1

Bioinformatics applications through visualization of variations on structures, comparative functional genomics, and comparative modeling for protein structure studies

A dissertation presented

by

Alper Uzun

to

The Department of Biology

In partial fulfillment of the requirements for the degree of

Doctor of Philosophy

in the field of

Biology

Northeastern University Boston, Massachusetts July 2009

2

©2009 Alper Uzun ALL RIGHTS RESERVED 3

Bioinformatics applications through visualization of variations on protein structures, comparative functional genomics, and comparative modeling for protein structure studies

by

Alper Uzun

ABSTRACT OF DISSERTATION

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Biology in the Graduate School of Arts and Sciences of Northeastern University, July, 2009 4

Abstract The three-dimensional structure of a protein provides important information for understanding and answering many biological questions in molecular detail. The rapidly

growing number of sequenced and related genomic information is intensively

accumulating in the biological databases. It is significantly important to combine

biological data and developing bioinformatics tools while information of protein

sequences, structures and DNA sequences are exponentially growing. On the other hand,

especially the number of known protein sequences is much larger than the number of

experimentally solved protein structures. However the experimental methods cannot

always be applied or protein structures cannot be available for all protein sequences.

Comparative protein modeling technique is closing a gap for the protein sequences with

unknown structures by constructing a three-dimensional model of a given protein

sequence based on sequence similarity to one or more known structures. I present here

several different applications of comparative protein modeling by developing servers for

mapping nsSNPs on to comparative protein models, studying comparative functional

genomics of HMGN3a and SMARCAL1 along with multiple sequence analysis,

developing comparative models for other techniques in order to find active sites, and

understanding the possible functional properties of while substitutions occur in a

given protein.

Part of the presented research focused on nonsynonymous SNPs, understanding the

functional consequences of nonsynonymous changes and predicting potential causes and

the molecular basis of diseases involves integration of information from multiple

heterogeneous sources including sequence, structure data and pathway relations between 5

proteins. In order to visualize them on protein structures and perform the analysis on

nsSNP, a web server, Structure SNP (StSNP) was developed. It provides the ability to

analyze and compare human nsSNP(s) in protein structures, protein complexes and

protein–protein interfaces, where nsSNP and structure data on protein complexes are available in PDB. In the second part of the research, comparative functional genomics analysis of HMGN3a and SMARCAL1 was studied along with combining information

with comparative modeling and multiple protein sequence analysis across mammals. Our results showed that there was a high degree of structural conservation of HMGN3a and

SMARCAL1 in the mammalian species studied. In the third part of the research several comparative models built from different species in order to find active site residues by the

THEMATICS method. In the last part of the research multiple sequence alignment studies of APEX1 and dna polymerase beta and comparative models of γ-tubulins were studied and presented.

6

Acknowledgements

First, and foremost, I would like to express my sincere gratitude to my advisor, Dr.

Valentin Ilyin, for his guidance, suggestions and patience during the course of my work.

Dr. Valentin Ilyin has provided me with necessary assistance and his supervision during my graduate work which will always be appreciated. I would like to thank the members

of my committee, Dr. William Detrich, Dr. Kostia Bergman, Dr. Veronica Godoy, Dr.

Mary Jo Ondrechen. I am thankful for the valuable collaboration of Dr. William Detrich,

Dr. Mary Jo Ondrechen, Dr. Phyllis Strauss, and Dr. Erdogan Memili.

I also would like to thank to Biology Department at Northeastern University, they always

support me morally and financially during the PhD years. I would like to thank Dr. Scott

Mohr from Boston University for his encouragement, and who introduced me to the field

of bioinformatics and as well as special thanks should be given to Dr. Kostia Bergman

and Dr. William Detrich for having their support and guidance which let me to have the

initial steps to get into the field of bioinformatics.

During my graduate study, while I was having an internship at Broad Institute of MIT

and Harvard University, I especially would like to thank Dr. Jill Mesirov and Dr. Michael

Reich for not only supervising me but also they were very kind to give their full support

a long the years even after the internship.

I would like to thank my colleagues from the Ilyin lab, Alex Abyzov, Chesley, Leslin,

Haifeng Weng. Especially I am thankful to Alex Abyzov for patiently answering my

every question. 7

My friends outside the lab were also very supportive and sincere. I would like to thank

Ahmet Ozcan who was like a brother to me, Taner Kaya for sharing the tough times and fun times, he was a genuine friend along the years, Murat Erdem and Karin Schon who are always supportive and sincere, Ata Murat Kaynar and Dilsun Kaynar although later on moved to Pittsburgh, they constantly show their sincere love and friendship even from a distance.

I am grateful to my mom and dad for their everlasting trust, emotional support, encouragement, and unconditional love. I spend many years away from them; I understand and appreciate their sacrifice. Since I am the only child, their patience and understanding means a lot to me.

8

TABLE OF CONTENTS

Abstract 4

Acknowledgements 6

Table of Contents 8

List of Abbreviations 9

List of Figures 13

List of Tables 15

Chapter 1. Introduction. 16

Chapter 2. Mapping and modeling nsSNPs on protein structures 25

Chapter 3. Functional genomics of HMGN3a and SMARCAL1 30 in early mammalian embryogenesis Chapter 4. Comparative model structures and active site predictions 43

Chapter 5. Studies on multiple sequence alignment 50 and comparative modeling Conclusion 53

Tables 56

Figures 68

References 113

9

List of Abbreviations

Abbreviations Proper name

ALDH2 Aldehyde dehydrogenase-2

ANOLEA Atomic non-local environment assessment

APEX1 Apurinic/apyrimidinic endonuclease 1

Arg Arginine

AspAT Aspartate aminotransferase

ATP Adenosine-5'-triphosphate

BER Base excision repair

BLAST Basic local alignment search tool

BLOSUM Blocks of amino acid substitution matrix dbSNP SNP database

DFIRE Distance-scaled, finite ideal-gas reference

DHAP Dihydroxyacetone phosphate

DNA Deoxyribonucleic acid

EGA Embryonic genome activation

GAP D-glyceraldehyde 3- phosphate 10

Gln Glutamine

GST Glutathione s transferase g-T1 Gama tubulin 1 g-T2 Gama tubulin 2

GV germinal vesicle

HARP HepA-related protein

HGVbase variation database

HMGN High mobility group nucleosomal

HPPK 6-Hydroxymethyl-7,8-dihydropterin pyrophosphokinase

Ile Isoleucine

JSNP Japanese Single Nucleotide Polymorphism

KEGG Kyoto encyclopedia of genes and genomes

Lys Lysine

MEGA Molecular evolutionary genetics analysis

Met Methionine mRNA messenger ribonucleic acid

MSA Multiple sequence alignment 11

MZT Maternal to zygotic transition

NCBI National Center for Biotechnology Information

nsSNP Nonsynonymous single nucleotide polymorphism

PCR Polymerase chain reaction

PDB

pol-β DNA polymerase beta

PROQ Protein quality predictor

ProsaII Protein structure analysis II

PSI_BLAST Position specific iterative BLAST

RMSD Root mean square deviation

SEDB Structural exon database

Ser Serine

SIOD Schimke immunoosseous dysplasia

SMARCAL1 SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily a member 1 SNF2N like domain

SNP Single nucleotide polymorphism

StSNP Structure SNP 12

SWI/SNF SWItching and sucrose non-fermenting in yeast

THEMATICS Theoretical microscopic titration curves

TIM Triosephosphate isomerase

TRIP7 TRAF-interacting protein 7

UPGMA Unweighted pair group method with arithmetic mean

13

List of Figures

Figure 1. Growth of released protein structures per year.

Figure 2. Schematic of StSNP web server.

Figure 3. Data generation in StSNP.

Figure 4. nsSNPs and Glutathione S Transferase.

Figure 5. nsSNPs and Aldehyde dehydrogenase-2.

Figure 6. Phylogenetic tree for HMGN3a.

Figure 7. MSA of HMGN3a.

Figure 8. Four distinctive domains of SMARCAL1.

Figure 9. Phylogenetic tree for SMARCAL1.

Figure 10. First HARP domain of SMARCAL1.

Figure 11. MSA of the first HARP domain in SMARCAL1.

Figure 12. Second HARP domain of SMARCAL.

Figure 13. MSA of second HARP domain in SMARCAL1.

Figure 14. SNF2N domain of SMARCAL1.

Figure 15. MSA of SNF2N domain in SMARCAL1.

Figure 16. Phylogenetic tree of helicase C terminal domain in SMARCAL1. 14

Figure 17. MSA of the helicase C-terminal domain in SMARCAL1.

Figure 18. Disparity Index test of HMGN3a.

Figure 19. Disparity Index test of SMARCAL1.

Figure 20. MSA of helicase like and helicase domains with modeled protein structure. Figure 21. MSA of APEX1 and pol-β.

Figure 22. A view of model structures of g-T1 and g-T2 at the positions 303, 205.

15

List of Tables

Table 1. Representing query and modeling options for resources.

Table 2. Summary of resources.

Table 3. Organisms and protein accession numbers used in MSA of SMARCAL1.

Table 4. Organisms and protein accession numbers used in MSA of HMGN3a.

Table 5. Predicted clusters for orthologous structures for three templates.

Table 6. Predicted active site clusters for additional orthologous sets for the templates Table 7. Predicted active site clusters for nine homologous sets.

16

Chapter 1. Introduction

Comparative modeling of protein structures The three-dimensional structure of a protein provides important information for

understanding and answering many biological questions in molecular detail. The rapidly

growing number of sequenced genes and related genomic information is intensively

accumulating in the biological databases. As of June 2009 more than 58,000 (see Figure

1) experimentally determined protein structures were deposited in the Protein Data Bank

(PDB) [1]. Since the experimental methods cannot always be applied, protein structures aren’t available for all protein sequences. Computational methods and tools can be useful for identifying the protein structures.

Comparative protein structure modeling (or homology modeling) techniques have been developed to build a three-dimensional model of a given protein sequence on the basis of an alignment which relies on sequence similarity to at least one known structure which is called a template. For predicting the protein structure by comparative modeling, two

conditions have to be met [2-4].

1) The sequence to be modeled must have measurable similarity to another sequence of

known structure. Such as more than 50% sequence similarity can be a good example for

modeling [5-7]. Accuracy decreases as the percentage identity falls below usually

approximately 30% (also called as ‘twilight zone’) [8]. 17

2) It must be possible to compute an accurate alignment between the protein sequence

and the template structure. High accuracy comparative models are usually based on more

than 50% sequence identity to their templates [5].

The degree of sequence similarity between the protein sequence and the template is an

important predictor of the model accuracy. Higher sequence similarity (such as more than

50%) to the template is a sign of a more accurate model [2-7].

Comparative modeling in general consists of four main steps: Searching for template(s) starts with the target sequence. In another words target sequence is the query sequence.

1) Identifying experimentally solved structure(s) that can be used as template(s) for modeling the protein sequence.

2) Accurately aligning the protein sequence (target) to the template(s).

3) Building the three-dimensional model on the basis of the alignment(s).

4) Evaluating the quality of the model [2, 4, 9].

Searching available templates for the query sequence which refers to step 1 can be

categorized into three classes. The templates for modeling may be found by a) pairwise

comparison methods such as Basic Local Alignment Search Tool (BLAST) [10] and

FASTA [11] which align the protein sequence with all the sequences in the database of known structures, b) Profile-sequence alignment methods, such as PSI_BLAST [12], that depend on profiles which are derived from multiple sequence alignments in order to increase the accuracy and sensitivity of the template search, c) Third class of methods uses a combination of sequence and structure considerations to detect similarities between sequences and structures. The protein sequence is threaded through a library of 18

3-D profiles or folds, and each threading is evaluated based on a certain scoring function.

Superfamily [13] and GenThreader [14] are examples of this class.

I now compare these three classes briefly. While the pairwise sequence comparison methods are the least sensitive and are best used to detect close homologs, the profile- based methods are usually capable of recognizing homologs sharing only approximately

25% sequence identity and threading methods can sometimes recognize common folds even in the absence of any statistically significant sequence similarity [15, 16]. During my Ph.D. research I mainly used the first class in comparative modeling dependent studies.

Aligning protein sequence and template is the second step of modeling which is significantly important in determination of the accuracy of the template and also it is a critical step in the model-building process. A direct correlation between the sequence identity level of a pair of protein structures and the deviation of the C alpha carbon atoms of their common core has been shown [9, 17]. If the two sequences are more similar, then it can be expected that more closely corresponding structures will be constructed [9].

Since this a vital step in modeling, an erroneous alignment will lead to the construction of an incorrect model. Decreasing sequence identity, alignment errors and the incorrect modeling of large insertions can be listed as major source of inaccuracies. As the percentage identity falls below usually approximately 30% (also called as ‘twilight zone’)

[8], model quality estimation on the basis of sequence identity becomes unreliable [2, 8,

9, 18-20]. 19

In the third step of comparative modeling, protein sequence-template alignment, maps the

protein sequence on the template structure. Mapping is utilized in building the 3-D model of the protein sequence. There are several methods of building the model. These methods can be listed as;

1) Assembling the model from a small number of rigid bodies which are gathered from

the aligned protein structures [3].

2) Modeling by segment matching or coordinate reconstruction [21, 22]. This method can

construct main-chain and side-chain atoms. It can also model unaligned regions [23].

3) Loop modeling; the protein sequence to be modeled may have inserted residues in

their sequences which have no corresponding regions in the template(s). Since no

structural information is available about the inserted residues of the protein sequence, it is

not possible to derive any information from the template. Loop modeling can be needed

at this point. There are two main classes of loop modeling methods;

a) the database search; scanning through all available databases for protein

structures in order to find segments fitting the anchor core regions [24, 25],

b) the conformational search; depending on an optimization of a scoring function

[26, 27]. Other than these two main methods, some methods combine these two

approaches [28, 29].

4) Side chain modeling; in modeling of the side chains, there are two simplifications in

application. One is amino acid replacement which leaves the backbone unchanged (145),

so that backbone can be fixed during the search for the best side chain conformations.

The other one is, since most of the side chains are in high-resolution crystallographic 20 structures, it is feasible to represent them by a limited number of conformers which comply with stereochemical and energetic constraints [30]. Depending on this information, libraries of side-chain rotamers have been derived [31-33]. Rotamers on a fixed backbone are often used when all the side chains need to be modeled on a given backbone. This approach overcomes the combinatorial explosion associated with a full conformational search of all the side chains, and is applied by some comparative modeling and protein design approaches [3, 34].

The last step in comparative modeling is predicting accuracy of the models. Two important factors may influence the ability to predict accurate models: one is the extent of structural conservation between protein sequence and template, and the other is correctness of alignment [4, 35, 36]. Models based on templates with more than 50% sequence identity are generally very accurate. They can exhibit approximately 1A° Cα atom rmsd from the experimental structure.

Sequence identity between protein sequence and template is not the only parameter to estimate the difficulty or the quality of a comparative model [35-37] . The quality of the alignment between these two sequences will also depend on the number and the similarity distribution of all the sequences of the multiple sequence alignment [38]. Given the unprecedented growth of both structural and sequence databases, improvements in the quality of comparative models seem to be largely due to the increased availability of sequences and structures homologous to the protein of interest [35, 38]. There are several programs available on the web for testing accuracy of the 3-D models such as DFIRE

[39], VERIFY3D [40], ANOLEA [41], PROQ [42], ProsaII [43] are some of them. 21

MODELLER [4] and SWISS-MODEL [44] are commonly used programs for building models which are also open to public use.

An introduction to multiple sequence alignment

The discovery that sequences of different organisms are often related carries a very significant meaning; it has an important part in biology for evolutionary research and analysis. Many genes are represented in highly conserved forms in a wide range of organisms and similar genes are conserved across broadly divergent species. Usually they perform very similar functions. But an alteration which may be caused by mutations or rearrangements can alter the function of a . These kinds of alterations or similarities can be seen by through simultaneous alignment of the sequences of the genes or proteins.

We can see the patterns of change among the species or among different tissue types or genomes of diseased/healthy individuals. In this sense analyzing and understanding the functional, structural properties of genes and proteins by using multiple sequence alignment approach (MSA) is very powerful tool [45, 46].

Multiple sequence alignments are broadly used in computational analysis of protein sequences, comparative structure modeling, phylogenetic analysis, functional site prediction, and sequence database searching [45-47]. The purpose of constructing a multiple sequence alignment is to arrange residues with inferred common evolutionary origin or functional and structural similarities in the same column position for a set of sequences. Multiple sequence alignment provides position-specific information about conservation, correlation, and residue usage for several applications [46-48]. Therefore, 22 the reliability of the acquired information from MSA relies on the quality of the alignments. In the recent years, precise and swift construction of MSA has been under intensive research; several methods and programs have been developed in order to improve the quality of the alignments [48].

One of the frequently and commonly used programs for multiple sequence alignment is

CLUSTAL which has been in use since 1988 [49, 50]. CLUSTALW [51] is a more recent version of CLUSTAL. In this recent version, “W” is standing for weighting which represents the ability of the program to provide weights to the program parameters and sequences [52]. CLUSTALW is designed to provide an adequate alignment of a large number of more closely related sequences and a reliable indication of the domain structure of those sequences. The program also has options for adding one or more additional sequences with weights or an alignment to an existing alignment [52, 53].

Once an alignment has been made, a phylogenetic tree may be made by the neighbor- joining method or Unweighted Pair Group Method with Arithmetic mean (UPGMA) guide trees can be built. Especially, UPGMA guided trees helps in speeding up the alignment of extremely large data sets. The predicted trees may also be displayed by various programs [46, 52].

Nonsynonymous single nucleotide polymorphism Single nucleotide polymorphisms (SNPs) represent one of the most common forms of genetic variation in a population [54, 55]. SNPs are DNA sequence variations that occur when a single nucleotide (A,T,C,or G) in the genome sequence is altered. For example a

SNP might alter the DNA sequence “…CGTGATTACGATTA… to …CGTGTTTACGTATTA…”. In 23

order to consider a variation to be SNP, it must occur in at least 1% of the population [54,

56]. SNPs constitute about 90% of all human genetic variation and they occur with a very

high frequency, with estimates ranging from about 1 in 1000 bases to 1 in 100 to 300

bases. More than 99% of human DNA sequences are the same but variations in DNA

sequence may have a major impact on how humans respond to disease, environmental

factors, and drugs [54] and sometimes a SNP or set of SNPs may not be directly

responsible for any disease, but the shear number of SNPs means they can also be used to

locate genes that influence such traits [57]. Therefore, this makes SNPs significantly

valuable for biomedical research and for developing pharmaceutical products or

computational methods and bioinformatics tools.

Currently, (May 2009) the public SNP database (dbSNP, build 130) [58] contains 17.8

million SNP candidates, of which 6.5 million have been validated which means

experimentally verified. Nonsynonymous SNPs (nsSNPs), the SNPs located within the

open reading frame of a gene that result in an alteration in the amino acid sequence of the

encoded protein might directly or indirectly affect protein functionality alone or its

interactions in a multi-protein complex, by increasing/decreasing the activity of the

metabolic pathway [54, 59]. nsSNPs have been linked to a wide variety of diseases;

because they affect protein function, alter DNA and binding sites, reduce protein solubility and destabilize protein structures [59]. Therefore, understanding

the functional consequences of nonsynonymous changes and predicting potential causes

and the molecular basis of diseases involves integration of information from multiple 24

heterogeneous sources including sequence, structure data and pathway relations between

proteins.

SNP information is currently collected in several databases, including: dbSNP, the

Human Genome Variation Database (HGVbase) [60], the Japanese Single Nucleotide

Polymorphism (JSNP) database [61] and the HapMap Project [54]. Currently, there are number of studies and resources which have begun to explore the effects of nsSNPs on the tertiary structure of proteins and their functionality, including: SNPs3D [56],

PolyPhen [62], TopoSNP [63], ModSNP [64], LS-SNP [65], SNPeffect [66], MutDB [67,

68] and Snap [69], StSNP [70] have all been released for public use. A brief description of the available resources for SNP analysis is presented in Tables 1 and 2. It should be noted, this is not a comparison table but a reference table, as the field is in its infancy and all resources are currently evolving, with each database having strength.

25

Chapter 2. Mapping nsSNPs on protein structures.

StSNP; a web server for mapping and modeling nsSNPs on protein structures

StSNP, a web-based server, which provides the ability to analyze and compare human

nsSNP(s) in protein structures, protein complexes and protein–protein interfaces, where

nsSNP and structure data on protein complexes are available in PDB, along with the

analysis of the metabolic data within a given pathway. Usually nsSNP do not inactivate

protein functionality completely, otherwise the mutation would most likely be lethal,

instead nsSNPs change the protein activity at some level, either directly or indirectly

through interactions with other proteins in the pathway; therefore, such information has

to be considered mutually. As a result, StSNP was developed, which utilizes information

from different sources and provides ‘on the fly’ comparative modeling of the wild-type

and mutated proteins (when an appropriate structural template is available) along with real-time analysis and visualization of structures and sequences [71] to assist researchers

in visual inspection of the possible effects of the nsSNPs in protein structure. Users can

analyze data in different formats with StSNP. They have different search options such as

by keyword, NCBI protein accession number, PDB id, NCBI nsSNP id. This helps user

quickly retrieve targeted information.

Design and implementation sources

In general, the internal database structure has been inherited from the Structural Exon

database (SEDB) [72]. StSNP was implemented using a MySQL database running on a 26

Linux server, with PERL scripts used for all data retrieval and output (Figure 2). StSNP utilizes three major data sources:

(1) Protein sequences from NCBI,

(2) The reference and nsSNPs locations from NCBI’s dbSNP,

(3) Structures and sequences from the PDB.

Every protein sequence has a pre-calculated list of structural modeling templates found by BLAST, and stored in a database for quick retrieval. The actual aligning of the protein sequence and the PDB sequence was implemented with the Smith–Waterman algorithm

[73], using similarity specific scoring matrices, from BLOSUM30 to BLOSUM90 [74].

The pathway information is utilized from KEGG [75, 76], human gene/protein

information is gathered from NCBI’s Gene [77], and the comparative modeling

phase is done by MODELLER. The modeling part of StSNP is interactive and allows the

user to choose a template from the list, select particular mutations to be modeled

calculate the model and subsequently visualize the superimposition of the models and

template in the Friend applet. Additionally, simultaneous analysis of structurally similar

proteins/models for structural correlation of nsSNP locations can be done in the Friend

applet by the TOPOFIT structure alignment method [78, 79].

StSNP currently (June 2009) contains 107,028 nsSNPs, 31,834 protein sequences, 12,741

genes and 30,517 protein structures.

27

Web server features

StSNP has several types of search options, including search by a Protein ID, PDB ID or keyword, all of which together integrates nsSNP related information. For example, the

Protein ID search displays the known nsSNP(s) for the protein, while the PDB ID search provides a list of similar Protein IDs with nsSNP(s). Both searches will provide a link to pathway information if the data are available. The resulting report pages provide the user with options for model template selection. Only templates satisfying the following two criteria are shown: the nsSNP(s) has to be within the alignment of the protein sequence with template and the sequence similarity of the alignment has to be 30%. The modeling step provides the user with the ability to choose which nsSNPs to map, and after completion, a user can instantly visualize the models with the Friend applet. StSNP has several browsing and search capabilities as well, for example, searching for available structures by protein length and percent similarity, or by a specifically chosen reference and nonsynonymous residue within a particular . The features found in

StSNP have been designed with graphics, plots and easily readable tables with the end user in mind.

28

Examples of use 1

Glutathione S Transferase

Glutathione S Transferase (GST, Protein ID: NP_000843) is a family of multifunctional enzymes involved in cellular detoxification of xenobiotics and reactive endogenous compounds of oxidative metabolism. nsSNPs of GST were mapped onto protein structure which shown in Figure3 [80]. The output page reports the available reference and nonsynonymous residues for the protein with the rs number which refers to a reference SNP ID number, or “rs” ID, is an identification tag which is assigned by NCBI to a group or cluster of SNPs that map to an identical location [81], amino acid properties for the variations, and the alignment picture of protein sequence with template including nsSNP locations. In this example, all nsSNPs are located inside the alignment and thus available for mapping onto PDB ID 1aqv chain B. The next step is to choose the nsSNPs for modeling. All the known nsSNPs associated with GST, I105V, T110S, A114V,

D147Y and L176M have been modeled in this example and are presented in Figure 4A.

A black circle denotes where isoleucine has changed to valine at position 105. The role of functional I105V GSTP1 polymorphism in the pathogenesis of methamphetamine abuse was studied, with researchers noting that individuals with the G allele (valine) are expected to have decreased GST detoxification [80]. It is visible from the mapping of this nsSNP onto the protein structure (Figure 4A) the location of I105V is located in direct contact with the glutathione, and could potentially have a strong effect on the GST activity or its binding affinity with glutathione. The results section also provides a user 29

with a link to glutathione metabolism in order to view other members found in the

pathway (Figure 4B).

Examples of use 2

Aldehyde Dehydrogenase-2

Aldehyde Dehydrogenase-2 (ALDH2) (PROTEIN ID: NP_000681) is illustrated in

Figure 5. ALDH2 is involved in acetaldehyde oxidation at physiological concentrations

and found when a person consumes alcohol. Worldwide, the Lys504 allele has the

highest prevalence (30–50%) in Asian populations [82]. In this example, glutamate is replaced by lysine at position 504 (Glu504Lys), where it has been demonstrated to essentially eliminate ALDH2 activity [83]. From these examples, one can see how a quick search in StSNP in conjunction with the structural mapping of the nsSNP locations provides structural support to the medical studies mentioned here and may facilitate in the designing of future experiments.

Discussions and Conclusions

StSNP provides practical, user friendly access to the wealth of information related to nsSNPs by seamlessly connecting various databases into one pipeline. Key functional and structural information along with known pathways the proteins are involved in, have all been linked together to provide users some advantages when compared to other current resources:

(a) the sequence, structure and pathway information have all been cross-referenced,

which enables a user to quickly query and visualize the inter-related nsSNP data; 30

(b) a graphical display of the nsSNPs provides a user with the location of the nsSNP(s) in

terms of primary sequence, and whether such nsSNP(s) can be modeled;

(c) the modeling options provide the user with a choice of which nsSNP to map and

visualize which nsSNPs could potentially have deleterious effects on a protein’s function;

(d) the modeled protein structures are automatically loaded in Friend, where they can be

easily viewed, compared and analyzed;

(e) finally, StSNP will be updated on a regular basis following the updates on the major sources, dbSNP, PDB, KEGG and others.

Chapter 3. Functional genomics of HMGN3a and SMARCAL1 in early mammalian embryogenesis.

Stages of early embryonic development

Early embryonic development in general is initiated when mature oocytes (MII) are fertilized by spermatozoa. Maternal factors, such as mRNAs, microRNAs and proteins stored in the oocyte, provide the means of support for the first few days of development.

The transition from a maternal to a zygotic control of development, called maternal to zygotic transition (MZT), and the activation of the embryonic genome involve chromatin structural modifications that take place during the first few embryonic cell cycles [84].

Embryonic genome activation (EGA) sets the stage for later development [85, 86].

Changes in chromatin structure have been characterized throughout the transition from transcriptional incompetence to the minor activation of the zygotic genome at the 1-cell stage and through the major genome activation at the 2-cell stage in murine embryos 31

[87]. In bovine embryos EGA occurs at the 8- to 16-cell stage with extensive programming of . However, the regulation of during EGA still remains a mystery [88].

Description of Chromatin remodeling

Chromatin remodeling is an extensive process occurring during early embryogenesis. An essential property of the embryonic chromatin structure is to prevent the access of the transcriptional machinery to all of the promoters in the genome [88]. The expression of some genes may be mediated by chromatin remodeling proteins. Chromatin remodeling complexes may change the overall pattern of expression of mammalian genes, allowing transcription factors and signaling pathways to produce different genomic transcriptional responses to common signals [89]. This is particularly important for preimplantation embryos starting cell differentiation cascades that will lead to tissue and organogenesis.

These changes in chromatin structure generate activation of the transcriptional machinery and gene expression occurring during early embryo development, leading to a unique chromatin structure capable of maintaining totipotency during embryogenesis and differentiation during postimplantation development [86, 88].

High mobility group nucleosomal protein family

The High Mobility Group Nucleosomal (HMGN) protein family is the only group of nuclear proteins that bind to the 147- long nucleosome core particle with no sequence specificity [90]. HMGN proteins are present in the nuclei of all mammalian and most vertebrate cells at approximately 10% of the abundance of histones [91]. They bind 32

as homodimers to the nucleosome and cause chromatin modifications that facilitate and

enhance several DNA-dependent activities, such as transcription, replication and DNA

repair. This protein family is composed of 3 members, HMGN1 (also known as HMG-

14), HMGN2 (also known as HMG-17), and the most recently discovered HMGN3,

initially named TRIP7 for its ability to bind the thyroid hormone receptor [92].

In the mouse HMGN1 and HMGN2 have been detected throughout oogenesis and

preimplantation development and are progressively down-regulated throughout the entire

embryo, except in cell types undergoing active differentiation [93]. Since reduction in the

levels of HMGN1 and 2 mRNA also occurs during myogenesis in rat, this decrease is suggesting that down-regulation of HMGN mRNA may be associated with tissue differentiation [94]. Depletion of HMGN1 and HMGN2 in one- or two-cell embryos delays subsequent embryonic divisions. Cells derived from HMGN1-/- mice have an altered transcription profile and are hypersensitive to stress [93]. Experimental manipulations of the intracellular levels of HMGN1 in X. laevis embryos cause specific developmental defects at the post-blastula stages. Furthermore, HMGN proteins regulate the expression of specific genes during X. laevis development [95]. Several lines of evidence implicate HMGN1 and 2 in transcriptional regulation. Chromatin containing genes that are actively being transcribed have two- to three times more HMGN1 and 2 compared with total chromatin [88, 93].

33

Description of HMGN3

The human HMGN3 transcript produces two splice variants HMGN3a the long isoform

with 99 amino acids, and HMGN3b with 77 amino acids that arises due to a truncation of the fifth exon. Although no HMGN3b protein has been identified in the rat and cow,

ESTs with high identity to it suggest that this splice variant may also exist in these

species. HMGN3a constitutes a family of relatively low molecular weight non-histone

components of about 100 amino acid residues. The cow, mouse, and rat HMGN3a

proteins share more than 81% identity with the human HMGN3a protein [92]. The role of

HMGN3a has not been studied in mammalian development. Although the exact function

of HMGN3a during early embryonic development has not been determined, its role in

facilitating chromatin modifications and enhancing transcription, replication, and DNA

repair is critical for early embryo development [92].

SWI/SNF protein family

Another important mechanism in regulation of chromatin structure in the early embryo is

mediated by nucleosome repositioning factors, which are ATP-dependent chromatin- remodeling enzymes. Nucleosome repositioning factors use energy released by ATP hydrolysis to alter histone-DNA contacts and reposition nucleosomes to create chromatin environments that are either open or compact. These factors do not involve sequence specific DNA binding sites, but rather are recruited onto promoter regions by specific transcription factors. Nucleosome repositioning factors typically exist as multi subunit protein complexes, like the SWI/SNF (from SWItching and Sucrose Non-Fermenting in 34 yeast) ATP-dependent chromatin remodeling complex [96]. SWI/SNF complexes are thought to regulate transcription of certain genes by altering the chromatin structure around them with their helicase and ATPase activities [97, 98]. In mammals, each

SWI/SNF complex has any of two distinct ATPases as the catalytic subunit of

SMARCA2 and SMARCA4 [99]. Both ATPases have important developmental functions. In primates, expression of both subunits remains constant and low throughout embryogenesis until the blastocyst stage [100]. In mouse embryos, Smarca4 transcripts remain at stable levels throughout preimplantation development, while Smarca2 transcripts remain low until the blastocyst stage, when its mRNA levels increase [101]. In porcine embryos, SMARCA2 transcripts are most abundant in germinal vesicle (GV) stage oocytes and decline progressively during embryo development to blastocyst stage

[102]. Mutant mice lacking the Smarca4 gene die at preimplantation while the Smarca2- null mouse mutant is viable and shows a mild overgrowth phenotype [103, 104].

Description of SMARCAL1

Another member of the SWI/SNF family of proteins involved in chromatin remodeling is

SMARCA1 (SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily a member 1), considered a global transcription activator and also called

SNF2L1. Like other SWI/SNF members, the SMARCA1 protein has a helicase ATP- binding domain. However, since the rest of its motifs diverge from other members of the

SWI/SNF family, it has been classified in the ISWI (for Imitation SWItch) subfamily of

ATPases, together with SMARCA5. Decreasing levels of SMARCA5 were found during

Rhesus monkey embryogenesis from GV oocytes until blastocyst stage. The same study 35

reported low levels of SMARCA1 throughout all stages of embryogenesis except for the

8-cell stage [100].

Members of the SNF2 subfamily of SWI/SNF proteins are characterized by its seven

motifs (I, Ia, II, III, IV, V and VI) [105]. SMARCAL1 (SWI/SNF related, matrix

associated, actin dependent regulator of chromatin, subfamily a-like 1) is one of the

SNF2 members and shows high sequence similarity to the E.coli RNA polymerase-

binding protein HepA [105]. Recent reports have linked mutations in the SMARCAL1

gene with Schimke immunoosseous dysplasia (SIOD), a human autosomal recessive disorder with the diagnostic features of spondyloepiphyseal dysplasia which is a is a descriptive term for a group of disorders with primary involvement of the vertebrae and epiphyseal centers resulting in a short-trunk disproportionate dwarfism, renal dysfunction, and T-cell immunodeficiency [106-108]. The ability of SMARCAL1, to interact primarily with nucleosomes was demonstrated using protein interaction microarrays. SMARCAL1 transcripts are ubiquitously expressed in different human and mouse tissues, suggesting a role in normal cellular functions or housekeeping activities, such as transcriptional regulation [105]. Although no studies have reported the expression of SMARCAL1 during early embryogenesis in mammals, our collaborator Dr Memili and his group previously detected a 7- fold increase of the SMARCAL1 mRNA in 8-cell bovine embryos as compared with MII oocytes by using oligonucleotide microarray gene expression analysis and Real Time PCR validation [86]. Additionally, studies on the

SWI/SNF complex associated factor SMARCC1 (also called SRG3 and BAF155), a core subunit of the SWI/SNF complex, have highlighted the importance of the ATPase 36

subunits and the whole complex during embryogenesis. In the absence of Smarcc1,

mouse embryonic development ceased during preimplantation stages, indicating that

Smarcc1, as well as the chromatin-remodeling process, plays an essential role in early

mouse development [109]. SMARCC1 mRNA was found in high levels in GV stage

Rhesus monkey oocytes and at very low levels throughout early embryogenesis but was higher later at the hatched blastocyst stage [100, 105].

Comparative functional genomics analyses of HMGN3a across mammals

I have used seven mammalian (number of species depend on the availability of the data from current databases) sequences in the construction of a HMGN3a phylogenetic tree

(Figure 6). The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (500 replicates) is shown next to the branches. All positions containing gaps and missing data were eliminated from the dataset. There were a total of

95 positions in the final dataset, of which 18 were parsimony informative. If the site contains at least two types of nucleotides or amino acids, and at least two of them occur with a minimum frequency of two. It is called as parsimony-informative. The most significant observation in multiple sequence alignment of HMGN3a was the insertion of alanine, in the fifth exon of the Bos taurus protein, (highlighted in red on Figure 7).

Several substitutions in the bovine sequence were shared by other mammals in the alignment. Macaca mulatta and Canis familiaris HMGN3a proteins have longer

sequences with regions not shared with the other species. We focused on the regions of

the protein shared by all species. Also we showed other alanine substitutions in the

alignment (marked with stars) (Figure 7). 37

Comparative functional genomics analyses of SMARCAL1 across mammals

SMARCAL1 has four conserved domains (Figure 8). The first and the second are two

HARP (HepA-related protein) domains of approximately 60 residues long, with single

stranded DNA-dependent ATPase activity. The third conserved domain is a helicase like

domain named SNF2N-terminal domain and the fourth is a helicase C-terminal domain

[105] .

I used the whole SMARCAL1 protein sequences from 9 mammalian species (number of

species depend on the availability of the data from current databases) to construct the

phylogenetic tree (Figure 9). The percentage of replicate trees in which the associated

taxa clustered together in the bootstrap analysis (500 replicates) is shown next to the

branches. All positions containing gaps and missing data were eliminated from the

dataset. There were a total of 856 positions in the final dataset, of which 198 were

parsimony informative. I used each one of the four SMARCAL1 conserved domains to

build separate multiple sequence alignments and construct separate phylogenetic trees for

each domain (Figures 10, 12, 14 and 16). Phylogenetic analysis shows that while Homo

sapiens, Pan troglodytes, Macaca mulatta are clustering together, Equus caballus, Canis

familiaris and Bos taurus have relatively distant position in the tree. Rattus norvegicus

and Mus musculus separated these organisms in the tree. When the first and the second

domain of HARP in SMARCAL1 is compared, also there is a separation which can be easily identified between the group of Canis familiaris, Rattus norvegicus, Mus musculus

and the group of Pan troglodytes, Homo sapiens and Macaca mulatta. Equus caballus

observed closer to the second group in the first HARP domain. Monodelphis domestica 38

(gray short-tailed opossum) becomes the most distant species among 9 mammals in the

tree. For the first HARP domain, the positions at which substitutions occur are

highlighted in yellow (Figure 11) Monodelphis domestica was the most distantly related mammal with respect to this domain. Substitutions were observed in 24 positions. On the

4th substitution, glutamate, a medium size acidic amino acid was substituted by alanine, a small size hydrophobic amino acid in Bos taurus. On the 8th substitution, while Bos taurus, Equus caballus, and Canis familiaris have a serine, it is substituted for asparagine

in Pan troglodytes, and Homo sapiens, and for arginine in Macaca mulatta and Rattus

norvegicus. Additionally Mus musculus has a histidine, and Monodelphis domestica has a

lysine at this position. On the 10th substitution, while Bos taurus, Equus caballus, Canis

familiaris, and Monodelphis domestica have an alanine, a small size hydrophobic amino

acid, Pan troglodytes, Homo sapiens, and Macaca mulatta have aspartate, a medium size

acidic amino acid. Both Rattus norvegicus and Mus musculus have phenylalanine at this

position. All these substitutions along the HARP sequence may suggest that acidic

residue distribution is conserved. Although in the 4th position Bos taurus is the only

species which has alanine instead of glutamate may still suggest that along the species in

that position acidic property is well conserved. For the second HARP domain, there were

34 positions with amino acid substitutions in at least one of the species studied. These

substitutions are highlighted in the alignment (Figure 13). Again Monodelphis domestica

was the most distant species for this domain.

In phylogenetic tree analysis of HARP1 domain, significantly higher bootstrap values

were observed for Rattus norvegicus, Mus musculus, Pan troglodytes and Homo sapiens. 39

In the second HARP domain, high bootstrap values conserved only in Rattus norvegicus

and Mus musculus. For both domains Monodelphis domestica observed as the most distant mammalian among 9 species. When we compared first and second domain of

HARP in SMARCAL1 also there was a separation which can easily be identified between the group of Canis familiaris, Rattus norvegicus, Mus musculus and the group of

Pan troglodytes, Homo sapiens, and Macaca mulatta. Equus caballus was observed closer to the second group in the first HARP domain.

The phylogenetic tree of SNF2N which is the third domain of SMARCAL1, shows similar composition like the first two phylogenetic trees (Figure 14), but the most significant difference is thatCanis familiaris is getting closer to Bos taurus.

For the last domain of SMARCAL1, one of the clearest observations is lowering of bootstrap values between Bos taurus and GLEAN 20241, when it is compared to the other phylogenetic trees. Also in the phylogenetic and sequence related analysis 11 parsimony informative sites detected and 46 of the sites are conserved among the species

(Figure 15). Positions with insertions and deletions are marked with stars. The first insertion comprises 3 additional amino acids (glutamate, leucine, and lysine) present only in the Equus caballus, protein. There is a deletion of the amino acid arginine, present in all species, except for the NCBI bovine sequence. However the GLEAN_20241 does not have the deletion. The amino acid threonine is also absent in both Rattus norvegicus and

Mus musculus. The bovine NCBI sequence showed significant mutations of the third conserved domain marked in red on the alignment. However the sequenced official gene set for this protein (Bovine Genome Database http://racerx00.tamu.edu/bovine) shows a 40

higher homology to all species, differing in only 2 amino acids from the horse and human

protein. These findings indicate sequencing errors in the currently available bovine

SMARCAL1 protein. These errors will likely be corrected with the completion of the

bovine genome annotation effort. The bovine helicase C-terminal domain protein shows a

deletion (marked with a star) and several substitutions highlighted in red (Figure 17) that

do not exist in GLEAN_20241. These observations point to the need for an update in

SMARCAL1 protein sequence currently available at NCBI. In addition to our analysis,

we applied disparity index, ID [110], which measures the observed difference in

evolutionary patterns for a pair of sequences. The disparity index for HMGN3a (Figure

18), did not show any significant pairs of species. The disparity index for each domain of

SMARCAL1 is presented in Figure 19. In the first HARP domain (Figure 19A) 6 pairs of

species (Bos taurus-Rattus norvegicus, Pan troglodytes-Rattus norvegicus, Homo

sapiens- Rattus norvegicus, Macaca mulatta-Rattus norvegicus, Canis familiaris-Rattus

norvegicus, and Mus musculus-Homo sapiens) were considered significant. The disparity

index did not observed differences in evolutionary patterns for the second HARP domain

(Figure 19B). There were 9 significant pairs in the SNF2 N-terminal domain disparity

index (Bos Taurus-Equus caballus, Bos Taurus-Pan troglodytes, Bos Taurus-Homo

sapiens, Bos taurus -Macaca mulatta, Bos Taurus- Rattus norvegicus, Bos Taurus-

Monodelphis domestica, Equus caballus-Pan troglodytes, Equus caballus-Homo sapiens,

Equus caballus-Macaca mulatta) (Figure (Figure 19C). In the disparity index for the helicase C-terminal domain only the pair Bos taurus-Canis familiaris was significant

(Figure 19D). 41

Analysis of the conserved/non-conserved regions on the comparative modeled structure of SMARCAL1

Since the protein structures for SMARCAL1 are available for helicase like and helicase domain, a comparative homology model was built on covering only these domains. The percentage similarity between template and protein sequence was 24%. Depending on the multiple sequence alignments, all non-conserved residues were mapped on the modeled structure (Figure 20). SNF2N and helicase C domains have nucleotide binding and ATP binding residues. There are 9 residues responsible for nucleotide binding and 8 residues for ATP binding which were retrieved from the literature [111, 112]. These locations were mapped on the modeled structure. In the analysis it was shown that all ATP binding residues exist in the conserved regions. Although Threonine781 is among the residues that are responsible in nucleotide binding, it falls into the non-conserved region of the protein sequence. Multiple alignment results show that in this specific location only one species (in Mus musculus) has variation which is Proline (Figure 12). This substitution creates a difference in amino acid side chain polarity as well as hydrophobicity and size at that specific position.

Methods of comparative functional genomics analyses of HMGN3a and SMARCAL1 across mammals

Protein sequences of SMARCAL1 were retrieved from NCBI by performing protein

BLAST against mammalian database using Bos taurus SMARCAL1 (NP_788839) as the query protein. Sequence data were manipulated with the Friend software, a 42 bioinformatics application designed for simultaneous analysis and visualization of multiple structures and sequences of proteins, DNA or RNA. Multiple sequence alignment of nine mammalian SMARCAL1 protein sequences that are listed in Table 3 were created by using Clustal W under Friend Software. We defined conserved regions based on domains listed in the Pfam [113] database which has conserved amino acid sequence regions. The same steps were applied for constructing HMGN3a phylogenetic tree, for which we used the only available Reference Sequence protein for Bos taurus

(NP_001029676). Since the availability of mammalian HMGN3a sequences is limited, we excluded Monodelphis domestica from the HMGN3a phylogenetic analyses, which were conducted in MEGA 4 [110]. There are 7 mammalian species used in our analysis of HMGN3a which are shown in Table 4. The Maximum Parsimony method was used for inferring the evolutionary history when creating the phylogenetic trees for both

SMARCAL1 and HMGN3.

Methods of comparative modeling of SMARCAL1

Because of the availability of possible templates for comparative modeling, a model was created for only SMARCAL1. PDB-file 1z63 chain A was used as a template for comparative modeling which shares 24% sequence identity with SMARCAL1 protein sequence of Bos taurus. The template that was including the residues from 422 to 869 covered two domains of SMARCAL1 and these were helicase like and helicase domains.

Comparative modeling was performed by MODELLER 9v1 [4]. Structural analysis was done under Friend and model picture was created with Chimera [114]. 43

Discussions and conclusions

In the analysis, the bovine HMGN3a and SMARCAL1 showed a high degree of homology in all studied mammals. This high structural conservation highlights the importance of chromatin remodeling in the regulation of gene expression, particularly during early embryonic development. Understanding the interactions between these proteins and their roles could improve our understanding of epigenetics in reproduction and disease. Appropriate models for the study of chromatin remodeling proteins are essential to understanding this process, particularly in the case of diseases like SIOD,

caused by a mutation in the SMARCAL1 gene. The greater similarities of the HMGN3a

and SMARCAL1 proteins in human and bovine species may suggest that more attention

should be paid to a bovine model in the study of chromatin remodeling [88].

Chapter 4. Comparative model structures and active site predictions.

Description of Theoretical MicroscopicTitration Curves

THEMATICS (Theoretical Microscopic Titration Curves) is a computational predictor of

the active sites of enzymes from protein structure [115-119]. For the corresponding

protein, the electrical potential function is computed by using Finite Difference Poisson-

Boltzmann methods. Then the predicted titration curves for all of the ionisable residues in

the protein structure are calculated [116]. The shapes of the predicted titration curves are

analyzed for identification of residues with elongated, non-sigmoidal titration behavior

[116]. A cluster of two or more such anomalous residues in physical proximity is a highly

reliable predictor of the active site of the protein [115, 116]. 44

It is an advantage that THEMATICS requires only the three-dimensional structure of the query protein as an input. This is one of the powerful sides of THEMATICS. The query protein does not have to have any similarity in sequence or in structure to any previously characterized protein [115, 116]. However, there is also a disadvantage of the method:

Three-dimensional structure of the protein has to exist. Which one is sufficient; an experimentally determined structure, or a theoretical model structure [116]?

Building comparative model structures for THEMATICS

In order to answer the question, comparative model structures were built from [7]. In the present work, it is shown that THEMATICS can predict active site locations in comparative model structures [7, 115]. It was started with an experimentally determined template structure [1, 120] and with the sequence of the query protein.

Sequence alignments and comparative structure modeling [6, 121] were performed using: the integrated application Friend, which interfaces with ClustalW and with MODELLER;

In order to perform modeling, pairwise alignments were done under MODELLER. The titration curves are calculated for all of the ionisable residues in each of the template and comparative model structures [116]. The curves are analyzed for selecting the ones that deviate most from the typical sigmoidal shape. Most of the curves do possess the characteristic sigmoidal shape, with a sharp fall-off in charge in the region around the midpoint, as predicted by the Henderson-Hasselbalch equation [116]. Only a small fraction (about 3 to 7 %) of the ionisable residues deviates from the typical behavior.

THEMATICS identifies the deviant ones [116]. Then a search was performed for a 45 cluster of residues with deviant titration behavior that are in physical proximity. A residue is deemed to belong to a cluster if it is a nearest neighbor, or is within 7 A°, of another cluster member [116]. Since these clusters are highly reliable predictors of active site location in the protein structure, they are called THEMATICS positives [115].

Comparative protein modeling for THEMATICS

In this part comparative models and their related results for THEMATICS will be presented.

Triosephosphate isomerase (TIM) orthologs

The conversion of D-glyceraldehyde 3- phosphate (GAP) to dihydroxyacetone phosphate

(DHAP) is catalyzed by triosephosphate isomerase (TIM). The x-ray crystal structure data for TIM from chicken (PDB ID: 1tph) is utilized from the Protein Data Bank with a resolution of`1.8 A°. Since TIM is active as a dimer, the calculations are performed on the dimer.

The first of the four structures homologous to the chicken TIM structure 1tph is built from the sequence for Schistosoma japonicum with 60% sequence identity in the pair wise alignment and 0.16 A° RMSD value for the model structure. The second model is determined for the sequence for Enterococcus faecalis with 40.2 % sequence identity, resulting in a 0.29 A° RMSD value with the template structure. The third model is built from the sequence of Bartonella henselae with 38.7 % identity and RMSD value of 0.31

A°, and the last model is built from the sequence of Mycoplasma genitalium with 33 % 46

identity and RMSD value of 1.73 A°. These structures are all obtained with MODELLER

and are summarized in Table 5.

Table 5 gives the THEMATICS result for the active site cluster for each template

structure and the orthologous model structures. Known active site residues are shown in

boldface and “second shell” residues (those immediately adjacent to known active site

residues but not considered to be in the active site) are underlined. For the TIM structure

from chicken (1tph), four neighboring residues with anomalous titration behavior are

identified as the active site cluster. Two of these residues, H95 and E165, are well

established by experiment as catalytically active residues [116, 122, 123].

Two other residues, C126 and Y164, are located in the active site cleft but any possible

catalytic role for these residues has not been investigated experimentally. Upon alignment

of the sequences and superposition of the structures, it is confirmed that all four of these

residues are conserved, both in the sequence and in the spatial arrangement of the active

site cleft, in all of the four model structures. Sequence alignment across a wider range of

species again reveals high conservation of all four of these residues. Although known

active sites are located by THEMATICS on the models. Other two locations C126 and

Y164 may be used as a guide. In order to understand if these two residues are also

addition to the existed active sites or if they have any supportive role.

6-Hydroxymethyl-7,8-dihydropterin pyrophosphokinase (HPPK) orthologs

6-Hydroxymethyl-7,8-dihydropterin pyrophosphokinase (HPPK) is a monomeric pyrophosphate transferase. Its crystal structure for E. coli (PDB ID: 1hka) was utilized 47

from the protein data bank with 1.5 A° resolution. Four homologous models to the E. coli

structure 1hka are built using the MODELLER software from the sequences for the following organisms: Vibrio vulnificus (with 63% sequence identity with E. coli and 0.34

A° RMSD), Vibrio parahaemolyticus (with 57% sequence identity and 0.22 A° RMSD),

Pseudomonas aeruginosa (with 51% sequence identity and 0.36 A° RMSD) and

Pseudomonas putida (with 48% sequence identity and 0.50 A° RMSD). All of them are

conserved across the four species for which the model structures are built and are also

generally well conserved across bacterial kinases. When the four model structures are

superimposed onto the template E. coli structure, the positions of these residues are

conserved in the active site pocket with similar orientations. For the HPPK case,

THEMATICS identifies the same cluster for all four of the model structures as for the E.

coli template structure (see Table 5).

Aspartate aminotransferase (AspAT) orthologs

The structure of the pyridoxamine 5'-phosphate dependent enzyme Aspartate

aminotransferas from E. coli at 2.2 Å resolution (PDB ID: 1amr) is used as the template.

Its fold is a unique amino transferase fold. AspAT is active as a homodimer and the

calculations are performed on the dimer structure. Using MODELLER software four

model structures homologous to the AspAT template from E. coli are constructed from

the sequences for the following organisms: Vibrio cholera (with 62 % pairwise identity and 1.52 Å RMSD), Oryza sativa (with 44% identity and 0.64 A° RMSD), and Neiserria meningitides (41 % identity and 1.28 A° RMSD), and Clostridium perfringens (22% identity and 3.67 A° RMSD). For all four models and for the template, THEMATICS 48

finds the active site cluster, although the list of identified residues is a little different for

each species for the AspAT case (see Table 5).

Additional examples of comparative protein models for application of THEMATICS

Tables 6-7 give THEMATICS results for additional sets of enzymes with a significant variety of different folds and chemical functions. The comparative structures range from

93% to 22% sequence identity with templates. Table 6 gives the active site cluster

predicted by THEMATICS for eight sets of orthologous proteins. Results are given for

the eight templates and a total of 31 comparative model structures.

In these examples (in Table 6), the homologues are presumed to have the same function

as the template. Table 7 gives the active site cluster predicted by THEMATICS for nine

more sets of proteins, including nine templates and 36 comparative models. In the

examples given in Table 7, there may be variation in function among some of the

members of the homologous sets.

In most cases, active sites in the comparative model structures that are similar to those of

the corresponding templates are located by THEMATICS. There are also some examples

with low sequence identity where the predicted active site cluster is similar to the

template. However, in a couple of cases involving distant homologues, the predicted

active site is quite different from that of the template.

49

Discussion and conclusions

THEMATICS successfully locates the active sites for the comparative models. In some

cases, there is some minor variation in the list of important residues, but the catalytically

active residues almost always seem to be properly identified. In a couple of cases involving remote homologues, the predicted active site residues are quite different from

those of the template. This may happen because the function of the remote homologue is

different, or because the quality of the comparative model may not be adequate for

THEMATICS to predict the active site residues. On the other side for almost all of the

cases studied, the comparative model structures are good enough to acquire an accurate active site prediction from THEMATICS.

50

Chapter 5. Studies on multiple sequence alignment and comparative modeling.

Multiple sequence analyses of APEX1 and pol-β

Abasic (apurinic/apyrimidinic, AP) sites are one of the most common lesions in DNA. It

has been estimated that approximately 10,000 AP sites are formed in each mammalian

cell per day under normal physiological conditions [124-127]. They can occur in DNA as a result of spontaneous hydrolysis of the N-glycosylic bond or the removal of altered bases by DNA glycosylases [124]. They are potentially mutagenic and lethal lesions that can prevent normal DNA replication and transcription [127]. In the cell there are systems

to recognize and repair such sites, the base excision repair (BER) pathway is specifically responsible for the repair of alkylation and oxidative DNA damage.

Apurinic/apyrimidinic endonuclease 1 (APEX1) cleaves the phosphodiester backbone 5’ to the AP site [2–4]. The cleavage, which is a key step in the BER pathway, is followed by nucleotide insertion and removal of the downstream deoxyribose moiety, performed most often by DNA polymerase beta (pol-β) [5]. The fact that nucleotide insertion

requires cleavage of the AP site suggests interaction of the two enzymes. While several

biochemical studies indicate interaction between the two proteins, the details of the

interaction remain unknown.

In my Ph.D. research, one of the collaborative projects that I was involved in focused on

predicting the most likely protein-protein interface between APEX1 and pol-β by

applying a new methodology [126]. This methodology relies on the assumption, which is

validated by experimental evidence that both proteins must bind to DNA in order to 51

interact. Analysis of the simulated protein behavior in water suggests how protein

interaction might be coupled to conformational changes in DNA polymerase beta [126].

Moreover, multiple sequence alignment of APEX1 and pol-β in related organisms

identified a set of correlated mutations of specific residues at the predicted interfaces

[126]. Methods and results are presented below.

Multiple Sequence Analysis of APEX1 and pol-β Proteins

BLAST searches against non-redundant protein sequence database were performed by

using human APEX1 (accession number NP_001632) and pol-β (accession number

NP_002681) as query sequences. Fourteen eukaryotic organisms were identified where

both proteins were available in each organism. The selected sequences were aligned

using CLUSTALW program.

Results for correlated mutations of the interface residues

If APEX1 and pol-β evolved to form a molecular complex so that the specificity of their

interaction optimized the function of the BER pathway, then it may be expected that the

network of inter-residue contacts constrains the protein sequence. It may suggest that the

changes accumulated in the evolution of one of the interacting proteins would be

compensated by changes in the other one [16]. Thus, correlated mutations in predicted

regions between APEX1 and pol-β were observed across a variety species. Multiple

sequence analyses of the two proteins shows correlated mutation at the interface of the

two proteins in the 3’-complex (Figure 21). In particular Arg221 of APEX1 and Gln31 of

pol-β that interacted in the 3’-complex with pol-β in the closed conformation were 52

changed in five organisms to Lys and Arg respectively. In four of these organisms there

was also correlated variation of Ser275 in APEX1 and Ser109 in pol-β, but these residues

did not interact in the predicted complex. In addition, in S. purpuratus there was one

more coordinated change in interacting residues, Gly225 of APEX1 was mutated to Ser

and Ile33 of pol-b was mutated to Met.

Comparative protein structure modeling of E. focardii γ-tubulins

Description of γ-tubulins

In the study of characterization of γ-tubulin from the psychrophilic Antarctic ciliate

Euplotes focardii, comparative models for γ-tubulins were built. γ-tubulin is a low

abundance protein which localized to the pericentriolar material. It is important in the

nucleation and polar orientation of microtubules [128]. Microtubule assembly is nucleated by organizing centers, which include centrioles, basal bodies, and other structures [128, 129]. Both centrioles and basal bodies require γ-tubulin for their assembly and maintenance. Microtubule assembly is entropically driven, predominantly via hydrophobic interactions, and therefore environmental temperature plays an important role both in vitro and in vivo [129-131].

Comparative protein structure modeling of E. focardii γ-tubulins

In the presented study, sequences of γ-T1 and γ-T2 of E. focardii were modeled [129].

Comparative homology models of the two E. focardii γ-tubulins were obtained by use of

MODELLER (version 9v1) and the Friend interface. The 3.0 A˚ structure of human γ- 53

tubulin containing bound GTP (Protein Databank 1z5w) was used as a template for comparative modeling. Structural alignments between the template and modeled sequences were performed with TOPOFIT and models were analyzed under Friend software. The percentage similarities between modeled and template sequences were

68.36% for γ-T1 and 69.28% for γ-T2, and the length of alignment was 433 residues for both models. Based on these values, it is estimated that the accuracies of the modeled structures of γ-T1 and γ-T2 approach 3 A˚[129, 132].

In the comparative modeled structures, it has been shown that of prolines for threonine at position 297 of γ-T2 and for serine at 303 in γ-T1 do not alter significantly the conformation of the H9-S8 loop, although they are likely to restrict its mobility.

However, the Pro303 substitution of γ-T1 eliminates the bent hydrogen bond that forms between Ser303 and Asn205 in γ-T2 (see figure 22 A and B).

Conclusion

The three-dimensional structure of a protein provides important information for understanding and answering many biological questions. Combining the information with other available sources offer new extensions and visions. Developing StSNP web server is a good example for uniting the available sources for a scientific exploration. Since variations in DNA sequence may have a major impact on how humans respond to disease, environmental factors, and drugs [54], it is also important to visualize and analyze nsSNPs on protein structures. But as it was described previously, we don’t have all protein structures for every protein sequences in the available databases currently. By using comparative structure modeling, we have the models of the sequences with 54 unknown structures. Mapping nsSNPs on to the models not only provides deeper information about substitutions and their possible structural effects, also we gain insight in nsSNPs for unknown structures about where they may exist in the three dimensional structure of a protein. Combining this information along with the pathways enables us to create possible interactions among the proteins and pathways which may lead to disease pathways in order to understand the structural and physiological mechanism. Thus, the first steps have been taken in the development of a resource for mapping nsSNPs onto protein structures, providing structural insight into the effects of nsSNPs on proteins such as, stability, functionality, protein–protein interactions and other structurally related issues. As a web server in a rapidly evolving area of research, StSNP is designed to evolve with other related resources; future directions include; a more detailed analysis of the SNP, predictions of the functional/biological implications of the SNP(s) and the use of image map technology from the KEGG API for more interactive data retrieval. StSNP creates the basis for further studies involving the metabolic pathways and the disease(s) associated with a particular SNP.

In the study for comparative functional genomics analysis of HMGN3a and SMARCAl1 across mammals includes both comparative protein structure modeling and multiple sequence alignment analysis which suggest high levels of structural conservation of these proteins highlight the importance of chromatin remodeling in the regulation of gene expression, particularly during early mammalian embryonic development. The greater similarities of human and bovine HMGN3a and SMARCAL1 proteins may suggest the cow as a valuable model to study chromatin remodeling at the onset of mammalian 55

development. Understanding the roles of chromatin remodeling proteins during

embryonic development emphasizes the importance of epigenetics and could shed light

on the underlying mechanisms of early mammalian development.

Using comparative protein structure modeling is also performed in one of the research

projects which is for locating the active site residues by the THEMATICS method. The

importance of the study was to show that THEMATICS can predict active site locations

in comparative model structures. Since the protein sequences and templates were

available to perform comparative modeling, results showed that THEMATICS performed

successfully. More than 40 comparative models were used in order to apply

THEMATICS from several different species. This study also showed the importance of

using accurate models which affects the end results such as locating the active sites

accurately.

APEX1 and pol-β involve in DNA repair mechanism, in the study with the aid of the

molecular dynamic applications, interacting interface residues of these proteins were

determined. In the further steps of the study multiple sequence alignments were

performed across the species which identifies coordinated mutations of specific residues at the predicted interfaces.

Research projects that I worked and was involved in during my work on this dissertation were published in six peer-reviewed journals and here in this dissertation presented data with related information was obtained from these publications. [70, 88, 116, 126, 129,

133]. 56

Tables

Table 1. Representing query and modeling options for resources. Table is reproduced from Uzun et al. [70].

57

58

Table 2. Summary of resources. Table shows (pages 59 and 60) the differences and the similarities of the resources for their search options and background information (number of nsSNP information for each database is from 2007, current update of StSNP is presented in the introduction part). Table is reproduced from Uzun et al. [70].

59

60

61

Table3. Organisms and protein accession numbers used in MSA of SMARCAL1. Table is reproduced from Uzun et al. [88].

Table 4. Organisms and protein accession numbers used in multiple sequence alignment of HMGN3a. Table is reproduced from Uzun et al. [88].

62

Table 5. Predicted clusters for orthologous structures for three templates. For each model structure, % pairwise identity with the template and the RMSD value in A° are given. THEMATICS results for the active site cluster are given with known active site residues shown in boldface and second shell residues underlined. Sequence numbers for the models are adjusted to match those of the template structures. Table is reproduced from Shehadi et al. [116].

63

Table 6. Predicted active site clusters for additional orthologous sets for the templates. Known active site residues are shown in bold. Second shell residues are underlined. For the models, residues aligned with a known active site residue in the template are shown in bold; those aligned with a second shell template residue are underlined. Table is reproduced from Shehadi et al. [116]

64

65

Table 7. Predicted active site clusters for nine homologous sets (pages 66 and 67). For the templates, known active site residues are shown in bold. Second shell residues are underlined. For the models, residues aligned with a known active site residue in the template are shown in bold; those aligned with a second shell template residue are underlined. Table is reproduced from Shehadi et al. [116].

66

67

68

Figures

Figure1. Growth of released protein structures per year. Graph displays the number of searchable structures per year in PDB since 1976. Red bars represent number of structures accumulated per year. The number of structures available printed to 1985 was less than 195.

69

70

Figure2. Schematic of StSNP web server. StSNP is an interactive web server, which utilizes several heterogeneous data sources.

71

72

Figure3. Data generation in StSNP. (A) Main query page, (B) Formatted data for nsSNPs along with graphical alignment representation, (C) nsSNP(s) selection for modeling, (D) Output page, and (E) Visualization in the Friend applet.

73

74

Figure 4A. nsSNPs and Glutathione S Transferase. Glutathione S Transferase is shown with nsSNP locations displayed in ball and stick representation, with I105V marked with a black circle. The reference residues are shown in blue, nonsynonymous residues in red and the substrate glutathione is displayed in space fill representation (yellow). The query for the example was Protein ID NP_000843 and template PDB ID 1aqv chain B. B. The Results section also provides a user with a link to glutathione metabolism in order to view other members found in the pathway. Figure 4A is reproduced from Uzun et al. [70].

75

76

77

Figure 5. nsSNPs and Aldehyde dehydrogenase-2. Aldehyde dehydrogenase-2 is shown with nsSNP locations displayed in ball and stick representation, with E504K marked with a black circle. The reference residues are shown in blue, nonsynonymous residues in red and the substrate NAD is displayed in space fill representation (green). The query for the example was Protein ID NP_000681 and template PDB ID 1ag8 chain A. Backbones in the figure for models with nsSNPs and with reference residue is shown in different colors (purple and green). Figure is reproduced from Uzun et al. [70].

78

79

Figure 6. Phylogenetic tree for HMGN3a. Phylogenetic tree of evolutionary relationships of HMGN3a in 7 mammalian taxa using the Maximum Parsimony method. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (500 replicates) is shown next to the branches. All positions containing gaps and missing data were eliminated from the dataset. Figure is reproduced from Uzun et al. [88].

80

81

Figure 7. MSA of HMGN3a. Highlighted regions show substitutions in at least 1 of the 7 species. The alignment includes both the official bovine HMGN3a gene model GLEAN_08006, and the bovine NCBI HMGN3a protein (NP_001029676.1). The insertion of alanine in the fifth exon of the bovine protein is marked in red. Figure is reproduced from Uzun et al. [88].

82

83

Figure 8. Four distinctive domains of SMARCAL1. The starting and ending residue numbers are 245–299, 342–396, 437–727, and 741–818. Figure is reproduced from Uzun et al. [88].

84

85

Figure 9. Phylogenetic tree for SMARCAL1. Phylogenetic tree of evolutionary relationships of the complete SMARCAL1 protein in 9 mammalian taxa, using the Maximum Parsimony method. The bootstrap consensus tree inferred from 500 replicates is taken to represent the evolutionary history of the taxa analyzed. All positions containing gaps and missing data were eliminated from the dataset. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (500replicates) is shown next to the branches. Figure is reproduced from Uzun et al. [88].

86

87

Figure 10. First HARP domain of SMARCAL1. SMARCAL1 first HARP domain phylogenetic tree with the highest parsimony (length = 46). The consistency index is 0.95, the retention index is 0.94, and the composite index is 0.90 for all sites and parsimony-informative sites. There were a total of 57 positions in the final dataset, of which 20 were parsimony informative. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (500 replicates) is shown next to the branches. Figure is reproduced from Uzun et al. [88].

88

89

Figure 11. MSA of first HARP domain in SMARCAL1. Substitutions in at least one species are highlighted. Numbers on top of the alignment show significant substitutions which are mentioned in the text. Figure is reproduced from Uzun et al. [88].

90

91

Figure 12. Second HARP domain of SMARCAL. SMARCAL1 second HARP domain phylogenetic tree with the highest parsimony (length = 53). The consistency index is 0.84, the retention index is 0.85, and the composite index is 0.77 for all sites. There were a total of 62 positions in the final dataset, of which 17 were parsimony informative. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (500 replicates) is shown next to the branches. Figure is reproduced from Uzun et al. [88].

92

93

Figure 13. MSA of second HARP domain in SMARCAL1. Highlighted regions show substitutions occur. Figure is reproduced from Uzun et al. [88].

94

95

Figure 14. SNF2N domain of SMARCAL1. SMARCAL1 SNF2N domain phylogenetic tree with the highest parsimony (length = 145). The consistency index is 0.81, the retention index is 0.79, and the composite index is 0.70 for all sites. There were a total of 290 positions in the final dataset, out of which 39 were parsimony informative. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (500 replicates) is shown next to the branches. Figure is reproduced from Uzun et al. [88].

96

97

Figure 15. MSA of SNF2N domain in SMARCAL1. Multiple sequence alignment of the SNF2N domain in SMARCAL1. Highlighted regions show where the substitutions occur. The multiple substitutions marked in red in the Bos taurus sequence may be due to sequencing errors since the corrected model for this protein (GLEAN_20241, in blue) added to this portion of the alignment only differs from the human sequence in 2 amino acids. Figure is reproduced from Uzun et al. [88].

98

99

Figure 16. Phylogenetic tree of helicase C terminal domain in SMARCAL1. SMARCAL1 helicase C-terminal domain in phylogenetic tree with the highest parsimony (length = 52). The consistency index is 0.86, the retention index is 0.81, and the composite index is 0.77 for all sites. There were a total of 78 positions in the final dataset, out of which 11 were parsimony informative. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (500 replicates) is shown next to the branches.

100

101

Figure 17. MSA of the helicase C-terminal domain in SMARCAL1. Highlighted regions show where the substitutions occur. Figure is reproduced from Uzun et al. [88].

102

103

Figure 18. Disparity Index test of HMGN3a. Probability of rejecting the null hypothesis that HMGN3a sequences have evolved with the same pattern of substitution, as judged from the extent of differences in base composition biases between sequences (Disparity Index test). A Monte Carlo test (1000 replicates) was used to estimate the P-values, which are shown below the diagonal. P-values smaller than 0.05 are considered significant. The estimates of the disparity index per site are shown for each sequence pair above the diagonal. There were a total of 95 positions in the final dataset. None of the P- values were smaller than 0.05. All positions containing gaps and missing data were eliminated from the dataset. Figure is reproduced from Uzun et al. [88]. Black colored numbers: Probability computed (must be < 0.05 for hypothesis rejection at 5% level), Blue colored numbers: Disparity Index.

104

105

Figure 19. Disparity Index test of SMARCAL1. Probability of rejecting the null hypothesis that the sequences of the SMARCAL1 conserved domains have evolved with the same pattern of substitution, as judged from the extent of differences in base composition biases between sequences (Disparity Index test). A Monte Carlo test (1000 replicates) was used to estimate the P-values, which are shown below the diagonal. P- values smaller than 0.05 are considered significant. The estimates of the disparity index per site are shown for each sequence pair above the diagonal. All positions containing gaps and missing data were eliminated from the dataset. A. First HARP domain: there were a total of 57 positions in the final dataset. B. Second HARP domain: there were a total of 62 positions in the final dataset. C. SNF2N domain: there were a total of 290 positions in the final dataset. D. Helicase C-terminal domain: there were a total of 78 positions in the final dataset. Figure is reproduced from Uzun et al. [88]. Black colored numbers: Probability computed (must be < 0.05 for hypothesis rejection at 5% level [yellow background]), Blue colored numbers: Disparity Index.

106

107

Figure20. MSA of helicase like and helicase domains with modeled protein structure. Based on the domains in the multiple sequence alignment, residues are colored in green and blue. Green color represents the helicase like domain and blue color represents helicase domain. ATP binding residues are shown in red color and represented as balls and sticks. Nucleotide binding residues are colored in magenta color. The arrows shows the residue which existed in the non-conservative region on the structure and in the MSA, which is indicated in yellow. Figure is reproduced from Uzun et al. [88].

108

109

Figure 21. MSA of APEX1 and pol-β. Only alignment for fragments of interacting regions in the 39 complexes (with open and closed conformation of pol-β) is shown. Residues at the interfaces are in bold; neighboring residues are in normal font. Interacting residues include residues from segments #1, #2, and #3 and adjacent residues, found at interface only in the complex with open conformation of pol-β. Adjacent residues are termed (where possible) AR. Correlated mutations of interacting residues are highlighted in cyan and orange. Other variations in interacting residues are highlighted in red. Figure is reproduced from Abyzov et al. [126].

110

111

Figure 22. A view of model structures of γ-T1 and γ-T2 at the positions 303, 205. A) Contains a proline at position 303 in γ-T1 in contrast to the serine of γ-T2 (in figure B). The proline substitution of γ-T1 eliminates the hydrogen bond between Ser303 and Asn205 in γ-T2. Figure is reproduced from Marziale et al. [129]

112

113

References 1. Berman, H.M., et al., The Protein Data Bank. Nucleic Acids Res, 2000. 28(1): p. 235-42. 2. Baker, D. and A. Sali, Protein structure prediction and structural genomics. Science, 2001. 294(5540): p. 93-6. 3. Blundell, T.L., et al., Knowledge-based prediction of protein structures and the design of novel molecules. Nature, 1987. 326(6111): p. 347-52. 4. Marti-Renom, M.A., et al., Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct, 2000. 29: p. 291- 325. 5. Jacobson, M. and A. Sali, Comparative Protein Structure Modeling and its Applications to Drug Discovery. Annual Reports in Medicinal Chemistry, 2004. 39: p. 259-276. 6. Sali, A., Modeling mutations and homologous proteins. Curr Opin Biotechnol, 1995. 6(4): p. 437-51. 7. Sali, A., 100,000 protein structures for the biologist. Nat Struct Biol, 1998. 5(12): p. 1029-32. 8. Rost, B., Twilight zone of protein sequence alignments. Protein Eng, 1999. 12(2): p. 85-94. 9. Bordoli, L., et al., Protein structure homology modeling using SWISS- MODEL workspace. Nat Protoc, 2009. 4(1): p. 1-13. 10. Altschul, S.F., et al., Basic local alignment search tool. J Mol Biol, 1990. 215(3): p. 403-10. 11. Pearson, W.R., Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol, 1990. 183: p. 63-98. 12. Altschul, S.F., et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 1997. 25(17): p. 3389-402. 13. Gough, J., et al., Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol, 2001. 313(4): p. 903-19. 14. Jones, D.T., GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J Mol Biol, 1999. 287(4): p. 797-815. 15. Lindahl, E. and A. Elofsson, Identification of related proteins on family, superfamily and fold level. J Mol Biol, 2000. 295(3): p. 613-25. 16. John, B. and A. Sali, Detection of homologous proteins by an intermediate sequence search. Protein Sci, 2004. 13(1): p. 54-62. 17. Chothia, C. and A.M. Lesk, The relation between the divergence of sequence and structure in proteins. Embo J, 1986. 5(4): p. 823-6. 18. Canutescu, A.A., A.A. Shelenkov, and R.L. Dunbrack, Jr., A graph-theory algorithm for rapid protein side-chain prediction. Protein Sci, 2003. 12(9): p. 2001-14. 114

19. Rohl, C.A., et al., Modeling structurally variable regions in homologous proteins with rosetta. Proteins, 2004. 55(3): p. 656-77. 20. Soto, C.S., et al., Loop modeling: Sampling, filtering, and scoring. Proteins, 2008. 70(3): p. 834-43. 21. Bystroff, C. and D. Baker, Prediction of local structure in proteins using a library of sequence-structure motifs. J Mol Biol, 1998. 281(3): p. 565-77. 22. Unger, R., et al., A 3D building blocks approach to analyzing and predicting structure of proteins. Proteins, 1989. 5(4): p. 355-73. 23. Chinea, G., et al., The use of position-specific rotamers in model building by homology. Proteins, 1995. 23(3): p. 415-21. 24. Chothia, C. and A.M. Lesk, Canonical structures for the hypervariable regions of immunoglobulins. J Mol Biol, 1987. 196(4): p. 901-17. 25. Jones, T.A. and S. Thirup, Using known substructures in protein model building and crystallography. Embo J, 1986. 5(4): p. 819-22. 26. Bruccoleri, R.E. and M. Karplus, Prediction of the folding of short polypeptide segments by uniform conformational sampling. Biopolymers, 1987. 26(1): p. 137-68. 27. Shenkin, P.S., et al., Predicting antibody hypervariable loop conformation. I. Ensembles of random conformations for ringlike structures. Biopolymers, 1987. 26(12): p. 2053-85. 28. Deane, C.M. and T.L. Blundell, CODA: a combined algorithm for predicting the structurally variable regions of protein models. Protein Sci, 2001. 10(3): p. 599-612. 29. van Vlijmen, H.W. and M. Karplus, PDB-based protein loop prediction: parameters for selection and methods for optimization. J Mol Biol, 1997. 267(4): p. 975-1001. 30. Janin, J. and C. Chothia, Role of hydrophobicity in the binding of coenzymes. Appendix. Translational and rotational contribution to the free energy of dissociation. Biochemistry, 1978. 17(15): p. 2943-8. 31. Mendes, J., et al., Improved modeling of side-chains in proteins with rotamer-based methods: a flexible rotamer model. Proteins, 1999. 37(4): p. 530-43. 32. Tuffery, P., et al., A new approach to the rapid determination of protein side chain conformations. J Biomol Struct Dyn, 1991. 8(6): p. 1267-89. 33. Xiang, Z. and B. Honig, Extending the accuracy limits of prediction for side-chain conformations. J Mol Biol, 2001. 311(2): p. 421-30. 34. Desjarlais, J.R. and T.M. Handel, Side-chain and backbone flexibility in protein core design. J Mol Biol, 1999. 290(1): p. 305-18. 35. Ginalski, K., Comparative modeling for protein structure prediction. Curr Opin Struct Biol, 2006. 16(2): p. 172-7. 36. Kryshtafovych, A., et al., Progress over the first decade of CASP experiments. Proteins, 2005. 61 Suppl 7: p. 225-36. 37. Tress, M., et al., Assessment of predictions submitted for the CASP6 comparative modeling category. Proteins, 2005. 61 Suppl 7: p. 27-45. 115

38. Cozzetto, D. and A. Tramontano, Relationship between multiple sequence alignments and quality of protein comparative models. Proteins, 2005. 58(1): p. 151-7. 39. Zhou, H. and Y. Zhou, Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Sci, 2002. 11(11): p. 2714-26. 40. Luthy, R., J.U. Bowie, and D. Eisenberg, Assessment of protein models with three-dimensional profiles. Nature, 1992. 356(6364): p. 83-5. 41. Melo, F. and E. Feytmans, Assessing protein structures with a non-local atomic interaction energy. J Mol Biol, 1998. 277(5): p. 1141-52. 42. Wallner, B. and A. Elofsson, Can correct protein models be identified? Protein Sci, 2003. 12(5): p. 1073-86. 43. Sippl, M.J., Boltzmann's principle, knowledge-based mean fields and protein folding. An approach to the computational determination of protein structures. J Comput Aided Mol Des, 1993. 7(4): p. 473-501. 44. Arnold, K., et al., The SWISS-MODEL workspace: a web-based environment for protein structure homology modelling. Bioinformatics, 2006. 22(2): p. 195-201. 45. Notredame, C., Recent evolutions of multiple sequence alignment algorithms. PLoS Comput Biol, 2007. 3(8): p. e123. 46. Mount, D., Bioinformatics: sequence and genome analysis. 2nd Edition ed. 2004, New York Cold Spring Harbour Laboratory Press. 47. Pei, J., Multiple protein sequence alignment. Curr Opin Struct Biol, 2008. 18(3): p. 382-6. 48. Wallace, I.M., G. Blackshields, and D.G. Higgins, Multiple sequence alignments. Curr Opin Struct Biol, 2005. 15(3): p. 261-6. 49. Higgins, D.G. and P.M. Sharp, CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 1988. 73(1): p. 237-44. 50. Higgins, D.G. and P.M. Sharp, Fast and sensitive multiple sequence alignments on a microcomputer. Comput Appl Biosci, 1989. 5(2): p. 151-3. 51. Thompson, J.D., D.G. Higgins, and T.J. Gibson, CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 1994. 22(22): p. 4673-80. 52. Larkin, M.A., et al., Clustal W and Clustal X version 2.0. Bioinformatics, 2007. 23(21): p. 2947-8. 53. Higgins, D.G., J.D. Thompson, and T.J. Gibson, Using CLUSTAL for multiple sequence alignments. Methods Enzymol, 1996. 266: p. 383-402. 54. The International HapMap Project. Nature, 2003. 426(6968): p. 789-96. 55. Sachidanandam, R., et al., A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature, 2001. 409(6822): p. 928-33. 116

56. Wang, Z. and J. Moult, SNPs, protein structure, and disease. Hum Mutat, 2001. 17(4): p. 263-70. 57. Stoneking, M., Single nucleotide polymorphisms. From the evolutionary past. Nature, 2001. 409(6822): p. 821-2. 58. Sherry, S.T., et al., dbSNP: the NCBI database of genetic variation. Nucleic Acids Res, 2001. 29(1): p. 308-11. 59. Chasman, D. and R.M. Adams, Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation. J Mol Biol, 2001. 307(2): p. 683-706. 60. Fredman, D., et al., HGVbase: a human sequence variation database emphasizing data quality and a broad spectrum of data sources. Nucleic Acids Res, 2002. 30(1): p. 387-91. 61. Hirakawa, M., et al., JSNP: a database of common gene variations in the Japanese population. Nucleic Acids Res, 2002. 30(1): p. 158-62. 62. Sunyaev, S., V. Ramensky, and P. Bork, Towards a structural basis of human non-synonymous single nucleotide polymorphisms. Trends Genet, 2000. 16(5): p. 198-200. 63. Stitziel, N.O., et al., topoSNP: a topographic database of non-synonymous single nucleotide polymorphisms with and without known disease association. Nucleic Acids Res, 2004. 32 Database issue: p. D520-2. 64. Yip, Y.L., et al., The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat, 2004. 23(5): p. 464-70. 65. Karchin, R., et al., LS-SNP: large-scale annotation of coding non- synonymous SNPs based on multiple information sources. Bioinformatics, 2005. 21(12): p. 2814-20. 66. Reumers, J., et al., SNPeffect: a database mapping molecular phenotypic effects of human non-synonymous coding SNPs. Nucleic Acids Res, 2005. 33(Database issue): p. D527-32. 67. Dantzer, J., et al., MutDB services: interactive structural analysis of mutation data. Nucleic Acids Res, 2005. 33(Web Server issue): p. W311- 4. 68. Han, A., et al., SNP@Domain: a web resource of single nucleotide polymorphisms (SNPs) within protein domain structures and sequences. Nucleic Acids Res, 2006. 34(Web Server issue): p. W642-4. 69. Li, S., et al., Snap: an integrated SNP annotation platform. Nucleic Acids Res, 2007. 35(Database issue): p. D707-10. 70. Uzun, A., et al., Structure SNP (StSNP): a web server for mapping and modeling nsSNPs on protein structures with linkage to metabolic pathways. Nucleic Acids Res, 2007. 35(Web Server issue): p. W384-92. 71. Abyzov, A., et al., Friend, an integrated analytical front-end application for bioinformatics. Bioinformatics, 2005. 21(18): p. 3677-8. 117

72. Leslin, C.M., A. Abyzov, and V.A. Ilyin, Structural exon database, SEDB, mapping exon boundaries on multiple protein structures. Bioinformatics, 2004. 20(11): p. 1801-3. 73. Smith, T.F., and Waterman M.S., Comparison of biosequences. Adv. Appl. Math., 1981. 2: p. 482-489. 74. Henikoff, S. and J.G. Henikoff, Performance evaluation of amino acid substitution matrices. Proteins, 1993. 17(1): p. 49-61. 75. Kanehisa, M., A database for post-genome analysis. Trends Genet, 1997. 13(9): p. 375-6. 76. Kanehisa, M. and S. Goto, KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res, 2000. 28(1): p. 27-30. 77. Pruitt, K.D. and D.R. Maglott, RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res, 2001. 29(1): p. 137-40. 78. Ilyin, V.A., A. Abyzov, and C.M. Leslin, Structural alignment of proteins by a novel TOPOFIT method, as a superimposition of common volumes at a topomax point. Protein Sci, 2004. 13(7): p. 1865-74. 79. Leslin, C.M., A. Abyzov, and V.A. Ilyin, TOPOFIT-DB, a database of protein structural alignments based on the TOPOFIT method. Nucleic Acids Res, 2007. 35(Database issue): p. D317-21. 80. Hashimoto, T., et al., A functional glutathione S-transferase P1 gene polymorphism is associated with methamphetamine-induced psychosis in Japanese population. Am J Med Genet B Neuropsychiatr Genet, 2005. 135B(1): p. 5-9. 81. Marth, G.T., et al., The allele frequency spectrum in genome-wide human variation data reveals signals of differential demographic history in three large world populations. Genetics, 2004. 166(1): p. 351-72. 82. Goedde, H.W., et al., Population genetic studies on aldehyde dehydrogenase isozyme deficiency and alcohol sensitivity. Am J Hum Genet, 1983. 35(4): p. 769-72. 83. Li, Y., et al., Mitochondrial aldehyde dehydrogenase-2 (ALDH2) Glu504Lys polymorphism contributes to the variation in efficacy of sublingual nitroglycerin. J Clin Invest, 2006. 116(2): p. 506-11. 84. Memili, E., T. Dominko, and N.L. First, Onset of transcription in bovine oocytes and preimplantation embryos. Mol Reprod Dev, 1998. 51(1): p. 36-41. 85. Memili, E. and N.L. First, Control of gene expression at the onset of bovine embryonic development. Biol Reprod, 1999. 61(5): p. 1198-207. 86. Misirlioglu, M., et al., Dynamics of global transcriptome in bovine matured oocytes and preimplantation embryos. Proc Natl Acad Sci U S A, 2006. 103(50): p. 18905-10. 87. Thompson, E.M., E. Legouy, and J.P. Renard, Mouse embryos do not wait for the MBT: chromatin and RNA polymerase remodeling in genome activation at the onset of development. Dev Genet, 1998. 22(1): p. 31-42. 118

88. Uzun, A., et al., Functional genomics of HMGN3a and SMARCAL1 in early mammalian embryogenesis. BMC Genomics, 2009. 10: p. 183. 89. Olave, I., et al., Identification of a polymorphic, neuron-specific chromatin remodeling complex. Genes Dev, 2002. 16(19): p. 2509-17. 90. Shirakawa, H., et al., Targeting of high mobility group-14/-17 proteins in chromatin is independent of DNA sequence. J Biol Chem, 2000. 275(48): p. 37937-44. 91. Bustin, M. and R. Reeves, High-mobility-group chromosomal proteins: architectural components that facilitate chromatin function. Prog Nucleic Acid Res Mol Biol, 1996. 54: p. 35-100. 92. West, K.L., et al., HMGN3a and HMGN3b, two protein isoforms with a tissue-specific expression pattern, expand the cellular repertoire of nucleosome-binding proteins. J Biol Chem, 2001. 276(28): p. 25959-69. 93. Mohamed, O.A., M. Bustin, and H.J. Clarke, High-mobility group proteins 14 and 17 maintain the timing of early embryonic development in the mouse. Dev Biol, 2001. 229(1): p. 237-49. 94. Crippa, M.P., J.M. Nickol, and M. Bustin, Developmental changes in the expression of high mobility group chromosomal proteins. J Biol Chem, 1991. 266(5): p. 2712-4. 95. Korner, U., et al., Developmental role of HMGN proteins in Xenopus laevis. Mech Dev, 2003. 120(10): p. 1177-92. 96. Banine, F., et al., SWI/SNF chromatin-remodeling factors induce changes in DNA methylation to promote transcriptional activation. Cancer Res, 2005. 65(9): p. 3542-7. 97. Fry, C.J. and C.L. Peterson, Chromatin remodeling enzymes: who's on first? Curr Biol, 2001. 11(5): p. R185-97. 98. Pollard, K.J. and C.L. Peterson, Chromatin remodeling: a marriage between two families? Bioessays, 1998. 20(9): p. 771-80. 99. Lusser, A. and J.T. Kadonaga, Chromatin remodeling by ATP-dependent molecular machines. Bioessays, 2003. 25(12): p. 1192-200. 100. Zheng, P., et al., Expression of genes encoding chromatin regulatory factors in developing rhesus monkey oocytes and preimplantation stage embryos: possible roles in genome activation. Biol Reprod, 2004. 70(5): p. 1419-27. 101. LeGouy, E., et al., Differential preimplantation regulation of two mouse homologues of the yeast SWI2 protein. Dev Dyn, 1998. 212(1): p. 38-48. 102. Magnani, L. and R.A. Cabot, Developmental arrest induced in cleavage stage porcine embryos following microinjection of mRNA encoding Brahma (Smarca 2), a chromatin remodeling protein. Mol Reprod Dev, 2007. 74(10): p. 1262-7. 103. Gebuhr, T.C., S.J. Bultman, and T. Magnuson, Pc-G/trx-G and the SWI/SNF connection: developmental gene regulation through chromatin remodeling. Genesis, 2000. 26(3): p. 189-97. 119

104. Reyes, J.C., et al., Altered control of cellular proliferation in the absence of mammalian brahma (SNF2alpha). Embo J, 1998. 17(23): p. 6979-91. 105. Coleman, M.A., J.A. Eisen, and H.W. Mohrenweiser, Cloning and characterization of HARP/SMARCAL1: a prokaryotic HepA-related SNF2 helicase protein from human and mouse. Genomics, 2000. 65(3): p. 274- 82. 106. Dahiya, R., S. Cleveland, and C.A. Megerian, Spondyloepiphyseal dysplasia congenita associated with conductive hearing loss. Ear Nose Throat J, 2000. 79(3): p. 178-82. 107. Boerkoel, C.F., et al., Mutant chromatin remodeling protein SMARCAL1 causes Schimke immuno-osseous dysplasia. Nat Genet, 2002. 30(2): p. 215-20. 108. Spranger, J., A. Winterpacht, and B. Zabel, The type II collagenopathies: a spectrum of chondrodysplasias. Eur J Pediatr, 1994. 153(2): p. 56-65. 109. Sun, F., et al., Expression of SRG3, a chromatin-remodelling factor, in the mouse oocyte and early preimplantation embryos. Zygote, 2007. 15(2): p. 129-38. 110. Tamura, K., et al., MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol Biol Evol, 2007. 24(8): p. 1596-9. 111. Muthuswami, R., et al., A eukaryotic SWI2/SNF2 domain, an exquisite detector of double-stranded to single-stranded DNA transition elements. J Biol Chem, 2000. 275(11): p. 7648-55. 112. Theis, K., et al., Crystal structure of UvrB, a DNA helicase adapted for nucleotide excision repair. Embo J, 1999. 18(24): p. 6899-907. 113. Finn, R.D., et al., Pfam: clans, web tools and services. Nucleic Acids Res, 2006. 34(Database issue): p. D247-51. 114. Pettersen, E.F., et al., UCSF Chimera--a visualization system for exploratory research and analysis. J Comput Chem, 2004. 25(13): p. 1605-12. 115. Ondrechen, M.J., J.G. Clifton, and D. Ringe, THEMATICS: a simple computational predictor of enzyme function from structure. Proc Natl Acad Sci U S A, 2001. 98(22): p. 12473-8. 116. Shehadi, I.A., et al., Active site prediction for comparative model structures with thematics. J Bioinform Comput Biol, 2005. 3(1): p. 127-43. 117. Shehadi, I.A., H. Yang, and M.J. Ondrechen, Future directions in protein function prediction. Mol Biol Rep, 2002. 29(4): p. 329-35. 118. Tong, W., et al., Enhanced performance in prediction of protein active sites with THEMATICS and support vector machines. Protein Sci, 2008. 17(2): p. 333-41. 119. Wei, Y., et al., Selective prediction of interaction sites in protein structures with THEMATICS. BMC Bioinformatics, 2007. 8: p. 119. 120. Westbrook, J., et al., The Protein Data Bank and structural genomics. Nucleic Acids Res, 2003. 31(1): p. 489-91. 120

121. Fiser, A., R.K. Do, and A. Sali, Modeling of loops in protein structures. Protein Sci, 2000. 9(9): p. 1753-73. 122. Lodi, P.J. and J.R. Knowles, Neutral imidazole is the electrophile in the reaction catalyzed by triosephosphate isomerase: structural origins and catalytic implications. Biochemistry, 1991. 30(28): p. 6948-56. 123. Zhang, Z., et al., Crystal structure of recombinant chicken triosephosphate isomerase phosphoglycolo-hydroxamate complex at 1.8-A resolution. Biochemistry, 1994. 33(10): p. 2830-7. 124. Podlutsky, A.J., et al., Human DNA polymerase beta initiates DNA synthesis during long-patch repair of reduced AP sites in DNA. Embo J, 2001. 20(6): p. 1477-82. 125. Lindahl, T. and B. Nyberg, Rate of depurination of native deoxyribonucleic acid. Biochemistry, 1972. 11(19): p. 3610-8. 126. Abyzov, A., et al., An AP endonuclease 1-DNA polymerase beta complex: theoretical prediction of interacting surfaces. PLoS Comput Biol, 2008. 4(4): p. e1000066. 127. Boiteux, S. and M. Guillet, Abasic sites in DNA: repair and biological consequences in Saccharomyces cerevisiae. DNA Repair (Amst), 2004. 3(1): p. 1-12. 128. Shu, H.B. and H.C. Joshi, Gamma-tubulin can both nucleate microtubule assembly and self-assemble into novel tubular structures in mammalian cells. J Cell Biol, 1995. 130(5): p. 1137-47. 129. Marziale, F., et al., Different roles of two gamma-tubulin isotypes in the cytoskeleton of the Antarctic ciliate Euplotes focardii: remodelling of interaction surfaces may enhance microtubule nucleation at low temperature. Febs J, 2008. 275(21): p. 5367-82. 130. Detrich, H.W., 3rd, et al., Brain and egg tubulins from antarctic fishes are functionally and structurally distinct. J Biol Chem, 1992. 267(26): p. 18766- 75. 131. Detrich, H.W., 3rd, K.A. Johnson, and S.P. Marchese-Ragona, Polymerization of Antarctic fish tubulins at low temperatures: energetic aspects. Biochemistry, 1989. 28(26): p. 10085-93. 132. Hoffman, D.C., et al., Macronuclear gene-sized molecules of hypotrichs. Nucleic Acids Res, 1995. 23(8): p. 1279-83. 133. Shehadi, I., et al., THEMATICS is Effective for Active Site Prediction in Comparative Model Structures. APBC, 2004. 1: p. 209-215.