TCAGAAAATGCGCTCCTGATGCACCCATACCGC TGCTTCCACGCGAGACTTGAGCTTCATTTTCTT CAGCATGTGCTTGACGTGCACTTTTACTGTGCT TTCGGTGATATCCAGGCGGCGGGCAATCATCTT GTTCGGCAAACCCTGGGCAATCAGCTTGAGAAT ATCGCGCTCGCGTGGGGTTAACTGGTTAACATC TCAGAAAATGCGCTCCTGATGCACCCATACCGC TGCTTCCACGCGAGACTTGAGCTTCATTTTCTT CAGCATGTGCTTGACGTGCACTTTTACTGTGCT MCCMB ’09 TTCGGTGATATCCAGGCGGCGGGCAATCATCTT
POCEEDINGSR OFTHEITERNATIONALN MSCOWCNFERENCEOO ONCMPUTATIONALO MLECULARBOLOGYOI
July 20-23, 2009 Moscow, Russia Organizers ЕР И И Е Н И Ж Б Н И И О О И И Н Б Ф Department of Bioengineering and Bioinformatics О
Т
Р
Е
М
Т
Ь
А
Л of M.V. Lomonosov Moscow State University Т
У И
К
К
А И Ф
1930 Biological Department of M.V. Lomonosov Moscow State University У
State Scientific Centre GosNIIGenetika
Institute for Information Trasnsmission Problems, RAS
The Scientific Council on Biophysics RAS,
Engelhardt Institute of Molecular Biology Russian Academy of Sciences
Sponsored by
Р И Russian Fund of Basic Research
INRIA, France INRIA the French National Institute for Research in Computer Science and Control Department of Bioengineering and Bioinformatics of M.V. Lomonosov Moscow State University Biological Department of M.V. Lomonosov Moscow State University State Scientific Centre GosNIIGenetika Institute for Information Trasnsmission Problems, Russian A cademy of S ciences TheScientific Council on Biophysics , RAS ussian cademy of ciences Engelhardt Institute of Molecular Biology, Russian Academy of Sciences with financial support of Russian Fund of Basic Research INRIA, France (the French National Institute for Research in Computer Science and Control)
POCEEDINGSR
MCCMB ’09 Moscow, Russia July 20-23, 2009 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
NEW METHOD TO IMPROVE ERROR PROBABILITY ESTIMATION APPLIED TO ILLUMINA SEQUENCING IRINA ABNIZOVA 1, TOM SKELLY 1, YUMI YAN 1, TONY COX 1
The new short read sequencing technique introduced new technological and computational challenges. It requires reconsideration of well-known error estimation algorithms, taking into account different sequencing platforms.
1 Wellcome Trust Sanger Institute, Hinxton, United Kingdom, [email protected] 1 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
DETECTION OF GENES THAT UNDERWENT POSITIVE SELECTION IN DEEP-SEA ARCHAEBACTERIA OF PYROCOCCUS GENUS K.V. GUNBIN 1, D.A. AFONNIKOV 2, N.A.KOLCHANOV 2
Pressure is an environmental parameter of crucial importance for organisms. Archaeal species of the Pyrococcus genus live under both normal (~0,1MPa) and high pressures (>10MPa). To date, the genomes of three Pyrococcus species have been completely sequenced: P. furiosus bacteria live under normal pressure, whereas P. horikoshii and P. аbyssi are piezophilic (live in deep sea environment under high pressure at 14MPa and 20MPa, respectively). In this work we analyze the rate of nucleotide substitution in search for genes underwent positive selection in deep-sea species of Pyrococcus genus. A phylogenetic analysis was performed to determine the evolutionary relatedness of the piezophilic species of the Pyrococcus genus and T. kodekaraensis as outgroup. The analysis of phylogenetic tree demonstrates that piezophilic species have a common origin and the ancestor of piezophilic species emerged from archaebacteria phylogenetically close to the extant species of Pyrococcus genus inhabiting in normal pressure environments. Events of positive selection (PS) for adaptation of life under high pressure were searched for the set of 508 homologous genes which protein sequences are close homologs (amino acid sequence identity greater than 40%) and have no paralogs in genomes. We reconstructed genes and proteins of the most recent ancestor of piezophilic species of the Pyrococcus genus and the common ancestor of P. furiosus, P. horikoshii and P. аbyssi species. Reconstructed ancestral sequence of genes and proteins were compared with extant sequences using nonsynonymous to synonymous substitution rate ratio, radical to conservative amino acid replacement rate ratio, also amino acid dissimilarity measures. We use ArCOG functional classification of analyzed genes and demonstrated that positive selection events occurred in genes and proteins of ‘Coenzyme transport and metabolism’ and ‘Energy production and conversion’ functional groups (Table 1). The results suggest
1 Institute of Cytology and genetics SB RAS, [email protected] 2 Institute of Cytology and genetics SB RAS, Novosibirsk State University [email protected]; [email protected] 2 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 that genes of these functional classes may be important for adaptation of piezophilic Pyrococcus species to deep-sea environment.
Table 1. ArCOG group enrichment in the full set of analyzed genes and in genes with identified positive selection events. Last column represents the probablilty of difference in number of genes in full and PS sets observed by chance according to Monte Carlo shuffling test with 105 replicas. ArCOG groups with statistical significant difference (p<0.05) shown in bold. ArCOG group Number Number ppp-p---valuevalue of observing in full in PS by random chance dataset group Amino acid transport and metabolism 34 8 0.18225 Carbohydrate transport and metabolism 22 2 0.90471 Cell cycle control; cell division; chromosome partitioning 8 0 * Cell motility 7 2 0.32479 Cell wall/membrane/envelope biogenesis 13 1 0.90748 Coenzyme transport and metabolism 15 6 0.02416 Defense mechanisms 3 0 * Energy production and conversion 33 11 0.01072 Inorganic ion transport and metabolism 16 0 * Intracellular trafficking; secretion; and vesicular transport 6 0 * Lipid transport and metabolism 5 0 * Nucleotide transport and metabolism 24 5 0.36181 Posttranslational modification; protein turnover; chaperones 18 4 0.34395 Replication; recombination and repair 24 4 0.58199 Secondary metabolites biosynthesis; transport and catabolism 5 1 0.59636 Signal transduction mechanisms 3 1 0.42142 Transcription 28 4 0.70909 Translation; ribosomal structure and biogenesis 76 14 0.36997 Function unknown 87 6 0.99903 General function prediction only 75 14 0.34782 Not annotated 6 1 0.66633 Total 508 84
The work was supported by SB RAS integration project №109, Scientific School НШ-2447.2008.4, RAS program “Origin and evolution of Biosphere” and CRDF REC-008 grant.
3 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
MATHEMATICAL MODELING OF THE MOLECULAR GENETIC SYSTEMS REGULATING A PLANT DEVELOPMENT ILYA AKBERDIN 1, FEDOR KAZANTSEV 1, STANISLAV FADEEV 2, IRINA GAINOVA 2, VITALY LIKHOSHVAI 1
Keywords: auxin metabolism, gene network, automatic generation, mathematical model, plant development
Indole-3-acetic acid (IAA) is physiologically active in the form of the free acid, but can also be found in conjugated forms in plant tissues. IAA can be degraded and redundant pathways lead to its synthesis. Auxin participates in regulation of cell differentiation in development of embryo, leaves, vascular tissue, fruit, primary and lateral root and in controlling apical dominance and tropisms. The regulation of the IAA metabolism (synthesis, conjugation and degradations) is enough complex and may explain in some aspects how this simple substance is able to influence such diverse processes. Mathematical modeling of IAA metabolic gene network can help reveal the main factors governing this complex process. To reach this aim, we first reconstructed a gene network of auxin biosynthesis, conjugation degradation by annotating experimental data from 107 published papers into GeneNet computer system. This gene network after reduction was input into converter to generate the mathematical model of auxin metabolism. We have reconstructed the gene network and develop the mathematical model of auxin metabolism in arabidopsis shoots. The model allows to reproduce some phenomenological and molecular-genetic aspects of the auxin role in the plant development. The obtained results confirm adequacy of the developed model. In silico experiments testify to qualitatively rapid processes of the molecular genetic regulation of the systems homeostasis. The cumulative experimental data allowed starting construction of spatial distributed hierarchical model that describe both molecular genetic processes and processes on the level of cell- cell interactions simultaneously. So earlier we’ve developed the cellular automaton model that imitates morphodynamics of embryo development by means of regulation of signals produced by different embryonic cells is a first
1 The Institute of Cytology and Genetics SB RAS, Russian Federation, [email protected] 2 The Institute of Mathematics SB RAS, Russian Federation 4 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 step in modelling the process of development in general and in modelling the gene network for morphogenesis in particular [1]. The next step in mathematical modeling application to studying of the plant development rules is integration of the spatial distributed hierarchical model with model of the intracellular auxin metabolism.
Akberdin I.R., Ozonov E.A., Mironova V.V., Gorpinchenko D.N., Omelyanchuk N.A., Likhoshvai V.A., Kolchanov N.A. (2007). “A cellular automaton to model the development of shoot meristems of Arabidopsis thaliana”, Journal of Bioinformatics and Computational Biology Vol. 5, pp. 641-650.
5 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
WATER-MEDIATED HYDROGEN BONDS ARE ESSENTIAL FOR LOOP STABILIZATION IN PROTEIN STRUCTURES EVGENIY AKSIANOV 1, SERGEI SPIRIN 1,2, ANNA KARYAGINA 1,3,4, ANDREI ALEXEEVSKI 1,2
Keywords: protein structure, water, hydrogen bond, water-mediated bond
Protein structures are mostly composed of secondary structural elements (SSE): alpha-helices and beta-strands. SSEs are connected by unstructured regions (loops). Loops resolved in Х-ray experiments are not flexible; they are stable, at least in a crystal. Regular nets of hydrogen bonds (H-bonds) stabilize both helices and sheets and are important for SSE's stability. No regular hydrogen bond networks are known to stabilize loop conformations. Based on a number of examples we hypothesized that intradomain hydrogen bonds mediated by water molecules significantly contribute to the stabilization of loops. To test our hypothesis, we analyzed intradomain direct hydrogen bonds and water-mediated hydrogen bonds in a non-redundant set of protein domain X-ray structures with high resolution. Methods . 995 protein domains were obtained from the SCOP 1.73 database; sequence identity between each pair of domains was ≤90 %, all structures are X-ray with resolution better than 1.5 Å. Secondary structural elements (β-strands and α-helices) were detected using DSSP algorithm. An H- bond was defined as a pair of atoms such that (1) one of atoms may be proton donor and other proton acceptor, (2) the distance between atoms is 2.3–3.7 Å and (3) the angles between the direction of the H-bond and the optimal direction of H-bond is ≤ 40° for both atoms. Results. The number (per 20 residues of the corresponding SSEs) of H- bonds and water-mediated bonds between helices, strands and loops are shown in Table 1. The numbers of backbone-backbone H-bonds per 20 residues in helices and sheets are less than the maximal possible 20 (11.5 for strands and 12.1 for helices) mainly due to large number of short helices and
1 Belozersky Institute, Moscow State University, Moscow, Russia, [email protected] 2 Scientific Research Institute for System Studies (NIISI RAN), Moscow 3 Gamaleya Institute of Epidemiology and Microbiology, 18 Gamaleya st., Moscow, 123098, Russia 4 Institute of Agricultural Biotechnology, 42 Timiryazevskaya st., Moscow, 127550, Russia 6 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 hairpins (where the number of regular H-bonds is twice smaller) and irregularities in SSE's H-bond networks.
Table 1. Number of direct/water-mediated hydrogen bonds between helices, strands, and loops. Strands Helixes Loops Side chain Backbone Side chain Backbone Side chain Backbone Length (1) 40897 40897 47681 47681 37062 37062 Atoms (2) 14.13 40.00 17.93 40.00 89.75 40.00 0.51 / 1.23 0.00 / 0.02 1.49 / 2.45 12.14(5) / 0.05 3.18 / 4.57 1.78 / 0.83 BONDS WITHIN (3) THE SAME SSE 0.10 / 0.07 0.14 / 0.07 0.25 / 0.13 (backbone to side chain) BONDS BETWEEN DIFFERENT SSEs Strands Helixes Strands (s.c.) Helixes (bb.) Loops (s.c.) Loops (bb.) (bb.) (s.c.) Strands (s.c.(4)) 2.07 / 8.45 0.23 / 1.27 0.36 / 0.76 0.03 / 0.14 1.40 / 3.95 0.85 / 2.05 Strands (bb. (4)) 0.23 / 1.27 11.48 / 0.5 0.06 / 0.34 0.03 / 0.05 0.32 / 0.93 1.31 / 0.82 1.22 / Helixes (s.c.) 0.36 / 0.76 0.06 / 0.34 0.49 / 0.99 1.76 / 4.10 0.80 / 1.77 10.13 Helixes (bb.) 0.03 / 0.14 0.03 / 0.05 0.49 / 0.99 0.07 / 0.38 0.74 / 1.02 2.00 / 0.52 2.65 / Loops (s.c.) 1.40 / 3.95 0.32 / 0.93 1.76 / 4.10 0.74 / 1.02 3.48 / 9.04 20.75 Loops (bb.) 0.85 / 2.05 1.31 / 0.82 0.80 / 1.77 2.00 / 0.52 3.48 / 9.04 1.52 / 4.54 (1) The total length of all elements in the investigated structures (in amino acids). (2) Number of hydrogen donors and acceptors per 20 residues (3) 0.51 direct bonds and 1.23 water-mediated bonds per 20 amino acids. The same notations are used in all other cells of the table. (4) s.c. means side chains, bb. means backbone atoms. (5) Numbers greater than 4 bonds per 20 residues are shown bold and large.
From table 1 it follows that water-mediated bonds between side chains of two helices or two strands were detected for approximately a half of residues. In the case of loop-to-loop interactions water-mediated bonds on average were detected for each residue, and their contribution to loop – loop interactions exceeds the contribution of direct H-bonds.
7 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 We conclude that intra-domain water-mediated bonds are common feature in protein structures. Such bonds may be especially important for loop stabilization. The work is partly supported by the Russian Foundation for Basic Research, grants 07-04-91560 and 08-04-91975.
8 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
GENOMIC INSIGHTS INTO THE ORIGINS OF METAZOAN CELL DIFFERENTIATION KIRILL V. MIKHAILOV 1, A.V. KONSTANTINOVA 1, M.A. NIKITIN 1, V.V. ALEOSHIN 1, L.YU. RUSIN 2, YURI V. PANCHIN 2
Keywords: Mesomycetozoea; molecular phylogenetics; origin of Metazoa;
Choanoflagellates and mesomycetozoeans are two groups of unicellular organisms that are the closest relatives of animals [1]. The ongoing genome sequencing effort aimed at their members is an attempt to understand the origin of animals and multicellularity in the context of evolution of genes and genomes [2]. These studies have brought about a notion of “Metazoa-specific” genes, genes found exclusively in metazoans, which are thus considered likely to be novelties specifically associated with the development multicellularity. The “Metazoa-specific” genes code a large number of cell signalling and adhesion proteins such as cadherins and TGFb pathway components, to name a few. However the list of “Metazoa-specific” genes is rapidly contracting as the number of sequenced genomes of unicellular relatives of metazoans increases. The genomes of choanoflagellates were found to contain a multitude of tyrosine kinases – proteins involved in the regulation of cell proliferation and motility that were originally considered to be a metazoan novelty [3]. Another example is a mesomycetozoean that possesses components involved in cell-matrix adhesion, such as focal adhesion kinase and integrin beta [4]. Here we present evidence for the exclusion of yet another set of genes from the “Metazoa-specific” list by demonstrating their presence in another mesomycetozoean and showing that they are actively expressed. The premetazoan ancestry of metazoan transcription factor families and signal transduction pathways is poorly accommodated by the traditional view of the metazoan ancestors as blastula-like colonies, which had subsequently undergone cell differentiation. The new data suggests that the elements of the genetic toolkit for the development of multicellular animals were possibly already in use by their unicellular relatives. Mapping of major gene families and ecological traits onto the phylogeny indicates that presence of different cell types at different stages of life cycle and appearance of 1 Belozersky Institute for Physicochemical Biology, Lomonosov Moscow State University, Moscow, Russian Federation, [email protected] 2 Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow 127994, Russian Federation 9 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 multicellular aggregates is not an intrinsic property of metazoans, but of a much wider group of organisms – Opisthokonta [5]. The emerging scenario regards the last common ancestor of multicellular animals as an integration of different stages of the unicellular ancestor’s life cycle.
1. E.T.Steenkamp, J.Wright, S.L.Baldauf (2006) The protistan origins of animals and fungi, Molecular Biology and Evolution, 23: 93–106. 2. I.Ruiz-Trillo, G.Burger , P.W.Holland, N.King, B.F.Lang, et al. (2007) The origins of multicellularity: a multi-taxon genome initiative, Trends in Genetics, 23:113–118. 3. N.King, M.J.Westbrook, S.L.Young, A.Kuo, M.Abedin, et al. (2008) The genome of the choanoflagellate Monosiga brevicollis and the origin of metazoans, Nature, 451: 783–788. 4. K.Shalchian-Tabrizi, M.A.Minge, M.Espelund, R.Orr, T.Ruden, et al. (2008) Multigene phylogeny of choanozoa and the origin of animals, PLoS ONE, 3: 2098. 5. K.V.Mikhailov, A.V.Konstantinova, M.A.Nikitin, P.V. Troshin, L.Yu. Rusin, V.A. Lyubetsky, Y.V. Panchin, et al. (2009) The origin of Metazoa: a transition from temporal to spatial cell differentiation, Bioessays, 31: (in press).
10 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
INHERENT POTENTIALITIES OF VORONOI-DELAUNEY TESSELLATION AS APPLIED TO BIOLOGY PROBLEMS ANASTASYA ANASHKINA 1, NATALIA ESIPOVA 1, VLADIMIR TUMANYAN 1
Researchers of different areas of interest effectively used Voronoi- Delaunay tessellation to solve various problems for a long time. During last years the interest to this method arises due to its possibilities in complex biological studies along with crystallography and chemistry. By definition, Voronoi polyhedron or Voronoi region is a part of space which points locate closer to this center than to any other center of the system. Tetrahedron (based on four centers of the system) is a Delaunay simplex whether inside the circumsphere there are no other centers of the system. The set of all Delaunay simplexes of a system as well as the set of Voronoi polyhedrons fills space without slits and overlaps. These tessellations are dual and topologically equivalent. Single-valued character of Voronoi-Delaunay tessellation make this method extremely attractive for researchers as well as it’s independence of any parameters. Mathematical rigorousness and exactness of exploration are very rare occur in biological sciences. The method is developed both for two- dimensional and three-dimensional cases. Modifications of the basic method provide additional capabilities and allow analyzing not only systems of points but systems of spheres of similar radii, systems of spheres of different radii, systems of bodies of arbitrary shapes [1]. Voronoi-Delaunay tessellation encounters some problems in practical use. In particular, boundary conditions should be set. Another problem consists in time-consuming during computations for multi-atomic systems. Two-dimensional Voronoi-Delaunay tessellation is used even for cell cultures architecture analysis. Voronoi facet as well as Delaunay edge is a natural unambiguous non-parametric way to reveal the nearest neighbors in tridimensional space. This procedure is equivalent to revelation of contacts between atoms. Consequently Voronoi-Delaunay tessellation allows calculating of local atomic density and contacts between biopolymer molecules. A contact between two atoms, in this case, is a common facet of Voronoi polyhedron. As a result the contact between two residues is defined as a set of common facets of Voronoi polyhedrons of appropriate atoms. So it 1 Engelghardt Institute of Molecular Biology RAS , Russian Federation , [email protected] 11 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 is possible to explore the statistics of contacts between atoms or residues/nucleotides in protein-protein [2] and protein-nucleic [3] interfaces. Knowledge of rules which control interactions in protein-protein interfaces is necessary for correct prediction of interaction sites on the surface of protein or protein complexes. Also, it may well be that application of this powerful method will decide the question of existence of kind of code of nucleic acid- protein recognition. Voronoi network (more specifically, Voronoi S-network) is the main tool for empty interatomic space analysis. This network penetrates through interatomic space of the system and represents locus located outermost from atoms [4].
1. N.N. Medvedev (2000) Metod Voronogo-Delone v issledovanii struktury nekristallicheskih sistem, Novosibirsk: NIC OIGGM SO RAN. 2. A. Anashkina et al. (2007) Comprehensive statistical analysis of residues interaction specificity at protein-protein interfaces, Proteins, 67(4): 1060-77. 3. A.A. Anashkina et al. (2008) Geometricheskij analiz DNK-belkovyh vzaimodejstvij na osnove metoda Voronogo-Delone, Biofizika, 53(3): 402-6. 4. N.N Medvedev, V.P. Voloshin (2003) Issledovanie mezhatomnyx pustot v molekulyarnyh sistemah, Struktura i dinamika molekulyarnyh sistem, X (1): 299-304.
This research was supported (funded) by Russian Foundation for Basic Research Grants 07-04-01765а and 08-04-01770а.
12 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
COMPUTATIONAL ANTI-AIDS DRUG DESIGN RESULTING FROM THE STUDY ON SPECIFIC INTERACTIONS OF IMMUNOPHILINS WITH THE HIV-1 GP120 V3 LOOP ALEXANDER ANDRIANOV 1
Keywords: HIV-1, V3 Loop, 3D Structure, Computer Modeling, Molecular Docking
Currently, special emphasis of the research teams involved in the anti-AIDS drug studies is attracted to the HIV-1 V3 loop (reviewed in [1]). The higher interest in V3 is caused by numerous experimental data testifying to the fact that exactly this gp120 site gives rise to the principal target for neutralizing antibodies and accounts for the choice of co-receptor determining the preference of the virus in respect with T-lymphocytes or primary macrophages. Since the V3 loop governs the cell tropism and cell fusion (see, e.g., [1], one of the strategic ways in developing the anti-HIV-1 drugs may be based on the approach anticipating the search for the chemicals capable of the efficacious blockading this functionally significant stretch of gp120. Comprehensive analysis of the data of study [2] allows one to suppose that immunophilins exhibiting specific high-affinity interactions with the HIV-1 V3 loop may be utilized as a basic substance to set out of the search for the potential anti-AIDS therapeutic agents. This work proceeds with my previous study [3] where the virtual molecule presenting the promising anti-HIV-1 pharmacological substance was designed by means of the computer modeling based on the analysis of specific interactions between the FK506-binding protein and synthetic peptide imitating the immunogenic crown of the V3 loop. The object of the present study was to generate the model describing the structural complex of cyclophilin A with the HIV-MN V3 loop followed by the computer-aided design of the immunophilin-derived peptide able to mask the biologically important V3 segments. To this end, the following problems were solved: (i) the NMR-based conformational analysis of the HIV-MN V3 loop was put into effect, and its low energy structure fitting the input experimental observations was determined; (ii) molecular docking of this V3 structure with the X-ray conformation of CycA was carried out, and the energy refining the simulated structural 1 Institute of Bioorganic Chemistry, National Academy of Sciences of Belarus, Kuprevich Street., 5/2, 220141 Minsk, Republic of Belarus, [email protected] 13 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 complex was implemented; (iii) the matrix of inter-atomic distances for the amino acids of the molecules forming part of the built over-molecular ensemble was computed, the types of interactions responsible for its stabilization were analyzed, and the CycA stretch which accounts for the binding to V3 was identified; (iv) the most probable 3D structure of this stretch in the unbound state was predicted, and its collation with the X-ray structure for the corresponding site of CycA was performed; (v) the potential energy function and its constituents were studied for the structural complex generated by molecular docking of the V3 loop with the CycA peptide offering the virtual molecule which imitates the CycA segment making a key contribution to the interactions of the native protein with the HIV-1 principal neutralizing determinant; (vi) as a result, the designed molecule was shown to be capable of the effictive blocking the functionally crucial V3 sites; and (vii) starting from the joint analysis of the results derived here and in study [3], the composition of the peptide cocktail presenting the promising anti-AIDS pharmacological substance was developed. The molecules simulated here by molecular modeling methods may become the first representatives of a new class of chemicals (immunophilin- derived peptides) offering the forward -looking basic structures for the design of efficacious and safe antiviral agents. The author appreciates the Belarusian Republican Foundation for Basic Research for financial support (project No X08-003).
1. S.Sirois, T.Sing, K.C.Chou (2005) HIV-1 gp120 V3 loop for structure- based drug design, Curr. Protein Pept. Sci., 6: 413-422. 2. M.M.Endrich, H.Gehring (1998) The V3 loop of human immunodeficiency virus type-1 envelope protein is a high-affinity ligand for immunophilins present in human blood, Eur. J. Biochem., 252: 441- 446. 3. A.M.Andrianov (2008) Computational anti-AIDS drug design based on the analysis of the specific interactions between immunophilins and the HIV-1 gp120 V3 loop. Application to the FK506-binding protein, J. Biomol. Struct. Dynam., 26: 49-56.
14 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
HOMOLOGY MODELING AND MOLECULAR DYNAMICS IN STRUCTURAL STUDIES ON THE HIV-1 GP120 V3 LOOPS: INSIGHT INTO THE VIRUS SUBTYPE A IVAN ANISHCHENKO 1, ALEXANDER ANDRIANOV 2
Keywords: HIV-1, V3 Loop, 3D Structure, Computer Modeling, Molecular Docking
The V3 loop of the HIV-1gp120 glycoprotein presenting 35-residue-long, frequently glycosylated, highly variable, and disulfide bonded structure plays the central role in the virus biology and forms the principal target for neutralizing antibodies and the major viral determinant for co-receptor binding. Here we present the computer-aided studies on the 3D structure of the HIV-1 subtype A V3 loop (SA-V3 loop) in which its structurally inflexible regions and individual amino acids were identified and the structure-function analysis of V3 aimed at the informational support for anti-AIDS drug researches was put into practice. To this effect, the following successive steps were carried out: (i) using the methods of homology modeling and simulated annealing, the ensemble of the low-energy structures was generated for the consensus amino acid sequence of the SA-V3 loop and its most probable conformation was defined basing on the general criteria widely adopted as a measure of the quality of protein structures in terms of their 3D folds and local geometry; (ii) the elements of secondary V3 structures in the built conformations were characterized and careful analysis of the corresponding data arising from experimental observations for the V3 loops in various HIV-1 strains was made; (iii) to reveal common structural motifs in the HIV-1 V3 loops regardless of their sequence variability and medium inconstancy, the simulated structures were collated with each other as well as with those of V3 deciphered by NMR spectroscopy and X-ray studies for diverse virus isolates in different environments; (iv) with the object of delving into the conformational features of the SA-V3 loop, molecular dynamics trajectory was computed from its static 3D structure followed by determining the structurally rigid V3 segments and comparing the findings obtained with the ones derived hereinbefore; and (v) to evaluate the 1 United Institute of Informatics Problems, National Academy of Sciences of Belarus, Surganov Street 6, 220012 Minsk, Republic of Belarus, [email protected] 2 Institute of Bioorganic Chemistry, National Academy of Sciences of Belarus, Kuprevich Street, 5/2, 220141 Minsk, Republic of Belarus, [email protected] net.by 15 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 masking effect that can occur due to interaction of the SA-V3 loop with the two virtual molecules constructed previously [1, 2] by tools of computational modeling and named FKBP and CycA peptides, molecular docking of V3 with these molecules was implemented and inter-atomic contacts appearing in the simulated complexes were analyzed to specify the V3 stretches keeping in touch with the ligands. As a matter of record, V3 segments 3-7, 15-20, and 28-32 containing the highly conserved and biologically meaningful residues of gp120 were shown to retain their 3D main chain shapes in all the cases of interest presenting the forward-looking targets for anti-AIDS drug researches. From the data on molecular docking, synthetic analogs of the CycA and FKBP peptides were suggested being suitable frameworks for making a reality of the V3-based anti-HIV-1 drug projects. In addition, the computational V3 model proposed above provides a productive basis to gain a better insight into the principles of virus functioning, and, therefore, can be used in subsequent studies for investigating the structure-functional relationship as well as for examining the structural effects of mutations or distinguishing between various forms of the V3 loop under different conditions.
1. A.M.Andrianov (2008) Computational anti-AIDS drug design based on the analysis of the specific interactions between immunophilins and the HIV-1 gp120 V3 loop. Application to the FK506-binding protein, J. Biomol. Struct. Dynam., 26: 49-56. 2. A.M.Andrianov (2009) Immunophilins and HIV-1 V3 loop for structure- based anti-AIDS drug design, J. Biomol. Struct. Dynam., 26: 445-454.
This study was supported by grants from the Union State of Russia and Belarus (scientific program SKIF-GRID; № 4U-S/07-111) as well as from the Belarusian Foundation for Basic Research (project X08-003).
16 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
3D STRUCTURE MODELING AND POSTERIOR COLLATION OF THE HIV-1 V3 VARIABLE LOOPS FOR DISCOVERY OF THEIR STRUCTURALLY INVARIANT SITES EXPOSING THE ACHILLES' HEEL IN THE HIV-1 “REDOUBTS” A. M. ANDRIANOV 1, I.V. ANISHCHENKO 2
Keywords: HIV-1, V3 Loop, 3D Structure, Computer Modeling, Molecular Docking
The HIV-1 gp120 V3 loop forming the virus principal neutralizing determinant and determinants of cell tropism and cell fusion is considered as one of the promising targets for anti-AIDS drug studies (reviewed in [1]). The V3 loops derived from different HIV-1 isolates contain highly variable amino acid sequences, which prevents antibodies bound to a V3 loop of one isolate from having effect on the V3 loops of other isolates. However, the analysis of various HIV-1 V3 loop sequences makes it clear that, despite their high variability which complicates fundamentally the studies on the V3 loop structure, some of the amino acid positions located in the N- and C-terminals and especially those residing in its immunogenic tip, are highly conserved. Conserving these V3 stands allows one to suggest that the residues occupying them may preserve their conformational states in diverse HIV-1 strains and, therefore, may present the promising targets for developing the new therapeutic agents. Therefore, one is in need of the information on the 3D structure of V3 and its inflexible regions, which is of particular importance to successful implementation of the anti-AIDS drug studies [1]. In the light of the above, the computational approaches combining the NMR-based protein structure modeling with the mathematical statistics methods were used here to define the locally accurate 3D structures of the HIV-1 gp120 V3 loops from Minnesota, Haiti, RF, and Thailand isolates in water solution as well as from Minnesota and Haiti isolates in a water/trifluoroethanol mixed solvent. To specify the structural motifs of V3 giving rise to the close spatial folds regardless of the sequence and environment variability, the simulated structures and their individual 1 Institute of Bioorganic Chemistry, National Academy of Sciences of Belarus, Kuprevich Street, 5/2, 220141 Minsk, Republic of Belarus, [email protected] 2 United Institute of Informatics Problems, National Academy of Sciences of Belarus, Surganov Street 6, 220012 Minsk, Republic of Belarus, [email protected] 17 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 segments of different length were collated between themselves and with those derived previously from homology modeling [2] and X-ray crystallography [3]. As a result, the sequence and environment changes were found to trigger the considerable structural rearrangements of the V3 loop, but, at the same time, some of the functionally crucial V3 stretches were shown to keep the 3D shapes in all the cases in question. In the first place, it concerns core V3 sequence 15-20 as well as its N- and C-terminal sites 3-7 and 28-32 comprising the residues, which contribute significantly to the virus immunogenicity and cell tropism. In addition, structurally rigid V3 stretch 3-7 includes the highly conservative glycolysation site of gp120 utilized by the virus for defense against neutralizing antibodies and elevation of its infectivity. In the context of these findings, the inflexible V3 motifs identified in this study may present the weak units in the HIV-1 protection system and, therefore, their detection is of great importance to successful design of the V3- based anti-AIDS drugs being able to stop the HIV's spread.
1. S.Sirois, T.Sing, K.C.Chou (2005) HIV-1 gp120 V3 loop for structure- based drug design, Curr. Protein Pept. Sci., 6: 413-422. 2. I.V.Anishchenko, A.M. Andrianov (2008) Computer-aided modeling of the 3D structure for the HIV-1 gp120 V3 loop: exploring the virus subtype A, Proceedings of II International Conference “Advanced Information and Telemedicine Technologies for Health” (Minsk, 2008): 12-16. 3. C.C. Huang, M. Tang, M.Y. Zhang, S. Majeed, E. Montabana, R.L. Stanfield, D.S. Dimitrov, B. Korber, J. Sodroski, I.A. Wilson, R. Wyatt, P.D. Kwong (2005) Structure of a V3-containing HIV-1 gp120 core, Science, 310: 1025 – 1028.
This study was supported by grants from the Union State of Russia and Belarus (scientific program SKIF-GRID; № 4U-S/07-111) as well as from the Belarusian Foundation for Basic Research (project X08-003).
18 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
POLYCTLDESIGNER – THE SOFTWARE FOR CONSTRUCTING POLYEPITOPE IMMUNOGENS. DENIS ANTONETS 1, AMIR MAKSYUTOV 2, SERGEY BAZHAN 3
Keywords: Immunity, cytotoxic T-lymphocyte, T-cell epitope, polyepitope antigen
Design of the artificial polyepitope immunogens capable of eliciting high levels of the CD8+ CTL responses to is a promising approach in creation of an efficient vaccines. When designing such immunogens, it is necessary to optimize the processing and presentation of contained epitopes. DNA vaccine constructs encoding poly-CTL-epitope immunogens containing N-terminal ubiquitin and spacer sequences ensuring correct processing and presentation of selected epitopes were shown to be highly efficient in stimulating CD8+ CTL responses. These results inspired us to create PolyCTLDesigner software, intended for designing optimal polyepitope antigens. To optimize polytope sequence for inducing high level of CTL response one should take into account major steps of MHC class I-dependent antigen processing: proteasomal/immunoproteasomal cleavage of antigen and TAP-dependent transport of generated peptidic fragments into endoplasmic reticulum where they bind to MHC class I molecules. To prognose proteasomal/immunoproteasomal processing PolyCTLDesigner utilizes predictive models developed by Toes et al. [1]. The site of proteasomal cleavage should be located at the С-terminus of the epitope. Thus to optimize proteasomal cleavage (if necessary) C-terminus of the epitope should be extended with spacer motif with up to six aminoacid residues in length. To predict peptide binding to TAP our program uses models developed by Peters et al. [2]. Since, according to a widely accepted hypothesis, the major contributions to TAP-binding are provided by the first three N-terminal amino acid residues of the peptide and the last one (C-terminal), and given the fact, that C-terminus of the epitope must stay unchanged, only N-terminus of the antigenic peptide could be extended to optimize its interaction with TAP1/TAP2 heterodimer. According to the chosen models and algorithms for 1 Research Center of Virology and Biotechnology Vector, Russian Federation, [email protected] 2 Research Center of Virology and Biotechnology Vector, Russian Federation, [email protected] 3 Research Center of Virology and Biotechnology Vector, Russian Federation, [email protected] 19 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 TAP-binding prediction the maximal length of N-terminal spacer sequence will make three residues: ARY. PolyCTLDesigner is integrated with TEpredict program (http://tepredict.sourceforge.net), created earlier. TEpredict is used by PolyCTLDesigner to predict T-cell epitopes. PolyCTLDesigner allows the user to select the minimal set of epitopes with known (or predicted) specificity towards various allelic variants of MHC class I molecules covering the selected MHC-repertoire with a specified redundancy. Currently PolyCTLDesigner utilizes two algorithms to design polyepitope immunogens. The first one utilizes an optimal spacer motif derived from the selected predictive models (e.g., ADLVKV). And the second algorithm utilizes redundant spacer motif and minimizes formation of «non target» epitopes in the sequence of the desired polyepitope immunogen. The developed software realizes the rational approach to designing highly immunogenic poly-CTL- epitope vaccine constructs and can be used for designing new candidate polyepitope vaccines capable of eliciting high levels of the T-cell–mediated immune responses. More detailed description of the program and its source code are available at http://tepredict.sourceforge.net/PolyCTLDesigner.html. The program is written in Python programming language (http://python.org).
1. Toes R.E. et al. (2001). Discrete Cleavage Motifs of Constitutive and Immunoproteasomes Revealed by Quantitative Analysis of Cleavage Products. J. Exp. Med., 194:1-12. 2. Peters B. et al. (2003) Identifying MHC class I epitopes by predicting the TAP transport efficiency of epitope precursors. J. Immunol., 171:1741– 1749.
20 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
GENOME-WIDE SEARCH FOR 5’-UTR OF SACCHAROMYCES CEREVISIAE GENES AND THEIR ORTHOLOGS KIRILL ANTONEZ 1, ALSU SAIFITDINOVA 2
Keywords: yeast,5'-UTR
Motivation and Aims: Prokaryotic and eukaryotic mRNAs are the important step of protein biosynthesis and consists of coding sequence and untranslated regions (UTRs). UTR’s play essential role in posttranscriptional life of mRNA and may harbor regulatory elements in addition to translation initiation sequences. Also 5’-UTRs of both prokaryotic and eukaryotic mRNA may form stable secondary structures, which influence the efficiency of translation initiation. Certain 5’-UTRs contain riboswitches that regulate protein synthesis by ligand binding and decrease or enhance translation efficiency [1]. Realization of genetic information in eukaryotes includes processing of RNA, its transport from nucleus to cytoplasm, translation and decay [2, 3]. There are regulatory elements in 5’- and 3’-UTRs that hasten decay of mRNA. Also UTR’s may contain stems which special proteins interact with leading to inhibition or initiation of translation [4]. Besides main ORF, mRNA may contain upstream ORF located in 5’-UTR that decrease efficiency of translation [5]. All these elements can regulate tissue-specific production of protein, fast response to stress or influence on development and progress of disease [6]. Therefore, it is important to identify regulatory sequences in mRNA. The frequent way to find regulatory elements is to compare the set of sequences, which harbor putative elements. Currently there is no useful tool for analysis of Saccharomyces cerevisiae 5’-UTRs. Our aim was to write program in order to get the set of 5’-UTRs of yeast genes and their orthologs. Methods and Algorithms: We used Microsoft Visual Studio 2008 for writing program. The program was written in C# language for .NET Frameworker 3.5 with usage of Windows Workflow Foundation. To get the data about yeast genes we used Saccharomyces Genome Database (www.yeastgenome.org) and published data about length of UTR [7, 8]. The information about yeast gene orthologs was obtained from Princeton Protein Orthology Database
1 Saint-Petersburg State University, Russian Federation, [email protected] 2 Saint-Petersburg branch of Vavilov Institute of General Genetics RAS, Russian Federation, [email protected] 21 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 (ppod.princeton.edu). To get the detailed data for other organisms we used WormBase (www.wormbase.org), FlyBase – A Database of Drosophila Genes & Genomes (www.flybase.org), TAIR (www.arabidopsis.org), Mouse Genome Informatics (www.informatics.jax.org), Protein Knowledgebase (www.uniprot.org) and Homo sapiens genes (NCBI36). BioMart tool (www.ensembl.org/biomart/) was used for downloading human 5’-UTR sequences. Results: We have designed the program UTRdbMaker for getting a set of 5’- UTRs. It obtains information corresponding to the gene names containing ORFs and 5’-UTRs sequences of yeast genes and their orthologs. UTRdbMaker analyses nucleotide composition of 5’-UTRs. Results of search are written in text files as tables and contain general descriptions of yeast genes. These results may be used for exploration of conservation of 5’-UTRs and for searching of regulatory elements in them. Code of UTRdbMaker can be extended for similar work with other regions or other databases.
1. W.C.Winkler et al. (2004) Control of gene expression by a natural metabolite-responsive ribozyme, Nature, 428: 281-286. 2. J.E.G.McCarthy (1998) Posttranscriptional Control of Gene Expression in Yeast, Microbiol. Mol. Biol. Reviews, 62: 1492-1553. 3. Ch.Dimaano et al. (2004) Nucleocytoplasmic Transport: Integrating mRNA Production and Turnover with Export through the Nuclear Pore, Mol. Cell. Biol, 24: 3069-3076. 4. A.M.Thomson et al. (1999) Iron-regulatory proteins, iron-responsive elements and ferritin mRNA translation, Int. J. Biochem. Cell Biol, 31: 1139-1152. 5. A.M.Resch et al. (2009) Evolution of alternative and constitutive regions of mammalian 5’UTRs, BMC Genomics, 10: 162. 6. J.T.Rogers et al. (2002) An iron-responsive element type II in the 5’- untranslated region of the Alzheimer’s amyloid precursor protein transcript, J.Biol.Chem. 277: 45518-45528. 7. F.Miura et al. (2006) A large-scale full-length cDNA analysis to explore the budding yeast transcriptome, PNAS, 103:17846-17851. 8. Z.Xu et al. (2009) Bidirectional promoters generate pervasive transcription in yeast, Nature, 457: 1033-1037.
22 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
A TRUSTY KNOWLEDGE-BASED POTENTIAL ENERGY BASED ON PAIRWISE RESIDUE CONTACT AREA SEYED SHAHRIAR ARAB 1, ARMITA SHEARI 1, MEHDI SADEGHI 2, CHANGIZ ESLAHCHI 3, HAMID PEZESHK 4
Keywords: Knowledge-based potential, decoy sets, protein structure prediction, protein folding
We develop a new approach to calculate a knowledge-based mean-force based on pairwise residue contact area. To test its effectiveness, we elaborate it on several decoy sets to measure its ability to discriminate native structure from decoys. In all cases this potential has been able to distinguish native structures from the decoys with about 100% accuracy. Also calculated Z-score shows high value for all protein datasets. This knowledge-based mean force can discriminate native structures from the decoys effectively, so it will be useful for protein structure prediction and model refinement. Considering energy function to detect a correct protein fold from incorrect ones is very important for protein structure prediction and protein folding. Mainly, two different types of potential energy function are currently in use either on the identification of native protein models from a large set of decoys or protein fold recognition and threading studies. The first class of potentials, the so-called physical-based potential, is based on the fundamental analysis of the forces between the particles referred to as physical energy function. The second type is knowledge-based energy function based on information from known protein structures. In physical energy function, a molecular mechanics force field is used. Molecular mechanics force fields are parameterized from ab initio calculation and small molecule structural data. They are essentially the sum of pairwise electrostatic and Van der Waals interaction energies, bonds, angles and dihedral angle terms. In addition, terms that are not included such as entropy and solvent effect are implicitly considered. Although, physical
1 Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Iran, [email protected], [email protected] 2 National Institute of Genetic Engineering and Biotechnology, Tehran-Karaj Highway, Tehran, Iran, [email protected] 3 Department of Mathematical Sciences, Shahid Beheshti University, Tehran Iran, [email protected] 4 School of Computer Science, Institute for Studies in Theoretical Physics and Mathematics, Iran, [email protected] 23 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 energy function is widely used in molecular dynamic simulation of proteins in their native and denatured states and can be used to distinguish the decoy and native structures, but these functions have not been efficient in protein structure prediction because of their greater computational cost. To reduce computational complexity of the protein folding problem, knowledge-based or empirical mean-force potential is widely used. Since the structure of folded proteins reflects the free energy of the interaction of all their components, including all enthalpic and entropic contribution, as well as solvent effects, such potentials provide an excellent shortcut towards a powerful objective function. It can be used to force the system to obtain potential between groups of atoms by use of experimentally determined structures. In this approach, statistical thermodynamics is used in an analysis of the frequency of observed states to estimate the underlying free energy. Most often, the distribution of pairwise distances are used to extract a set of effective potential between residues or atoms. The distribution of pairwise distances can be compiled from the protein structure database and by defining a reference state, Boltzman's Law is used to calculate the interaction energy of a particular pair. The total potential energy of a protein is simply taken as a sum over all pairwise interactions. In most cases, one or two points for each residue are used to represent a protein. These points are usually C(alpha), C(beta) or the center of mass of each side chain. Each interaction can be distance – dependent. A large variety of knowledge-based potential of mean-force have been developed by introducing additional interactions such as surface area terms, the main chain and side chain dihedral angles, three and four body terms and heavy atoms. In the contact potential, either distance – dependent or only dependent on contact, the distance between the centers of two C(alpha), C(beta) or center of mass of two residues or the all heavy atoms of two residues are calculated and the observed frequency of contacts between residues converts to free energy using Boltzman’s equation. In this way, there is some problems that distance between two C(alfa) Atoms of two residues may be equal to the distance of two atoms of these residues in another position, but the orientation of two residue side chains may be quite different and they are considered as the pairs with equal pairwise distance. In other words, the side chains of two atoms may not have direct contact with each other and some atoms may be located in internal of the space. In this study, we develop a new approach to calculate a knowledge-based potential energy based on pairwise residue contact area. We calculated the parts of each pairwise residue surface area that are in contact in Å2 by rolling a probe ball 24 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 of different sizes around the atoms of a residue to determine the direct contacts surface area of each pair. This pairwise direct contact area, was used to determine statistical contact area preference between each residue pairs, when a contact area preference estimates a sum of energetic interaction and a structural constraint. A good energy function at its minimum should discriminate native structures from decoys. So, to test the effectiveness of this new potential, we elaborated it on several decoy sets to measure its ability to discriminate native structure from decoys. Several decoy sets that contain one to hundreds of decoy proteins generated in different ways were used and in all cases this potential has been able to distinguish native structures from the decoys with about 100% accuracy. Calculated Z-score, which is a useful measure of the validity of the computed potential, shows high value for all protein datasets. The knowledge-based mean force pairwise direct contact area can discriminate effectively, so it will be useful for protein structure prediction and model refinement.
25 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
EVOLUTIONARY DYNAMICS OF CRISPR-CASSETTES VALERY SOROKIN 1, IRENA ARTAMONOVA 2
Keywords: prokaryotic immunity, CRISPR-cassettes, metagenome, evolution
CRISPRs, Regularly Interspaced Short Palindromic Repeats, are a new type of prokaryotic anti-phage immunity systems. A typical CRISPR system consists of a CRISPR-cassette that is a chain of almost identical repeats separated by unique spacers, a leader region, and CRISPR-associated genes [1]. Analysis of the CRISPR-systems was performed in metagenomic sequence data. There are no efficient tools for CRISPR-cassette search, since, when applied to metagenomes, all three publicly available programs, CRT [2], PILER-CR [3], and CRISPRFinder [4], produce high levels of false positive noise. Thus, to search for CRISPR-cassettes in metagenomes we developed a filtering procedure based on a combination of these three programs. This procedure was applied to the Sorcerer II [5] metagenome data, resulting in 192 reliable cassettes. All cassettes found by at least one of the three tools were collected in a database called MeCRISPR (http://iitp.bioinf.fbb.msu.ru/vsorokin/crispr). The database interface allows browsing and analyzing pre-calculated CRISPR-cassettes and their flanking sequences; in particular, to search against spacers, repeats and metagenomic contigs containing at least one CRISPR cassette. We clustered CRISPR-cassettes based on similarity between repeat units. Additional analysis of flanking regions allowed us to distinguish between the lateral transfer and the parallel evolution of cassettes in related strains. For every group of homologous cassettes, we reconstructed the evolutionary history. We observed that similarities representing phage-related spacers or lateral transfers of cassettes were significantly enriched in metagenome contigs from same geographical locations. This shows that on-going phage-host encounters of specific ocean locations involve the CRISPR-mediated response and imprint the host genome.
1 M.V. Lomonosov Moscow State University, Russian Federation, [email protected] 2 Vavilov Institute of General genetics RAS; Kharkevich Institute of Information Transmission Problems RAS, Russian Federation, [email protected] 26 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 We also investigated CRISPR-cassettes in close strains of Xanthomonas oryzae. The attempt to construct an experimental system for studying CRISPR systems failed because of the unresolved paradox in two strains Xo604 and Xo21. A shared spacer of homologous CRISPR-cassettes of these strains is identical to the Xp10 phage and should, theoretically, prevent the phage infection in both cases. However while Xo21 is indeed resistant for this phage, the Xo604 strain is sensitive. We explained it by identifying a mutation in the phage regulatory motif, discovered for the Xanthomonas cassettes. The comparative analyses of all known CRISPR-cassettes of Xanthomonas oryzae (five strains) will be presented. This is joint work with Mikhail S. Gelfand, Konstantin V. Severinov, Mikhail A. Pyatnitskiy, Ekaterina Semenova and Maxin Nagronykh. This work was partially supported by the Russian Foundation of Basic Research (09-04- 01098-a) and the Russian Academy of Sciences (programs “Molecular and Cellular Biology” and “Fundamental problems of Oceanology”).
1. R. Sorek et al. (2008) CRISPR--a widespread system that provides acquired resistance against phages in bacteria and archaea, Nat Rev Microbiol., 6:181-186. 2. C. Bland et al. (2007) CRISPR recognition tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats, BMC Bioinformatics. 8: 209. 3. R.C. Edgar (2007) PILER-CR: fast and accurate identification of CRISPR repeats, BMC Bioinformatics. 8: 18. 4. I. Grissa et al. (2007) CRISPRFinder: a web tool to identify clustered regularly interspaced short palindromic repeats, Nucleic Acids Res., 35: W52-7. 5. D.B. Rusch et al. (2007) The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific, PLoS Biol., 5: e77.
27 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
INVESTIGATING BRANCH POINT SITE CONSENSUS OF HUMAN FEDOR GONCHAROV 1, VLADIMIR BABENKO 2
Splicing is commonly recognized as one of the ultimate regulation stages of gene expression. In particular, alternative splicing (AS) is a widespread mechanism with an important role in generating appropriate tissue and/or stage specific product from the same gene. On the other hand, one of the key binding sites in the course of spliceosome assembly, namely branch point site (BPS) is drastically degenerate in mammals in contrast to intron poor organisms, e.g. yeast (Gao et al., 2008). We explored the 30bp branch point region sequences [-50, -20] relative to 3’ splice site from 28156 human introns. For analysis we built up the maximum parsimony tree for 7-mers taking into account the pairwise correlation values of the positions in the7-mers occurrence distribution. We got several resulting points after analysis: There are several major branch point site consensi in human that supposes BPs heterogeneity. 1. The most abundant human BPS is represented by ACTGACG oligonucleotide which is consistent with (Irimia, Roy, 2008) and differs from , e.g. yeast (TACTAAC) 2. The human U2 RNP can bind to mRNA BPS not by canonical GTAGTA site, but in significant number of cases by IIa loop (Pomeratz et al., 2009), which is confirmed with extensive ATTAAAC representation as BPS in human (Henscheid, Voelker, Berglund, 2008). 3. The BPs sequence depends on the intron length, so it is closer to canonical in small to moderate introns. 4. Cassette exon –related BPS 3’ downstream possess significantly lower BPS strength (more mismatches from major consensi) than obligatory exons (p<1e-8). In metazoan cells the increasing tissue specific complexity leads to multistage gene regulation in the course of replication, transcription and posttranscriptional phases. It was shown (IrFimia, Roy) that intron rich organisms usually belong to the top hierarchical clade of the organization complexity tree. We believe that branch point redundancy comes as the part of 1 Institute of Cytology and Genetics, Russian Federation, [email protected] 2 Institute of Cytology and Genetics, Russian Federation, [email protected] 28 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 AS regulation evolution. In particular, strong BPS don’t allow for cis- regulatory element to affect splicing, so BPS of the canonical type could be referred to as Intronic Splicing Enchancer (ISE). On the contrary, regulated exons lack strong BPs signal apparently due to regulation.
1. Irimia M, Roy SW. Evolutionary convergence on highly-conserved 3' intron structures in intron-poor eukaryotes and insights into the ancestral eukaryotic genome. PLoS Genet. 2008. 4(8):e100014 2. Gao K, Masuda A, Matsuura T, Ohno K. Human branch point consensus sequence is yUnAy. Nucleic Acids Res. 2008.36(7):2257-6 3. Henscheid KL, Voelker RB, Berglund JA. Alternative modes of binding by U2AF65 at the polypyrimidine tract. Biochemistry. 2008. 47(1):449-59. 4. Pomeranz Krummel DA, Oubridge C, Leung AK, Li J, Nagai K. Crystal structure of human spliceosomal U1 snRNP at 5.5 A resolution. Nature. 2009. 458(7237):475-80.
29 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
GLAUCOMA AND MYOPIA WHOLE GENOME ASSOCIATION STUDY VLADIMIR BABENKO 1, MARINA GUBINA 1, IGOR KULIKOV 1, RUSLAN AITNASAROV 1
Keywords: Illumina 550, SNP analysis, glaucoma, myopia,
40 individuals were genotyped with the Illumina 550 snp array (Illumina, Inc., http://illumina.com) at the “Bioingineering” Center, RAS, Russia. The data comprises 27 healthy individuals, 5 patients with glaucoma and 8 ones with myopia diagnosis. All individuals are Caucasians from Novosibirsk urban region, Russia. The total SNP volume comprises more than 340 thousand SNPs We implemented sql database schema designed by us for maintenance of the sample and a software suite to analyze it. Results. We identified 44 target SNPs while analyzing 11 normal and 13 disorder cases where discrepancy between control and affected samples set was more than empirically chosen significant threshold of 9 genotypes. Using haploview software suite (www.hapmap.org) we selected 28 non-redundant unlinked SNPs. Next we scanned OMIM database (www.ncbi.nlm.nih.gov/omim) for the genes comprising the target SNP set. There we identified 5 genes with ‘glaucoma’ and ‘myopia’ as keywords, namely: myocilin (MYOC), optineurin (OPTN), cytochrome P450 family 1 subfamily B (CYP1B1), optic atrophy 1 isoform 8 (OPA1), WD repeat domain 36 (WDR36). The gene OPA1 (optical atrophy, chrom 3) significantly associated with target SNPs is located within recently identified cluster of genes (MFN1, SOX2OT and PSARL, Andrew T et al., Plos Genetics, 2008), and proved to be associated with myopia. We thus reconfirm the impact of this gene on myopia in ethnic population considered.
1 Institute of Cytology and Genetics, Russian Federation, [email protected] 30 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
AN EVOLUTIONARY STUDY IN THE GENOMICS OF VERTEBRATE POXVIRUSES 1 IGOR BABKIN
Keywords: DNA virus, Poxviridae, Virus evolution, Smallpox history
Members of the family Poxviridae are the most studied among the known cytoplasmic DNA-containing viruses. According to the accepted taxonomy, they are divided into two subfamilies, Entomopoxvirinae and Chordopoxvirinae; the latter contains eight genera and two unclassified viruses, deer poxvirus and crocodile poxvirus. The members of Chordopoxvirinae subfamily utilize two types of evolutionary strategy: Parapoxvirus, Molluscipoxvirus, and crocodile poxvirus accumulate CG sequences in their genomes and the remaining poxviruses, AT sequences [1]. To introduce the time scale into the evolutionary reconstruction, it is necessary to determine the divergence time points for one or several tree nodes. One of such limitations is the moment when variola virus (VARV) was exported to the American continent from West Africa in the XVI century [2]. We have earlier discovered the genetic relatedness between the virus strains from these regions [3], which form a separate biological subtype of VARV. This has allowed us to estimate the divergence time points for poxviruses using the Bayesian relaxed clock [4]. We have earlier determined the rates of orthopoxvirus molecular evolution based on the analysis of extended central conserved region of their genomes and of AT-rich poxviruses by analyzing the nucleotide sequences of virus RNA polymerase subunits. The goal of this study was to study the evolutionary history of the vertebrate poxviruses with AT-rich genomes by the Bayesian relaxed clock analysis using a large set of highly conserved vitally important genes of these viruses. For this analysis, we selected only highly conserved genes with similar evolutionary rates, namely, 35 genes encoding the proteins involved in transcription, DNA replication, and the system of S–S bond formation. The accumulation rate of nucleotide substitutions was 1–6 × 10–6 nucleotide substitutions per site per year. Applying the Bayesian method for determining the time estimates, it is possible to conclude that the modern viruses of the genus Avipoxvirus diverged from the progenitor approximately 283 ± 102
1 Institute of Cytology and Genetics SB RAS, Russian Federation, [email protected] 31 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 thousand years ago. Presumably, the progenitor virus of the modern mammalian poxviruses had a wide range of sensitive hosts and specialized to different organisms during the evolution. The progenitor of the genus Orthopoxvirus was the first to diverge approximately 171 ± 55 thousand years ago. Then the progenitor of the genus Leporipoxvirus separated about 136 ± 44 thousand years ago. This genus contains the viruses inducing tumors in rabbits, hares, and squirrels. The next to diverge was the progenitor of the genus Yatapoxvirus, the representatives of which induce benign tumors in primates. The progenitor of three ungulate virus genera—Capripoxvirus, Suipoxvirus, and recently discovered unclassified deerpox virus—appeared 107 ± 36 thousand years ago. VARV diverged from its progenitor, common for camelpox and taterapox viruses, 5.8 ± 1.4 thousand years ago. However, we have earlier performed a more reliable calculation based on the extended central conserved region of orthopoxvirus genomes, which estimated the time of independent VARV evolution as 3.4 ± 0.8 thousand years ago [4]. This dating of the VARV progenitor to 3–4 thousand years ago demonstrates that VARV is a comparatively young virus. This work was supported by the Russian Foundation for Basic Research (project no. 08-04-00443-a).
1. Moss B. (1996) Poxviridae: The viruses and their replication, In: Fields Virology Fields B.N. et al. (Eds.), 2637-2671 (Philadelphia: Lippincott- Raven Publishers). 2. Fenner F., Henderson D.A. et al. (1988) Smallpox and its Eradication. Geneva: World Health Organization 1460 p. 3. Babkina I.N., Babkin I.V. et al. (2004) Phylogenetic comparison of the genomes of different strains of variola virus, Dokl. Biochem. Biophys., 398:316-319 4. Babkin I.V., Shchelkunov S.N. (2008) Molecular evolution of poxviruses, Genetika, 44:1029–1044.
32 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
DOSAGE COMPENSATION AND DEMASCULINIZATION OF X CHROMOSOMES IN DROSOPHILA DORIS BACHTROG 1, NICHOLAS TODA 1, STEVEN LOCKTON 1
Keywords: sex chromosomes, demasuclinization
The X chromosome of Drosophila shows a deficiency of genes with male- biased expression, while mammalian X chromosomes are enriched for both spermatogenesis genes expressed pre-meiosis and multi-copy testis genes. Meiotic X inactivation and sexual antagonism can only partly account for these patterns. Here, we show that dosage compensation in Drosophila contributes substantially to the depletion of male genes on the X. To equalize expression of X-linked genes between the sexes, male Drosophila hyper-transcribe their single X, while female mammals silence one of their two X chromosomes. By combining fine-scale mapping-data of dosage compensated regions in D. melanogaster with genome-wide expression profiles, we demonstrate that the dosage compensation machinery directly limits further up-regulation of X- linked genes in males. As a result, most male-biased genes on the X chromosome are located outside dosage compensated regions. We also show that dosage compensation in Drosophila contributes to gene trafficking of male-genes off the X. Thus, while natural selection operates more efficiently on the hemizygous X chromosome in males, dosage compensation prevents the emergence of male genes on the Drosophila X. Conversely, since base-line levels of X-linked transcription are identical in male and females, no sex- specific restriction on gene regulation exists and selection can act to masculinize the X in mammals. The vastly different mechanisms of dosage compensation can therefore help to explain X-chromosomal gene content differences between mammals and Drosophila.
1 University of California Berkeley, United States, [email protected] 33 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
CODON SIZE REDUCTION AS THE ORIGIN OF THE TRIPLET GENETIC CODE. PAVEL BARANOV 1, MAXIME VENINE 2, GREGORY PROVAN 2
The genetic code appears to be optimized in its robustness to missense errors and frameshift errors [1-3]. In addition, the genetic code is near optimal in terms of its ability to carry information in addition to the sequences of encoded proteins [4]. As evolution has no foresight, optimality of the genetic code suggests its evolutionary origin as opposed to an accidental origin. The length of codons in the genetic code is also optimal, as three is the minimal nucleotide combination allowing encoding of the twenty standard amino acids. The apparent impossibility of transitions between codon sizes in a discontinuous manner during evolution has resulted in an unbending view that the genetic code was always triplet. Yet, recent experimental evidence on quadruplet decoding [5-8], as well as the discovery of organisms with ambiguous [9, 10] and dual decoding [11], suggest that the possibility of the evolution of triplet decoding from living systems with non-triplet decoding merits reconsideration and further exploration. We designed a mathematical model of the evolution of primitive digital organisms capable of decoding nucleotide sequences into protein sequences. These organisms are allowed to evolve their nucleotide sequences via genetic events of Darwinian evolution, such as point-mutations. The replication rates of such organisms depend on the accuracy of generated protein sequences. Computer simulations based on our model show that decoding systems with codons of length greater than three spontaneously evolve into predominantly triplet decoding systems. Our findings suggest a plausible scenario for the evolution of the triplet genetic code in a continuous manner. This scenario suggest an explanation to how protein synthesis could be accomplished by means of long RNA-RNA interactions prior to the emergence of complex decoding machinery, such as the ribosome, that is required for stabilization and discrimination of otherwise weak triplet codon-anticodon interactions.
1 Biochemistry Department, University College Cork, Ireland, [email protected] 2 Computer Science Department, University College Cork, Ireland, [email protected], [email protected] 34 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 1. T.Maeshiro, M.Kimura (1998). The role of robustness and changeability on the origin and evolution of genetic codes, Proc Natl Acad Sci U S A, 95:5088-5093. 2. S.J.Freeland et al. (2000) Early fixation of an optimal genetic code, Mol Biol Evol 17:511-518. 3. A.S.Novozhilov et al. (2007) Evolution of the genetic code: partial optimization of a random code for robustness to translation error in a rugged fitness landscape, Biol Direct 2:24. 4. S.Itzkovitz, U.Alon (2007) The genetic code is nearly optimal for allowing additional information within protein-coding sequences, Genome Res 17:405-412. 5. D.L.Riddle, J.Carbon (1973). Frameshift suppression: a nucleotide addition in the anticodon of a glycine transfer RNA. Nat New Biol, 242:230-234. 6. B.Moore et al. (2000) Quadruplet codons: implications for code expansion and the specification of translation step size. J Mol Biol 298, 195-209 (2000). 7. Magliery T. J., Anderson, J. C., and Schultz, P. G. Expanding the genetic code: selection of efficient suppressors of four-base codons and identification of "shifty" four-base codons with a library approach in Escherichia coli. J Mol Biol 307, 755-769 (2001). 8. Anderson J. C., Magliery, T. J., and Schultz, P. G. Exploring the limits of codon and anticodon size. Chem Biol 9, 237-244 (2002). 9. Gomes A. C. et al. A genetic code alteration generates a proteome of high diversity in the human pathogen Candida albicans. Genome Biol 8, R206 (2007). 10. Miranda I. et al. A genetic code alteration is a phenotype diversity generator in the human pathogen Candida albicans. PLoS ONE 2, e996 (2007). 11. Turanov A. A. et al. genetic code supports targeted insertion of two amino acids by one codon. Science 323, 259-261 (2009).
35 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
TOWARD UNIVERSAL MALIGNOMETER: GENOME-WIDE EXPRESSION PATTERNS AS COMPOSITE BIOMARKERS GANIRAJU MANYAM 1, ALESSANDRO GIULIANI2, ANCHA BARANOVA 3
Keywords: global patterns of gene expression, attractor, tumorigenesis, expression dynamics
Abstract To date, most of the high-throughput studies of the gene expression studies were focused on elucidation of the gene signatures discriminating cell phenotypes. On the other hand, a given cell type could be represented as a dynamic system occupying a specific position in the multidimensional phase space spanned by all expressed genes. In terms of dynamics, this specific position is called an “attractor”, i.e. a “stable” position characterized by a specific pattern of gene expression levels that determines the particular type of the cell differentiation. Some studies have indicated that the differentiation destinies of the progenitor cells could be defined as high dimensional attractor states of the underlying molecular networks. A possible middle ground between discriminating signatures and entire expression landscapes may be described as a combination of attractor-like behavior with some local “vantage points” represented by genes most sensitive to dynamical changes of the system. Affymetrix Microarray datasets were extracted from the NCBI Gene Expression Omnibus. We analyzed following two categories of datasets: A) datasets describing paired normal and tumor tissue samples collected from the same individual; B) datasets describing a group of normal and a group of tumor samples collected from the same tissue type across a number of subjects. The global and specific expression distances (Dglobal and Dspecific) were calculated based on the whole transcripts on the chip and significantly differentially expressing transcripts by Mann-Whitney test, respectively. The distances between expression profiles of two biological samples were estimated using Pearson correlation coefficients. In all studied datasets, on average, tumors were further away from the Normal Sample Space than the paired samples with normal histology. Interestingly, this observation was true only in case when distances were calculated using Dglobal. Surprisingly, similarly calculated distances for Normal samples from the Normal Space 1 George Mason University, United States, [email protected] 2 Istituto Superiore di Sanità, Italy 3 George Mason University, United States 36 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 defined by Dspecific were different not significantly, mostly due to larger variations in the expression of cancer-specific genes in the normal samples. In all datasets, mean (Dglobal) distances from individual normal samples to the Normal Space were correlated with Mean (Dglobal) distances from individual tumor samples (R=0.9236, p <= 0.00186). Principal Component Analysis (PCA), for the first time, a quantitative estimation of the relative importance of global and local features of gene expression regulation landscape in the process of tumor development. The remarkable behavioral invariance we observed in eighteen independent tumor data sets gives a robust proof of the dynamical picture of cell populations.
37 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
MATHEMATICAL MODELLING OF CELL-FATE DECISION NETWORKS EMMANUEL BARILLOT 1, LAURENCE CALZONE 2, SIMON FOURQUET 3, LAURENT TOURNIER 4, ANDREI ZINOVYEV 5, DENIS THIEFFRY 6
Keywords: systems biology, apoptosis, cell-fate decision, death receptors
Engagement of death domain receptors such as TNFR1 or Fas can trigger cell death by apoptosis or necrosis, or lead to the activation of pro-survival signaling pathways such as NF-κB. Our study aims at identifying determinants of this cell fate decision process. Apoptosis represents a tightly controlled mechanism of cell death that is triggered by overwhelming stress conditions or external death signals, and results in vacuolization of cellular content followed by its phagocyte-mediated elimination. It is a physiological process that regulates cell homeostasis, development, and clearance of damaged, virus-infected or cancer cells. Necrosis results in plasma membrane disruption and release of intracellular content that can trigger inflammation in the neighboring tissues. Long seen as an accidental cell death, necrosis can also be a regulated process, possibly involved in the clearance of virus-infected or cancer cells that escaped apoptosis.
Modeling of these pathways could help identify in which conditions and how the cell chooses between different types of cellular deaths or survival. Moreover, modeling could suggest ways to re-establish the apoptotic death when it is altered. The decision process appears to be very complex: it integrates many intertwined signaling pathways and the molecular interactions controlling this process are regulated by multiple positive and negative feedback loops. Mathematical modeling provides a good tool to understand and analyse the dynamical behaviours of such complex systems.
1 Institut Curie, Mines ParisTech, INSERM U900, France, [email protected] 2 Institut Curie, Mines ParisTech, INSERM U900, France, [email protected] 3 Institut Curie, Mines ParisTech, INSERM U900, France, [email protected] 4 Institut Curie, Mines ParisTech, INSERM U900, France, [email protected] 5 Institut Curie, Mines ParisTech, INSERM U900, France, [email protected] 6 Faculté des Sciences de Luminy, Université de la Méditerranée, France, [email protected] 38 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 For that purpose, based on the literature, we established a generic influence network that includes the main species that participate in cell fate decision in response to death signals (mediated by Fas and TNF). A first annotated version of this “master” model was built in a discrete framework. An initial study was performed on the steady states: eight different clusters of steady states that correspond to the expected cellular phenotypes were identified. This result constitutes a first validation of the proposed structure of the network.
In order to propose a more refined dynamical analysis, we suggested a reduction of the model preserving the same dynamical properties. We went from 22 variables in the “master” model to 11 variables in the reduced version. Thanks to this reduction, the realistic asynchronous updating strategy could be used and qualitative simulations were performed. In particular, the computation of all discrete trajectories starting from specific initial conditions allowed to identify the corresponding “reachable” phenotypes in the case of TNF and Fas-induced signals, for the wild-type and mutants models. The mutants mostly fit the expected behaviours and suggested some improvements in the “master” model.
This work is supported by the APO-SYS EU FP7 project and the authors of the work are members of the team "Systems Biology of Cancer," Equipe labellisée par la Ligue Nationale Contre le Cancer.
39 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
CONSERVATIVE REGIONS OF PROTEINS EVOLVE UNDER STRONGER POSITIVE SELECTION GEORGII BAZYKIN 1, ALEXEY KONDRASHOV 2
Positive selection, i.e. natural selection that promotes change, is usually assumed to play a larger role in evolution of rapidly evolving sequences than in evolution of slowly evolving sequences. We use the MacDonald-Kreitman test [1] to study how the strength of positive selection in segments of coding sequences in divergence of Drosophila simulans and D. melanogaster depends on the overall evolutionary conservation of this segment between Drosophila species. The fraction of amino acid positions evolving under positive selection in the most conserved sites is twice as high as in the least conserved sites. The analysis of pairs of substitutions in adjacent nucleotide sites within a codon [2] reveals that the clumping of substitutions, indicative of positive selection, is also strongest in the most conserved segments. By making use of the dense phylogeny of Drosophila species with complete genomes sequenced, we ascertain the distribution of the evolutionary times between the substitutions, as well as the strength of the selection coefficients favoring the second substitution in each pair. In conserved segments, the average second substitution occurred under selection that accelerated evolution by a factor of 20. Together, our results indicate that strong positive selection within conservative regions is an important component of adaptive evolution.
1. J. McDonald, M. Kreitman (1991) Adaptive protein evolution at the Adh locus in Drosophila, Nature, 351:652–654. 2. G. Bazykin et al. (2004) Positive selection at sites of multiple amino acid replacements since rat–mouse divergence, Nature, 429:558–562.
1 Institute for Information Transmission Problems of the Russian Academy of Sciences (Kharkevich Institute), Russian Federation, [email protected] 2 Life Sciences Institute and Department of Ecology and Evolutionary Biology, University of Michigan, United States, [email protected] 40 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
MODELLING AND STABILITY ANALYSIS OF INTERCONNECTED REGULATORY CYCLES MAHSA BEHZADI 1, MIREILLE REGNIER 1, LAURENT SCHWARTZ 1, JEAN-MARC STEYAERT 1
Keywords: System biology, ordinary differential equations, enzymatic reactions, stability analysis, cycle oscillations,equilibria.
Biochemical reactions are continually taking place in all living organisms. The complexity of biochemical and biological processes is such that the development of computer models is often essential in trying to understand the phenomenon under consideration. Our aim is to build a generic framework with which one could simulate the behavior of complex systems of interconnected regulatory cycles. For the simulation of a biological system we use the traditional reaction-rate approach by means of equations describing the system. In this approach, chemical reactions are modelled by ordinary differential equations (ODEs) representing the variations of the concentrations of the substances. In each of the differential equations we express the kinetics of one reactant as a sum of fractional terms for enzymatic reactions and non-fractional terms for simple reactions. Once constructed the model, we aim to study the various modes of the cell behaviour according to the concentrations of relevant enzymes in enzymatic reactions. Since stable and unstable equilibrium play different roles in the dynamics of a system, it is useful and important to be able to classify equilibrium points based on their stability, and this is what we are able to do by simulation and also by mathematical study. By stability analysis, first given equilibrium we can determine if it is a stable point or not; furthermore through a mathematical study we are able to find the stability and instability regions by changing one or several parameters. As a first try we have constructed a model for the central part of the system of the GlyceroPhosphoLipid metabolism in the human cell. The model comprises enzymatic reactions of PhosphatidyleEthanolamine (PtdEth) and the PhosphatidylCholine (PtdCho) [1, 2]. Given the values of metabolite concentrations (Ci) which were observed experimentally we have managed to find the appropriate parameter values (Pi) which allow us to completely 1 Bioinformatics group, LIX, Ecole Polytechnique, Palaiseau, 91128, France [email protected], [email protected], [email protected], [email protected] 41 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 describe the system with a set of ordinary differential equations (ODE). Our analysis of this model demonstrates that, with these parameter values, the system has a stable solution. Moreover, we investigated the possibility that a change in parameter values could give an unstable or oscillating solution. For that purpose we studied the system mathematically in a large rank of values and we prove that the solution is always stable and without oscillations regardless the parameter values of the system. We have also applied our method to the cell division cycle model; well- known interactions of proteins cdc2 and cyclin. A mathematical model was already constructed by Joun Tyson [3], who used numerical integration (carried out by using Gear's algorithm) for simulation and stability analysis of model. We studied this system of interactions and using our approach based on the analysis of the eigenvalues of the liberalized system we confirmed the nature of the results for the same parameter values. We currently use this approach to study the stability analysis of a complex metabolic network containing several interconnected regulatory cycles such as Glycolysis, Krebs cycle, Phospholipids pathway and Amino acids.
1. Henry, S. A., and Patton-Vogt, J. L. (1998) Prog. Nucleic Acids Res. Mol. Biol. 61, pp. 133-179. 2. R. Sundler, B. Akesson, (1975) Biochem. J. 146309-315. 3. J. Tyson, (1991) Cell Biology, Vol 88. pp. 7328-7332.
42 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
INVOLVEMENT OF PROTEIN-PROTEIN INTERACTIONS IN COMPOSITE ELEMENTS DETECTION ALEXANDER A BELOSTOTSKY 1, VSEVOLOD Y. MAKEEV 1
Keywords: transcription factor, transcription factor binding site, composite element, protein-protein interaction
CE is a group of transcription factor binding sites (TFBSs) located near each other in statistically significant number of cases. Composite elements (CEs) detection is a very crucial task in understanding transcription regulation. There exist many methods for predicting CE using data of co-occurrence of different sites in a set of regulatory sequences. In some cases these methods take into account score of every site in CE and distance between them [1, 2]. In other cases it is only co-occurrence of different sites in some large genomic region, but this search is performed over the set of co-regulated genes [3, 4] sometimes with conservation estimation added [5]. All these methods have one common disadvantage: they are based on known CE. Here we present a method for prediction of CE that uses experimentally determined protein-protein interactions. In this case we use information about interaction between transcription factors (TFs). This method allows us to involve some structural aspects in CE detecting. This source of experimental data is independent from previously examined CEs. In our approach we simply count for CEs that contain sites of TFs able to interact with each other. We searched for group of sites: sites of particular TF (TF of interest) and sites of TFs capable to interact with TF of interest. These sites must score above the threshold and located not further than a given distance from each other. We tested the idea at a set of Hif1-dependent genes, having experimentally determined sites of Hif1 TFBS. Names of these genes and positions with sequences of Hif1 sites were taken from TransFac. Our objective was to predict experimentally determined Hif1 sites as a part of predicted CE. For predicted CE we selected those having sites of Hif1 itself with sites of TFs capable to interact with Hif1 in close vicinity. We set threshold for sites
1 State Research Institute of Genetics and Slection of Industrial Microorganisms, GosNIIGenetika,Moscow, [email protected] 43 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 constituting predicted CE and the distance between sites. We compared results of our prediction with results of prediction by programs TFM-Explorer, Cluster-Buster, MSCAN and DiRE. From 18 genes contained in TransFac that had in their upstream region experimentally determined Hif1 sites we found 12 genes. This can be compared with 3 genes found by DiRE, 6 genes for Cluster-Buster, 3 genes for TFM-Explorer and 0 genes for MSCAN. The advantage of our method is that it uses a short list of TFs selected from known TF-TF interaction to search for all possible combinations of sites constituting CE. Surprisingly taking into account conservation estimation by phastCons negatively affected sensitivity and specificity of Hif1 prediction. We are grateful to Dmitry Malko for help in programming. This study has been supported with Russian Fund of Basic Research project 07-04-01623.
1. Shelest, E., et al. (2003) Prediction of potential C/EBP/NF-kappaB composite elements using matrix-based search methods, In Silico Biol ,. 3(1-2): p. 71-9. 2. Kel-Margoulis, O.V., et al. (2002) TRANSCompel: a database on composite regulatory elements in eukaryotic genes, Nucleic Acids Res , 30(1): p. 332-4. 3. Kel, A., et al. (2006) Composite Module Analyst: a fitness-based tool for identification of transcription factor binding site combinations, Bioinformatics , 22(10): p. 1190-7. 4. Waleev, T., et al. (2006) Composite Module Analyst: identification of transcription factor binding site combinations using genetic algorithm, Nucleic Acids Res ,. 34(Web Server issue): p. W541-5. 5. Gotea, V. and I. Ovcharenko (2008) DiRE: identifying distant regulatory elements of co-expressed genes. Nucleic Acids Res,. 36(Web Server issue): p. W133-9.
44 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
STUDYING THE IMPACT OF GENE COPY NUMBER VARIATIONS ON GENE EXPRESSION VIA A GENE REGULATION NETWORK SYLVAIN BLACHON 1, CARITO GUZIOLOWSKI 1, GAUTIER STOLL 2, GAELLE PIERRON 3, STELLY BALLET 3, FRANCK TIRODE 3, OLIVIER DELATTRE 3, EMMANUEL BARILLOT 2, ANDREI ZYNOVIEV 2, ANNE SIEGEL 1, OVIDIU RADULESCU 4
During tumorigenesis, DNA repair machinery is perturbed. As a result, genomic aberrations arise and may deeply affect the tumoral cell physiology. It has been partially demonstrated that an increase of gene copy numbers induces higher expression; but this effect is less clear for small genomic modifications. To study it, we propose a systems biology approach that enables the integration of CGH and expression data together with an influence graph derived from biological knowledge. This work is based on 3 concepts. 1. Studying inter-individual variations in gene copy number and in expression allows to grasp tumor varability and ultimately adresses the problem of individual-centered therapeutics. 2. Confronting post-genomic data to known regulations is a good way to check the soundness and limits of current knowledge. 3. The abstraction level of qualitative modeling allows integration of heterogeneous data sources. We tested this approach using data on two tumor types : Ewing tumors and bladder tumors. It allowed the definition of new biological hypotheses that were assessed by random permutation of the initial data sets.
1 INRIA, Centre Inria Rennes - Bretagne Atlantique, 263, avenue du General Leclerc, Campus de Beaulieu, 35042 Rennes Cedex, France, [email protected] 2 Institut Curie Bioinformatics Group, Institut Curie, Service BIOINFORMATIQUE, 26 rue d'Ulm, 75248 PARIS cedex 05, France 3 Genetics and biology of paediatric tumors and sporadic breast > cancers - Institut Curie / Inserm Unit 830, 26 rue d'Ulm 75248 Paris cedex 05, France 4 IRMAR, UMR CNRS 6625, Campus de Beaulieu, bâtiments 22 et 23, 263 avenue du Général Leclerc, CS 74205, 35042 RENNES Cédex, France 45 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
USING SVM AND A MEASURE OF MOTIF ‘SURPRISE’ TO DISTINGUISH REGULATORY DNA RENE TE BOEKHORST 1, IRINA ABNIZOVA 2, FEDOR NAUMENKO , IVAN KULAKOVSKI 3, WERNISCH LORENZ 4
Motivation and Aim . There are still no satisfactory computational methods to reliably recognize regulatory DNA. Assuming that the main biological and statistical “signature” of regulatory regions is the presence of multiple regulatory motifs, we aim to identify motifs that contribute significantly to the separation of coding (C), regulatory (R) and non-coding non-regulatory (N) DNA.
Methods and Algorithms We use unsupervised pattern recognition (cluster analysis) to back up the performance and to visualize the results of a supervised method (Support Vector Machine). These methods were applied to a new feature representation of DNA sequences. The feature set is a 4k – dimensional vector of which the elements measure how likely each k-mer is in comparison to a model assuming nucleotide independence and thus how “surprising” a k-mer is (i.e. its degree of over-/under-representation). We subjected the feature set to a hierarchical test procedure that first distinguishes coding from non coding sequences, and in a next step separates regulatory regions from non coding-non regulatory DNA.
Data The positive training set is a collection of experimentally verified functional Drosophila melanogaster regulatory regions (enhancers) [Nazina & Papatsenko, BMC Bioinformatics 22 (2003)]. The two other (negative training) sets are: (i) 60 randomly picked Drosophila exons, and (ii) 60 randomly picked Drosophila non-coding, non-regulatory (NCNR) sequences.
Results The SVM separated coding DNA (C) very well from other DNA types (R, N) with an overall accuracy 97 % at the first step.The second step predicted regulatory DNA with a 95 % overall accuracy.
1 University of Hertfordshire, United Kingdom, [email protected] 2 Wellcome Trust Sanger Institute, United Kingdom, [email protected] 3 University of Moscow, Russian Federation 4 MRC Biostatistics Unit Institute of Public Health, United Kingdom 46 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 K-means cluster analysis (K=3) resulted in a cluster mainly composed of coding regions and two non-coding clusters of which the smallest is dominated by regulatory regions. Tests for the association between type of DNA (C, R, N) and cluster membership are highly 2=129.16, df=4, p=5.89E- 27). Also a hierarchical cluster χsignificant ( analysis (Euclidean Distances, Ward’s method) clearly distinguished between coding and regulatory regions. One cluster contains only 5 of all 60 coding regions, whereas the second virtually lacks regulatory regions (2 out of 60). A hierarchical cluster analysis of words on sequences resulted in a main cluster containing all the low entropy words (AAA, CCC, GGG and TTT), 70% of the self-repetitive words and about half (56%) of all the 24 intermediate entropy words. The other cluster is made up of the remaining intermediate and all the high entropy words and 67% of the palindromes. Combining the dendrogram of sequences with the dendrogram of words showed that: i) regulatory sequences stand out by either over- or underrepresented words; ii) overrepresented words tend to be of low entropy whereas underrepresented ones are mostly of high entropy. The motifs characteristic for regulatory DNA tend to be biologically important fragments of known TFBS. We stress the up till now overlooked importance of underrepresented motifs.
Comparison with other methods Our methodology outperforms SVM applications based on string [Leslie et al, Pac. Symp. Biocomput. (2002)] and mismatch kernels [Leslie et al., Adv. Neural Inf. Process. Syst, 20 (2003)]. The latter worked well for the detection of functionally similar proteins, but achieved no more than about 50% accuracy when we applied them to our data. Boeva et al [Algorithms for Molecular Biology (2007)] developed an algorithm for computing the probability (p-value) that s different, possibly overlapping, motifs occur respectively k1, ..., ks or more times,. When we used p-values calculated for the Drosophila data as the input for SVM, we obtained almost the same specifity, sensitivity and accuracy as for our Z-scores.
47 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
SEARCH FOR DEGENERATE TANDEM REPEATS IN NUCLEOTIDE SEQUENCES. THEIR POSSIBLE ROLE IN REGULATION OF GENE EXPRESSION. V. BOEVA 1, V.J. MAKEEV 2, M. REGNIER 3
During the last decade many experiments demonstrate that degenerate tandem repeats occur in regulatory regions and play role in regulation of gene expression [1, 2]. But the latest work show high mutability of tandem repeats located in regulatory regions even between closely-related species [3]. Hence the hypothesis arises that for the regulation of gene activity the presence of tandem repeat itself is important, but not the concrete motif sequence. The program SWAN [4] was written to search for degenerate tandem repeats in DNA sequences. Its advantages are the possibility to set minimal significance level of repeats and the calculation of statistical significance of all found tandem repeats. Besides SWAN returns a single result file with the table containing all necessary information about tandem repeats found that it is easy to process by Excel or Perl. Using the program SWAN we analyzed frequencies of degenerate tandem repeats in the complete genome of D.melanogaster as well as in various functional regions such as coding and regulatory regions, intergenic regions and heterochromatin. It was found that the frequency of degenerate tandem repeats in X-chromosome is about 1.5 times greater than in autosomes. It agrees with the result obtained in [5] that frequencies of exact tandem repeats with period length from1 to 4 are also higher in X-chromosome. We analyzed frequencies of degenerate tandem repeats of each period in annotated loci of D.melanogaster (Fig 1). One can see that periods divisible by 3 are significantly abundant in coding regions. Apparently this fact is induced by some regulatory structure of coded proteins, e.g. poly(Ala) chain. The interesting fact that tandem repeats with periods 6,7 and 8 occur more frequently in non-coding regions of loci, especially in regulatory ones, than in intergenic regions. As we suppose it is caused by partial destabilization of double helix (each turn of which is about 10.2b.p.), that facilitates the process of transcription factor binding. This hypothesis is corroborated by the fact that repeats with period divisible by 5, which should stabilize the double helix on 1 Moscow State University, Vorob'evy Gory, Moscow, Russia, [email protected] 2 State Center GosNIIGenetika, Moscow, Russia, [email protected] 3 INRIA Roquencourt, France, [email protected] 48 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 our hypothesis, are overrepresented in heterochromatin regions of D.melanogaster . By definition this DNA is not transcribed and stays in condense state. Authors are pleased to thank Andrey Mironov, Natal’ya Esipova and Nika Oparina for effective discussion. This work has been supported by a project EcoNet-08159PG and RFBR 04- 04-49601.
0,07 intergenic regions. 12M. 0,06 coding regions. 49 . 0,05
0,04 regulatory regions. 155K.
0,03 spacers in loci. 350 . coverage 0,02 heterochromatin. 83K. 0,01
0 random sequence. 1 . 2 3 4 5 6 7 8 9 10111213 period
1. Ott RW, Hansen LK. (1996) Repeated sequences from the Arabidopsis thaliana genome function as enhancers in transgenic tobacco. Mol Gen Genet., 252(5), 563-71. PMID: 8914517 2. Antoniewski C, Mugat B, Delbac F, Lepesant JA. (1996) Direct repeats bind the EcR/USP receptor and mediate ecdysteroid responses in Drosophila melanogaster. Mol Cell Biol., 16(6), 2977-86. PMID: 8649409. 3. Sinha S. and Siggia E.D. (2005) Sequence turnover and tandem repeats in cis-regulatory modules in Drosophila. MBE, published online on January 19, 2005. 4. V.A. Boeva, M. Regnier, V.J. Makeev (2004) SWAN: searching for highly divergent tandem repeats in DNA sequences with evaluation of their statistical significance. Proceedings of the JOBIM'2004, Montreal, Canada, 2004. 5. Mukund V. Katti, Prabhakar K. Ranjekar and Vidya S. Gupta (2001) Differential Distribution of Simple Sequence Repeats in Eukaryotic Genome Sequences. Molecular Biology and Evolution 18:1161-1167.
49 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
APPLICATION OF THE COMPUTER PROGRAM ROSETTA FOR THE PROTEIN STRUCTURE INTERPRETATION FROM TRITIUM PLANIGRAPHY TECHNIQUE DATA: M1 PROTEIN OF INFLUENZA VIRUS A ELENA BOGACHEVA 1, ALEXEY CHULICHKOV 1, ALEXEY DOLGOV 1, ALEKSANDR SHISHKOV 1, ILIYA KUZMIN 2, LIDIA NEFEDOVA 2 , LUDMILA BARATOVA 3
Keywords: protein, spatial structure, tritium planigraphy, computer simulation
Construction of proteins spatial structure remains extremely actual problem, especially when they are a part of multicomponent biological complexes such as viruses. The matrix M1 protein underlying the membrane is the major structural component of influenza A virus (about 1100-3000 copies per virion). The atomic structure of the N-terminal two thirds of M1 protein was solved at acid and neutral pH [1]. However, M1 spatial structure in a membrane environment remains to be understood. The information obtained by tritium planigraphy gives the data about steric accessibility of hydrocarbon fragments of macromolecule, which by itself is directly connected with its spatial structure, and reflects the «architecture» of the object [2, 3]. The introduction of tritium label occurs through single collisions of tritium atoms with the protein-target. Analysis of the label distribution in the investigated object is usually realized at the level of the separate amino acids, which is attained by fragmentation of tagged proteins into short peptides by the various proteases. Such procedure allows determining the relative level of exposure of amino acid residues to tritium, gives detailed information on the structure of the surface and preliminary conclusions concerning the stacking of residues in macromolecule [4]. We’ve developed the computer algorithm imitating the anisotropic conditions of the bombardment of proteins in a membrane surrounding with the proper account of the protein molecule orientation in relation to the membrane surface for the beam of “hot” tritium atoms.
1 N.N. Semenov Institute of Chemical Physics Russian Academy of Sciences, ul. Kosygina, 4, Moscow, 119991 Russia, e-mail: [email protected] 2 Biology faculty , Moscow State University, Russian Federation 3 Belozersky Institute of Physico-Chemical Biology of Moscow State University, Leninskie Gory, 1, Moscow, 119992 Russia, [email protected] 50 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 The first working model of the spatial structure of M1 protein as a component of influenza virus is proposed. This model is based on the data obtained by tritium labeling of intact virions and free M1 protein, theoretical prediction of the C-terminal domain secondary structure for M1 protein, and application of the developed computer algorithm. The experimental and theoretical data obtained by tritium bombardment and simulation algorithm were compared with the Rosetta program prediction of the C-domain three-dimensional structure [5]. The clusters with the best correlation between the methods were allocated. The application of the combined approach allowed reducing substantially the hypothetically possible spatial structures of the C-domain. Analysis of the Rosetta algorithms has shown an opportunity of the tritium planigraphy experimental data usage for more correct construction 3D structures. This work was partially supported by the Russian Foundation for Basic Research (09-03-00469, 09-04-01160) and International Science and Technology Center (BTEP#82/ISTC#2816).
1. B.Sha, M.Luo (1997) Structure of a bifunctional membrane-RNA binding protein, influenza virus matrix protein M1. Nat. Struct. Biol. 4:239–244. 2. L.A.Baratova, E.N.Bogacheva, V.I.Goldanskii, V.A.Kolb, A.S.Spirin, A.V.Shishkov (1999) Tritium planigraphy of biological macromolecules. Moscow.: Nauka, 175p. 3. E.N.Bogacheva, V.I.Goldanskii, A.V.Shishkov, A.V.Galkin and L.A.Baratova (1998) Tritium planigraphy: from the accessible surface to the spatial structure of a protein. Proc. Natl. Acad. Sci. USA. 95:2790–2794. 4. A.V.Shishkov, E.N.Bogacheva (2007) Tritium planigraphy of biological macromolecules. In: Methods in Protein Structure and Stability Analysis: Conformational Stability, Size, Shape and Surface of Protein Molecules. Eds. V.N.Uversky and E.A.Permyakov, 317–353 (N.-Y.: Nova Science Publishers). 5. K.M.Misura, D.Chivian, C.A.Rohl, D.E.Kim, D.Baker (2006) Physically realistic homology models built with Rosetta can be more accurate than their templates. Proc. Natl. Acad. Sci. USA, 103:5361–5366.
51 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
FSDETECTOR: FRAMESHIFT PREDICTION IN PROTEIN CODING SEQUENCES BY THE VITERBI ALGORITHM IVAN ANTONOV 1, MARK BORODOVSKY 2
In 2005 the 454 Life Sciences company released a new machine which performs sequencing of 400-600 megabases of DNA per 10-hour run. The innovation made revolution in sequencing technology. The new method is 100 times faster and much cheaper than previously used Sanger capillary sequencing. For these obvious reasons a number of genome projects have by now switched to the 454 pyrosequencing and similar next generation sequencing platforms. Due to the nature of pyrosequencing, the 454 method is prone to errors at homopolymer locations. Even with high on average X coverage errors in finished sequences are likely to occur more frequently than previously with “old sequencing techniques”. Insertion or deletion of one or two nucleotides inside a protein coding region causes a frameshift and will result in wrong annotation of the gene or even a part of the gene missing. It is highly desirable to detect frameshift errors as early as possible and resequence regions with predicted errors before genome sequence released to public. Here we present a new method, called FSdetector, to predict frameshifts in protein coding regions. FSdetector can be applied to a nucleotide sequence that contains intronless protein-coding regions. Thus, the method is applicable to prokaryotic genomic sequences, to sequences from fungal genomes with intronless genes or to clustered EST sequences. The method works in two steps. In the first step the gene finding program GeneMarkS [1] is used to identify genes in the given DNA sequence. Upon approaching a gene with a frameshift GeneMarkS predicts two genes in different frames. These two putative genes located in the same strand will appear as overlapped or adjacent genes. In the second step all DNA regions containing predicted
1 Division of Computational Science and Engineering, Georgia Institute of Technology,801 Atlantic Drive, Atlanta, GA, USA 30332-0280, [email protected] 2 Department of Biomedical Engineering and Division of Computational Science and Engineering, Georgia Institute of Technology, 313 Ferst Drive, Atlanta, GA, USA 30332- 0535, [email protected]
52 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 overlapping and adjacent genes are selected. Each region is analyzed by FSdetector to identify a possible frameshift. The algorithm design is centered around the Hidden Markov Model 1 (HMM) of a genomic region that could 1/2 be an ingenious gene overlap or a pair of 3/1 2/1 1/3 adjacent genes or a gene with a frameshift (Fig. 1). States 1, 2 and 3 N/C correspond to three possible “global” frames of reading the genetic code in the given strand. States designated as i/j 2 3 where i=1,2,3 and j=1,2,3 indicate gene overlap regions with number i indicating 3/2 the frame of the upstream gene and number j indicating the frame of the 2/3 downstream gene. The colors of the start and stop codon states are indicative of - start codon state - stop codon state their global frames. A direct transition from one coding state to another is Fig.1 . HMM designed for FSdetector possible only as a frameshift. An ingenious gene overlap will be identified by a transition between two coding states traversing through the overlapping states (i/j type); the adjacent genes will be connected through the non-coding state (N/C). The algorithm finding the maximum likelihood path through the model for a given sequence is the Viterbi algorithm. If this path includes a direct transition between coding states then the frameshift is predicted. In the accuracy tests of FSdetector on the whole Escherichia coli genomic sequence with framshifts introduced randomly into annotated genes we have observed 76.3% sensitivity (Sn) and 73.3% specificity (Sp). It should be noted that the Sn and Sp values were obtained for the 2nd order sequence model with HMM parameters chosen by initial heuristics. The initial settings leave ample room for further improvement, thus, in the conference presentation we will discuss the method with further improvements, generalizations and applications to various species.
1. J. Besemer et al. (2001) GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Nucleic Acids Res., 29: 2607-18. 53 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
AUTOMATIC TOOL TO DESCRIBE STRUCTURE OF RELIABLE BLOCKS IN A MULTIPLE ALIGNMENT OF PROTEIN SEQUENCES BORIS BURKOV 1, BORIS NAGAEV 2, SERGEI SPIRIN 3, ANDREI ALEXEEVSKI 4
Keywords: multiple alignment of protein sequences, blocks detection
To reveal desirable information from a multiple alignment of protein sequences, first of all, an expert needs to distinguish between parts of reliable alignment and parts where no relevant alignment can be detected on the sequence level. The former parts of the alignment could be verified by 3D structure comparison (if structures are available). The latter parts may correspond, for example, to differently located loops of proteins. Therefore, the alignment in those parts makes no sense. Most programs of alignment do not take this fact into consideration. A number of tools facilitating multiple alignment analysis are currently available. They are implemented in alignment editors and visualization servers (e.g. Jalview [1] or T-coffee [2]). They do not seem to cover all alignment features of interest. We created a tool for autOmatic Partition of a given multiple ALignment (OPAL) on so-called blocks. A block is a part of the alignment defined by a continuous series of positions within a subfamily of sequences. Blocks are divided into two groups, blocks of reliable alignment (plus-blocks) and blocks of senseless or unreliable alignment (minus-blocks). Output of the main program is sets of plus- and minus-blocks. Plus and minus blocks together cover all alignment, blocks may not intersect. OPAL_vis module represents all blocks of the alignment allowing navigation through blocks and visualization of each block in the frame of the alignment. The algorithm iteratively repeats the procedure that finds one plus-block within an analyzed block. First the procedure is applied to the entire alignment. That plus-block may be either full-width in the alignment or full- width in the subalignment defined by a cluster of sequences. If a plus-block 1 Moscow State University, Russian Federation, [email protected] 2 Moscow State University, Russian Federation, [email protected] 3 Belozersky Institute, Moscow State University, Russian Federation, [email protected] 4 Belozersky Institute, Moscow State University, Russian Federation, [email protected] 54 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 was found, then it is stored in output data and the remaining parts of the analyzed block are analyzed by the same procedure. Otherwise, input is considered as a minus-block. Special criteria of block reliability were developed and implemented. The algorithm was implemented in OPAL_cut program. To test OPAL_cut on multiple alignments for proteins with solved 3D structures, OPAL_test module was created. For each plus-block found by OPAL_cut, so called geometrical core [3] of it is determined. If the geometrical core comprises the whole block or its significant part, then the reliability of block is considered to be supported by 3D data. OxBench benchmark alignment database [4] was used as a source of structural data for a massive test. The test showed that 90% of plus-blocks are supported by structural data (geometric core comprises >80% of blocks' positions). OPAL package can be useful for expert analysis of large alignments of proteins and a number of other purposes as well, such as multiple alignments refinement, assessment of multiple alignment programs' performance or subfamilies identification and reconstruction of phylogeny. We are grateful to Elena Lukina for help in preparing 3D superimpositions and structural alignments and Daniil Alexeevski for helpful hints and assistance. The work is partly supported by the RFBR-DFG grants 07-04- 91560 and 08-04-91975.
1. M.Clamp et al. (2004) The Jalview Java alignment editor, Bioinformatics, 20(3):426-7. 2. O.Poirot, E.O'Toole and C.Notredame (2003) Tcoffee@igs: a web server for computing, evaluating and combining multiple sequence alignments Nucleic Acids Research, 31(13): 3503-3506. 3. M.Gribkov et al. (2004) Life Core, the program for classification of 3D structures of macromolecules Biofizika, 48(1):157-166 4. G.P.Raghava et al. (2003) OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics, 4:47
55 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
EVOLUTION OF SIGNAL PEPTIDE APPEARANCE/DISAPPEARANCE IN BACTERIAL GENOMES NADEZHDA BYKOVA 1, ANDREJ MIRONOV 1
Keywords: signal peptide
Introduction Signal peptide is an 15-30 amino acid sequence in the N-terminus of protein that directs it to the way of export from cytoplasm. In previous works we have shown that the presence of signal peptide is not conserved in clusters of orthologous genes and that it is not only because of prediction programms mistakes (non-published data). In present work we studied evolution events of signal peptide appearance and disappearance in such clusters. We have found evidences of as ancient as recent events existance. Also we tried to characterize clusters and genomes overpresented wih this events. One of the important overcomes of this work is a list of recent signal peptide appearance. We suggested that signal peptide appearance is anticipated by gene duplication, so we studied also clusers and genomes rich of paraloges pairs, in which one protein has signal peptide and another has not. The most active were some symbiothic and pathogenic bacteria and even there were slight differences between strains of the same species, for example pathogenic and non-pathogenic strains. That shows corellation between their adaptation requirements and high rates of signal peptides appearance. All the data including tree pictures and signal/non-singal paraloges is avalible at http://www.bioinf.fbb.msu.ru/SignalWeb/ Materials and Methods 1) Protein clusters were downloaded from NCBI Protein Clusters database [1]. We took into consideration only clusters that contain more than 8 proteins. 2) For signal peptide prediction we used SingalP3.0-NN [2] with the standart thresholds. 3) We have also performed correction of annotation errors in suspicious pairs of proteins (id%>70 and different signal peptide prediction): pair
1 Department of Bioingeneering and Bioinformatics, Moscow State University, Moscow,GSP-2, building 73, Leninskiye Gory, Moscow, 119992, Russia, [email protected] 56 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 alignment and further searching signal peptides in 150 bp uprteam relative to start of local alignment. Results. From 37863 clusters we analysed, 25471 were predicted as completely non-signal and 2168 as completely signal, so 27% of such clusters has some potential appearance/disappearance events. For our purpose we took only clusters that contain at least 3 signal and 3 non-signal peptides. After the correction of gene starts, we analysed the distribution of predicted signal peptides on the evolutionary tree of such clusters using the Events Number value and E(economy) value. We found out that there are signaificant number of relatively ancient divergences (see Table1). For example deaminase cluster (PRK06846), where divergence happened on the level of gram-positive/gram- negative bacteria.
Table 1. Signal peptide appearance/disappearance events Events number All clusters Clusters with signal/non- signal paraloges 1 889 141 2 1327 248 3 995 177 >3 964 270 Total 4175 836
On the other hand we also found recent events, which are not likely to be a prediction errors because of deletion of signal peptide in one of the sequences (and there is stop codon immediately before start of local alignment) - for example disappearance of signal peptide in a cluster of endo-1,4-D-glucanase (PRK11097; catalyzes the hydrolysis of 1,4-beta-D-glucosidic linkages in cellulose, lichenin and cereal beta-D-glucans) in 4 strains of Yersinia pestis and Yersinia pseudotuberculosis IP 32953, while it is still present in Y. enterocolitica and all other members of this cluster. So we can conclude that signal peptide appearance/disappearance events are relatively fast and some symbiothic/pathogenic bacteria use this feature for their adaptation as we can see comparing pathogenic and non-pathogenic strains (for example pathogenic strain of Echerichia coli O157:H7 has 5 additional clusters with diverged signal/non-signal paraloges in compare with simple E.coli K12). 57 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Acknowledgements Howard Hughes Medical Institute [grant number 55005610]; the Program “Molecular and Cellular Biology” of the Russian Academy of Sciences; and Russian Foundation of Basic Research [grants number 09-04-92742, 07-04- 91555].
1. Klimke W. et al. (2009) The National Center for Biotechnology Information's Protein Clusters Database, Nucleic Acids Res., 37(Database issue): D216–23. 2. Bendtsen J.D. et al. (2004) Improved prediction of signal peptides: SignalP 3.0, J. Mol. Biol., 340:783-795.
58 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
A STATISTICAL METHOD FOR PWM CLUSTERING SOLENNE CARAT 1,2 , REMI HOULGATTE 1, JEREMIE BOURDON 2
Introduction Motif discovery is a fundamental problem in molecular biology. It possesses important applications in the study of regulatory signals and transcription factor binding sites discovery. Several motif discovery tools have been proposed (see [1] for a complete review). They all extract significant motifs from sets of sequences. Nevertheless, addressing motif discovery for complex organism is still a challenge. It is thus interesting to take profit of the specificities of every discovery tools with different parameters for extracting several putative interesting motifs. Doing this impose to deal with redundant motifs that must be removed. Here, we propose a method for comparing several motifs given by their PSSM (Position Specific Scoring Matrices). This method automatically detects periodic motif and redundant motif. It is also possible to compare a final set of motifs with public databases [2,3]. Notice also that palindromic motifs can be detected easily with this method. Methods Our method is based on comparison of PSSM. The use of PSSM, rather than PWM, is justified by the exactness of the content, while PWM may require pseudo-count adaptation. These PSSM can be constructed easily from any motif discovery tools. All matrices are compared pairwise. Reverse complements are also taken into account. For each pair of motifs (m,n), comparison is done between m and all possible shift of n. Shifts allow to detect imbricate motifs. The specificity of our comparison method is that its is performed only on bases which frequencies are superior to determinate threshold, like background for example. This limits the effects of noise in the comparison. Finally, a Chi-square test is used to compare the two distributions of frequencies. This comparison method allows to detect periodic motif, like tandem repeat GC, comparing PSSM to itself with lag of 2 bases. If these two PSSM are similar, motif is periodic (Fig. 1.1). In the same vein, comparing a PSSM with its reverse complement allows to determine if it is a palindrome (Fig. 1.2). 1 Institut du thorax, INSERM U915, Nantes, {Solenne.Carat,Remi.Houlgatte}@univ- nantes.fr 2 LINA, CNRS UMR6241, Nantes, Jeremie.Bourdon @univ-nantes.fr 59 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
Fig. 1 : Comparison of several motifs
Optimizations Many parts of the process treatment are quite independent. It is thus possible to take advantage of modern computer architectures (multicore computers, clusters, grid) by a parallization of these parts of computation. This allows a huge gain of the time needed to get a full result. Discussion Motif comparison allows to detect periodic and palindromic motifs, and identify transcription factors that recognize it through public databases. Moreover, by grouping similar motif, it is possible to generate consensus motifs that correspond to a larger number of sequences, and to reduce number of motifs to be studied.
1. G. K, Sandve, F. Drablos (2006), Biology direct, 1:11 2. A. Sandelin et al., (2004), Nucleic Acids Res. 32: D91-94 3. V.Matys et al., (2006), Nucleic Acids Res., 34:D108-110
60 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
CONSTRUCTION AND HETEROLOGICAL EXPRESSION IN E. COLI OF THE DELETION DERIVATIVES OF THE CYANOBACTERIUM SYNECHOCYSTIS SP. PCC 6803 DRGA GENE AND ITS HYBRIDS WITH GFP REGINA CHAKHIRIDIS 1, VERA GRIVENNIKOVA 1, ELENA MURONETS 1, KIRILL TIMOFEEV 1, IRINA ELANSKAYA 1, VIKTORIYA TOPOROVA 2, ALEXEI NEKRASOV 2, DMITRY DOLGIKH 2
Keywords: Cyanobacteria, NAD(P)H:quinone oxidoreductase, nitroreductase, electron transport
Soluble NAD(P)H:quinone-oxidoreductase encoded by drgA gene of the cyanobacterium Synechocystis sp. PCC 6803 is involved in NADPH oxidation and is respobsible for the cell sensitivity to nitroaromatic inhibitors as well as for the resistance to the oxidative stress inducer menadione [1]. DrgA protein is responsible for peroxide reduction in Fenton reaction [2] and participates in regulation of photosynthetic and respiratory electron transport in cyanobacterial thylakoid membranes [3]. The protein sequences of DrgA from Synechocystis sp. PCC 6803 and its homologues from other microorganisms were aligned and studied for their information content by analysis of Shannon-Weaver informational entropy computed as function of the distance between the amino acid residues [4-6]. Sites of increased degree of information coordination between residues (IDIC- sites) were identified. Associations of information-coordinated structural elements (IDIC-trees and IDIC-branches) were mapped. Coding sequence of drgA gene was amplified using PCR method. To study DrgA functional topology, several new deletion derivatives of drgA gene (drgA ∆1, drgA ∆2, and drgA ∆3) were constructed using PCR. In order to facilitate protein purification we have spliced the 3’-ends of all genes with 12xHis tag coding sequence. For visualization of DrgA, the genes encoding the green fluorescent proteins (GFP) cherry or egfp were placed between drgA and 12xHis tag coding sequences. Several constructions for direct constitutive and inducible intracellular expression in E. coli of drgA and its deletion
1 Faculty of Biology, M.V. Lomonosov Moscow State University, Moscow 119991, Leninskie Gory, 1-12; tel. (495)9391179, fax (495)9392957, [email protected] 2 Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, Miklukho-Maklaya 16/10, Moscow, 117997, Russia, tel. (495)3306983, fax (495)3357103, [email protected] 61 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 variants were designed and investigated. The recombinant proteins were purified by IMAC-chromatography method. The enzyme activity of DrgA was tested. The purified DrgA-12His protein exhibited high quinone reductase and nitroreductase activity. The rate of re-reduction of photooxidized Photosystem I reaction center was increased after addition of DrgA-12His protein and NADPH to isolated cyanobacterial thylakoid membranes. Thus, DrgA protein may participate in electron transfer from NADPH to plastoquinone pool in thylakoid membranes of the cyanobacterium Synechocystis sp. PCC 6803.
The work was supported by RFBR grant 09-04-01119.
1. Elanskaya I.V., Chesnavichene E.A., Vernotte C., and Astier C. (1998) Resistance to nitrophenolic herbicides and metronidazole in the cyanobacterium Synechocystis sp. PCC 6803 as a result of the inactivation of a nitroreductase-like protein encoded by drgA gene. FEBS Letters, 428: 188-192. 2. Takeda, K., Iizuka, M., Watanabe T., Nakagawa, J., Kawasaki, S., and Niimura Y. (2007) Synechocystis DrgA protein functioning as nitroreductase and ferric reductase is capable of catalyzing the Fenton reaction. FEBS J., 274: 1318-1327. 3. Matsuo M., Endo T., and Asada K. (1998) Isolation of a novel NAD(P)H- quinone oxidoreductase from the cyanobacterium Synechocystis PCC 6803. Plant Cell Physiol., 39: 751-755. 4. Nekrasov A.N. (2002) Entropy of Protein Sequences: an Integral Approach. Journal of Biomolecular Structure & Dynamics, 20: 87-92. 5. Rogov S.I., Nekrasov A.N. (2001) A Numerical Measure of Amino Acid Residues Similarity Based on the Analysis of their Surroundings in Natural Protein Sequences. Protein Engineering, 14: 459-463. 6. Nekrasov A.N. (2004) Analysis of Information Structure of Protein Sequences: A New Method for Analyzing the Domain Organization of Proteins. Journal of Biomolecular Structure & Dynamics, 21: 615-623.
62 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
ROLE OF GATA4 AND NKX2-5 IN CONGENITAL HEART DEFECTS OF INDIAN POPULATON: A PRELIMINARY REPORT ANBARASAN CHAKRAPANI 1, ASHOK KUMAR MANICKARAJA 1, CHERIAN K. M 1, SOMA GUHATHAKURTA 1, VIJAYA M NAYAK 1
Congenital heart disease (CHD) is a cardiac structural abnormality that is present at birth or even if it is discovered much later. The burden of CHD in India is quite high with an prevalence rate of 2%. A number of studies have identified GATA-4, Nkx2-5, and Tbx5 among the candidate genes causing CHD. The zinc finger transcription factor GATA4 and evolutionarily conserved homeodomain containing transcription factor Nkx2-5, located on 8p23.1-22 and 5q35.2 respectively, are thought to play a vital role in cardiogenesis The objective of the present study was to screen for reported mutations on Nkx2-5 T, →gene, exon 1 (249 C →T), exon 2 (723A →G, 735C →T) and GATA 4 gene, exon 3 (687G T) in CHD patients of Indian →G, 848G →A, 796C →A, 700G →A), exon4 (818A →779G population. The above exons of Nkx2-5 and GATA4 gene were alone focussed as the incidence of mutation were reported high in previous studies among other populations. A phenotypically well characterized 40 non syndromic patients [19 Atrial Septal defects (ASD), 12 Ventricular Septal Defect (VSD), 2 Atrioventricular Septal Defects (AVSD), Tetralogy Of Fallot (TOF), 2 Corrected Transposition of Great Arteries (CTGA)], who have been referred to the International Centre for Cardio Thoracic & Vascular Diseases (A Unit of Frontier Lifeline Pvt. Ltd. & Dr. K. M. Cherian Heart Foundation, Chennai) for CHD treatment from November 2008 to March 2009 were selected. Preoperative blood samples of the patients were collected after obtaining their informed consent. Genetic counselling revealed that 7.5 % (ASD=2, VSD=1) of patients were born to consanguinous parents, 2.5% (n=1, ASD) had a familial history of CHD and 2.5% (n=1, ASD) were born premature. DNA was isolated from peripheral blood using Lahiri’s method1 and the quantification of DNA was done on agarose gel Hind-III digested ladder [MBO Fermentas, USA]. The exon1, λelectrophoresis using exon2 regions of Nkx2-5 gene2 and exon3, exon4 regions of GATA43 were amplified using 1 Department of Genetic Engineering, Frontier Tissue Line,R-30-C,Ambattur Industrial Estate Road, Mogappair, Chennai- 600 101, Tamil Nadu, India, [email protected]., [email protected] 63 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 corresponding primers and subjected for RFLP analysis using reported restriction enzymes. Mutations were observed in exon2 of Nkx2-5 (735C →T, Gln187Ter, heterozygous) in one VSD patient and exon3 of GATA4 (700G →A, Gly234Ser, heterozygous) in each of CTGA and OSASD patients. Our results have revealed a 735C →T transversions of Nkx2-5 gene in one VSD patient and previously this mutation was observed in German study4. A GATA 4 exon 3 mutation Gly234Ser was also identified in two patients, one CTGA and one OSASD. A Japanese study has previously reported the same mutation in 1 patient among 68 mutations5. All the other mutation studied on GATA 4 and Nkx2-5 has not been observed in our population. These results indicate that the above two mutations are not population specific. The results identify that Indians also have mutations among GATA4 and Nkx2-5. Further, new mutations also could be identified among these patients as Indians are a unique genetic entity. The result has to be validated with more number of patients for extensive studies on the role of GATA4 and Nkx2-5 among the Indian population.
1. Lahiri D. K et al. (1993), DNA isolation by a rapid method form human blood samples. Effect of MgCl2, EDTA, storage time and temperature on DNA yeild and quality, Bio Chemical Genetics, 31: 321-328 2. Wei-min Z., Xiao-feng L., Zhong-yuan M., et al. (2009), GATA4and NKX2.5 gene analysis in Chinese Uygur patients with congenital heart disease, Chinese Medical Journal, 122(4):416-419 3. Reamon-Buettner S. M., Cho S. H., Borlak J.(2007), Mutations in the 3'- untranslated region of GATA4 as molecular hotspots for congenital heart disease (CHD),Biomedical Centre Medical Genetics, 8:38 4. Reamon-Buettner S.M., Hecker H., Spanel-Borowski, K. et al. (2004), Novel NKX2–5 Mutations in Diseased Heart Tissues of Patients with Cardiac Malformations , American Journal of Pathology, 164(6). 5. Reamon-Buettner S.M., Borlak J. (2005), GATA4 zinc finger mutations as a molecular rationale for septation defects of the human heart, Journal of Medical Genetics, 42 I would like to thank my research project students Saranya Devi C., Reshme J., Shruthi V., Srividya V., Aishwarya V., Ram Prasath G., Nelson Rajkamal A., Pooranamathi and Muhammed Sirajeeden for their support in the research work
64 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
HYDROGEN BOND GEOMETRY IN REGULAR HELIX STRUCTURES DMITRII L. UKRAINSKII1, VLADIMIR O. CHEKHOV 1, VLADIMIR G. TUMANYAN 1, NATALIA G. ESIPOVA 1
Quantum-chemical calculations of compounds that allow modeling of interpeptide H-bonds in polypeptide helices provide unique information about the physical nature of these bonds. Our purpose was quantum-chemical modeling of interpeptide H-bonds with variation of geometric parameters. Two semi-empirical methods PM3, AM1 and ab initio methods STO3G, 3- 21G and 6-31G** were used in this study. The above mentioned methods were included into application packages GAMESS and HyperChem Pro 6. So the AM1 method was found the most adequate for our purposes as the difference between the optimal orientation of the N–H bond obtained from AM1 calculations and the one from ab initio lies within the 3 ° limit. It also appears to be valid for simulations of peptide groups belonging to regular helical peptide chains exemplified by 1cq2 and 2mb5 proteins. We observed how the total energy of a single peptide group in regular (ideally infinite) helical structures depends on the orientation of N–H bond. We computed the energies of regular helical octo-, nano- and deca-Gly structures at different Ramachandran angles ϕ and ψ. The dependence of the total energy of peptide group situated between the sixth and the seventh (from N-terminus) amino acids versus N–H bond deviations from the bisector line of C α–N–C′ valence angle was obtained at frozen geometries of N–H bonds for the rest peptide groups. In all these and the following simulations the bond length was adopted to be 1.01 Å. Boundary effects have been eliminated during the calculations. For Ramachandran angles –75 ° ≤ ϕ ≤ –47 ° and –57 ° ≤ ψ ≤ –25 ° typical for A- area structures, we observed that even when it is hard to choose between hydrogen acceptors, the peptide group total energy has a single minimum depending on N-H bond direction. It was shown that all these dependencies suggest the presumed H-bonding in “indecisive” positions even if Rose criterion predicts existence of a direct H-bond. For all the cases the energetically permitted range lies within ±10 ° interval for the plane of the peptide molecule, while for the devations in the perpendicular plane the range 1 Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, ul. Vavilova 32, Moscow, 117984 Russia; fax:+7 (499) 135-1405 e-mail: [email protected]; [email protected] 65 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 is about ±30 °. This minimisation in each point of the Ramachandran plot results in rather flat energy surface in the region adjacent to the line described with ( ϕ+51 °)/( ψ+50 °)≈1.1 equation. The region of the plot under consideration contains the α-helix area, the 310 helix area and a part of π- helix area. Interestingly, the energy minimum ( ϕ=–51 °, ψ=–50 °) does not coincide with any canonical helical forms. The energy corresponding to the classical Pauling α-helix exceeds the minimal energy by 0.7 kcal/mol. Note that kT at room temperature is about 0.6 kcal/mol. The π-helix energy is practically the same as the α-helix energy, while the 310 helix energy exceeds α-helix energy by approximately 1.5 kcal/mol. The type of helical structure thereby depends on the nature of its residues and possibly their surroundings. For H-bonds the donor-acceptor distances lie between 2.4 and 4.4 Å for the ϕ, ψ - region under investigation. At the same time the angles of H–N–Oacceptor follow distribution shown in Fig. 1. One can see that the angles are predominantly found in the 25 °-30 ° interval. Significant number of the angles are in the 35 °-55 ° interval; however residue energies of these cases exceed 5 kcal/mol. The angles are also minimal when donor-acceptor distances are about 3 Å and they are not less than 15 °. Thus, almost every hydrogen bond in the A-area can be regarded as “indecisive”.
Fig. 1. Hystogram of absolute value of the angle between N–H direction and the direction from hydrogen atom donor (N) towards an acceptor (O). Black bins take into account cases when effective energy per glycine residue exceeds the minimum within the 5 kcal/mol limit. Grey bins take into account all cases.
This work was supported by grants from Russian Foundation for Basic Research (projects No 07-04-01765 and 08-04-00849), and the Molecular and Cellular Biology Program of the Russian Academy of Sciences.
66 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
NEGATIVE INFORMATION ENTROPY AS A MEASURE OF NONEXPONENTIALITY OF PROTEIN FOLDING KINETICS SERGEI F. CHEKMAREV 1
In many cases, when the folding process is complicated by the presence of on/off-pathway intermediates, the proteins reveal nonexponential folding kinetics (e.g [1-6]). To see how far the kinetics deviate from the exponential (two-state) kinetics, or which of the kinetics deviate more, a quantitative measure of nonexponentiality of the first-passage-time distributions (FPTDs) is needed. For this purpose, the difference between the information (Shannon) entropies for the exponential distribution and a given FPTD ( ∆S) can be employed [7]. It is essentially the Schrödinger-Brillouin [8,9] negative entropy (negentropy), except that the probability for the system to escape from a certain state at a given time is considered instead of the probability for the system to be found in a certain state, and is closely related to the well-known Kullback-Leibler divergence [10], widely used in information theory. The utility of the negative entropy thus introduced is twofold [7]. First, a positive value of ∆S indicates that the FPTD is less random than the Poisson distribution, so that the process under consideration presumably involves some intermediates, which breaks the Poisson process. Secondly, ∆S has a straightforward interpretation in terms of transition state theory, so that it can be expressed in terms of the free energy, and, correspondingly, be measured in the kBT units. In contrast to the other known measures of nonexponentiality of FPTDs, which are based on the comparison of the standard deviation and median of a FPTD with the mean value of the FPTD, ∆S gives an unambiguous estimate of nonexponentiality of a FPTD. Potentially, the present approach has a broad range of application for the analysis of kinetic processes because it is applicable to any problem to which the concepts of information entropy and transition state theory are relevant. The theoretical analysis is illustrated with simulation and experimental results from protein folding [1-6]. Considering a limited but not specific set of proteins, it has been found that ∆S typically varies in the range of several hundredths of kBT (two-state kinetics) to several tenths of kBT (multistate kinetics). The knowledge of ∆S and the free energy barrier between the 1 Institute of Thermophysics, SB RAS, and Novosibirsk State University, 630090 Novosibirsk, Russia , [email protected] 67 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 unfolded and folded states of the protein allows estimation of the relative deviation of the folding process from the two-state kinetics. This work was supported in part by the grant from the Russian Foundation for Basic Research (No. 08-04-91104) and the Civilian Research and Development Foundation (No. RUB2-2913-NO-07).
1. J. Sabelko, et al. (1999) Proc. Natl. Acad. Sci. U.S.A. 96: 6031-6036. 2. J. M. Sorenson and T. Head-Gordon (2002) Proteins: Struct., Funct., Genet. 46: 368-379. 3. H. Kaya and H. S. Chan (2003) Proteins: Struct., Funct., Genet. 52: 524- 533. 4. J. M. Borreguero, et al. (2004) Biophys. J. 87: 521-533. 5. S. F. Chekmarev, et al. (2005) J. Phys. Chem. B 109: 5312 -5330. 6. Yu. Palyanov, et al. (2007) J. Phys. Chem. B 111: 2675-2687. 7. S. F. Chekmarev (2008) Phys. Rev. E 78: 066113. 8. E. Schrödinger (1945) What is Life? The Physical Aspect of the Living Cell (Cambridge University Press, Cambridge, England). 9. L. Brillouin (1953) J. Appl. Phys. 24: 1152-1953. 10. S. Kullback (1959) Information Theory and Statistics (Wiley, New York).
68 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
CHANGING THE CONTENT OF CYTOSINE, GUANINE, CpG AND CpNpG SEQUENCES OF rDNA IN LONG PHYLOGENETIC BRANCHES OF FLOWERING PLANTS IS A BACK-AND-FORTH NATURE. VLADIMIR CHUPOV 1
Variations of nucleotide composition and frequency of CpG and CpNpG sequences in the clusters of nuclear ribosomal genes of taxa, belonging to two long phylogenetic branches of Angiospermae have been analyzed. This region of eucaryotic genomes is nucleolus organizer and functions in a separate compartment of cell nucleus that can do running here processes it is enough specific. It was shown that at the level of orders, and and superorders flowering plants level of evolution advance of a taxon, defined on morphological data, is in positive correlation with quantitative value of dC, dG, CpG and CpNpG. (Chupov et all., 2007; Чупов и др. 2008 а, б). This is found in contradiction with beliefs about the general rules of the transformation of nucleotide composition in evolution, that suggest a dC and CpG suppression. However as demonstrated by further studies increased content of cytosine, guanine, CpG and CpNpG sequences dedicated to specific mono- or oligotip kriptaffinous taxa, which are the link between large families. Within individual families of flowering plants dominated by another process. It is dominated the replacement of cytosine for thiamine and, consequently, reducing dC, dG, CpG and CpNpG content. Thus the general character of changes in nucleotide composition and the type dinukleotid’s profiles of rDNA of flowering plants is a back-and-forth, wavy appearance.
1. V. S. Chupov., E. O. Punina, E. M . Machs, A. V. Rodionov (2007) Nucleotide Composition and CpG and CpNpG Content of ITS1, ITS2, and the 5.8S rRNA in Representatives of the Phylogenetic Branches Melanthiales–Liliales and Melanthiales–Asparagales (Angiospermae, Monocotyledones) Reflect the Specifics of Their Evolution, Mol. Biol., ( 41: 808–829.
1 Komarov Botanical Institute, Russian Academy of Sciences, Russian Federation, [email protected] 69 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 2. V. S. Chupov, E. M. Machs, A. V. Rodionov (2008 a) The Dinuсleotide Composition of Rhibosomal Spacer Regions ITS1-5.8S rDNA-ITS2 as an Indicator of Evolutionary Development and a Phylogenetic Marker of Monocotyledon Plants (Melanthiaceae, Iridaceae, Trilliaceae and Liliaceae).General Changes in the Dinucleotide Composition, Usp. Sovrem. Biol., 128: 482 – 497. (In Russ.) 3. V. S. Chupov, E. M. Machs, A. V. Rodionov (2008 б) The Dinuсleotide Composition of Rhibosomal Spacer Regions ITS1-5.8S rDNA-ITS2 as an Indicator of Evolutionary Development and a Phylogenetic Marker of Monocotyledon Plants (Melanthiaceae, Iridaceae, Trilliaceae and Liliaceae). Dinucleotide Spectrum of Cryptaffine Taxa, Usp. Sovrem. Biol., 128: 482 – 497. (In Russ.)
70 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
EVOLUTION OF SEQUENCES UNDER STRONG SELECTION: SPLICE SITES AND SHINE-DALGARNO BOXES STEPAN DENISOV 1, AKSINIYA GAYDUKOVA 1, ANDREY MIRONOV 1, ALEXANDER FAVOROV 2, RAMIL NURTDINOV 1, MIKHAIL GELFAND 3
Splice sites (in eukaryotes) and Shine-Dalgarno (SD) boxes (in prokaryotes) are highly conserved sequences. They play key roles in the process of gene expression at the level of splicing (splice cites) and initiation of translation (SD). Splice sites are located t the exon-intron boundaries of eukaryotic genes. The spliceosome binds directly to these sequences and then performs the splicing reactions [1, 2]. The Shine-Dalgarno sequences are special motifs located upstream of start codons of many prokariotic genes. These sequences are essential for the initiation of translation. The 16S rRNA (part of ribosome) binds to SD sequence via standard Watson-Crick base- pairing [3]. Hence, the Shine-Dalgarno sequences and splice sites experience a strong selective pressure. Taking into account a large number of such sequences (several thousands) in the available genomes, it is interesting to understand their evolution on the nucleotide level. Raw splice site data consisted of ~30000 triple alignments of ortologous donor splice sites and the same number of acceptor splice sites from the human, mouse and dog genomes. This data were extracted from the EDAS database ([4], http://edas.bioinf.fbb.msu.ru/ ). The SD sequences were identified using a rule involving a positional weight matrix and the information about position of SD relative to the start of translation in genomes of bacteria from the Enterobacteriaceae family. After all filtration procedures, the total number of SD sequences was 15260 (for all species). The aim was to study the pattern of evolution at each position and to compare (calculated) strength of ancestor and current sites. All evolutionary events were considered independently for each branch of the phylogenetic tree. For each position and for each branch of the tree a
1 Lomonosov Moscow State University, GSP-2, building 73, Leninskiye Gory, Moscow, 119992, [email protected] 2 Division of Oncology Biostatistics and Bioinformatics, The Sidney Kimmel Cancer Center at Johns Hopkins, 550 North Broadway, Suite 1103, Baltimore, MD 21205, USA 3 Institute for Information Transmission Problems, Russian Academy of Sciences, Bolshoi Karenty pereulok 19, Moscow, 127994, Russia, [email protected] 71 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 substitution matrix was calculated using the parsimony and maximum likelihood methods. Properties of substitution matrix were studied (matrix dissymmetry, ancestor, descendant and steady vectors of nucleotide frequencies). In many cases the steady vectors significantly differ from both the ancestor and descendant vectors. For each pair of ancestor and offspring sites the differences in strengths were calculated, in order to study changes in site strengths and (in)dependence of mutations in the sites. Alternative and constitutive splice sites were studied independently. We found that on many samples of splice sites (constitutive sites and different types of alternative ones) weights of ancestor sites is slightly but statistically significantly larger than descendant site weights. It was shown that distinct positions in sites mutate not independently: mutations tend to be compensated with other mutations to keep weight of the site relatively stable.
1. J. Rojers and R. Wall (1980) A mechanism for RNA splicing, Proc Natl Acad Sci USA, 77(4): 1877–1879. 2. D.A. Wasserman and J.A. Steitz (1992) Interactions of small nuclear RNA's with precursor messenger RNA during in vitro splicing, Science, 257(5078):1918-25. 3. T. Nakamoto (2006) A unified view of the initiation of protein synthesis, Biochem Biophys Res Commun, 341(3): 675-678. 4. R.N. Nurtdinov et al. (2006) EDAS, databases of alternatively spliced human genes, Biofizika, 51(4): 589-592.
72 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
COMPUTER SIMULATION OF C.ELEGANS MUSCULAR SYSTEM AND NEURAL NETWORK ALEXANDER DIBERT 1, ANDREY PALYANOV 2
Keywords: C. Elegans, simulation, neuron network, muscle system, 3-D environment
Investigation of structure and functioning of the nervous system is one of the most interesting and complex problems. A functional computer model of a nervous system that reproduces the properties of the original one with high accuracy will be an evidence of a high level of understanding of the processes that take place in it. Reproducing the architecture of a real neural network seems to be a good approach to start with. The mammal brain and even brains of simpler organisms are too complex to determine the positions of all the neurons and connections between them and to simulate them on contemporary computers. Moreover, although a lot of different neuron models have been proposed, it is difficult to estimate how close to reality they are.
C.Elegans, free-living soil nematode, is one of the model organisms, widely used and extensively studied by biologists. It is the only organism for which neural network architecture – positions of its neurons and connections between them - is almost completely known. Its nervous system consists of 302 neurons, over 5000 synapses, more than 2000 neuromuscular junctions and these elements are invariant for individuals of the same sex. Taking into consideration the aforesaid, the simulation of the nervous system of C.Elegans seems to be one of the most actual and necessary task. Small size of neural network will allow us to make calculations in reasonable time using contemporary computers. Besides the model of the nervous system, it is very important to develop a model of organism’s body including muscles and receptors in a three dimensional physical environment, which will provide sensory input and feedback to the working nervous system and allow to observe organism’s behavior.
The model of the nematode body consists of a set of mass points, passive spring connections, which simulate tissues, active spring connections, which
1 Novosibirsk State University, Russian Federation, [email protected] 2 A.P. Ershov Institute of Informatics Systems, Russian Federation, [email protected] 73 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 can receive input signal from motoneurons and simulate muscles. The three- dimensional model of a worm and physical environment model, which consists of the supporting force, the friction force, the muscle tension, gravity and the surface resistance, was embodied using C++ and OpenGL lib for real- time visualization. The muscle system of a real organism consists of 4 longitudinal muscle groups. Each group consists of 23 or 24 muscles, gathered in interleaving pattern. Each muscle in our model conforms to a real worm’s muscle.
We examined some simple neuron models, based on input signal summation with adjustable actuation threshold. Information on some neuron parameters is unknown, so we built the muscle contraction model, which allows C.Elegans model to make a sinusoidal movement and use the genetic algorithms based on this model, as well as experimental research data to approximate adequate values.
The result of our work is a virtual model of a C. Elegans nematode, which consists of carcass, muscle system, and neuron system, which are not barely separated fragments of a C. Elegans systems, but a set of interconnected systems. It allows neuron network to get a signal from an environment and react on it. Visualization allows us to study the structure of neural network, which is quite complex, providing selection of any combination of neurons, for which axon and dendrites will be displayed and shown at necessary scale and projection. Also it gives us an opportunity to observe a virtual model behavior, so we can judge about adequacy of a neuron model while adjusting its parameters.
74 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
NEW PROFILES FOR TWO DOMAINS OF QUORUM- SENSING HISTIDINE KINASES FROM FIRMICUTES BACTERIA D.V. DIBROVA 1
Keywords: histidine kinase, annotation analysis
Introduction Proteins of Two-Component Systems (TCSs) are responsible for the majority of bacterial reactions to the changes in the environment [1]. Each TCS consists of at least two proteins: sensor Histidine Kinase (HK) and Response Regulator (RR). Signal transduction is performed in the three steps: Autophosphorylation of HK in response to external stimulus by His residue; Transmission of phosphate from His of HK to Asp residue of RR; Activation of the effector domain of RR which leads to cell reaction; a wide majority of RRs are transcription factors, and their effector domains bounds to DNA. Generally, HKs are membrane proteins with various numbers of transmembrane helices. Typical HK has three domains: N-terminal sensor domain, the most variable; Dimerization domain with His residue which is phosphorylated during signal transduction; C-terminal kinase domain which performs ATP hydrolysis. Several families of histidine kinases were described, one of which is known to act in quorum-sensing systems of Firmicutes bacteria [2]. Results The comparison of known information about histidine kinases from two different sources was performed and inconsistencies between them were detected. In particular, several proteins were annotated as histidine kinases in RefSeq databank [3] while were not detected by any Pfam [4] or Prosite [5] profile. Some of them were reported previously to act in quorum-sensing systems [2, 6]. These proteins were used for building two new profiles, one of which covered presumable dimerization domain with absolutely conserved His residue while the other covered unusual kinase domain. 82 proteins had hits with these profiles. They form a family of histidine kinases not found by existing profiles of Pfam and Prosite.
1 Moscow State University, Moscow, Russia, [email protected] 75 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Four indirect evidences that these proteins are really HKs are the following: Presence of conservative His residue and a special region around it with both conserved and high-variable residues; N-terminal region of these proteins holds several predicted transmembrane helixes (usually 7); Closest neighbors on genomes for their genes are genes of RRs, which is typical for TCSs; Kinase domain of these proteins lacks one of four conserved motifs and this fact is in agreement with the literature.
1. Ann M. Stock, Victoria L. Robinson, Paul N. Goudreau (2000) Two- Component Signal Transduction, Annu. Rev. Biochem., 69:183-215. 2. Richard P. Novick, Edward Geisinger (2008) Quorum Sensing in Staphylococci, Annu. Rev. Genet., 42: 541-64. 3. Kim D. Pruitt et al. (2007), NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins, Nucleic Acids Research, 35:D61-D65. 4. R.D. Finn et al. (2008) The Pfam protein families database, Nucleic Acids Research, 36:D281-D288. 5. Hulo N. et al. (2008) The 20 years of PROSITE, Nucleic Acids Research, 36:D245-D249. 6. Regine Hakenbeck (2000) Transformation in Streptococcus pneumoniae: mosaic genes and the regulation of competence, Res. Microbiol. 151: 453–456.
76 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
MULTISCALE MODELING AND DESIGN OF BIOLOGICAL MOLECULES NIKOLAY V. DOKHOLYAN 1
Some of the emerging goals in modern medicine are to uncover the molecular origins of human diseases, and ultimately contribute to the development of new therapeutic strategies to rationally abate disease. Of immediate interests are the roles of molecular structure and dynamics in certain cellular processes leading to human diseases and the ability to rationally manipulate these processes. Despite recent revolutionary advances in experimental methodologies, we are still limited in our ability to sample and decipher the structural and dynamic aspects of single molecules that are critical for their biological function. Thus, there is a crucial need for new and unorthodox techniques to uncover the fundamentals of molecular structure and interactions. We developed a multiscale approach which is based on tailoring simplified protein models to the systems of interest. Such an approach allows significantly extending the length and time scales for studies of complex biological systems. I will describe several recent studies that signify the predictive power of simplified protein models within the hypothesis-driven modeling approach utilizing rapid Discrete Molecular Dynamics (DMD) simulations.
1 Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, NC, United States, dokh @med.unc.edu 77 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
PREDICTION OF FLEXIBILITY AND ABILITY TO HYDROGEN-DEUTERIUM EXCHANGE FOR PROTEIN CHAIN USING AMINO ACID SEQUENCE NIKITA DOVIDCHENKO 1, ALEXEY SURIN 2, SERGIY GARBUZYNSKIY3, MICHAIL LOBANOV 4, XANA GALZITSKAYA 5
Keywords: hydrogen-deuterium exchange, secondary structure, hydrogen bond, B- factor, regions with irregular secondary structure
Since flexible protein regions frequently play an important role in biological functioning, it is not surprising that the structural explanation of these dynamic properties is at present a very active area of research. Some structural aspects of local flexibility have been outlined in this work. We have investigated the possibility to predict protection of the main polypeptide chain from hydrogen-deuterium exchange. Exchange data for 14 proteins with published rates for native state out-exchange have been compiled. Different structural parameters reflecting flexibility of amino acid residues and their amid groups have been analyzed to answer the question whether the parameters can be used to predict protection of amino acid residues from hydrogen-deuterium exchange using only the amino acid sequence. The method for such prediction has been elaborated. For 70% of the residues considered in this paper we can predict correctly their status: will they be protected or not from hydrogen exchange. An additional goal of our study is to assess whether properties inferred using the bioinformatics approach are easily applicable to predict the behavior of proteins in solution. Mass spectrometry analysis of hydrogen-deuterium exchange for five proteins as well as comparison with our method have been done.
1 Institute of protein research RAS, Russian Federation, [email protected] 2 Institute of protein research RAS, Russian Federation, [email protected] 3 Institute of protein research RAS, Russian Federation, [email protected] 4 Institute of protein research RAS, Russian Federation, [email protected] 5 Institute of protein research RAS, Russian Federation, [email protected] 78 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
MATHEMATICAL MODELING OF STEADY-STATE METABOLISM IN SACCHAROMYCES CEREVISIAE MITOCHONDRIA RENATA A. ZVYAGILSKAYA 1, NAFISA N. NAZIPOVA 2, ALEXSANDER A. ALEXSANDROV 3, LYUSIEN N. DROZDOV-TIKHOMIROV 3
Steady-state metabolism of mitochondria from Saccharomyces cerevisiae cells growning under aerobic conditions in the presence of sucrose as the sole carbon source is described in this approach by mathematical model using the previously elaborated method of the steady-state metabolic flux balance (SMFB method) and the specially designed for this purpose computer program package FLUX II. In the SMFB method, steady-state rates of the metabolic reactions are taken as variables. Each equation of the SMFB method is an equation of the balance between incoming and outgoing fluxes for one of the metabolites. Therefore, the model can be written as a set of linear algebraic equations, in which the left sides of equations are formed by the stoichiometric matrix of the reaction system, while the right sides are the resulting metabolic flux values corresponding to each metabolite of the system under consideration. The constructed advanced model permits to calculate the optimal distribution of reaction rates in the mitochondria metabolic network provided that the composition of monomers of mitochondria-forming biopolymers (proteins, DNA, RNAs, membranous lipoproteins), as well as a list of mitochondria-entering metabolites and the ATP efflux from mitochondria are given. It is assumed that mitochondria are the self-reproducing system dividing synchronously with the cell division. Importantly, the calculated levels of oxygen consumption and CO2 export were in a good agreement with the experimentally obtained results, thus reinforcing the validity of the SMFB method for quantification of cell metabolism.
1 Moscow, A.N. Bach Institute of Biochemistry, Russian Academy of Sciences, Russian Federation 2 Puschino, Institute of Mathematic problems in Biology, Russian Academy of Sciences, Russian Federation 3 Moscow, Institute of Molecular Genetics, Russian Academy of Science, Russian Federation, [email protected] 79 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
STRUCTURAL TREES AND CLASSIFICATION OF PROTEINS ALEXANDER EFIMOV 1
The structural tree for proteins is a scheme that includes all the intermediate and final three-dimensional structures that can be obtained by stepwise addition of secondary structural elements to the root (starting) structure. Secondary structural elements are added to the growing structures in accordance with a set of rules inferred from known principles of protein structure. The structural motif having a unique overall fold is taken as the root structure of the tree. Possible folding pathways are shown by lines that connect all the structures between each other giving one structural tree. Because of structural similarity, proteins and domains included in one structural tree can be classified into one structural class or a superfamily. Proteins and domains found within branches of a strutural tree can be grouped into subclasses or subfamilies. Levels of stuctural similarity between different proteins can easily be observed by visual inspection. Within one branch, protein structures having a higher position in the tree include the structures located lower. Proteins and domains of different branches have the structure located in the branching point as the common fold. This classification is based on similarity of overall folds and modelled folding pathways of proteins and domains. In this classification, amino acid sequences, functions, and homology of proteins are not taken into account, so it is different from other known classification systems. To date structural trees for nine large protein superfamilies - beta-proteins containing abcd-units, 3-beta-corners, S-like beta-sheets; two-layer (alpha+beta)-proteins containing abCd-units; three-layer alpha/beta-proteins containing five- and seven-segment alpha/beta-motifs; alpha-proteins containing alpha-alpha-corners; proteins containing phi-motifs; and proteins containing combinations of beta-alpha-beta-units and psi-motifs - have been constructed. Some updated structural trees and the corresponding databases are now available at http://strees.protres.ru/.
1 Institute of Protein Research, Russian Academy of Sciences, Russian Federation, [email protected] 80 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
INVESTIGATION OF CORRELATION BETWEEN DOMAIN BORDERS AND CORRESPONDING EXON BORDERS IN THE NONREDUNDANT SET OF HUMAN PROTEINS V.A. EPANESHNIKOV 1, A.A. ANASHKINA 1, E.N. KUZNETZOV 2, V.G. TUMANYAN 1
Keywords: Protein structural domain, exon, domain/exon shuffling
Gilbert [1] suggested an assumption that exons could be shuffled and it is a way for formation of new genes. Novel protein functions can also be produced by rearranging exons of existing genes. In these scheme introns may be treated as hot-spots for genetic recombination [2]. Thus, one or several exons correspond to protein module or domain. [3] points that correlation between intron positions and protein modules has not observed for ancient proteins. However, other authors shows that intron positions in ancient proteins correlate with boundaries of compact protein modules [4]. Works in the field are developing hand by hand with sequencing more and more animal genomes. It was elucidated by [5] that domains flanked by phase 1 introns have prominently expanded in the human genome due to domain shuffling. In the other work [6] statistical evidences for nine eukaryotic genomes have been drawn that protein domain borders correlate strongly with exon-intron structure of genes. At the same time in this works a protein domain was defined as functional unit which in general case does not coincide with structural domain. Thus, literature data does not allow attaining final decision about correlation between domain and exon borders. Our task consists in defining is there statistically significant correspondence between borders of structural protein domains and exon borders of corresponding genes. Our investigation consists in detailed comparison of exon and domain structure for nonredundant set of human proteins. This nonredundant set includes 632 protein chains. For each protein chain from this set corresponding transcript and its exon marking was established using pdb identifier (http://www.rcsb.org , http://www.ncbi.nlm.nih.gov ). After aligning by the program fasta3 [7] the protein and the transcript sequences, domain and exon pattern are comparing. A special mathematical 1 Engelhardt Institute of Molecular Biology RAS, [email protected], [email protected] 2 Institute of Control Problems RAS, [email protected] 81 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 criterion was developed, namely measure of difference. The measure of difference is equal to sum of distances from domain borders to nearest exon borders of aligned transcript. For each domain and corresponded exons, measure of difference was calculated. Three domain databases Cath [8], Scop [9] and Dali [10] were taken into account. Distribution of measure of difference has been constructed for each database. These distributions are quite similar. With the aim to estimate statistical significance of observed distributions the theoretical random model was constructed. Comparison of both distributions leads to conclusion that the distributions indeed differ from each other. Additionally, the threshold value was determined which help to divide the coinciding and the noncoinciding regions. After this, those types of domains which are characterized by correlation of exon and domain borders have been selected. The phases were computed for assigning introns both for coinciding and for noncoinciding domains in respect of exon borders. Interestingly, the domains of the former type have preference in 1-1 phase in contrast to non coinciding domains that have not excess of 1-1 phase. This result confirms shuffling mechanism for exon expansion and new gene formation throughout genome for coinciding domains.
1. Gilbert, W., Why genes in pieces? Nature, 1978. 271(5645): p. 501. 2. Gilbert, W., S.J. de Souza, and M. Long, Origin of genes. Proc Natl Acad Sci U S A, 1997. 94(15): p. 7698-703. 3. Stoltzfus, A., et al., Testing the exon theory of genes: the evidence from protein structure. Science, 1994. 265(5169): p. 202-7. 4. de Souza, S.J., et al., Intron positions correlate with module boundaries in ancient proteins. Proc Natl Acad Sci U S A, 1996. 93(25): p. 14632-6. 5. Kaessmann, H., et al., Signatures of domain shuffling in the human genome. Genome Res, 2002. 12(11): p. 1642-50. 6. Liu, M., et al., Significant expansion of exon-bordering protein domains during animal proteome evolution. Nucleic Acids Res, 2005. 33(1): p. 95- 105. 7. Pearson, W.R., Empirical statistical estimates for sequence similarity searches. J Mol Biol, 1998. 276(1): p. 71-84. 8. Orengo, C.A., et al., CATH--a hierarchic classification of protein domain structures. Structure, 1997. 5(8): p. 1093-108.
82 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 9. Murzin, A.G., et al., SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol, 1995. 247(4): p. 536-40. 10. Alexandrov, N. and I. Shindyalov, PDP: protein domain parser. Bioinformatics, 2003. 19(3): p. 429-30.
83 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
EVOLUTION OF STRUCTURE AND SEQUENCE IN ALTERNATIVELY SPLICED DROSOPHILA GENES DMITRY MALKO 1, EKATERINA ERMAKOVA2, MIKHAIL GELFAND 3
Keywords: exon-intron structure, nucleotide substitutions, alternative splicing, Drosophila
BACKGROUND. Two major mechanisms of evolution of genomic sequences are shuffling of genomic fragments and fine-tuning of coding and cis- regulatory regions via nucleotide substitutions. Alternative splicing provides extra freedom for both mechanisms [1]. Evolution of exon-intron structure and alternative splicing in insects is poorly studied as compared to vertebrates [2-4]. We consider the evolutionary diversity of the Drosophila genus at the level of exon-intron structure and at the level of nucleotide substitutions. We study gain and loss of exonic, intronic, and alternatively spliced regions within the same framework, considering nucleotide substitutions in different types of alternative coding regions separately. RESULTS. The patterns of evolution in terms of gain and loss of introns, constitutive exons, and alternatively spliced gene segments, as well as substitution rates in constitutively and alternatively spliced coding regions were considered for eleven Drosophila species (D. melanogaster, D. sechellia, D. yakuba, D. erecta, D. ananassae, D. pseudoobscura, D. persimilis, D. willistoni, D. mojavensis, D. virilis, D. grimshawi). Alternative segments are gained and lost at a higher rate than introns and constitutive exons, and introns are gained at a higher rate than constitutive exons. The patterns of structural rearrangements in pairs of recently diverged species D. yakuba ↔D. erecta and D. pseudoobscura ↔D. persimilis differ dramatically, despite similar rates of nucleotide substitutions. Extremely high rates of structural rearrangements were observed in D. persimilis. During the evolution periods when the rate of intron loss was greater than the rate of intron gain (recent evolution of D. ananassae and D. willistoni, and evolution in pseudoobscura subgroup before the D. pseudoobscura ↔ D. persimilis split), the rates of gain and loss of coding regions were extremely low. 1 State Scientific Center "GosNIIGenetika", Russian Federation, [email protected] 2 A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Russian Federation, [email protected] 3 A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Russian Federation, [email protected] 84 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Alternative regions contain more nonsynonymous substitutions than constitutive regions of spliced genes. Intronless genes contain more nucleotide substitutions than constitutively spliced regions of multiexonic genes. The substitution rates in alternative regions of different types vary dramatically. In particular, cassette exons have nearly twice as many nucleotide substitutions as mutually exclusive exons. The substitution rates in duplicated and non-duplicated mutually exclusive exons also differ. 5′- terminal exon extensions due to acceptor sites have the highest rate of nonsynonymous substitutions while retained introns have the highest rate of synonymous substitutions. CONCLUSIONS. Alternatively spliced regions are hotspots of molecular evolution both at the level of structural rearrangements and at the level of nucleotide substitutions. This demonstrates that alternative splicing is one of the major evolutionary mechanisms generating protein diversity. The rates of structural rearrangements in close species are more variable than the rates of nucleotide substitutions. Substitution rates in alternative regions of different types vary. This variation may be caused by differences in the density of cis-regulatory elements in alternative regions of different types. In particular, our results show that three types of alternative exons: cassette exons, duplicated mutually exclusive exons, and non-duplicated mutually exclusive exons, should be considered separately in comparative genomic studies.
1. Modrek B. and Lee, C.J. Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss. Nat Genet 34 (2003). 2. Malko D.B., Makeev V.J., Mironov A.A., and Gelfand, M.S. Evolution of the exon-intron structure and alternative splicing in fruit flies and malarial mosquito genomes. Genome Res 16 (2006). 3. Ermakova E.O., Mal'ko D.B., and Gel'fand, M.S. Different patterns of evolution in alternative and constitutive coding regions of Drosophila alternatively spliced genes. Biofizika 51 (2006). 4. Coulombe-Huntington J. and Majewski J. Intron loss and gain in Drosophila. Mol Biol Evol. 24 (2007).
85 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
SECONDARY STRUCTURE OF COPOLYMER CONSISTING OF AMPHIPHILIC AND HYDROPHILIC MONOMER UNITS: IMPACT OF THE RANGE OF THE INTERACTION POTENTIAL VITALY ERMILOV 1, VALENTINA VASILEVSKAYA 2, ALEXEI KHOKHLOV 3
Keywords: apmphiphilic copolymers, simple model of polypeptide chain, HP model
The dependence of coil-globule transition of copolymer composed of amphiphilic and hydrophilic monomers on the range of the interaction potential has been studied via molecular dynamics simulations. It has been shown that the structure of globules formed in such systems substantially depends on the range of the interaction potential. In the case of long range potential the globule resulting from hydrophobically driven collapse has blob structure; if the potential is short ranged quasi helical structure of the globule is formed, where the backbone of the chain forms helical turns with direction of twisting which can vary from turn to turn. The coil-globule transition in such systems goes through the stage of forming of the necklace conformation consisting of quasi helical micelle-beads. The size of the globules linearly depends on the degree of polymerization in the case of long macromolecules.
1 A.N.Nesmeyanov Institute of Organoelement Compounds Russian Academy of Sciences , Russian Federation, [email protected] 2 A.N.Nesmeyanov Institute of Organoelement Compounds Russian Academy of Sciences , Russian Federation, [email protected] 3 Lomonosov Moscow State University, Russian Federation, [email protected] 86 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009
MUTUAL ORIENTATION OF Q Y TRANSITION DIPOLES OF SUBANTENNAE PIGMENTS AS A STRUCTURAL FACTOR OPTIMIZING THE PHOTOSYNTHETIC ANTENNA FUNCTION. THEORETICAL AND EXPERIMENTAL STUDIES ANASTASIYA ZOBOVA 1, ANDREY YAKOVLEV 1, VLADIMIR NOVODEREZHKIN 1, ALEXANDRA TAISOVA 1, ZOYA FETISOVA 1
Keywords: structure optimization, functional criteria, photosynthesis, light-harvesting antenna, model calculations
This work continues a series of our investigations on efficient strategies of functioning of natural light-harvesting antennae, initiated by our concept of rigid optimization of photosynthetic apparatus structure by functional criterion. This work deals with the problem of finding the optimal orientation of Qy transition dipole moments of light-harvesting bacteriochlorophyll (BChl) a molecules of a subantenna B798 (absorption maximum, at 798 nm) in the green bacterium Chloroflexus aurantiacus [1]. We used infinite 3D antennae an elementary fragment of which is a 1D unit (parallel to the Z axis), containing molecules of three subantennae, B740, B798 and B808 (Fig.1).
B798 is the acceptor for oligomeric BChl c B740 subantenna and the donor for monomeric BChl a B808 one. Orientations of the Qy transition dipoles are known only for B740 and B866 [Fig.1]. Using the probability matrix approach,we computed the time ( t, a.u.) of excitation energy transfer(EET) from B740 to B808 as a function of α, Δ≡α-β and φ, where φ determines the 1 M.V. Lomonosov Moscow State University, Moscow, 119992, Russian Federation [email protected] ; [email protected] 87 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 sought orientation of B798 Qy dipoles (Δ∈[0–180˚]; α∈[0–90˚); φ∈(–90– +90˚)). Each set of curves t(α,Δ,φ) was computed for R12 /R23 =0.5; 1.0; 2.0 , at that R12 +R23=const ( Rij is the distance between dipoles i and j). For each R12 /R23 value, one can find stable minima of curves t (α,Δ,φ) near tmin (φopt ), which are much lower than those ( tr) for randomly oriented dipoles: η ≡ tr/tmin >1.2. It was found that (1) at R12 /R23 =2, φopt ∈[0˚± 5˚] at α<30˚, Δ≤45˚; (2) at R12 /R23 =1, φopt ∈[±(20–32)˚] at α≤30˚, Δ≤ 60˚; (3) at R12 /R23 = 0.5, φopt ∈[± (37–70)˚] at any Δ and 0˚≤ α ≤ 75˚. Experiments in vivo revealed that the second stage is limiting in EET B740→B798→B808, which corresponds to the case of R12 /R23 = 0.5. We assumed that in a single chlorosome, the B798 subantenna is formed by ordered chains of BChl a protein complexes with either (i) fixed BChl a dipoles orientations, according to the Table, for any Δ and 0˚ ≤ α ≤ 75˚ (model No.1), or