TCAGAAAATGCGCTCCTGATGCACCCATACCGC TGCTTCCACGCGAGACTTGAGCTTCATTTTCTT CAGCATGTGCTTGACGTGCACTTTTACTGTGCT TTCGGTGATATCCAGGCGGCGGGCAATCATCTT GTTCGGCAAACCCTGGGCAATCAGCTTGAGAAT ATCGCGCTCGCGTGGGGTTAACTGGTTAACATC TCAGAAAATGCGCTCCTGATGCACCCATACCGC TGCTTCCACGCGAGACTTGAGCTTCATTTTCTT CAGCATGTGCTTGACGTGCACTTTTACTGTGCT MCCMB ’09 TTCGGTGATATCCAGGCGGCGGGCAATCATCTT

POCEEDINGSR OFTHEITERNATIONALN MSCOWCNFERENCEOO ONCMPUTATIONALO MLECULARBOLOGYOI

July 20-23, 2009 Moscow, Russia Organizers ЕР И И Е Н И Ж Б Н И И О О И И Н Б Ф DepartmentofBioengineeringandBioinformatics О

Т

Р

Е

М

Т

Ь

А

Л ofM.V.LomonosovMoscowStateUniversity Т

У И

К

К

А И Ф

1930 BiologicalDepartment ofM.V.LomonosovMoscowStateUniversity У

StateScientificCentreGosNIIGenetika

Institutefor Information TrasnsmissionProblems,RAS

The ScientificCouncil onBiophysicsRAS,

EngelhardtInstituteofMolecular Biology Russian AcademyofSciences

Sponsoredby

Р И RussianFundofBasicResearch

INRIA,France INRIA theFrenchNationalInstituteforResearch inComputerScienceandControl DepartmentofBioengineeringandBioinformatics ofM.V.LomonosovMoscowStateUniversity BiologicalDepartmentofM.V.LomonosovMoscowStateUniversity StateScientificCentreGosNIIGenetika InstituteforInformationTrasnsmissionProblems,Russian A cademyof S ciences TheScientificCouncilonBiophysics , RAS ussian cademyof ciences EngelhardtInstituteofMolecularBiology,Russian AcademyofSciences with financial support of RussianFundofBasicResearch INRIA,France (theFrenchNationalInstituteforResearchinComputerScienceandControl)

POCEEDINGSR

MCCMB ’09 Moscow, Russia July 20-23, 2009 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

NEW METHOD TO IMPROVE ERROR PROBABILITY ESTIMATION APPLIED TO ILLUMINA SEQUENCING IRINA ABNIZOVA 1, TOM SKELLY 1, YUMI YAN 1, TONY COX 1

The new short read sequencing technique introduced new technological and computational challenges. It requires reconsideration of well-known error estimation algorithms, taking into account different sequencing platforms.

 1 Wellcome Trust Sanger Institute, Hinxton, United Kingdom, [email protected] 1 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

DETECTION OF GENES THAT UNDERWENT POSITIVE SELECTION IN DEEP-SEA ARCHAEBACTERIA OF PYROCOCCUS GENUS K.V. GUNBIN 1, D.A. AFONNIKOV 2, N.A.KOLCHANOV 2

Pressure is an environmental parameter of crucial importance for organisms. Archaeal species of the Pyrococcus genus live under both normal (~0,1MPa) and high pressures (>10MPa). To date, the genomes of three Pyrococcus species have been completely sequenced: P. furiosus bacteria live under normal pressure, whereas P. horikoshii and P. аbyssi are piezophilic (live in deep sea environment under high pressure at 14MPa and 20MPa, respectively). In this work we analyze the rate of nucleotide substitution in search for genes underwent positive selection in deep-sea species of Pyrococcus genus. A phylogenetic analysis was performed to determine the evolutionary relatedness of the piezophilic species of the Pyrococcus genus and T. kodekaraensis as outgroup. The analysis of phylogenetic tree demonstrates that piezophilic species have a common origin and the ancestor of piezophilic species emerged from archaebacteria phylogenetically close to the extant species of Pyrococcus genus inhabiting in normal pressure environments. Events of positive selection (PS) for adaptation of life under high pressure were searched for the set of 508 homologous genes which protein sequences are close homologs (amino acid sequence identity greater than 40%) and have no paralogs in genomes. We reconstructed genes and proteins of the most recent ancestor of piezophilic species of the Pyrococcus genus and the common ancestor of P. furiosus, P. horikoshii and P. аbyssi species. Reconstructed ancestral sequence of genes and proteins were compared with extant sequences using nonsynonymous to synonymous substitution rate ratio, radical to conservative amino acid replacement rate ratio, also amino acid dissimilarity measures. We use ArCOG functional classification of analyzed genes and demonstrated that positive selection events occurred in genes and proteins of ‘Coenzyme transport and metabolism’ and ‘Energy production and conversion’ functional groups (Table 1). The results suggest

 1 Institute of Cytology and SB RAS, [email protected] 2 Institute of Cytology and genetics SB RAS, Novosibirsk State University [email protected]; [email protected] 2 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 that genes of these functional classes may be important for adaptation of piezophilic Pyrococcus species to deep-sea environment.

Table 1. ArCOG group enrichment in the full set of analyzed genes and in genes with identified positive selection events. Last column represents the probablilty of difference in number of genes in full and PS sets observed by chance according to Monte Carlo shuffling test with 105 replicas. ArCOG groups with statistical significant difference (p<0.05) shown in bold. ArCOG group Number Number ppp-p---valuevalue of observing in full in PS by random chance dataset group Amino acid transport and metabolism 34 8 0.18225 Carbohydrate transport and metabolism 22 2 0.90471 Cell cycle control; cell division; chromosome partitioning 8 0 * Cell motility 7 2 0.32479 Cell wall/membrane/envelope biogenesis 13 1 0.90748 Coenzyme transport and metabolism 15 6 0.02416 Defense mechanisms 3 0 * Energy production and conversion 33 11 0.01072 Inorganic ion transport and metabolism 16 0 * Intracellular trafficking; secretion; and vesicular transport 6 0 * Lipid transport and metabolism 5 0 * Nucleotide transport and metabolism 24 5 0.36181 Posttranslational modification; protein turnover; chaperones 18 4 0.34395 Replication; recombination and repair 24 4 0.58199 Secondary metabolites biosynthesis; transport and catabolism 5 1 0.59636 Signal transduction mechanisms 3 1 0.42142 28 4 0.70909 Translation; ribosomal structure and biogenesis 76 14 0.36997 Function unknown 87 6 0.99903 General function prediction only 75 14 0.34782 Not annotated 6 1 0.66633 Total 508 84

The work was supported by SB RAS integration project №109, Scientific School НШ-2447.2008.4, RAS program “Origin and evolution of Biosphere” and CRDF REC-008 grant.

3 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

MATHEMATICAL MODELING OF THE MOLECULAR GENETIC SYSTEMS REGULATING A PLANT DEVELOPMENT ILYA AKBERDIN 1, FEDOR KAZANTSEV 1, STANISLAV FADEEV 2, IRINA GAINOVA 2, VITALY LIKHOSHVAI 1

Keywords: auxin metabolism, gene network, automatic generation, mathematical model, plant development

Indole-3-acetic acid (IAA) is physiologically active in the form of the free acid, but can also be found in conjugated forms in plant tissues. IAA can be degraded and redundant pathways lead to its synthesis. Auxin participates in regulation of cell differentiation in development of embryo, leaves, vascular tissue, fruit, primary and lateral root and in controlling apical dominance and tropisms. The regulation of the IAA metabolism (synthesis, conjugation and degradations) is enough complex and may explain in some aspects how this simple substance is able to influence such diverse processes. Mathematical modeling of IAA metabolic gene network can help reveal the main factors governing this complex process. To reach this aim, we first reconstructed a gene network of auxin biosynthesis, conjugation degradation by annotating experimental data from 107 published papers into GeneNet computer system. This gene network after reduction was input into converter to generate the mathematical model of auxin metabolism. We have reconstructed the gene network and develop the mathematical model of auxin metabolism in arabidopsis shoots. The model allows to reproduce some phenomenological and molecular-genetic aspects of the auxin role in the plant development. The obtained results confirm adequacy of the developed model. In silico experiments testify to qualitatively rapid processes of the molecular genetic regulation of the systems homeostasis. The cumulative experimental data allowed starting construction of spatial distributed hierarchical model that describe both molecular genetic processes and processes on the level of cell- cell interactions simultaneously. So earlier we’ve developed the cellular automaton model that imitates morphodynamics of embryo development by means of regulation of signals produced by different embryonic cells is a first

 1 The Institute of Cytology and Genetics SB RAS, Russian Federation, [email protected] 2 The Institute of Mathematics SB RAS, Russian Federation 4 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 step in modelling the process of development in general and in modelling the gene network for morphogenesis in particular [1]. The next step in mathematical modeling application to studying of the plant development rules is integration of the spatial distributed hierarchical model with model of the intracellular auxin metabolism.

Akberdin I.R., Ozonov E.A., Mironova V.V., Gorpinchenko D.N., Omelyanchuk N.A., Likhoshvai V.A., Kolchanov N.A. (2007). “A cellular automaton to model the development of shoot meristems of Arabidopsis thaliana”, Journal of Bioinformatics and Computational Biology Vol. 5, pp. 641-650.

5 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

WATER-MEDIATED HYDROGEN BONDS ARE ESSENTIAL FOR LOOP STABILIZATION IN PROTEIN STRUCTURES EVGENIY AKSIANOV 1, SERGEI SPIRIN 1,2, ANNA KARYAGINA 1,3,4, ANDREI ALEXEEVSKI 1,2

Keywords: protein structure, water, hydrogen bond, water-mediated bond

Protein structures are mostly composed of secondary structural elements (SSE): alpha-helices and beta-strands. SSEs are connected by unstructured regions (loops). Loops resolved in Х-ray experiments are not flexible; they are stable, at least in a crystal. Regular nets of hydrogen bonds (H-bonds) stabilize both helices and sheets and are important for SSE's stability. No regular hydrogen bond networks are known to stabilize loop conformations. Based on a number of examples we hypothesized that intradomain hydrogen bonds mediated by water molecules significantly contribute to the stabilization of loops. To test our hypothesis, we analyzed intradomain direct hydrogen bonds and water-mediated hydrogen bonds in a non-redundant set of protein domain X-ray structures with high resolution. Methods . 995 protein domains were obtained from the SCOP 1.73 database; sequence identity between each pair of domains was ≤90 %, all structures are X-ray with resolution better than 1.5 Å. Secondary structural elements (β-strands and α-helices) were detected using DSSP algorithm. An H- bond was defined as a pair of atoms such that (1) one of atoms may be proton donor and other proton acceptor, (2) the distance between atoms is 2.3–3.7 Å and (3) the angles between the direction of the H-bond and the optimal direction of H-bond is ≤ 40° for both atoms. Results. The number (per 20 residues of the corresponding SSEs) of H- bonds and water-mediated bonds between helices, strands and loops are shown in Table 1. The numbers of backbone-backbone H-bonds per 20 residues in helices and sheets are less than the maximal possible 20 (11.5 for strands and 12.1 for helices) mainly due to large number of short helices and

 1 Belozersky Institute, Moscow State University, Moscow, Russia, [email protected] 2 Scientific Research Institute for System Studies (NIISI RAN), Moscow 3 Gamaleya Institute of Epidemiology and Microbiology, 18 Gamaleya st., Moscow, 123098, Russia 4 Institute of Agricultural Biotechnology, 42 Timiryazevskaya st., Moscow, 127550, Russia 6 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 hairpins (where the number of regular H-bonds is twice smaller) and irregularities in SSE's H-bond networks.

Table 1. Number of direct/water-mediated hydrogen bonds between helices, strands, and loops. Strands Helixes Loops Side chain Backbone Side chain Backbone Side chain Backbone Length (1) 40897 40897 47681 47681 37062 37062 Atoms (2) 14.13 40.00 17.93 40.00 89.75 40.00 0.51 / 1.23 0.00 / 0.02 1.49 / 2.45 12.14(5) / 0.05 3.18 / 4.57 1.78 / 0.83 BONDS WITHIN (3) THE SAME SSE 0.10 / 0.07 0.14 / 0.07 0.25 / 0.13 (backbone to side chain) BONDS BETWEEN DIFFERENT SSEs Strands Helixes Strands (s.c.) Helixes (bb.) Loops (s.c.) Loops (bb.) (bb.) (s.c.) Strands (s.c.(4)) 2.07 / 8.45 0.23 / 1.27 0.36 / 0.76 0.03 / 0.14 1.40 / 3.95 0.85 / 2.05 Strands (bb. (4)) 0.23 / 1.27 11.48 / 0.5 0.06 / 0.34 0.03 / 0.05 0.32 / 0.93 1.31 / 0.82 1.22 / Helixes (s.c.) 0.36 / 0.76 0.06 / 0.34 0.49 / 0.99 1.76 / 4.10 0.80 / 1.77 10.13 Helixes (bb.) 0.03 / 0.14 0.03 / 0.05 0.49 / 0.99 0.07 / 0.38 0.74 / 1.02 2.00 / 0.52 2.65 / Loops (s.c.) 1.40 / 3.95 0.32 / 0.93 1.76 / 4.10 0.74 / 1.02 3.48 / 9.04 20.75 Loops (bb.) 0.85 / 2.05 1.31 / 0.82 0.80 / 1.77 2.00 / 0.52 3.48 / 9.04 1.52 / 4.54 (1) The total length of all elements in the investigated structures (in amino acids). (2) Number of hydrogen donors and acceptors per 20 residues (3) 0.51 direct bonds and 1.23 water-mediated bonds per 20 amino acids. The same notations are used in all other cells of the table. (4) s.c. means side chains, bb. means backbone atoms. (5) Numbers greater than 4 bonds per 20 residues are shown bold and large.

From table 1 it follows that water-mediated bonds between side chains of two helices or two strands were detected for approximately a half of residues. In the case of loop-to-loop interactions water-mediated bonds on average were detected for each residue, and their contribution to loop – loop interactions exceeds the contribution of direct H-bonds.

7 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 We conclude that intra-domain water-mediated bonds are common feature in protein structures. Such bonds may be especially important for loop stabilization. The work is partly supported by the Russian Foundation for Basic Research, grants 07-04-91560 and 08-04-91975.

8 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

GENOMIC INSIGHTS INTO THE ORIGINS OF METAZOAN CELL DIFFERENTIATION KIRILL V. MIKHAILOV 1, A.V. KONSTANTINOVA 1, M.A. NIKITIN 1, V.V. ALEOSHIN 1, L.YU. RUSIN 2, YURI V. PANCHIN 2

Keywords: Mesomycetozoea; molecular phylogenetics; origin of Metazoa;

Choanoflagellates and mesomycetozoeans are two groups of unicellular organisms that are the closest relatives of animals [1]. The ongoing genome sequencing effort aimed at their members is an attempt to understand the origin of animals and multicellularity in the context of evolution of genes and genomes [2]. These studies have brought about a notion of “Metazoa-specific” genes, genes found exclusively in metazoans, which are thus considered likely to be novelties specifically associated with the development multicellularity. The “Metazoa-specific” genes code a large number of cell signalling and adhesion proteins such as cadherins and TGFb pathway components, to name a few. However the list of “Metazoa-specific” genes is rapidly contracting as the number of sequenced genomes of unicellular relatives of metazoans increases. The genomes of choanoflagellates were found to contain a multitude of tyrosine kinases – proteins involved in the regulation of cell proliferation and motility that were originally considered to be a metazoan novelty [3]. Another example is a mesomycetozoean that possesses components involved in cell-matrix adhesion, such as focal adhesion kinase and integrin beta [4]. Here we present evidence for the exclusion of yet another set of genes from the “Metazoa-specific” list by demonstrating their presence in another mesomycetozoean and showing that they are actively expressed. The premetazoan ancestry of metazoan transcription factor families and signal transduction pathways is poorly accommodated by the traditional view of the metazoan ancestors as blastula-like colonies, which had subsequently undergone cell differentiation. The new data suggests that the elements of the genetic toolkit for the development of multicellular animals were possibly already in use by their unicellular relatives. Mapping of major gene families and ecological traits onto the phylogeny indicates that presence of different cell types at different stages of life cycle and appearance of  1 Belozersky Institute for Physicochemical Biology, Lomonosov Moscow State University, Moscow, Russian Federation, [email protected] 2 Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow 127994, Russian Federation 9 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 multicellular aggregates is not an intrinsic property of metazoans, but of a much wider group of organisms – Opisthokonta [5]. The emerging scenario regards the last common ancestor of multicellular animals as an integration of different stages of the unicellular ancestor’s life cycle.

1. E.T.Steenkamp, J.Wright, S.L.Baldauf (2006) The protistan origins of animals and fungi, Molecular Biology and Evolution, 23: 93–106. 2. I.Ruiz-Trillo, G.Burger , P.W.Holland, N.King, B.F.Lang, et al. (2007) The origins of multicellularity: a multi-taxon genome initiative, Trends in Genetics, 23:113–118. 3. N.King, M.J.Westbrook, S.L.Young, A.Kuo, M.Abedin, et al. (2008) The genome of the choanoflagellate Monosiga brevicollis and the origin of metazoans, Nature, 451: 783–788. 4. K.Shalchian-Tabrizi, M.A.Minge, M.Espelund, R.Orr, T.Ruden, et al. (2008) Multigene phylogeny of choanozoa and the origin of animals, PLoS ONE, 3: 2098. 5. K.V.Mikhailov, A.V.Konstantinova, M.A.Nikitin, P.V. Troshin, L.Yu. Rusin, V.A. Lyubetsky, Y.V. Panchin, et al. (2009) The origin of Metazoa: a transition from temporal to spatial cell differentiation, Bioessays, 31: (in press).

10 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

INHERENT POTENTIALITIES OF VORONOI-DELAUNEY TESSELLATION AS APPLIED TO BIOLOGY PROBLEMS ANASTASYA ANASHKINA 1, NATALIA ESIPOVA 1, VLADIMIR TUMANYAN 1

Researchers of different areas of interest effectively used Voronoi- Delaunay tessellation to solve various problems for a long time. During last years the interest to this method arises due to its possibilities in complex biological studies along with crystallography and chemistry. By definition, Voronoi polyhedron or Voronoi region is a part of space which points locate closer to this center than to any other center of the system. Tetrahedron (based on four centers of the system) is a Delaunay simplex whether inside the circumsphere there are no other centers of the system. The set of all Delaunay simplexes of a system as well as the set of Voronoi polyhedrons fills space without slits and overlaps. These tessellations are dual and topologically equivalent. Single-valued character of Voronoi-Delaunay tessellation make this method extremely attractive for researchers as well as it’s independence of any parameters. Mathematical rigorousness and exactness of exploration are very rare occur in biological sciences. The method is developed both for two- dimensional and three-dimensional cases. Modifications of the basic method provide additional capabilities and allow analyzing not only systems of points but systems of spheres of similar radii, systems of spheres of different radii, systems of bodies of arbitrary shapes [1]. Voronoi-Delaunay tessellation encounters some problems in practical use. In particular, boundary conditions should be set. Another problem consists in time-consuming during computations for multi-atomic systems. Two-dimensional Voronoi-Delaunay tessellation is used even for cell cultures architecture analysis. Voronoi facet as well as Delaunay edge is a natural unambiguous non-parametric way to reveal the nearest neighbors in tridimensional space. This procedure is equivalent to revelation of contacts between atoms. Consequently Voronoi-Delaunay tessellation allows calculating of local atomic density and contacts between biopolymer molecules. A contact between two atoms, in this case, is a common facet of Voronoi polyhedron. As a result the contact between two residues is defined as a set of common facets of Voronoi polyhedrons of appropriate atoms. So it  1 Engelghardt Institute of Molecular Biology RAS , Russian Federation , [email protected] 11 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 is possible to explore the statistics of contacts between atoms or residues/nucleotides in protein-protein [2] and protein-nucleic [3] interfaces. Knowledge of rules which control interactions in protein-protein interfaces is necessary for correct prediction of interaction sites on the surface of protein or protein complexes. Also, it may well be that application of this powerful method will decide the question of existence of kind of code of nucleic acid- protein recognition. Voronoi network (more specifically, Voronoi S-network) is the main tool for empty interatomic space analysis. This network penetrates through interatomic space of the system and represents locus located outermost from atoms [4].

1. N.N. Medvedev (2000) Metod Voronogo-Delone v issledovanii struktury nekristallicheskih sistem, Novosibirsk: NIC OIGGM SO RAN. 2. A. Anashkina et al. (2007) Comprehensive statistical analysis of residues interaction specificity at protein-protein interfaces, Proteins, 67(4): 1060-77. 3. A.A. Anashkina et al. (2008) Geometricheskij analiz DNK-belkovyh vzaimodejstvij na osnove metoda Voronogo-Delone, Biofizika, 53(3): 402-6. 4. N.N Medvedev, V.P. Voloshin (2003) Issledovanie mezhatomnyx pustot v molekulyarnyh sistemah, Struktura i dinamika molekulyarnyh sistem, X (1): 299-304.

This research was supported (funded) by Russian Foundation for Basic Research Grants 07-04-01765а and 08-04-01770а.

12 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

COMPUTATIONAL ANTI-AIDS DRUG DESIGN RESULTING FROM THE STUDY ON SPECIFIC INTERACTIONS OF IMMUNOPHILINS WITH THE HIV-1 GP120 V3 LOOP ALEXANDER ANDRIANOV 1

Keywords: HIV-1, V3 Loop, 3D Structure, Computer Modeling, Molecular Docking

Currently, special emphasis of the research teams involved in the anti-AIDS drug studies is attracted to the HIV-1 V3 loop (reviewed in [1]). The higher interest in V3 is caused by numerous experimental data testifying to the fact that exactly this gp120 site gives rise to the principal target for neutralizing antibodies and accounts for the choice of co-receptor determining the preference of the virus in respect with T-lymphocytes or primary macrophages. Since the V3 loop governs the cell tropism and cell fusion (see, e.g., [1], one of the strategic ways in developing the anti-HIV-1 drugs may be based on the approach anticipating the search for the chemicals capable of the efficacious blockading this functionally significant stretch of gp120. Comprehensive analysis of the data of study [2] allows one to suppose that immunophilins exhibiting specific high-affinity interactions with the HIV-1 V3 loop may be utilized as a basic substance to set out of the search for the potential anti-AIDS therapeutic agents. This work proceeds with my previous study [3] where the virtual molecule presenting the promising anti-HIV-1 pharmacological substance was designed by means of the computer modeling based on the analysis of specific interactions between the FK506-binding protein and synthetic peptide imitating the immunogenic crown of the V3 loop. The object of the present study was to generate the model describing the structural complex of cyclophilin A with the HIV-MN V3 loop followed by the computer-aided design of the immunophilin-derived peptide able to mask the biologically important V3 segments. To this end, the following problems were solved: (i) the NMR-based conformational analysis of the HIV-MN V3 loop was put into effect, and its low energy structure fitting the input experimental observations was determined; (ii) molecular docking of this V3 structure with the X-ray conformation of CycA was carried out, and the energy refining the simulated structural  1 Institute of Bioorganic Chemistry, National Academy of Sciences of Belarus, Kuprevich Street., 5/2, 220141 Minsk, Republic of Belarus, [email protected] 13 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 complex was implemented; (iii) the matrix of inter-atomic distances for the amino acids of the molecules forming part of the built over-molecular ensemble was computed, the types of interactions responsible for its stabilization were analyzed, and the CycA stretch which accounts for the binding to V3 was identified; (iv) the most probable 3D structure of this stretch in the unbound state was predicted, and its collation with the X-ray structure for the corresponding site of CycA was performed; (v) the potential energy function and its constituents were studied for the structural complex generated by molecular docking of the V3 loop with the CycA peptide offering the virtual molecule which imitates the CycA segment making a key contribution to the interactions of the native protein with the HIV-1 principal neutralizing determinant; (vi) as a result, the designed molecule was shown to be capable of the effictive blocking the functionally crucial V3 sites; and (vii) starting from the joint analysis of the results derived here and in study [3], the composition of the peptide cocktail presenting the promising anti-AIDS pharmacological substance was developed. The molecules simulated here by molecular modeling methods may become the first representatives of a new class of chemicals (immunophilin- derived peptides) offering the forward -looking basic structures for the design of efficacious and safe antiviral agents. The author appreciates the Belarusian Republican Foundation for Basic Research for financial support (project No X08-003).

1. S.Sirois, T.Sing, K.C.Chou (2005) HIV-1 gp120 V3 loop for structure- based drug design, Curr. Protein Pept. Sci., 6: 413-422. 2. M.M.Endrich, H.Gehring (1998) The V3 loop of human immunodeficiency virus type-1 envelope protein is a high-affinity ligand for immunophilins present in human blood, Eur. J. Biochem., 252: 441- 446. 3. A.M.Andrianov (2008) Computational anti-AIDS drug design based on the analysis of the specific interactions between immunophilins and the HIV-1 gp120 V3 loop. Application to the FK506-binding protein, J. Biomol. Struct. Dynam., 26: 49-56.

14 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

HOMOLOGY MODELING AND MOLECULAR DYNAMICS IN STRUCTURAL STUDIES ON THE HIV-1 GP120 V3 LOOPS: INSIGHT INTO THE VIRUS SUBTYPE A IVAN ANISHCHENKO 1, ALEXANDER ANDRIANOV 2

Keywords: HIV-1, V3 Loop, 3D Structure, Computer Modeling, Molecular Docking

The V3 loop of the HIV-1gp120 glycoprotein presenting 35-residue-long, frequently glycosylated, highly variable, and disulfide bonded structure plays the central role in the virus biology and forms the principal target for neutralizing antibodies and the major viral determinant for co-receptor binding. Here we present the computer-aided studies on the 3D structure of the HIV-1 subtype A V3 loop (SA-V3 loop) in which its structurally inflexible regions and individual amino acids were identified and the structure-function analysis of V3 aimed at the informational support for anti-AIDS drug researches was put into practice. To this effect, the following successive steps were carried out: (i) using the methods of homology modeling and simulated annealing, the ensemble of the low-energy structures was generated for the consensus amino acid sequence of the SA-V3 loop and its most probable conformation was defined basing on the general criteria widely adopted as a measure of the quality of protein structures in terms of their 3D folds and local geometry; (ii) the elements of secondary V3 structures in the built conformations were characterized and careful analysis of the corresponding data arising from experimental observations for the V3 loops in various HIV-1 strains was made; (iii) to reveal common structural motifs in the HIV-1 V3 loops regardless of their sequence variability and medium inconstancy, the simulated structures were collated with each other as well as with those of V3 deciphered by NMR spectroscopy and X-ray studies for diverse virus isolates in different environments; (iv) with the object of delving into the conformational features of the SA-V3 loop, molecular dynamics trajectory was computed from its static 3D structure followed by determining the structurally rigid V3 segments and comparing the findings obtained with the ones derived hereinbefore; and (v) to evaluate the  1 United Institute of Informatics Problems, National Academy of Sciences of Belarus, Surganov Street 6, 220012 Minsk, Republic of Belarus, [email protected] 2 Institute of Bioorganic Chemistry, National Academy of Sciences of Belarus, Kuprevich Street, 5/2, 220141 Minsk, Republic of Belarus, [email protected] net.by 15 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 masking effect that can occur due to interaction of the SA-V3 loop with the two virtual molecules constructed previously [1, 2] by tools of computational modeling and named FKBP and CycA peptides, molecular docking of V3 with these molecules was implemented and inter-atomic contacts appearing in the simulated complexes were analyzed to specify the V3 stretches keeping in touch with the ligands. As a matter of record, V3 segments 3-7, 15-20, and 28-32 containing the highly conserved and biologically meaningful residues of gp120 were shown to retain their 3D main chain shapes in all the cases of interest presenting the forward-looking targets for anti-AIDS drug researches. From the data on molecular docking, synthetic analogs of the CycA and FKBP peptides were suggested being suitable frameworks for making a reality of the V3-based anti-HIV-1 drug projects. In addition, the computational V3 model proposed above provides a productive basis to gain a better insight into the principles of virus functioning, and, therefore, can be used in subsequent studies for investigating the structure-functional relationship as well as for examining the structural effects of mutations or distinguishing between various forms of the V3 loop under different conditions.

1. A.M.Andrianov (2008) Computational anti-AIDS drug design based on the analysis of the specific interactions between immunophilins and the HIV-1 gp120 V3 loop. Application to the FK506-binding protein, J. Biomol. Struct. Dynam., 26: 49-56. 2. A.M.Andrianov (2009) Immunophilins and HIV-1 V3 loop for structure- based anti-AIDS drug design, J. Biomol. Struct. Dynam., 26: 445-454.

This study was supported by grants from the Union State of Russia and Belarus (scientific program SKIF-GRID; № 4U-S/07-111) as well as from the Belarusian Foundation for Basic Research (project X08-003).

16 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

3D STRUCTURE MODELING AND POSTERIOR COLLATION OF THE HIV-1 V3 VARIABLE LOOPS FOR DISCOVERY OF THEIR STRUCTURALLY INVARIANT SITES EXPOSING THE ACHILLES' HEEL IN THE HIV-1 “REDOUBTS” A. M. ANDRIANOV 1, I.V. ANISHCHENKO 2

Keywords: HIV-1, V3 Loop, 3D Structure, Computer Modeling, Molecular Docking

The HIV-1 gp120 V3 loop forming the virus principal neutralizing determinant and determinants of cell tropism and cell fusion is considered as one of the promising targets for anti-AIDS drug studies (reviewed in [1]). The V3 loops derived from different HIV-1 isolates contain highly variable amino acid sequences, which prevents antibodies bound to a V3 loop of one isolate from having effect on the V3 loops of other isolates. However, the analysis of various HIV-1 V3 loop sequences makes it clear that, despite their high variability which complicates fundamentally the studies on the V3 loop structure, some of the amino acid positions located in the N- and C-terminals and especially those residing in its immunogenic tip, are highly conserved. Conserving these V3 stands allows one to suggest that the residues occupying them may preserve their conformational states in diverse HIV-1 strains and, therefore, may present the promising targets for developing the new therapeutic agents. Therefore, one is in need of the information on the 3D structure of V3 and its inflexible regions, which is of particular importance to successful implementation of the anti-AIDS drug studies [1]. In the light of the above, the computational approaches combining the NMR-based protein structure modeling with the mathematical statistics methods were used here to define the locally accurate 3D structures of the HIV-1 gp120 V3 loops from Minnesota, Haiti, RF, and Thailand isolates in water solution as well as from Minnesota and Haiti isolates in a water/trifluoroethanol mixed solvent. To specify the structural motifs of V3 giving rise to the close spatial folds regardless of the sequence and environment variability, the simulated structures and their individual  1 Institute of Bioorganic Chemistry, National Academy of Sciences of Belarus, Kuprevich Street, 5/2, 220141 Minsk, Republic of Belarus, [email protected] 2 United Institute of Informatics Problems, National Academy of Sciences of Belarus, Surganov Street 6, 220012 Minsk, Republic of Belarus, [email protected] 17 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 segments of different length were collated between themselves and with those derived previously from homology modeling [2] and X-ray crystallography [3]. As a result, the sequence and environment changes were found to trigger the considerable structural rearrangements of the V3 loop, but, at the same time, some of the functionally crucial V3 stretches were shown to keep the 3D shapes in all the cases in question. In the first place, it concerns core V3 sequence 15-20 as well as its N- and C-terminal sites 3-7 and 28-32 comprising the residues, which contribute significantly to the virus immunogenicity and cell tropism. In addition, structurally rigid V3 stretch 3-7 includes the highly conservative glycolysation site of gp120 utilized by the virus for defense against neutralizing antibodies and elevation of its infectivity. In the context of these findings, the inflexible V3 motifs identified in this study may present the weak units in the HIV-1 protection system and, therefore, their detection is of great importance to successful design of the V3- based anti-AIDS drugs being able to stop the HIV's spread.

1. S.Sirois, T.Sing, K.C.Chou (2005) HIV-1 gp120 V3 loop for structure- based drug design, Curr. Protein Pept. Sci., 6: 413-422. 2. I.V.Anishchenko, A.M. Andrianov (2008) Computer-aided modeling of the 3D structure for the HIV-1 gp120 V3 loop: exploring the virus subtype A, Proceedings of II International Conference “Advanced Information and Telemedicine Technologies for Health” (Minsk, 2008): 12-16. 3. C.C. Huang, M. Tang, M.Y. Zhang, S. Majeed, E. Montabana, R.L. Stanfield, D.S. Dimitrov, B. Korber, J. Sodroski, I.A. Wilson, R. Wyatt, P.D. Kwong (2005) Structure of a V3-containing HIV-1 gp120 core, Science, 310: 1025 – 1028.

This study was supported by grants from the Union State of Russia and Belarus (scientific program SKIF-GRID; № 4U-S/07-111) as well as from the Belarusian Foundation for Basic Research (project X08-003).

18 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

POLYCTLDESIGNER – THE SOFTWARE FOR CONSTRUCTING POLYEPITOPE IMMUNOGENS. DENIS ANTONETS 1, AMIR MAKSYUTOV 2, SERGEY BAZHAN 3

Keywords: Immunity, cytotoxic T-lymphocyte, T-cell epitope, polyepitope antigen

Design of the artificial polyepitope immunogens capable of eliciting high levels of the CD8+ CTL responses to is a promising approach in creation of an efficient vaccines. When designing such immunogens, it is necessary to optimize the processing and presentation of contained epitopes. DNA vaccine constructs encoding poly-CTL-epitope immunogens containing N-terminal ubiquitin and spacer sequences ensuring correct processing and presentation of selected epitopes were shown to be highly efficient in stimulating CD8+ CTL responses. These results inspired us to create PolyCTLDesigner software, intended for designing optimal polyepitope antigens. To optimize polytope sequence for inducing high level of CTL response one should take into account major steps of MHC class I-dependent antigen processing: proteasomal/immunoproteasomal cleavage of antigen and TAP-dependent transport of generated peptidic fragments into endoplasmic reticulum where they bind to MHC class I molecules. To prognose proteasomal/immunoproteasomal processing PolyCTLDesigner utilizes predictive models developed by Toes et al. [1]. The site of proteasomal cleavage should be located at the С-terminus of the epitope. Thus to optimize proteasomal cleavage (if necessary) C-terminus of the epitope should be extended with spacer motif with up to six aminoacid residues in length. To predict peptide binding to TAP our program uses models developed by Peters et al. [2]. Since, according to a widely accepted hypothesis, the major contributions to TAP-binding are provided by the first three N-terminal amino acid residues of the peptide and the last one (C-terminal), and given the fact, that C-terminus of the epitope must stay unchanged, only N-terminus of the antigenic peptide could be extended to optimize its interaction with TAP1/TAP2 heterodimer. According to the chosen models and algorithms for  1 Research Center of Virology and Biotechnology Vector, Russian Federation, [email protected] 2 Research Center of Virology and Biotechnology Vector, Russian Federation, [email protected] 3 Research Center of Virology and Biotechnology Vector, Russian Federation, [email protected] 19 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 TAP-binding prediction the maximal length of N-terminal spacer sequence will make three residues: ARY. PolyCTLDesigner is integrated with TEpredict program (http://tepredict.sourceforge.net), created earlier. TEpredict is used by PolyCTLDesigner to predict T-cell epitopes. PolyCTLDesigner allows the user to select the minimal set of epitopes with known (or predicted) specificity towards various allelic variants of MHC class I molecules covering the selected MHC-repertoire with a specified redundancy. Currently PolyCTLDesigner utilizes two algorithms to design polyepitope immunogens. The first one utilizes an optimal spacer motif derived from the selected predictive models (e.g., ADLVKV). And the second algorithm utilizes redundant spacer motif and minimizes formation of «non target» epitopes in the sequence of the desired polyepitope immunogen. The developed software realizes the rational approach to designing highly immunogenic poly-CTL- epitope vaccine constructs and can be used for designing new candidate polyepitope vaccines capable of eliciting high levels of the T-cell–mediated immune responses. More detailed description of the program and its source code are available at http://tepredict.sourceforge.net/PolyCTLDesigner.html. The program is written in Python programming language (http://python.org).

1. Toes R.E. et al. (2001). Discrete Cleavage Motifs of Constitutive and Immunoproteasomes Revealed by Quantitative Analysis of Cleavage Products. J. Exp. Med., 194:1-12. 2. Peters B. et al. (2003) Identifying MHC class I epitopes by predicting the TAP transport efficiency of epitope precursors. J. Immunol., 171:1741– 1749.

20 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

GENOME-WIDE SEARCH FOR 5’-UTR OF SACCHAROMYCES CEREVISIAE GENES AND THEIR ORTHOLOGS KIRILL ANTONEZ 1, ALSU SAIFITDINOVA 2

Keywords: yeast,5'-UTR

Motivation and Aims: Prokaryotic and eukaryotic mRNAs are the important step of protein biosynthesis and consists of coding sequence and untranslated regions (UTRs). UTR’s play essential role in posttranscriptional life of mRNA and may harbor regulatory elements in addition to translation initiation sequences. Also 5’-UTRs of both prokaryotic and eukaryotic mRNA may form stable secondary structures, which influence the efficiency of translation initiation. Certain 5’-UTRs contain riboswitches that regulate protein synthesis by ligand binding and decrease or enhance translation efficiency [1]. Realization of genetic information in eukaryotes includes processing of RNA, its transport from nucleus to cytoplasm, translation and decay [2, 3]. There are regulatory elements in 5’- and 3’-UTRs that hasten decay of mRNA. Also UTR’s may contain stems which special proteins interact with leading to inhibition or initiation of translation [4]. Besides main ORF, mRNA may contain upstream ORF located in 5’-UTR that decrease efficiency of translation [5]. All these elements can regulate tissue-specific production of protein, fast response to stress or influence on development and progress of disease [6]. Therefore, it is important to identify regulatory sequences in mRNA. The frequent way to find regulatory elements is to compare the set of sequences, which harbor putative elements. Currently there is no useful tool for analysis of Saccharomyces cerevisiae 5’-UTRs. Our aim was to write program in order to get the set of 5’-UTRs of yeast genes and their orthologs. Methods and Algorithms: We used Microsoft Visual Studio 2008 for writing program. The program was written in C# language for .NET Frameworker 3.5 with usage of Windows Workflow Foundation. To get the data about yeast genes we used Saccharomyces Genome Database (www.yeastgenome.org) and published data about length of UTR [7, 8]. The information about yeast gene orthologs was obtained from Princeton Protein Orthology Database

 1 Saint-Petersburg State University, Russian Federation, [email protected] 2 Saint-Petersburg branch of Vavilov Institute of General Genetics RAS, Russian Federation, [email protected] 21 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 (ppod.princeton.edu). To get the detailed data for other organisms we used WormBase (www.wormbase.org), FlyBase – A Database of Drosophila Genes & Genomes (www.flybase.org), TAIR (www.arabidopsis.org), Mouse Genome Informatics (www.informatics.jax.org), Protein Knowledgebase (www.uniprot.org) and Homo sapiens genes (NCBI36). BioMart tool (www.ensembl.org/biomart/) was used for downloading human 5’-UTR sequences. Results: We have designed the program UTRdbMaker for getting a set of 5’- UTRs. It obtains information corresponding to the gene names containing ORFs and 5’-UTRs sequences of yeast genes and their orthologs. UTRdbMaker analyses nucleotide composition of 5’-UTRs. Results of search are written in text files as tables and contain general descriptions of yeast genes. These results may be used for exploration of conservation of 5’-UTRs and for searching of regulatory elements in them. Code of UTRdbMaker can be extended for similar work with other regions or other databases.

1. W.C.Winkler et al. (2004) Control of gene expression by a natural metabolite-responsive ribozyme, Nature, 428: 281-286. 2. J.E.G.McCarthy (1998) Posttranscriptional Control of Gene Expression in Yeast, Microbiol. Mol. Biol. Reviews, 62: 1492-1553. 3. Ch.Dimaano et al. (2004) Nucleocytoplasmic Transport: Integrating mRNA Production and Turnover with Export through the Nuclear Pore, Mol. Cell. Biol, 24: 3069-3076. 4. A.M.Thomson et al. (1999) Iron-regulatory proteins, iron-responsive elements and ferritin mRNA translation, Int. J. Biochem. Cell Biol, 31: 1139-1152. 5. A.M.Resch et al. (2009) Evolution of alternative and constitutive regions of mammalian 5’UTRs, BMC Genomics, 10: 162. 6. J.T.Rogers et al. (2002) An iron-responsive element type II in the 5’- untranslated region of the Alzheimer’s amyloid precursor protein transcript, J.Biol.Chem. 277: 45518-45528. 7. F.Miura et al. (2006) A large-scale full-length cDNA analysis to explore the budding yeast transcriptome, PNAS, 103:17846-17851. 8. Z.Xu et al. (2009) Bidirectional promoters generate pervasive transcription in yeast, Nature, 457: 1033-1037.

22 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

A TRUSTY KNOWLEDGE-BASED POTENTIAL ENERGY BASED ON PAIRWISE RESIDUE CONTACT AREA SEYED SHAHRIAR ARAB 1, ARMITA SHEARI 1, MEHDI SADEGHI 2, CHANGIZ ESLAHCHI 3, HAMID PEZESHK 4

Keywords: Knowledge-based potential, decoy sets, protein structure prediction, protein folding

We develop a new approach to calculate a knowledge-based mean-force based on pairwise residue contact area. To test its effectiveness, we elaborate it on several decoy sets to measure its ability to discriminate native structure from decoys. In all cases this potential has been able to distinguish native structures from the decoys with about 100% accuracy. Also calculated Z-score shows high value for all protein datasets. This knowledge-based mean force can discriminate native structures from the decoys effectively, so it will be useful for protein structure prediction and model refinement. Considering energy function to detect a correct protein fold from incorrect ones is very important for protein structure prediction and protein folding. Mainly, two different types of potential energy function are currently in use either on the identification of native protein models from a large set of decoys or protein fold recognition and threading studies. The first class of potentials, the so-called physical-based potential, is based on the fundamental analysis of the forces between the particles referred to as physical energy function. The second type is knowledge-based energy function based on information from known protein structures. In physical energy function, a molecular mechanics force field is used. Molecular mechanics force fields are parameterized from ab initio calculation and small molecule structural data. They are essentially the sum of pairwise electrostatic and Van der Waals interaction energies, bonds, angles and dihedral angle terms. In addition, terms that are not included such as entropy and solvent effect are implicitly considered. Although, physical

 1 Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Iran, [email protected], [email protected] 2 National Institute of Genetic Engineering and Biotechnology, Tehran-Karaj Highway, Tehran, Iran, [email protected] 3 Department of Mathematical Sciences, Shahid Beheshti University, Tehran Iran, [email protected] 4 School of Computer Science, Institute for Studies in Theoretical Physics and Mathematics, Iran, [email protected] 23 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 energy function is widely used in molecular dynamic simulation of proteins in their native and denatured states and can be used to distinguish the decoy and native structures, but these functions have not been efficient in protein structure prediction because of their greater computational cost. To reduce computational complexity of the protein folding problem, knowledge-based or empirical mean-force potential is widely used. Since the structure of folded proteins reflects the free energy of the interaction of all their components, including all enthalpic and entropic contribution, as well as solvent effects, such potentials provide an excellent shortcut towards a powerful objective function. It can be used to force the system to obtain potential between groups of atoms by use of experimentally determined structures. In this approach, statistical thermodynamics is used in an analysis of the frequency of observed states to estimate the underlying free energy. Most often, the distribution of pairwise distances are used to extract a set of effective potential between residues or atoms. The distribution of pairwise distances can be compiled from the protein structure database and by defining a reference state, Boltzman's Law is used to calculate the interaction energy of a particular pair. The total potential energy of a protein is simply taken as a sum over all pairwise interactions. In most cases, one or two points for each residue are used to represent a protein. These points are usually C(alpha), C(beta) or the center of mass of each side chain. Each interaction can be distance – dependent. A large variety of knowledge-based potential of mean-force have been developed by introducing additional interactions such as surface area terms, the main chain and side chain dihedral angles, three and four body terms and heavy atoms. In the contact potential, either distance – dependent or only dependent on contact, the distance between the centers of two C(alpha), C(beta) or center of mass of two residues or the all heavy atoms of two residues are calculated and the observed frequency of contacts between residues converts to free energy using Boltzman’s equation. In this way, there is some problems that distance between two C(alfa) Atoms of two residues may be equal to the distance of two atoms of these residues in another position, but the orientation of two residue side chains may be quite different and they are considered as the pairs with equal pairwise distance. In other words, the side chains of two atoms may not have direct contact with each other and some atoms may be located in internal of the space. In this study, we develop a new approach to calculate a knowledge-based potential energy based on pairwise residue contact area. We calculated the parts of each pairwise residue surface area that are in contact in Å2 by rolling a probe ball 24 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 of different sizes around the atoms of a residue to determine the direct contacts surface area of each pair. This pairwise direct contact area, was used to determine statistical contact area preference between each residue pairs, when a contact area preference estimates a sum of energetic interaction and a structural constraint. A good energy function at its minimum should discriminate native structures from decoys. So, to test the effectiveness of this new potential, we elaborated it on several decoy sets to measure its ability to discriminate native structure from decoys. Several decoy sets that contain one to hundreds of decoy proteins generated in different ways were used and in all cases this potential has been able to distinguish native structures from the decoys with about 100% accuracy. Calculated Z-score, which is a useful measure of the validity of the computed potential, shows high value for all protein datasets. The knowledge-based mean force pairwise direct contact area can discriminate effectively, so it will be useful for protein structure prediction and model refinement.

25 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

EVOLUTIONARY DYNAMICS OF CRISPR-CASSETTES VALERY SOROKIN 1, IRENA ARTAMONOVA 2

Keywords: prokaryotic immunity, CRISPR-cassettes, metagenome, evolution

CRISPRs, Regularly Interspaced Short Palindromic Repeats, are a new type of prokaryotic anti-phage immunity systems. A typical CRISPR system consists of a CRISPR-cassette that is a chain of almost identical repeats separated by unique spacers, a leader region, and CRISPR-associated genes [1]. Analysis of the CRISPR-systems was performed in metagenomic sequence data. There are no efficient tools for CRISPR-cassette search, since, when applied to metagenomes, all three publicly available programs, CRT [2], PILER-CR [3], and CRISPRFinder [4], produce high levels of false positive noise. Thus, to search for CRISPR-cassettes in metagenomes we developed a filtering procedure based on a combination of these three programs. This procedure was applied to the Sorcerer II [5] metagenome data, resulting in 192 reliable cassettes. All cassettes found by at least one of the three tools were collected in a database called MeCRISPR (http://iitp.bioinf.fbb.msu.ru/vsorokin/crispr). The database interface allows browsing and analyzing pre-calculated CRISPR-cassettes and their flanking sequences; in particular, to search against spacers, repeats and metagenomic contigs containing at least one CRISPR cassette. We clustered CRISPR-cassettes based on similarity between repeat units. Additional analysis of flanking regions allowed us to distinguish between the lateral transfer and the parallel evolution of cassettes in related strains. For every group of homologous cassettes, we reconstructed the evolutionary history. We observed that similarities representing phage-related spacers or lateral transfers of cassettes were significantly enriched in metagenome contigs from same geographical locations. This shows that on-going phage-host encounters of specific ocean locations involve the CRISPR-mediated response and imprint the host genome.

 1 M.V. Lomonosov Moscow State University, Russian Federation, [email protected] 2 Vavilov Institute of General genetics RAS; Kharkevich Institute of Information Transmission Problems RAS, Russian Federation, [email protected] 26 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 We also investigated CRISPR-cassettes in close strains of Xanthomonas oryzae. The attempt to construct an experimental system for studying CRISPR systems failed because of the unresolved paradox in two strains Xo604 and Xo21. A shared spacer of homologous CRISPR-cassettes of these strains is identical to the Xp10 phage and should, theoretically, prevent the phage infection in both cases. However while Xo21 is indeed resistant for this phage, the Xo604 strain is sensitive. We explained it by identifying a mutation in the phage regulatory motif, discovered for the Xanthomonas cassettes. The comparative analyses of all known CRISPR-cassettes of Xanthomonas oryzae (five strains) will be presented. This is joint work with Mikhail S. Gelfand, Konstantin V. Severinov, Mikhail A. Pyatnitskiy, Ekaterina Semenova and Maxin Nagronykh. This work was partially supported by the Russian Foundation of Basic Research (09-04- 01098-a) and the Russian Academy of Sciences (programs “Molecular and Cellular Biology” and “Fundamental problems of Oceanology”).

1. R. Sorek et al. (2008) CRISPR--a widespread system that provides acquired resistance against phages in bacteria and archaea, Nat Rev Microbiol., 6:181-186. 2. C. Bland et al. (2007) CRISPR recognition tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats, BMC Bioinformatics. 8: 209. 3. R.C. Edgar (2007) PILER-CR: fast and accurate identification of CRISPR repeats, BMC Bioinformatics. 8: 18. 4. I. Grissa et al. (2007) CRISPRFinder: a web tool to identify clustered regularly interspaced short palindromic repeats, Nucleic Acids Res., 35: W52-7. 5. D.B. Rusch et al. (2007) The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific, PLoS Biol., 5: e77.

27 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

INVESTIGATING BRANCH POINT SITE CONSENSUS OF HUMAN FEDOR GONCHAROV 1, VLADIMIR BABENKO 2

Splicing is commonly recognized as one of the ultimate regulation stages of gene expression. In particular, alternative splicing (AS) is a widespread mechanism with an important role in generating appropriate tissue and/or stage specific product from the same gene. On the other hand, one of the key binding sites in the course of spliceosome assembly, namely branch point site (BPS) is drastically degenerate in mammals in contrast to intron poor organisms, e.g. yeast (Gao et al., 2008). We explored the 30bp branch point region sequences [-50, -20] relative to 3’ splice site from 28156 human introns. For analysis we built up the maximum parsimony tree for 7-mers taking into account the pairwise correlation values of the positions in the7-mers occurrence distribution. We got several resulting points after analysis: There are several major branch point site consensi in human that supposes BPs heterogeneity. 1. The most abundant human BPS is represented by ACTGACG oligonucleotide which is consistent with (Irimia, Roy, 2008) and differs from , e.g. yeast (TACTAAC) 2. The human U2 RNP can bind to mRNA BPS not by canonical GTAGTA site, but in significant number of cases by IIa loop (Pomeratz et al., 2009), which is confirmed with extensive ATTAAAC representation as BPS in human (Henscheid, Voelker, Berglund, 2008). 3. The BPs sequence depends on the intron length, so it is closer to canonical in small to moderate introns. 4. Cassette exon –related BPS 3’ downstream possess significantly lower BPS strength (more mismatches from major consensi) than obligatory exons (p<1e-8). In metazoan cells the increasing tissue specific complexity leads to multistage gene regulation in the course of replication, transcription and posttranscriptional phases. It was shown (IrFimia, Roy) that intron rich organisms usually belong to the top hierarchical clade of the organization complexity tree. We believe that branch point redundancy comes as the part of  1 Institute of Cytology and Genetics, Russian Federation, [email protected] 2 Institute of Cytology and Genetics, Russian Federation, [email protected] 28 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 AS regulation evolution. In particular, strong BPS don’t allow for cis- regulatory element to affect splicing, so BPS of the canonical type could be referred to as Intronic Splicing Enchancer (ISE). On the contrary, regulated exons lack strong BPs signal apparently due to regulation.

1. Irimia M, Roy SW. Evolutionary convergence on highly-conserved 3' intron structures in intron-poor eukaryotes and insights into the ancestral eukaryotic genome. PLoS Genet. 2008. 4(8):e100014 2. Gao K, Masuda A, Matsuura T, Ohno K. Human branch point consensus sequence is yUnAy. Nucleic Acids Res. 2008.36(7):2257-6 3. Henscheid KL, Voelker RB, Berglund JA. Alternative modes of binding by U2AF65 at the polypyrimidine tract. Biochemistry. 2008. 47(1):449-59. 4. Pomeranz Krummel DA, Oubridge C, Leung AK, Li J, Nagai K. Crystal structure of human spliceosomal U1 snRNP at 5.5 A resolution. Nature. 2009. 458(7237):475-80.

29 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

GLAUCOMA AND MYOPIA WHOLE GENOME ASSOCIATION STUDY VLADIMIR BABENKO 1, MARINA GUBINA 1, IGOR KULIKOV 1, RUSLAN AITNASAROV 1

Keywords: Illumina 550, SNP analysis, glaucoma, myopia,

40 individuals were genotyped with the Illumina 550 snp array (Illumina, Inc., http://illumina.com) at the “Bioingineering” Center, RAS, Russia. The data comprises 27 healthy individuals, 5 patients with glaucoma and 8 ones with myopia diagnosis. All individuals are Caucasians from Novosibirsk urban region, Russia. The total SNP volume comprises more than 340 thousand SNPs We implemented sql database schema designed by us for maintenance of the sample and a software suite to analyze it. Results. We identified 44 target SNPs while analyzing 11 normal and 13 disorder cases where discrepancy between control and affected samples set was more than empirically chosen significant threshold of 9 genotypes. Using haploview software suite (www.hapmap.org) we selected 28 non-redundant unlinked SNPs. Next we scanned OMIM database (www.ncbi.nlm.nih.gov/omim) for the genes comprising the target SNP set. There we identified 5 genes with ‘glaucoma’ and ‘myopia’ as keywords, namely: myocilin (MYOC), optineurin (OPTN), cytochrome P450 family 1 subfamily B (CYP1B1), optic atrophy 1 isoform 8 (OPA1), WD repeat domain 36 (WDR36). The gene OPA1 (optical atrophy, chrom 3) significantly associated with target SNPs is located within recently identified cluster of genes (MFN1, SOX2OT and PSARL, Andrew T et al., Plos Genetics, 2008), and proved to be associated with myopia. We thus reconfirm the impact of this gene on myopia in ethnic population considered.

 1 Institute of Cytology and Genetics, Russian Federation, [email protected] 30 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

AN EVOLUTIONARY STUDY IN THE GENOMICS OF VERTEBRATE POXVIRUSES 1 IGOR BABKIN

Keywords: DNA virus, Poxviridae, Virus evolution, Smallpox history

Members of the family Poxviridae are the most studied among the known cytoplasmic DNA-containing viruses. According to the accepted taxonomy, they are divided into two subfamilies, Entomopoxvirinae and Chordopoxvirinae; the latter contains eight genera and two unclassified viruses, deer poxvirus and crocodile poxvirus. The members of Chordopoxvirinae subfamily utilize two types of evolutionary strategy: Parapoxvirus, Molluscipoxvirus, and crocodile poxvirus accumulate CG sequences in their genomes and the remaining poxviruses, AT sequences [1]. To introduce the time scale into the evolutionary reconstruction, it is necessary to determine the divergence time points for one or several tree nodes. One of such limitations is the moment when variola virus (VARV) was exported to the American continent from West Africa in the XVI century [2]. We have earlier discovered the genetic relatedness between the virus strains from these regions [3], which form a separate biological subtype of VARV. This has allowed us to estimate the divergence time points for poxviruses using the Bayesian relaxed clock [4]. We have earlier determined the rates of orthopoxvirus molecular evolution based on the analysis of extended central conserved region of their genomes and of AT-rich poxviruses by analyzing the nucleotide sequences of virus RNA polymerase subunits. The goal of this study was to study the evolutionary history of the vertebrate poxviruses with AT-rich genomes by the Bayesian relaxed clock analysis using a large set of highly conserved vitally important genes of these viruses. For this analysis, we selected only highly conserved genes with similar evolutionary rates, namely, 35 genes encoding the proteins involved in transcription, DNA replication, and the system of S–S bond formation. The accumulation rate of nucleotide substitutions was 1–6 × 10–6 nucleotide substitutions per site per year. Applying the Bayesian method for determining the time estimates, it is possible to conclude that the modern viruses of the genus Avipoxvirus diverged from the progenitor approximately 283 ± 102

 1 Institute of Cytology and Genetics SB RAS, Russian Federation, [email protected] 31 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 thousand years ago. Presumably, the progenitor virus of the modern mammalian poxviruses had a wide range of sensitive hosts and specialized to different organisms during the evolution. The progenitor of the genus Orthopoxvirus was the first to diverge approximately 171 ± 55 thousand years ago. Then the progenitor of the genus Leporipoxvirus separated about 136 ± 44 thousand years ago. This genus contains the viruses inducing tumors in rabbits, hares, and squirrels. The next to diverge was the progenitor of the genus Yatapoxvirus, the representatives of which induce benign tumors in primates. The progenitor of three ungulate virus genera—Capripoxvirus, Suipoxvirus, and recently discovered unclassified deerpox virus—appeared 107 ± 36 thousand years ago. VARV diverged from its progenitor, common for camelpox and taterapox viruses, 5.8 ± 1.4 thousand years ago. However, we have earlier performed a more reliable calculation based on the extended central conserved region of orthopoxvirus genomes, which estimated the time of independent VARV evolution as 3.4 ± 0.8 thousand years ago [4]. This dating of the VARV progenitor to 3–4 thousand years ago demonstrates that VARV is a comparatively young virus. This work was supported by the Russian Foundation for Basic Research (project no. 08-04-00443-a).

1. Moss B. (1996) Poxviridae: The viruses and their replication, In: Fields Virology Fields B.N. et al. (Eds.), 2637-2671 (Philadelphia: Lippincott- Raven Publishers). 2. Fenner F., Henderson D.A. et al. (1988) Smallpox and its Eradication. Geneva: World Health Organization 1460 p. 3. Babkina I.N., Babkin I.V. et al. (2004) Phylogenetic comparison of the genomes of different strains of variola virus, Dokl. Biochem. Biophys., 398:316-319 4. Babkin I.V., Shchelkunov S.N. (2008) Molecular evolution of poxviruses, Genetika, 44:1029–1044.

32 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

DOSAGE COMPENSATION AND DEMASCULINIZATION OF X CHROMOSOMES IN DROSOPHILA DORIS BACHTROG 1, NICHOLAS TODA 1, STEVEN LOCKTON 1

Keywords: sex chromosomes, demasuclinization

The X chromosome of Drosophila shows a deficiency of genes with male- biased expression, while mammalian X chromosomes are enriched for both spermatogenesis genes expressed pre-meiosis and multi-copy testis genes. Meiotic X inactivation and sexual antagonism can only partly account for these patterns. Here, we show that dosage compensation in Drosophila contributes substantially to the depletion of male genes on the X. To equalize expression of X-linked genes between the sexes, male Drosophila hyper-transcribe their single X, while female mammals silence one of their two X chromosomes. By combining fine-scale mapping-data of dosage compensated regions in D. melanogaster with genome-wide expression profiles, we demonstrate that the dosage compensation machinery directly limits further up-regulation of X- linked genes in males. As a result, most male-biased genes on the X chromosome are located outside dosage compensated regions. We also show that dosage compensation in Drosophila contributes to gene trafficking of male-genes off the X. Thus, while natural selection operates more efficiently on the hemizygous X chromosome in males, dosage compensation prevents the emergence of male genes on the Drosophila X. Conversely, since base-line levels of X-linked transcription are identical in male and females, no sex- specific restriction on gene regulation exists and selection can act to masculinize the X in mammals. The vastly different mechanisms of dosage compensation can therefore help to explain X-chromosomal gene content differences between mammals and Drosophila.

 1 University of California Berkeley, United States, [email protected] 33 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

CODON SIZE REDUCTION AS THE ORIGIN OF THE TRIPLET GENETIC CODE. PAVEL BARANOV 1, MAXIME VENINE 2, GREGORY PROVAN 2

The genetic code appears to be optimized in its robustness to missense errors and frameshift errors [1-3]. In addition, the genetic code is near optimal in terms of its ability to carry information in addition to the sequences of encoded proteins [4]. As evolution has no foresight, optimality of the genetic code suggests its evolutionary origin as opposed to an accidental origin. The length of codons in the genetic code is also optimal, as three is the minimal nucleotide combination allowing encoding of the twenty standard amino acids. The apparent impossibility of transitions between codon sizes in a discontinuous manner during evolution has resulted in an unbending view that the genetic code was always triplet. Yet, recent experimental evidence on quadruplet decoding [5-8], as well as the discovery of organisms with ambiguous [9, 10] and dual decoding [11], suggest that the possibility of the evolution of triplet decoding from living systems with non-triplet decoding merits reconsideration and further exploration. We designed a mathematical model of the evolution of primitive digital organisms capable of decoding nucleotide sequences into protein sequences. These organisms are allowed to evolve their nucleotide sequences via genetic events of Darwinian evolution, such as point-mutations. The replication rates of such organisms depend on the accuracy of generated protein sequences. Computer simulations based on our model show that decoding systems with codons of length greater than three spontaneously evolve into predominantly triplet decoding systems. Our findings suggest a plausible scenario for the evolution of the triplet genetic code in a continuous manner. This scenario suggest an explanation to how protein synthesis could be accomplished by means of long RNA-RNA interactions prior to the emergence of complex decoding machinery, such as the ribosome, that is required for stabilization and discrimination of otherwise weak triplet codon-anticodon interactions.

 1 Biochemistry Department, University College Cork, Ireland, [email protected] 2 Computer Science Department, University College Cork, Ireland, [email protected], [email protected] 34 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 1. T.Maeshiro, M.Kimura (1998). The role of robustness and changeability on the origin and evolution of genetic codes, Proc Natl Acad Sci U S A, 95:5088-5093. 2. S.J.Freeland et al. (2000) Early fixation of an optimal genetic code, Mol Biol Evol 17:511-518. 3. A.S.Novozhilov et al. (2007) Evolution of the genetic code: partial optimization of a random code for robustness to translation error in a rugged fitness landscape, Biol Direct 2:24. 4. S.Itzkovitz, U.Alon (2007) The genetic code is nearly optimal for allowing additional information within protein-coding sequences, Genome Res 17:405-412. 5. D.L.Riddle, J.Carbon (1973). Frameshift suppression: a nucleotide addition in the anticodon of a glycine transfer RNA. Nat New Biol, 242:230-234. 6. B.Moore et al. (2000) Quadruplet codons: implications for code expansion and the specification of translation step size. J Mol Biol 298, 195-209 (2000). 7. Magliery T. J., Anderson, J. C., and Schultz, P. G. Expanding the genetic code: selection of efficient suppressors of four-base codons and identification of "shifty" four-base codons with a library approach in Escherichia coli. J Mol Biol 307, 755-769 (2001). 8. Anderson J. C., Magliery, T. J., and Schultz, P. G. Exploring the limits of codon and anticodon size. Chem Biol 9, 237-244 (2002). 9. Gomes A. C. et al. A genetic code alteration generates a proteome of high diversity in the human pathogen Candida albicans. Genome Biol 8, R206 (2007). 10. Miranda I. et al. A genetic code alteration is a phenotype diversity generator in the human pathogen Candida albicans. PLoS ONE 2, e996 (2007). 11. Turanov A. A. et al. genetic code supports targeted insertion of two amino acids by one codon. Science 323, 259-261 (2009).

35 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

TOWARD UNIVERSAL MALIGNOMETER: GENOME-WIDE EXPRESSION PATTERNS AS COMPOSITE BIOMARKERS GANIRAJU MANYAM 1, ALESSANDRO GIULIANI2, ANCHA BARANOVA 3

Keywords: global patterns of gene expression, attractor, tumorigenesis, expression dynamics

Abstract To date, most of the high-throughput studies of the gene expression studies were focused on elucidation of the gene signatures discriminating cell phenotypes. On the other hand, a given cell type could be represented as a dynamic system occupying a specific position in the multidimensional phase space spanned by all expressed genes. In terms of dynamics, this specific position is called an “attractor”, i.e. a “stable” position characterized by a specific pattern of gene expression levels that determines the particular type of the cell differentiation. Some studies have indicated that the differentiation destinies of the progenitor cells could be defined as high dimensional attractor states of the underlying molecular networks. A possible middle ground between discriminating signatures and entire expression landscapes may be described as a combination of attractor-like behavior with some local “vantage points” represented by genes most sensitive to dynamical changes of the system. Affymetrix Microarray datasets were extracted from the NCBI Gene Expression Omnibus. We analyzed following two categories of datasets: A) datasets describing paired normal and tumor tissue samples collected from the same individual; B) datasets describing a group of normal and a group of tumor samples collected from the same tissue type across a number of subjects. The global and specific expression distances (Dglobal and Dspecific) were calculated based on the whole transcripts on the chip and significantly differentially expressing transcripts by Mann-Whitney test, respectively. The distances between expression profiles of two biological samples were estimated using Pearson correlation coefficients. In all studied datasets, on average, tumors were further away from the Normal Sample Space than the paired samples with normal histology. Interestingly, this observation was true only in case when distances were calculated using Dglobal. Surprisingly, similarly calculated distances for Normal samples from the Normal Space  1 George Mason University, United States, [email protected] 2 Istituto Superiore di Sanità, Italy 3 George Mason University, United States 36 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 defined by Dspecific were different not significantly, mostly due to larger variations in the expression of cancer-specific genes in the normal samples. In all datasets, mean (Dglobal) distances from individual normal samples to the Normal Space were correlated with Mean (Dglobal) distances from individual tumor samples (R=0.9236, p <= 0.00186). Principal Component Analysis (PCA), for the first time, a quantitative estimation of the relative importance of global and local features of gene expression regulation landscape in the process of tumor development. The remarkable behavioral invariance we observed in eighteen independent tumor data sets gives a robust proof of the dynamical picture of cell populations.

37 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

MATHEMATICAL MODELLING OF CELL-FATE DECISION NETWORKS EMMANUEL BARILLOT 1, LAURENCE CALZONE 2, SIMON FOURQUET 3, LAURENT TOURNIER 4, ANDREI ZINOVYEV 5, DENIS THIEFFRY 6

Keywords: systems biology, apoptosis, cell-fate decision, death receptors

Engagement of death domain receptors such as TNFR1 or Fas can trigger cell death by apoptosis or necrosis, or lead to the activation of pro-survival signaling pathways such as NF-κB. Our study aims at identifying determinants of this cell fate decision process. Apoptosis represents a tightly controlled mechanism of cell death that is triggered by overwhelming stress conditions or external death signals, and results in vacuolization of cellular content followed by its phagocyte-mediated elimination. It is a physiological process that regulates cell homeostasis, development, and clearance of damaged, virus-infected or cancer cells. Necrosis results in plasma membrane disruption and release of intracellular content that can trigger inflammation in the neighboring tissues. Long seen as an accidental cell death, necrosis can also be a regulated process, possibly involved in the clearance of virus-infected or cancer cells that escaped apoptosis.

Modeling of these pathways could help identify in which conditions and how the cell chooses between different types of cellular deaths or survival. Moreover, modeling could suggest ways to re-establish the apoptotic death when it is altered. The decision process appears to be very complex: it integrates many intertwined signaling pathways and the molecular interactions controlling this process are regulated by multiple positive and negative feedback loops. Mathematical modeling provides a good tool to understand and analyse the dynamical behaviours of such complex systems.

 1 Institut Curie, Mines ParisTech, INSERM U900, France, [email protected] 2 Institut Curie, Mines ParisTech, INSERM U900, France, [email protected] 3 Institut Curie, Mines ParisTech, INSERM U900, France, [email protected] 4 Institut Curie, Mines ParisTech, INSERM U900, France, [email protected] 5 Institut Curie, Mines ParisTech, INSERM U900, France, [email protected] 6 Faculté des Sciences de Luminy, Université de la Méditerranée, France, [email protected] 38 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 For that purpose, based on the literature, we established a generic influence network that includes the main species that participate in cell fate decision in response to death signals (mediated by Fas and TNF). A first annotated version of this “master” model was built in a discrete framework. An initial study was performed on the steady states: eight different clusters of steady states that correspond to the expected cellular phenotypes were identified. This result constitutes a first validation of the proposed structure of the network.

In order to propose a more refined dynamical analysis, we suggested a reduction of the model preserving the same dynamical properties. We went from 22 variables in the “master” model to 11 variables in the reduced version. Thanks to this reduction, the realistic asynchronous updating strategy could be used and qualitative simulations were performed. In particular, the computation of all discrete trajectories starting from specific initial conditions allowed to identify the corresponding “reachable” phenotypes in the case of TNF and Fas-induced signals, for the wild-type and mutants models. The mutants mostly fit the expected behaviours and suggested some improvements in the “master” model.

This work is supported by the APO-SYS EU FP7 project and the authors of the work are members of the team "Systems Biology of Cancer," Equipe labellisée par la Ligue Nationale Contre le Cancer.

39 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

CONSERVATIVE REGIONS OF PROTEINS EVOLVE UNDER STRONGER POSITIVE SELECTION GEORGII BAZYKIN 1, ALEXEY KONDRASHOV 2

Positive selection, i.e. natural selection that promotes change, is usually assumed to play a larger role in evolution of rapidly evolving sequences than in evolution of slowly evolving sequences. We use the MacDonald-Kreitman test [1] to study how the strength of positive selection in segments of coding sequences in divergence of Drosophila simulans and D. melanogaster depends on the overall evolutionary conservation of this segment between Drosophila species. The fraction of amino acid positions evolving under positive selection in the most conserved sites is twice as high as in the least conserved sites. The analysis of pairs of substitutions in adjacent nucleotide sites within a codon [2] reveals that the clumping of substitutions, indicative of positive selection, is also strongest in the most conserved segments. By making use of the dense phylogeny of Drosophila species with complete genomes sequenced, we ascertain the distribution of the evolutionary times between the substitutions, as well as the strength of the selection coefficients favoring the second substitution in each pair. In conserved segments, the average second substitution occurred under selection that accelerated evolution by a factor of 20. Together, our results indicate that strong positive selection within conservative regions is an important component of adaptive evolution.

1. J. McDonald, M. Kreitman (1991) Adaptive protein evolution at the Adh locus in Drosophila, Nature, 351:652–654. 2. G. Bazykin et al. (2004) Positive selection at sites of multiple amino acid replacements since rat–mouse divergence, Nature, 429:558–562.

 1 Institute for Information Transmission Problems of the Russian Academy of Sciences (Kharkevich Institute), Russian Federation, [email protected] 2 Life Sciences Institute and Department of Ecology and Evolutionary Biology, University of Michigan, United States, [email protected] 40 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

MODELLING AND STABILITY ANALYSIS OF INTERCONNECTED REGULATORY CYCLES MAHSA BEHZADI 1, MIREILLE REGNIER 1, LAURENT SCHWARTZ 1, JEAN-MARC STEYAERT 1

Keywords: System biology, ordinary differential equations, enzymatic reactions, stability analysis, cycle oscillations,equilibria.

Biochemical reactions are continually taking place in all living organisms. The complexity of biochemical and biological processes is such that the development of computer models is often essential in trying to understand the phenomenon under consideration. Our aim is to build a generic framework with which one could simulate the behavior of complex systems of interconnected regulatory cycles. For the simulation of a biological system we use the traditional reaction-rate approach by means of equations describing the system. In this approach, chemical reactions are modelled by ordinary differential equations (ODEs) representing the variations of the concentrations of the substances. In each of the differential equations we express the kinetics of one reactant as a sum of fractional terms for enzymatic reactions and non-fractional terms for simple reactions. Once constructed the model, we aim to study the various modes of the cell behaviour according to the concentrations of relevant enzymes in enzymatic reactions. Since stable and unstable equilibrium play different roles in the dynamics of a system, it is useful and important to be able to classify equilibrium points based on their stability, and this is what we are able to do by simulation and also by mathematical study. By stability analysis, first given equilibrium we can determine if it is a stable point or not; furthermore through a mathematical study we are able to find the stability and instability regions by changing one or several parameters. As a first try we have constructed a model for the central part of the system of the GlyceroPhosphoLipid metabolism in the human cell. The model comprises enzymatic reactions of PhosphatidyleEthanolamine (PtdEth) and the PhosphatidylCholine (PtdCho) [1, 2]. Given the values of metabolite concentrations (Ci) which were observed experimentally we have managed to find the appropriate parameter values (Pi) which allow us to completely  1 Bioinformatics group, LIX, Ecole Polytechnique, Palaiseau, 91128, France [email protected], [email protected], [email protected], [email protected] 41 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 describe the system with a set of ordinary differential equations (ODE). Our analysis of this model demonstrates that, with these parameter values, the system has a stable solution. Moreover, we investigated the possibility that a change in parameter values could give an unstable or oscillating solution. For that purpose we studied the system mathematically in a large rank of values and we prove that the solution is always stable and without oscillations regardless the parameter values of the system. We have also applied our method to the cell division cycle model; well- known interactions of proteins cdc2 and cyclin. A mathematical model was already constructed by Joun Tyson [3], who used numerical integration (carried out by using Gear's algorithm) for simulation and stability analysis of model. We studied this system of interactions and using our approach based on the analysis of the eigenvalues of the liberalized system we confirmed the nature of the results for the same parameter values. We currently use this approach to study the stability analysis of a complex metabolic network containing several interconnected regulatory cycles such as Glycolysis, Krebs cycle, Phospholipids pathway and Amino acids.

1. Henry, S. A., and Patton-Vogt, J. L. (1998) Prog. Nucleic Acids Res. Mol. Biol. 61, pp. 133-179. 2. R. Sundler, B. Akesson, (1975) Biochem. J. 146309-315. 3. J. Tyson, (1991) Cell Biology, Vol 88. pp. 7328-7332.

42 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

INVOLVEMENT OF PROTEIN-PROTEIN INTERACTIONS IN COMPOSITE ELEMENTS DETECTION ALEXANDER A BELOSTOTSKY 1, VSEVOLOD Y. MAKEEV 1

Keywords: transcription factor, transcription factor binding site, composite element, protein-protein interaction

CE is a group of transcription factor binding sites (TFBSs) located near each other in statistically significant number of cases. Composite elements (CEs) detection is a very crucial task in understanding transcription regulation. There exist many methods for predicting CE using data of co-occurrence of different sites in a set of regulatory sequences. In some cases these methods take into account score of every site in CE and distance between them [1, 2]. In other cases it is only co-occurrence of different sites in some large genomic region, but this search is performed over the set of co-regulated genes [3, 4] sometimes with conservation estimation added [5]. All these methods have one common disadvantage: they are based on known CE. Here we present a method for prediction of CE that uses experimentally determined protein-protein interactions. In this case we use information about interaction between transcription factors (TFs). This method allows us to involve some structural aspects in CE detecting. This source of experimental data is independent from previously examined CEs. In our approach we simply count for CEs that contain sites of TFs able to interact with each other. We searched for group of sites: sites of particular TF (TF of interest) and sites of TFs capable to interact with TF of interest. These sites must score above the threshold and located not further than a given distance from each other. We tested the idea at a set of Hif1-dependent genes, having experimentally determined sites of Hif1 TFBS. Names of these genes and positions with sequences of Hif1 sites were taken from TransFac. Our objective was to predict experimentally determined Hif1 sites as a part of predicted CE. For predicted CE we selected those having sites of Hif1 itself with sites of TFs capable to interact with Hif1 in close vicinity. We set threshold for sites

 1 State Research Institute of Genetics and Slection of Industrial Microorganisms, GosNIIGenetika,Moscow, [email protected] 43 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 constituting predicted CE and the distance between sites. We compared results of our prediction with results of prediction by programs TFM-Explorer, Cluster-Buster, MSCAN and DiRE. From 18 genes contained in TransFac that had in their upstream region experimentally determined Hif1 sites we found 12 genes. This can be compared with 3 genes found by DiRE, 6 genes for Cluster-Buster, 3 genes for TFM-Explorer and 0 genes for MSCAN. The advantage of our method is that it uses a short list of TFs selected from known TF-TF interaction to search for all possible combinations of sites constituting CE. Surprisingly taking into account conservation estimation by phastCons negatively affected sensitivity and specificity of Hif1 prediction. We are grateful to Dmitry Malko for help in programming. This study has been supported with Russian Fund of Basic Research project 07-04-01623.

1. Shelest, E., et al. (2003) Prediction of potential C/EBP/NF-kappaB composite elements using matrix-based search methods, In Silico Biol ,. 3(1-2): p. 71-9. 2. Kel-Margoulis, O.V., et al. (2002) TRANSCompel: a database on composite regulatory elements in eukaryotic genes, Nucleic Acids Res , 30(1): p. 332-4. 3. Kel, A., et al. (2006) Composite Module Analyst: a fitness-based tool for identification of transcription factor binding site combinations, Bioinformatics , 22(10): p. 1190-7. 4. Waleev, T., et al. (2006) Composite Module Analyst: identification of transcription factor binding site combinations using genetic algorithm, Nucleic Acids Res ,. 34(Web Server issue): p. W541-5. 5. Gotea, V. and I. Ovcharenko (2008) DiRE: identifying distant regulatory elements of co-expressed genes. Nucleic Acids Res,. 36(Web Server issue): p. W133-9.

44 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

STUDYING THE IMPACT OF GENE COPY NUMBER VARIATIONS ON GENE EXPRESSION VIA A GENE REGULATION NETWORK SYLVAIN BLACHON 1, CARITO GUZIOLOWSKI 1, GAUTIER STOLL 2, GAELLE PIERRON 3, STELLY BALLET 3, FRANCK TIRODE 3, OLIVIER DELATTRE 3, EMMANUEL BARILLOT 2, ANDREI ZYNOVIEV 2, ANNE SIEGEL 1, OVIDIU RADULESCU 4

During tumorigenesis, DNA repair machinery is perturbed. As a result, genomic aberrations arise and may deeply affect the tumoral cell physiology. It has been partially demonstrated that an increase of gene copy numbers induces higher expression; but this effect is less clear for small genomic modifications. To study it, we propose a systems biology approach that enables the integration of CGH and expression data together with an influence graph derived from biological knowledge. This work is based on 3 concepts. 1. Studying inter-individual variations in gene copy number and in expression allows to grasp tumor varability and ultimately adresses the problem of individual-centered therapeutics. 2. Confronting post-genomic data to known regulations is a good way to check the soundness and limits of current knowledge. 3. The abstraction level of qualitative modeling allows integration of heterogeneous data sources. We tested this approach using data on two tumor types : Ewing tumors and bladder tumors. It allowed the definition of new biological hypotheses that were assessed by random permutation of the initial data sets.

 1 INRIA, Centre Inria Rennes - Bretagne Atlantique, 263, avenue du General Leclerc, Campus de Beaulieu, 35042 Rennes Cedex, France, [email protected] 2 Institut Curie Bioinformatics Group, Institut Curie, Service BIOINFORMATIQUE, 26 rue d'Ulm, 75248 PARIS cedex 05, France 3 Genetics and biology of paediatric tumors and sporadic breast > cancers - Institut Curie / Inserm Unit 830, 26 rue d'Ulm 75248 Paris cedex 05, France 4 IRMAR, UMR CNRS 6625, Campus de Beaulieu, bâtiments 22 et 23, 263 avenue du Général Leclerc, CS 74205, 35042 RENNES Cédex, France 45 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

USING SVM AND A MEASURE OF MOTIF ‘SURPRISE’ TO DISTINGUISH REGULATORY DNA RENE TE BOEKHORST 1, IRINA ABNIZOVA 2, FEDOR NAUMENKO , IVAN KULAKOVSKI 3, WERNISCH LORENZ 4

Motivation and Aim . There are still no satisfactory computational methods to reliably recognize regulatory DNA. Assuming that the main biological and statistical “signature” of regulatory regions is the presence of multiple regulatory motifs, we aim to identify motifs that contribute significantly to the separation of coding (C), regulatory (R) and non-coding non-regulatory (N) DNA.

Methods and Algorithms We use unsupervised pattern recognition (cluster analysis) to back up the performance and to visualize the results of a supervised method (Support Vector Machine). These methods were applied to a new feature representation of DNA sequences. The feature set is a 4k – dimensional vector of which the elements measure how likely each k-mer is in comparison to a model assuming nucleotide independence and thus how “surprising” a k-mer is (i.e. its degree of over-/under-representation). We subjected the feature set to a hierarchical test procedure that first distinguishes coding from non coding sequences, and in a next step separates regulatory regions from non coding-non regulatory DNA.

Data The positive training set is a collection of experimentally verified functional Drosophila melanogaster regulatory regions (enhancers) [Nazina & Papatsenko, BMC Bioinformatics 22 (2003)]. The two other (negative training) sets are: (i) 60 randomly picked Drosophila exons, and (ii) 60 randomly picked Drosophila non-coding, non-regulatory (NCNR) sequences.

Results The SVM separated coding DNA (C) very well from other DNA types (R, N) with an overall accuracy 97 % at the first step.The second step predicted regulatory DNA with a 95 % overall accuracy.

 1 University of Hertfordshire, United Kingdom, [email protected] 2 Wellcome Trust Sanger Institute, United Kingdom, [email protected] 3 University of Moscow, Russian Federation 4 MRC Biostatistics Unit Institute of Public Health, United Kingdom 46 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 K-means cluster analysis (K=3) resulted in a cluster mainly composed of coding regions and two non-coding clusters of which the smallest is dominated by regulatory regions. Tests for the association between type of DNA (C, R, N) and cluster membership are highly 2=129.16, df=4, p=5.89E- 27). Also a hierarchical cluster χsignificant ( analysis (Euclidean Distances, Ward’s method) clearly distinguished between coding and regulatory regions. One cluster contains only 5 of all 60 coding regions, whereas the second virtually lacks regulatory regions (2 out of 60). A hierarchical cluster analysis of words on sequences resulted in a main cluster containing all the low entropy words (AAA, CCC, GGG and TTT), 70% of the self-repetitive words and about half (56%) of all the 24 intermediate entropy words. The other cluster is made up of the remaining intermediate and all the high entropy words and 67% of the palindromes. Combining the dendrogram of sequences with the dendrogram of words showed that: i) regulatory sequences stand out by either over- or underrepresented words; ii) overrepresented words tend to be of low entropy whereas underrepresented ones are mostly of high entropy. The motifs characteristic for regulatory DNA tend to be biologically important fragments of known TFBS. We stress the up till now overlooked importance of underrepresented motifs.

Comparison with other methods Our methodology outperforms SVM applications based on string [Leslie et al, Pac. Symp. Biocomput. (2002)] and mismatch kernels [Leslie et al., Adv. Neural Inf. Process. Syst, 20 (2003)]. The latter worked well for the detection of functionally similar proteins, but achieved no more than about 50% accuracy when we applied them to our data. Boeva et al [Algorithms for Molecular Biology (2007)] developed an algorithm for computing the probability (p-value) that s different, possibly overlapping, motifs occur respectively k1, ..., ks or more times,. When we used p-values calculated for the Drosophila data as the input for SVM, we obtained almost the same specifity, sensitivity and accuracy as for our Z-scores.

47 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

SEARCH FOR DEGENERATE TANDEM REPEATS IN NUCLEOTIDE SEQUENCES. THEIR POSSIBLE ROLE IN REGULATION OF GENE EXPRESSION. V. BOEVA 1, V.J. MAKEEV 2, M. REGNIER 3

During the last decade many experiments demonstrate that degenerate tandem repeats occur in regulatory regions and play role in regulation of gene expression [1, 2]. But the latest work show high mutability of tandem repeats located in regulatory regions even between closely-related species [3]. Hence the hypothesis arises that for the regulation of gene activity the presence of tandem repeat itself is important, but not the concrete motif sequence. The program SWAN [4] was written to search for degenerate tandem repeats in DNA sequences. Its advantages are the possibility to set minimal significance level of repeats and the calculation of statistical significance of all found tandem repeats. Besides SWAN returns a single result file with the table containing all necessary information about tandem repeats found that it is easy to process by Excel or Perl. Using the program SWAN we analyzed frequencies of degenerate tandem repeats in the complete genome of D.melanogaster as well as in various functional regions such as coding and regulatory regions, intergenic regions and heterochromatin. It was found that the frequency of degenerate tandem repeats in X-chromosome is about 1.5 times greater than in autosomes. It agrees with the result obtained in [5] that frequencies of exact tandem repeats with period length from1 to 4 are also higher in X-chromosome. We analyzed frequencies of degenerate tandem repeats of each period in annotated loci of D.melanogaster (Fig 1). One can see that periods divisible by 3 are significantly abundant in coding regions. Apparently this fact is induced by some regulatory structure of coded proteins, e.g. poly(Ala) chain. The interesting fact that tandem repeats with periods 6,7 and 8 occur more frequently in non-coding regions of loci, especially in regulatory ones, than in intergenic regions. As we suppose it is caused by partial destabilization of double helix (each turn of which is about 10.2b.p.), that facilitates the process of transcription factor binding. This hypothesis is corroborated by the fact that repeats with period divisible by 5, which should stabilize the double helix on  1 Moscow State University, Vorob'evy Gory, Moscow, Russia, [email protected] 2 State Center GosNIIGenetika, Moscow, Russia, [email protected] 3 INRIA Roquencourt, France, [email protected] 48 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 our hypothesis, are overrepresented in heterochromatin regions of D.melanogaster . By definition this DNA is not transcribed and stays in condense state. Authors are pleased to thank Andrey Mironov, Natal’ya Esipova and Nika Oparina for effective discussion. This work has been supported by a project EcoNet-08159PG and RFBR 04- 04-49601.

0,07 intergenic regions. 12M. 0,06 coding regions. 49 . 0,05

0,04 regulatory regions. 155K.

0,03 spacers in loci. 350 . coverage 0,02 heterochromatin. 83K. 0,01

0 random sequence. 1 . 2 3 4 5 6 7 8 9 10111213 period

1. Ott RW, Hansen LK. (1996) Repeated sequences from the Arabidopsis thaliana genome function as enhancers in transgenic tobacco. Mol Gen Genet., 252(5), 563-71. PMID: 8914517 2. Antoniewski C, Mugat B, Delbac F, Lepesant JA. (1996) Direct repeats bind the EcR/USP receptor and mediate ecdysteroid responses in Drosophila melanogaster. Mol Cell Biol., 16(6), 2977-86. PMID: 8649409. 3. Sinha S. and Siggia E.D. (2005) Sequence turnover and tandem repeats in cis-regulatory modules in Drosophila. MBE, published online on January 19, 2005. 4. V.A. Boeva, M. Regnier, V.J. Makeev (2004) SWAN: searching for highly divergent tandem repeats in DNA sequences with evaluation of their statistical significance. Proceedings of the JOBIM'2004, Montreal, Canada, 2004. 5. Mukund V. Katti, Prabhakar K. Ranjekar and Vidya S. Gupta (2001) Differential Distribution of Simple Sequence Repeats in Eukaryotic Genome Sequences. Molecular Biology and Evolution 18:1161-1167.

49 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

APPLICATION OF THE COMPUTER PROGRAM ROSETTA FOR THE PROTEIN STRUCTURE INTERPRETATION FROM TRITIUM PLANIGRAPHY TECHNIQUE DATA: M1 PROTEIN OF INFLUENZA VIRUS A ELENA BOGACHEVA 1, ALEXEY CHULICHKOV 1, ALEXEY DOLGOV 1, ALEKSANDR SHISHKOV 1, ILIYA KUZMIN 2, LIDIA NEFEDOVA 2 , LUDMILA BARATOVA 3

Keywords: protein, spatial structure, tritium planigraphy, computer simulation

Construction of proteins spatial structure remains extremely actual problem, especially when they are a part of multicomponent biological complexes such as viruses. The matrix M1 protein underlying the membrane is the major structural component of influenza A virus (about 1100-3000 copies per virion). The atomic structure of the N-terminal two thirds of M1 protein was solved at acid and neutral pH [1]. However, M1 spatial structure in a membrane environment remains to be understood. The information obtained by tritium planigraphy gives the data about steric accessibility of hydrocarbon fragments of macromolecule, which by itself is directly connected with its spatial structure, and reflects the «architecture» of the object [2, 3]. The introduction of tritium label occurs through single collisions of tritium atoms with the protein-target. Analysis of the label distribution in the investigated object is usually realized at the level of the separate amino acids, which is attained by fragmentation of tagged proteins into short peptides by the various proteases. Such procedure allows determining the relative level of exposure of amino acid residues to tritium, gives detailed information on the structure of the surface and preliminary conclusions concerning the stacking of residues in macromolecule [4]. We’ve developed the computer algorithm imitating the anisotropic conditions of the bombardment of proteins in a membrane surrounding with the proper account of the protein molecule orientation in relation to the membrane surface for the beam of “hot” tritium atoms.

 1 N.N. Semenov Institute of Chemical Physics Russian Academy of Sciences, ul. Kosygina, 4, Moscow, 119991 Russia, e-mail: [email protected] 2 Biology faculty , Moscow State University, Russian Federation 3 Belozersky Institute of Physico-Chemical Biology of Moscow State University, Leninskie Gory, 1, Moscow, 119992 Russia, [email protected] 50 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 The first working model of the spatial structure of M1 protein as a component of influenza virus is proposed. This model is based on the data obtained by tritium labeling of intact virions and free M1 protein, theoretical prediction of the C-terminal domain secondary structure for M1 protein, and application of the developed computer algorithm. The experimental and theoretical data obtained by tritium bombardment and simulation algorithm were compared with the Rosetta program prediction of the C-domain three-dimensional structure [5]. The clusters with the best correlation between the methods were allocated. The application of the combined approach allowed reducing substantially the hypothetically possible spatial structures of the C-domain. Analysis of the Rosetta algorithms has shown an opportunity of the tritium planigraphy experimental data usage for more correct construction 3D structures. This work was partially supported by the Russian Foundation for Basic Research (09-03-00469, 09-04-01160) and International Science and Technology Center (BTEP#82/ISTC#2816).

1. B.Sha, M.Luo (1997) Structure of a bifunctional membrane-RNA binding protein, influenza virus matrix protein M1. Nat. Struct. Biol. 4:239–244. 2. L.A.Baratova, E.N.Bogacheva, V.I.Goldanskii, V.A.Kolb, A.S.Spirin, A.V.Shishkov (1999) Tritium planigraphy of biological macromolecules. Moscow.: Nauka, 175p. 3. E.N.Bogacheva, V.I.Goldanskii, A.V.Shishkov, A.V.Galkin and L.A.Baratova (1998) Tritium planigraphy: from the accessible surface to the spatial structure of a protein. Proc. Natl. Acad. Sci. USA. 95:2790–2794. 4. A.V.Shishkov, E.N.Bogacheva (2007) Tritium planigraphy of biological macromolecules. In: Methods in Protein Structure and Stability Analysis: Conformational Stability, Size, Shape and Surface of Protein Molecules. Eds. V.N.Uversky and E.A.Permyakov, 317–353 (N.-Y.: Nova Science Publishers). 5. K.M.Misura, D.Chivian, C.A.Rohl, D.E.Kim, D.Baker (2006) Physically realistic homology models built with Rosetta can be more accurate than their templates. Proc. Natl. Acad. Sci. USA, 103:5361–5366.

51 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

FSDETECTOR: FRAMESHIFT PREDICTION IN PROTEIN CODING SEQUENCES BY THE VITERBI ALGORITHM IVAN ANTONOV 1, MARK BORODOVSKY 2

In 2005 the 454 Life Sciences company released a new machine which performs sequencing of 400-600 megabases of DNA per 10-hour run. The innovation made revolution in sequencing technology. The new method is 100 times faster and much cheaper than previously used Sanger capillary sequencing. For these obvious reasons a number of genome projects have by now switched to the 454 pyrosequencing and similar next generation sequencing platforms. Due to the nature of pyrosequencing, the 454 method is prone to errors at homopolymer locations. Even with high on average X coverage errors in finished sequences are likely to occur more frequently than previously with “old sequencing techniques”. Insertion or deletion of one or two nucleotides inside a protein coding region causes a frameshift and will result in wrong annotation of the gene or even a part of the gene missing. It is highly desirable to detect frameshift errors as early as possible and resequence regions with predicted errors before genome sequence released to public. Here we present a new method, called FSdetector, to predict frameshifts in protein coding regions. FSdetector can be applied to a nucleotide sequence that contains intronless protein-coding regions. Thus, the method is applicable to prokaryotic genomic sequences, to sequences from fungal genomes with intronless genes or to clustered EST sequences. The method works in two steps. In the first step the gene finding program GeneMarkS [1] is used to identify genes in the given DNA sequence. Upon approaching a gene with a frameshift GeneMarkS predicts two genes in different frames. These two putative genes located in the same strand will appear as overlapped or adjacent genes. In the second step all DNA regions containing predicted

 1 Division of Computational Science and Engineering, Georgia Institute of Technology,801 Atlantic Drive, Atlanta, GA, USA 30332-0280, [email protected] 2 Department of Biomedical Engineering and Division of Computational Science and Engineering, Georgia Institute of Technology, 313 Ferst Drive, Atlanta, GA, USA 30332- 0535, [email protected]

52 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 overlapping and adjacent genes are selected. Each region is analyzed by FSdetector to identify a possible frameshift. The algorithm design is centered around the Hidden Markov Model 1 (HMM) of a genomic region that could 1/2 be an ingenious gene overlap or a pair of 3/1 2/1 1/3 adjacent genes or a gene with a frameshift (Fig. 1). States 1, 2 and 3 N/C correspond to three possible “global” frames of reading the genetic code in the given strand. States designated as i/j 2 3 where i=1,2,3 and j=1,2,3 indicate gene overlap regions with number i indicating 3/2 the frame of the upstream gene and number j indicating the frame of the 2/3 downstream gene. The colors of the start and stop codon states are indicative of - start codon state - stop codon state their global frames. A direct transition from one coding state to another is Fig.1 . HMM designed for FSdetector possible only as a frameshift. An ingenious gene overlap will be identified by a transition between two coding states traversing through the overlapping states (i/j type); the adjacent genes will be connected through the non-coding state (N/C). The algorithm finding the maximum likelihood path through the model for a given sequence is the Viterbi algorithm. If this path includes a direct transition between coding states then the frameshift is predicted. In the accuracy tests of FSdetector on the whole Escherichia coli genomic sequence with framshifts introduced randomly into annotated genes we have observed 76.3% sensitivity (Sn) and 73.3% specificity (Sp). It should be noted that the Sn and Sp values were obtained for the 2nd order sequence model with HMM parameters chosen by initial heuristics. The initial settings leave ample room for further improvement, thus, in the conference presentation we will discuss the method with further improvements, generalizations and applications to various species.

1. J. Besemer et al. (2001) GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Nucleic Acids Res., 29: 2607-18. 53 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

AUTOMATIC TOOL TO DESCRIBE STRUCTURE OF RELIABLE BLOCKS IN A MULTIPLE ALIGNMENT OF PROTEIN SEQUENCES BORIS BURKOV 1, BORIS NAGAEV 2, SERGEI SPIRIN 3, ANDREI ALEXEEVSKI 4

Keywords: multiple alignment of protein sequences, blocks detection

To reveal desirable information from a multiple alignment of protein sequences, first of all, an expert needs to distinguish between parts of reliable alignment and parts where no relevant alignment can be detected on the sequence level. The former parts of the alignment could be verified by 3D structure comparison (if structures are available). The latter parts may correspond, for example, to differently located loops of proteins. Therefore, the alignment in those parts makes no sense. Most programs of alignment do not take this fact into consideration. A number of tools facilitating multiple alignment analysis are currently available. They are implemented in alignment editors and visualization servers (e.g. Jalview [1] or T-coffee [2]). They do not seem to cover all alignment features of interest. We created a tool for autOmatic Partition of a given multiple ALignment (OPAL) on so-called blocks. A block is a part of the alignment defined by a continuous series of positions within a subfamily of sequences. Blocks are divided into two groups, blocks of reliable alignment (plus-blocks) and blocks of senseless or unreliable alignment (minus-blocks). Output of the main program is sets of plus- and minus-blocks. Plus and minus blocks together cover all alignment, blocks may not intersect. OPAL_vis module represents all blocks of the alignment allowing navigation through blocks and visualization of each block in the frame of the alignment. The algorithm iteratively repeats the procedure that finds one plus-block within an analyzed block. First the procedure is applied to the entire alignment. That plus-block may be either full-width in the alignment or full- width in the subalignment defined by a cluster of sequences. If a plus-block  1 Moscow State University, Russian Federation, [email protected] 2 Moscow State University, Russian Federation, [email protected] 3 Belozersky Institute, Moscow State University, Russian Federation, [email protected] 4 Belozersky Institute, Moscow State University, Russian Federation, [email protected] 54 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 was found, then it is stored in output data and the remaining parts of the analyzed block are analyzed by the same procedure. Otherwise, input is considered as a minus-block. Special criteria of block reliability were developed and implemented. The algorithm was implemented in OPAL_cut program. To test OPAL_cut on multiple alignments for proteins with solved 3D structures, OPAL_test module was created. For each plus-block found by OPAL_cut, so called geometrical core [3] of it is determined. If the geometrical core comprises the whole block or its significant part, then the reliability of block is considered to be supported by 3D data. OxBench benchmark alignment database [4] was used as a source of structural data for a massive test. The test showed that 90% of plus-blocks are supported by structural data (geometric core comprises >80% of blocks' positions). OPAL package can be useful for expert analysis of large alignments of proteins and a number of other purposes as well, such as multiple alignments refinement, assessment of multiple alignment programs' performance or subfamilies identification and reconstruction of phylogeny. We are grateful to Elena Lukina for help in preparing 3D superimpositions and structural alignments and Daniil Alexeevski for helpful hints and assistance. The work is partly supported by the RFBR-DFG grants 07-04- 91560 and 08-04-91975.

1. M.Clamp et al. (2004) The Jalview Java alignment editor, Bioinformatics, 20(3):426-7. 2. O.Poirot, E.O'Toole and C.Notredame (2003) Tcoffee@igs: a web server for computing, evaluating and combining multiple sequence alignments Nucleic Acids Research, 31(13): 3503-3506. 3. M.Gribkov et al. (2004) Life Core, the program for classification of 3D structures of macromolecules Biofizika, 48(1):157-166 4. G.P.Raghava et al. (2003) OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics, 4:47

55 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

EVOLUTION OF SIGNAL PEPTIDE APPEARANCE/DISAPPEARANCE IN BACTERIAL GENOMES NADEZHDA BYKOVA 1, ANDREJ MIRONOV 1

Keywords: signal peptide

Introduction Signal peptide is an 15-30 amino acid sequence in the N-terminus of protein that directs it to the way of export from cytoplasm. In previous works we have shown that the presence of signal peptide is not conserved in clusters of orthologous genes and that it is not only because of prediction programms mistakes (non-published data). In present work we studied evolution events of signal peptide appearance and disappearance in such clusters. We have found evidences of as ancient as recent events existance. Also we tried to characterize clusters and genomes overpresented wih this events. One of the important overcomes of this work is a list of recent signal peptide appearance. We suggested that signal peptide appearance is anticipated by gene duplication, so we studied also clusers and genomes rich of paraloges pairs, in which one protein has signal peptide and another has not. The most active were some symbiothic and pathogenic bacteria and even there were slight differences between strains of the same species, for example pathogenic and non-pathogenic strains. That shows corellation between their adaptation requirements and high rates of signal peptides appearance. All the data including tree pictures and signal/non-singal paraloges is avalible at http://www.bioinf.fbb.msu.ru/SignalWeb/ Materials and Methods 1) Protein clusters were downloaded from NCBI Protein Clusters database [1]. We took into consideration only clusters that contain more than 8 proteins. 2) For signal peptide prediction we used SingalP3.0-NN [2] with the standart thresholds. 3) We have also performed correction of annotation errors in suspicious pairs of proteins (id%>70 and different signal peptide prediction): pair

 1 Department of Bioingeneering and Bioinformatics, Moscow State University, Moscow,GSP-2, building 73, Leninskiye Gory, Moscow, 119992, Russia, [email protected] 56 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 alignment and further searching signal peptides in 150 bp uprteam relative to start of local alignment. Results. From 37863 clusters we analysed, 25471 were predicted as completely non-signal and 2168 as completely signal, so 27% of such clusters has some potential appearance/disappearance events. For our purpose we took only clusters that contain at least 3 signal and 3 non-signal peptides. After the correction of gene starts, we analysed the distribution of predicted signal peptides on the evolutionary tree of such clusters using the Events Number value and E(economy) value. We found out that there are signaificant number of relatively ancient divergences (see Table1). For example deaminase cluster (PRK06846), where divergence happened on the level of gram-positive/gram- negative bacteria.

Table 1. Signal peptide appearance/disappearance events Events number All clusters Clusters with signal/non- signal paraloges 1 889 141 2 1327 248 3 995 177 >3 964 270 Total 4175 836

On the other hand we also found recent events, which are not likely to be a prediction errors because of deletion of signal peptide in one of the sequences (and there is stop codon immediately before start of local alignment) - for example disappearance of signal peptide in a cluster of endo-1,4-D-glucanase (PRK11097; catalyzes the hydrolysis of 1,4-beta-D-glucosidic linkages in cellulose, lichenin and cereal beta-D-glucans) in 4 strains of Yersinia pestis and Yersinia pseudotuberculosis IP 32953, while it is still present in Y. enterocolitica and all other members of this cluster. So we can conclude that signal peptide appearance/disappearance events are relatively fast and some symbiothic/pathogenic bacteria use this feature for their adaptation as we can see comparing pathogenic and non-pathogenic strains (for example pathogenic strain of Echerichia coli O157:H7 has 5 additional clusters with diverged signal/non-signal paraloges in compare with simple E.coli K12). 57 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Acknowledgements Howard Hughes Medical Institute [grant number 55005610]; the Program “Molecular and Cellular Biology” of the Russian Academy of Sciences; and Russian Foundation of Basic Research [grants number 09-04-92742, 07-04- 91555].

1. Klimke W. et al. (2009) The National Center for Biotechnology Information's Protein Clusters Database, Nucleic Acids Res., 37(Database issue): D216–23. 2. Bendtsen J.D. et al. (2004) Improved prediction of signal peptides: SignalP 3.0, J. Mol. Biol., 340:783-795.

58 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

A STATISTICAL METHOD FOR PWM CLUSTERING SOLENNE CARAT 1,2 , REMI HOULGATTE 1, JEREMIE BOURDON 2

Introduction Motif discovery is a fundamental problem in molecular biology. It possesses important applications in the study of regulatory signals and transcription factor binding sites discovery. Several motif discovery tools have been proposed (see [1] for a complete review). They all extract significant motifs from sets of sequences. Nevertheless, addressing motif discovery for complex organism is still a challenge. It is thus interesting to take profit of the specificities of every discovery tools with different parameters for extracting several putative interesting motifs. Doing this impose to deal with redundant motifs that must be removed. Here, we propose a method for comparing several motifs given by their PSSM (Position Specific Scoring Matrices). This method automatically detects periodic motif and redundant motif. It is also possible to compare a final set of motifs with public databases [2,3]. Notice also that palindromic motifs can be detected easily with this method. Methods Our method is based on comparison of PSSM. The use of PSSM, rather than PWM, is justified by the exactness of the content, while PWM may require pseudo-count adaptation. These PSSM can be constructed easily from any motif discovery tools. All matrices are compared pairwise. Reverse complements are also taken into account. For each pair of motifs (m,n), comparison is done between m and all possible shift of n. Shifts allow to detect imbricate motifs. The specificity of our comparison method is that its is performed only on bases which frequencies are superior to determinate threshold, like background for example. This limits the effects of noise in the comparison. Finally, a Chi-square test is used to compare the two distributions of frequencies. This comparison method allows to detect periodic motif, like tandem repeat GC, comparing PSSM to itself with lag of 2 bases. If these two PSSM are similar, motif is periodic (Fig. 1.1). In the same vein, comparing a PSSM with its reverse complement allows to determine if it is a palindrome (Fig. 1.2).  1 Institut du thorax, INSERM U915, Nantes, {Solenne.Carat,Remi.Houlgatte}@univ- nantes.fr 2 LINA, CNRS UMR6241, Nantes, Jeremie.Bourdon @univ-nantes.fr 59 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

Fig. 1 : Comparison of several motifs

Optimizations Many parts of the process treatment are quite independent. It is thus possible to take advantage of modern computer architectures (multicore computers, clusters, grid) by a parallization of these parts of computation. This allows a huge gain of the time needed to get a full result. Discussion Motif comparison allows to detect periodic and palindromic motifs, and identify transcription factors that recognize it through public databases. Moreover, by grouping similar motif, it is possible to generate consensus motifs that correspond to a larger number of sequences, and to reduce number of motifs to be studied.

1. G. K, Sandve, F. Drablos (2006), Biology direct, 1:11 2. A. Sandelin et al., (2004), Nucleic Acids Res. 32: D91-94 3. V.Matys et al., (2006), Nucleic Acids Res., 34:D108-110

60 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

CONSTRUCTION AND HETEROLOGICAL EXPRESSION IN E. COLI OF THE DELETION DERIVATIVES OF THE CYANOBACTERIUM SYNECHOCYSTIS SP. PCC 6803 DRGA GENE AND ITS HYBRIDS WITH GFP REGINA CHAKHIRIDIS 1, VERA GRIVENNIKOVA 1, ELENA MURONETS 1, KIRILL TIMOFEEV 1, IRINA ELANSKAYA 1, VIKTORIYA TOPOROVA 2, ALEXEI NEKRASOV 2, DMITRY DOLGIKH 2

Keywords: Cyanobacteria, NAD(P)H:quinone oxidoreductase, nitroreductase, electron transport

Soluble NAD(P)H:quinone-oxidoreductase encoded by drgA gene of the cyanobacterium Synechocystis sp. PCC 6803 is involved in NADPH oxidation and is respobsible for the cell sensitivity to nitroaromatic inhibitors as well as for the resistance to the oxidative stress inducer menadione [1]. DrgA protein is responsible for peroxide reduction in Fenton reaction [2] and participates in regulation of photosynthetic and respiratory electron transport in cyanobacterial thylakoid membranes [3]. The protein sequences of DrgA from Synechocystis sp. PCC 6803 and its homologues from other microorganisms were aligned and studied for their information content by analysis of Shannon-Weaver informational entropy computed as function of the distance between the amino acid residues [4-6]. Sites of increased degree of information coordination between residues (IDIC- sites) were identified. Associations of information-coordinated structural elements (IDIC-trees and IDIC-branches) were mapped. Coding sequence of drgA gene was amplified using PCR method. To study DrgA functional topology, several new deletion derivatives of drgA gene (drgA ∆1, drgA ∆2, and drgA ∆3) were constructed using PCR. In order to facilitate protein purification we have spliced the 3’-ends of all genes with 12xHis tag coding sequence. For visualization of DrgA, the genes encoding the green fluorescent proteins (GFP) cherry or egfp were placed between drgA and 12xHis tag coding sequences. Several constructions for direct constitutive and inducible intracellular expression in E. coli of drgA and its deletion

 1 Faculty of Biology, M.V. Lomonosov Moscow State University, Moscow 119991, Leninskie Gory, 1-12; tel. (495)9391179, fax (495)9392957, [email protected] 2 Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, Miklukho-Maklaya 16/10, Moscow, 117997, Russia, tel. (495)3306983, fax (495)3357103, [email protected] 61 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 variants were designed and investigated. The recombinant proteins were purified by IMAC-chromatography method. The enzyme activity of DrgA was tested. The purified DrgA-12His protein exhibited high quinone reductase and nitroreductase activity. The rate of re-reduction of photooxidized Photosystem I reaction center was increased after addition of DrgA-12His protein and NADPH to isolated cyanobacterial thylakoid membranes. Thus, DrgA protein may participate in electron transfer from NADPH to plastoquinone pool in thylakoid membranes of the cyanobacterium Synechocystis sp. PCC 6803.

The work was supported by RFBR grant 09-04-01119.

1. Elanskaya I.V., Chesnavichene E.A., Vernotte C., and Astier C. (1998) Resistance to nitrophenolic herbicides and metronidazole in the cyanobacterium Synechocystis sp. PCC 6803 as a result of the inactivation of a nitroreductase-like protein encoded by drgA gene. FEBS Letters, 428: 188-192. 2. Takeda, K., Iizuka, M., Watanabe T., Nakagawa, J., Kawasaki, S., and Niimura Y. (2007) Synechocystis DrgA protein functioning as nitroreductase and ferric reductase is capable of catalyzing the Fenton reaction. FEBS J., 274: 1318-1327. 3. Matsuo M., Endo T., and Asada K. (1998) Isolation of a novel NAD(P)H- quinone oxidoreductase from the cyanobacterium Synechocystis PCC 6803. Plant Cell Physiol., 39: 751-755. 4. Nekrasov A.N. (2002) Entropy of Protein Sequences: an Integral Approach. Journal of Biomolecular Structure & Dynamics, 20: 87-92. 5. Rogov S.I., Nekrasov A.N. (2001) A Numerical Measure of Amino Acid Residues Similarity Based on the Analysis of their Surroundings in Natural Protein Sequences. Protein Engineering, 14: 459-463. 6. Nekrasov A.N. (2004) Analysis of Information Structure of Protein Sequences: A New Method for Analyzing the Domain Organization of Proteins. Journal of Biomolecular Structure & Dynamics, 21: 615-623.

62 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

ROLE OF GATA4 AND NKX2-5 IN CONGENITAL HEART DEFECTS OF INDIAN POPULATON: A PRELIMINARY REPORT ANBARASAN CHAKRAPANI 1, ASHOK KUMAR MANICKARAJA 1, CHERIAN K. M 1, SOMA GUHATHAKURTA 1, VIJAYA M NAYAK 1

Congenital heart disease (CHD) is a cardiac structural abnormality that is present at birth or even if it is discovered much later. The burden of CHD in India is quite high with an prevalence rate of 2%. A number of studies have identified GATA-4, Nkx2-5, and Tbx5 among the candidate genes causing CHD. The zinc finger transcription factor GATA4 and evolutionarily conserved homeodomain containing transcription factor Nkx2-5, located on 8p23.1-22 and 5q35.2 respectively, are thought to play a vital role in cardiogenesis The objective of the present study was to screen for reported mutations on Nkx2-5 T, →gene, exon 1 (249 C →T), exon 2 (723A →G, 735C →T) and GATA 4 gene, exon 3 (687G T) in CHD patients of Indian →G, 848G →A, 796C →A, 700G →A), exon4 (818A →779G population. The above exons of Nkx2-5 and GATA4 gene were alone focussed as the incidence of mutation were reported high in previous studies among other populations. A phenotypically well characterized 40 non syndromic patients [19 Atrial Septal defects (ASD), 12 Ventricular Septal Defect (VSD), 2 Atrioventricular Septal Defects (AVSD), Tetralogy Of Fallot (TOF), 2 Corrected Transposition of Great Arteries (CTGA)], who have been referred to the International Centre for Cardio Thoracic & Vascular Diseases (A Unit of Frontier Lifeline Pvt. Ltd. & Dr. K. M. Cherian Heart Foundation, Chennai) for CHD treatment from November 2008 to March 2009 were selected. Preoperative blood samples of the patients were collected after obtaining their informed consent. Genetic counselling revealed that 7.5 % (ASD=2, VSD=1) of patients were born to consanguinous parents, 2.5% (n=1, ASD) had a familial history of CHD and 2.5% (n=1, ASD) were born premature. DNA was isolated from peripheral blood using Lahiri’s method1 and the quantification of DNA was done on agarose gel Hind-III digested ladder [MBO Fermentas, USA]. The exon1, λelectrophoresis using exon2 regions of Nkx2-5 gene2 and exon3, exon4 regions of GATA43 were amplified using  1 Department of Genetic Engineering, Frontier Tissue Line,R-30-C,Ambattur Industrial Estate Road, Mogappair, Chennai- 600 101, Tamil Nadu, India, [email protected]., [email protected] 63 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 corresponding primers and subjected for RFLP analysis using reported restriction enzymes. Mutations were observed in exon2 of Nkx2-5 (735C →T, Gln187Ter, heterozygous) in one VSD patient and exon3 of GATA4 (700G →A, Gly234Ser, heterozygous) in each of CTGA and OSASD patients. Our results have revealed a 735C →T transversions of Nkx2-5 gene in one VSD patient and previously this mutation was observed in German study4. A GATA 4 exon 3 mutation Gly234Ser was also identified in two patients, one CTGA and one OSASD. A Japanese study has previously reported the same mutation in 1 patient among 68 mutations5. All the other mutation studied on GATA 4 and Nkx2-5 has not been observed in our population. These results indicate that the above two mutations are not population specific. The results identify that Indians also have mutations among GATA4 and Nkx2-5. Further, new mutations also could be identified among these patients as Indians are a unique genetic entity. The result has to be validated with more number of patients for extensive studies on the role of GATA4 and Nkx2-5 among the Indian population.

1. Lahiri D. K et al. (1993), DNA isolation by a rapid method form human blood samples. Effect of MgCl2, EDTA, storage time and temperature on DNA yeild and quality, Bio Chemical Genetics, 31: 321-328 2. Wei-min Z., Xiao-feng L., Zhong-yuan M., et al. (2009), GATA4and NKX2.5 gene analysis in Chinese Uygur patients with congenital heart disease, Chinese Medical Journal, 122(4):416-419 3. Reamon-Buettner S. M., Cho S. H., Borlak J.(2007), Mutations in the 3'- untranslated region of GATA4 as molecular hotspots for congenital heart disease (CHD),Biomedical Centre Medical Genetics, 8:38 4. Reamon-Buettner S.M., Hecker H., Spanel-Borowski, K. et al. (2004), Novel NKX2–5 Mutations in Diseased Heart Tissues of Patients with Cardiac Malformations , American Journal of Pathology, 164(6). 5. Reamon-Buettner S.M., Borlak J. (2005), GATA4 zinc finger mutations as a molecular rationale for septation defects of the human heart, Journal of Medical Genetics, 42 I would like to thank my research project students Saranya Devi C., Reshme J., Shruthi V., Srividya V., Aishwarya V., Ram Prasath G., Nelson Rajkamal A., Pooranamathi and Muhammed Sirajeeden for their support in the research work

64 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

HYDROGEN BOND GEOMETRY IN REGULAR HELIX STRUCTURES DMITRII L. UKRAINSKII1, VLADIMIR O. CHEKHOV 1, VLADIMIR G. TUMANYAN 1, NATALIA G. ESIPOVA 1

Quantum-chemical calculations of compounds that allow modeling of interpeptide H-bonds in polypeptide helices provide unique information about the physical nature of these bonds. Our purpose was quantum-chemical modeling of interpeptide H-bonds with variation of geometric parameters. Two semi-empirical methods PM3, AM1 and ab initio methods STO3G, 3- 21G and 6-31G** were used in this study. The above mentioned methods were included into application packages GAMESS and HyperChem Pro 6. So the AM1 method was found the most adequate for our purposes as the difference between the optimal orientation of the N–H bond obtained from AM1 calculations and the one from ab initio lies within the 3 ° limit. It also appears to be valid for simulations of peptide groups belonging to regular helical peptide chains exemplified by 1cq2 and 2mb5 proteins. We observed how the total energy of a single peptide group in regular (ideally infinite) helical structures depends on the orientation of N–H bond. We computed the energies of regular helical octo-, nano- and deca-Gly structures at different Ramachandran angles ϕ and ψ. The dependence of the total energy of peptide group situated between the sixth and the seventh (from N-terminus) amino acids versus N–H bond deviations from the bisector line of C α–N–C′ valence angle was obtained at frozen geometries of N–H bonds for the rest peptide groups. In all these and the following simulations the bond length was adopted to be 1.01 Å. Boundary effects have been eliminated during the calculations. For Ramachandran angles –75 ° ≤ ϕ ≤ –47 ° and –57 ° ≤ ψ ≤ –25 ° typical for A- area structures, we observed that even when it is hard to choose between hydrogen acceptors, the peptide group total energy has a single minimum depending on N-H bond direction. It was shown that all these dependencies suggest the presumed H-bonding in “indecisive” positions even if Rose criterion predicts existence of a direct H-bond. For all the cases the energetically permitted range lies within ±10 ° interval for the plane of the peptide molecule, while for the devations in the perpendicular plane the range  1 Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, ul. Vavilova 32, Moscow, 117984 Russia; fax:+7 (499) 135-1405 e-mail: [email protected]; [email protected] 65 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 is about ±30 °. This minimisation in each point of the Ramachandran plot results in rather flat energy surface in the region adjacent to the line described with ( ϕ+51 °)/( ψ+50 °)≈1.1 equation. The region of the plot under consideration contains the α-helix area, the 310 helix area and a part of π- helix area. Interestingly, the energy minimum ( ϕ=–51 °, ψ=–50 °) does not coincide with any canonical helical forms. The energy corresponding to the classical Pauling α-helix exceeds the minimal energy by 0.7 kcal/mol. Note that kT at room temperature is about 0.6 kcal/mol. The π-helix energy is practically the same as the α-helix energy, while the 310 helix energy exceeds α-helix energy by approximately 1.5 kcal/mol. The type of helical structure thereby depends on the nature of its residues and possibly their surroundings. For H-bonds the donor-acceptor distances lie between 2.4 and 4.4 Å for the ϕ, ψ - region under investigation. At the same time the angles of H–N–Oacceptor follow distribution shown in Fig. 1. One can see that the angles are predominantly found in the 25 °-30 ° interval. Significant number of the angles are in the 35 °-55 ° interval; however residue energies of these cases exceed 5 kcal/mol. The angles are also minimal when donor-acceptor distances are about 3 Å and they are not less than 15 °. Thus, almost every hydrogen bond in the A-area can be regarded as “indecisive”.

Fig. 1. Hystogram of absolute value of the angle between N–H direction and the direction from hydrogen atom donor (N) towards an acceptor (O). Black bins take into account cases when effective energy per glycine residue exceeds the minimum within the 5 kcal/mol limit. Grey bins take into account all cases.

This work was supported by grants from Russian Foundation for Basic Research (projects No 07-04-01765 and 08-04-00849), and the Molecular and Cellular Biology Program of the Russian Academy of Sciences.

66 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

NEGATIVE INFORMATION ENTROPY AS A MEASURE OF NONEXPONENTIALITY OF PROTEIN FOLDING KINETICS SERGEI F. CHEKMAREV 1

In many cases, when the folding process is complicated by the presence of on/off-pathway intermediates, the proteins reveal nonexponential folding kinetics (e.g [1-6]). To see how far the kinetics deviate from the exponential (two-state) kinetics, or which of the kinetics deviate more, a quantitative measure of nonexponentiality of the first-passage-time distributions (FPTDs) is needed. For this purpose, the difference between the information (Shannon) entropies for the exponential distribution and a given FPTD ( ∆S) can be employed [7]. It is essentially the Schrödinger-Brillouin [8,9] negative entropy (negentropy), except that the probability for the system to escape from a certain state at a given time is considered instead of the probability for the system to be found in a certain state, and is closely related to the well-known Kullback-Leibler divergence [10], widely used in information theory. The utility of the negative entropy thus introduced is twofold [7]. First, a positive value of ∆S indicates that the FPTD is less random than the Poisson distribution, so that the process under consideration presumably involves some intermediates, which breaks the Poisson process. Secondly, ∆S has a straightforward interpretation in terms of transition state theory, so that it can be expressed in terms of the free energy, and, correspondingly, be measured in the kBT units. In contrast to the other known measures of nonexponentiality of FPTDs, which are based on the comparison of the standard deviation and median of a FPTD with the mean value of the FPTD, ∆S gives an unambiguous estimate of nonexponentiality of a FPTD. Potentially, the present approach has a broad range of application for the analysis of kinetic processes because it is applicable to any problem to which the concepts of information entropy and transition state theory are relevant. The theoretical analysis is illustrated with simulation and experimental results from protein folding [1-6]. Considering a limited but not specific set of proteins, it has been found that ∆S typically varies in the range of several hundredths of kBT (two-state kinetics) to several tenths of kBT (multistate kinetics). The knowledge of ∆S and the free energy barrier between the  1 Institute of Thermophysics, SB RAS, and Novosibirsk State University, 630090 Novosibirsk, Russia , [email protected] 67 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 unfolded and folded states of the protein allows estimation of the relative deviation of the folding process from the two-state kinetics. This work was supported in part by the grant from the Russian Foundation for Basic Research (No. 08-04-91104) and the Civilian Research and Development Foundation (No. RUB2-2913-NO-07).

1. J. Sabelko, et al. (1999) Proc. Natl. Acad. Sci. U.S.A. 96: 6031-6036. 2. J. M. Sorenson and T. Head-Gordon (2002) Proteins: Struct., Funct., Genet. 46: 368-379. 3. H. Kaya and H. S. Chan (2003) Proteins: Struct., Funct., Genet. 52: 524- 533. 4. J. M. Borreguero, et al. (2004) Biophys. J. 87: 521-533. 5. S. F. Chekmarev, et al. (2005) J. Phys. Chem. B 109: 5312 -5330. 6. Yu. Palyanov, et al. (2007) J. Phys. Chem. B 111: 2675-2687. 7. S. F. Chekmarev (2008) Phys. Rev. E 78: 066113. 8. E. Schrödinger (1945) What is Life? The Physical Aspect of the Living Cell (Cambridge University Press, Cambridge, England). 9. L. Brillouin (1953) J. Appl. Phys. 24: 1152-1953. 10. S. Kullback (1959) Information Theory and Statistics (Wiley, New York).

68 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

CHANGING THE CONTENT OF CYTOSINE, GUANINE, CpG AND CpNpG SEQUENCES OF rDNA IN LONG PHYLOGENETIC BRANCHES OF FLOWERING PLANTS IS A BACK-AND-FORTH NATURE. VLADIMIR CHUPOV 1

Variations of nucleotide composition and frequency of CpG and CpNpG sequences in the clusters of nuclear ribosomal genes of taxa, belonging to two long phylogenetic branches of Angiospermae have been analyzed. This region of eucaryotic genomes is nucleolus organizer and functions in a separate compartment of cell nucleus that can do running here processes it is enough specific. It was shown that at the level of orders, and and superorders flowering plants level of evolution advance of a taxon, defined on morphological data, is in positive correlation with quantitative value of dC, dG, CpG and CpNpG. (Chupov et all., 2007; Чупов и др. 2008 а, б). This is found in contradiction with beliefs about the general rules of the transformation of nucleotide composition in evolution, that suggest a dC and CpG suppression. However as demonstrated by further studies increased content of cytosine, guanine, CpG and CpNpG sequences dedicated to specific mono- or oligotip kriptaffinous taxa, which are the link between large families. Within individual families of flowering plants dominated by another process. It is dominated the replacement of cytosine for thiamine and, consequently, reducing dC, dG, CpG and CpNpG content. Thus the general character of changes in nucleotide composition and the type dinukleotid’s profiles of rDNA of flowering plants is a back-and-forth, wavy appearance.

1. V. S. Chupov., E. O. Punina, E. M . Machs, A. V. Rodionov (2007) Nucleotide Composition and CpG and CpNpG Content of ITS1, ITS2, and the 5.8S rRNA in Representatives of the Phylogenetic Branches Melanthiales–Liliales and Melanthiales–Asparagales (Angiospermae, Monocotyledones) Reflect the Specifics of Their Evolution, Mol. Biol., ( 41: 808–829.

 1 Komarov Botanical Institute, Russian Academy of Sciences, Russian Federation, [email protected] 69 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 2. V. S. Chupov, E. M. Machs, A. V. Rodionov (2008 a) The Dinuсleotide Composition of Rhibosomal Spacer Regions ITS1-5.8S rDNA-ITS2 as an Indicator of Evolutionary Development and a Phylogenetic Marker of Monocotyledon Plants (Melanthiaceae, Iridaceae, Trilliaceae and Liliaceae).General Changes in the Dinucleotide Composition, Usp. Sovrem. Biol., 128: 482 – 497. (In Russ.) 3. V. S. Chupov, E. M. Machs, A. V. Rodionov (2008 б) The Dinuсleotide Composition of Rhibosomal Spacer Regions ITS1-5.8S rDNA-ITS2 as an Indicator of Evolutionary Development and a Phylogenetic Marker of Monocotyledon Plants (Melanthiaceae, Iridaceae, Trilliaceae and Liliaceae). Dinucleotide Spectrum of Cryptaffine Taxa, Usp. Sovrem. Biol., 128: 482 – 497. (In Russ.)

70 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

EVOLUTION OF SEQUENCES UNDER STRONG SELECTION: SPLICE SITES AND SHINE-DALGARNO BOXES STEPAN DENISOV 1, AKSINIYA GAYDUKOVA 1, ANDREY MIRONOV 1, ALEXANDER FAVOROV 2, RAMIL NURTDINOV 1, MIKHAIL GELFAND 3

Splice sites (in eukaryotes) and Shine-Dalgarno (SD) boxes (in prokaryotes) are highly conserved sequences. They play key roles in the process of gene expression at the level of splicing (splice cites) and initiation of translation (SD). Splice sites are located t the exon-intron boundaries of eukaryotic genes. The spliceosome binds directly to these sequences and then performs the splicing reactions [1, 2]. The Shine-Dalgarno sequences are special motifs located upstream of start codons of many prokariotic genes. These sequences are essential for the initiation of translation. The 16S rRNA (part of ribosome) binds to SD sequence via standard Watson-Crick base- pairing [3]. Hence, the Shine-Dalgarno sequences and splice sites experience a strong selective pressure. Taking into account a large number of such sequences (several thousands) in the available genomes, it is interesting to understand their evolution on the nucleotide level. Raw splice site data consisted of ~30000 triple alignments of ortologous donor splice sites and the same number of acceptor splice sites from the human, mouse and dog genomes. This data were extracted from the EDAS database ([4], http://edas.bioinf.fbb.msu.ru/ ). The SD sequences were identified using a rule involving a positional weight matrix and the information about position of SD relative to the start of translation in genomes of bacteria from the Enterobacteriaceae family. After all filtration procedures, the total number of SD sequences was 15260 (for all species). The aim was to study the pattern of evolution at each position and to compare (calculated) strength of ancestor and current sites. All evolutionary events were considered independently for each branch of the phylogenetic tree. For each position and for each branch of the tree a

 1 Lomonosov Moscow State University, GSP-2, building 73, Leninskiye Gory, Moscow, 119992, [email protected] 2 Division of Oncology Biostatistics and Bioinformatics, The Sidney Kimmel Cancer Center at Johns Hopkins, 550 North Broadway, Suite 1103, Baltimore, MD 21205, USA 3 Institute for Information Transmission Problems, Russian Academy of Sciences, Bolshoi Karenty pereulok 19, Moscow, 127994, Russia, [email protected] 71 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 substitution matrix was calculated using the parsimony and maximum likelihood methods. Properties of substitution matrix were studied (matrix dissymmetry, ancestor, descendant and steady vectors of nucleotide frequencies). In many cases the steady vectors significantly differ from both the ancestor and descendant vectors. For each pair of ancestor and offspring sites the differences in strengths were calculated, in order to study changes in site strengths and (in)dependence of mutations in the sites. Alternative and constitutive splice sites were studied independently. We found that on many samples of splice sites (constitutive sites and different types of alternative ones) weights of ancestor sites is slightly but statistically significantly larger than descendant site weights. It was shown that distinct positions in sites mutate not independently: mutations tend to be compensated with other mutations to keep weight of the site relatively stable.

1. J. Rojers and R. Wall (1980) A mechanism for RNA splicing, Proc Natl Acad Sci USA, 77(4): 1877–1879. 2. D.A. Wasserman and J.A. Steitz (1992) Interactions of small nuclear RNA's with precursor messenger RNA during in vitro splicing, Science, 257(5078):1918-25. 3. T. Nakamoto (2006) A unified view of the initiation of protein synthesis, Biochem Biophys Res Commun, 341(3): 675-678. 4. R.N. Nurtdinov et al. (2006) EDAS, databases of alternatively spliced human genes, Biofizika, 51(4): 589-592.

72 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

COMPUTER SIMULATION OF C.ELEGANS MUSCULAR SYSTEM AND NEURAL NETWORK ALEXANDER DIBERT 1, ANDREY PALYANOV 2

Keywords: C. Elegans, simulation, neuron network, muscle system, 3-D environment

Investigation of structure and functioning of the nervous system is one of the most interesting and complex problems. A functional computer model of a nervous system that reproduces the properties of the original one with high accuracy will be an evidence of a high level of understanding of the processes that take place in it. Reproducing the architecture of a real neural network seems to be a good approach to start with. The mammal brain and even brains of simpler organisms are too complex to determine the positions of all the neurons and connections between them and to simulate them on contemporary computers. Moreover, although a lot of different neuron models have been proposed, it is difficult to estimate how close to reality they are.

C.Elegans, free-living soil nematode, is one of the model organisms, widely used and extensively studied by biologists. It is the only organism for which neural network architecture – positions of its neurons and connections between them - is almost completely known. Its nervous system consists of 302 neurons, over 5000 synapses, more than 2000 neuromuscular junctions and these elements are invariant for individuals of the same sex. Taking into consideration the aforesaid, the simulation of the nervous system of C.Elegans seems to be one of the most actual and necessary task. Small size of neural network will allow us to make calculations in reasonable time using contemporary computers. Besides the model of the nervous system, it is very important to develop a model of organism’s body including muscles and receptors in a three dimensional physical environment, which will provide sensory input and feedback to the working nervous system and allow to observe organism’s behavior.

The model of the nematode body consists of a set of mass points, passive spring connections, which simulate tissues, active spring connections, which

 1 Novosibirsk State University, Russian Federation, [email protected] 2 A.P. Ershov Institute of Informatics Systems, Russian Federation, [email protected] 73 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 can receive input signal from motoneurons and simulate muscles. The three- dimensional model of a worm and physical environment model, which consists of the supporting force, the friction force, the muscle tension, gravity and the surface resistance, was embodied using C++ and OpenGL lib for real- time visualization. The muscle system of a real organism consists of 4 longitudinal muscle groups. Each group consists of 23 or 24 muscles, gathered in interleaving pattern. Each muscle in our model conforms to a real worm’s muscle.

We examined some simple neuron models, based on input signal summation with adjustable actuation threshold. Information on some neuron parameters is unknown, so we built the muscle contraction model, which allows C.Elegans model to make a sinusoidal movement and use the genetic algorithms based on this model, as well as experimental research data to approximate adequate values.

The result of our work is a virtual model of a C. Elegans nematode, which consists of carcass, muscle system, and neuron system, which are not barely separated fragments of a C. Elegans systems, but a set of interconnected systems. It allows neuron network to get a signal from an environment and react on it. Visualization allows us to study the structure of neural network, which is quite complex, providing selection of any combination of neurons, for which axon and dendrites will be displayed and shown at necessary scale and projection. Also it gives us an opportunity to observe a virtual model behavior, so we can judge about adequacy of a neuron model while adjusting its parameters.

74 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

NEW PROFILES FOR TWO DOMAINS OF QUORUM- SENSING HISTIDINE KINASES FROM FIRMICUTES BACTERIA D.V. DIBROVA 1

Keywords: histidine kinase, annotation analysis

Introduction Proteins of Two-Component Systems (TCSs) are responsible for the majority of bacterial reactions to the changes in the environment [1]. Each TCS consists of at least two proteins: sensor Histidine Kinase (HK) and Response Regulator (RR). Signal transduction is performed in the three steps: Autophosphorylation of HK in response to external stimulus by His residue; Transmission of phosphate from His of HK to Asp residue of RR; Activation of the effector domain of RR which leads to cell reaction; a wide majority of RRs are transcription factors, and their effector domains bounds to DNA. Generally, HKs are membrane proteins with various numbers of transmembrane helices. Typical HK has three domains: N-terminal sensor domain, the most variable; Dimerization domain with His residue which is phosphorylated during signal transduction; C-terminal kinase domain which performs ATP hydrolysis. Several families of histidine kinases were described, one of which is known to act in quorum-sensing systems of Firmicutes bacteria [2]. Results The comparison of known information about histidine kinases from two different sources was performed and inconsistencies between them were detected. In particular, several proteins were annotated as histidine kinases in RefSeq databank [3] while were not detected by any Pfam [4] or Prosite [5] profile. Some of them were reported previously to act in quorum-sensing systems [2, 6]. These proteins were used for building two new profiles, one of which covered presumable dimerization domain with absolutely conserved His residue while the other covered unusual kinase domain. 82 proteins had hits with these profiles. They form a family of histidine kinases not found by existing profiles of Pfam and Prosite.

 1 Moscow State University, Moscow, Russia, [email protected] 75 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Four indirect evidences that these proteins are really HKs are the following: Presence of conservative His residue and a special region around it with both conserved and high-variable residues; N-terminal region of these proteins holds several predicted transmembrane helixes (usually 7); Closest neighbors on genomes for their genes are genes of RRs, which is typical for TCSs; Kinase domain of these proteins lacks one of four conserved motifs and this fact is in agreement with the literature.

1. Ann M. Stock, Victoria L. Robinson, Paul N. Goudreau (2000) Two- Component Signal Transduction, Annu. Rev. Biochem., 69:183-215. 2. Richard P. Novick, Edward Geisinger (2008) Quorum Sensing in Staphylococci, Annu. Rev. Genet., 42: 541-64. 3. Kim D. Pruitt et al. (2007), NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins, Nucleic Acids Research, 35:D61-D65. 4. R.D. Finn et al. (2008) The Pfam protein families database, Nucleic Acids Research, 36:D281-D288. 5. Hulo N. et al. (2008) The 20 years of PROSITE, Nucleic Acids Research, 36:D245-D249. 6. Regine Hakenbeck (2000) Transformation in Streptococcus pneumoniae: mosaic genes and the regulation of competence, Res. Microbiol. 151: 453–456.

76 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

MULTISCALE MODELING AND DESIGN OF BIOLOGICAL MOLECULES NIKOLAY V. DOKHOLYAN 1

Some of the emerging goals in modern medicine are to uncover the molecular origins of human diseases, and ultimately contribute to the development of new therapeutic strategies to rationally abate disease. Of immediate interests are the roles of molecular structure and dynamics in certain cellular processes leading to human diseases and the ability to rationally manipulate these processes. Despite recent revolutionary advances in experimental methodologies, we are still limited in our ability to sample and decipher the structural and dynamic aspects of single molecules that are critical for their biological function. Thus, there is a crucial need for new and unorthodox techniques to uncover the fundamentals of molecular structure and interactions. We developed a multiscale approach which is based on tailoring simplified protein models to the systems of interest. Such an approach allows significantly extending the length and time scales for studies of complex biological systems. I will describe several recent studies that signify the predictive power of simplified protein models within the hypothesis-driven modeling approach utilizing rapid Discrete Molecular Dynamics (DMD) simulations.

 1 Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, NC, United States, dokh @med.unc.edu 77 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

PREDICTION OF FLEXIBILITY AND ABILITY TO HYDROGEN-DEUTERIUM EXCHANGE FOR PROTEIN CHAIN USING AMINO ACID SEQUENCE NIKITA DOVIDCHENKO 1, ALEXEY SURIN 2, SERGIY GARBUZYNSKIY3, MICHAIL LOBANOV 4, XANA GALZITSKAYA 5

Keywords: hydrogen-deuterium exchange, secondary structure, hydrogen bond, B- factor, regions with irregular secondary structure

Since flexible protein regions frequently play an important role in biological functioning, it is not surprising that the structural explanation of these dynamic properties is at present a very active area of research. Some structural aspects of local flexibility have been outlined in this work. We have investigated the possibility to predict protection of the main polypeptide chain from hydrogen-deuterium exchange. Exchange data for 14 proteins with published rates for native state out-exchange have been compiled. Different structural parameters reflecting flexibility of amino acid residues and their amid groups have been analyzed to answer the question whether the parameters can be used to predict protection of amino acid residues from hydrogen-deuterium exchange using only the amino acid sequence. The method for such prediction has been elaborated. For 70% of the residues considered in this paper we can predict correctly their status: will they be protected or not from hydrogen exchange. An additional goal of our study is to assess whether properties inferred using the bioinformatics approach are easily applicable to predict the behavior of proteins in solution. Mass spectrometry analysis of hydrogen-deuterium exchange for five proteins as well as comparison with our method have been done.

 1 Institute of protein research RAS, Russian Federation, [email protected] 2 Institute of protein research RAS, Russian Federation, [email protected] 3 Institute of protein research RAS, Russian Federation, [email protected] 4 Institute of protein research RAS, Russian Federation, [email protected] 5 Institute of protein research RAS, Russian Federation, [email protected] 78 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

MATHEMATICAL MODELING OF STEADY-STATE METABOLISM IN SACCHAROMYCES CEREVISIAE MITOCHONDRIA RENATA A. ZVYAGILSKAYA 1, NAFISA N. NAZIPOVA 2, ALEXSANDER A. ALEXSANDROV 3, LYUSIEN N. DROZDOV-TIKHOMIROV 3

Steady-state metabolism of mitochondria from Saccharomyces cerevisiae cells growning under aerobic conditions in the presence of sucrose as the sole carbon source is described in this approach by mathematical model using the previously elaborated method of the steady-state metabolic flux balance (SMFB method) and the specially designed for this purpose computer program package FLUX II. In the SMFB method, steady-state rates of the metabolic reactions are taken as variables. Each equation of the SMFB method is an equation of the balance between incoming and outgoing fluxes for one of the metabolites. Therefore, the model can be written as a set of linear algebraic equations, in which the left sides of equations are formed by the stoichiometric matrix of the reaction system, while the right sides are the resulting metabolic flux values corresponding to each metabolite of the system under consideration. The constructed advanced model permits to calculate the optimal distribution of reaction rates in the mitochondria metabolic network provided that the composition of monomers of mitochondria-forming biopolymers (proteins, DNA, RNAs, membranous lipoproteins), as well as a list of mitochondria-entering metabolites and the ATP efflux from mitochondria are given. It is assumed that mitochondria are the self-reproducing system dividing synchronously with the cell division. Importantly, the calculated levels of oxygen consumption and CO2 export were in a good agreement with the experimentally obtained results, thus reinforcing the validity of the SMFB method for quantification of cell metabolism.

 1 Moscow, A.N. Bach Institute of Biochemistry, Russian Academy of Sciences, Russian Federation 2 Puschino, Institute of Mathematic problems in Biology, Russian Academy of Sciences, Russian Federation 3 Moscow, Institute of Molecular Genetics, Russian Academy of Science, Russian Federation, [email protected] 79 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

STRUCTURAL TREES AND CLASSIFICATION OF PROTEINS ALEXANDER EFIMOV 1

The structural tree for proteins is a scheme that includes all the intermediate and final three-dimensional structures that can be obtained by stepwise addition of secondary structural elements to the root (starting) structure. Secondary structural elements are added to the growing structures in accordance with a set of rules inferred from known principles of protein structure. The structural motif having a unique overall fold is taken as the root structure of the tree. Possible folding pathways are shown by lines that connect all the structures between each other giving one structural tree. Because of structural similarity, proteins and domains included in one structural tree can be classified into one structural class or a superfamily. Proteins and domains found within branches of a strutural tree can be grouped into subclasses or subfamilies. Levels of stuctural similarity between different proteins can easily be observed by visual inspection. Within one branch, protein structures having a higher position in the tree include the structures located lower. Proteins and domains of different branches have the structure located in the branching point as the common fold. This classification is based on similarity of overall folds and modelled folding pathways of proteins and domains. In this classification, amino acid sequences, functions, and homology of proteins are not taken into account, so it is different from other known classification systems. To date structural trees for nine large protein superfamilies - beta-proteins containing abcd-units, 3-beta-corners, S-like beta-sheets; two-layer (alpha+beta)-proteins containing abCd-units; three-layer alpha/beta-proteins containing five- and seven-segment alpha/beta-motifs; alpha-proteins containing alpha-alpha-corners; proteins containing phi-motifs; and proteins containing combinations of beta-alpha-beta-units and psi-motifs - have been constructed. Some updated structural trees and the corresponding databases are now available at http://strees.protres.ru/.

 1 Institute of Protein Research, Russian Academy of Sciences, Russian Federation, [email protected] 80 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

INVESTIGATION OF CORRELATION BETWEEN DOMAIN BORDERS AND CORRESPONDING EXON BORDERS IN THE NONREDUNDANT SET OF HUMAN PROTEINS V.A. EPANESHNIKOV 1, A.A. ANASHKINA 1, E.N. KUZNETZOV 2, V.G. TUMANYAN 1

Keywords: Protein structural domain, exon, domain/exon shuffling

Gilbert [1] suggested an assumption that exons could be shuffled and it is a way for formation of new genes. Novel protein functions can also be produced by rearranging exons of existing genes. In these scheme introns may be treated as hot-spots for genetic recombination [2]. Thus, one or several exons correspond to protein module or domain. [3] points that correlation between intron positions and protein modules has not observed for ancient proteins. However, other authors shows that intron positions in ancient proteins correlate with boundaries of compact protein modules [4]. Works in the field are developing hand by hand with sequencing more and more animal genomes. It was elucidated by [5] that domains flanked by phase 1 introns have prominently expanded in the human genome due to domain shuffling. In the other work [6] statistical evidences for nine eukaryotic genomes have been drawn that protein domain borders correlate strongly with exon-intron structure of genes. At the same time in this works a protein domain was defined as functional unit which in general case does not coincide with structural domain. Thus, literature data does not allow attaining final decision about correlation between domain and exon borders. Our task consists in defining is there statistically significant correspondence between borders of structural protein domains and exon borders of corresponding genes. Our investigation consists in detailed comparison of exon and domain structure for nonredundant set of human proteins. This nonredundant set includes 632 protein chains. For each protein chain from this set corresponding transcript and its exon marking was established using pdb identifier (http://www.rcsb.org , http://www.ncbi.nlm.nih.gov ). After aligning by the program fasta3 [7] the protein and the transcript sequences, domain and exon pattern are comparing. A special mathematical  1 Engelhardt Institute of Molecular Biology RAS, [email protected], [email protected] 2 Institute of Control Problems RAS, [email protected] 81 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 criterion was developed, namely measure of difference. The measure of difference is equal to sum of distances from domain borders to nearest exon borders of aligned transcript. For each domain and corresponded exons, measure of difference was calculated. Three domain databases Cath [8], Scop [9] and Dali [10] were taken into account. Distribution of measure of difference has been constructed for each database. These distributions are quite similar. With the aim to estimate statistical significance of observed distributions the theoretical random model was constructed. Comparison of both distributions leads to conclusion that the distributions indeed differ from each other. Additionally, the threshold value was determined which help to divide the coinciding and the noncoinciding regions. After this, those types of domains which are characterized by correlation of exon and domain borders have been selected. The phases were computed for assigning introns both for coinciding and for noncoinciding domains in respect of exon borders. Interestingly, the domains of the former type have preference in 1-1 phase in contrast to non coinciding domains that have not excess of 1-1 phase. This result confirms shuffling mechanism for exon expansion and new gene formation throughout genome for coinciding domains.

1. Gilbert, W., Why genes in pieces? Nature, 1978. 271(5645): p. 501. 2. Gilbert, W., S.J. de Souza, and M. Long, Origin of genes. Proc Natl Acad Sci U S A, 1997. 94(15): p. 7698-703. 3. Stoltzfus, A., et al., Testing the exon theory of genes: the evidence from protein structure. Science, 1994. 265(5169): p. 202-7. 4. de Souza, S.J., et al., Intron positions correlate with module boundaries in ancient proteins. Proc Natl Acad Sci U S A, 1996. 93(25): p. 14632-6. 5. Kaessmann, H., et al., Signatures of domain shuffling in the human genome. Genome Res, 2002. 12(11): p. 1642-50. 6. Liu, M., et al., Significant expansion of exon-bordering protein domains during animal proteome evolution. Nucleic Acids Res, 2005. 33(1): p. 95- 105. 7. Pearson, W.R., Empirical statistical estimates for sequence similarity searches. J Mol Biol, 1998. 276(1): p. 71-84. 8. Orengo, C.A., et al., CATH--a hierarchic classification of protein domain structures. Structure, 1997. 5(8): p. 1093-108.

82 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 9. Murzin, A.G., et al., SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol, 1995. 247(4): p. 536-40. 10. Alexandrov, N. and I. Shindyalov, PDP: protein domain parser. Bioinformatics, 2003. 19(3): p. 429-30.

83 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

EVOLUTION OF STRUCTURE AND SEQUENCE IN ALTERNATIVELY SPLICED DROSOPHILA GENES DMITRY MALKO 1, EKATERINA ERMAKOVA2, MIKHAIL GELFAND 3

Keywords: exon-intron structure, nucleotide substitutions, alternative splicing, Drosophila

BACKGROUND. Two major mechanisms of evolution of genomic sequences are shuffling of genomic fragments and fine-tuning of coding and cis- regulatory regions via nucleotide substitutions. Alternative splicing provides extra freedom for both mechanisms [1]. Evolution of exon-intron structure and alternative splicing in insects is poorly studied as compared to vertebrates [2-4]. We consider the evolutionary diversity of the Drosophila genus at the level of exon-intron structure and at the level of nucleotide substitutions. We study gain and loss of exonic, intronic, and alternatively spliced regions within the same framework, considering nucleotide substitutions in different types of alternative coding regions separately. RESULTS. The patterns of evolution in terms of gain and loss of introns, constitutive exons, and alternatively spliced gene segments, as well as substitution rates in constitutively and alternatively spliced coding regions were considered for eleven Drosophila species (D. melanogaster, D. sechellia, D. yakuba, D. erecta, D. ananassae, D. pseudoobscura, D. persimilis, D. willistoni, D. mojavensis, D. virilis, D. grimshawi). Alternative segments are gained and lost at a higher rate than introns and constitutive exons, and introns are gained at a higher rate than constitutive exons. The patterns of structural rearrangements in pairs of recently diverged species D. yakuba ↔D. erecta and D. pseudoobscura ↔D. persimilis differ dramatically, despite similar rates of nucleotide substitutions. Extremely high rates of structural rearrangements were observed in D. persimilis. During the evolution periods when the rate of intron loss was greater than the rate of intron gain (recent evolution of D. ananassae and D. willistoni, and evolution in pseudoobscura subgroup before the D. pseudoobscura ↔ D. persimilis split), the rates of gain and loss of coding regions were extremely low.  1 State Scientific Center "GosNIIGenetika", Russian Federation, [email protected] 2 A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Russian Federation, [email protected] 3 A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Russian Federation, [email protected] 84 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Alternative regions contain more nonsynonymous substitutions than constitutive regions of spliced genes. Intronless genes contain more nucleotide substitutions than constitutively spliced regions of multiexonic genes. The substitution rates in alternative regions of different types vary dramatically. In particular, cassette exons have nearly twice as many nucleotide substitutions as mutually exclusive exons. The substitution rates in duplicated and non-duplicated mutually exclusive exons also differ. 5′- terminal exon extensions due to acceptor sites have the highest rate of nonsynonymous substitutions while retained introns have the highest rate of synonymous substitutions. CONCLUSIONS. Alternatively spliced regions are hotspots of molecular evolution both at the level of structural rearrangements and at the level of nucleotide substitutions. This demonstrates that alternative splicing is one of the major evolutionary mechanisms generating protein diversity. The rates of structural rearrangements in close species are more variable than the rates of nucleotide substitutions. Substitution rates in alternative regions of different types vary. This variation may be caused by differences in the density of cis-regulatory elements in alternative regions of different types. In particular, our results show that three types of alternative exons: cassette exons, duplicated mutually exclusive exons, and non-duplicated mutually exclusive exons, should be considered separately in comparative genomic studies.

1. Modrek B. and Lee, C.J. Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss. Nat Genet 34 (2003). 2. Malko D.B., Makeev V.J., Mironov A.A., and Gelfand, M.S. Evolution of the exon-intron structure and alternative splicing in fruit flies and malarial mosquito genomes. Genome Res 16 (2006). 3. Ermakova E.O., Mal'ko D.B., and Gel'fand, M.S. Different patterns of evolution in alternative and constitutive coding regions of Drosophila alternatively spliced genes. Biofizika 51 (2006). 4. Coulombe-Huntington J. and Majewski J. Intron loss and gain in Drosophila. Mol Biol Evol. 24 (2007).

85 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

SECONDARY STRUCTURE OF COPOLYMER CONSISTING OF AMPHIPHILIC AND HYDROPHILIC MONOMER UNITS: IMPACT OF THE RANGE OF THE INTERACTION POTENTIAL VITALY ERMILOV 1, VALENTINA VASILEVSKAYA 2, ALEXEI KHOKHLOV 3

Keywords: apmphiphilic copolymers, simple model of polypeptide chain, HP model

The dependence of coil-globule transition of copolymer composed of amphiphilic and hydrophilic monomers on the range of the interaction potential has been studied via molecular dynamics simulations. It has been shown that the structure of globules formed in such systems substantially depends on the range of the interaction potential. In the case of long range potential the globule resulting from hydrophobically driven collapse has blob structure; if the potential is short ranged quasi helical structure of the globule is formed, where the backbone of the chain forms helical turns with direction of twisting which can vary from turn to turn. The coil-globule transition in such systems goes through the stage of forming of the necklace conformation consisting of quasi helical micelle-beads. The size of the globules linearly depends on the degree of polymerization in the case of long macromolecules.

 1 A.N.Nesmeyanov Institute of Organoelement Compounds Russian Academy of Sciences , Russian Federation, [email protected] 2 A.N.Nesmeyanov Institute of Organoelement Compounds Russian Academy of Sciences , Russian Federation, [email protected] 3 Lomonosov Moscow State University, Russian Federation, [email protected] 86 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

MUTUAL ORIENTATION OF Q Y TRANSITION DIPOLES OF SUBANTENNAE PIGMENTS AS A STRUCTURAL FACTOR OPTIMIZING THE PHOTOSYNTHETIC ANTENNA FUNCTION. THEORETICAL AND EXPERIMENTAL STUDIES ANASTASIYA ZOBOVA 1, ANDREY YAKOVLEV 1, VLADIMIR NOVODEREZHKIN 1, ALEXANDRA TAISOVA 1, ZOYA FETISOVA 1

Keywords: structure optimization, functional criteria, photosynthesis, light-harvesting antenna, model calculations

This work continues a series of our investigations on efficient strategies of functioning of natural light-harvesting antennae, initiated by our concept of rigid optimization of photosynthetic apparatus structure by functional criterion. This work deals with the problem of finding the optimal orientation of Qy transition dipole moments of light-harvesting bacteriochlorophyll (BChl) a molecules of a subantenna B798 (absorption maximum, at 798 nm) in the green bacterium Chloroflexus aurantiacus [1]. We used infinite 3D antennae an elementary fragment of which is a 1D unit (parallel to the Z axis), containing molecules of three subantennae, B740, B798 and B808 (Fig.1).

B798 is the acceptor for oligomeric BChl c B740 subantenna and the donor for monomeric BChl a B808 one. Orientations of the Qy transition dipoles are known only for B740 and B866 [Fig.1]. Using the probability matrix approach,we computed the time ( t, a.u.) of excitation energy transfer(EET) from B740 to B808 as a function of α, Δ≡α-β and φ, where φ determines the  1 M.V. Lomonosov Moscow State University, Moscow, 119992, Russian Federation [email protected] ; [email protected] 87 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 sought orientation of B798 Qy dipoles (Δ∈[0–180˚]; α∈[0–90˚); φ∈(–90– +90˚)). Each set of curves t(α,Δ,φ) was computed for R12 /R23 =0.5; 1.0; 2.0 , at that R12 +R23=const ( Rij is the distance between dipoles i and j). For each R12 /R23 value, one can find stable minima of curves t (α,Δ,φ) near tmin (φopt ), which are much lower than those ( tr) for randomly oriented dipoles: η ≡ tr/tmin >1.2. It was found that (1) at R12 /R23 =2, φopt ∈[0˚± 5˚] at α<30˚, Δ≤45˚; (2) at R12 /R23 =1, φopt ∈[±(20–32)˚] at α≤30˚, Δ≤ 60˚; (3) at R12 /R23 = 0.5, φopt ∈[± (37–70)˚] at any Δ and 0˚≤ α ≤ 75˚. Experiments in vivo revealed that the second stage is limiting in EET B740→B798→B808, which corresponds to the case of R12 /R23 = 0.5. We assumed that in a single chlorosome, the B798 subantenna is formed by ordered chains of BChl a protein complexes with either (i) fixed BChl a dipoles orientations, according to the Table, for any Δ and 0˚ ≤ α ≤ 75˚ (model No.1), or

, α = 0º α = 30º α = 45º α = 75º degree φ , φ , φ , φ , η opt η opt η opt η opt s degrees degrees degrees degrees 0 3.4 -56 3.3 -55 3.2 -53 2.1 -37 30 3.2 -59 3.1 -57 3.0 -55 1.9 -39 45 2.9 -61 2.9 -59 2.8 -57 1.8 -41 90 2.2 ±70 2.0 ±65 >1.2 ±52

(ii) random orientation of BChl a dipoles around the normal to the membrane with a deviation from the membrane plane by angles within the range of φopt =37–70˚. The second model implies the fixed angle to meet a requirement η ≡ tr/topt >1.2 at any Δ and α viewed ;;; such angle was found to be within the range of φopt = 47–57˚ (model No.2). Note that this conclusion was done for a single chlorosome. Using femtosecond difference absorption spectroscopy, we showed that the second model is confirmed. Room temperature isotropic and anisotropic pump-probe spectra were measured on the femtosecond through picosecond time scales for the chlorosome BChl a Qy band upon direct excitation of this band. The monomeric nature of B798 BChl a was manifested. The anisotropy in the B798 band decayed from r = 0.4 (t = 0) to r =0.1 (steady state). A simulation assuming a random orientation of chlorosomes realized in our experiments proved the proposed model No.2 . Fig.2 shows the theoretical dependence of the steady state anisotropy parameter r from the sought angle φ. The angle 88 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 corresponding to experimental value of r = 0.1 equals 54,7° which is close to φopt = 47–57˚ calculated earlier. r 0.4

0.3 2 2 r = 0.1 (3cos ϕ - 2 ) 0.2 0.1 54.7 o 0.0 -10 0 10 20 30 40 50 60 70 80 90 ϕϕϕ , degrees

The work was supported by the Russian Foundation for Basic Research (Grant 08-04-01587a).

1. A.V. Zobova, A.G. Yakovlev, A.S. Taisova, Z.G. Fetisova (2009) Search for an optimal orientational ordering of Qy transition dipoles of subantennae molecules in superantenna of photosynthetic green bacteria. Model calculations, Molec. Biol. (Engl.transl.), 43: 420-443.

89 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

ORIENTATIONAL FACTORS FOR FÖRSTER`S RESONANCE EXCITATION ENERGY TRANSFER V.S. DUJENKO 1, A.V. ZOBOVA 1, Z.G. FETISOVA 1

Keywords: excitation energy transfer, photosynthesis, dipoles orientation

According to the Förster’s theory [1], the probability of electron excitation energy to be transferred from the donor molecule (1) to the acceptor one (2) strongly depends on ther orientational factorr k2 , defined by the mutual orientation of donor ( µ1 ) and acceptor ( µ2 ) Qy transition dipoles (see Figure). In this work, we derive the general formula of the calculation of orientational factor k2 in the spherical coordinate system and calculate the 2 k values for some disordered and partially orderedr syrstems. r Without loss of generality, one can consider µ1 , µ2 and r vectors to be normalized. Figure displays the concerned donor-acceptor system.

r r r µ = (cos ϕ cos α , cos ϕ sin α , sin ϕ ) z µ 1 1 1 1 1 1 1 µ2 r µ2 = (cos ϕ 2 cos α 2 , cos ϕ2 sin α 2 , sin ϕ 2 ) r r r φ1 r r µ µ 1 2 r = (0, 1, 0) 1 2 x y r α r r r r θ1 r θ2 1 cos θ1 = (µ1 ,r) , cos θ 2 = (µ2 ,r ) 1 θ12 2 − π ≤ α ≤ π r r cos θ = (µ , µ ), π π 12 1 2 − ≤ ϕ ≤ 2 2 r r r 2 2 r = µ1 = µ2 = 1 k = (cos θ12 − 3cos θ1 cos θ 2 ) 2 ( the Förster’s expression for k (12 , 1, 2) [1] )

General Results 1. General formula of the calculation of orientational factor k 2 (in the spherical coordinate system )

 1 M.V. Lomonosov Moscow State University, Moscow, 119992, Russian Federation, [email protected], [email protected], [email protected] 90 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

1 k 2 = c2 (α ,α )cos 2 ϕ cos 2 ϕ + sin 2 ϕ sin 2 ϕ + c(α ,α )sin 2ϕ sin 2ϕ , 1 2 1 2 1 2 2 1 2 1 2

where c(α 1,α2 ) = cos α1 cos α2 − 2sin α1 sin α2

2 2. The averaging of k (α1,α2 ,ϕ1,ϕ2 )

(a) The random variables α1 and α 2 are independent and uniformly distributed within variation interval [ − π ,π ]. Then 1 k 2 (ϕ ,ϕ ) = 5( cos 2 ϕ cos 2 ϕ + 4sin 2 ϕ sin 2 ϕ ) 1 2 4 1 2 1 2 5 If ϕ = ϕ = 0 , then k 2 (0, 0) = (in-plane chaos ). 1 2 4 All following cases ((b), (c), and (d)) implies that all variables

(α1,α 2 ,ϕ1,ϕ 2 ) are random and independent; the variables α1 and α 2 are uniformly distributed within variation interval [ − π ,π ]. (b) If ϕ1 and ϕ 2 are uniformly distributed within variation interval

π π [ − , ], then 2 2  3 2 9 2 k 2 =   = , instead of well-known k 2 = (spatial chaos).  4  16 3

(c) If ϕ1 and ϕ 2 are uniformly distributed within variation interval

π 1  sin 2∆  sin 2∆ 2  [ − ∆, ∆ ], where 0 < ∆ ≤ , then k 2 (∆) = 9 + 2 + 9   . 2 16  2∆  2∆   5 Thus, k 2 (∆) → when ∆ → 0 (in-plane chaos ); 4 9 π k 2 (∆) = when ∆ = (spatial chaos ) 16 2

91 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 ϕ ϕ 1 (d) I f 1 and 2 densities are equal to cos ϕ (where i = 1, 2) within 2sin ∆ i π variation interval [− ∆ , ∆ ] , where 0 ≤ ∆ ≤ , then 2

2 1  10 2 4  k (∆) = 5 − sin ∆ + sin ∆ , 4  3  π  2 2 5 and, as a consequence , k 2   = , and k )0( = .  2  3 4

The work was supported by the Russian Foundation for Basic Research (Grant 08- 04-01587a). 1. Förster T. 1965. Modern quantum chemistry. In: Istanbul Lectures. Part III: Action of Light and Organic Crystals . Ed Sinannoğlu O. New York: Academic Press.

92 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 SEARCH FOR AN OPTIMAL INTERFACING SUBANTENNAE IN SUPERANTENNA OF PHOTOSYNTHETIC GREEN BACTERIA V.G. POPOV 1, A.V. ZOBOVA 1, A.S. TAISOVA 1, Z.G. FETISOVA 1

Keywords: stucture optimization, functional criteria, photosynthesis, light-harvesting antenna, model calculations

Theoretical investigation of optimality of a model antenna functioning is a powerful tool for the study of efficient strategies for the light-harvesing in photosynthesis. Our previous theoretical analysis of optimality of subantennae constitution in the superantenna of the green bacterium Oscillochloris trichoides (from a new family of green bacteria Oscillochloridaceae registered in 2000) prompted us to predict the existence of a new subantenna Bx in addition to known ones, B750 and B805-860, to optimize excitation energy transfer (EET) B750 →B805 [1]. Targeted search for the theoretically predicted subantenna Bx has subsequently allowed us to recognize it in Osc. trichoides chlorosomal baseplate. However, this BChl a subantenna was not visually identified in absorption spectra of isolated chlorosomes. At the same time, we succeeded in finding the native fluorescence spectra of both chlorosomal subantennae (B750 and Bx) which turned out to differ notably from those we used in our previous theoretical analysis. This requires additional examination of the problem of optimal interfacing of energy levels of neighboring subantennae in Osc. trichoides superantenna. We supposed the model 3D superantenna with 1D antenna units and used the Förster’s type for the description of EET B750 →Bx →B805 (denote it as 1 →2→3). The probability matrix method was used to simulate EET. When varying the Bx spectral position (λ x), we computed the time ( t, a.u.) of EET 1 →2→3 for 0.5 ≤ R12 /R23 ≤ 2 ( Rij is the distance between donor i and acceptor j) ; k12 /k23 =1 ( kij , orientational factors) ; n12 /n23 =1 ( nij , refractive indexes) (see Figure). Such formal quadruple change in degree R12 /R23 (at R12 +R23 =const) was chosen to consider possible variations of unknown at present parameters kij , nij and Rij . The horizontal line corresponds to the trapping time for the direct EET B750→→→→B805, i.e., in the absence of the intermediate Bx subantenna. The η value is the ratio:

 1 M.V. Lomonosov Moscow State University, Moscow, 119992, Russian Federation, [email protected]; [email protected] 93 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 100 300 t (B750 →→→ B805 ) 20090 λλλ (t ), nm ηηη Bx min R12 /R23 80 4 801 0.5 a.u.

70 , R12 / R23

-3 7 801 0.7 0.5 40 10 0.7 х

) 0.8

30 801 0.8 1.0 B805 12 1.2 → 1.5 Bx 18 799 1 20

→ 2.0 26 795 1.2

B750 30 778 2.0

( 10 t 28 787 1.5

765 770 775 780 785 790 795 800 805 λλλ , nm X

η ≡ t(B750→→→→B805 /tmin (B750→Bx→B805). The figure presents that: (1) the Bx subantenna introduction allows one to decrease notably the time of direct EET B750 →B805 (up to 30 times), i.e., the direct EET B750 →B805 is not optimal; (2) each parametric curve t(λ x) demonstrates an individual stable minimum tmin (λX) in the spectral region under study, i.e., at any value of the R12 /R23 parameter ( 0.5 ≤ R12 / R23 ≤ 2 ), the Bx subantenna allows one to control the entire superantenna efficiency which is governed by the Bx spectral position λХ (770 nm < λХ < 805 nm); (3) all minima tmin (λX) are localized within a rather narrow spectral range, from 778 nm to 801 nm ; (4) an increase in R12 /R23 value from 0.5 to 2.0 is responsible for the short-wave shift of minima tmin (λX) from 801nm to 778 nm ; (5) for each curve t(λ x) maximal effect of the superantenna structure optimization depends on R12 /R23 value and varies from η = 4 (at R12 /R23 = 0.5) to η = 30 (at R12 /R23 = 1.5 (the most effective value)). Our calculations demonstrated that the optimal interfacing of subantennae in the superantenna of green bacteria Osc. trichoides leads to a stable minimization of the energy transfer time within superantenna and, consequently, to a decrease in the energy losses, thereby ensuring the high efficiency and stability of the overall superantenna function. Thus, the biological expedience of existence of the intermediate-energy BChl a subantenna in Osc. trichoides chlorosomes connecting B750 and B805-860 is theoretically well founded. 94 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 The work was partially supported by the Russian Foundation for Basic Research (Grant 08-04-01587a).

1. А.А. Novikov, А.S. Taisova, Z.G. Fetisova (2007) Analysis of spectral conjugation of nonuniform subantennae in the light-harvesting superantenna of Oscillochloridaceae photosynthetic green bacteria , Biofizika. 525252:52 63-68

95 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

EVOLUTION OF SEX CHROMOSOMES IN DIPLOIDS AND HAPLOIDS DMITRY FILATOV 1

Recombination is one of the most important factors affecting evolution of genes and genomes [1] and lack of recombination often leads to accumulation of deleterious mutations, repetitive DNA and genetic degeneration [2]. Human Y-chromosome represent an extreme case of genetic degeneration with only few functional genes remaining intact [3], which makes it unsuitable for the studies of evolutionary processes dominating genomic regions at the early stages of degeneration process. White Campion Silene latifolia [2] and a number of fungal species (Neurospora tetrasperma, Microbotryum violaceum [4, 5]) are more suitable for the evolutionary analysis of the early stages of non-recombining genomic regions, as these species evolved non-recombining sex chromosomes or (in case of fungi) mating type-specific chromosomes relatively recently. Silent DNA divergence between homologous X- and Y- linked (or mating type-linked) sequences for these species is less than 15%, suggesting that sex- or mating type-linked non-recombining regions evolved within the last ~10 million years. Our analysis of S. latifolia Y-linked genes and their recombining X-linked homologues revealed that effective population size is dramatically reduced on the S. latifolia Y chromosomes, suggesting that efficacy of selection in the Y- linked genes should be reduced, compared to the X-linked genes [6]. Thus, S. latifolia Y-linked genes are expected to accumulate deleterious mutations and undergo genetic degeneration. Surprisingly, with few notable exceptions, most Y-linked genes in that species contain no apparent signs of genetic degeneration: they are transcribed, open reading frames are intact and the ratio of non-synonymous to synonymous substitutions (Ka/Ks) <<1, reflecting fairly strong purifying selection in these Y-linked genes [7, 8]. Recent sequencing-based expression analyses, however, revealed that expression of S. latifolia Y-linked genes is reduced, compared to their X-linked homologues, suggesting that genetic degeneration of non-recombining regions first proceeds via alterations of expression rather than accumulation of deleterious amino acid replacements or gene loss. One factor that may, at least partly, prevent genetic degeneration of the plant Y-linked genes is haploid expression [9]. Unlike animals, where only few  1 University of Oxford , United Kingdom , [email protected] 96 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 genes are expressed in gametes, a significant proportion of plant genome is expressed in pollen [10], which may potentially make purifying selection in the Y-linked genes stronger and slow down or even prevent genetic degeneration [9]. Indeed, the ‘sex locus’ (effectively, a sex chromosome as recombination is suppressed in this region) of fungal pathogen Cryptococcus neoformans contains about 20 genes in just 100 kb and does not show any obvious signs of genetic degeneration [11]. However, the 20 genes present in this region may not be sufficient for operation of such detrimental population genetic processes as background selection or Mullers ratchet, as the speed of these processes critically depends on the number of active genes linked together [2]. Two other fungal species, Neurospora tetrasperma and Microbotryum violaceum are known to have relatively large non-recombining regions [4, 5]. N. tetrasperma is a pseudohomothallic species with a constant state of heterocaryosis, while M. violaceum does have a haploid stage when individuals contain either of the two mating type-specific chromosomes. Thus, comparisons between these two fungal species are analogous to comparisons between species with haploid and diploid sex determination. The N. tetrasperma genome has just become available and M. violaceum genome was recently sequenced in our lab. The analysis of these genomes is in progress and it will hopefully provide the answer to the question whether the haploid expression is capable of preventing genetic degeneration in non-recombining regions.

1. Gaut, B.S., et al., (2007) Recombination: an underappreciated factor in the evolution of plant genomes. Nat Rev Genet, 8: 77-84. 2. Charlesworth, D., (2008) Sex chromosome origins and evolution, in Evolutionary Genomics and Proteomics, P.A. Pagel M, editor. Sinauer Associates: Sunderland. p. 207-240. 3. Skaletsky, H., et al., (2003) The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes. Nature, 423: 825- 837. 4. Hood, M.E., (2002) Dimorphic mating-type chromosomes in the fungus Microbotryum violaceum. Genetics, 160: 457-461. 5. Menkis, A., et al., (2008) The mating-type chromosome in the filamentous ascomycete Neurospora tetrasperma represents a model for early evolution of sex chromosomes. PLoS Genet, 4: e1000030.

97 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 6. Filatov, D.A., et al., (2000) Low variability in a Y-linked plant gene and its implications for Y-chromosome evolution. Nature, 404: 388-390. 7. Filatov, D.A., (2005) Substitution rates in a new Silene latifolia sex- linked gene, SlssX/Y. Mol Biol Evol, 22: 402-408. 8. Marais, G.A., et al., (2008) Evidence for degeneration of the Y chromosome in the dioecious plant Silene latifolia. Curr Biol, 18: 545- 549. 9. Bull, J.J., (1983) Evolution of sex determining mechanisms.: The Benjamin/Cummings Publishing. 10. Boavida, L.C., et al., (2005) The making of gametes in higher plants. Int J Dev Biol, 49: 595-614. 11. Lengeler, K.B., et al., (2002) Mating-type locus of Cryptococcus neoformans: a step in the evolution of sex chromosomes. Eukaryot Cell, 1: 704-718.

98 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

ANALYSIS OF 3D STRUCTURE, THERMOSTABILITY AND MECHANICAL CHARACTERISTICS OF I, II, III, V AND XI TYPES OF COLLAGENS IVAN V. FILATOV 1, YURI V. MILCHEVSKY 1, VLADIMIR A. NAMIOT 1, MARIANNA V. MOLDAVER 2, SERGEY A. LUKSHIN 2, MAXIM A. RUBIN 2, ELISA I. TIKTOPULO 3, NATALIA G. ESIPOVA 2, VLADIMIR G. TUMANYAN 2

With the help of a modification of molecular mechanics approach we calculated spatial structures of triple-helical macromolecules of collagens. Structures of whole molecules of collagens of I, II, III, V and XI types as well as fragments of collagen of IV type macromolecule have been calculated. Our method includes a special procedure for determining helical parameters of molecules without strict helical symmetry. We developed a tool for conformational space reduction of side chains. Subunits of the side chain conformation space displayed. We observed that various subsets from full set of side chains conformations were asymptotically equivalent in optimization observed. It was pointed out that the amino acid sequence determines the alternation of the segments with one or two hydrogen bonds per tripeptide along the macromolecule without variation in the type of triple helix. The second network of interpeptide hydrogen bonds forms if the amino (rather than imino) acids occur in the second position in the triplets. Formation of the double network of hydrogen bonds results in displacement of co-groups of Gly and the third residue in each triplet residue into the symmetrical equivalent positions at the surface of the triple helical complex. Hence collagen hydration pattern becomes independent from its amino acid sequence, i.e. it is tissue- specific rather than specie-specific. Our computations also confirm the specific role of Pro residues in collagen hydration and thermostability. Pro residues increase the number of independent cooperative units, which becomes evident when melting of collagens with different Pro share is compared; this is related to the entropy increase in the transition. It was demonstrated that  1 Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, ul. Vavilova 32, Moscow, 119991 Russia; fax:+7 (499) 135-1405 e-mail: [email protected] Institute of Protein research, Russian Academy of Sciences, Pushchino, Moscow Region, 142290 Russia 2 Institute of Protein research, Russian Academy of Sciences, Pushchino, Moscow Region, 142290 Russia 3 Institute of Nuclear Physics. Lomonosov Moscow State University, Vorob’evy Gory, Moscow, 119992 Russia 99 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 cooperate character of hydration of single polypeptide chain in polyproline II conformation is determined by the sequence of prolines and by the symmetry of prolines arrangement as well. The helical parameters of five types of collagens were calculated. We observed the complex character of distribution of these parameters along the axis of macromolecule with periodicity corresponding to the periodicities in the amino\imino acid sequence. The differences in the periodical distribution of helical parameters are reflected in different type of collagen fibrilogenesis. We estimated numerical values of Young module and the persistent length for collagen type structures. It was found that the persistent length of collagen exceeds significantly the macromolecule size. Thus, intact nonbended collagen macromolecules participate in the process of fibril formation, and the pattern of helical parameters must influence on the fibrilogenesis. Interestingly, the global minimum of potential energy is attained under stretching of the triple helical macromolecules. Simultaneously, the helix height value becomes equal to the experimental helix height value obtained in X-ray studies of collagens of different origin. It suggests existence of specific mechanism of stretching of collagen macromolecules in a cell. The influence of mutagenesis on structural and mechanical properties of collagens is discussed. The effect of amino acid substitutions is analyzed in collagen IV as well.

Fig. 1. 3D structures of human collagen III type (left) and human collagen V (right) type.

This work was supported by grants from Russian Foundation for Basic Research (projects No 08-04-01770a and 07-04-01765a), and the Molecular and Cellular Biology Program of the Russian Academy of Sciences.

100 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

A NEW ATOMIC FORCE FIELD "FFS" FOR PROTEIN INTERACTIONS, COMPUTED FROM SOLUBILITY OF MOLECULAR CRYSTALS IN WATER ALEXEI FINKELSTEIN 1, LEONID PEREYASLAVETS 2

Keywords: non-covalent atom-atom interactions, implicit water surrounding, sublimation, salvation

Detailed calculations of protein interactions with explicitly considered water (see, e.g., [1]) takes enormous computer time. The calculations become much faster if water is considered implicitly (as a continuous media rather than as molecules); however, these calculations are much less precise, unless one uses an additional (and also volumes) computation of the solvent- accessible areas of all atoms (see, e.g., [2]). The aim of our study is to preserve a conventional atom-atom interaction scheme and extend it via development of parameters for non-bonded atom- atom interactions for the case when water surrounding is considered implicitly. The "in-vacuum" interactions of atoms are obtained from experimental structures of crystals [3] and enthalpies of their sublimation [4]; the "in-water" interactions of atoms must be corrected using solvation free energies of molecules, which can be obtained from the Henry constants [5]. Taken 58 structures of molecular crystals and thermodynamic data on their sublimation and solubility, we obtained van der Waals parameters for "in- water" attraction and repulsion of atoms typical of protein structures (H, C, N, O, S) in various covalently-bonded states, as well as parameters for electrostatic interactions of these atoms. All necessary for calculations parameters of covalent interactions have been taken from the ENCAD force field [6], and all partial charges of atoms in various molecules have been obtained with RESP method [7] from quantum-mechanical calculations done by PC-GAMESS [8]. The sought parameters of the "in-water" van der Waals and electrostatic interactions were optimized so as to achieve the best fit of equilibrium and experimental crystal structures and their sublimation and solvation at the room temperature. With the optimized parameters, the accuracy of effective molecular cohesion energy in crystals was, on the  1 Institute of Protein Research, Russian Academy of Sciences, Russian Federation, [email protected] 2 Institute of Protein Research, Russian Academy of Sciences, Russian Federation, [email protected] 101 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 average, about 10% both in the "in-vacuum" and "in-water" cases. We are grateful to the MCB program of the Presidium of RAS, RFBR (grant 07- 04-00388), program "Leading Scientific Schools" (grant 2791.2008.4), INTAS (grant 05-1000004-7747) and the Howard Hughes Medical Institute (award 55005607).

1. J. Wang, R.M. Wolf, J.W. Caldwell, P.A. Kollman, D.A. Case (2004) Development and testing of a general amber force field. J. Comp. Chem., 25:1157-1174. 2. E. Gallicchio, L.Y. Zhang, R.M. Levy (2002) The SGB/NP Hydration Free Energy Model Based on the Surface Generalized Born Solvent Reaction Field and Novel Nonpolar Hydration Free Energy Estimators. J. Comp. Chem., 23:517-529. 3. F.H. Allen (2002) The Cambridge Structural Database: a quarter of a million crystal structures and rising. Acta. Cryst.,. B58:380-388. 4. J.S. Chickos, W.E. Acree, Jr. (2002) Enthalpies of Sublimation of Organic and Organometallic Compounds. 1910-2001. J. Phys. Chem. Ref. Data, 31:537-698. 5. R. Sander (1999) Compilation of Henrys Law Constants for Inorganic and Organic Species of Potential Importancein Environmental Chemistry. Air Chemistry Department. Max-Planck Institute of Chemistry: http://www.mpch-mainz.mpg.de/~sander/res/henry.html. 6. M. Levitt, M. Hirshberg, R. Sharon, V. Dagget (1995) Potential energy function and parameters for simulations of the molecular dynamics of proteins and nucleic acids in solution. Comp. Phys. Commun., 91:215-231. 7. C.I. Bayly, P. Cieplak, W. Cornell, P.A. Kollman (1993) A well-behaved electrostatic potential based method using charge restraints for deriving atomic charges: the RESP model. J. Phys. Chem., 97:10269–10280. 8. A.A. Granovsky (2008) PC-GAMESS/Firefly version 7.1, http://classic.chem.msu.su/gran/gamess/index.html.

102 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

X(Y)N-TYPE MICROSATELLITES IN THE HUMAN AND MOUSE GENOME FRIDMAN M.V .1, MAKEEV V. 1, OPARINA N.J 2

Keywords: microsatellites, Alu

We have studied the frequencies of different types of 4-6-bp microsatellites in the human and mouse genomes [7]. The enormous frequency of X(Y)n-type repeats in comparison to others was demonstrated [5]. The prevailing X(Y)n repeats were TAAAA, GAAAA etc, probably correlating with sites of retroposons integration [7]. It might be explained by the mechanism of repeat generation that is connected with L1-like elements endonuclease that cleaves AT-rich sequences, preferably the T(A)n motifs [1,6]. Our study of intersection of such microsatellites with different dispersed repetitive elements let us show that indeed, X(Y)n microsatellites are preferably bordered with Alus. We can see that oldest human Alu’s intersect with microsatellites more seldom then youngest Alu’s. There are more exact microsatellite repeats in the intersections of youngest Alu’s. It might be evidence that microsatellite generates practically simultaneously with Alu integration and further both retroposon and microsatellite age.

Nevertheless, other dispersed repeats were rarely located at the sites of X(Y)n tandems. Besides Alu elements, LINEs could be mostly underestimated due to difficulties in their mapping onto genome. We proposed that X(Y)n repeats could be originated through human L1 and mouse B1 and L1 integration also. What's about the further fate of such microsatellites? We have detected traces of their degeneration and deletion, which degree depends on the genomic location of these microsatellites. For example, we have compared fate of these repeats in 3'UTRs and other regions. But also - amplification [4] and making the target sequence prone to further SINEs/LINEs integration [8]. These microsatellites are still the prevailing class in the human and mouse genomes, that's why it's important to study their probable role in genome

 1 Institute of genetics and selection of industrial microorganisms, GosNIIgenetika, Moscow, Russia, [email protected] 2 Engelhardt Institute of Molecular Biology, RAS, Moscow, Russia, [email protected] 103 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 rearrangements and disorders [2, 3]. Also we search the traces of recruiting such repeats as new functional genomic elements. Acknowledgements: This study was partially supported with Russian Fund of Basic Research Grant # 07-04-01584-а.

1. Jurka J. (1997) Sequence Patterns Indicate an Enzymatic Involveement in Integration of Mammalia Retroposons. PNAS USA, 94: 1872-1877. 2. Li Y-C, Korol A.B., Fahima T., Nevo E. (2004) Microsatellites within Genes: Structure, Function, and Evolution. Molecular Biology and Evolution, 21(6): 991-1007. 3. Li Y-C, Korol A.B., Fahima T., Beiles A., Nevo E. (2002) Microsatellites: genomic distribution, putative functions and mutational mechanisms: a review. Molecular Ecology, 11: 2453–2465. 4. D. Pumpernik, B. Oblak, B. Borstnik (2008) Replication slippage versus point mutation rates in short tandem repeats of the human genome. Mol. Genet. Genomic, 279 (1): 53-61. 5. Toth, G., Gaspari, Z., Jurka, J. (2000) Microsatellites in different eukaryotic genomes: survey and analysis. Genome Res, 10 (7): 967-981. 6. Zinger N. et al (2005) Analysis of 5’ junctions of human LINE-1 and Alu retrotransposons suggests an alternative model for 5’-end attachment requiring microhomology-mediated end-joining Genome Research,15: 780-889. 7. Pdf of BGRS 2008 conference proceedings: http://www.bionet.nsc.ru/meeting/bgrs2008/BGRS2008_Proceedings. pdf 8. Alu Pairs Database, NIEHS http://www.niehs.nih.gov/research/resources/databases/alu/index.cfm

104 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

DIPROGB: A NEW GENOME BROWSER THAT ENCODES SEQUENCE INFORMATION BY THERMODYNAMIC AND GEOMETRICAL DINUCLEOTIDE PROPERTIES MAIK FRIEDEL 1, THOMAS WILHELM 1, JÜRGEN SÜHNEL 2

The aim of computational genome analysis is to convert the dramatically increasing amount of genomic information into biological knowledge. This is currently almost exclusively done by using the character string representation of genomic sequences. Instead, we have developed the new genome browser DiProGB that encodes the sequence by geometrical and physicochemical dinucleotide properties [1]. We call the corresponding plot a sequence graph. Analysing physical properties allows detection of sequence motifs that cannot be seen in the usual character string representation. The sequence graph can be manipulated in real time by zooming in and out, changing the amplitude, and by smoothing with a shifting window technique. All GenBank features and qualifiers such as exons, introns etc. can be separately addressed . DiProGB also offers tools for statistical and Fourier analyses as well as for motif and repeat search. As a case study, here we present a classification of chloroplast genomes adopting information from the repeat structure of rRNA gene clusters. We found, for instance, that these genes have significantly different physico-chemical properties, e.g. stacking energy, than the rest of the genome. We also demonstrate that the repeat structure of rRNA chloroplast gene clusters can be used to infer phylogenetic relationships between different species [2].

1. M. Friedel et al. (2009) DiProDB: a database for dinucleotide properties, Nucleic Acids Res., 37:D37–40. 2. J. D. Palmer (1985) Comparative organization of chloroplast genomes, Ann. Rev. Genet., 19: 325-354.

 1 Fritz Lipmann Institute Jena, Beutenbergstr. 11, 07745 Jena, Germany, [email protected] 2 Institute of Food Research, Norwich Research Park, Colney, Norwich NR4 7UA, UK 105 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

HELIX-HELIX CONTACTS IN MEMBRANE PROTEINS: ANALYSIS, PREDICTION AND APPLICATIONS ANGELIKA FUCHS 1, ANDREAS KIRSCHNER 2, BARBARA HUMMEL 2, DMITRIJ FRISHMAN 2

Keywords: Membrane proteins, contact prediction, protein classification

Despite increasing numbers of available 3D structures, membrane proteins, which constitute up to 30% of a genome, still account for less than 1% of all structures in the Protein Data Bank. Additionally, recent high resolution structures indicate a clearly higher structural diversity of membrane proteins than initially anticipated, motivating the development of reliable structure prediction methods for membrane proteins. A commonly addressed 2D structure prediction problem in soluble proteins is the prediction of residue- residue contacts which can subsequently be used as constraints for ab initio structure prediction or fold recognition. For membrane proteins however, this field of structural bioinformatics has so far found far less attention prompting us to develop the first predictor for helix-helix contacts within membrane proteins. Furthermore, we are introducing a potential application of predicted residue contacts specific for membrane proteins, namely the distinction of different helix architectures based on predicted helix interaction graphs. For soluble proteins, initial contact prediction methods were rooted in the idea of analyzing correlated mutations1 while recent methods mostly use a variety of input features in combination with a machine learning approach2. Similarly, we have first conducted a study of correlated mutations in polytopic membrane proteins where we could show that co-evolving residues alone are not sufficient to predict helix-helix contacts, but that these residues still carry a strong signal for the detection of interacting transmembrane helices due to their frequent occurrence in close sequence neighborhood to helix-helix contacts3. Within a second step, we have developed a neural network based approach to predict helix-helix contacts specifically in a-helical membrane proteins4. Input features for this neural network are both input features commonly used for the contact prediction of soluble proteins like windowed  1 Department of Genome Oriented Bioinformatics, Technische Universität München, Wissenschaftszentrum Weihenstephan, 85350 Freising, Germany, [email protected] 2 hnische Universität München, Wissenschaftszentrum Weihenstephan, 85350 Freising, Germany, [email protected] , [email protected] 106 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 residue profiles and residue distance in the sequence, but also features that apply to membrane proteins only, such as a residue’s position within the transmembrane segment or its orientation towards the hydro- or lipophilic environment. Trained on a dataset of 62 membrane proteins with solved structure, the obtained neural network can predict contacts between residues in transmembrane segments with nearly 26% accuracy and recalls 3.5% of all helix-helix contacts. It is therefore the first contact predictor developed specifically for a-helical membrane proteins performing with equal accuracy to state-of-the-art contact predictors available for soluble proteins. We further demonstrate how the predicted contacts can be utilized to identify interacting transmembrane helices distant in sequence, which is an important step in the discrimination of different helix architectures of membrane proteins. Based on a simple selection procedure, which requires several predicted residue contacts to rate a given helix pair as interacting, we are able to remove incorrectly predicted helix-helix contacts and predict interacting helices with a sensitivity of 53.1%, a specificity of 86.3% and an accuracy of 78.1%, which clearly outperforms the results earlier obtained with correlated mutations alone. Applying this procedure to our CAMPS database of membrane proteins5, we combine predictions from several proteins classified to the same fold into a consensus helix interaction graph representing the helix architecture of a given fold. By doing so, we further improve the prediction accuracy for interacting helices but are also able to identify superfamilies with similar helix interaction patterns. Furthermore, we plan to incorporate the obtained consensus interaction graphs into the procedure of classifying a newly sequenced proteins to its most likely fold.

1. Gobel, U., Sander, C., Schneider, R. & Valencia, A. (1994). Correlated mutations and residue contacts in proteins. Proteins 18, 309-17. 2. Izarzugaza, J. M., Grana, O., Tress, M. L., Valencia, A. & Clarke, N. D. (2007). Assessment of intramolecular contact predictions for CASP7. Proteins 69 Suppl 8, 152-8. 3. Fuchs, A., Martin-Galiano, A. J., Kalman, M., Fleishman, S., Ben-Tal, N. & Frishman, D. (2007). Co-evolving residues in membrane proteins. Bioinformatics 23, 3312-9. 4. Fuchs, A., Kirschner, A. & Frishman, D. (2009). Prediction of helix-helix contacts and interacting helices in polytopic membrane proteins using neural networks. Proteins 74, 857-71. 5. Martin-Galiano, A. J. & Frishman, D. (2006). Defining the fold space of membrane proteins: the CAMPS database. Proteins 64, 906-22. 107 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

PREDICTION OF UNSTRUCTURED RESIDUES IN PROTEIN CHAINS OXANA GALZITSKAYA 1, SERGIY GARBUZYNSKIY1, MICHAIL LOBANOV 1

In this work, the statistical analysis of unstructured residues was done considering 28727 unique protein chains taken from PDB database. In this database, 4.65% of residues are unstructured (that is, invisible in X-ray structures). The statistics was obtained separately for the N- and C-termini as well as for the central part of the protein chain. Based on the collected statistics, a scale for prediction of unstructured regions was made. Based on this statistical scale, an optimized scale was obtained by optimization of such a parameter as the sum of sensitivity and specificity which is usually used by assessors of CASP (Critical Assessment of Techniques for Protein Structure Prediction). The obtained scales correlate with the scale of contacts (average number of close residues at a distance below 8 Å; this scale was successfully used by us previously [1] for predictions of unstructured regions and totally disordered proteins) on the level of 90%. The newly obtained scales were used for prediction of status of each residue (ordered or disordered) in protein chain using the method FoldUnfold (previously developed by us for such predictions [2]). To test the quality of our predictions of intrinsically disordered regions in proteins, we used two databases, one of them having sequences of 427 intrinsically disordered proteins and regions and the other having sequences of 559 fully ordered proteins [3]. It turns out that the obtained scales give a little better result than the previous scale (the scale of contacts) in the context of FoldUnfold method (the specificity increased by 7% while the sensitivity practically did not change). Besides that, a new method based on the dynamic programming has been developed for searching not only disordered regions but also individual unstructured residues in protein chain. This method correctly finds 75% of unstructured residues as well as 85% of structured residues in the protein data bank.

This work was supported by the programs "Molecular and cellular biology" and “Fundamental sciences – medicine”, by the Russian Foundation for Basic Research (08-04-00561), by the “Russian Science Support Foundation”.

 1 Institute of Protein Research RAS , Russian Federation , [email protected] 108 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 1. O.V.Galzitskaya, S.O.Garbuzynskiy, M.Yu.Lobanov (2006) Prediction of natively unfolded regions in protein chains, 918. −Molecular Biology, 40:910 2. O.V.Galzitskaya, S.O.Garbuzynskiy, M.Yu.Lobanov (2006) FoldUnfold: web server for the prediction of disordered 2949. −regions in protein chain, Bioinformatics, 22:2948 3. O.V.Galzitskaya, S.O.Garbuzynskiy, M.Yu.Lobanov (2006) Prediction of Amyloidogenic and Disordered 1648. −Regions in Protein Chain, PLoS Comput. Biology, 2:1639

109 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

AGGREGATION PROPENSITY OF YEAST AND HUMAN PROTEOMES NATALYA BOGATYREVA 1, XANA GALZITSKAYA 2

Keywords: Amyloid formation, aggregation parameters, amyloid-like fibrils, proteome, function, cellular localization

We analyze the propensity to form amyloid-like fibrils of 5818 protein sequences of the yeast proteome and 44423 protein sequences of the human proteome. We have used a predictive algorithm elaborated by us [1] to calculate the aggregation propensity of every protein sequence of human and yeast proteomes. A set of parameters (the frequency of the aggregation peaks; the average length of all the aggregation peaks present in the sequence; the area of each aggregation peak, Sagg, i.e. the surface under the peak that lies above the threshold of 21.4; Sagg was then normalized by both the protein length (Sagg/Lprotein) and the number of peaks (Sagg/Npeaks)) was calculated for each of the 50,241 sequences. All membrane intrinsic proteins (10656 sequences) were removed from the database and analyzed separately, as they have very high aggregation potential itself. We obtained that proteins with different sub-cellular localizations have different aggregation propensities. We compared folded proteins forming amyloid-like fibrils in vivo in the context of human diseases with proteins from the corresponding sub- cellular localizations and functions in the human proteome (without membrane proteins) using the Gene Onthology component and functional annotations [2]. We found that the former category of proteins (β2- microglobulin, gammaC-crystallin_R168W, insulin, Ig_REC, lithostathine, lysozyme, PrPc, transthyretin, prolactin, medin with exception for cystatin-C, SAA, and SOD-1) has significantly higher aggregation propensities than the latter in comparison with the work [3] where human proteins involved in amyloidoses in vivo do not differ extensively from the rest of the proteome. The effect of aggregating properties of human proteins involved in amyloidoses in vivo has been lost averaging the properties over the whole proteome (without membrane proteins) [3].

 1 Insitute of Protein Research, RAS, Russian Federation, [email protected] 2 Insitute of Protein Research, RAS, Russian Federation, [email protected] 110 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 This work was supported by the programs "Molecular and cellular biology" and “Fundamental sciences – medicine”, by the Russian Foundation for Basic Research (08-04-00561), by the “Russian Science Support Foundation”.

1. O.V.Galzitskaya, S.O.Garbuzynskiy, M.Yu.Lobanov (2006) Prediction of Amyloidogenic and Disordered Regions in Protein Chain, PLoS Comput. Biology, 2:1639-1648. 2. E. Camon, M. Magrane, D. Barrell, V. Lee, E. Dimmer, et al. (2004) The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology, Nucleic Acids Res., 32:D262-266. 3. E. Monsellier, M. Ramazzotti, N. Taddei, F. Chiti (2008) Aggregation propensity of the human proteome, PLoS Comput. Biology, 4:e1000199.

111 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

POSITIONS OF PROTEIN FOLDING NUCLEI CORRESPOND TO POSITIONS OF ROOT STRUCTURAL MOTIFS SERGIY O. GARBUZYNSKIY 1, MARIA S. KONDRATOVA1

Keywords: Folding nucleus; Transition state; Root structural motif; Protein folding

A crucial event of protein folding process is the formation of folding nucleus. Folding nucleus is a structured part of protein chain in the transition state (that is, on the top of the free-energy barrier at the protein folding pathway). Its formation is thus a rate-limiting step of protein folding. That's why folding nucleus formation is considered to be a key event of protein folding process [1]. Since folding nucleus corresponds to the top of the free- energy barrier, it is very unstable and thus its experimental investigation [2] is very complicated. Therefore, theoretical approaches which could allow prediction of folding nucleus position in a protein structure are greatly demanded. We demonstrate a presence of a considerable correlation between the location of folding nucleus and the location of the so called "root structural motifs" [3,4] (supersecondary structures with unique overall folds and handedness) [5]. For those proteins which possess a single root structural motif, the involvement in the formation of a folding nucleus is in average significantly higher for amino acids residues that are in root structural motifs, compared to residues in other parts of the protein molecule. The fraction of contacts formed in transition state ensemble (reflecting the involvement into the folding nucleus) is in average twice larger for amino acid residues belonging to the root structural motif as compared to the other residues. The carried out tests revealed that the observed difference is statistically reliable. Thus, we have found a structural feature that corresponds to folding nucleus. This observation allows reducing the task of searching for folding nucleus in a protein structure to a much simpler task of determination of root structural motifs.

The work was supported by Russian Foundation for Basic Research (grants №№ 07-04-00388-a and 08-04-00561-a), Russian Academy of Sciences (programs "Molecular and Cell Biology" and "Fundamental Sciences to

 1 Institute of Protein Research, RAS, Russian Federation, [email protected] 112 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Medicine"), Howard Hughes Medical Institute (grant № 55005607) and "Russian Science Support Foundation".

1. A.R.Fersht (1997) Nucleation mechanisms in protein folding, Curr. Opin. Struct. Biol., 7:3–9. 2. J.T.Matouschek et al. (1989) Mapping the transition state and pathway of protein folding by protein engineering, Nature, 340:122–126. 3. A.V.Efimov (1994) Favoured structural motifs in globular proteins, Structure, 2:999–1002. 4. A.V.Efimov (1997) Structural trees for protein superfamilies, Proteins, 28:241–260. 5. S.O.Garbuzynskiy, M.S.Kondratova (2008) Structural features of protein folding nuclei, FEBS Letters, 582:768–772.

113 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

IN SILICO DESIGN OF PRIMER FOR 28 KDA ANTIGEN PRECURSOR PROTEIN OF MYCOBACTERIUM LEPRAE ADITYA GAUR 1

Bioinformatics has become an essential tool not only for basic research but also for applied research in Biotechnology and Biomedical sciences. Optimal primer sequence and appropriate primer concentration are essential for maximum specificity and efficiency of PCR. A poorly designed primer can result in little or no product due to non specific amplification and/or primer- dimer formation, which can become competitive enough to suppress product formation. There are several online tools devoted to serving molecular Biologist design effective primers. This review intends to provide a guide to choosing the most efficient way to design a specific primer by applying current publicly available links and web-resources. Also the purpose here is to provide general recommendations for design and use of PCR primers. A primer is a short synthetic oligonucleotide which is used in many molecular techniques from PCR to DNA sequencing. Primer designing has been done by the help of various online soft wares like Genefisher, Primer3, DNASIS-Max, Math Primer, etc. Query sequence of 28 kDa antigen precursor protein of Mycobacterium Leprae was taken in FASTA format and was submitted in nucleotide BLAST tool. The generated results were analyzed for similarities between sequences. The same query sequence was then run through online soft wares for designing primers and primer calculation was illustrated and studied. The primer length size, ATGC sequences and Tm values were different for different sequences. Later the complete data was reorganized and a comparative study was done in order to perceive the overlapping primer sequences those were generated using different tools and soft wares involving discrete algorithms. Those primers which were having the same sequence were considered to be the best and the most favorable primer to be used while undergoing wet lab sequencing/PCR.

 1 Dr. Bhim Rao Ambedkar University, Agra, India, [email protected] 114 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

ONE CODON – TWO AMINO ACIDS VADIM GLADYSHEV 1

Strict one-to-one correspondence between codons and amino acids is thought to be an essential feature of the genetic code. In ciliated protozoa Euplotes crassus, cysteine (Cys) is encoded by three codons, UGA, UGU and UGC. We sequenced the macronuclear genome of this organism and found that UGA codon also specifies insertion of selenocysteine (Sec). The dual use of this codon could occur even within the same gene. Consistent with the use of UGA for both Sec and Cys insertion, we identified the corresponding Sec and Cys tRNAs containing UCA anticodon. In addition, four selenoprotein genes were found that had UGA codons encoding both Sec and Cys, whereas in four other selenoprotein genes, UGA only encoded Sec. We examined a Euplotes protein, thioredoxin reductase 1, that has seven in-frame UGA codons and found that Sec was only inserted into the classical Sec site, whereas other UGA positions did not support Sec insertion. Further studies revealed that Sec insertion was dependent on the location of UGA codon within the ORF, presence of a Sec insertion sequence element in the 3’-UTR and availability of this structure for interaction with ribosome. Thus, E. crassus utilizes UGA for insertion of both Cys and Sec, establishing it as the first known organism that utilizes one codon to code unambiguously for two different amino acids. More generally, the data show that the genetic code can support the use of one codon to code for multiple amino acids.

1. A.A. Turanov, A.V. Lobanov, D.E. Fomenko, H.G. Morrison, M.L. Sogin, L.A. Klobutcher, D.L. Hatfield, V.N. Gladyshev. (2009) Genetic code supports targeted insertion of two amino acids by one codon. Science, 323:259- 261.

 1 Redox Biology Center and Department of Biochemistry, University of Nebraska, Lincoln, NE 68588 USA , United States, [email protected] 115 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

DYNAMICS AND RIGIDITY/FLEXIBILITY OF THERMOPHILIC AND MESOPHILIC PROTEINS ANNA V. GLYAKINA 1, TATYANA B. MAMONOVA2, MARIA G. KURNIKOVA 2, XANA V. GALZITSKAYA 3

Keywords: Stability, MD simulation, MK simulation, thermophilic proteins, mesophilic proteins

Protein molecules require both flexibility and rigidity for protein function. The fast and accurate prediction of protein rigidity/flexibility is one of the important problems in protein science. We have determined the flexible regions for four homologous pairs from thermophilic and mesophilic organisms by two methods: fast FoldUnfold, which uses amino acid sequence, and time consuming MDFirst, which uses three dimensional structures. We demonstrate that both methods allow for determining flexible regions in protein structure. For three from four thermophile-mesophile pairs of proteins, FoldUnfold predicts practically the same flexible regions which have been found by the MDFirst method. Molecular Dynamic simulations show, as expected, that thermophile proteins are more stable in comparison to its mesophilic homologs. Analysis of rigid clusters provides newer insights into protein stability. It has been found that there are two groups of proteins. The first one is characterized a salt bridge or ionic network. This network includes salt bridge triads Agr-Glu-Lys, Arg-Glu-Arg or salt bridges (like Arg-Glu) connected with hydrogen bonds. This ionic network accumulates alpha helices and rigidifies the structure. The second groups can be characterized by single salt bridges and h-bonds or small ionic clusters. Such difference in the network of salt bridges results in different flexibility of homologous proteins. Considering both approaches allows for characterizing structural features in atomic detail that determine the rigidity/flexibility of a protein structure. Folding pathways for two pairs of thermophilic and mesophilic proteins from A and D classes were obtained with the help of the Monte-Carlo method. Fifty folding trajectories for each protein were simulated and analyzed. The analysis revealed that folding pathways for thermophilic and mesophilic  1 Institute of Mathematical Problems of Biology, Russian Academy of Sciences, [email protected] 2 Chemistry Department, Carnegie Mellon University, Pittsburgh, PA 15213 3 Institute of Protein Research RAS, Pushchino, 142290, Russia, [email protected] 116 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 proteins of class A (all α-proteins, six α-helices in our proteins) are different. The folding pathways for a mesophilic protein are more heterogeneous than for a thermophilic one. The first passage time, time during which half of the trajectories are folded, for mesophilic proteins is twice as much as that for thermophilic proteins. For thermophilic proteins, in all cases except one the folding pathway is as follows: initially the first and second α-helices (the N- terminus) are formed, then the half of the third α-helix, and the last fourth, fifth and sixth α-helices (the C-terminus). For mesophilic proteins, in 19 cases initially two middle α-helices (the third and fourth), then the fifth and sixth α- helices (the C-terminus), and at last the first and second α-helices (the N- terminus). In 10 cases initially four α-helices from the C-terminus are formed, and then two first α-helices from the N-terminus. It is not possible to separate clearly the other folding pathways. Thus, in thermophilic proteins initially the protein N-terminus is formed, and in mesophilic proteins in most of cases – the C-terminus. Folding of thermophilic and mesophilic proteins of class D (α+β proteins) follows the same pathways and the first passage time is practically the same for both proteins. A frequently observed folding pathway is the following: initially first and second α-helices are formed and then the other elements of secondary structure (in 37 and 45 trajectories for thermophilic and mesophilic proteins, respectively). A rarely observed folding pathway is as follows: initially the β-sheet, which consists of three β-strands, and the fourth α-helix are formed and then the first and second α-helices (in 3 trajectories for thermophilic and mesophilic proteins). It is not possible to separate clearly the other folding pathways. This work was supported by the programs "Molecular and Cellular Biology" and “Fundamental Sciences to Medicine”, by the Russian Foundation for Basic Research (08-04-00561), and by the “Russian Science Support Foundation”.

117 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

A NOVEL APPROACH TO STRUCTURAL ALIGNMENT OF PROTEINS BASED ON ENERGY LANDSCAPES CALCULATION MAXIM GODSIE 1, IGOR OFERKIN 1, PAVEL IVANOV 1

Traditional methods of structural alignment of proteins have several drawbacks. One of them is not taking into account the energy-conformation interdependences in proteins. Another one is the ignorance of differences between native protein structures and those deposited in databases, e.g. Protein Data Bank, that is ususally used as an alignment input. Such a difference is proved to be very significant because of specific experimental conditions under which protein structures are determined by the NMR techniques or X-ray crystallography [1]. Therefore, the results of any modern structural alignment algorithm might not be accurate as soon as any deformations in each input structure are not allowed inside the algorithm while the input structures seem to be already deformed. We present a fundamentally new approach to protein structural alignment designed to determine the differences between two protein structures previously conformed to a set of native-like states. The latter are determined as local minima of function F = a*(E1+E2)+b*RMSD (1) obtained by numerical optimization with a set of different initial conditions. Here, a and b are weighting coefficients and RMSD is calculated within pairs of atoms preliminary aligned by sequence alignment. In the above function, energies E1 and E2 are the total energies of two proteins to be aligned that are calculated using MMFF94 Force Field [2] separately for each protein. As soon as the set of function (1) minima is found, the algorithm suggests several strategies of alignment score computation that correspond to various methods of taking into account structural differences and residual deformations in 'structurally optimized' protein states. We implemented our algorithm in software for an HPC environment. In this software, total protein energy computations are based on MMFF94 Force Field modified by authors to work with metal-binding proteins. To validate our algorithm, we used MMFF94 Validation Suite and obenergy module from OpenBabel software. For several test structures, our results proved to be

 1 Moscow State University, Russian Federation, [email protected] 118 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 identical to those provided by aforementioned packages while our software appeared to be four times faster. As an output, our algorithm provides the differences between input protein structures that do not depend on conformational changes the proteins undergo on their paths from their native states to the structures deposited in PDB database or any other sort of deformation except denaturation. The results of algorithm validation on globins as well as on proteins from thermophilic and mesophilic organisms [3] will be presented.

1. Michael Andrec, D. A. S., Zhiyong Zhou, Jasmine Young, Gaetano T. Montelione, and Ronald M. Levy (2007). Proteins 69: 449–465. 2. Halgren, T. A. (1999). J. Comp. Chem. 20: 730-748. 3. Anna V. Glyakina, Sergiy O. Garbuzynskiy, Michail Yu. Lobanov, and Oxana V. Galzitskaya (2007). Struct. Bioinformatics 23: 2231–2238.

119 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

INFERRING GENE EVOLUTION ALONG A SPECIES TREE K. GORBUNOV 1, V. LYUBETSKY 1

First task: given are gene tree G and species tree S, to be constructed is scenario G' (which uniquely defines the optimal mapping of G into S) with inferred gene evolution events (non-speciation duplications, losses and horizontal transfers) assigned to the branches of S. Such a scenario is defined as inner tree G' with branches contained within “tubes” – branches of tree S – and allowed to transfer between tubes (HGT events) within the same temporal slice. After removing branches in G' corresponding to gene losses, such a tree becomes isomorphic to initial G but provides more evolutionary information comparing to G or standard mapping α of G into S, [1]. A fast algorithm to construct G' is developed, which uses a minimal weighted sum of the number of gene evolution events. Also a fast algorithm to construct temporal slices in S is developed which accounts for branch lengths of input gene trees used to build tree S. The definitions and algorithms are described in detail in [1].

Second task: gene tree G and scenario G' are constructed simultaneously under given species tree S. A fast algorithm is developed to implement this task. First, set M is defined to contain sets corresponding to putative clades in G: if some set P is already contained in M, then for any gene g with information content in P exceeding a certain threshold, set P+g is included in M. The information content is estimated as in [2]. The algorithm further constructs an optimal pair to minimize the cost of mapping of G into S, where all clades in G are contained in M (conditional optimization). The cost accounts for the events of gene evolution (duplications, losses, HGTs) and sequence evolution (substitutions and indels). The innovation here is the separate construction of putative clades (sets from M) and tree G, the latter being inferred simultaneously with scenario G' .

Third task : given a set of gene trees, “supertree” S is constructed based on minimization of the total cost of scenarios, where clades in S are taken from among the precomputed sets in M. A fast algorithm is developed to compute  1 Institute for Information Transmission Problems (Kharkevich Institute), RAS, Bolshoy Karetnyi lane, 19, 127994, Moscow, Russia, [email protected], [email protected] 120 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 set M of sets of candidate clades in S and then to construct optimal S by minimizing the total cost of scenarios within tree S, in which all clades are contained in M (conditional optimization). All algorithms produce optimal and suboptimal solutions and have computer implementations. Numerous applications of the algorithms proved statistically significant improvement of the costs of mapping of the resulting gene tree into the species tree. The cost of 2 was assumed for one loss, 3 – for one duplication, 11 – for one HGT.

Examples. 1) COG0012, 41 sequence. The algorithm selected 10158 sets as putative clades. The cost of inferred scenario (gene evolution events only) was 105 (less comparing to 138 obtained with the algorithm from [1]). This adds credit to clades inferred in [1]; the current COG tree better describes gene evolution. 2) COG0272, 31 sequence. The algorithm selected 4842 sets as putative clades. The cost of inferred scenario (gene evolution events only) was 115 (higher comparing to 83 obtained with the algorithm from [1]). This suggests that some clades were unreliable in the gene tree used in [1]. 3) COG0180, 43 sequence. The algorithm selected 2990 sets as putative clades. The cost of inferred scenario (gene evolution events only) was 110 (less comparing to 170 obtained previously in [1]). This adds credit to clades in [1]; the current COG tree better describes gene evolution. Large-scale comparisons of the algorithms’ performance between this study and [1] demonstrated an improvement in the mapping cost (examples 1 and 3) in 80% cases and suggests previous inference of some incorrect clades in COG trees (example 2) in 20% cases.

The authors are grateful to L. Rusin for discussions.

1. K.Yu. Gorbunov, V.A. Lyubetsky (2009). Inferring gene evolution along a species tree. Molecular biology , to appear. 2. K.Yu. Gorbunov, V.A. Lyubetsky (2005). Searching for ancestral genes that introduce incongruence between gene and species trees. Molecular biology , 393939(5):39 847-858.

121 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

MODE OF STOP CODON RESTRICTION BY THE EUPLOTES ERF1 TRANSLATION TERMINATION FACTOR EVGENY GORDIENKO 1, BORIS ELISEEV 1, ELENA ALKALAEVA 1, LUDMILA FROLOVA 1

Keywords: eRF1, translation termination, ciliates, Euplotes

In universal-code eukaryotes, a single translation termination factor eRF1 decodes the three stop codons UAA, UAG and UGA. In some ciliates, like Euplotes and Blepharisma, eRF1s exhibit UAR-only decoding specifity, while UGA is reassigned as sense codon. Since variant-code ciliates may have evolved from (a) universal-code ancestor (s), structural features should exist in ciliate eRF1s that restrict their stop codon recognition. In omnipotent eRF1s, stop codon recognition is associated with the amino terminal domain of the protein [1]. Using in vitro assay we show here that chimeric molecules composed of the N-terminal domain of Euplotes eRF1 fused to the core domain (MC domain) of human eRF1 retained specificity towards UAA and UAG; this unambiguously associates eRF1 stop codon specificity to the nature of its N-terminal domain. Functional analysis of eRF1 chimeras constructed by swapping ciliate N-terminal domain sequences with the matching ones from the human protein highlighted the crucial role of the α3-helix region in restricting Euplotes specificity towards UAR. By application of the same chimera approach to Stylonychia and Paramecium eRF1s two restriction patterns for UGA-only response was founded [2], which differs profoundly from Euplotes case. Our result gain insights into the mechanism of stop-codon decoding by N-terminal domain of eRF1. This work was supported by Russian Foundation for Basic Research Grants 08-04-01091а (to E.A.). 1. 1. H. Song et al. (2000) The crystal structure of human eukaryotic release factor eRF1—mechanism of stop codon recognition and peptidyl-tRNA hydrolysis, Cell, 100:311–21. 2. 2. S. Lekomtsev et al. (2007) Different modes of stop codon restriction by the Stylonychia and Paramecium eRF1 translation termination factors, Proc Natl Acad Sci U S A, 104:10824–29.

 1 Engelhardt Institute of Molecular Biology RAS, Russian Federation, [email protected], [email protected], [email protected], [email protected] 122 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

BIOINFORMATICS ANALYSIS OF LAGLIDADG HOMING ENDONUCLEASES FOR CONSTRUCTION OF ENZYMES WITH CHANGED DNA RECOGNITION SPECIFICITY ALEXANDER GRISHIN 1, INES FONFARA 2, WOLFGANG WENDE 2, DANIIL ALEXEYEVSKY 3, ANDREI ALEXEEVSKI 3,4, SERGEI SPIRIN 3,4, OLGA ZANEGINA 3, ANNA KARYAGINA 5

Keywords: homing endonuclease, genome engineering, molecular design

Homing endonucleases are rare-cutting enzymes with long DNA target sites (14-40 bp), encoded in group I, group II and archaeal introns and inteins. These enzymes utilize host DNA-repair machinery to promote the propagation of sequences that encode them in vivo. Due to its extreme specificity, homing endonucleases have been proposed as a powerful tool for genome engineering. In spite of a large number of characterized homing endonucleases, the range of cleaved sequences is still not sufficient for effective use in various genome engineering projects. Thus, methods of design of novel homing endonucleases with predefined specificity are needed [1]. Of five families of homing endonucleases the LAGLIDADG family is most abundant and extensively studied. Also because of its unique structural properties, this family of proteins appears to be most suitable for redesign purposes. Catalytically active unit of these enzymes consists of two conserved LAGLIDADG domains, each recognizing one DNA half-site. Due to this fact, and also to the fact that interdomain interface of these enzymes is well conserved, one of possible approaches to novel endonuclease design is combining LAGLIDADG domains from different LAGLIDADG nucleases [2, 3]. Comprehensive analysis of available sequences and 3D structures was performed, providing the information than can be used to aid developing approaches of combining LAGLIDADG domains from different LAGLIDADG endonucleases and creating novel homing endonucleases with predefined specificity. By superimposition of 3D structures of 12 LAGLIDADGE homing

 1 All-Russia Research Institute for Agricultural Biotechnology, Moscow, Russia; [email protected] 2 Institut für Biochemie, Justus-Liebig Universität Gießen, Germany 3 Belozersky Institute for Physical-Chemical Biology, Moscow State University, Moscow 4 Scientific Research Institute for System Studies (NIISI RAS), Moscow 5 Gamaleya Institute of Epidemiology and Microbiology, Moscow 123 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 endonucleases we have confirmed that intersubunit interface is the most conserved part of enzyme structure. All interacting interfacial residues were identified. Eight of them interact by side chains in all structures being the most perspective for chimeric protein design. Combining structurally based alignment and Pfam alignments of subfamilies, it became possible to create reliable alignment of at least, alpha-helices involved in dimerization for available 467 sequences of LAGLIDADG endonucleases. As a result, we indentified eight interacting residues in the alignment. By weighting sequences we have found the most represented patterns of interfacial residues. Those patterns are considered as promising ones in constructing endonucleases with new specificity. To predict the most perspective substitutions of residues for dimerization of two given LAGLIDADG subunits, we have developed a computer program fitprot. The program exhausts substitutions of prescribed amino acid residues by rotamers from rotamers library and selects those that have highest score roughly reflecting free energy of interaction. Selected patterns are planned to be characterized in more details including molecular dynamics simulation. Preliminary testing of fitprot showed reasonable results: native pattern of amino acid residues occurs in top of the list of patterns. Experimental testing of the best computer predictions is planned.

This work is partially supported by RFBR-DFG grant 08-04-91975.

1. B. L. Stoddard (2005), Homing endonucleases structure and fuction, Quarterly Reviews of Biophysics, 1-47. 2. B. S. Chevalier et al. (2002), Design, activity and structure of a highly specific artificial endonuclease, Molecular Cell, 10: 895-905. 3. G. H. Silva et al. (2006), From monomeric to homodimeric endonucleases and back: engineering novel specificity of LAGLIDADG enzymes, J. Mol. Biol., 361: 744–754.

124 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

A COMPARATIVE ASSESSMENT OF METHODS FOR RECOGNITION OF BINDING SITES IN PROTEINS CONCETTINA GUERRA 1

The prediction of interactions of proteins with ligands is a major task in biology that is relevant for function assignment and drug design. The experimental in vitro determination of interactions is expensive and time consuming. Thus computational prediction techniques can be provide aid to complement experimental techniques. When a novel protein with unknown function is discovered, bioinformatics tools are used to screen huge datasets of proteins with known function and binding sites, searching for a candidate binding site in the new protein. More specifically, if a surface region of the novel protein is similar to that of the binding site of another protein with known function, the function of the one protein can be inferred and its molecular interaction predicted. Much work has been done on the analysis of the binding sites of proteins and their identification using various approaches [5,7]. We developed a suite of methodologies for the problem of protein-ligand binding site recognition, based on a representation of the proteins by a collection of spin-images [2,3]. We have recently made the programs available on the web at http://bcb.dei.unipd.it/MolLoc/ [1]. Here we present a large-scale computer experiment of existing methods for binding site comparison and recognition with the goal of identifying inaccuracies and outlining possible drawbacks. While a comprehensive evaluation of protein structure alignment methods is available using the SCOP classification of proteins as a gold standard [6], no such effort has been made for binding site recognition. One reason may be the lack of a gold standard against which to evaluate the results of the matching. Each method proposed in literature uses its own or native score that cannot easily be exported and computed for the other methods. Within this work we evaluate the use of three existing methods based on protein surface descriptors differing in the level of accuracy and computational complexity. Specifically, we compare spin images, spherical harmonics [4] and context shapes. We base the evaluation on the geometric measures of SI, MI, SAS [6] that combine the number of matched atoms with the RMSD of the matched atoms into a single expression,  1 University of Padova, via Gradenigo 6,a 35131 Padova, Italy Georgia Tech, 5th st. 30332, Atlanta, USA [email protected], [email protected] 125 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 thus overcoming the problem of the trade-off between the two values. Furthermore, we consider the quality of the solution in terms of physico- chemical properties of contacts with the ligands. We show that the additional computational complexity of handling context shapes is not justified by the increase in accuracy. Furthermore, while the spherical harmonics and spin images tend to have similar performance however spin images require higher computation times.

Acknowledgements: Funding was provided by Progetto di Ateneo Universita’ di Padova and Progetto Cariparo.

1. S.Angaran, M.E. Bock, C. Garutti, G. Guerra (2009) MolLoc: a Web Tool for the Local Alignment of Molecular Surfaces, Nucleic Acids Research (submitted). 2. M.E.Bock, C. Garutti, G. Guerra (2008) Cavity Detection and Binding Site Recognition in Proteins, Theoretical Computer Science, doi:10.1016/j.tcs.2008.08.018. 3. M.E. Bock, C. Garutti, G. Guerra (2007) Discovery of Similar Regions on Protein Surfaces, J. Computational Biology, 233: 387–406. 4. M. Comin, C. Guerra, F. Dellaert (2009) Binding Balls: Fast detection of Binding Sites using a property of Spherical Fourier Transform, J. Computational Biology (submitted). 5. Glaser, F., Morris, R.J., Najmanovich, R.J., Laskowski, R. A., and J.M. Thornton (2006) A Method for Localizing Ligand Binding Pockets in Protein Structures. Proteins: Struct. Funct. Bioinf. 62,479-488. 6. R. Kolodny, P. Koehl, M. Levitt (2005) Comprehensive Evaluation of Protein Structure Alignment Methods: Scoring by Geometric Measures, J. Mol. Biology. 346, 1173–1188 7. Shulman-Peleg, A., Nussinov, R., and Wolfson, H. J. (2004). Recognition of Functional Sites in Protein Structures. J. Mol. Biology. 339,607-633.

126 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

COMPARATIVE GENOMICS AND EVOLUTIONARY ACCOUNT OF GPI ANCHORED PROTEINS: AN IN SILICO STUDY ASHUTOSH MANI 1, DWIJENDRA K. GUPTA 1

Glycophosphoinositol (GPI)- anchored proteins occur on exoplasmic cell surface , as protozoal antigens, adhesion molecules, mammalian antigens and are involved in significant cellular functions like dense packing of proteins on cell surface, increased protein mobility on cell surface , specific release from cell surface, control of exit from ER and toxin binding. Mutations in these proteins lead to Paroxysomal Nocturnal Haemogolbinuria and other disorders. This study has been performed by combining comparative proteomics and phylogenetic approaches in order to address a cross family evolution of GPI anchor proteins from 23 species. The results revealed interesting specifics about conserved domains across different taxa of organisms. Introduction . Glycosylphosphatidylinositol (GPI) anchored proteins carry a Phosphoinositol based glycolipid attached to the C-terminus during post translational modification[1, 3]. GPI proteins have been found in a wide variety of eukaryotes : mammals (45 in humans), chickens (10), fish, rays, sea urchin, fruit flies (5), silk moth, ticks, grasshopper, protozoa (trypanosomes, leishmania, paramecium), fungi, slime mold, unicellular green alga, mung bean, even herpes virus (simian surface glycoprotein), but not in bacteria, and oddly nothing reported from nematode (out of 1208 proteins). A GPI-anchor unsurprisingly implies a signal peptide but by no means conversely. There is a division between O- and N-glycosylation. Identified functions are clearly appropriate to the extra-cytoplasmic location; GPI proteins are over- represented in neurons. At least one human disease, paroxysmal nocturnal hemoglobinuria, is a result of defective GPI anchor addition to plasma membrane proteins . Materials and methodmethod.... GPI-anchored protein family members were searched by using blastp program in the protein database at NCBI. Homo Sapiens GPI anchor selected as query for search. For pair wise and multiple alignments gap open penalty was -7 and gap extension penalty was -1. BLOSUM weight matrix was used for substitution scoring. Manual editings were performed on BioEdit[2]. The evolutionary history was inferred using  1 Center of Bioinformatics, Instyutute of Interdisciplinary Studies, University of Allahabad, India, [email protected] 127 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 the Neighbour-Joining method. The bootstrap consensus tree inferred from 10000 replicates was taken to represent the evolutionary history of the taxa analyzed. There were a total of 167 positions in the GPI anchor proteins’ final dataset. Phylogenetic analyses were conducted in MEGA4 [4]. Results and Discussion . A conserved region search resulted into four regions. An entropy plot was generated for all the aligned positions. The positions after 500 do not show much conservedness. A hydrophobicity profile plot shows that mean hydrophobicity of the protein for most of the positions is in all the species is below zero; occasionally it turns to be positive. From the profile it is clear that the regions related to conserved positions also have a characteristic of possessing residues in a balanced way and the profile is always around zero value. The phylogenetic trees resulted into different organisms on tree nodes branched on the basis of their GPI anchor proteins. Arabidopsis thaliana being a plant species appears with a totally diverged branch from the main tree with a bootstrap support percentage of 69. Node for the fungal species namely Aspergillus fumigates, Neosartorya fischeri, Candida albicans and Sclerotinia sclerotiorum, Sacchromyces cerevisiae and Schizosacchromyce pombe has been supported by a bootstrap support value of 92. The Node for Arthropods (Drosophila melanogaster, Drosophila pseudoobscura and Tribolium castaneum) has been supported by bootstrap value of 89. The node for Mammals has been supported by a high bootstrap support value of 99.

1. Birgit Eisenhaber, Sebastian Maurer-Stroh, Maria Novatchkova, Georg Schneider, Frank Eisenhaber “Enzymes and auxiliary factors for GPI lipid anchor biosynthesis and post-translational transfer to proteins” *Research Institute of Molecular Pathology, Vienna, Republic Austria 25:367-385, 2003. © 2003 Wiley Periodicals, Inc. (web link http://mendel.imp.ac.at ) 2. Hall, T.A. (1999). BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucl. Acids. Symp. Ser. 41,95-98. 3. Janes, P. W., Ley, S. C., Magee, A. I., and Kabouridis, P. S. Semin. Immunol. 12, 23-34.2000. 4. Tamura, K., Dudley J, Nei, M. & Kumar, S. (2007). MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Molecular Biology and Evolution 24, 1596-1599.

128 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

IN-SILICO SEQUENCE ANALYSIS, FUNCTIONAL AND EVOLUTIONARY CHARACTERIZATION OF A NOVEL COLD SHOCK DOMAIN PROTEIN FROM INDIAN ERI SILKWORM, PHILOSAMIA RICINI ASHUTOSH MANI 1, PRAMOD K YADAVA 2, DWIJENDRA K. GUPTA 1

We have cloned and sequenced the first cDNA, coding for Y-box protein, a member of cold shock domain proteins, from Philosamia ricini mRNA and predicted its amino acid sequence. On the basis of deduced amino acid sequence and phylogenetic analysis, we confirm that the protein belongs to the same cold shock domain protein family. The motif search with amino-acid sequences found in the deduced sequence showed presence of a N-terminal domain, a Cold shock domain and a C-terminal domain. IntroductionIntroduction.... The Y-box proteins are the most evolutionarily conserved nucleic acid-binding proteins, hitherto described and occur in bacteria, plants and animals. All vertebrate Y-box proteins contain a variable N-terminal domain, a Cold Shock Domain (CSD) and a C-terminal tail domain. The CSD is a highly conserved nucleic acid binding domain that confers RNA- and single stranded and double stranded DNA binding activities to the Y-box proteins. The eukaryotic Y-box proteins were originally identified through their ability to interact with DNA containing a reverse CCAAT box, the Y-box sequence CTGATTGGCCAA [1]. They have been implicated in various cellular processes, including adaptation to low temperatures, cellular growth, nutrient stress and stationary phase [2]. The discovery of a domain, CSD, that shows strikingly high homology and similar RNA-binding properties to CSPs in a growing number of eukaryotic nucleic-acid-binding proteins suggests that these proteins have an ancient origin. Materials and MethodsMethods.... Silk glands were isolated from the fifth instar larvae of the Eri silkworms, Philosamia ricini, a multivoltine lepidopteran insect of considerable economic importance to Indian silk industry. Total RNA was isolated from silk glands. Reverse transcription was performed by using random hexamer primers. Gene specific primers designed on the basis of consensus sequences obtained from multiple sequence alignment of available

 1 Center of Bioinformatics, University of Allahabad Allahabad-211002 (India), e-mail:[email protected] 2 Applied Molecular Biology Laboratory, School of Life Sciences, Jawaharlal Nehru University, New Delhi 129 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 insect CSDPs were used for amplification of target cDNA. Sequencing was performed on Applied Biosystems sequencer. Manual editings were performed on BioEdit [3]. The DNA sequence was translated into protein sequence by considering all of the six reading frames. The homology based sequence search was performed by using NCBI Blastp program. Functional characterizatioin were done by using homology based approach. CSD sequences from CSDPs of Bombyx mori, (gi|112982792) Anophele gambiae (gi|158295359), Tribolium castaneum (gi|91081963), Drosophila simulans (gi|195589720), Drosophola melanogaster (gi|24663131), Aedes aegypti (gi|157118310), Pediculus humanus corporus (gi|212508401) and Schistosoma japonicum (gi|56754985) from NCBI Entrez were used for evolutionary characterization of the novel sequence. The phylogenetic tree was constructed by using Neighbour-joining method. Evolutionary analysis were conducted by using MEGA 4.0 [4]. Results and DiscussionDiscussion.... The cloned cDNA is 795 bp long and codes a 265 amino acid Y-Box protein, including a 71-residue long cold shock domain. Most of the invertebrate cold shock domain proteins carry the domain of the same length. Codon usage frequency was observed quite similar to Bombyx mori and Drosophila melanogaster. The deduced amino acid sequence of the YBP gene of P. ricini showed about 80% identity to the homolog present in Bombyx mori. The analysis further revealed that the Cold shock domain is rich in hydrophobic residues and has close homology with CSDs that have RNA binding properties. It is considered that the domain also has RNA binding properties and functions as transcription factor during protein synthesis. The multiple sequence alignment of the sequences showed high homology among the CSDs themselves suggesting that they are highly conserved. The phylogenetic tree was well supported by high bootstrap values and revealed common origin of all insect CSD protein family members.

Supported by BIF Grant from Department of Biotechnology, Govt of India

1. Didier, D.K., J.Schiffenbauer, S. Woulfe, M. Zacheis, and B. D. Schwartz. (1988). Characterization of the cDNA encoding a protein binding to the major histocompaibility class II Y box.Proc.Natl.Acad. Sci.USA 85,7332- 7326. 2. Peter L. Graumann and Mohamed A. Marahiel. (1998). A superfamily of proteins that contain the cold-shock domain. Trends Biochem Sci. 8,286- 90. 130 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 3. Hall, T.A. (1999). BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucl. Acids. Symp. Ser. 41,95-98. 4. Tamura, K., Dudley J, Nei, M. & Kumar, S. (2007). MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Molecular Biology and Evolution 24, 1596-1599.

131 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

PREDICTION OF GENOME-WIDE FUNCTIONAL LINKAGES IN MYCOBACTERIUM TUBERCULOSIS USING GENOME CONTEXT METHODS AND GENE EXPRESSION DATA CHANDRANI DAS1, SHUBHADA HEGDE 1, SHEKHAR MANDE 1

The increased rate of tuberculosis infection and a fatality rate of ~23% have made it necessary to search for new ways to prevent the chronic disease. This work attempts to infer genome-wide functional linkages in Mycobacterium tuberculosis. Support Vector Machine (SVM), a machine learning algorithm, was used for the predictions. The parameter values were obtained from genomic context methods such as phylogenetic profile, gene distance and method and gene expression correlations. The positive interacting protein pairs were obtained by bidirectional best hit method taking interacting protein pairs of Escherichia coli. The negative datasets were generated considering differential localization of proteins in the cell. We predict 62,253 binary interactions among 2,884 M. tuberculosis proteins with the accuracy of 89%. The protein interaction network has a degree exponent of 1.15 showing scale-free behavior. We hope that this resource will be helpful for the systems level analysis of M. tuberculosis and identification of the potential drug targets.

 1 Centre for DNA Fingerprinting and Diagnostics, Hyderabad, INDIA, India, [email protected] 132 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

DEPENDENCE BETWEEN EXON, INTRON LENGTH AND NUCLEOTIDE CONTENT OF GENES IN HUMAN AND PROTIST GENOMES ANATOLIY IVACHSHENKO 1, ANEL KABDULLINA 1, VLADIMIR KHAILENKO 1, SHARA ATAMBAYEVA 1

Keywords: exon, intron, gene, genome, nucleotide content

Number of higher and lower eukaryotic genomes contain genes with exon- intron structure. Since the time introns were discovered in genes their many properties have been revealed, however mutual dependence between exons and introns remains unclear. Dependence of exon length on intron number and also intron number on total exon lengths was established. In some genomes dependence between intron and exon lengths was revealed, correlation was both positive and negative. In human genes proportion of exon and intron lengths depends on gene density in DNA. In region of DNA with high gene density (30 genes/Mbp) intron length is about 12 times longer than exon length, and in genes from region of DNA with low density (4 genes/Mbp) it is 60 times longer. We have determined that fC/fG-fA/fT (∆fN) value is negative or positive in exons for genes coding hydrophilic and hydrophobic proteins, respectively. Introns influence on nucleotide content of genes is such that absolute value ∆fNgn of genes approaches zero. Genes with 1-2 introns, in region of DNA with 30 genes/Mbp and coding hydrophilic proteins, reveal absolute value ∆fNgn lower by 79-93% than ∆fNex of exons, and in genes with 3-5 introns ∆fNgn value equals zero. At great intron number in genes ∆fNgn value takes positive. This tendency of ∆fNgn value variability for genes coding hydrophilic proteins was identical to genes in region with low, medial and high density of genes. In intron-containing genes coding hydrophobic proteins, positive ∆fNgn value was lower in comparison with ∆fNex of exons by 36-44%. Absolute ∆fNgn value in genes with introns is smaller than in exons of protist genomes (Plasmodium falciparum, Theileria parva, Dictyostelium discoideum). In genes with 1-2 introns absolute ∆fNgn value was smaller than ∆fNex in exons by 24, 23 and 27%, and in genes with 6-9 introns ∆fNgn was smaller than ∆fNex of exons by 48, 43 and 46% in P. falciparum, T. parva, D. discoideum genomes, respectively. We need to notice that in P. falciparum genes average intron length exceeds average exon length in 1.2 times in sample of genes with 15 and more introns. In other samples of  1 Al-Farabi`s Kazakh National University, Almaty, Kazakhstan, [email protected] 133 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 genes and in all samples of T. parva and D. discoideum genes introns were shorter from 2 to 5 times than exons. Since ∆fNgn value is reduced due to introns we have guessed that there is dependence between total exon length (Lex) multiply by ∆fNex value of exons (Lex•∆fNex) and total intron length (Lin) multiply by ∆fNin value of introns (Lin•∆fNin). In 557 human genes in region of DNA with 30 genes/Mbp density of chromosome 19 relationship between Lex•∆fNex and Lin•∆fNin had correlation coefficient r=0.46 with reliability р<9е-31. For another group of 777 human genes relationship between Lex•∆fNex and Lin•∆fNin was negative r=-0.53 with reliability р<4е- 58. In 376 genes from region of DNA (chromosome 4, 7, 8, 13, 16, 19, 21) with 4 genes/Mbp density relationship of Lex•∆fNex and Lin•∆fNin value was the following: r=0.45 (р<3е-20) and for other groups of 767 genes relationship had parameter: r=-0.37 (р<2е-25). In protist genes similar dependence was revealed between Lex•∆fNex and Lin•∆fNin. In genes of chromosome 14 P. falciparum correlation coefficients and reliability equaled r=-0.80 (р<4е-37) and r=0.23 (р<0.005) according to two samples of genes. In genes of chromosome 1 D. discoideum this dependence was characterized by following parameters: r=-0.44 (р<6е-17) and r=0.55 (р<9е-19). In genes of chromosome 1 T. parva this dependence had following characteristics: r=- 0.22 (р<0.015) and r=0.55 (р<4е-6). Dependence between Lex•∆fNex and Lin•∆fNin has been established in several human genes with great intron number which is necessary for representativeness of exon and intron samples. In FGR2 gene this dependence had following parameters: r=0.67 (p <0.006), in BRAF gene r=0.70 (p<0.003), in KIAA1276 gene r=-0.73 (p<0.0009). Correlation coefficients between Lex•∆fNex and Lin•∆fNin for some genes of chromosome 14 P. falciparum had following characteristics: PF14_0385 gene r=0.49 (p<0.06), PF14_0021 gene r=0.51 (p<0.08), PF14_0506 gene r=0.74, (p<0.01). In genes of chromosome 1 T. parva correlation coefficients equals: ТРО1_0118 gene r=0.63 (p<0.04), ТРО1_0515 gene r=-0.74, (p<0.02), ТРО1_0864 gene r=0.78 (p<0.01), ТРО1_0923 gene r=0.68 (p<0.02). Significant dependence between Lex•∆fNex and Lin•∆fNin in genes of chromosome 1 D. discoideum has not been revealed, as this chromosome did not contain genes with representative intron number. Correlation between exon and intron lengths in gene was lower than dependence between Lex•∆fNex and Lin•∆fNin. Thus, length and nucleotide content of exons are connected with length and nucleotide content of introns.

134 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

AN UPDATE OF KINETICDB, THE DATABASE OF PROTEIN FOLDING KINETICS NATALYA BOGATYREVA 1, ALEXANDER OSYPOV 2, DMITRY IVANKOV 3

Keywords: protein folding rates; protein folding kinetics; two-state folding; multi- state folding

The problem of protein folding is one of the most fundamental in molecular biology. In the last decade, the understanding of protein folding processes has resulted in the development of first crude models of protein folding provided the protein 3D structure is known [1–6]. The relevance of protein folding models is often tested as the ability to predict protein folding rates [3–6]. Simultaneously, a number of empirical and bioinformatical methods has been developed, which provide additional information on protein folding determinants as well as allowed predicting protein folding rates from tertiary, secondary or primary protein structure [7–10]. Prediction of protein folding rates is of special value because aggregation directly depends on the rate of protein folding. We propose here an update of KineticDB [11], a systematically compiled database of protein folding kinetics, which contains now about 100 unique proteins. It is necessary to provide a researcher with as much data as possible in a simple and easy to use way. At present, the KineticDB contains the results of folding kinetics measurements of single-domain proteins and separate protein domains as well as several short peptides. It also includes data on many mutants that have been systematically accumulated over the last 10 years. The KineticDB is the largest collection of protein folding kinetic data presented as a database. It is available at http://kineticdb.protres.ru/db/index.pl. Acknowledgements: This work was supported by the programs "Molecular and cellular biology" and “Fundamental sciences – medicine”, by the Russian Foundation for Basic Research (08-04-00561), by the “Russian Science Support Foundation”. 1. O.V.Galzitskaya, A.V.Finkelstein (1999) A theoretical search for folding/unfolding nuclei in three-dimensional protein structures, Proc Natl Acad Sci U S A, 96:11299-11304.  1 Institute of Protein Research, RAS, Russian Federation, [email protected] 2 Institute of Cell Biophysics, RAS, Russian Federation, [email protected] 3 Institute of Protein Research, RAS, Russian Federation, [email protected] 135 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 2. E.Alm, D.Baker (1999) Prediction of protein-folding mechanisms from free-energy landscapes derived from native structures, Proc Natl Acad Sci U S A, 96:11305-11310. 3. V.Munoz, W.A.Eaton (1999) A simple model for calculating the kinetics of protein folding from three-dimensional structures, Proc Natl Acad Sci U S A, 96:11311-11316. 4. D.N.Ivankov, A.V. Finkelstein (2001) Theoretical study of a landscape of protein folding-unfolding pathways. Folding rates at midtransition, Biochemistry, 40:9957-9961. 5. E.Alm et al. (2002) Simple physical models connect theory and experiment in protein folding kinetics, J Mol Biol, 322:463-476. 6. S.O.Garbuzynskiy et al. (2004) Outlining folding nuclei in globular proteins, J Mol Biol, 336:509-525. 7. K.W.Plaxco et al. (1998) Contact order, transition state placement and the refolding rates of single domain proteins, J Mol Biol, 277:985-994. 8. D.N.Ivankov et al. (2003) Contact order revisited: influence of protein size on the folding rate, Protein Sci, 12:2057-2062. 9. 9. H.Gong et al. (2003) Local secondary structure content predicts folding rates for simple, two-state proteins, J Mol Biol, 327, 1149-1154. 10. 10. D.N.Ivankov, A.V.Finkelstein (2004) Prediction of protein folding rates from the amino acid sequence-predicted secondary structure, Proc Natl Acad Sci U S A, 101:8942-8944. 11. 11. N.S.Bogatyreva et al. (2009) KineticDB: a database of protein folding kinetics, Nucleic Acids Res, 37:D342-D346.

136 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

A NEW APPROACH FOR DETECTING TUMOR MARKER GENES FROM MICROARRAY DATASETS USING EVOLUTIONARY ALGORITHM GEORGY GULBEKYAN 1, VALERY VALYAEV 1, PAVEL IVANOV 1

The advent of expression microarray technologies made it possible to study transcrip-tome in various types of malignant cells [1,2]. It also allows one to diagnose several subtypes of the same tumor with relatively high precision to prescribe a proper administration [3]. Un-fortunately, the usage of high- throughput microarrays for this purpose is costly and time-consuming. Therefore, the problem of detecting a small number of marker genes for reliable partitioning of different subtypes of the same pathology becomes extremely important.

To date, several algorithms addressing this problem have been proposed. Their accu-racy in differential diagnostics of tumors varies from 70 to 98 percent [4,5]. Usually they are based on traditional supervised data classification or on evolutionary algorithms and in most cases lack statistical estimates of partitions reliability and are hardly applicable to classifica-tion of relatively large number of classes.

We present a new approach to revealing marker genes in multiclass microarray datasets that combines and advances two well-proved approaches, namely, supervised classification and evolution simulation. Since most of currently available cancer microarray datasets contain expression profiles for dozens of thousands of gene (in such a profile for a given gene, one expression value correspondes to a particular patient), a filtration step should precede further analysis. As most dequate to the problem of marker gene detection, we propose a method of expression profiles filtration that, first, approximates gene expression profiles in different classes by beta ditribution functions and, second, estimates gene relevance measure as a multiple convolution of such distributions.

We select the SVM Leap version of Support Vector Machine (SVM) technique to partition microarray samples into multiple classes using genes  1 M.V.Lomonosov Moscow State University, Russian Federation, [email protected], [email protected] 137 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 remained after filtration. We use a Leave K-out Cross Validation to partition initial data into training and test datasets and to fit the parameters of SVM algorithm. After that, by randomizing initial data we estimate classification error for a large number of training/test partitions. Than, we generate mutations in a randomly chosen set of potential marker genes (predictor) from a filtered gene set and imitate simultaneous evolution of several such predictors combined in a predictor pool. At each evolutionary epoch, we retain a predictor in or exclude it from the pool based on its classification power (quality measure). An elitism principle can also been added at this step. Finally, we stop the evolutionary process when changes in classification power appear to be less than a chosen threshold or after a predefined number of iterations.

In most cases, several predictors from predictor pool reach a 100% classification power on a training dataset after hundreds or thousands of evolutionary epochs. We propose two methods to select the best predictor. One approach is to choose this predictor as having minimal number of genes among predictors with maximal classification power. Another method is to rank predictors by their reliability that we determine as the number of genes that can be linear separated from corresponding classes in the training dataset and than to choose predictor of highest reliability.

Results of applying the proposed algorithm to a model dataset as well as to results of experimental microarray tumor studies [3,6] will be presented.

1. Ramaswamy, S. and Golub, T.R. (2001) J. Clin. Oncol. 20: 1932-1941. 2. Segal, E. et al. (2002). Nat. Genet. 37: S38-S45. 3. Yeoh, E.J. et al. (2002). Cancer Cell 1: 133-143. 4. Ancona, N. et al. (2006) BMC Bioinformatics 7: 387. 5. Jirapech-Umpai, T. and Aitken, S. (2005). BMC Bioinformatics 6: 148. 6. Ross, M.E. et al. (2003). Blood 102: 2951-2959.

138 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

ANALYSIS OF TIME SERIES MICROARRAY DATA USING DYNAMIC BAYESIAN NETWORK K.G. SRINIVASA 1, SEEMA S 2, MANOJ JAISWAL 3

Keywords: Gene Regulatory Network, Clustering, Dynamic Bayesian Network.

Gene Regulatory Network represents how the genes interact with each other. Using genetic network modelling, it is possible to explain the cell functions at molecular level. DNA microarrays can measure the expression levels of thousands of genes simultaneously. Two steps method adapted to model largescale Gene Regulatory Networks using time series microarray data. Firstly, genes are clustered based on existing biological knowledge (Gene Ontology annotations) and then a dynamic Bayesian network applied in order to model causal relationships between genes in each cluster. Finally the learned sub-networks are integrated to make a global network. This project aims at inferring the regulatory network that provides us the interaction between the various genes. Our aim is to apply data mining technique to gene expression data and infer regulatory network for various experiments, which include experiments in good and bad conditions using the information available in Gene Ontology.

 1 MSRIT,Bangalore, India, [email protected] 2 MSRIT,Bangalore, India, [email protected] 3 MSRIT,Bangalore, India, [email protected] 139 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

CHROMOSOME PROPERTIES OF UNICELLULAR EUKARYOTIC GENOMES ANEL KABDULLINA 1, ANATOLIY IVACHSHENKO 1, MAKPAL TAUASAROVA 1, SHARA ATAMBAYEVA 1

Keywords: exon, intron, gene, genome

In the past few years genome sequences of more than 30 genomes of unicellular eukaryotes have been completed and tens genomes are under process of sequencing. Many of these genomes have major portion of intron- containing genes that makes it possible to analyze exon-intron organization of genes of lower eukaryotes in comparative aspect. Genome size, chromosome number, mechanism of expression genetic information and portion of genes with introns differ significantly among genomes of these organisms. The purpose of this work was to observe diversity of exon-intron structure of genes of unicellular eukaryots for revealing properties that will characterize genomes of these organisms. In each samples of genes with 1, 2, 3, 4, 5, 6-9, 10-14, and over 15 introns the average exon and intron lengths, total of exon lengths in a gene (Lex), gene length (Lgn), the portion of total exon lengths in a gene and intron number in a gene (Nin) were determined. The examined genomes of lower fungi have following portion of genes with introns (%): С.neoformans (96.9), N.crassa (79.5), A.fumigatus (77.5), M.grisea (75.0), S.pombe (45.3), U.maydis (38.5), P.stipitis (25.4), Y.lipolytica (10.5), D.hansenii (5.1), S.cerevisiae (4.5), E.gossypii (4.5), K.lactis (2.5), С.glabrata (1.5). In all chromosomes of each genome of lower fungi the portion of genes with introns was nearly equal - SD was less than 3.5%. Hence, portion of genes with introns is sustained constant with high accuracy and this reveals genome property, which is common for all chromosomes. The average chromosome length was the largest in genome M.grisea - 6.163 Мbр and the least in genome S.cerevisiae - 0.754 Мbр. Lengths of the largest and the smallest chromosomes differ in 18.3 times in Y.lipolytica genome and in 1.9 times in D.hansenii and M.grisea genomes. Density of genes in genomes varied in interval from 247 gene/Мbр (N. crassa) to 540 gene/Мbр (E. gossypii). However, density of genes per 1 Мbр in all chromosomes of each genome was approximately identical - standard deviation (SD) was less than 6%. The average density of genes in genomes of lower fungi correlates negatively with genome size and portion of genes with introns. There is positive correlation between portion of genes with introns and genome size. The obtained data testifies relationship of  1 Al-Farabi's Kazakh National University, Kazakhstan, [email protected], [email protected] 140 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 three characteristics of genomes of lower fungi - density of genes, portion of genes with introns and genome size. The average length of genes without introns in chromosomes of each genome changed slightly - SD was less than 8%. The portion of genes with introns in examined genomes of protists varied in a wide range: B. natans (84.8), T. parva (73.5), D. discoideum (68.5), B. bovis (60.4), P. falciparum (55.4), P. tricornutum (46.9), C. muris (21.5), O. lucimarinus (20.2), C. parvum (1.0). However, lengths of genes with introns differ insignificantly in chromosomes of each genome: SD did not exceed 6%. In protists genomes the average density of genes varied from 228 gene/Мbр (P.falciparum) to 779 gene/Мbр (B.bovis). In each genome density of genes in chromosomes was nearly equal - SD was less than 7 %. However, in chromosome 3 of B.bovis genome the density of genes was 455 gene/Мbр, that was twice less than the average density of genes in three other chromosomes (887±25). At the same time the average portion of genes with introns and the average length of intronless genes were nearly equel in all chromosomes. The average length of intronless genes varied slightly in each genome - SD was less than 11%. There was no significant correlation between genome size and the average density of genes, and also between the genome size and the portion of genes with introns in examined genomes. It was determined that there is no considerable relationship between portion of intron-containing genes and density of genes and also with length of intronless genes. Portion of genes with introns increases due to appearance of new kind of genes with great number of introns, but not due to increase of portion of gene with one-intronic and two-intronic genes. It tends to be genome nature not only for complete genomes and also for chromosomes of one organism. For example, chromosome 2 of O.lucimarinus genome has 33.0 % of genes with introns, which is unlike the rest of genome (20.2 %) and this chromosome unlike other chromosomes has genes with 8, 10, 11, 12, 13 and 15 introns. Between the average intron number in genes and the portion of genes with introns in genomes of the lower fungi and protists the high association has been established: Nin=0.03D + 0.85 (r=0.86; p <10-6). It allows establishing that genes with exon-intron structure have common nature in evolution process. In lower fungi and protists genomes with vast number of introns linear dependence between intron number in genes and total intron length (Nin=aLex + b) and gene length (Nin=cLgn + d) was established. Regression coefficient (a, b, c, d) depends on properties of exon- intron structure of genes.

141 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

REVERSE ENGINEERING OF EARLY ENDOCYTIC COMPARTMENTS ORGANIZATION BY MODELLING CARGO PROPAGATION YANNIS KALAIDZIDIS 1, MARTA MIACZYNSKA 2, JOCHEN RINK 3, INNA KALAIDZIDIS 1, MARINO ZERIAL 1

Keywords: Endocytosis

Pulse-chase experiments measuring fluorescent cargo uptake provide rich kinetic information concerning the trafficking of different cargos through the endocytic system. High-resolution fluorescence microscopy allows for quantitative measurements of the co-localisation of cargo within various endocytic compartments. A “neutral” cargo (transferrin) and a signalling cargo (EGF) were chased for 30 minutes after a 30 second pulse in cultured HeLa cells. Two different endosomal markers, EEA1 and APPL, were used to mark the compartments of interest. The quantitative colocalization of cargos and endocytic markers provides strong evidence of sorting properties of APPL1- positive EEA1-negative endosomes. At the same time fitting different models to the experimental data gives possibility to search “simplest” model, which describes data with accuracy one can expect within given measurement noise. The “simplest” model revealed kinetically distinct sub-population of EEA1 labelled endosomes, which were impossible to distinguish morphological analysis of antibody staining. The cargo-dependent asymmetry of cargo exchange between the APPL and EEA1 compartments provides insight on the role of non-canonical APPL1-positve early endosomes.

 1 Max Planck Institute of Molecular Cell Biology and Genetics, Germany, [email protected], [email protected], [email protected] 2 International Institute of Molecular and Cell Biology, Poland, [email protected] 3 University of Utah School of Medicine, United States, [email protected] 142 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

PREDICTING NOVEL PROTEIN-SMALL MOLECULE INTERACTIONS USING MOLECULAR MODELLING TECHNIQUES OLGA KALININA 1,2 , ROBERT RUSSELL 2

Keywords: structural bioinformatics, small molecule, modelling

Understanding nature of interactions between proteins and low-molecular weight ligands is a key issue in chemoinformatics and drug design. The increasing amount of data allows to go beyond studying interactions of a single protein with a single small molecule and perform comparative studies in this field. It has been recently shown that drug-target interactions do not follow the simple lock-and-key model, where the keys (drugs) are specific and selective for the locks (target proteins): it is a frequent case when one protein binds different drugs, or one drugs fits many proteins (Yilridim et al., 2007). Complementary, Keiser et al. demonstrated that low molecular weight ligands can be regarded as links between unrelated proteins and allow for prediction of new interactions (Keiser et al., 2007).

In this study, we apply molecular modeling techniques to study the possible promiscuity both of small molecules and their target proteins and aim to predict new potential drugs on a large scale. We use similar small molecules as anchors to superimpose the proteins, to which they are bound, and fit other interaction partner into structures of proteins, which they were never observed to bind. Assessing geometry and chemical properties of the emerging interactions, we can conclude if the considered protein-small molecule pair is likely to interact.

We illustrate our approach by considering binding of phosphodiesterases (PDE) 4 and 5 with their specific inhibitors. These two enzyme families are specific for hydrolysis of cAMP and cGMP, respectively, and development of their specific inhibitors is an issue of great importance in drug industry. Sildenafil is a potent inhibitor of PDE5A, however, it binds PDE4B to a limited extent as well. Using a structure of human PDE4B bound to sildenafil, we fit

 1 Institute for Information Transmission Problems, B. Karetny per. 19, 127994, Moscow, Russia 2 EMBL-Heidelberg, Meyerhofstr., 1, 69117, Heidelberg, Germany, [email protected] 143 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 other PDE5A-specific inhibitors into the pocket of PDE4B. We show that PDE4B is probably unable to bind tadalafil, another specific inhibitor of PDE5A, due to steric clashes with Tyr403, Arg404, Thr407 and Met411 of PDE4B. Experimental assays confirm virtual absence of binding of tadalafil to PDE4B (Card et al., 2004). We also analyze binding of chemicals transported by transthyretin to hormone receptors and report a range of new binding candidates.

We benchmark our technique by rediscovering existing interactions: if two divergent enough ligands bind two divergent enough proteins, we exclude one of the four interactions from the dataset and aim to reproduce it. The presented technique is widely applicable to discovery of new drug targets and in assessment of potential toxicity of drugs.

1. Keiser M.J., Roth B.L., Armbruster B.N., Ernsberger P., Irwin J.J., Shoichet B.K. (2007) Relating protein pharmacology by ligand chemistry. Nat. Biothechnol. 25(2), 197-206. 2. Yildirim MA, Goh K-I, Cusick ME, Barabasi A-L, Vidal M. (2007) Drug-target network. Nat. Biotechnol. 25(10), 1119-1126. 3. Card G.L., England B.P., Suzuki Y., Fong D., Powell B., Lee B., Luu C., Tabrizizad M., Gillette S., Ibrahim P.N., Artis D.R., Bollag G., Milburn M.V., Kim S.H., Schlessinger J., Zhang K.Y. (2004) Structural basis for the activity of drugs that inhibit phosphodiesterases. Structure 12(12), 2233-2247.

144 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

BIOINFORMATIC SEARCH OF PLANT MICROTUBULE- AND CELL CYCLE RELATED SERINE-THREONINE PROTEIN KINASES P.A. KARPOV 1, E.S. NADEZHDINA 2,3, A.I. YEMETS 1, V.G. MATUSOV 1, A.YU. NYPORKO 1, N.YU. SHASHINA 2, Y.B. BLUME 1

Keywords: protein kinases, viridiplantae

Among post-translational modifications, phosphorylation represents a special case because of its wide occurrence (doi: 10.1199/tab.0106) and ability to regulate structure/function of around 30% of all proteins in eukaryotes (PMID:3774008, PMID:11114734). At the same time, several microtubular proteins show homology between animals and plants, e.g., tubulins (PMID:12681322), microtubule-associated proteins type I (MAP1) [1], etc. Therefore, these proteins may be expected to have similar phosphorylation sites and to be phosphorylated by corresponding conserved protein kinases (PKs). Phosphorylation of other proteins forming plant microtubules and involved in cell division may also be mediated by PKs having well characterized homologues in the animal kingdom [2]. Our previous research demonstrated that only ~ 50% of Arabidopsis thaliana PKs have catalytic (kinase) domain homologous to animal PK (http://www.ims.nus.edu.sg/Programs/ 08compsys/files/blume_ab.pdf), and the goal of the present work was to identify plant homologs of the microtubule- and cell cycle related serine-threonine PKs (PMID:12471243). Based on the data from the Human Kinome project (http://kinase.com/) and literature, 68 human serine-threonine PKs phosphorylating microtubule proteins and regulating cell cycle were selected. Further search of plant (Viridiplantae) homologs was performed in the UniProt (Swiss-Prot/ TrEMBL) database with SIB BLASTp scanning against catalytic domains amino acid sequences of human PKs. It was identified Viridiplantae homologs of 35 human protein kinases (from 19 families) (Table): Aurora (A, B, C), BUB1,

 1 Institute of Food Biotechnology and Genomics, Natl. Academy of Sciences of Ukraine, 04123 Kyiv, Ukraine, [email protected], [email protected] 2 Institute of Protein Research, Russian Academy of Sciences, 142290 Pushchino, Moscow Region, Russian Federation, [email protected] 3 AN Belozersky Institute of Physico-Chemical Biology, Moscow State University, Leninsky Gory, 119992 Moscow, Russian Federation, [email protected] 145 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 BUBR1, CDC2, CDK2, CDK5, CHK1, CK2a1, GSK3A, GSK3B, MAST (1, 2, 3), MASTL, NDR1, NEK7, PITSLRE, PLK1, PLK3, SLK, TAO1, TAO2 and TTK.

Table - Results of the BLASTp search of the human microtubule- and cell cycle related serine-threonine protein kinases homologs in Viridiplantae Protein yes/no Protein yes/no Protein yes/no Protein yes/no Aurora kinases: Inhibitor of nuclear Microtubule - Polo -like kinases: AurA / yes IKKa / no MAST1 / yes PLK1 / yes AurB / yes Integrin -linked MAST2 / yes PLK3 / yes AurC / yes ILK / yes MAST3 / yes PLK4 / no Breakpoint cluster Mitogen -activated MAST4 / yes TKL Ser/Thr protein BCR / no MAPK8 / no MASTL / yes RIPK1 / no Mitotic checkpoint Kinase suppressor of Nuclear Dbf2 -related RIPK2 / yes BUB1 / yes KSR2 / no NDR1 / yes RIPK3 / yes BUBR1 / yes Large tumor NimA -related protein Rho -associated Cyclin -dependent LATS1 / ye s NEK2 / no ROCK1 / no CDC2 / yes LATS2 yes NEK7 / yes STE20 -like CDK2 / yes LIM domain kinase 1: NEK8 / no SLK / yes CDK5 / yes LIMK1 / no NEK9 / no Thousand and one CAMK Ser/Thr Leucine -rich repeat p21 -activated TAO1 / yes CHK1 / yes LRRK2 / yes PAK1 / no TAO2 / yes CHK2 / no Mitogen -activated PAK5 / no TANK -binding Casein kinase II: MAP2K3 / no Cell division cycle 2 - TBK1 / no CK2a1 / yes MAP2K6 / no PITSLRE / yes Tau -tubulin kinases: CaMK subfamily : MAP3K1 / yes Protein kinase C - TTBK1 / no DCLK1 / no MAP3K3 / yes PKCi / no TTBK2 / no MNB/DYRK MAP3K7 / yes Mitogen -activated Dual specificity DYR1A / no MAP3K11 / no MAP3K15 no TTK / yes CMGC Ser/Thr MAP/microtubule CAMK Ser/Thr protein Titin: Erk2 / no MARK1 / no PRKD2 / no TTN / no CMGC Ser/Thr MARK2 / no Serine/threonine - CaM kinase -like GSK3A / yes MARK3 / no PSKH1 / no CAMKV / no GSK3B / yes MARK4 / no

At the same time plant homologs of BCR, CHK2, DCLK1, DYR1A, Erk2, IKKa, MAPK8, KSR2, LIMK1, MAP2K3, MAP2K6, MAP3K11, MARK1, MARK2, MARK3, MARK4, NEK2, NEK8, NEK9, PAK1, PAK5, PKCi, MAP3K15, PRKD2, PSKH1, PLK4, RIPK1, ROCK1, TBK1, TTBK1, TTBK2, TTN and CAMKV PKs were not found. Acknowledgment: This work is supported by bilateral grant No. 08-04- 90454 of Natl. Academy of Sciences of Ukraine and Russian Foundation for Basic Research (RFBR) 2008-2009. 146 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

1. P.A. Karpov, Ya.B. Blume (2008) Bioinformatic search for plant homologues of animal structural MAPs in the Arabidopsis thaliana genome, In: The Plant Cytoskeleton: a Key Tool for Agro-Biotechnology. Ya.B. Blume et al., 373-397 (Springer). 2. A.Yemets et al. (2008) Effects of tyrosine kinase and phosphatase inhibitors on microtubules in Arabidopsis root cells, Cell Biol Int, 32:630-637.

147 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

NET2DRUG: COMBINED TARGETING THE KEY-NODES IN SIGNAL TRANSDUCTION NETWORK SHIFTS BALANCE BETWEEN APOPTOSIS AND SURVIVAL MECHANISMS IN TUMOR CELLS. ALEXANDER KEL 1, ANGELA GLUCH 1, ULYANA BOYARSKIH 2, VLADIMIR POROIKOV 3, ALEXEY ZAKHAROV 3, GALINA SELIVANOVA 4

Keywords: antitumor drugs, signal transduction, transcription factors, genetic algorithm

Cell proliferation is controlled by complex interplay between genes governing cell proliferation (protooncogenes and other pro-survival genes) and their antagonists (tumor suppressors including pro-apoptotic genes). Malfunction of the balance between these two sets of genes can lead to tumor progression. We have applied a systems biology approach to study tumor suppressor genes in various processes leading to cell quiescence phase (GO phase) in normal conditions as well as under treatment by ant-tumor drugs leading to stop of cell proliferation and apoptosis. In this study, we analyzed microarray gene expression data on stopping cell proliferation of human fibroblasts under condition of serum deprivation and a large scale gene expression study of treatment of breast cancer cell line by antitumor drugs – RITA and Nutlin, whose direct targets are p53 and Mdm2.

We have applied ExPlain™ computer system [1], which allows to analyze composite structure of promoters, identify transcription factors involved in regulation of these processes, and perform topological modeling of signal transduction processes leading to stopping of cell proliferation and apoptosis. We found that several transcription factors, such as FoxF1, Sox-9 and IRF play an important role on the early stages of this processes switching specific regulatory program of entrance into cell quiescence phase (G0 phase). We also found that transcription factors as Crx, Oct, and Fox families are key factors  1 BIOBASE GmbH, Wolfenbuettel, Germany, [email protected], [email protected] 2 Institute of Chemical Biology and Fundamental Medicine, Novosibirsk, Russian Federation, [email protected] 3 Institute of Biomedical Chemistry Russian Academy of Medical Science, Moscow, Russian Federation, [email protected], , [email protected] 4 Microbiology and Tumor Biology Center (MTC), Karolinska Institutet, Sweden, [email protected] 148 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 contributing to the downregulation of pro-survival genes upon p53 reactivation by RITA antitumor drug leading to apoptosis of cancer cells. Topological modeling of the signal transduction network upstream of these transcription factors using ExPlain™ tools allows us to reveal key-nodes of such network, which might master-regulate the whole program of cell survival which is balancing versus the program of entrance into cell quiescence phase and apoptosis. We consider such key-nodes as the most perspective targets for novel anticancer drugs.

Finally, we applied the powerful chemo-informatics approach based on the computer toll PASS [2], which allowed us to search for prospective leads among libraries of small molecular compounds targeting in combined manner the key nodes found by the ExPlain™ system. Perspective drug candidate are going to be identified whose application on the cancer cells in the combination with RITA low concentration treatment can shift balance of survival mechanisms in tumor cells towards apoptosis.

This work was partially supported by EU grants: VALAPODYN (LSHG-CT- 2006-037277) and Net2Drug (LSHB-CT-2007-037590).

1. Kel A, Voss N, Valeev T, Stegmaier P, Kel-Margoulis O, Wingender E. ExPlain: finding upstream drug targets in disease gene regulatory networks. SAR QSAR Environ Res. 2008;19(5-6):481-94. 2. Alexey Lagunin, Alla Stepanchikova, Dmitrii Filimonov, Vladimir Poroikov: PASS: prediction of activity spectra for biologically active substances. Bioinformatics 16(8): 747-748 (2000)

149 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

HIGHLY CONNECTED CANCER METASIGNATURE GENES ARE NOT EVOLUTIONARY CONSERVED THROUGHOUT THE THREE DOMAINS OF LIFE MUHUMMADH KHAN 1, KAISER JAMIL 2

In the present study we identified and compared the highly connected genes of the cancer metasignature with the most conserved of them [1]. In order to generate functional network we employed the network prediction tool STRING which combines an increasing number of gene context protein interaction prediction methods using a unified scoring scheme [2]. STRING contains a unique scoring-framework based on benchmarks of the different types of associations against a common reference set, integrated in a single confidence score per prediction. The graphical representation of the network of inferred, weighted protein interactions provides a high-level view of functional linkage, facilitating the analysis of modularity in biological processes. In the combined signature there are 18 genes that highly interconnected. These include the genes having intra-nuclear functions like the PCNA, MAD2L1, RPA3, Cyclins, Cyclin dependent kinases and MCM proteins. It seems that they form the crux of the combined cancer metasignature. It seems that this common group of genes is essential for cancer progression. The combined signature network contains six sub- networks established by these genes. These sub-networks belong to Chaperonins, Proteasomes, Cyclin dependent kinases, Minichromosome maintenance proteins, Replication factors and Cell division cycle proteins. These sub-networks are highly interconnected except the chaperonin and proteasome sub-networks. This model with combined signature sub-networks contribute to the infallible tendency of the cancer cells to divide. We also generated a phylogenetic profile to determine conservation of cancer metasignature genes through the three domains of life; archaea, bacteria and eukarya. Genes that are vital for proper functioning of certain life processes which are crucial for the cell tend to be conserved than others. The genes which are important are found in all the three domains most of the times. Hence we wanted to find out if this the case with metasignature genes.  1 Department of Bioinformatics, MGNIRSA, Gagan Mahal, Hyderabad, A.P. India, [email protected] 2 Department of Genetics, Mahavir Hospital and Research Center, Masab Tank, Hyderabad, A.P. India, [email protected] 150 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 We used the orthology detection as implemented in STRING which has been described earlier in the methodology section. As of now STRING searches around 315 bacterial species, 35 eukaryotic species and 26 archaeain species. The 27 neoplastic metasignature genes are conserved in Archaea and 30 genes are conserved in Bacteria whereas in the case of the undifferentiated metasignature the overall number of homologs was 9 which is very low indicating at exclusively eukaryotic genes. But On the whole about 5 genes of neoplasm and 4 genes of undifferentiated metasignature were strongly conserved in Bacteria, Archaea and Eukarya. These are HSPD1, NME1, AHCY, MTHFD2, DDX48, DPM1, MTHFD2, PRDX4, and NME1. Baring HSPD1 and DDX48 remaining all genes are enzymes. Conservation of these genes shows that these are vital not only to a cancerous cell but even to a normal cell. Most of the metasignature genes had homologs in the Eukaryotic organisms indicating that these genes are exclusive to eukaryotes. It is interesting to know that the highly connected cancer metasignature genes are not conserved when compared to the species from the other two domains of life. It seems plausible that the ancestral genes were replaced by newly evolved counterparts in the eukaryotes. One more important concern is the chemotherapeutic drug targeting. The pathways of these over expressed conserved genes could be considered to plan and implement a potentially viable chemotherapy.

CDK - Figure: (a) The highly Cyclin connected genes of the MCM – CCT - cancer metasignature. (b) Minichro Chaperonin The cellular processes which the functional network of the PSM – RFC – Proteasom metasignature genes Replicati on represent CDC – Cell

1. D.R. Rhodes, et al. (2004) Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc Natl Acad Sci U S A. 101(25): p. 9309-14.

151 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 2. von Mering, C., et al., (2007) STRING 7-recent developments in the integration and prediction of protein interactions. Nucleic Acids Res. 35(Database issue): p. D358-62.

152 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

SNPS OF MTHFR OCCUR AT SITES EXHIBITING SIGNIFICANT CONSERVATION IN COMPARISON TO THE ACTIVE SITES MUHUMMADH KHAN 1, KAISER JAMIL 2

The phylogenetic tree was constructed using homologs belonging to diverse species ranging from Archaea to Bacteria to Eukarya. It was clear from our multiple alignment that the active catalytic residues aspartic acid and glutamic acid are conserved in all the representative sequences from the three domains of life. The Eukaryotic subtree is clustered into two clades. The homologs of animalia species were clustered as one and the other contained many fungi along with plants and other unicellular eukaryotes. This clustering of homologs demarks the different paths evolution that the ancestral MTHFR has undergone. We examined, by employing CONSURF tool, the structure of MTHFR. Conserved core is the three dimensional structure of any given protein that is maintained throughout its evolutionary history. It provides scaffolding for exact spatial arrangement and maintenance of active sites, catalytic residues and binding pockets of the protein. In case of MTHFR, It is clear from the multiple alignment that the catalytic residues along with MTHFR domain are remarkably conserved throughout its evolution. To trace this conserved core onto the three dimensional structure, conservation score for each site in multiple sequence alignment was calculated using bayesian method as implemented in Consurf [1]. The conservation score at a site corresponds to the site's evolutionary rate. For any given protein the evolution is not the same amongst all its amino acid sites. This rate of variation is less in regions which are well conserved and high for the regions which are variable. This rate change corresponds to the levels of purifying selection at any given position in the protein sequence. The scores are normalized, so that the average score for all residues is zero, and the standard deviation is one. The conservation scores calculated by consurf are a relative measure of evolutionary conservation at each sequence site of the target protein. The lowest score represents the most conserved position in a protein. It does not  1 Department of Bioinformatics, MGNIRSA, Gagan Mahal, Hyderabad, A.P. India, [email protected] 2 Department of Genetics, Mahavir Hospital and Research Center, Masab Tank, Hyderabad, A.P. India, [email protected] 153 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 necessarily indicate 100% conservation (e.g. no mutations at all), but rather indicates that this position is the most conserved in this specific protein calculated using a specific multiple sequence alignment. The resultant conservation as seen in the picture show that the active sites are well conserved as expected but interestingly the two SNPs of MTHFR also exhibit conservation though not as much. Hence we assume that these SNPs occur at sites which are under functional constraint, but this constraint is less in comparison with conserved sites and more in comparison with a variable site of the same protein. Hence, we speculate that this could be a possible reason for the fixing of SNPs in MTHFR. The contribution of the fixed SNP to the structural destabilization of a given protein is subjected to stochastic processes. Simply put, it should occur at the right place at the right time.

Figure 1: This figure shows MTHFR phylogenetic tree, conservation scores of the MTHFR structure and actual protein structure with higher intensity greyscale depicting higher conservation. The dark spheres represent the active sites and the lighter spheres represent the SNPs. Though SNP bearing sites are not as conserved as the Active sites but nonetheless they exhibit significant conservation.

1. M. Landau et al., (2005) ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures, Nucleic Acids Res., 33(Web Server issue): p. W299-302.

154 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

CHROMATIN ORGANIZATION IN D. MELANOGASTER PETER KHARCHENKO 1, ART ALEKSEYENKO 1, ANDREY GORCHAKOV 1, MICHAEL TOLSTORUKOV 1, MITZI KURODA 1, PETER PARK 1, YURI SCHWARTZ 2, DANIELA LINDER-BASSO 2, VINCENZO PIRROTTA 2, NICOLE RIDDLE 3, SARAH GADEL 3, SARAH MARCHETTI 3, SARAH ELGIN 3, AKI MINODA 4, CAMERON KENNEDY 4, GREGORY SHANOWER 4, GARY KARPEN 4

Keywords: D.melanogaster chromatin

The functional roles and mechanisms of epigenetic regulation by chromatin structure remain poorly understood. To provide a baseline reference of chromatin organization, we are examining genome-wide distributions of more than a hundred histone modifications and relevant chromosomal proteins across D. melanogaster genome in cell line and tissue samples. Our work is part of the larger effort by the ENCODE project (ENCyclopedia Of DNA Elements) to generate a comprehensive dataset of principal functional elements in fly, worm and human genomes. The combinatorial patterns of histone modifications delineate specific functional regions of the D. melanogaster genome, including euchromatic and heterochromatic compartments, domains of Polycomb-mediated silencing, putative enhancers, transcription start sites and bodies of expressed genes. Using multivariate analysis techniques we identify predominant epigenetic patterns associated with gene activation and silencing specific to euchromatic, heterochromatic regions and individual chromosomes. To investigate mechanisms that may account for partitioning of the genome into functionally independent compartments we have examined a number of putative insulator proteins. We find that binding positions of many such proteins frequently coincide, forming several types of co-binding groups, with key insulator proteins such as CTCF and Su(HW) at their core. Colocalization with specific binding partners is often linked to significant differences in apparent magnitude of binding, or specific positioning relative to genetic elements and epigenetic marks. The coordinated set of experimental measurements by the consortia members allows further integrative analysis of the chromatin organization

 1 Harvard Medical School, Boston, MA, USA, [email protected] 2 Rutgers University, Piscataway, NJ, USA 3 Washington University, St. Louis, MO, USA 4 Lawrence Berkeley National Laboratory, Berkeley, CA, USA 155 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 with respect to other functional data such as copy-number variation or replication timing, as well as comparison of chromatin marks across different organisms.

156 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

A NOVEL TYPE OF REPEATS MEDIATES INTERACTION BETWEEN SCHIZOSACCHAROMYCES POMBE RAD51 AND SFR1 PROTEINS. OLGA KHASANOVA 1, FUAT KHASANOV 1

Keywords: fission yeast, Rad51, DNA repair

The central reaction of homologous recombination is homology search and DNA strand invasion by the Rad51-ssDNA presynaptic filament by mediator proteins. In S.pombe cells mediator proteins Rhp55/Rhp57 and Sfr1 was shown to help Rad51 to overcome the inhibitory binding of RPA to single- stranded DNA in in vitro strand-exchange reaction via protein interactions. Analysis of amino acid sequence of Sfr1 revealed that the C-terminus included a coiled-coil domain (amino acids 178 – 236) strongly predicted by program COILS. Also, we found two novel tandem repeats of approximately 40 amino acids long in the region preceding the coiled-coil domain with consensus sequence of the core part, F-[H/K/R]-[S/T/P]-P-[I/M/L], predicted by several algorithms at Network Protein Sequence Analysis sever (Pole BioInformatique Lyonnais, France). We named this novel motif a PSA (pombe Sfr1 associated). Alignment of PSA repeat sequences generated by Clustal W and secondary structure prediction using Network Protein Sequence Analysis showed that PSA repeats consist of –helical central core flanked by regions of high amino acid conservation. The TBLASTX search resulted in identification of several proteins with PSA repeats in other eukaryotic organisms, C. neoformans CNBF2390 (two repeats) S. pombe Swi2 (one repeat), S. cerevisiae Sae2/Com1 (one repeat), N. crassa EAA36464.1 (one repeat), and M. grisea MG03610.4 (one repeats), A. nidulans ANO810.2 (one repeat). Identification of PSA repeats in Sfr1 suggested that they might have functional significance. To address this we tested if overproduction of PSA repeats can cause dominant negative effect on cell survival under conditions of genotoxic stress. We overexpressed truncated Sfr1 protein (a.a. 71-176, PSA1 and PSA2) in a wild type and rad51 strains and performed drop-assay with MMS. PSA repeat overexpression as was found to have negative effect on cell survival under conditions of genotoxic stress by the absence of Rad51. This suggests that PSA repeats may be involved in the interaction between Sfr1 and Rad51

 1 Institute of Gene Biology, Russian Academy of Sciences, Russian Federation, [email protected], [email protected] 157 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 proteins. The consensus sequence of the most conserved part of PSA repeat contains Phe. Site directed mutagenesis of PSA core region was performed using QuickChange (Stratagene) to generate F107E mutation in PSA1 and F163E in PSA2 repeat (substitutions of phenylalanine on glutamate) for the ability to cripple Sfr1-Rad51 interaction. Yeast two-hybrid analysis showed the strong decreasing of interaction between Rad51 and Sfr1 with amino acid substitutions. Thus, PSA repeats of Sfr1 serve as protein-protein interaction module, the switch of the charge to the opposite would more likely cripple the function of the repeat. Double mutation in repeats resulted in cell sensitivity to CPT and MMS like sfr1 mutant. Moreover, the strain sfr1-F107E F163E showed the defects in meiosis and meiotic recombination. From these data we concluded that Sfr1 via PSA repeats is important and for wild-type level meiotic recombination in fission yeast. The impairment of Sfr1 function by mutations in PSA repeats is not due to destabilization of protein structure, as similar protein levels were detected in the whole cell extract prepared from strains of wild type and mutant Sfr1. This indicates that PSA repeats are essential for the mechanism of action of Sfr1 in DNA repair.

158 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

COMPARATIVE ANALYSIS OF GENE EXPRESSION PROFILIES IN LIVER AND KIDNEY OF PIGS N.S. KHLOPOVA 1, V.I.GLAZKO 1, T.T. GLAZKO 1

Profiles of a gene expression in different organs allows to reveal the "critical" genes and the metabolic ways supervised by them, which closely connected with the formation of various phenotypic characteristics. The comparative analysis of profiles of a gene expression of two pigs organs, liver and kidney was carried out. It is important to mention that, despite wide prevalence in using of microarray technologies for the analysis of gene expression profiles, the method has some problems, which can lead to erroneous results. One of such sources, which were widely discussed in the literature (for example, [1]), is the problem of cross hybridization. In our researches a particular attention was given to this question. Researches were carried out on six-monthly female pigs of Landras breed, keeping on the experimental farm of University of Minnesota, USA. The primary data obtaining on profiles of gene expression was spent on experimental base of this University under the supervision of professor S. C. Fahrenkrug. Experiment was carried out with 70-mer oligo microarrays designed by request of professor S. Fahrenkrug. Total RNA was isolated from liver and kidney of five pigs, for each sample separately received cDNA in RT- PCR, using special RT Primer Oligo (Cy3/Cy5). After stopping RT reaction and degrading of RNA the 1st hybridization of cDNA to microarray and 2nd hybridization of 3DNA Capture Reagent to bind to cDNA on microarray were carried out. At the total analysis of 600 genes of the liver, with the maximum distinctions in intensity of hybridization between liver cDNA and kidney cDNA, 12 genes were found out with more than one probe in microarray. The obtained data allowed comparing separately for each animal the intensity of hybridization to microarray probes of various sites of the same gene (on the digitized signal strength). Two basic groups of genes among considered 12 were allocated: with the least distinctions between signal strength of hybridization of various sites cDNA of same mRNA with microarray probes (to 4500 standard units of a luminescence), and with the greatest distinctions (more than 10000 standard units of a luminescence). The last group included  1 Russian State Agrarian University – Moscow Timiryasev Agrarian Academy, 127550, Timiryasev st., 49, Moscow, Russia, [email protected] 159 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 two genes – Alpha-1-antichymotrypsin precursor (ACT) and Fibrinogen alpha chain precursor. Observable differences were reproduced in independent experiments for all investigated animals. Among profiles of gene expression in kidneys for 600 genes, with the maximum distinction in intensity of hybridization between kidney cDNA and liver cDNA, 9 genes were allocated with more than one probe to different sites of the same gene on the microarray. The group of genes, which internal sites of hybridization differed more than on 10000 standard units of a luminescence, consisted of the genes coding ATP synthase a chain, Chromogranin A precursor (CgA), and Ubiquitin. The search a homology sites to microarray probes was executed with the use of BLASTn in a Sus scrofa EST databank presented in NCBI. All considered cases of reproduced differences between hybridization intensity of different microarray probes to cDNA the same gene were typical for the genes belonging to supergene families. For all cases the expressed differences between homologous sites to different probes for the same cDNA were observed on the quantity of homology fragments, on the presence of homologous sites for other genes, including paraloges. The total comparative analysis of gene expression in liver and kidney had allowed to reveal 40 genes which expression was essentially above in kidney, than in liver. In general, the basic differences had appeared connected with the genes supervising inter- and an intracellular ionic exchange, and also mechanisms of cellular division. It would be coordinated well with dominating participation of kidneys in maintenance of ionic balance in blood of mammals, in comparison with liver, and also with known lowered activity of cytokinesis in a liver (polyploidy of hepatocytes). Thus, the revealed differences in profiles of gene expression in kidney and liver corresponded to functional and histological distinguishes between these organs. The obtained data visually show the possibilities of the using of short DNA fragments (70 nanometers) for in-depth studies of genetic-biochemical mechanisms of cellular and tissue phenotype formation and to develop the experimental approaches to its control.

1. Okoniewski M. J., Miller C. J. Hybridization interactions between probesets in short oligo microarrays lead to spurious correlations//BMC Bioinformatics.- 2006. – Vol.7. – P.276-290

160 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

REGULATION OF SPLICING BY SMALL NON-CODING RNAs EKATERINA KHRAMEEVA 1, ANDREY MIRONOV 1, MIKHAIL GELFAND 2, DMITRI PERVOUCHINE 1

Keywords: splicing, RNA secondary structure, ncRNA

Non-coding RNAs (ncRNAs) are various functional RNA molecules that are not translated into proteins. Non-coding RNA genes include highly abundant and functionally important RNAs such as transfer RNA (tRNA) and ribosomal RNA (rRNA), as well as RNAs such as snRNAs, snoRNAs, microRNAs (mature and precursors), siRNAs, piRNAs and long ncRNAs. Functions of many of these transcripts are still not identified. Several experimental studies suggest regulation of splicing by non-coding RNAs that form secondary structures with regulatory elements on messenger RNA. For example, a small nucleolar RNA (snoRNA), HBII-52, exhibits sequence complementarity to the alternatively spliced exon Vb of the serotonin receptor 5-HT(2C)R. HBII-52 regulates alternative splicing of 5- HT(2C)R by binding to a silencing element in exon Vb. So, HBII-52 promotes inclusion of exon Vb. Defects in 5-HT(2C)R pre-mRNA processing are known to contribute to the Prader-Willi syndrome. In addition to experimental methods, non-coding RNAs associated with secondary structures can be studied in silico. We performed comparative genomic analysis of introns in three taxonomic groups (Drosophila, Mammalia and Nematoda) to identify conserved complementary motifs capable of forming stable RNA structures between non-coding RNAs and regions surrounding splice sites. We developed a generic method for identification of such conserved RNA structures and implemented it, including a visual interface, in Java. Even at the most strict search conditions, hundreds of conserved joint ncRNA-mRNA structures were found for each taxonomic group. Additionally, we looked at the abundance of the predicted structures at different positions relative to splice sites. There are some positions with unexpectedly high (with respect to random nucleotide context) number of predicted structures. As  1 Moscow State University, Russian Federation, [email protected], [email protected], [email protected] 2 Institute for Information Transmission Problems, Russian Federation, [email protected] 161 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 expected, we observed many structures formed by snRNAs and donor splice sites. However, we also observe unexpectedly high number of secondary structures formed by polypyrimidine tract and long ncRNAs. We suggest that non-coding RNAs can participate in the regulation of splicing to much broader extent than it is believed currently. One of the possible mechanisms is that non-coding RNAs may interfere with splice sites or cis-regulatory elements related to splicing (enhancers, silencers). They can also form secondary structures with both ends of the intron, which would bring the corresponding splice sites closer to each other. Finally, non-coding RNAs can modulate the native secondary structure of the mRNA, thus indirectly changing the accessibility of cis-regulatory elements to splicing factors.

162 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

COMMON PREDECESSOR’S EFFECT IN ARCHAEAL GENOMES AND PROTEOMES VLADISLAV VICTOROVICH KHRUSTALEV 1, EUGENE VICTOROVICH BARKOVSKY 1

In this work we analyzed G+C composition of genomes and amino acid content of proteomes of 25 completely sequenced archaeal genomes, information on which has already been submitted to Codon Usage Database [1] (www.kazusa.or.jp/codon). They are: Halobacterium sp.; Natronomonas pharaonis; Haloarcula marismortui; Haloquadratum walsbyi; Methanopyrus kandleri; Thermofilum pendens; Pyrobaculum calidifontis, arsenaticum, islandicum and aerophilum; Aeropyrum pernix; Thermococcus kodakarensis; Methanothermobacter thermautotrophicus; Archaeoglobus fulgidus; Thermoplasma acidophilum and volcanium; Metallosphaera sedula; Pyrococcus abyssi, horikoshii and furiosus; Picrophilus torridus; Sulfolobus acidocaldarius, solfataricus and tokodaii; Nanoarchaeum equitans. Total GC-content (G+C), GC-content in first, second and third codon positions, in fourfold (GC4f) and twofold degenerated sites in third codon positions (GC2f3p), levels of arginine codons usage (Arg2: AGA/G; Arg4: CGX) and levels of amino acid usage have been calculated for each coding district from these genomes by the original CGS algorithm (www.barkovsky.hotmail.ru). All the genomic and proteomic data obtained in our research can be summarized in the single hypothesis. There was a strong mutational AT- pressure in the genome of common predecessor of all archaea. Isoleucine and lysine are coded by GC-poor codons, their levels have been increased in predecessor’s proteome due to AT-pressure. Once mutator gene (or genes) causing elevated GC to AT transitions rates has been (or have been) destroyed, 3GC levels of genes began to increase. But mutator gene (or genes) causing elevated GC to AT transversions rates has not (or have not) been destroyed. That is why bias in GC4f and GC2f3p has occurred (GC4f>Arg4) in the most of the offspring of common archaeal predecessor [2].

 1 Belarussian State Medical University, Belarus, Minsk, Dzerzinskogo 83, [email protected] 163 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Arg4 became higher than Arg2 only after the increase in GC4f (after the rates of AT to GC transversions had become significantly higher than rates of GC to AT ones). This shift observed in archaeal genes with G+C > 0.6 results not only in increase of Arg4, but also in increase of aspartic acid, histidine and glutamine levels of usage due to decrease in lysine level of usage as well as in increase of valine and threonine levels due to decrease in isoleucine (see Figure 1a and 1b). In archaeal genomes with strong GC-pressure arginine is coded preferably by GC-rich Arg4 codons. In the genome of Haloquadratum walsbyi closely related to GC-rich archaea GC-content has decreased mostly in third codon positions, while Arg4>>Arg2 bias still persists. Amino acid content of archaeal proteomes (coded by genomes with G+C<0.6) carries its characteristic features (relatively elevated levels of isoleucine and lysine, and relatively decreased levels of alanine, histidine, cysteine and glutamine) due to common predecessor’s effect. Shifts in amino acid usage in GC-rich archaea can also be due to this effect.

Figure 1. Amino acid content of proteins c oded by genes (arranged by G+C) from 25 archaeal species. Significant differences in amino acid usage are marked by arrows.

1. Y. Nakamura et al. (2000) Codon usage tabulated from the international DNA sequence databases: status for the year 2000. Nucleic Acids Research , 28(1)28(1): 292. 2. V.V. Khrustalev, E.V. Barkovsky (2007) Levels of CpG and GpC dinucleotides in coding districts of archaeal genomes. Computational Phylogenetics and Molecular Systematics “CPMS’ 2007” . Conference proceedings. Moscow: KMK Scientific Press Ltd. P.354-357.

164 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

CONFORMATIONAL ANALYSIS OF ROTAMER CHANGES UPON PROTEIN-PROTEIN BINDING TATSIANA KIRYS 1, ANATOLY RUVINSKY 2, ALEXANDER TUZIKOV 1, ILYA VAKSER 2

Keywords: rotamers, rotamer library, protein-protein docking

The scale and spectrum of rotamer changes reflect important features of protein interactions [1]. The side chains on the protein surface are less conformationally restricted than the buried ones. Thus the conformations of the surface side chains are more likely to change upon the formation of protein-protein complexes. To better understand the process of side-chain packing in protein-protein interactions, the conformational preferences at the interface area in bound and unbound proteins were examined. Such preferences will be integrated in the procedure for the refinement of protein docking predictions. The analysis was performed on the Dockground [2] dataset of bound and the corresponding unbound proteins. All residues in a protein were partitioned into three categories: core (I), semi-exposed (II), and exposed (III). The change of the residue solvent-accessible surface area (SASA) upon binding was used to differentiate the interface residues from the non-interface ones (the interface residues were defined as those with > 1 Å2 SASA change). Thus, the residue classification included 3 interface and 3 non-interface residue categories. The distribution of torsional angles was calculated for each residue type, and the correlation between the distributions of corresponding bound and unbound conformations was analyzed. The results (Table 1) show the weakening of the correlation with the SASA increase for all residue types. This trend can be explained by the difference in numbers of the nearest neighbors of surface and core residues (the environment effect [1]). In comparison with the core residues, the surface residues have less nearest neighbors. Thus, they are less restricted and more prone to changes upon binding. The degree of correlation shows expected dependence on the hydrophilicity of residues. Polar residues on average demonstrate stronger decrease of correlation than the nonpolar ones. This trend relates to the environment effect and the non-  1 UIIP NAS of Belarus, Surganova, 6, 220012 Minsk, [email protected] 2 Center for Bioinformatics, The University of Kansas, 2030 Becker Drive, Lawrence, KS 66047, USA, [email protected] 165 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 homogeneous distributions of residues in proteins (e.g., the polar residues prefer surface positions and the nonpolar ones are more often found in the protein core).

Table 1. Correlation between corresponding areas of bound and unbound conformations Amino Non-interface Interface acid I II III I II III C 0.97 0.90 0.71 0.97 0.82 0.42 P 1.00 0.99 0.93 0.99 0.94 0.94 S 0.98 0.98 0.79 0.96 0.95 0.63 T 1.00 0.94 0.63 0.99 0.88 0.79 V 1.00 0.92 0.88 0.99 0.94 0.88 E 0.87 0.46 0.08 0.70 0.28 0.04 M 0.90 0.09 0.05 0.68 0.12 0.00 Q 0.84 0.32 0.19 0.50 0.27 0.22 K 0.88 0.37 0.26 0.64 0.13 0.00 R 0.75 0.25 0.13 0.43 0.18 0.06 D 0.96 0.87 0.44 0.89 0.80 0.25 F 0.95 0.67 0.77 0.87 0.46 0.53 H 0.93 0.80 0.29 0.79 0.55 0.40 I 0.99 0.74 0.59 0.97 0.65 0.23 L 0.99 0.94 0.84 0.97 0.86 0.68 N 0.94 0.91 0.50 0.89 0.51 0.45 W 0.95 0.32 0.71 0.82 0.18 -0.01 Y 0.96 0.59 0.59 0.89 0.65 0.48

The distributions of torsional angles for each residue type were clustered. The centers of the clusters were defined as rotamers. Two factors, the cluster occupancy and the clustering torsional cutoff, were used to derive libraries of bound and unbound rotamers. These libraries systematically reflect conformational preferences in different protein areas, which is essential for the design of better docking procedures for the unbound proteins.

1. 1. Ruvinsky A.M. and Vakser I.A. (2009), Sequence composition and environment effects on residue fluctuations in protein structures, submitted. 2. 2. Gao Y., Douguet D., Tovchigrechko A. and Vakser, I.A. (2007), DOCKGROUND system of databases for protein recognition studies: Unbound structures for docking, Proteins 69: 845–851; http://dockground.bioinformatics.ku.edu .

166 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

INTRODUCTION AND APPLICATION OF CELLEXPRESS, A NEW DATABASE FOR STUDYING HUMAN TISSUE SPECIFIC GENE EXPRESSION LARISA KISELEVA 1, RAYMOND WAN 1, PAUL HORTON 1

Background Databases such as NCBI’s GEO and EBI’s ArrayExpress hold a tremendous volume and variety of expression data, and fulfill an invaluable role as primary repositories of gene expression data. However, although some auxillary annotation, such as tissue type, can be obtained from these repositories, the content, vocabulary and level of detail found in the annotation is left up to the data repositor. This fact makes these repositories less than ideal for performing analysis which pools data from multiple experiments. For that task, curated databases which provide unified annotation and classification of expression samples are needed.

CellExpress We present CellExpress, a curated database of human gene expression. To create CellExpress, we annotated and organized gene expression data for about 10, 000 microarray expression samples of various normal and disease human tissues found in NCBI’s GEO database. Consulting the literature as necessary, we manually annotated and classified the samples into categories based on tissue type, normal or disease type, specimen type (cell line or tissue), gender and age of subject etc. We expect CellExpress to facilitate analysis of gene expression in various human tissue types. To illustrate this we outline two ongoing projects which use CellExpress.

Example Applications First, we suggest a method for visualizing relationships between distinct cell/tissue types. This method employs correlation similarity metric and minimum spanning tree concept of graph theory to display inter-cellular relations as networks with nodes representing particular cell type, and edges representing correlation distance between those cell types. As a result we obtained undirected acyclic graph where cell types of similar origin and function appeared to be linked. The HAMSTER (Helpful Abstraction using

 1 AIST, Japan, [email protected] 167 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Minimum Spanning Trees for Expression Relations) is our web server for automatic generation of network images (http://hamster.cbrc.jp/). Second, using the annotated gene expression data of CellExpress database, we filtered out the expression levels for genes coding transcription factors and characterized the features of their expression across normal human tissues. As a result, we identified transcription factors whose expression is specific to one or a few particular tissues, suggesting their role in the regulation of tissue specific genes.

Availability CellExpress can be downloaded in spreadsheet form from http://cellexpress.cbrc.jp

168 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

REPLICA-EXCHANGE SIMULATIONS OF AMYLOID GROWTH DMITRI K KLIMOV 1

Experimental data suggest that the onset of Alzheimer’s disease is related to extracellular aggregation of A β peptides. One of the key aggregation events is amyloid fibril formation and its growth by deposition of A β monomers. However, at the present time little is known about amyloid growth on a molecular level. To address this issue we use all-atom replica exchange molecular dynamics to explore the thermodynamics of deposition of A β peptides on the preformed amyloid fibril (Fig. 1) [1,2]. Consistent with the experiments [3] we identify two deposition stages. The first (docking) stage occurs over a wide temperature range and is completed at Td=380K. Several lines of evidence, including the analysis of the free energy landscape, suggest that docking is continuous and occurs without free energy barriers. Docking does not result in the formation of ordered structures by incoming peptides and consequently bears similarity with adsorption of polymers on attractive walls. The second (locking) stage occurs at the temperature Tl =360K < Td and is characterized by the rugged free energy landscape (Fig. 2). Locking is associated with the emergence of parallel β-sheets (phase (p) ) formed by incoming A β peptides with the fibril. Due to coexistence of (p) with metastable states and rugged free energy surface the formation of (p) bears similarity to the first-order transition. Parallel β-sheet structure formed by the edge peptides is consistent with the internal fibril structure resolved experimentally (Fig. 1) [4]. Because locking resembles first-order transition, it is similar to folding in proteins [5]. We analyzed in detail the energetics of A β fibril growth. We showed that considerable variations in fibril binding propensities are observed along Aβ sequence. The peptides in the fibril and those binding to its edge interact primarily through their N-terminals (Fig. 1). The polarized aggregation interface is rationalized by uneven distribution of entropic binding costs along Aβ sequence. We also performed perturbations of binding free energy landscape by scanning partial deletions of side chain interactions at various

 1 George Mason University, Virginia, USA, [email protected] 169 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Aβ sequence positions. This study led us to surprising conclusion that strong hydrophobic side chain contacts impede fibril growth by favoring disordered docking of incoming peptides. Therefore, fibril elongation may be promoted by moderate reduction of A β hydrophobicity. Simulation results are tested by comparing in silico and experimental chemical shifts. More importantly, our simulations rationalize some of available experimental data on amyloid growth and contribute to its microscopic physiochemical description.

Fig. 1 Cartoon representation of A fibril growth. Peptides in grey form fibril fragment, whereas two incoming peptides in color are bound to the fibril edge. Together with the fibril, they form -sheets during locking stage. Binding involves predominantly N-terminals (in red).

7

6 7 6 5 5

4 4 3

F/RT ∆ 2 3 1

0 2 20 15 20 1 10 15 10 N 5 5 0 state (p) p 0 0 hb N hb a

Fig. 2 Locking of A peptides in the fibril is governed by rugged free energy landscape. The free energy of incoming peptide F is projected as a function of the number of parallel and antiparallel hydrogen bonds (Nphb and Nahb), which describe the formation of parallel and antiparallel -sheets by incoming peptide on the fibril edge. Free energy minimum associated with (p) is shown. 170 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 AcknowledgementsAcknowledgements: This work is supported by the NIH grant R01 AG028191.

1. T. Takeda, D. K. Klimov (2009) Replica exchange simulations of the thermodynamics of A β fibril growth. Biophys. J. 969696:96 442-452. 2. T. Takeda, D. K. Klimov (2009) Interpeptide interactions induce helix to strand structural transition in A β peptides. Proteins Struct. Funct. Bioinform. (doi 10.1002/prot.22406). 3. W. P. Esler, E. R. Stimson, J. M. Jennings et al. (2000) Alzheimer’s disease amyloid propagation by a template dependent dock-lock mechanism. Biochemistry 393939:6288–6295.39 4. A.T. Petkova, W. -M. Yau, R. Tycko (2006) Experimental constraints on quaternary structure in Alzheimer’s β-amyloid fibrils. Biochemistry 454545:498–512.45 5. E.I Shakhnovich, A. V. Finkelstein (1989) Theory of cooperative transitions in protein molecules. I. Why denaturation of globular protein is a first- order phase transition. Biopolymers 282828:1667–1680.28

171 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

FINDING OF MOLECULAR TARGETS AND THEIR LIGANDS FOR BREAST CANCER THERAPY O.N. KOBOROVA 1, D.A. FILIMONOV 1, A.V. ZAKHAROV 1, A.A. LAGUNIN 1, V.V. POROIKOV 1

Low efficacy of the current therapy is the reason for investigation of new anticancer drug targets. In recent years, accumulation of “Omics” data about structural and functional organization of regulatory networks in a cell provides possibility to identify the potential targets, involved in pathological processes and select the most promising targets for future drug development. We propose an algorithm for anticancer drug target identification, which is implemented in NetFlowEx program (see Figure 1).

Figure 1. Scheme of dichotomy model.

The algorithm simulates a behavior of regulatory network on the basis of dichotomy model, using microarray data to define the primary states of network. The simulation process is stopped when some selected outcome is reached, which corresponds to activation/inhibition of a particular fragment in regulatory network. The effect of pharmaceutical agents, which inhibit a  1 Institute of Biomedical Chemistry Rus. Acad. Med. Sci., 10, Pogodinskaya Street, 119121, Moscow, Russia, [email protected] 172 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 particular protein or combination of proteins in the regulatory network, is simulated by blockade of single nodes in the network or their combinations [1, 2]. Recently, the method was applied to the three groups of breast cancer types: HER2/neu-positive breast carcinomas, invasive ductal carcinoma and ductal carcinoma in situ, invasive ductal carcinoma and/or a nodal metastasis and to the generalized breast cancer using fragment of the regulatory network (802 proteins/genes and 1309 interactions between them), which contains proteins involved in cell cycle regulation, apoptosis, breast cancer progression and normal formation of breast. As a result, separate proteins and their combinations were identified as promising targets for therapy of breast cancer. Inhibitors of some identified targets are known as potential drugs for therapy of malignant diseases; for some other targets we identified the hits in the commercially available samples databases.

The work was supported by European Commission project No. 037590 (FP6-2005-LIFESCIHEALTH-7).

1. O. N. Koborova et al. (2008) Bioinformatics technologies as implication for promising drug target identification. Rus. Biotherapeut. J., 7 (2), 54- 56. 2. O. N. Koborova et al. (2009) Modeling of regulatory networks to indentify promising drug targets for breast cancer therapy. The Herald of Vavilov Society for Genecitists and Breeding Scientists, 13 (1) 201- 207.

173 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

INTERACTION OF ANTIBODIES WITH SMALL AROMATIC LIGANDS DARJA SVISTUNOVA 1, VLADIMIR ARZHANIK 2, OLEG KOLIASNIKOV 3

Keywords: Antibodies; Small ligands; Pi-stacking

Antibodies belong to the protein molecule class. They are responsible for antigen recognition in vertebrate organism. Practically every molecule can be bound with antibodies. The main subject of this study is the problem of antibody binding with small aromatic antigenes. Simplicity of the structure of these molecules allows to analyze antibody-antigen interaction all in all. During this work 177 structures of antibody-antigen complexes were taken from PDB database (http://www.rcsb.org/) and compared. The visualization was performed with Swiss PDB Viewer 3.7. The main epitope of studied ligands is an aromatic ring. Antibodies bind it with a deep hole, lying between complementary-determing regions (CDR) H3 and L3 and formed by aromatic residues. In most cases the aromatic ring of ligand was placed parallel to one or two aromatic sidechains of binding site at 3.5-4 Angstroem distance. This disposition of aromatic rings is a sign of pi- stacking presence. This interaction is common for biological macromolecules, but is usually mentioned only for nucleic acids description. It should be noted, that we observed only parallel stacking layout and no examples of T-forms were revealed. By the way, we found pi-stacking interaction for complexes of some studied ligands with enzymes, but the structural motifs of stacked residues were different from antibodies. Most frequently this interaction was observed for residues in positions H33, H95, L32 and L93. For example we have considered cases of aromatic residues in H95 position which belongs to CDR H3. The correlation between sidechain conformation and that of CDR H3 “torso”-region was found. Therefore we can conclude that small aromatic ligands bind with antibodies via pi-stacking. A few exceptions belong to carcas antigenes (like morphine), metalloorganic complexes (e.g. ferrocene) and structures of  1 Lomonosov Moscow State University, Kolmogorov Advanced Education and Scientific Center, Russian Federation, [email protected] 2 Lomonosov Moscow State University, Faculty of Bioengineering and Bioinformatics, Russian Federation, [email protected] 3 Lomonosov Moscow State University, Kolmogorov Advanced Education and Scientific Center, Russian Federation, [email protected] 174 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 unnatural antibodies - abzymes which are especially selected for catalytical purposes. The project was supported by a grant of Council of Club of Kolmogorov School for Physics and Mathematics.

175 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

SIMILAR CURVED MOTIF SURROUNDS CENP-B BOX IN DIFFERENT CENTROMERIC SATELLITE DNA ALEKSEY KOMISSAROV 1, OLGA PODGORNAYA 1

Keywords: centromere, satellite DNA, neocentromere, DNA structure

Centromere-specific proteins are pretty well conserved in evolution in spite of the lack of similarity in sequences to which they are bound. One key paradox that must be explained is the species-specificity of the satellites (satDNA), i.e. their high variability in evolution, but the evolutionary conservation of the proteins bound to them, for example CENP-B. The structure-specific mechanism of protein binding to centromeric (CEN) satDNAs was the reason to look for the sequence superstructure. Computer analysis of satellite DNA was previously done using the wedge model [Bolshoy, 1991], which describes how the curved state depends on particular nucleotide sequences. Mouse major satellite (MaSat) and centromeric mouse minor satellite (MiSat), which are very different in set of binding proteins, show also difference in their curvature. MaSat happens to be “curved”, resembling classical MAR (Matrix Attachment Region), while MiSat acquires the form of a stretched helix, i.e. rather “straight”. Analysis was extended to all centromeric sequences. In budding yeast S.cerevisiea centromeres from all (16) chromosomes were tested functionally and we found out that they bind yeast CENP-B. All the fragments tested are “straight”. Human CENP-B box 21-I also is rather “straight as well as inner part of CEN of yeast αcontaining S.pombe. In this work, we calculated curvature, curvature angle and bend angle parameters to the satDNA arrays. Arrays were extracted from tandem repeats found by program TRF [Benson, 1999] in 4 genomes according to criteria: long tandem repeat array of more than 2 kb with CENP-B box or of proved CEN location. MiSat arrays as being well described were chosen in order to find any motifs in DNA curvature. In all MiSat arrays we found a stable motif of DNA curvature around CENP-B box. The motif occupied from 100 up to 120 bp instead of CENP-B box itself of ~15bp. We performed a search for this motif in satDNA arrays from other genomes. The same motif exists in human alpha satellite, in African green monkey alpha satellite, in known centromeric

 1 Institute of Cytology RAS, 194064, Saint-Petersburg, Russian Federation, [email protected], [email protected] 176 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 sequences of S. pombe and S. cerevisiae. Little is known about C.elegance CEN satDNA. The CEN of this species is defined as diffuse. 10 CENP-B box containing arrays with size more than 2 kb were found in satDNA arrays of C.elegance. In all of them CENP-B box is surrounded by the same curved motif. We suggest that namely these satDNA arrays are the CEN foundation in C.elegance. Obviously, the exact base sequence is not the determining factor. CEN satDNA arrays of different animals are composed of very different sequences — alignment approaches to the comparison of CEN sequences failed to find common features — and yet the CENP-B can bind them due to the structural similarity. It seems that more than one type of DNA is being able to fulfil the CEN function even within a single species. Cloning and detailed structural analysis of 80 kb DNA corresponding to the core -satDNA-negative) αprotein-binding domain of one of these so-called “analphoid” ( marker chromosomes of human chromosome 10-derived neoCEN (NC10) published. The -satellites, γ- and β- satDNA and the periCEN αsequence is devoid of human CEN the ATRS (A/T rich sequences) and 48 bp repeat DNA. Searching NC10 for AT-rich -satDNA, but instead found 34 other tandem repeats, αDNA repeats found no including 21 tandem copies of a 28 bp repeat (AT28). Comparing the AT28 repeat - satDNA, no similarities were found. One copy of a sequence that is related αto to the CENP-B box motif is present, and a number of copies of other periCEN sequences including pJα and classical satDNAs I and III (HS3) in non-tandem organization have been reported [Barry, 1999]. We did found the curvature pattern similar to that of MiSat in several sites in NC10 in spite of the lack of long tandem arrays in NC10 and lack of the conventional CENP-B box. In this case the curvature pattern present around all CENP-B boxes has minor modifications. The structural features revealed are sufficiently preserved to allow sequence recognition by CENP-B as could be supposed from the range of the species used.

1. A. Bolshoy et al. (1991) Curved DNA without A-A: experimental estimation of all 16 DNA wedge angles, Proc Natl Acad Sci U S A, 88: 2312 – 2316. 2. G.Benson (1999) Tandem repeats finder: a program to analyze DNA sequences, Nucleic Acids Research, 27:573-580. 3. A.E.Barry et al. (1999) Sequence analysis of an 80 kb human neocentromere, Hum. Mol. Genet., 8:217–227. 177 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

MOLECULAR EVOLUTION OF INFLUENZA A VIRUS HEMAGGLUTININ IN CONSIDERATION OF ENZYME PROTEOLYSIS, MASS SPECTROMETRY AND PHYLOGENY ANALYSIS DATA YULIA SMIRNOVA 1, VIKTOR LEBEDEV 1, TATIANA SEMASHKO 1, EKATERINA KROPOTKINA2, LARISA KORDYUKOVA 3 MARINA SEREBRYAKOVA 4

Keywords: Influenza virus, hemagglutinin, molecular evolution, enzyme proteolysis, S- acylation, mass spectrometry

Influenza A virus hemagglutinin (HA) is a major transmembrane glycoprotein of the lipoprotein envelope surrounding the viral nucleocapsid. The HA molecules form homotrimeric spikes mediating attachment of the virion to the cell surface at neutral pH and delivery of the viral genome into the cell cytoplasm via fusion of the viral and endosome membranes at acid pH. The large ectodomain possessing antigenic determinants was cleaved from the viral surface by enzyme bromelain, crystallized and studied by X-ray analysis in case of 5 of 16 antigenic HA subtypes. The structural information regarding HA ~45 amino acids (aa) anchoring segment, which includes a transmembrane (27 aa) and an intraviral (10-11 aa) domains and a part of the spike “neck” above the viral membrane (≤10 aa) and besides modified post- translationally by long saturated fatty acids (palmitate, C16, and stearate, C18) is scares. Earlier it was shown experimentally that this segment contribute to fusion. It is reasonable to suppose that it participates in orienting the HA ectodomain in the viral membrane, however it is unknown what are the distinctive features for different subtypes of HA. We have developed a unique approach to elucidate the HA anchoring segment primary structure starting from proteolytic digestion of intact virions, which is followed by the chloroform/methanol extraction and MALDI- TOF mass spectrometry analysis of the HA anchoring segment [1]. Now using several proteolytic enzymes of different specificity we have discovered that  1 Department of Bioengineering and Bioinformatics, Moscow State University, Russian Federation, [email protected], [email protected], [email protected] 2 Chumakov Institute of Poliomyelitis and Viral Encephalitides, RAMS, Russian Federation 3 Belozersky Institute of Physico-Chemical Biology, Moscow State University, Russian Federation, [email protected] 4 Institute of Physico-Chemical Medicine, Moscow, Russian Federation, [email protected] 178 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 the HA “neck” region is the primary target for proteolytic attack. Inside a narrow region (5-7 aa) the enzymes act according to their specificity measured using chromogenic low-weight substrates. The amino acid sequence in the “neck” area is rather conserved so it is absolutely unclear why there is significant difference in HA spikes cleavability discovered for different subtypes. Spatial models of cysteine proteinase bromelain and serine proteinase subtilisin were created as a first step to understand the enzyme interaction with the HA “neck” region. Several phylogenic trees demonstrating molecular evolution of Influenza A virus HA ectodomain were found in literature. We have built for the first time a phylogenic tree based on amino acid sequence of the HA anchoring segment of various subtypes and discovered that it is just the same as that built for the ectodomain’s region of HA responsible for membrane fusion [2]. It means that evolution of the HA anchoring segment is strongly coupled to evolution of that part of the ectodomain, and very probably, of the whole ectodomain. Possibly, different cleavability of the spike “neck” region as well as different palmitate/stearate ratio discovered for various subtype strains contribute to the HA homotrimer organization within the viral envelope. This work was supported by RFBR grants ## 09-04-01160 and 09-03- 01007.

1. M.V.Serebryakova et al. (2006) Mass spectrometric sequencing and acylation character analysis of C-terminal anchoring segment from Influenza A hemagglutinin, Eur. J. Mass Spectrom., 12:51-62. 2. R.J.Russell et al. (2008) Structure of influenza hemagglutinin in complex with an inhibitor of membrane fusion, Proc. Natl. Acad. Sci. USA, 105:17736-17741.

179 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

AN ONLINE TOOL FOR SEARCH OF CORRELATIONS BETWEEN SEQUENCES OF DNA-BINDING PROTEINS AND THEIR BINDING SITES YURIY KOROSTELEV 1, OLGA LAIKOVA 2, ALEXANDRA RAKHMANINOVA 1, MIKHAIL GELFAND 2

Keywords: Protein-DNA binding

Transcription factors play a major role in the regulation of gene expression. By binding specifically to DNA sites, they either promote binding of the RNA- polymerase and hence effective initiation of transcription, or repress it. Such specific binding is key to the regulation of the cell cycle, tissue differentiation, reaction to changes in the environment etc. The problem of protein-DNA binding is one of the major problems of structural biology and bioinformatics. We have developed an online tool (available at http://www.bioinf.fbb.msu.ru/Prot-DNA-Korr) that analyzes statistical properties of transcription factors and their DNA binding sites and uses it to determine positions, important for the specific protein-DNA recognition. In a test, this program has been applied to the LACI family of bacterial transcription factors. It turned out that the constructed list of correlated positions is almost identical to the list of specific protein-DNA contacts. Further we studied how single correlations group into more complex mutual correlations: between group of amino acid positions and group of nucleotide positions in binding site. It turned out that our mutual correlations are universal for the whole LACI family. Thus, we believe, found correlations reflect universal mechanisms of protein-DNA recognition. We further analyzed another family, NRTR, for which the structural data were not initially available. The structure of a NRTR-family transcription factor with its operator, published later, confirmed our original predictions about contacting pairs of amino acids and nucleotides. On a top of that we found a group of non-contacting with DNA residues which form a hydrophobic cluster above recognition helix. We believe these residues orient recognition helix in major loop of DNA. Thus substitutions in this residues may greatly alter specificity of transcription factor.

 1 Faculty of Bioengineering and Bioinformatics, Moscow State University, Russian Federation, [email protected] 2 Institute of information transmission problems, Russian Federation, [email protected] 180 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Currently correlations are studied in five families of DNA-binding proteins: LACI, NRTR, CRP-FNR, N4-N6 methyltransferases and C-proteins. Further development of this approach allows one to use the obtained rules and regularities for prediction of specificity of transcription factors prior to labor- and time-consuming experiments, as well as to predict consequences of mutations in transcription-factor binding sites (increase or decrease of affinity, changes of the recognized motifs).

181 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

A KNOWLEDGE-RICH APPROACH TO DRUG DISCOVERY EKATERINA KOTELNIKOVA1, NIKOLAI DARASELIA 2

Nowadays a lot of heterogeneous biological data, including information about molecular interactions, chemical reactions, high-throughput experiments, biological algorithms and so on, is already published. However, it is very hard to integrate all this information manually in order to make some biological meaningful hypothesis or to suggest a new drug for the specific disease. That’s why it is important to use the tools and databases, which can help to combine different approaches to the drug discovery. To devise the possible drug discovery workflow, we used publically available microarray experiments as well as Ariadne Genomics Pathway Studio software with ResNet database[1] and ChemEffect database of chemical effects. Both databases were automatically constructed using MedScan, a natural language processing engine for MEDLINE abstracts [2]. Here we tried to find potentially effective chemicals for the glioblastoma – one of the most common, aggressive and invasive type of primary brain tumor in humans, which doesn’t have an effective treatment yet. Two different approaches have been applied in order to make prediction of glioblastoma- effective chemical. First approach was literature oriented. The glioblastoma signaling pathway was constructed in Pathway Studio on the base of literature data and ResNet database. Since it could be more effective for drug to target multiple proteins in the activated pathway, compounds known to affect multiple proteins in the glioblastoma pathway were found using ChemEffect database entries. Another approach was data-oriented. We took a glioblastoma microarray experiment from GEO database, found the differential expression of glioblastoma tissue vs normal brain, and applied Sub-Network Enrichment Analysis Tool - statistical test, similar to Broad Institute Gene Set Enrichment Analysis (GSEA). Sub-networks were built dynamically around all proteins and represent their expression targets in the database. This tool allowed us to identify key regulators of differentially expressed genes in glioblastoma. Two key activators, which are important for angiogenesis – CYR61 and NOV, where found using this analysis.

 1 Ariadne Genomics, Russian Federation, [email protected] 2 Ariadne Genomics, United States 182 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Using ChemEffect database we found that there is an intersection between the chemical inhibitors of key activators, found by Sub-Network Enrichment Analysis Tool and those chemicals found to target multiple proteins in the glioblastoma pathway: Fulvestrant.

1. A. Nikitin, S. Egorov, N. Daraselia, I. Mazo (2003) Pathway studio--the analysis and navigation of molecular networks, Bioinformatics,, 19(16): 2155-2157. 2. 2. N.Daraselia, A.Yuryev, S. Egorov, S.Novichkova, A. Nikitin, I. Mazo (2004) Extracting human protein interactions from MEDLINE using a full-sentence parser, Bioinformatics,, 20(5): 604-11

183 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

SYSTEMS BIOLOGY APPROACH TO STUDY MORPHOGENETIC FIELD KONSTANTIN KOZLOV 1, EKATERINA MYASNIKOVA 1, MARIA SAMSONOVA 1

In this study we apply the systems biology approach to characterize the mechanisms of cells fate determination and developmental robustness in morphogenetic field. We use the segment determination in fruit fly as a model system . At the core of our methodology is the extraction of the accurate quantitative data on expression of all network genes. We have developed the pipeline of the original methods for the acquisition of quantitative data on segmentation gene expression from the experimental images obtained with confocal microscope. Each fixed fly embryo is scanned for the expression for up to three genes at once and optionally for the concentrations of proteins that marks nuclei. The digital images from one embryo are then segmented and the quantitative data is read off in the form of the table where for each segmented nuclei the coordinates and concentrations of three proteins are calculated. The individual quantitative patterns are then subjected to background removal procedure that reduces the effect of nonspecific antibody staining. Due to individual variability of embryos the gene expression patterns need to be registered. We apply the point mapping technique based on the extraction of the small set of characteristic features (GCP) from each pattern. The coordinates of nuclei along A-P axis are affine transformed to align the GCP’s in all patterns. The pipeline also includes the methods for estimation and correction of data errors which arise in the course of fluorescence quantification. This pipeline allows us to construct the spatiotemporal atlas of gene expression in situ. The atlas is freely available in FlyEx database at http:/urchin.spbcas.ru/flyex/ [1]. The atlas includes the reference expression patterns of 13 genes involved in segment determination in early Drosophila embryo. Each nucleus in each individual pattern corresponds to the closest nucleus in the reference pattern. The expression of each gene in each nuclei in the reference pattern equals the average expression over the corresponded nuclei.

 1 St.Petersburg State Polytechnical University, Russian Federation, [email protected], [email protected], [email protected] 184 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 The majority of methods used in our pipeline can be easily adapted to the images of other organisms. All methods are implemented as separate software modules that can be assembled in the complex data processing scenario by use of the graphical user interface of the software package ProStack (Processing Stacks). ProStack is designed to automate the analysis 2D and 3D digital images of biological objects. ProStack includes geometrical, morphological, histogram, segmentation and other domain specific and domain-independent image processing methods. The processing operations afford tuning to ensure customization and flexibility without the loss of efficiency. The designed scenario can be saved as a complex program module and re-used in other scenarios. We have developed an easy-to-use interactive graphical interface GCPReg that can be used for registration of any one-dimensional data on gene expression both at the RNA and protein levels and at a resolution of a single cell. GCPReg makes it possible to correctly extract characteristic features even from poorly resolved expression patterns, such as patterns of segmentation gene expression in wild type embryos at cleavage cycle 13 and early cleavage cycle 14A, in mutant embryos, or complicated segment-polarity-like patterns. Rich means of data visualization enable the user to estimate both the accuracy of registration and the quality of the resulting reference data.

This work is supported by NIH grant RR07801, GAP award RUB1-1578 and RFBR grants 08-04-00712-a, 08-01-00315-a and 09-04-01590-а.

1. Andrei Pisarev, Ekaterina Poustelnikova, Maria Samsonova, John Reinitz (2009). FlyEx, the quantitative atlas on segmentation gene expression at cellular resolution. Nucl. Acids Res., 37: D560 - D566.

185 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

EST-BASED BIOINFORMATIC APPROACHES TO IDENTIFICATION OF CANCER BIOMARKERS GEORGE KRASNOV 1, NINA OPARINA 1, MASHKOVA TAMARA 1, SERGEY BERESTEN 1

Keywords: EST, bioinformatics, expression, cancer

Method of EST database analysis developed. Our approach have a lot of advantages compared other analog methods: the EST counts are normalized not only with total representativity of the library but also with the level of selected control genes appropriate for the cancer under study. Only genes characterized with most stable differences between normal and tumor libraries are selected. Additional analysis of aberrant transcript variants allows to select only putative protein-coding ESTs. Our method provide producing the rating of genes with probably stable expression levels mostly related to housekeeping genes. We present the results of normal-cancer comparison of several human cancers using our approach of EST analysis. We include the manual curation of EST libraries at the first stage of our work. Several new genes characterized by stable differential expression in normal and tumor tissues were found. We have compared our approach to existing analogs [1, 2]. In our method the EST counts are normalized not only with total representativity of the library but also with the level of selected control genes appropriate for the cancer under study. This is useful to avoid false-positive results due to increasing or lowering of general expression levels. The large amount of cancer EST libraries allow us to produce not pooled analysis but inter-library comparison. This approach lets us to select “top” genes characterized with most stable differences between normal and tumor libraries. Additional filtering of aberrant transcripts enables selecting only putative protein-coding ESTs thus making our results much more suitable to transcriptome-proteome comparison. We also produce the rating of probable stable unaberrated genes mostly related to housekeeping genes. Indeed, only a part of known housekeeping genes remains stably expressed in cancer cells. Thus our method is fruitful for isolating the new probable control genes appropriate to use in the several studied cancers. For this purpose we take into account splice aberrations such as intron retention or exon skipping, stability of gene expression level between normal and cancer and frequency of cancer-related mutations in this transcript.  1 Engelhardt Institute of Molecular Biology Russian Academy of Sciences, Russian Federation, [email protected] 186 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

RECONSTRUCTING ANCESTRAL MULTI-DOMAIN PROTEINS ROLAND KRAUSE 1

Proteins are organized into discreet domains which can evolve separately. Some protein domains recombine frequently and a complete evolutionary history of proteins must include insertion and deletion of domains in addition to duplication of complete genes. I present the improvement and first evaluation of a published suggestion for the reconstruction of the evolution of multi-domain proteins using reconciled gene trees. We improve on a computationally expensive step in the partition and evaluation of the ancestral domain composition and explore how to deal with inconsistencies in the gene trees.

 1 Free University Berlin/ MPI for Molecular Genetics, Ihnestr. 73, 14195 Berlin, Germany, [email protected] 187 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

PERIODIC PATTERNS IN B.SUBTILIS PROMOTER STRUCTURE ARE ASSOCIATED WITH PROMOTER SELECTIVITY BY DIFFERENT FORMS OF RNA POLYMERASE HOLOENZYME

G. KRAVATSKAYA 1, YU. KRAVATSKY 1, YU. MILCHEVSKY 1, N. ESIPOVA 1

Bacterial DNA-dependent RNA polymerase (RNAP) consists of several subunits, i.e., 2α, β, β` ω and σ. The σ- subunit provides for specificity in promoter recognition and contributes to separation of DNA strands during the formation of an open promoter–RNAP complex in transcription initiation [1]. We have demonstrated that the spectral pattern of the nucleotide sequence is an informative characteristic of functional DNA regions such as promoters and replication initiation sites [2–5], and revealed [3] a set of contacts between minimal RNAP and promoter DNA comprising periodically disposed nucleotides. This periodicity becomes more pronounced when the RNAP complex includes a σ factor. Does this periodicity play any role in the specific interaction of DNA with RNAP holoenzyme? In this work we assessed the possible connection between the promoter spectral patterns and the selectivity of RNAP holoenzymes containing different σ factors. Using a special version of Fourier analysis for symbolic sequences [6,7,8], Fourier spectra were obtained for the primary structure of promoters recognized by one of several Bacillus subtilis RNA polymerases holoenzymes. Nucleotide sequences of the promoters and the data on RNAP σ factors that recognize these sequences were drawn from DBTBS http://dbtbs.hgc.jp ( only promoters with known transcription start points; 643 promoters ). The sequences were supplemented with the corresponding nucleotides from the Bacillus subtilis genome to cover positions –75 to +25 around the transcription start point. Unlike previous researchers, we did not focus on the conventional “consensus,” “–10” and “–35” motifs obtained upon statistical averaging of the promoter sets, but examined the promoter sequences on a broader range. Stepwise discriminant analysis with jackknife testing was performed for different promoter data sets. Based on the spectral patterns of the nucleotide sequences, the data sets could be sorted with 100% accuracy into one of four classes according to the type of corresponding sigma factor (A, B, D, H), with

 1 Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Moscow, 119991 Russia, [email protected] 188 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 100% accuracy into one of five classes according to the type of corresponding sigma factor (E, F, G, H, K) and with more than 85% accuracy in one of 11 classes, according to the type of corresponding sigma factor (A, B, D, E, F,G, H, K,W, X, L). Bacillus subtilis promoters recognized by different forms of RNA polymerases were found to have different periodic patterns of nucleotide disposition. The set of the most significant periods revealed by discriminant analysis unambiguously assigns the promoters recognized by holoenzymes with different sigma factors. Thus, the periodicity in nucleotide distribution along the DNA chain is itself an attribute sufficient for selective recognition of the cognate promoter by RNA polymerase. The study was supported by the Russian Foundation for Basic Research project no. 07-04-01765a.

1. J. D. Helmann and M. J. Chamberlin (1988). Ann. Rev. Biochem. 575757:839- 57 872. 2. G. I. Kutuzova, G. K. Frank, V. Yu. Makeev, et al., (1997) Biofizika 424242:42 354- 362 [ Biophysics 42 :335-343]. 3. G. I. Kutuzova, U. K. Frank, V. Yu. Makeev, et al., (1999) Biofizika 444444:44 216- 223 [(1999) Biophysics 444444 :216 -223]. 4. N. G. Esipova, G. I. Kutuzova, V. Yu. Makeev, et al., (2000) Biofizika 454545:45 432- 438 [(2000) Biophysics 45 :421-427]. 5. G. I. Kravatskaya, G. K. Frank, V. Yu. Makeev, and N. G. Esipova, (2002) Biofizika 474747:595-599,47 [ Biophysics (2002) 474747:553-556].47 6. V. R. Chechetkin and A. Yu. Turigin, (1995) J. Theor. Biol .175175175:477-494 7. V. Yu. Makeev and V. G. Tumanyan, (1996) CABIOS 12(12(1)1)1)1): 49-54 8. G. I. Kravatskaya, Yu. V. Kravatsky, Yu. V. Milchevsky, and N. G. Esipova Biophysics (2007) 525252:52 521–526

189 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

PREDICTING RNA SECONDARY STRUCTURES INCLUDING PSEUDOKNOTS ANDREY KRAVCHENKO 1, RUNE LYNGSO 2

Keywords: RNA Secondary Structure, Pseudoknots, NNTM

RNA secondary structures play a vital role in modern genetics and a lot of time and effort has been put into their study. It is important to be able to predict them with high accuracy, since methods involving manual analysis are expensive, time-consuming and error-prone. Predictions can also be used to guide experiments to reduce time and money requirements. Several algorithms have been developed for implementing this task. Most of them assume that the desired secondary structure will not contain pseudoknots. However, pseudoknots, though not occurring that often, play an important role in a secondary structure as a whole. This report describes in detail the full thermodynamic model used to predict secondary structures without pseudoknots and the associated algorithm. It proceeds to extend the model to include a restricted class of pseudoknots and presents an e-fficient algorithm for the prediction of structures within this class. This algorithm has a running time complexity of O(n^4) and a spatial complexity of O(n^2), putting it on a high competitive edge with other known algorithms that take pseudoknots into account.

 1 Oxford University, United Kingdom, [email protected] 2 Oxford University, United Kingdom, [email protected] 190 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

RARE VARIANTS BASED ASSOCIATON STUDIES – ARE THEY FEASIBLE? GREGORY KRYUKOV 1

Continuing reduction in cost of DNA sequencing will enable human geneticists to relate complete sequence information in genes and, soon, complete genomes to human traits of clinical relevance. Deep sequencing in large samples promises to reveal a vast trove of rare alleles, a significant fraction of which may be important determinants of complex traits. Although knowledge of all rare variants segregating in the population would seem to increase the power of genetic analysis, this prospect faces daunting statistical challenges. A larger pool of sequence variants would require a more stringent multiple testing correction, while the power to detect an association for less common variants is reduced.

One potential solution to overcome this problem is by pooling rare allelic variants by gene, functional non-coding region or pathway. We analyzed the potential of the gene discovery strategy that combines multiple rare variants from the same gene and treats genes, rather than individual alleles, as the units for the association test. By using computer simulations based on deep resequencing data, we showed that genes meaningfully affecting a human trait can be identified in an unbiased fashion, although large sample sizes would be required to achieve substantial power.

 1 Genetics Division, Brigham & Women's Hospital and Harvard Medical School, United States 191 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

MISHIMA – A NEW HEURISTIC METHOD OF MULTIPLE SEQUENCE ALIGNMENT KIRILL KRYUKOV 1, KAZUHO IKEO 1, TAKASHI GOJOBORI 1, NARUYA SAITOU 1

Large nucleotide sequence datasets are becoming increasingly more common object of comparison. Complete bacterial genomes are reported almost everyday. This creates challenges for developing new multiple sequence alignment methods. Conventional multiple alignment methods are based on pairwise alignment and/or on progressive alignment technique. These approaches have performance problems when the number of sequences is large and with genomic scale sequences. We present a new method of multiple sequence alignment, called MISHIMA (Method for Inferring Sequence History In terms of Multiple Alignment), that does not depend on pairwise sequence comparison. A new algorithm is used to quickly find rare oligonucleotide sequences shared by all sequences. Divide and conquer approach, similar to [1], is then applied to break the sequences into fragments that can be aligned independently by an external alignment program. We used ClustalW [2] for this purpose in this study. The partial alignments are assembled together to form a complete alignment of the original sequences. MISHIMA provides improved performance compared to the commonly used multiple alignment methods. As example, four complete bacterial genomes were aligned, taking less than 3 hours. In another example, 100 complete mtDNA genomes of mammales were aligned in about 40 minutes. Availability: Standalone executable and online server are available at the MISHIMA homepage [3].

1. U.Tonges, S.W.Perrey, J.Stoye, A.W.M.Dress (1996) A General Method for Fast Multiple Sequence Alignment, Gene, 172(1):GC33-GC41. 2. J.D.Thompson, D.G.Higgins, T.J.Gibson (1994) CLUSTALW: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice, Nucleic Acids Research, 22:4673-4680. 3. http://esper.lab.nig.ac.jp/study/mishima/

 1 National Institute of Genetics, Japan, [email protected] 192 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

MODEL-BASED TIMING OF GENE EXPRESSION .

ANDRZEJ KUDLICKI 1, MALGORZATA ROWICKA 2

Keywords: microarrays, gene expression, timecourse, cell cycle, ribosome

We have developed an approach to analysing data from timecourse microarray experiments which takes into account prior information about a model of an expected temporal profile. Using MAP estimation techniques, we reconstruct the timeline of gene expression in a single cell to a very high resolution, surpassing the temporal resolution of the original data. Moreover, the reconstructed information is very robust to experimental noise. We present examples of the results allowing to infer the timeline of assembly of molecular complexes, including the DNA replication machinery and the ribosome.

 1 University of Texas Medical Branch, United States, [email protected] 2 University of Texas Medical Branch, United States, [email protected] 193 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

CHIPMUNK: A FAST DNA MOTIF FINDER FOR CHIP DATA AND ITS APPLICATION TO DATA INTEGRATION FROM DIFFERENT EXPERIMENTAL SOURCES IVAN V. KULAKOVSKIY 1, VALENTINA A. BOEVA 2, ALEXANDER V. FAVOROV 3, VSEVOLOD J. MAKEEV 4

Introduction: ChIP-chip and ChIP-seq are the up-to-date technologies for studying specific DNA-protein interaction. They can yield thousands of sequence fragments from about 3 hundreds b.p. (for ChIP-seq) up to small thousands b.p. (for ChIP-chip) long. Each of the fragments is supposed to contain the binding site for protein of interest. ChIP data is difficult to be processed because of the huge amount of sequences. Another problem is that such long fragments can contain binding sites of more than one transcription factor. We present a Chipmunk algorithm that can extract the single optimal motif from large data sets of ChIP data. Also, our tool can integrate different kinds of experimental sources (like SELEX, footprinting, etc.) to ensure correct motif extraction. Motif models: The input data consist in a set of DNA segments with assigned weights. In the case of equal sequence weights, the motif is represented as a Positional Count Matrix (PCM) with integer elements corresponding to multiple local alignment column-specific counts for each nucleotide. For unequal sequence weights, the representation by a real Weighted Positional Count Matrix (WPCM) is used. Each sequence in the alignment contributes to the corresponding WPCM elements with the factor of its weight. PWM can be calculated from WPCM by the following formula:

xi, j+ q i  Si, j = log   (N+ 1) q i 

 1 Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Vavilov str. 32, Moscow 119991, Russia, [email protected] 2 Curie Institute, 26 rue d'Ulm, 75248, Paris cedex 05, France, [email protected] 3 The Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins, Baltimore, MD 21231, USA, [email protected] 4 State Scientific Institute of Genetics and Selection of Industrial Microorganisms, GosNIIgenetika, 1st Dorozhny proezd, 1, Moscow 117545, Russia, [email protected] 194 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Here i∈{ А,C,G,T }, j is the position in the motif, xi,j is the WPCM matrix element, qi is the background frequency of nucleotide i, N is the number of sequences. Quality of a motif represented with WPCM can be measured with such statistics for the information content (DIC): m 1   I=∑ ∑ log xi, j ! − log N !  . j=1 N  i 

Algorithm: We expect that probability of sequences with missing motifs is known a priori from the weight values. Thus, the correct motif model is the set of the all best motif occurrences from each data sequence. To search for the optimal motif we use the EM-approach accompanied by bootstrapping. We start from a PWM for a single random word. In the first stage we use a bootstrapping procedure by selecting a random subset of the data sequences with the total weight that is equal to the square root of N. The matrix optimization is performed on this random subset only for the fixed number of rounds rather than until the convergence. Optimizing the resulting matrix on the entire dataset follows. Then the next round of bootstrapping on the next random sequence sample is performed. Total number of tested random WPCM, bootstrapping rounds and optimization steps on a random subset are controlled by “try-limit”, “step-limit” and “iteration-limit” parameters. We select the motif with the highest DIC. Algorithm scheme and pseudocode representation are given below.

195 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

Results: We tested Chipmunk on human ChIP-seq data for GABP (more than 20 thousands of sequences) and NRSF and Drosophila ChIP-chip accompanied by Footprinting and SELEX data. Chipmunk was up to 15 times faster than MEME in on the same dataset.

The work was supported by Russian Fund of Basic Research projects [07- 04-01623 to A.V.F., 07-04-01584 to V.J.M.]; INTAS Project [05-1000008-8028 to V.J.M.]; Russian Federation Agency in Science and Innovation State Contract [02.531.11.9003]; and French INRIA Équipe associé MIGEC.

196 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

CHANGES OF SELECTIVE PRESSURE AFFECTING THE ISOENZYMES OF GLYCERALDEHYDE-3-PHOSPHATE DEHYDROGENASE MIKHAIL L. KURAVSKY 1, VLADIMIR I. MURONETZ 1, VLADIMIR V. ALESHIN 1

Glyceraldehyde-3-phosphate dehydrogenase (GAPD, EC 1.2.1.12) is a homotetrameric glycolytic enzyme providing phosphorylation of 3- phosphoglyceraldehyde to 1,3-diphosphoglycerate coupled with reduction of NAD + to NADH. Mammals are known to possess two tissue-specific GAPD isoenzymes: somatic (GAPD-1) and testis-specific (GAPD-2, GAPDS). Recent studies established that GAPD-1 is not simply a classical metabolic protein involved in glycolytic energy production, but also participates in a number of non-glycolytical processed. As opposed to soluble GAPD-1, mammalian GAPD- 2 is tightly attached to the cytoskeleton supplying the dynein ATPases of filament with energy. GAPD-2 experimental investigation is significantly complicated by strong association of the protein with cytoskeleton. As a result, the data on GAPD-2 properties are virtually absent. GAPD-1 and GAPD-2 are also possessed by some other vertebrates besides Mammalia, but their expression should not be tissue-specific in all cases. Based on the phylogenetic trees, it was hypothesized that GAPD could diverge to the isoenzymes around the origin of Deuterostomia. However, some species were discovered to lack one of the isoenzymes. Single copy genes are thought to evolve conservatively because of strong negative selective pressure. Gene duplications produce a redundant gene copy and thus release one or both copies from negative selective pressure. Thus, duplications should be an important precursor of functional divergence. The increased availability of sequences in the public databases allows the investigation of the molecular evolution of the GAPD gene family and the evaluation of selection following duplication events. In the present study, we will evaluate the isoenzyme- and lineage-specific changes in selective pressure and look into the metamorphosis of GAPD-2 to a testis- specific protein. The GAPD isoenzymes of Mammalia and Actinopterygii were examined, as well as GAPD of Insecta, which is thought to arise before GAPD-1 and GAPD-2 diverged. Results of dN/dS ratio calculations show that GAPD isoenzymes are under varying selective pressure. Pairwise comparisons of orthologs of  1 Lomonosov Moscow State University, Leninskie Gory, Moscow 119991, Russia, [email protected] 197 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Mammalia, Actinopterygii and Insecta suggest strong purifying selection for all isoenzymes. The dN/dS ratio estimates for both GAPD-1, GAPD-2 of Actinopterygii and GAPD of Insecta were found to be practically equal, while the estimates for mammalian GAPD-2 were discovered to be two times greater. Four branch-specific substitution models were used for more accurate revelation of isoenzyme- and lineage-specific changes of selective pressure. Likelihood-ratio test indicated significant differences between the rates of non-synonymous substitutions for (1) GAPD-1/GAPD and GAPD-2, (2) mammalian and actinopterygian GAPD-2. The selective pressures affecting both GAPD-1 and GAPD of Insecta were found not to differ significantly. The obtained results are a good mark for mammalian GAPD-2 not participating in a number of non-glycolitic processes peculiar to GAPD-1 and therefore being under weaker selection constraints. If so, it becomes a confirmation of previously obtained results based on the study of short functional motives of human GAPD isoenzymes. The absence of differences between selective pressures on GAPD-1 and GAPD of Insecta indicates that GAPD-1 could maintain performing the functions of the original GAPD including non- glycolytic ones. Mammalian GAPD-2 should specialize by loosing the ability to perform some of them. The retention the isoenzyme should be reasoned by its tissue-specific expression. At the opposite side, the selection constraints for actinopterygians GAPD-2 are just slightly weaker than for GAPD-1. Due to that, it is supposed to perform some novel functions since if two isoenzymes perform the same function in the same tissues, one of them is free from functional constraint and therefore its gene will eventually turn into a non- functional pseudogene or be deleted. The other assumption is that actinopterygian GAPD acts in a heterotetrameric form consisting of both GAPD-1 and GAPD-2 subunits. The work is supported by the Russian Foundation for Basic Research (09- 04-01122), Russian Federal Purpose Program (02.512.11.2249), and NATO (PDD(CP)-(CBP.NR.RIG 982779)).

198 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

PATTERNS OF EVOLUTION IN PROTEIN PHOSPHORYLATION SITES YERBOL Z. KURMANGALIYEV 1

Keywords: protein phosphorylation, molecular evolution

Post-translational modifications play an important role in diversifying protein structure and function. Protein phosphorylation is one of the most important and widely distributed types of post-translational modifications. In eukaryotes, reversible protein phosphorylation plays a key role in the signal transduction and other processes. Recent advances in mass spectrometry allowed for large-scale identifications of phosphorylation events. The analyses of these data have already shown some specific features of phosphosites. In particular, it has been demonstrated that phosphoserines and phosphothreonines tend to be located in loops and hinges, whereas phosphotyrosines occur within regions of regular secondary structure [1]. The phosphorylation sites have been shown to be more conserved than generic loops and than non-phosphosites of the same proteins [1]. Comparison of phosphoproteomic sets revealed very old phosphorylation events shared by plants and animals [2]. Also it has been shown that phosphorylation sites tend to occur in alternatively spliced protein segments [3]. Since modified amino acids chemically are “new” types of amino acids, in terms of evolution might behave differently from their non-modified counterparts. Here we compare the difference in substitution patterns between phosphoporylated and non-phosphorylated serine residues in three groups of eukaryotes. Human [4], Drosophila [5] and yeast [5] phosphopeptides were mapped to the corresponding proteomes and only peptides with unique match were used in further study. The identified phosphoproteins were aligned with their orthologs (for human proteins among 7 vertebrates, for D. melanogaster among 10 fruit flies, and for S.cerevisiae among 14 yeasts). For each phosphoserine we have reconstructed the evolution of this site in the corresponding taxonomical group using a modification of a maximum likelihood algorithm (A.Goland, personal communication). Since we cannot  1 National Center for Biotechnology of the Republic of Kazakhstan, Astana, 010000, Kazakhstan; Institute for Information Transmission Problems (the Kharkevich Institute) RAS, Bolshoi Karetny pereulok 19, Moscow, 127994, Russia, [email protected] 199 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 reconstruct the moment when a serine had become modified, we assumed that it coincides with the oldest serine in a given tree. Then we counted the number of substitution of ancestral phosphoserines to other amino acids, and calculated vectors of substitution frequencies of phosphoserines . The control set consisted of of non-phosphorylated serines of the same proteins subject to the same processing. Differences in the substitution vectors between phosphorylated and non- phosphorylated serines varied among different groups of organism, but some trends were stable and significant. Particularly, in all three groups of organism, phosphoserines more frequently changed to aspartate and glutamate residues (in comparison with non-phosphoserines) and less frequently to alanine and cysteine. Interestingly, artificial substitution of serine to aspartate and glutamate is called phosphomimetic mutation and widely used to confirm phosphorylation of serine. I am grateful to Alexander Goland, Dmitry Malko, Ekaterina Ermakova and Anna Lyubeckaya who shared their programs and data. This is joint work with M.S.Gelfand. This study wa partially supported by a RFBR grant 09-04-90907.

1. F.Gnad et al. (2007) PHOSIDA (phosphorylation site database): management, structural and evolutionary investigation, and prediction of phosphosites. Genome Biol 8:R250. 2. J.Boekhorst et al. (2008) Comparative phosphoproteomics reveals evolutionary and functional conservation of phosphorylation across eukaryotes. Genome Biol. 9:R144. 3. 3. Y.Z.Kurmangaliyev and M.S.Gelfand. Alternative Splicing Tends to Involve Protein Phosphorylation Sites. Mol Biol [in press] 4. J.V. Olsen et al. (2006) Global, In Vivo, and Site-Specific Phosphorylation Dynamics in Signaling Networks, Cell 127: 635-848. 5. B.Bodenmiller et al. (2008) PhosphoPep—a database of protein phosphorylation sites in model organisms, Nature Biotechnology 26:1339 - 1340.

200 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

FINDING OF THE GENE FRUITLESS IN ANTS. TATIANA KUZMENKO 1, MIKHAIL SKOBLOV 2, SERGEY NUZHDIN 3, ANCHA BARANOVA 4

Keywords: ants, fruitless, cloning, recognision, alternative splicing

Nature of the social behavior is based on two major types of interaction: cooperation between individuals within the group and competition between groups. The main question here: how do individuals recognise group membership and their role within the group? Finding of the molecular level of this process brought us to the gene fruitless which was primary discovered in fruit fly Drosophila. This gene is a master-regulator that directs brain development in male- or female-specific ways by alternative splicing. It was shown that in Drosophila male’s brain there is a cluster of neurons coexpressing male-specific fruitless isoform (FruM) and oktopamine. These neurons play a dramatic role in recognising social status (for example, sex) of other individuals and making a decision what kind of behavior to activate (courtship or aggression). FruM mutants can not recognise other males and try to make court to both, males and females. Its function appears to be well- conserved among many insects. Therefore, it would be interesting to find out how this system change in insects with a much more developed structure of recognition - ants. Ants are eusocial animals with functional specialization of individuals inside the colony. All the ant society bases on right recognition of the social status of others group members and, thus, make a decision of their own. At the beginning we cloned the gene fruitless in worker ants Iridomyrmix humilis (Linepithema humile). Also known as imported Argentine ant, this species is world-wide and in some places forms great super-colonies, including thousands of nests what allows it to push out all the other ant species and make it one of the most important pests in the area. We successfully cloned ant fruitless using degenerate primers and RACE PCR technology. It was shown that the gene fruitless in the ant I. humilis forms at least two different transcripts. Bioinformatic analysis of the transcripts structure detected that the first one had the strong similarity with the Apis melifera (bee) predicted fruitless mRNA and the second – with the Nasonia  1 Lomonosov Moscow State University, Russian Federation, [email protected] 2 Org: Research Center for Medical Genetics, RAMS, Moscow, Russian Federation 3 University of Southern California, United States 4 George Mason University, United States 201 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 vitripennis (wasp) predicted fruitless mRNA. It is possible that these different transcripts might associate with the different behavioral patterns. Further study of the ant fruitless expression profile will clarify this interesting hypothesis.

202 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

PROTEIN-DNA BINDING STATISTICS AND ESTIMATING THE TOTAL NUMBER OF BINDING SITES OF TRANSCRIPTION FACTOR IN THE GENOME VLADIMIR KUZNETSOV 1, ONKAR SINGH 1, PIROON JENGAROENPOON 1

Keywords: transcription factor, binding avidity,ChIP-seq, Kolmogorov-Waring statistics, specifisity, sensetivity

Identification of transcription regulatory elements in a mammalian genome is important problem of systems molecular biology and statistical genomics. Among those elements, the protein transcription factor binding sites (TFBSs) on chromosomes are considered as the basic units of gene functional activity and protein-DNA interactome [1-4]. TFBSs have the potential to serve as targets for transcription factor (TF), which binds to the specific BS and regulates gene transcription. Recently developed next generation sequence technology, chromatin immuno-precipitation sequences (ChIP-seq) method [2], can generate many millions of sequences at a single time running and could accurately detect a relatively larger number of specific TFBSs at higher resolution (up to few base pairs) than ChIP-Chip and other ChIP-based methods. In a ChIP-seq experiment, immuno-precipitated DNA fragments are directly sequenced at one end for ~27 bp, and millions of short DNA sequence reads are then mapped to the reference genome. After mapping the DNA fragments enriched by the binding of the transcription factor are clustered and quantified by peak heights of DNA fragment cluster overlaps. One of the most crucial problems with this ChIP-based genome-wide experimental analysis is how to extract statistically reliable and biological meaningful phenomena from the resulting large data sets. In this context, specificity and, more important, sensitivity of protein-DNA binding event mapping is still essentially dependent on subjective rules of pre- processing/filtration of the derived sequences and statistical criteria used to identify ChIP-seq DNA fragment clusters and their peak values. Due to large amount of data generated in ChIP-based high-throughput sequence technology, non-uniform genome background noise and sampling errors in datasets, the adequate mathematical models and computational tools are required to identify specific events in generated datasets. In this work, we develop our probability mixture model of specific and non- specific TF-DNA binding avidity distribution function [1,2] and present new  1 Bioinformatics Institute, Singapore, [email protected] 203 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 method estimating the specificity and sensitivity of TF-DNA binding in ChIP- based experiments. We assume here that the Kolmogorov-Waring function [2] could be considered as an exploratory stochastic model of evolution of specific TF-DNA binding events. By that model, an evolution of specific TF-DNA events can be considered as stochastic binding events taken into account at least two binding transition probabilities: (i) preferential attachment process (specific binding potential) and Poisson process (non-specific potential). Similar two processes but with different intensities are assumed for TF detachment events. For eleven essential TFs of mouse embryonic stem cells (Nanog, Oct4, c- Myc, KLf4 etc) studied in [4], our TF-DNA binding model (i) re-estimated specificity and sensitivity (ChiP-seq), (ii) predicted the numbers of specific binding events in the ChiP-seq dataset and (iii) estimated the total numbers of specific TFBSs for 11 TFs in the mouse genome. Finally, we demonstrate that the sensitivity problem has not been technically resolved by current ChIP- based methods, including ChIP-Seq. Our approach provides a statistically- based framework for comprehensive computational identification of TFBSs and other regulatory sequences (RNA-polII, chemically-modified genome regions, etc) when the low-avidity and moderate- avidity sequences are over- represented in ChIP-derived sequence samples.

1. Wei, C.L., et al. (2006) A global map of p53 transcription-factor binding sites in the human genome. Cell, 124: 207-219. 2. Kuznetsov, V.A., et al. (2007) Computational analysis and modeling of genome-scale avidity distribution of transcription factor binding sites in chip-pet experiments, Genome Inform, 19: 83-94. 3. Johnson, D.S., et al. (2007) Genome-wide mapping of in vivo protein-DNA interactions, Science, 316:1497-1502. 4. Chen, X., et al. (2008) Integration of external signaling pathways with the core transcriptional network in embryonic stem cells, Cell, 133:1106-1117

204 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

EVOLVABILITY AND BIODIVERSITY – MODELING OF COEVOLUTION IN COMMUNITIES USING EVOLUTIONARY CONSTRUCTOR PROGRAM SERGEY A. LASHIN 1, VALENTIN V. SUSLOV 1, YURII G. MATUSHKIN 1

Relation between stability and evolvability in communities (populations and ecosystems) is Relation between stability and evolvability in communities (populations and ecosystems) is the key issue of evolution study. In general evolutionary success may be defined as the decrease in death probability of a biological unit (individual, taxon, population, ecosystem) due to environmental fluctuations [1]. Universal ways towards the goal: ability to search of novel environmental sources of energy/substrates [2], which implies evolvability; autonomism through the rise of multifunctionality and closeness [3,4], which implies stability. Models with fixed stoichiometric constants for each substrate are traditionally used. Set of those constants specifies conditions of species existence which accords with Liebig’s law. But in nature the Rubel’s replaceability of substrates [4] is observed. The comparative modeling of stability and evolvability of trophically closed communities of haploids with trophisms of Liebig and Rubel is in silico investigated by the “Evolutionary Constructor” (EC). EC combines imitation and generalized modeling approaches [5]. The populations were grouped into trophic ring-like networks (TRLN). Each population utilized specific substrate, which was secreted by its previous TRLN neighbor and produced and secreted specific product that could be utilized by its next TRLN neighbor (as neighbor’s specific substrate). Also all populations were needed the same nonspecific substrate injected into system from outside. We modeled mutations which increased substrate utilization efficiency and analyzed populations’ survival under some substrates sublethal deficiency (the constants i.e. mutability, utilization efficiency, growth rate etc. were estimated on the base of E.coli cell data). We have considered two types of TRLN: 1) insufficiency of some substrates was compensated with redundancy of the other ones (TRLN-А_daptive); 2) without compensation (TRLN-L_iebig). Mutation fixation in a population of TRLN-L was shown to lead to short-time growth of all TRLN populations, while lifetime of the ring was actually the same. Mutant population and the other ones had the biomass  1 Institute of Cytology and Genetics SB RAS, Russian Federation, [email protected], [email protected], [email protected] 205 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 growth just over the short period – then they extinct. But till the extinction the biodiversity of the system preserved. Mutation in TRLN-А also saved it or significantly increased its lifetime but in spite of this some populations could be extincted. TRLN-A has shown the better evolvability as compared to TRLN- L but at the same time it tends to lose its biodiversity and structure. However, modeling did not confirm intuitively assumed absolute advantages of Rubel’s trophism. Gradual loss of biodiversity deprives TRLN-A (in the long-time perspective) the evolutionary ability. The advantage of TRLN-L – preservation of higher biodiversity should play the role in biotopes of non-stable, irregular environmental conditions. High biodiversity is the potential of system’s evolution which leads to a search of novel substrates/energy sources. Under short-time conditions decline that sources are used for survival, after return of normal conditions – for progressive evolution [2]. It corresponds to paleontological data [1,6,7]. Contra wise, ecosystems of physiologically universal taxa (e.g. cyanobacteria) functions under stable conditions while the taxa their selves are in morpho-physiological stasis [1,6,7].

The work is supported by grants NSh-2447.2008.4, RFBR No.06-04-49556; RAS projects No.10.7, No.18.13; gov. contract No.10104-37/П-18/110- 327/180608/015

1. V.A.Krasilov (1986) Unsolved Problems of Evolution Theory. Vladivistok, FERSAS SSSR (in Russ.). 2. V.F.Levchenko (1993) Models in Biological Evolution Theory. St. Petersburg, Nauka (in Russ.). 3. I.I.Shmalgauzen, (1968) Evolutionary Factors. Moscow, Nauka (in Russ.). 4. Eu.Odum, (1975) Fundamentals of Ecology. Moscow, Mir (in Russ.). 5. Lashin, S.A., Suslov, V.V., Kolchanov, N.A., Matushkin, Yu.G. (2007) Simulation of coevolution in community by using the "Evolutionary Constructor" program. In Silico Biol. 7: 261-275. 6. G.A.Zavarzin, (2003) Lectures on Natural Resource Microbiology. Moscow, Nauka (in Russ.). 7. S.V. Меуеn, (1987) Fundamentals of Paleobotany. Moscow, Nedra (in Russ.).

206 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

A NONCODING ANTISENSE RNA - PROTEIN INFORMATION SYSTEM FOR MAMMALIAN STRESS RESPONSE GEORGES ST. LAURENT III 1, DMITRY SCHTOKALO 2, SERGEY NECHKIN 2, ANDREY POLYANOV 2, AJIT KUMAR 3, MOHAMMAD ALI FAGHIHI, FARZANEH MODARRESI, CLAES WAHLESTED

Keywords: Natural antisense transcript, NATs, siRNA library, stress response, noncoding RNA, HuR protein, information, ncRNA

Natural antisense transcripts (NATs) represent a prevalent class of RNA molecules transcribed from the opposite strand of protein coding genes (sense). Extensive sequencing efforts suggest that NATs are expressed from loci throughout the mammalian genome. NATs are members of a larger group of long non-coding RNAs (ncRNAs), with as many as 30,000 members in the human transcriptome. However, many questions remain regarding pathways for their processing, utilization and function, leading to persistent doubts about their relevance.

Like other non-coding RNAs (ncRNAs), NATs contain unique information content and computational features, coupling the digital information universe of nucleic acids with the analog universe of cellular protein interactions. Notable among these features is the ability to rapidly represent newly acquired information as a conformational change that drives protein interactions and downstream signaling. On balance, a favorable ratio of information-codable versus thermal degrees of freedom results in a rapid yet reversible regulatory machinery, ideal for the finely tuned regulation of the mammalian stress response. In a few instances, evidence indicates that long ncRNAs effect rapid changes in target gene expression during various forms of cellular stress responses (BACE1-AS4; HSR5; HIF1a-AS6, iNOS8.

Establishing the connection between NATs, neuronal stress, and the onset of neurodegeneration, we reported recently a long ncRNA that functions to stabilize BACE1 mRNA after neuronal stress, increasing BACE1 gene expression in-vivo, and in-vitro4. This example, together with several other

 1 Brown University, United States, [email protected] 2 Biorainbow, Russian Federation, [email protected], [email protected] 3 George Washington University , United States, [email protected] 207 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 recently published reports suggest that NAT – protein interaction machineries may play a broad role in sensing and responding to a range of cellular stresses. Assembling evidence that deciphers the role of NATs in mammalian stress response will likely reveal important flows of functional information regulating the dynamics of inflammation and disease etiology.

Interactions with cellular proteins represent key early steps in the biological function of ncRNAs, providing important clues to their mechanisms of action. We have studied the cellular stress associated RNA binding protein, HuR, using a novel cryogenic immuno-precipitation technique that preserves in-vivo RNA – protein interactions. We find that HuR protein binds to HIF1alpha natural antisense transcript, establishing a mechanism for the known involvement of this NAT in hypoxic stress response. Further, we find a large number of transcripts associated with HuR protein, many of which represent antisense transcripts. Bioinformatic analysis shows that almost 50% of these transcripts contain previously identified HuR protein binding motifs. These NAT – protein interactions likely mediate their action in the mammalian stress response system, and may establish a functional theme for HuR protein as a coordinator of NAT associated stress response mechanisms.

Complementary to our RNA – protein interaction studies, we have evaluated the functionality of non-coding NATs using a high throughput RNAi screen which targeted a family of 794 conserved NATs. Among the targeted NATs, 622 were non-coding RNA and 174 were coding antisense sequence or an extension of a coding cDNA. Surprisingly, on a percentage basis, siRNAs to non-coding NATs scored almost as strongly as siRNAs to coding NATs, suggesting a widespread role for the non-coding NATs as a class. The results demonstrate NAT modulation of fundamental cellular events, underscoring a pervasive role for these novel transcripts in the finely tuned mechanisms of reversible stress response, and ultimately in the information flow associated with disease etiology.

208 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

MOLECULAR DYNAMICS SIMULATION OF MEMBRANE CURVATURE INDUCTION BY I-BAR DOMAIN OF MIM OLGA LEVTSOVA 1, ILDAR DAVLETOV 1, OLGA SOKOLOVA 1

Keywords: molecular dynamics, I-BAR domain, membrane

I_BAR domains are found in cytoskeletal proteins such as missing-in- metastasis and IRSp53. Members of the IM domain protein family appear to be involved in the formation of membrane protrusions, such as filopodia, lamellipodia and plasma membrane ruffles. It was shown that the I-BAR domain directly binds PI(4,5)P2-rich membranes and deforms them into tubular structures. This domain is structurally related to the membrane tubulating BAR (Bin-Amphiphysin Rvs) domains, but induces a membrane curvature opposite that of BAR domain and deforms membranes by binding to the interior of the tubule. The structure of the I-BAR domain has been resolved. It forms dimer with high density of positively charged residues on one side of its surface. In this work it was shown how single I-BAR domain induces a local curvature on a negatively charged membrane (20% PI(4,5)P2 PO lipids and 80% POPC lipids) using molecular dynamics simulation. Also the interaction of the I-BAR domain with the neutral membrane (100% POPC lipids) was investigated. The M-domain – membrane systems were simulated using two levels of description: all-atoms molecular dynamics with OPLS-AA force field (AAMD) and coarse-grained molecular dynamics with MARTINI force fields (CGMD). In the case of POPC lipid bilayer the I-BAR domain didn't interact with the membrane’s surface and moved into the water surrounding. The negatively charged lipids attract the positively charged I-BAR domain and force it towards the membrane’s surface. The membrane-protein interaction induces the rotation of the outside loop (145-155 residues) to the membrane surface. This loop rotation results in a conformation stress which further transmits into the protein and provokes membrane curvature. The membrane deformation is opposite to that of a BAR domain. The aggregation of PI(4,5)P2 lipids was observed under the protein. Our results agree with other scientist’s experiments. Long CGMD simulations show that the IM domain induces local curvature in the membrane area. The bending mode developed underneath the I-BAR

 1 Moscow State University, Russian Federation, [email protected] 209 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 domain occupied the whole region of the lipid bilayer and was stabilized after 50 ns. The radius of the curvature was not the same during the whole simulation (300 ns), oscillation of the radius was observed (from 50 nm to 200 nm). The reason of this oscillation can be caused by the “edge effect”. It should be mentioned that membrane areas free of protein were not deformed and therefore destabilize the curvature. The results of AAMD and CGMD were very similar, so a coarse-grained model can be used to investigate the membrane deformation induced by the I- BAR domain. Simulation of the whole mechanism of tubular structure formation takes up a lot of computer power and time, that is why simplified models, such as CGMD, are very perspective.

210 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

TINC (TARGET ID BY NETWORK CONNECTIVITY) DMITRIY LEYFER 1, UGUR GUNER 2

Keywords: text mining; literature mining; drug discovery; networks; pathways; ontology

Motivation: It is often a case in pharmaceutical industry to bring a new and exciting target to clinical trials only to find that it has serious safety concerns or lack of efficacy. A gene downstream or upstream in the pathway might be the solution, however, not all pathways are known, and finding such an alternative target using existing in-silico or bench tools could become labor- intensive. A method that could automatically find similarities between targets according to published information would significantly accelerate the search. Method: targets were compared based on their nearest neighbors in the literature network space using an adjusted residuals method that was adapted to account for non-independence of neighbor groups. The result is a rank- ordered list of targets that are most similar to the original query. The method can be generalized to annotating nodes using edges in any networks having nodes with large connectivity, especially biologically-relevant scale free networks. It can be used to annotate diseases with similar etiology, reposition existing drugs, or discover adverse events for the targets. The results can be further clustered to create groups of similar nodes. The method can be also used for creating ontology of physiological processes to describe phenotypes. Results: TINC was used to analyze, group and rank histone acetylation enzymes. It has been shown that phylogenetically similar targets cluster together. Immediate further analysis will include discovery of diabetes and atherosclerosis targets.

 1 Pfizer, United States, [email protected] 2 Pfizer, United States, [email protected] 211 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

UNDERSTANDING THE AMINO ACID SUBSTITUTION PROCESS DAVID A. LIBERLES 1

In Darwinian evolution, mutations occur approximately at random in a gene, turned into amino acid mutations by the genetic code. Some mutations are fixed to become substitutions and some are eliminated from the population. Partitioning pairs of closely related species with complete genome sequences by population size, we look at the BLOSUM matrices generated for these partitions and compare the substitution patterns between species. A population genetic model is generated that relates the relative fixation probabilities of different types of mutations to the selective pressure and population size. Parameterizations of the average and distribution of selective pressures for different amino acid substitution types in different population size comparisons are generated using a Bayesian framework. We find that partitions in population size as well as in substitution type are required to explain the substitution data. Mechanistic explanations of this will be discussed. To further explore the role of underlying processes in amino acid substitution, we analyzed embryophyte (plant) gene families from the TAED database, where solved structures for at least one member exist in PDB. Using PAML, branches were assigned to three categories, strong negative selection, moderate negative selection/neutrality, and positive selection. Focusing on the first and third categories, sites changing along gene family lineages were identified and the spatial patterns of substitution observed. Selective sweeps are expected to create primary sequence clustering under positive selection. Co-evolution through direct physical interaction is expected to cause cause tertiary structural clustering. Under positive selection, the most significant signal was found at the primary sequence level, reflecting the action of selective sweeps. Less surprisingly, under strong negative selection, the most significant signal was found at the level of protein structure, reflecting the role of direct physical interaction driving co-evolution.

 1 Department of Molecular Biology, University of Wyoming, Laramie, WY 82071, United States, [email protected] 212 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

POSITIONING OF EXONS AND INTRONS IN COLLAGEN I AND VII GENES MAY BE DETERMINED BY NUCLEOSOMES A.P. LIFANOV1 1, P.K. VLASOV 1, V.YU. MAKEEV 1,2, N.G. ESIPOVA 1

Keywords: exon, intron, collagen I, collagen VII, nucleosome repeat length

A majority of human genome protein-coding sequences is split into a lot of exons, separated by introns. Purpose of such a split is generally unclear; for locuses of multidomain proteins no correlation between exon and structural domain borders is usually observed [1,2]. Collagen proteins contain flanking globular segments and a middle “fibrillar” segment coding a very simple aminoacid sequence (Gxy)n, strict for fibrillar collagens and with short insertions for other types. For a group containing all known types of collagen proteins (approx. 30) we study an exon-intron structure of their locuses. Length of exon+intron pairs is calculated as a distance between start exon nucleotides; no significant changes are observed if stop nucleotides or middles of exons are taken. We determine shares of exon-intron pairs with lengths 0-100, 101-200, 201-500 nucleotides in a total amount of such pairs with lengths 0 to 500 nucleotides (table below).

exon+intron pair length, nucl. 0-100 101-200 201-500 fibrillar regions of collagen locus 0% 42% 58% nonfibrillar regions of collagen locus 0% 17.5% 82.5% genome 0.2% 18% 81.8%

Fibrillar regions of collagen locuses contain a lot of exon+intron pairs of a rather small length. For short and “densely packed” locuses of collagen I (locus coding part length 16000 n., 51 exon) and collagen VII (30800 n., 117 exons) exons separated by introns form a close-to-periodic structure. . To show this and estimate lengths of periods for a collagen VII locus we plot a periodogram (left figure) using a complex Morlet wavelet (sine and cosine functions with Gaussian envelope).

 1 Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, [email protected] 2 FGUP GosNIIGenetika 213 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

a ) 165 b) 227 10 1000 20 900 3.5 800 30 3 10 r 700 e 40 b . l c 2.5 m u 600 u

n 50 n

, t d 20 2 n o 500 i

e 60 r

e m P

400 1.5 g

e 70 300 S 1 80 30 200 0.5 90 100 100 0 4000 8000 12000 16000 20000 24000 28000 110 Locus position, nucl. Position in a segment, nucl. 50 100 150 Position in a segment, nucl. Prolong regions with approximately constant layout of periods are observed. Borders of these regions correlate well with borders of the fibrillar region of the locus (9278-27593 nucleotides), marked by vertical lines. In the right figure two “multiple alignment” plots are placed, visualizing fibrillar (a) and nonfibrillar (b) regions of collagen VII locus sequence (exons black, introns white) split into segments of a fixed length (lengths shown above the plots) and placed in a stack. Borders of neighbouring exons are frequently found one below other; one can conclude that exons form a local close-to-periodic structure in a locus. Period of these repeats is different in fibrillar (165) and nonfibrillar (227) locus regions; in a fibrillar region it is close to a minimal nucleosome repeat length [3]. Splitting of a fibrillar region of collagen-coding locuses into tens of short (close to 100 nucl.) exons cannot be explained on a basis of “protein structure needs”: these locuses code very simple and monotone (Gxy)n aminoacid sequence. Appearance of repeats with a characteristic 165 n. length can probably be attributed to “locus structure needs” – a long nucleotide sequence is packed in nucleosomes. This work is supported by RFBR grant 07-04-01765a.

1. Traut, T.W., Do exons code for structural or functional units in proteins? Proc Natl Acad Sci U S A, 1988. 85(9): p. 2944-8. 2. Elder, D., Split gene origin and periodic introns. J Theor Biol, 2000. 207(4): p. 455-72. 3. Stein, A. and M. Bina, A model chromatin assembly system. Factors affecting nucleosome spacing. J Mol Biol, 1984. 178(2): p. 341-63. 214 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

APPLICATION OF NUCLEIC ACID PROGRAMMABLE PROTEIN ARRAYS (NAPPA) TO SEROLOGICAL PROFILING FOR TYPE 1 DIABETES ASSOCIATED AUTOANTIBODIES T. LOGVINENKO 1, S. MIERSCH 2, S. SIBANI 2, J. LABAER 2

High-content, multiplexed, array-based proteomics solutions are a valuable, unbiased discovery tool that can facilitate the identification of novel autoantibodies (AA’s) associated with various diseases, in particular, autoimmune diseases. Further, an expanded panel of biomarkers is likely to increase detection rates by combining sensitivity and specificity of a number of AA's and may ultimately prove to be of clinical value as a diagnostic tool. Association between islet cell AA's and the progression to Type 1 diabetes (T1D) has been well recognized. Although detection rates in individuals possessing all three commonly tested AA's (anti-GAD, -IA-2, -insulin) are thought to be >80%, there exist clinically presenting individuals which fail to present any of these biomarkers at the time of diagnosis. In light of this, it is likely that additional AA's specific for T1D exist. We undertook a serological screening effort against a clone library encoding >6000 human proteins using the Nucleic Acid Programmable Protein Array (NAPPA). Employing such a broad set of potential molecular targets our aims were 1) to identify, through multiple confirmatory rounds of screening, an enhanced panel of serum reactive antigens to which AA's associated with T1D can be detected and 2) to test the ability of this panel of antigens to accurately discriminate between those with and without diabetes. We will discuss the pre-processing of the data obtained from NAPPA arrays and statistical analyses used to identify antigens associated with T1D status. To this end, we have completed 1) a pre-screen aimed at eliminating uninformative antigens (displaying no differential AA reactivity between T1D+ (n=50) and T1D- (n=20) serum) and 2) a training screen in which the reproducibility of statistically significant, differentially recognized autoantigens between patient and controls observed in the pre-screen was verified/reproduced in an independent serum set (n=75 T1D+/-). From these efforts we have identified >100 putative autoantigens (including the recently reported ZnT8) that exhibit enhanced AA reactivity  1 Tufts Medical Center, United States, [email protected] 2 Harvard Institute of Proteomics, United States 215 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 with serum from T1D+ patients versus controls. Of these putative targets, a subset of approximately 10 antigens exhibit diagnostic sensitivities in the range of 50%, in addition to specificities between 70 and 80% based upon microarray data. With this panel we are developing an optimized diagnostic algorithm or classifier (using support vector machines) and plan to test performance under blinded conditions. Ultimately, it is our aim to fully validate these biomarkers by orthogonal means and to make use of the combined sensitivities and specificities of this collective panel of potential autoantigens in array format in order to improve diagnostic performance for clinical application. This study represents a significant advance toward comprehensive, unbiased identification of novel AA's associated with T1D and the demonstration of array-based immunodiagnostics based upon these newly identified autoantigens.

216 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

CHLOROPHYLL SYNTHESIS REGULATION IN PLANT CHLOROPLASTS K.V. LOPATOVSKAYA 1, A.V. SELIVERSTOV 1, V.A. LYUBETSKY 1

We present a study of expression regulation of genes encoding proteins with iron sulphur clusters: ChlL in plants, rubredoxin-like proteins in diatoms and Piroplasmida; or those involved in the formation of such clusters: SufB in Eimeria , Toxoplasma , Plasmodium and Porphyra . In the abstract we describe the chlLN operon regulation. The light-independent protochlorophyllide oxidoreductase is involved in the synthesis of chlorophyll a and consists of three subunits encoded by chlL , chlN and chlB .

MATERIALS. All sequenced plastomes available from GenBank were analyzed. Genes chlL , chlN and chlB are typically found in chloroplasts of algae and green plants, except for flowering plants, Welwitschia mirabilis from Gymnospermae and Psilotum nudum from Psilotophyta, [2].

METHOD. Several promoter-like regions are predicted upstream genes chlL based on similarity with the psbA bacterial type promoter in Sinapis alba , for which information is available on the effect of point mutations on the RNA polymerase binding, [1]. Control searches were conducted upstream genes psbA , psbB , psaA and rbcL , which were experimentally characterized in some flowering plants. The promoters’ neighborhood is further sought for conserved sites by constructing multiple alignments along the species tree, i.e. by comparing sequences in the order of their relatedness. The consensuses thus constructed are used to query the entire plastome. Some of the original programs are freely available at http://lab6.iitp.ru/ru/treeal/ and http://lab6.iitp.ru/ru/twobox/.

RESULTS. A factor-mediated transcription initiation regulation is predicted in the neighborhood of the chlL promoter. The factor binding site overlaps the promoter and has a tandem repeat structure with consensus GATCTAT-11- GATCTAT. The site is found in the alga Chara vulgaris , moss Anthoceros formosae , club-moss Huperzia lucidula , fern Adiantum capillus-veneris , and gymnosperms Cycas taitungensis , Keteleeria davidiana , Picea abies , Pinus koraiensis , P. thunbergii . A corrupt site with a promoter-like region is found  1 IITP RAS, 19, B. Karetny, Moscow, Russia, [email protected] 217 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 upstream pseudogene chlL in Gnetum gnemon . Other site overlaps the promoter and has an inverted repeat structure with consensus GATCTAT-11- ATAGATC. It is found in gymnosperms Chamaecyparis lawsoniana , Ch. obtusa , Cunninghamia lanceolata . The sites were not found upstream chlL in all other algae, the mosses Aneura mirabilis , Marchantia polymorpha , Physcomitrella patens , Syntrichia ruralis , ferns Angiopteris evecta , Polystichum acrostichoides , and gymnosperms Ephedra equisetina , Metasequoia glyptostroboides , Sequoia sempervirens , Sequoiadendron giganteum , Cryptomeria japonica , C. fortunei , Glyptostrobus pensilis , Taxodium distichum , Chamaecyparis pisifera , Cupressus sempervirens , Juniperus rigida , J. chinensis , Platycladus orientalis , Thuja standishii , Th. occidentalis , Th. plicata , Thujopsis dolobrata . The chlL promoters are less conserved comparing to those that we found upstream psbA , psbB , psaA and rbcL .

DISCUSSION. Charophyte algae develop a highly differentiated thallome, which implies presence of tissue-specific gene expression patterns. Chlorophyll synthesis genes are absent from the plastome of flowering plants, probably due to higher tissue differentiation. Dark-grown flowering plants, Ginkgo biloba , Larix kaempferi , and Thuja standishii seedlings fail to accumulate chlorophyll or have low levels of chlorophyll, [Kusumi, 2006].

ACKNOWLEDGEMENTS. The authors are grateful to L. Rusin for discussions.

1. A.Homann, G.Link (2003) DNA-binding and transcription characteristics of three cloned sigma factors from mustard (Sinapis alba L.) suggest overlapping and distinct roles in plastid gene expression, Eur J Biochem, 270: 1288–300. 2. J. Kusumi et al. (2006) Relaxation of functional constraint on light- independent protochlorophyllide oxidoreductase in Thuja, Mol Biol Evol, 23(5): 941–948.

218 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

COMPARATIVE GENOMIC ANALYSIS OF THE ATTENUATION REGULATION OF AMINO ACID AND AMINO acyl-tRNA BIOSYNTHESIS IN BACTERIA V.A. LYUBETSKY 1, K.V. LOPATOVSKAYA 1

We performed a large-scale search for attenuation regulation in bacteria based on two original computer programs that model attenuation regulation and construct a multiple alignment along a phylogenetic tree. The programs and their detailed descriptions are available from http://lab6.iitp.ru/en/rnamodel/ and http://lab6.iitp.ru/en/treeal/. The first such search was represented in [Vitreschak et al., 2004]. Candidate attenuations are predicted in many bacteria from α-, β-, γ-, δ-Proteobacteria, Actinobacteria, Bacteroidetes/Chlorobi, Firmicutes, Thermotogaе, Cloroflexi. Prediction frequencies are different for different genes: many genes of amino acid and amino acyl-tRNA biosynthesis have attenuation in many γ- Proteobacteria, while other bacterial taxa were not predicted to have attenuation. It was found neither in Chlamydiae, Cyanobacteria, Mollicutes, ε- Proteobacteria and Spirochaetales, nor in chloroplasts of algae possessing amino acid synthesis genes. Searches were conducted with all bacterial genomes contained in GenBank. Evolution of attenuation is discussed. Classic phenylalanine-tRNA-dependent attenuation is observed only in α-, β- and γ- Proteobacteria for phenylalanine biosynthesis gene pheA and operon pheST . Classic attenuation of threonine and isoleucine synthesis genes is observed only in α-, β-, γ- and δ- Proteobacteria. Classic tryptophanyl-tRNA-mediated attenuation is predicted in α-, β-, γ-, δ- Proteobacteria, Actinobacteria, Bacteroidetes and Thermotoga spp. Classic histidine synthesis attenuation is observed for gene hisS in α-Proteobacteria, gene hisG in γ-Proteobacteria, gene hisZ in Firmicutes, gene hisG in Bacteroidetes and gene hisS in Thermotogae. In γ-Proteobacteria and some Firmicutes regulation is based on helices and triplexes. Common triplexes contain Py-Pu-Py triads, although those in Alteromonadales bacterium and Pseudoalteromonas haloplanktis are mixed and possess Py-Pu-Py and Pu-Pu-Py triads. Bacillus cereus , B. thuringiensis , B. anthracis and B. weihenstephanensis may possess a weak cytidyl-guanyl Py-Pu-Py triplex in the structure of co-terminator. The co- terminator in Clostridium difficile possesses a Pu-Pu-Py triplex upstream gene  1 IITP RAS, 19, B. Karetny, Moscow, Russia, [email protected] 219 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 hisZ . Polymerase-coding gene lysQ in Lactococcus lactis has classic attenuation with histidine regulatory codons and helices, with the co- terminator being stabilized by a Pu-Pu-Py triplex. Branched amino acids- mediated classic attenuation is predicted in many Proteobacteria and Actinobacteria. The regulatory region of gene ilvD in Staphylococcus and Listeria (Firmicutes) contains the leader peptide sequence and a set of helices: four conserved helices forming a triplex with Py-Pu-Py triads within the co- antiterminator. The list of LEU-regulations from [Seliverstov et al., 2005] is largely extended to include leuA genes of most actinomycetes. Novel non- classic LEU1-regulation is predicted for gene leuA in α- and β- Proteobacteria. These results can be seen at multiply aliments by address http://lab6.iitp.ru/ru/lopatovskaya_supp/.

ACKNOWLEDGEMENTS. The authors are grateful to L. Rusin for discussions.

1. A.G. Vitreschak et al. (2004) Attenuation regulation of amino acid biosynthetic operons in proteobacteria: comparative genomics analysis, FEMS Microbiol Lett., 234: 357 - 370. 2. A.V. Seliverstov et al. (2005) Comparative analysis of RNA regulatory elements of amino acid metabolism genes in Actinobacteria, BMC Microbiol., 5(54).

220 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

REFINEMENT OF SPATIAL STRUCTURE MODEL OF POTATO VIRUS X COAT PROTEIN AND DETECTION OF FUNCTIONALLY SIGNIFICANT STRUCTURAL ALTERATIONS IN THIS PROTEIN WITH THE HELP OF TRITIUM PLANIGRAPHY METHOD PAVEL SEMENYUK 1, ANNA MUKHAMEDZHANOVA 1, ELENA LUKASHINA 1

Keywords: potato virus X coat protein structure, structure-function relationship, spatial structure model, tritium planigraphy

Potato virus X (PVX) is the type member of the Potexvirus group of filamentous plant viruses. PVX virus particle consists of a single RNA molecule and ∼1300 identical protein subunits, forming a protective shell around RNA. Three-dimensional structure of the PVX “coat protein” (CP) subunits in the virions and in the isolated state is still not determined experimentally, as potexviruses do not form fibers with orientation sufficient for high-resolution X-ray fiber diffraction analysis, and their isolated CPs do not produce “good” crystals. At the same time knowledge of the PVX CP spatial structure is essential for understanding structural transformations recently found to take place in the PVX virions. It was found, that intravirus PVX RNA, which normally cannot be translated in cell-free systems and in vivo, acquires an ability to be translated after binding of several molecules of non-virion but virus-specific movement protein, encoded by the first gene of PVX triple gene block (triple gene block 1 protein, TGBp1), to the PVX particle. It was also shown, that TGBp1 molecules interact with the PVX particle end, containing RNA 5'-terminus, and this interaction results in a strong decrease in the virion stability. In the present work alterations in PVX CP structure after binding of TGBp1 to the virions have been studied using tritium planigraphy. This method is based on atomic tritium application as surface nanoprobe: the substitution of hydrogen by tritium occurs in the thin surface layer of biological macromolecules and their aggregates in the course of the bombardment of their preparations by atomic tritium generated on hot tungsten wire in a special vacuum device. Different parts of a studied object acquire different radioactivity depending on their exposition on the object surface. This

 1 Moscow State University, Leninskie Gory, Moscow 119991, Russia, [email protected] 221 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 phenomenon allows obtaining information on the object organization and constructing its spatial structure model. In the present study it has been found, that interaction of TGBp1 with the PVX virions leads to about fifty percent increase in tritium label incorporation into amino acid residues 176-198 of 236 residues long PVX CP subunit with some decrease in label incorporation into N-terminal CP region. According to our model of intravirus PVX CP three-dimensional structure (Fig. 1), the 176- 198 segment is assigned to the β-sheet region located at the subunit surface, presumably participating in CP interactions with the intravirus RNA and/or in protein-protein interactions, while the N-terminal CP region corresponds to the other part of the same β-sheet. For the remaining segments of the PVX CP subunit no significant difference between tritium incorporation into the untreated and TGBp1-treated PVX was observed. Our experimental data on radioactivity distribution in CP chain allowed us to suggest probable mechanism of PVX virions transformation to a translationally active state after TGBp1 binding. The work was partially supported by Russian Foundation for Basic Research (project 09-04-01373) and a Grant of President of the Russian Federation for support of young Russian scientists (MK-5272.2008.3).

Fig. 1. “Sandwich” variant of the spatial structure model of PVX CP subunit in a virion. The virion long axis is located on the left side. α-Helices are shown as cylinders, β- strands – as arrows, except the strand of probable β2. The 35 to 39 and the 136 to 144 hinges are highlighted. The N- and C-termini are indicated.

222 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

ALLELE-SPECIFIC EXPRESSION USING SOLEXA BRADLEY MAIN 1, RYAN BICKEL 2, LAUREN MCINTYRE 3, RITA GRAZE 4, SERGEY NUZHDIN 5

Keywords: Allele-specific Expression, Drosophila, Solexa

Next-generation sequencing technology can generate millions of reads in a single run. Here, we take advantage of this technology to create an accurate and high-throughput approach to allele-specific expression assays. This PCR- based method enriches for pre-defined regions of transcripts and assesses allele-specific expression using digital read counts containing a known SNP. We demonstrate the effectiveness of this method for hundreds of ASE assays using a single solexa sequencing lane and estimate that thousands can potentially be performed. We applied this method to a set of genes in a Drosophila simulans parental mix, F1 and introgression and found that the vast majority of expression divergence can be explained by cis-regulatory variation for the 4 genes and 6 inbred lines tested. Furthermore, this variation appears to be additive, as we were unable to detect cis-by-trans interactions in the genes examined.

We would like to thank Ryan Bickel for significant contributions to the design, analysis, and writing. Lauren McIntyre for statisitical contributions. Hyo-sik Jang for fly work. And Joe Dunham and Johanna Main for discussions and comments on the method and paper, respectively.

 1 University of Southern California, United States, [email protected] 2 University of Southern California, United States, [email protected] 3 University of Florida, United States, [email protected] 4 University of Florida, United States, [email protected] 5 University of Southern California, United States, [email protected] 223 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

A NOVEL METHOD FOR GENE PREDICTION IN PROKARYOTIC GENOMES RAHIM MALEKSHAHI 1, ALIREA MEHRIDEHNAVI 2, HEDAYATOLAH HOSSEINI 1, MAJID BEIGI 2

Abstract- Development of bacterial databases is crucial and every year the number of prokaryotic genome is increasing. The problem of identifying genes in genomic DNA sequences by computational methods has attracted considerable research attention in recent years. A Full automatic and self-train Gene finder is presented in this research. This system uses non-looped HMM, to measure of statistical significance for Genes in prokaryotic Genomes. Design of this software is done in three main programs and developed in C++. First program is presented for extraction the DATA (Long non-overlapping ORFs) to train the machine learning algorithm in a self-training method. The probability of codon usage in the extracted Long non-overlapping set of ORFs shows that they are highly likely to be coding (up to 96%). Second program is related to the training stage. In this stage, HMM is trained with the Data that obtained in the previous stage. We model standard 'text book genes' with an unbroken open reading frame. In the last program, The Long ORFs is scored with the trained system. Finally Genes are selected on the base on their lengths and scores. our Gene finder can predicts Genes with Sp>96 and Sn>84. The result shows that overall performance of our software matches other methods that are designed out of Iran.

 1 Ahvaz university of Medical sciences, Iran, , [email protected], [email protected] 2 Isfahan medical university, Iran, , [email protected], majid.beigi @ eng.ui.ac.ir 224 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

SOrt-ITEMS AND DISCRIBINATE: SIMILARITY BASED BINNING ALGORITHMS FOR ACCURATE TAXONOMIC ASSIGNMENT OF METAGENOMIC SEQUENCES MONZOORUL HAQUE MOHAMMED 1. TARINI SHANKAR GHOSH 1, SHARMILA MANDE 1

Metagenomics involves sequencing of DNA isolated directly from environmental samples. The major objectives of metagenomic analysis include surveying the taxonomic diversity within microbial communities, understanding community metabolism, and in the process discover new organisms, genes and proteins having potential applications in industrial microbiology, biotechnology and medicine. One of the first steps in metagenomic analysis is the estimation of taxonomic diversity in a given environmental sample. This step, called binning, involves identifying the source organisms of the DNA fragments (referred to as 'reads') obtained by sequencing a metagenomic sample. Existing similarity based binning methods assign reads to an organisms/taxa based on their similarity to known sequences present in the reference databases. Since the majority of reads in metagenomic samples originate from new or partially characterized genomes (the corresponding sequences of which are either absent or underrepresented in existing reference databases), assignment of such reads pose a challenge to similarity based binning methods. In addition, binning methods are further challenged by the relatively short length of reads generated using contemporary sequencing technologies such as 454.

We have developed two new algorithms, namely SOrt-ITEMS and DiScRIBinATE, for the accurate taxonomic assignment of reads originating from known, partially characterized and new organisms. The methods adopt a two-phase approach, the first phase being the same for both. In the first phase various alignment parameters between the reads and their corresponding hits with similar sequences in the database are obtained. These parameters are used to first identify an appropriate taxonomic level to which the assignment of the read is to be restricted. In the second phase, while SOrt-ITEMS uses an orthology based approach for the final assignment of the reads, DiScRIBinATE calculates 'bit-score/distance' ratios and uses this information to finally assign

 1 Tata Consultancy Services Limited, India, [email protected] 225 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 the reads to taxa/clade which lie closest to the corresponding subset of high scoring hits.

The methods have been validated using simulated reads with incorporated sequencing errors and having lengths that are similar to those originating from 454 and Sanger sequencing technologies. Results obtained have been individually compared with another similarity based method called MEGAN. It is seen that both methods show improved accuracy of taxonomic assignments of reads as compared to that by MEGAN. This improvement is especially significant in simulated scenarios wherein sequences corresponding to the source organism of the read are removed from the reference database (thus simulating reads originating from new organisms present in the metagenomic sample).

Both the programs can be downloaded for academic use from the following locations http://metagenomics.atc.tcs.com/binning/SOrt-ITEMS http://metagenomics.atc.tcs.com/discribinate/

Details of the two algorithms and the results obtained will be presented during the conference.

226 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

PREDICTION OF CONDITIONAL GENE ESSENTIALITY THROUGH GRAPH THEORETICAL ANALYSIS OF GENOME-WIDE FUNCTIONAL LINKAGES PALANISAMY MANIMARAN 1, SHUBHADA HEGDE 1, SHEKHAR MANDE 1

The genome of an organism characterizes the complete set of genes that it is capable of coding. However, not all the genes are transcribed and translated under any defined condition. The robustness that an organism exhibits to environmental perturbations is partly conferred by the genes which are constitutively expressed under all the conditions, and partly by a subset of genes that are induced under the defined conditions. The conditional importance of genes in conferring robustness can be understood in the context of the functional attributes of these genes and their correlations to the defined environmental conditions. However, an a priori prediction of such genes for a given condition is yet not possible. We have attempted such predictions by integrating the available gene expression data with genome- wide functional linkages through the well known centrality-lethality correlations in graph theory. We make use of three distinct concepts of centrality, namely, degree, closeness and betweenness, which yield mutually complementary information. We then demonstrate the efficacy of combined graph theoretical and machine learning approaches in ranking essential nodes from a large network of genome-wide functional linkages, which yields predictions with high accuracy. We therefore perceive such predictions as highly useful in applications such as defining and prioritizing drug targets.

 1 Centre for DNA Fingerprinting and Diagnostics, Hyderabad, INDIA, India, [email protected] 227 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

IN SEARCH OF ANTISENSE TO AFAP1 HUMAN GENE ANDREY MARAKHONOV 1, ANCHA BARANOVA 2, TATYANA KAZUBSKAYA 3, SERGEY SHIGEEV 4, MIKHAIL SKOBLOV 5

Keywords: AFAP1, antisense transcription, antisense regulation of gene expression

Whole-genome sequencing and subsequent annotation of human and murine cDNA libraries revealed that about 20 % of genes overlap and form sense—antisense pairs in these organisms. Nevertheless, despite abundance of antisense transcription the role of this phenomenon in regulation of gene expression remains poorly investigated. Previously in our laboratory the genome-wide in silico screening of cis- antisense transcript clusters in human was performed. Estimation of sense— antisense transcripts expression profiles in each pair in different tissues revealed pairs which showed predominant expression of antisense partner in tumors. Such antisense RNAs may both serves as tumor markers and participate in regulation of important oncogenes and tumor suppressors expression. This work is devoted to search and characterization of antisense transcript to AFAP1 gene in human. AFAP1 gene encodes for actin filament associated protein which coordinates Src signaling pathway and actin filament remodeling. AFAP1 is a multidomain protein capable for oligomerization and interaction with Src and PKCα protein kinases. AFAP1 plays an important role in formation of stress fibers, focal contacts and podosomes. We hypothesized that expression of antisense RNA to AFAP1 may leads to suppression of sense gene activity and, consequently, to compensatory restrain of tumor progression. To investigate the tumor-specific expression of antisense RNA asAFAP1 we performed computational analysis of antisense ESTs and estimated the level of transcription in normal and tumor samples of human tissues and cell lines. The exon—intron structure of antisense transcript and expression level of sense—antisense pair were determined. We have also analyzed potential trans-antisense interactions.  1 Research Centre for Medical Genetics, Russian Academy of Medical Science, Russian Federation, [email protected] 2 George Mason University, United States, [email protected] 3 Blokhin Cancer Research Center, Russian Academy of Medical Sciences, Russian Federation, 4 People’s Friendship University of Russia, Russian Federation, 5 Research Centre for Medical Genetics, Russian Academy of Medical Science, Russian Federation, [email protected] 228 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

EQUILIBRIUM AND DYNAMICAL PROPERTIES OF PROTEIN BINDING NETWORKS SERGEI MASLOV 1

Large-scale protein-protein interaction networks serve as a paradigm of complex properties of living cells. These networks are naturally weighted with edges characterized by binding strength ( association constant) and protein- nodes – by their concentrations. However, state-of-the-art high-throughput experimental techniques generate just a binary (yes or no) information about individual interactions. As a result, most of the previous research concentrated just on topology of these networks. In a series of recent publications [1-4] my collaborators and I went beyond purely topological studies and calculated the mass-action equilibrium of a genome-wide binding network using experimentally determined protein concentrations, subcellular localizations, and reliable binding interactions in baker’s yeast. We then studied how this equilibrium responds to large perturbations [1-2] and stochastic noise [3] in concentrations of proteins. It was found that the magnitude of relative changes in free (monomer) and bound (heterodimer) concentrations of perturbed proteins exponentially decays with network distance from the source of perturbation. This explains why, despite a globally connected topology, individual functional modules in such networks are able to operate fairly independently. Another conclusion of our study is that the robustness of the equilibrium state is determined by 1) the topological structure of the network; 2) balance of concentrations of interacting proteins; and 3) the average binding strength of interactions. At the same time it only weakly depends on (current unknown) dissociation constants of individual interactions. In a separate study [4] we quantified the interplay between specific and non-specific binding interactions under crowded conditions inside living cells. We show how the need to limit the waste of resources inside non-specific complexes constrains the number of types and concentrations of proteins that are present at the same time and at the same compartment of the cell.

1. S Maslov, I. Ispolatov, PNAS 104:13655 (2007). 2. S. Maslov, K. Sneppen, I. Ispolatov, New J. of Phys. 9: 273 (2007). 3. K-K. Yan, D. Walker, S. Maslov, Phys. Rev. Lett., 101, 268102 (2008).. 4. J. Zhang, S. Maslov, and E. I. Shakhnovich, Mol. Syst. Biol. 4, 210 (2008).  1 Brookhaven National Laboratory, New York, USA 229 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

INVESTIGATION OF AGE RELATED ALTERNATIVE SPLICING CHANGES IN HUMAN BRAIN USING SOLEXA SEQUENCING PAVEL MAZIN 1, PHILIP KHAITOVICH 2, ANDREY MIRONOV 3, MIKHAIL GELFAND 4

Keywords: splicing, aging

Alternative splicing (AS) is a major factor of generation of protein diversity in higher eukaryotes. AS affects almost all human genes and plays a role in the development, cell differentiation and aging. Previously, the AS differences between tissues and ages were studied used microarrays with probes corresponding to known exon junctions [1], However, this approach has some limitations: it is not possible to observe new exon junctions and to distinguish between intron retention and alternative 3’ or 5’ splice site. In addition, expression microarray data tends to have insufficient reproducibility and narrow dynamic range. Here, we use the Genome Analyzer system of Illumina to survey the AS changes in the human brain during aging. We used samples of cortex tissue for three ages: extremely young, adult and extremely old. Each age sample was a mixture of samples from few individuals. For each sample, we performed two technical replications. As a result, we obtained 2.6 through 9.3 millions 36-nt paired reads for each replication (total 32385206 paired reads). We removed approximately 30% reads by filtering by low quality or complexity. After filtering, all reads were mapped to the human genome (hg18) and to all possible exon junctions by SOAPaligner [2], allowing up to two mismatches and no gaps. The exon junctions were produced by concatenation of all exon pairs from one gene where the end of the first exon was joined to the beginning of the second exon. The exons were extracted from GenBank and EDAS [3] databases. 49 % of reads were mapped uniquely to the genome and 1% mapped uniquely to the splice junctions. The counts of the reads mapped uniquely with 0, 1 and 2 mismatches closely followed a Poisson distribution with mean λ = 0.25 errors per 36-nt read, which is less than the value obtained in [4] (0.4). Levels of expression of genes, exons, exon pairs and exon junctions were calculated in term of coverage by reads for different ages.  1 MSU, Russian Federation, [email protected] 2 Max Planck Institute for Evolutionary Anthropology, China 3 MSU, Russian Federation 4 IITP, Russian Federation 230 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

1. K. Kechris, Y. H. Yang, R.-F. Yeh (2008) Prediction of alternatively skipped exons and splicing enhancers from exon junction arrays, BMC Genomics 9:551-566. 2. R. Li, Y. Li, K. Kristiansen, J. Wang (2008) SOAP: short oligonucleotide alignment program, Bioinformatics, 24(5):713-4. 3. R.N. Nurtdinov, A.D. Neverov, D.B. Mal'ko, I.A. Kosmodem'ianskiĭ, E.O. Ermakova, V.E. Ramenskiĭ, A.A. Mironov, M.S. Gel'fand (2006).EDAS, databases of alternatively spliced human genes, Biofizika, 51(4):589-92. 4. E.T. Wang, R. Sandberg, S. Luo, I. Khrebtukova, L. Zhang, C. Mayr, S.F. Kingsmore, G.P. Schroth, C.B. Burge (2008) Alternative isoform regulation in human tissue transcriptomes, Nature, 456(7221):470-6.

231 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

KNOWLEDGE PROFILE APPROACH: INSIGHTS INTO DRUG ACTION AND TOXICITY MECHANISMS ILYA MAZO 1, EKATERINA KOTELNIKOVA2, NIKOLAI DARASELIA 1

Providing a rich context for experimental data promises to offer new insights into mechanisms of compound action and toxicity. Employing a resource of millions of findings gleaned from a broad corpus of biomedical literature in which to evaluate genome-wide experimental data can highlight specific molecules otherwise easily missed. The challenge to use this large amount of data in decision-making can be met using appropriate hypothesis testing. Using the proprietary high-content linguistics tool MedScan a database of knowledge profiles associated with different diseases and small molecule effects has been compiled by extracting the information form scientific literature. Different approaches towards reconstructing individual pathways or cascades from the resulting knowledgebase and from profiling data will be described. The systematic mining of this database for knowledge on existing drugs/drug candidates findings is shown to help in tasks as diverse as 1) the identification of potential novel components of glioblastoma pathway and suggesting a new application for a known agent to inhibit such pathway, and 2) hypothesizing a mechanism behind drug induced cholestasis and suggesting a biomarker candidate independently validated in other studies.

 1 Ariadne, United States, [email protected], [email protected] 2 Ariadne, Russian Federation, [email protected] 232 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

SPECIFIC RECOGNITION OF UGA, UAA, UGA BUT NOT UGG BY ERF1 PROTEIN: MOLECULAR MODELING STUDY. YURIY MAZUR 1, NINA OPARINA 1, VLADLEN SKVORTSOV 2, IGOR BASKIN 3, VLADIMIR PALYULIN 3

Keywords: docking, molecular dynamics, translation termination factor

eRF1 protein is involved in the translation termination by ribosome. It was previously shown that eRF1 domain is responsible for specific stop codon decoding. Despite huge amount of experimental data the mechanism of this recognition remains unclear. We have implemented molecular modeling to study the test-system consisting of single N-domain of human eRF1 and dipurine dinucleotides. We have compared putative complexes between eRF1 N-domain and two dipurine classes: AA, AG, GA presenting the fragments of specific stop-codon motifs, and the GG dinucleotide presenting the fragment of tryptophan UGG codon, not recognized by eRF1. Molecular docking and dynamics studies let us construct these appropriate complexes, characterized by similar binding energies for specific dinucleotides, but not the GG one. We have carried out virtual mutagenesis studies of these complexes basing on published mutations and on comparative analysis between omnipotent eRF1s (recognize all three stop codons) and unipotent eRF1 (capable of binding to UGA only). Hypothetical scheme explaining the mechanism of stop-codon decoding by eRF1 was implemented in accordance with these molecular modeling data.

 1 IMB RAS, Russian Federation, [email protected] 2 Institute of Biomedical Chemistry, Russian Federation 3 Moscow State University, Russian Federation 233 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

INNER STRUCTURE OF CpG ISLANDS JULIA MEDVEDEVA 1, NIKA OPARINA 2, VSEVOLOD MAKEEV 3

In spite of the fact that mammalian genomes are generally CG-depleted, there are regions with high level of C+G content and increased number of CG- dinucleotides, called CpG islands. Researchers believe that CpG islands contribute into transcription regulation, namely gene suppression in normal (e.g. gene imprinting, X-chromosome inactivation) and disease case (e.g. cancer); chromatin modification; chromosome replication etc . Considering diverse biological functions mentioned above we supposed that CpG islands (CGI) could have inner structure with different compositional and therefore functional properties in different areas. First of all we decided to explore properties associated with transcription. We considered CpG islands located close to transcription start sites (TSS) of known genes with high level of transcription activity. We divided such CpG islands into two groups according to the strand of the corresponding gene. Each CpG island in case was separated into two parts: upstream and downstream regions of TSS. Then we checked all the groups for overrepresentation of SP1 binding sites and some other CG-rich motifs. We found that upstream regions in each group of CGIs had much more CG-rich motifs (especially, SP1 binding sites) comparing to downstream regions, although the latter also contained more CG-rich motifs then statistically expected. New genome-wild sequencing methods demonstrated that in many cases transcription began several hundreds bases upstream of transcription starts sites considered before. So we can assume that in CpG islands exactly the area of transcription demonstrates high overrepresentation of (C) nG(C) n–like motifs, especially SP1 binding sites. This study was partially supported by RFBR grant 07-04-01584-а

 1 GosNIIgenetika, Moscow, Russia, [email protected] 2 IMB, Moscow, Russia, [email protected] 3 GosNIIgenetika, Moscow, Russia, [email protected] 234 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

COMPENSATORY EVOLUTION IN mt-tRNAs NAVIGATES SHIFTING BALANCE-LIKE VALLEYS OF LOW FITNESS MARGARITA MEER 1, FYODOR KONDRASHOV 1

Conventional wisdom holds that in the course of evolution a population does not commonly advance over a valley of low fitness. On a molecular level this point of view is also commonly assumed when interpreting data on compensatory evolution. For example, a transition between an AU nucleotide complementary interaction in an RNA stem structure and a GC nucleotide interaction is thought of as proceeding though a benign GU intermediate. We have studied patterns of compensatory evolution in 1758 mammalian mitochondrial (mt) tRNAs. We show that two substitutions that occur between AU and GC compensatory transition occur in a non-random fashion, such that they are observed on the same branch of a phylogenetic tree than expected by chance. We use the natural mutational bias of the mitochondrial genome strands to compare the rate of AU to GC evolution when going through GU versus AC intermediates. We compared the rates of evolution on the strand that strongly favors the AU <-> AC <-> GC evolutionary pathway over the AU <-> GU <-> GC pathway due to the differences in the respective mutation rates. We find that compensatory evolution through the a priori less fit AC intermediate occurs as frequently as evolution through the GU intermediate, and that the compensatory substitutions that occur in the AU <- > AC <-> GC pathway are much more likely to be clustered on the same branch of the phylogenetic tree. We conclude, that compensatory evolution in mt-tRNAs commonly proceeds through a valley of low fitness, giving broad support the Wright's Shifting Balance theory on the molecular level.

 1 Center for Genomic Regulation (CRG) , Spain , [email protected], [email protected] 235 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

MODELING OF AUXIN DISTRIBUTION IN ROOT: RHIZOTAXIS IS DEFINED BY AUXIN REGULATION OF ITS OWN TRANSPORT VICTORIA MIRONOVA 1, NADYA OMELYANCHUK 1, VITALY LIKHOSHVAI 1

Rhizotaxis, the arrangement of lateral roots along the primary root, in some species, for example, in Arabidopsis thaliana has left-right alternating pattern [1]. However, in contrast to phyllotaxis, the mechanisms of lateral root positioning aren’t so robust – the severe part of plants in A. thaliana population has a variable number of lateral roots at sporadic positions. The first stage of lateral root primordium development is marked by increase of auxin concentration in a pericycle cell located 15-20 mm upstream to the root apical meristem (RAM) [2]. This cell than divides several times and forms the lateral root meristem. However, another auxin maxima precede this pericycle cell changes. These maxima occur with regular periods in protophloem cells 200-350 micrometres upstream to the RAM [1]. Recently, the model of the auxin-regulated rhizotactic patterning has been suggested that explains an increase of auxin concentration in pericycle cells by changes in cell shapes on curves of the primary root [2]. The model reproduce left-right lateral root positioning in A. thaliana , however it does not consider (i) the violations in rhizotactic patterning and (ii) auxin maxima in protophloem that precede lateral root initiation in pericycle. Previously, we have published the model where computational simulations clearly demonstrated that auxin dose-dependant regulation of its own transport is a sufficient mechanism for generation and positioning of auxin maxima along the central axis in the primary root [3]. Here, we present the model that explains mechanisms for both aspects of lateral root positioning: regular formation of auxin maxima at the distance from RAM and variations in longitudinal spacing between two lateral root primordia. In the model, the processes of auxin diffusion, dissipation and active transport in root cells and apoplast are described. Two forms of auxin, protonated and anion, which have different membrane permeability are taken into account. Both directions of active transport, auxin transfer from cell to apoplast by PIN1 efflux carriers and the opposite uptake by AUX1 influx carriers are considered. Expression of PIN1 and AUX1 depends on auxin  1 Institute of Cytology and Genetics, SB RAS, Novosibirsk, Russian Federation, [email protected], [email protected], [email protected] 236 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 concentration into the cell. The model has two linear configurations: the file of protoxylem cells where auxin is transported by PIN1 only and the file of protophloem cells where both PIN1 and AUX1 transporters work. We also investigated auxin distribution in 2D model that presents longitudinal root cut and contains 12 cell files (2 protophloem, 8 protoxylem, 2 pericycle). Computational simulations showed that appearance of auxin maxima in protophloem cell files is a regular event occurring with proper periodicity as a response to increasing auxin flow from the shoot. Where the localization of each newly formed auxin maximum along the central axis of the primary root varies within a finite interval that is defined by the dynamics of growth. Thus, the ratio of variable to regular trends in lateral root initiation determines the level of regularity of rhizotactic pattern. The model demonstrates (1) sufficiency of the minimal mechanism consisting of auxin regulated PIN1 expression for periodical formation of the inner auxin maxima, and (2) requirement of AUX1-mediated auxin transport for positioning of these maxima at the finite interval away from RAM.

Acknowledgments: This work partly supported by following grants: RFBR №08-04-01008, SB RAS integration projects №107, №119, the RAS programs 21 (project 26) and 22 (project 8), SS-2447.2008.4.

1. De Smet I. et al., 2007. Development. 134 (4). 681-690. 2. Laskowski M. et al. 2008. PLOS Biology. V.6. N. 12. e307. 3. Likhoshvai V.A. et al. 2007. Russian journal of developmental biology. 38 (6). 446-456.

237 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

CORRELATIONS BETWEEN DNA-BINDING DOMAINS AND THEIR DNA BINDING SITES D.S. MITEVA 1, V.V. STEPANOVA 1, A.B. RAKHMANINOVA 1

Keywords: protein-DNA binding, correlations

Protein-DNA interactions play a major role in many processes such as replication, reparation, restriction, transcription and its regulation. The exact mechanisms of specific DNA-protein recognition are still poorly understood. One possible approach to this problem is to study coevolving positions in the DNA-binding protein domains of transcription factors (TFs) and their binding sites on DNA (BSs). Earlier our group developed a program for such analysis. The correlation between residues in columns of a protein alignment and bases in a DNA alignment is measured by the mutual information between the two distributions. The program was initially tested on transcription factors from the LacI family and the results were in good agreement with experimental and structural data (Y. Korostelev this conference). The goal of this study is to analyze DNA-protein correlations in other protein families. The CRP-FNR superfamily included DNR, NnrR, HcpR, CooA, CRP, FNR families of bacterial transcription factors. We analyzed a set of 62 transcription factors and 932 binding sites of these TFs. The data were kindly provided by D.Rodionov and D.Ravcheev. Only 7% of possible pairs (position in TF alignment – position in BS alignment) were significantly correlated. The most correlated pair of positions corresponds to Arg180-Gua5 in the 3D- structure of CRP from E. coli (PDB ID: 1zrc). In pairs showing the highest correlation all involve BS position 5 or the complementary position 18. Only three a.a. residues in CRP form specific contacts with DNA, that is, contacts between the amino acid side chain and the base). One of these (Arg185) is highly conserved , whereas the other two (Arg180, Glu181) are at the top of the correlated pairs list. Further analyses demonstrated that these given correlations are not result from shared history (are not a phylogenetic trace) and that the obtained results agree with published data on point mutations [1].

 1 Department of Bioengineering and Bioinformatics, Moscow State University, Russian Federation, [email protected], [email protected] 238 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 The C-protein family includes regulators of restriction-modification system in Prokaryota (130 TF-BS pairs from [2]). Only 2% of position pairs were correlated. We saw no correlations in the right part of the BS motif and confirmed the weakly palindromic structure in the left part of ofther motif. We observed correlated pairs, which include Glu25 and Arg35. These a.a. form the contacts between TF monomers (numbers as in PDB ID: 3clc). The most frequent amino acid in the best correlated pairs is negatively charged Asp34 next to highly conserved Arg35 and in close contact with Gua3. This motif reminds of the pair Glu181 and Arg180 in the CRP-FNR superfamily. Finally, we have studied the N6-N4-Methylase family (PF01555). This family contains enzymes that methylate the exocyclic amino group (NH2) of cytosine or adenine yielding N4-methylcytosine (N4mC) or N6-methyladenine (N6mA), respectively. We used 311 pairs enzyme-site from REBASE (http://rebase.neb.com/rebase/rebase.html). Only 4% of possible position pairs (24 of 768) were significantly correlated. No 3D-structure of a protein- DNA complex for this family is known, so the detailed structural analysis was not feasible. Still, that amino acid positions from four of 24 correlated pairs correspond to Lys197 and Asp30 in the structure PDB ID: 1g60; these amino acids are located near the transferred CH3-group of S-adenosyl-L-methyonine (SAM). We propose that these positions determine what nucleotide should be methylated (A or C): most adenine-specific methyltransferases contain Lys and Asp and cytosine-specific enzymes Phe and Ser, respectively. To summarize, we have found that • the correlated pairs agree with available 3D-structures and with known experimental data; • two families were shown to have a common structural pattern: a helix containing correlated negatively charged amino acid (Glu or Asp) next to Arg that forms a close contact with DNA; • the binding pocket for the methylated base in the N6-N4-Methylase family was predicted. We are grateful to M.S.Gelfand for usefull discussion.

1. Ebright RH, Cossart P, Gicquel-Sanzey B, Beckwith J. Mutations that alter the DNA sequence specificity of the catabolite gene activator protein of E. coli. Nature, 311: 232 - 235 (20 September 1984) 2. Sorokin V, Severinov K, Gelfand MS. Systematic prediction of control proteins and their DNA binding sites. Nucleic Acids Res. 2009 Feb; 37(2): 441-51. 239 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

EXCEPTIONAL NUCLEOTIDE SEQUENCES IN GENOMES OF DIFFERENT ORGANISMS SERGEI MITROFANOV 1, ALEXANDER PANCHIN 2, ANDREI ALEXEEVSKI 3, SERGEI SPIRIN 3, YURY PANCHIN 4

Keywords: genomes, exceptional words, CpG, SNP

One of the most obvious tasks in bioinformatics is the analysis of relative representation of different nucleotide sequences (words) starting from single letter (GC content) to di-, tri-, tetranucleotides and longer within and across genomic sequences. With the development of reliable sequencing techniques and accumulation of sequence data, important observations on DNA content were made. As genomic databases are rapidly growing it is necessary to update these observations and spread them on to a wide range of species. To serve this purpose we analyzed DNA content and relative representation of nucleotide words 1-15 letters long for over a hundred eukaryotic species. A method suggested by Karlin and Ladunga [1] was used to estimate under- and overrepresentation of words in each genome. We found that some words that were considered to be universally over- or underrepresented show considerable exceptions while some other words appear to show a more universal trend. On the other hand we studied word content in the human genome in more detail and compared different types of sequences like coding, non-coding, or masked for different repeats. The most known underrepresented two letter word is CpG. It is accepted that the activity of a CpG specific methyltransferase increases the mutation rate from CpG to TpG. Indeed, CpG deficit is universal for all studied viridiplantae and most metazoa. In metazoa some species of insects and nematodes show no CpG deficit or even show overrepresentation of this

 1 Faculty of Bioengineering and Bioinformatics, Moscow State University, Russian Federation, [email protected] 2 Faculty of Bioengineering and Bioinformatics, Moscow State University;Institute for Information Transmission Problems, RAS, Russian Federation, [email protected] 3 Belozersky Institute, Moscow State University; Scientific Research Institute for System Studies (NIISI RAS), Moscow, Russian Federation, [email protected], [email protected] 4 Institute for Information Transmission Problems, RAS; Faculty of Bioengineering and Bioinformatics, Moscow State University, Russian Federation, [email protected] 240 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 dinucleotide. In nematodes this correlates with the loss of methyltransferase in some species, yet in insects the situation is more complicated: in honey bee active methyltransferases coexist with CpG overrepresentation. Studied fungi also demonstrate a diverse spectrum of CpG representation. The next most universally underrepresented dinucleotide is TpA, which is underrepresented in all studied genomes, except for Plasmodium falciparum. The tendency for T and A nucleotides to form long homogeneous stretches (AA…A, TT…T) may contribute to this effect because expected TpA frequencies do not take this effect into account. Indeed, TpA underrepresentation is negatively correlated with ApA(TpT) overrepresentation. This explanation would expect the same behavior of ApT sequence. Nevertheless, TpA is underrepresented comparing to ApT in all species. ApC and GpT are underrepresented in most analyzed genomes. Exceptions are three chordate groups Lamprey, Lancelet and Ciona. We do not know of any statistical reasons for the underrepresentation of ApC(GpT). An actual biological mechanism appears to be responsible for CpG deficit. It is interesting if any specific mechanism would result in ApC(GpT) or TpA deficit. We have also used the advantage of the human SNP database to determine the trends of nucleotide substitution in words of different length. To determine the direction of single nucleotide mutations in the human genome we selected SNPs that were mapped to a corresponding Pan troglodytes genomic region. We compared alleles of the human SNP with the chimp variant to determine the direction of the mutation: the allele matching the chimp variant was concidered to be ancestral. For instance, our analysis has shown, that the probability rate of (G or С) to (A or T) mutations is larger than the probability rate of (A or T) to (G or C) mutations. When these rates are taken into account, the equilibrium nucleotide content in the Human genome turns out to be 38,6% (G+C) and 61,4% (A+T). The current nucleotide content of the Human genome is 42,1% (G+C) and 57,9% (A+T). The nucleotide content of the human genome has not reached equilibrium and further decrease in (G+C) is predicted in the future. Acknowledgements: supported by RFBR 08-04-00478, 08-04-91975 and MCB RAS

1. S.Karlin, I.Ladunga (1994) Comparisons of eukaryotic genomic sequences, Proc Nat Acad Sci, 91:12832-12836.

241 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

MATHEMATICAL MODEL OF THE INHIBITING PART IN TCA AT CITRIC ACID SYNTHESIS BY SUPERPRODUCERS CROSS-MUTANTS OF YARROWIA LIPOLYTICA FROM GLUCOSE YULIA LUNINA 1, ANDREW RUDENKO 2, IGOR MORGUNOV 1

As described in [1] with processing of natural yeast Yarrowia lipolytica 704 by UV irradiation and mutagen N-methyl-N'-nitro-N-nitrosoguanidin (NG), 1500 variants were obtained, and three of these mutant strains were found to be an excellent citric acid (CA) superproducers on glucose. Acid-formatting activity of one of those mutants (N 15) essentially (on 50 %) exceeded similar parameter of the natural strain. Furthermore, the mutant mass yield was 43.0 % from consumed glucose, and considerably exceeded that of natural strain (19%). Natural strain and mutant N 15, cultivated in fermenters, both accumulated CA at significant levels, about 70 g/l. But the mutant strain produced this amount for 3 day of cultivation, and natural yeast accumulated the same amount only on 5 day of cultivation. Optimization of nutrient media and cultivation parameters has not helped essentially to overcome the barrier of 70 g/l, neither at the natural strain nor at the mutant. Apparently, it is connected with the inhibition of some metabolic reactions, involved in CA oversynthesis. To understand, at what synthesis stage the inhibition occurs, it was necessary to construct a mathematical model. For a basis the Wayman and Tseng equation with Andrews type inhibition was taken [2]. The Andrews function was combined with a linearly decreased activity function to form a five-parameter discontinuous model. Model fitting. The parameter values obtained correspond with general concepts of CA oversynthesis by yeast and the literature data [3-5]. Results. 1. CA yield levels of about 70 g/l were reached stably were slightly overcame; 2. The exceeding of the levels were statistically significant (σ2 and Least Square Method) but small in size (some percents);

 1 G. K. Skryabin Institute of Biochemistry and Physiology of Microorganisms Russian Academy of Sciences, Russian Federation , [email protected], [email protected]. 2 Faculty of Physics, Lomonosov Moscow State University , Russian Federation , [email protected] 242 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 3. The system’s behavior is non-conservative, no stability on small parameter was found, the inhibiting starts abruptly and stochastically. As the future prospect: to model the whole TCA cycle not only from point of view on searching the "bottlenecks", but to create the full mathematic model of all chemical changes in TCA - on analogy with well investigated and modeled thermodynamic Carno cycle. Acknowledgements . The authors are very grateful to doctor Svetlana V. Kamzolova (Institute of Biochemistry and Physiology of Microorganisms Russian Academy of Sciences, Pushchino) for the numerous valuable remarks made during discussion of given clause.

1. T.V.Finogenova et al. (2008) Obtaining of the mutant Yarrowia lipolytica strains producing citric acid from glucose, Prikl. Biokhimiya i Microbiologiya 44, 2: 219-224. (rus) 2. G.Alagappan, R.M.Cowan (2001) Biokinetic models for representing the complete inhibition of microbial activity, Biotechnology and bioengineering 75, 4: 393-405. 3. L.M.Glazunova, T.V.Finogenova (1976) Enzyme activity of citrate, glyoxylate and pentose phosphate cycles during synthesis of citric acids by Candida lipolytica, Microbiologiya XLV, vol. 3: 444-449. (rus) 4. I.T.Ermakova, T.V.Finogenova (1971) Participation of glyoxylate cycle in metabolism of alkane-oxidizing yeast Candida lipolytica during biosynthesis of α-keto-glutaric acid, Microbiologiya XL, vol. 2: 223-226. (rus) 5. I.G.Morgunov et al. (2004) Regulation of NAD+-dependent isocitrate dehydrogenase in the citrate producing yeast Yarrowia lipolytica, Biochemistry (Moscow) 69, 12: 1391-1398.

243 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

STUDYING ORIGIN OF LIFE THROUGH DATA MINING: TRACES OF THE PRIMEVAL ZINC WORLD IN MODERN PROTEIN AND RNA DATABASES. ARMEN Y. MULKIDJANIAN1,2, MICHAEL GALPERIN 3

The complexity of the problem of origin of life resulted in a large number of possible evolutionary scenarios. Their number, however, can be dramatically reduced by the simultaneous consideration of different bioenergetic, physical, and geological constraints as boundary conditions. We put forward a consensus evolutionary scenario that satisfies the known constraints by proposing that the life on Earth emerged, powered by solar radiation, at porous, photosynthetically-active edifices made of zinc sulfide (ZnS), similar to those found around modern deep-sea hydrothermal vents [1]. Under the conditions of high pressure of the carbon dioxide-dominated primeval atmosphere, such compartmentalized ZnS edifices could build up at sub-aerial settings of the first continents, in a direct access of the UV-rich solar light. This scenario suggests that ZnS surfaces (1) used the solar radiation to drive carbon dioxide reduction yielding the building blocks for the first biopolymers, (2) served as templates for the synthesis of longer biopolymers from simpler building blocks, and (3) prevented the first biopolymers from photo-dissociation by absorbing from them the excess radiation energy. The Zinc World concept resolves the conflict between the “metabolism first” and “replication first” concepts of abiogenesis. These are suggested to reflect the two facets of the Zinc World, namely the continuous abiogenic photosynthesis of metabolites and their further conversion by ZnS-confined replicating entities. [1]. We also formulate a set of biological predictions stemming from the hypothesis on the photosynthetic origin of life in hydrothermal ZnS settings and check the validity of these predictions using the data from public protein and RNA databases. The ZnS-mediated photosynthesis should result in the release of Zn 2+ ions, increasing their concentration inside the ZnS compartments. Therefore, the idea that life has started inside such

 1 School of Physics, Universität Osnabrück, D-49069 Osnabrück, Germany; 2 A.N.Belozersky Institute of Physico-Chemical Biology, Moscow State University, Moscow, 119991, Russia; [email protected] 3 3National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA, [email protected] 244 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 compartments leads to the suggestions that (i) the elevated Zn 2+ content of the primordial environments should be conserved inside modern cells; (ii) there should be ribozymes with Zn-dependent catalytic activities; (iii) crystal structures involving RNA molecules should be enriched in Zn 2+ ions; (iv) Zn 2+ ions should be associated with the evolutionarily oldest protein folds; (v) enzymes with evolutionarily “old” functions should depend on Zn2+; (vi) the enzymes that emerged to take over the catalytic functions from ribozymes should be dependent on Zn2+. We show that the Zinc World hypothesis successfully passes each of these falsification tests. In addition, we demonstrate the explanatory power of the Zinc World concept by elucidating several facts that so far remained without acceptable rationalization. In particular, the Zinc World concept implies a new scenario for the separation of Bacteria, Archaea, and Eukarya [2]. The ability of the Zinc World hypothesis to generate non-trivial veritable predictions and explain previously obscure observations gives credence to its key postulate that the development of the first life forms started inside photosynthesizing ZnS formations of hydrothermal origin. Further work will be needed to provide detailed mechanisms, including primordial (bio)chemical reactions, of the origin of life in the Zinc World.

1. Mulkidjanian AY: Origin of life in the Zinc World: 1. Photosynthetic, porous edifices built of hydrothermally precipitated zinc sulfide (ZnS) as cradles of life on Earth. Biol Direct 2009:in press. 2. Mulkidjanian AY, Galperin MY: Origin of life in the Zinc World. 2. Validation of the hypothesis. Biology Direct 2009:in press.

245 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

PIPELINE FOR ACQUISITION OF HIGH-PRECISION QUANTITATIVE INFORMATION ON GENE EXPRESSION FROM CONFOCAL IMAGES . EKATERINA MYASNIKOVA 1, KONSTANTIN KOZLOV 1, MARIA SAMSONOVA 1

Keywords: confocal scanning microscopy, quantitative gene expression data, image processing

Nowadays the confocal scanning microscopy is a commonly used method for the acquisition of high-quality digital images of molecular biological objects. The quality of confocal images allows to extract quantitative data at a single cell resolution, the availability of which is a necessary prerequisite for the successful systems biology studies. In view of this there is a great need for the full software solution providing biologists with the methods and tools for the visualization of confocal images, their analysis and acquisition of the quantitative information. We have developed a pipeline of original methods for the acquisition of quantitative data on gene expression from confocal images. The methods include: image segmentation, removal of non-specific background, spatial registration of gene expression patterns, creation of the spatiotemporal atlas of gene expression, as well as the methods for the improvement of data accuracy. We estimate and correct data errors which arise in the course of fluorescence quantification, namely, errors due to photon noise, averaging of clipped confocal scans, and diffractive scattering of images. The proposed methods were successfully applied for the acquisition of quantitative data on segmentation gene expression in Drosophila, estimation of their accuracy, and construction of an integrated map of Drosophila segmentation gene expression. The dataset has been generated during the more than a decade and is stored in the freely available FlyEx database at http://urchin.spbcas.ru/flyex/ . The majority of methods can be easily adapted to the images of other organisms. For example, the methods were successfully applied to acquire quantitative data from the sea anemone Nematostella vectensis and the coral Acropora millepora. All the developed methods are implemented both as separate software tools and as a part of the earlier created by the authors the software package Prostack (Processing of Stacks).  1 St.Petersburg State Technical University, Russian Federation, [email protected], [email protected], [email protected] 246 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

ORGANIZATION OF PHYSICAL INTERACTOMES AS UNCOVERED BY NETWORK SCHEMAS. ERIC BANKS 1, ELENA NABIEVA 2, BERNARD CHAZELLE 3, MONA SINGH 4

Keywords: protein networks

Understanding the ways in which proteins come together to perform various biological processes and thus create the life of a cell is one of the key challenges of biology. Large-scale determination of protein-protein interactions is an important step towards addressing this fundamental question. Commonly represented as networks or graph, protein interactions create new opportunities for understanding cellular organization and functioning, but simultaneously pose the challenge of interpreting these data to obtain biological knowledge. Here, we focus on the problem of identifying shared mechanisms within interactomes, and introduce network schemas to describe patterns of interaction among distinct types of proteins. Network schemas specify descriptions of proteins and the topology of interactions among them. We develop a novel computational procedure for systematically uncovering recurring, over-represented schemas in interaction net-works. We apply our methods to the S. cerevisiae physical interactome, focusing on schemas consisting of proteins described via sequence motifs and molecular function annotations and interacting with one another in one of four basic network topologies. We identify hundreds of recurring and over-represented network schemas of various complexities, and demonstrate via graph- theoretic representations how more complex schemas are organized in terms of their lower-order constituents. The uncovered schemas span a wide-range of cellular activities, with many signaling and transport related higher-order schemas. We establish the functional importance of the schemas by showing that they correspond to functionally cohesive sets of proteins, are enriched in the frequency with which they have instances in the H. sapiens interactome, and are useful for predicting protein function. In addition, we touch upon the use of schema analysis for comparative interactomic studies by examining the  1 Princeton University; now at the Broad Institute, United States, [email protected] 2 Princeton University; now at the Whitehead Institute for Biomedical Research, United States, [email protected] 3 Princeton University, United States, [email protected] 4 Princeton University, United States, [email protected] 247 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 simplest network schemas in the H. sapiens interactome. Our findings suggest that network schemas are a powerful paradigm for organizing, interrogating, and annotating cellular networks.

1. Banks E, Nabieva E, Chazelle B, Singh M. Organization of physical interactomes as uncovered by network schemas. PLoS Comput Biol. 2008 4(10): e1000203

248 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

RECLASSIFICATION OF GH13 FAMILY OF GLYCOSIDE HYDROLASES DIANA I. GIZATULLINA 1, DANIIL G. NAUMOFF 1

Keywords: Sequence-based classification of proteins, protein phylogeny, protein family, protein subfamily, glycoside hydrolases, GH13, GH31, GH36, GH70, TIM-barrel fold, CAZy, COG, PSI-BLAST, PSI Protein Classifier, α-amylase

The GH13 family of glycoside hydrolases catalytic domains occurs in 5899 proteins in the CAZy database [1]. GH13 together with GH70 and GH77 families form the GH-H clan at a higher hierarchical level. Domains of these families have the (β/α)8-barrel (or TIM-barrel) structure, which is the most common 3D structure among glycoside hydrolase catalytic domains. Biochemically characterized members of the GH13 family possess 22 different enzymatic activities. Due to the polyspecificity, the simple membership of this family cannot be used for the prediction of protein function based on sequence alone. In order to solve this problem in 2006 about 80% of the family members were grouped into 36 (GH13_1-GH13_36) mainly monofunctional subfamilies [1-3]. Members of the GH13 family encoded in complete genomes are grouped into four unicellular (COG) and four eukaryotic (KOG) clusters of orthologous groups of proteins [4]. Comparative analysis of GH13-containing proteins from CAZy and COG/KOG databases allowed us to determine the correspondence between these two classifications. At least one member in 31 out of 36 GH13- subfamilies is represented in COG0296, COG0366, COG1523, COG3280, KOG0470, KOG0471, KOG2212, or KOG3625 clusters. We have retrieved 9328 non-identical sequences of GH13 domains from GenPept database using the blast algorithm. The PSI Protein Classifier program [5] was employed to analyze a query-dependent order of sequence appearance during PSI-BLAST searches. In order to determine subfamily belonging of the unclassified proteins we have developed criteria allowing to distinguish members for each subfamily of the GH13 family based on E-value. Some proteins listed in the CAZy database were added or excluded from a particular subfamily, however none of them was moved from one subfamily to another. In total, 36 subfamilies contain GH13 domains from 6302 proteins (or 67.56%). Some subfamilies were found to be very similar. Grouping them  1 State Institute for Genetics and Selection of Industrial Microorganisms, I-Dorozhny proezd, 1, Moscow 117545, Russia, [email protected] 249 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 together followed by enlargement resulted in three extended subfamilies: GH13_15/24 (or KOG2212), GH13_20/21, and GH13_29/31. Analysis of the remaining proteins allowed to propose ten additional subfamilies (GH13_A- GH13_J) in the GH13 family. All together 7189 proteins (or 77.07%) were assigned to subfamilies. Comparative analysis of proteins belonging to the GH-H clan showed that catalytic domains of GH13_25 and GH13_33 subfamilies have a lower sequence similarity with domains from the other GH13-subfamilies than the latter with GH70 domains. Based on these results we propose to consider GH13_25 (or KOG3625) and GH13_33 as distinct families inside clan GH-H, which therefore consists of five families. Iterative screening of the protein database by PSI-BLAST using several GH13 domains as a query allowed to retrieve during the first three iterations some representatives of GH31, GH36D, GH70, COG1649, and COG2342 families, suggesting their evolutionary relationship.

1. P.M. Coutinho, B. Henrissat (2009) Carbohydrate-Active Enzymes server [http://www.cazy.org/]. 2. M.R. Stam et al. (2006) Dividing the large glycoside hydrolase family 13 into subfamilies: towards improved functional annotations of α- amylase-related proteins, Protein Eng. Des. Sel., 19:555–562. 3. M. Stam (2006) Evolution et prédiction des activités des glycoside hydrolases, Ph.D. Dissertation, Université Aix-Marseille 1. 174 p. [http://www.sfbi.fr/Theses/2006_Stam_Mark.pdf]. 4. R.L. Tatusov et al. (2009) COG: Phylogenetic classification of proteins encoded in complete genomes [http://www.ncbi.nlm.nih.gov/COG/]. 5. D.G. Naumoff, M. Carreras (2009) New program PSI Protein Classifier automatizes the PSI-BLAST results analysis, Molecular Biology (Engl. Transl.), 43(4). In press. [http://bioinform.genetika.ru/members/Naumoff/PSI_Protein_Classifi er.htm].

250 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

SEQUENCE ANALYSIS OF ENDO-Α-N- ACETYLGALACTOSAMINIDASES AND THEIR HOMOLOGUES DANIIL G. NAUMOFF 1

Glycoside hydrolases are a widespread group of enzymes hydrolyzing various carbohydrates and glycoconjugates. On the basis of sequence similarity of their catalytic domains all of them have been grouped into more than 100 families (GH1-GH115). Family GH101 includes retaining endo-α-N- acetylgalactosaminidases (EC 3.2.1.97) and their uncharacterized homologues (totally 41 proteins) [1]. This family was described in 2005 [2]. The 3D structure of its representative from Streptococcus pneumoniae (PDB, 3ECQ) has been solved recently [3]. According to these data, GH101 domains have a distorted (β/α) 8-barrel structure. The (β/α) 8-barrel or TIM-barrel is the most common 3D structure among glycoside hydrolase catalytic domains [1, 4]. We have revealed 98 non-identical protein sequences of GH101 domains from GenPept database using the blast algorithm. They represent 21 genera of bacteria: Abiotrophia, Actinomyces, Anaerococcus, Arthrobacter, Bacillus, Bacteroides, Bifidobacterium, Blautia, Catonella, Clostridium, Collinsella, Dyadobacter, Enterococcus, Janibacter, Opitutaceae, Propionibacterium, Renibacterium, Ruminococcus, Stackebrandtia, Streptococcus , and Streptomyces . We applied the PSI Protein Classifier program [5] to analyze the order of sequence appearance during the first round of searches by PSI- BLAST, depending on the query. This analysis allowed us to distinguish six subfamilies (101a-101f) in the GH101 family. Phylogenetic analysis of the GH101 family suggests the monophyletic status of each subfamily. Most GH101-containing proteins have several additional common domains. Based on the conserved domain structure and presence of three invariant catalytically essential residues we consider the same enzymatic function for all proteins of the GH101 family. Iterative screening of the protein database by PSI-BLAST revealed the closest relationship of GH101 domains with uncharacterized protein domains, representing two new protein families. Proteins with accession numbers AAN24642.1 and EDM96541.1 are their representatives. More distant  1 Laboratory of Bioinformatics, State Institute for Genetics and Selection of Industrial Microorganisms, I-Dorozhny proezd, 1, Moscow 117545, Russia, [email protected] 251 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 similarity was found with some proteins from GH13, GH20, GH27, GH29, GH31, GH36, GH66, GH97, COG1306, and COG1649 families. The closest of them is GH13, which is the biggest family among glycoside hydrolase catalytic domains. Besides, GH13 domains have typical TIM-barrel structure. Two key catalytic residues (nucleophile and proton donor) of GH13 and GH101 proteins are located in the homologous positions. Using AAN24642.1 and EDM96541.1 as a query allowed us to reveal relationship with several other enzymatically uncharacterized protein families. Based on the obtained results we suggest a hierarchical classification of the analyzed protein domains. It can be incorporated into classification of the TIM-barrel type glycoside hydrolases proposed by us earlier [4].

1. P.M. Coutinho, B. Henrissat (2009) Carbohydrate-Active Enzymes server [http://www.cazy.org/]. 2. K. Fujita et al. (2005) Identification and molecular cloning of a novel glycoside hydrolase family of core 1 type O-glycan-specific endo-α-N- acetylgalactosaminidase from Bifidobacterium longum , J. Biol. Chem. , 280280:37415–37422. 3. M.E.C. Caines et al. (2008) The structural basis for T-antigen hydrolysis by Streptococcus pneumoniae : a target for structure-based vaccine design, J. Biol. Chem., 283283:31279–31283. 4. D.G. Naumoff (2006) Development of a hierarchical classification of the TIM-barrel type glycoside hydrolases, Proceedings of the Fifth International Conference on Bioinformatics of Genome Regulation and Structure . July 16-22, 2006. Novosibirsk. Russia. 1:1:1:294–298 1: [http://www.bionet.nsc.ru/meeting/bgrs2006/BGRS_2006_V1.pdf]. 5. D.G. Naumoff, M. Carreras (2009) New program PSI Protein Classifier automatizes the PSI-BLAST results analysis, Molecular Biology (Engl. Transl.), 434343(4). 43 In press. [http://bioinform.genetika.ru/members/Naumoff/PSI_Protein_Classifier. htm].

252 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

PROKARYOTIC TRANSFER RNA: RATE OF MOLECULAR EVOLUTION, NUMBER OF COPIES, STABILITY & CODON USAGE. OLESYA NECHAY 1, MAKSIM SOROKIN 2, KONSTANTIN POPADIN 3

Keywords: tRNA, molecular evolution, gene conversion, codon usage, stability

Rate of molecular evolution of transfer RNA (tRNA) genes could be explained by various means. Pursuant to the last investigations each tRNA function has evolved to adjust its affinity for elongation factor, and for the ribosome in a way that compensates for the particular affinity of its cognate amino acid (Taraka Dale et al., 2005). At the same time there is generally a high number of tRNA gene copies, which evaluate with dependence on each other as a result of ectopic gene conversion (Li, Molecular evolution, 1997). By virtue of this literature data we expected that: (i) tRNA stability should depend on amino acid affinity and might correlate with evolutionary rate of this gene. (ii) increase of gene copies might decrease the evolutionary rate of the gene. Our aim was to check the introduced hypotheses.

All together 432 prokaryotic genomes containing at least 18 common types of tRNA were analyzed. Genes’ sequences of prokaryotic tRNA taken from the TIGR database (JCVI) were aligned, and its stability was calculated. Phylogenetic tree was built on base of concatenate of 18 aligned mostly outspread ones among all tRNA types of our fetch, also constrain of deep phylogeny was taken into account. According to the alignment of analyzed gene and its phylogenetic tree evolutionary rate was interpreted as the length of the external branch of the tree, leading to the present species. In order to obtain our results statistical analysis was performed. The received data was analyzes on two different scales: (i) comparative –species scale: evolution of each tRNA molecule was observed in different species and then association of evolutionary rate and number of tRNA gene copies was analyzed; (ii)

 1 MSU FBB, Russian Federation, [email protected] 2 MSU FBB, Russian Federation, [email protected] 3 IPPI, Russian Federation, [email protected] 253 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 comparative-tRNA scale, when the same parameters between different tRNA genes were analyzed within one prokaryotic genome. (i) According to our observation we corroborated that stability of tRNA molecule was mainly defined by thermodynamic compensation of its cognate aminoacid affinity, and all other effects were too faint and subsidiary. Significant positive regression with weak slope was obtained between stability and evolutionary rate of tRNA of all types. (ii) The increasing number of gene copies was associated with decreasing of evolution rate for the thumbing majority of tRNA. Interestingly the same correlation was observed on the both scales of our analysis. Such a result could be explained both by gene conversion and by strong evolution constraints affecting on the multiple gene families of tRNA. More direct consequence of gene conversion - greater gene similarity within a large tRNA gene family as compared to a small one - was revealed for each type of tRNA. In such a way extra support for the suggestion of active gene conversion was obtained: the higher the number of gene copies in a genome, the lower mean divergence of the copies. Additionally association of described above tRNA genes characteristics (stability, rate of molecular evolution and copies number) with codon usage was also proposed. We expected that: (i) the closely located codon is used more frequently than the mean one; (ii) the higher the codon usage of the codon, the more stable its cognate tRNA. (iii) the higher the codon usage the of the codon, the lower the rate of evolution of the correspondent tRNA gene. The tests of these three hypotheses are the objects of the current analyses.

1. Taraka Dale and Olke C. Uhlenbeck (2005) Amino acid specificity in translation TRENDS in Biochemical Sciences, 12:659-665. 2. Li (1997) Chapter 11: gene conversion, Molecular evolution.

254 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

THE BASE-CALLING ALGORITHM WITH VOCABULARY YURI S. FANTIN 1, DENIS A. RESHETOV 2, ALEXEY D. NEVEROV 1, ALEXANDER V. FAVOROV3, ANDREY A. MIRONOV 2, VLADIMIR P. CHULANOV 1

We elaborated a new computational method of base calling for Sanger’s chromatograms what uses a vocabulary of homological sequences to improve signal to noise ratio. In clinical practice population or direct sequencing of PCR product widely used for monitoring of drug resistance mutations in genomes of fast evolving pathogens during therapy. In such studies, locus of interest in genome of a pathogen is well known, and a multiple alignment of homological DNA sequences from public data bases can be build easily. The new method allows any polymorphic variants in sequence positions and has better specificity of base-calling for low-quality segments of a chromatogram. The main advantage of our approach is accuracy and high sensitivity of minor allele detection Following [1], we represent chromatograms from standard capillary sequence machines as an array of peaks – Z=z1...zN, zi=[ xi ,ti , si ] ordered by coordinates - t. For each peak, x – is a vector of physical parameters, and s – is a corresponding nucleotide. A set of aligned known nucleotide sequences homologous to unknown genetic variants mixed in studied chromatogram designated as W. We are looking for an optimal base-calling - maxBP(B|Z,W), where B is a peak partitioned on positions in nucleotide sequence. Partition B could be represented as sequence of the same length as chromatogram – N=||Z|| of peak indexes: B=y(1)...y(N), where yi, N yi0, i=1...N. Peak index - y(i) is a position number in the corresponded nucleotide sequence or zero value assigned for erroneous peaks. Bayesian decomposition of our function gives us -

P(Z | B)P(B |W ) P(B | Z,W ) = ∑ P(Z, B |W ) B

 1 Central Research Institute of Epidemiology, Russian Federation, [email protected], [email protected], [email protected] 2 Moscow State University, Russian Federation, [email protected], [email protected] 3 State Scientific Centre GosNIIGenetika, Russian Federation, [email protected] 255 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 We formulate all possible partitions as tracks on a graph with vertices corresponded to possible peak combinations into nucleotide positions. Each position combines from one but not more than four peaks of different nucleotide types. We use two parameters what determinate graph building: maximal peak distance combined in one position, and maximal distance between positions. We assume the first order markov chain on positions. The P(Z|B) is a likelihood function, describes physical properties of a chromatogram Z, depending of its partition. The vocabulary W is served as a POA-graph [2]. Then our algorithm calculates for each partition B a prior probability P(B|W) to be aligned with vocabulary W. The goal functional P(B|Z,W) is described as a Hidden Markov Model where B is a variable of interest. Our algorithm was tested on manually annotated chromatograms of fragment of hepatitis D virus genome coding D-antigen, known mixtures of cloned sequences and simulated data. The method is equally sensitive to primary nucleotide sequence and at least 1% more specific compared to Sequencing Analysis v.3.7. (Applied Biosystems). The specificity gain for low quality parts at the beginning and the end of a chromatogram is about 10%. The sensitivity to a minor allele in mixture is about 15% and depends of a contamination level of target PCR product by nonspecific DNA.

1. L.M. Andrade-Cetto, et al. (2005) A Graphical Model Formulation of the DNA Base-Calling Problem, Machine Learning for Signal Processing, 2005 IEEE Workshop, 28: 369 - 374. 2. C. Lee, C. Grasso, et al. (2002) Multiple sequence alignment using partial order graphs, Bioinformatics, 18: 452-64.

256 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

DETECTION OF GENOMIC VARIATION BY SELECTION OF A 9MB DNA REGION AND HIGH THROUGHPUT SEQUENCING SERGEY NIKOLAEV 1, CHRISTIAN ISELI 2, ANDREW SHARP 3, DANIEL ROBYR 3, JACQUES ROUGEMONT , CORINNE GEHRIG, LAURENT FARINELLI, STYLIANOS E. ANTONARAKIS

Detection of the rare polymorphisms and causative mutations of genetic diseases in a targeted genomic area has become a major goal in order to understand genomic and phenotypic variability. We have interrogated repeat- masked regions of 8.9Mb on human chromosomes 21 (7.6Mb) and 7 (1.1Mb) from an individual from the International HapMap Project (NA1278). We have optimized a method of genomic selection for high throughput sequencing. Microarray-based selection and sequencing resulted in 260-fold enrichment, with 41% of reads mapping to the target region. 83% of SNPs in the targeted region had at least 4-fold sequence coverage and 54% at least 15-fold. When assaying HapMap SNPs in NA12782, our sequence genotypes are 91.3% concordant in regions with coverage ≥4-fold, and 97.9% concordant in regions with coverage ≥15-fold. About 80% of the SNPs recovered with both thresholds are listed in dbSNP. We observed that regions with low sequence coverage occur in close proximity to low-complexity DNA. Validation experiments using Sanger sequencing were performed for 46 SNPs with 15-20 fold coverage, with a confirmation rate of 96%, suggesting that DNA selection provides an accurate and cost-effective method for identifying rare genomic variants.

 1 University of Geneva , Switzerland , [email protected] 2 Ecole Polytechnique Fédérale de Lausanne 3 University of Geneva 257 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

ANALYSIS OF GENE REGULATION IN ESCHERICHIA COLI SWETLANA NIKOLAJEWA 1, MAIK FRIEDEL 2, REINHARD GUTHKE 3

Keywords: gene regulation, gene expression, pattern recognition, E. coli

The principles of transcriptional regulation in E.coli still remain unclear. It is known that most genes are not differential expressed under different conditions. In order to find significant patterns and features important for genregulation, we compare two sets of highly and lowly expressed genes of E. coli. The expression data was taken from the Many Microbe Microarrays Database (http://m3d.bu.edu) providing uniformly normalized Affymetrix microarrays of 380 different experiments for all known E. coli genes (4298). From this data we have extracted the 200 most highly expressed genes and 200 the most lowly expressed genes using the mean expression over all experiments as reference. With help of DiProGB (http://diprogb.fli-leibniz.de) we were able to find remarkable differences between the two groups relevant to gene- and operon structures. We also found out, that not only the promoter region, but also the region downstream a gene shows significant differences.

 1 Hans-Knoell-Institute, HKI Jena, Germany, [email protected] 2 Fritz-Lipmann-Institute, FLI Jena, Germany, [email protected] 3 Hans-Knoell-Institute, HKI Jena, Germany, [email protected] 258 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

CHARGAFF'S SECOND PARITY RULE SWETLANA NIKOLAJEWA 1, REINHARD GUTHKE 1, MAIK FRIEDEL 2

The most famous first parity rule of Chargaff states that, the number of As exactly equals the number of Ts, as well as the number of Gs equals the number of Cs in any piece of double stranded DNA molecules [1]. Today it is well know, that Chargaff's first parity rule is the consequence of the DNA spatial organization into a double-helix, discovered by Watson and Crick [2]. The rules of base pairing tell us for example that if we have the sequence of nucleotides for one strand of the DNA, we can immediately deduce the complementary sequence for the other strand. In contrast, it is less widely known, that the first rule also applies to single DNA strands. This rule is named “Chargaff's second rule”. After separating the DNA strands in Bacillus subtilis Chargaff and colleges discovered that %A ≈ %T and %G ≈ %C within the single DNA strands [3]. Although this observation was done already forty years ago the second rule has no generally accepted explanation up to now [6]. It is still puzzling to find the underlying mechanism or selective pressure to hold the rule. Validating the second rule shows fascinating compliance on a large number of the available full sequenced genomes. The rule holds for four of the five possible double-stranded DNA genomes: eukarya, bacteria, viral and archea genomes. The exceptions are some organellar genomes and any type of single- stranded DNA and RNA (viral) genomes. In further studies it has been found, that the second rule can be extended from mononucleotide to oligonucleotide compositions: the numbers of each oligonucleotide and its reverse complement are nearly equal on the same strand: e.g. %AA=%TT, %CA=%TG,… [4,5]. In this work we propose a new constrain based on the coding region distribution within prokaryote genomes, which leads to Chargaff's second rule. We show that the composition and organization of genes in prokaryotes allow us not only to explain the second rule, but also to understand

 1 Hans-Knöll-Institute (HKI), Leibniz Institute for Natural Product Research and Infection Biology e.V. Jena Germany, Beutenbergstr. 11a, Jena 07745 Germany, [email protected] 2 Leibniz for Age Research - Fritz Lipmann Institute (FLI) Jena, Germany, maikfr@fli- leibniz.de 259 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 symmetries in the frequency of oligonucleotides and its reverse complements. This observation also sheds light on the exceptions from this rule. We conclude: like the first rule is a consequence of base-paring in double- helix, the second rule seems to be a consequence of coding region distribution within genomes.

1. E.Chargaff (1951) Structure and function of nucleic acids as cell constituents, Fed. Proc., 10:654–659. 2. J.D.Watson, F.H.C.Crick (1953) Genetical implications of the structure of deoxyribonuclec acid, Nature, 171:964-967 3. R.Rudner, J.D. Karkas, E. Chargaff (1968) Separation of B. subtilis DNA into complementary strands, PNAS, 60:921-922 4. V.V.Prabhu (1993) Symmetry observations in long nucleotide sequences, NAR, 21:2797-2800 5. P.F.Baisnee et al. (2002) Why are complementary DNA strands symmetric? Bioinformatics, 18:1021-1033. 6. G. Albrecht-Buehler (2006) Asymptotically increasing compliance of genomes with Chargaff's second parity rules through inversion and inverted transpositions, PNAS, 103:17828-17833.

260 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

PREDICTION OF REGULATORY ELEMENTS IN DROSOPHILA GENOMES USING HIDDEN MARKOV MODEL BASED ON THE ARRANGEMENT OF TRANSCRIPTION FACTOR BINDING SITES ANNA NIKULOVA 1, ANDREY MIRONOV 1

The identification and analysis of transcriptional regulatory elements is crucial for understanding the transcriptional control of development and many other biological processes. In higher eukaryotes transcription factor binding sites (TFBSs) tend to be rather short (5-15 bp) and degenerate and often spread in extensive non-coding regions. So, for more reliable prediction it is necessary to employ other criteria in addition to sites sequence. TFBSs are known to cluster along the DNA strand. This tendency can be explained by necessity of the sites to localize closely in order to occur within the particular regions which are open for proteins access. But in some cases this could be crucial for the transcription factors interaction. There are many evidences that in higher eukaryotes binding sites can form so-called composite elements. It’s a group of sites characterized by the specific arrangement. Transcription factors interact with each other to form regulatory complex that applies some constraints on sites arrangement. Identical composition element can have similar functions in regulatory regions of different genes. We will call the set of such constraints or rules of sites arrangement the structure of cluster (regulatory region). On the other hand we can use the comparative genomics approach that can be powerful instrument for prediction of real functional elements. And since in non-coding regions the level of sequence conservation is rather low the approach based on cluster structure, rather than sequence, conservation seems to be more reasonable [1]. Here, we’ve tried to construct Hidden Markov Model of the transcriptional regulatory region that would take into account the structure of the regulatory region and would be able to identify co-regulated genes. Emissions in the HMM are the sequences of nucleotides. There are three general types of sequences: background sequence, sites (with some known models) and sequences between sites in cluster (spacers). As some transcription factors  1 Department of Bioengineering and Bioinformatics, M.V. Lomonosov Moscow State University. GSP-2, building 73, Leninskiye Gory, Moscow, 119992, Russian Federation, [email protected] 261 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 interact with each other the distance between their TFBSs (in our model, the length of spacer) should satisfy some rules. In the current work we used only two simple rules for distances between two adjacent sites in cluster: 1) geometric distribution that only makes sites be closer to each other; 2) distance distribution that looks like peaks decreasing with a period of 10 nucleotides (corresponds to the situation when interacted proteins are on the same side of the DNA strand). Thus, the structure of regulatory region is described by set of parameters for every pair of site types: the frequency of such pair in the regulatory region and the frequencies of possible distance distributions for this site pair. We can train these parameters on known regulatory regions and then search through the entire genome or several genomes for regulatory regions with similar structure. Thus we can find genes with similar regulation as the gene we trained on. The comparative approach was applied in this work on two stages. First, we trained HMM on the upstreams of homologous genes from different genomes, as we believe that close genomes do not vary too much in the structure of their regulatory regions. And second, after genome-wide search for regulatory regions we took into consideration only those putative regulatory regions that were supported by clusters found in homologous DNA fragments of some other genomes. In the current work we used well studied system of early developmental genes in Drosophila to test our model. We performed genome-wide search for regulatory regions that can response to some combination of the seven transcription factors of this system. Genes close to these putative regions were examined using their GO and other annotations as well as their expression profiles. The work was supported by grants RFBR [09-04-92742], Howard Hughes Medical Institute [55005610] and by the program of the Russian Academy of Sciences (‘Molecular and Cellular Biology’). The authors thank A.V.Favorov for useful ideas and discussions.

1. O.Hallikas, K.Palin, N.Sinjushina, R.Rautiainen, J.Partanen, E.Ukkonen, J.Taipale (2006) Cell, 124: 47-59.

262 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

PROBE-LEVEL ANNOTATION DATABASE FOR AFFYMETRIX EXPRESSION MICROARRAYS RAMIL N. NURTDINOV 1, MIKHAIL O. VASILIEV 2, ANNA S. ERSHOVA 3, ILIA S. LOSSEV 4, ANNA S. KARYAGINA 5

Standard Affymetrix technology evaluates gene expression by measuring of the intensity of mRNA hybridization with a panel of 25mer oligonucleotide probes (probe sets) designed from the gene sequence and summarizing the probe signal intensities by robust average method. But in many cases the probe’s signal intensity does not correlate to gene expression due to the hybridization of probe to transcript of another gene, mapping of probe to intron, alternative splicing, SNPs and mutations. We have aligned probe sequences from four Affymetrix arrays (HG-U133A, HG-U133B, Human Exon 1.0 and Human Gene 1.0) to the human genome and defined ones falling into the transcript coding regions. The database is available at http://affymetrix2.bioinf.fbb.msu.ru/ . Files containing information about probes, their sequences and microchip positions were downloaded from Affymetrix site. The hg18 version of human genome assembly was downloaded from the UCSC ftp site. Probe sequences were mapped to genome using Blat program. We allowed alignments with no more than two mismatches and demanded 40 and more nucleotide introns for exon-junction

 1 Department of Bioengineering and Bioinformatics, M.V. Lomonosov Moscow State University, Vorbyevy Gory 1-73, Moscow, 119992, Russia, [email protected] 2 Moscow Institute of Physics and Technology, Dolgoprudny, Institutsky Per. 9, Moscow Region 141700, Russia, Gamaleya Institute of Epidemiology and Microbiology, Russian Academy of Medical Sciences, Gamaleya Str., 18, Moscow 123098, Russia, [email protected] 3 Gamaleya Institute of Epidemiology and Microbiology, Russian Academy of Medical Sciences, Gamaleya Str., 18, Moscow 123098, Russia; A.N. Belozersky Institute of chemical and physical biology, M.V. Lomonosov Moscow State University, Vorbyevy Gory 1-73, Moscow, 119992, Russia, [email protected] 4 Parascript Llc, Winchester Cirlce Suite 200, Boulder 6899, USA, [email protected] 5 Gamaleya Institute of Epidemiology and Microbiology, Russian Academy of Medical Sciences, Gamaleya Str., 18, Moscow 123098, Russia; A.N. Belozersky Institute of chemical and physical biology, M.V. Lomonosov Moscow State University, Vorbyevy Gory 1-73, Moscow, 119992, Russia; Institute of Agricultural Biotechnology, Russian Academy of Agricultural Sciences, Timiryazevskaya Str., 42, Moscow 127550, Russia, [email protected] 263 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 probes. Corresponding hits were stored and subjected to the further analysis. We assigned probe to a particular gene if the probe alignment intersects the gene annotation in the correct strand. We also took into account possible mistakes in gene annotation extending the 3’ end of each gene by 1000 nucleotides. In case of overlapping genes a probe was assigned to all of them that match the above conditions. Performing expression analysis we used information about spliced alignment of human ESTs and mRNAs with human genome provided by UCSC genome browser. UCSC genome browser does not support EST and mRNA alignments with information about genes they belong to. Considerable amount of alignments cover several closed genes, frequently alignments do not intersect reference annotation of exons, and finally some alignments represent different types of noncoding RNA. We used RefSeq genome annotation and UniGene (219 build) to assign sequences and genes. Table 1 contains information about the probes, probe alignments and annotations for all four arrays. From 45% to 68 % of probes perfectly and unambiguously aligned to the gene regions, and from 19 % to 60 % fall into exons that are mainly inserted in final mRNA.

Table 1. Database summary. HG-U133A HG-U133B HuGene HuEx Probesets 22215 22645 33843 1432033 Data Probes 246799 249502 862560 5431924 No mismatch 400719 426769 1498676 11171290 1 mismatch 380005 337872 594102 6266943 Alignment 2 mismatch 231238 231750 350277 3738059 Total 1011962 996391 2443055 21176292 Exon 147722 78200 482503 1026852 Exon/intron 3916 4924 17055 59473 Annotation Intron 16618 44499 80814 1331594 of unique perfect hits Total probes 168256 127623 580372 2417919 Total genes 12783 9351 21254 27996

Usage of Database can help with interpretation of results obtained with standard Affymetrix protocol, and also promote nonconventional probe-level usage of Affymetrix data, for example, studying of alternative splicing and genome rearrangements at various pathologies. 264 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

STRUCTURAL FEATURES OF βββ-TUBULIN SPECIFIC INTERACTION WITH BENZIMIDAZOLE COMPOUNDS YU. NYPORKO 1, YA. B. BLUME 2

Among specific tubulin effectors preventing microtubule polymerization, the benzimidazole compounds have the widest spectrum of action. Due to their high affinity to tubulins of fully different origin they use as fungicides, herbicides and antihelminth drugs. In addition, antiprotozoan and anticancer activity also was recently shown for them [1]. Rational design of new improved drugs on the base of certain chemical compounds using in silico approaches requires exact information on their appropriate binding site location in the structure of target protein(s). However, despite long-time investigations and existent models of interactions [2], question about exact localization of benzimidazole binding site on tubulin is still opened [3]. For accurate prediction of possible benzimidazole binding site we have used a combination of several approaches: analysis of spatial distribution of tubulin mutation conferring benzimidazole resistance, analysis of benzimidazole structural peculiarities that the most probably responsible for site recognition, reconstruction of spatial structure of complexes “ β-tubulin- benzimidazole” and further estimation of their stability via long-time molecular dynamics calculation. Mutations causing resistance to benzimidazole compounds have been characterized in fungal and animal tubulins [4]. Nine mutable positions that can be divided into two groups have been described. Substitutions in positions of the first group (165, 167, 198, 200) alter sensitivity/resistance to benzimidazoles in several species. The second group possesses unique substitutions in positions 237, 240, 241, 250 and 288 Spatial localization of two mutation group is different – multispecies replacements form compact group buried into β-tubulin molecule, unique ones are exposed on surface near intradimer contact area. Benzimidazole resistance conferring by multispecies replacements is accompanied by changes of sensitivity/resistance to other antimicrotubular compounds phenylcarbamates and can be cross-positive (replacements in positions 167 and 200) as well as cross-negative (positions 165 and 198). Thus, it could be  1 Institute of Food Biotechnology and Genomics of National Academy of Science of Ukraine 2 Osipovskogo str., 2a, Kiev, 04123, Ukraine, [email protected] 265 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 supposed that both compounds classes share the same binding site and common structural part. This part (carbamic acid methyl ester) has to interact with amino acids in positions 167 and 200, whereas the different components of ligands interact with residue 165 and 198. To verify our supposition the spatial structure of fungal (Neurospora crassa and Mycosphaerella graminicola) and animal (Haemonchus contortus and Homo sapience) β-tubulins were reconstructed as described earlier [5]. Their complexes with albendazole (typically used as antihelminth drug) and carbendazim (antifungal and antihelminth treatment) were developed by intercalation of ligands into spatial cavity between mutated residues. Spatial structure optimization and 30 ns molecular dynamics of investigated complexes as well as free ligands in water environment were calculated using the mdrun module of GROMACS software. Stability of obtained complexes was estimated by conformational energy dynamics (using g_energy module) and levels of molecular oscillations (using g_rms module). Accordingly to obtained data, benzimidazole binding site on tubulin contains amino acids in positions 152, 156, 163, 164, 165, 167, 195, 196, 197, 198, 200, 236, 250, 251, 253. Ligands are buried into tubulin enough deeply, that is typical for benzimidazole derivates with affinity for other proteins [6]. Levels of conformation energy of both ligands in binding sites are stably less than in free state. Energy downshift is within range from 48.6 kJ/mol for complex “M. graminicola β-tubulin-albendazole” to 88.5 kJ/mol for complex “N. crassa-carbendazim”. It’s demonstrative, that energy of albendazole in binding site proposed Robinson at coworkers [2] is essentially greater than in our site (by 45 kJ/mol). Molecular oscillations bound benzimidazoles also reduce. So, the average oscillation level for albendazole decreases from 2.5 Å to 0.7 Å. For carbendazim this value practically doesn’t change, but amplitudes of oscillation for both ligands diminish approximately 10 fold. Thus, described site can be considered as the most favorable for interaction between β-tubulin and benzimidazole molecules.

This work is supported by cooperative STCU-NAS grant 4929 “Application of grid-resource for total analysis of beta-tubulin spatial structure features causing a different sensitivity to benzimidazoles”.

1. M Yenjerla et al. (2009) Carbendazim inhibits cancer cell proliferation by suppressing microtubule dynamics., J Pharmacol Exp Ther. 328:390- 398. 266 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 2. M.W.Robinson et al. (2004) A possible model of benzimidazole binding to beta-tubulin disclosed by invoking an inter-domain movement, J Mol Graph Model, 3. M.J. Clément et al. (2008) Benomyl and colchicine synergistically inhibit cell proliferation and mitosis: evidence of distinct binding sites for these agents in tubulin, Biochemistry;47:13016-13025. 4. Nyporko A. Yu., Blume Y. B. (2008) Spatial distribution of tubulin mutations conferring resistance to antimicrotubular compounds In: The Plant Cytoskeleton: a Key Tool for Agro-Biotechnology, Blume Y. B., et al. (Eds.), 397-417 (Springer). 5. A. Yu. Nyporko, Ya. B. Blume (2001) Comparative analysis of secondary structure of tubulins and FtsZ proteins, Biopolym. and Cell. 17: 61-69 6. N. Foloppe et al. (2006) Identification of a buried pocket for potent and selective inhibition of Chk1: prediction and verification. Bioorg.Med.Chem. 14: 1792-1804

267 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

NEISSERIA GONORRHOEAE OUTER MEMBRANE PROTEIN TRANSLOCATION DISORDER: COMBINED BIOINFORMATIC AND EXPERIMENTAL ANALYSIS . NINA OPARINA 1, ELENA ILINA 1, MAYA MALAKHOVA 1, ALEXANDRA BOROVSKAYA 1, IRINA DEMINA 1, MARINA SEREBRYAKOVA1, MARIA ROGOVA 1, VADIM GOVORUN 1

Keywords: outer membrane proteins, pathogens, drug-resistance

The biogenesis of Gram-negative bacteria outer membrane is still the object of numerous studies. The complex structure of outer membrane is characterized by the presence of phospholipids, lipopolysaccharides, lipoproteins and integral outer membrane proteins (OMP). All these components are synthesized in cytoplasm with further translocation thorough inner membrane and periplasm. Two main classes of outer membrane protein components include lipoproteins, anchored into the outer membrane with the N-terminal lipid tail, and OMP containing membrane-spanning helices. Besides the decades of experiments, the regulation mechanism of OMP translocation into the outer membrane is still under investigation. We’re interested in the following paradox: the overall conservancy of the outer membrane biogenesis (characterized by several features common for bacteria and plastids) and the fast evolving OMPs such as porins. We proposed that the ability of several bacteria, including human pathogens, to produce different variants of outer membrane characterized by aberrant features could give us additional information about regulation of membrane biogenesis, mostly in the part of OMP translocation pathway.

The well-described human pathogen – Neisseria gonorrhoeae was studied. The large-scale genetic analysis of collection of clinical isolates allowed us to select the epidemiologically active strain with reduced susceptibility to penicillin (PEN) and without known aberrations in genomic DNA. Several tests showed that in this case the non-specific mechanism of drug-resistance formation could be explained by lower membrane permeability, demonstrated on tetracycline and penicillin import. We have studied the proteome of this  1 Scientific Research Institute of Physical-Chemical Medicine, Moscow, Russia, Russian Federation, [email protected]

268 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 strain in comparison to susceptible control and surprisingly detected the high concentration of OMPs in the cytoplasmic protein fraction (mostly the PorB protein). Thus we have proposed that the OMP translocation machine was corrupted in the studied strain. The subset of genes putatively involved into OMP translocation and processing (including Sec-system) was selected using comparative genomics approach. Both sequencing and transcriptomic study showed that none of these genes was down-regulated or mutated. Saying more, compensatory overexpression was shown for many of these genes. These data let us propose that the previously undescribed regulatory mechanism is responsible for this feedback regulation of OMP translocation pathway. In our work we present the results of genomic studies throwing light upon such candidate regulators.

269 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

SEARCH FOR CpG-ISLANDS: COMPARISON OF MODERN APPROACHES NINA OPARINA 1, MARINA FRIDMAN 2

Keywords: CpG islands, CpG clusters

In the mammalian genomes due to CpG methylation followed by frequent C->T mutation this hypermutated dinucleotide is strongly underrepresented. Nevertheless in the genomic regions called CpG-islands CG motifs are clustered together. For years CpG-islands were mapped mostly using the consideration of the following parameters: length >200 bp, observed frequency of CpGs/ expected frequency of CpGs > 0.6 (classic Gardiner- Garden and Frommer algorithm, 1987). During last decade several new approaches were published: Takai and Jones modification of the classical algorithm (2002), CpGProD (2002), statistical algorithms (CpGcluster, 2006 and CG cluster, 2007). How can we decide what approach is "better"? Traditionally the following features were taken into account: intersection of CpG-islands with genes, mostly with their TSS regions; intersection of CpG- islands with repetitive elements such as Alu and L1 repeats in the human genome; length and total amount of CpG-islands. The intuitive consideration of CpG-islands as "gene marks" is responsible to most of these features. For example, the larger fraction of mapped CpG-islands intersect exons and promoters, not SINEs and LINEs, the "better" is the algorithm of CpG-islands prediction. Here we present the results of comparative analysis of CpG-islands distribution in the large set of vertebrate genomes. In our work we reviewed features of the several CpG-island detection algorithms and their usefulness for comparative genomics of CpG-islands. The following features were included in our comparative analysis: intersection of CpG-islands with TSSs, exons, introns, repetitive elements; sequence conservation of CpG-islands; coincidence of CpG-islands with orthologous genes in a set of vertebrate genomes.

 1 Engelhardt Institute of Molecular Biology, Moscow, Russian Federation, [email protected] 2 Institute of genetics and selection of industrial microorganisms, GosNIIgenetika, Moscow, Russian Federation, [email protected] 270 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

SOME LIKE IT SWEET: TOWARDS GENOMIC ENCYCLOPEDIA OF SUGAR CATABOLISM IN BACTERIA. ANDREI OSTERMAN 1, DMITRY RODIONOV 2

Carbohydrates provide the key source of carbon and energy for many heterotrophic bacteria. The diversity of carbohydrates (poly-, oligo- and monosacharides) in various ecosystems is mirrored by substantial variations in sugar utilization machinery (SUM) even between phylogenetically close microorganisms. Developing a capability to confidently reconstruct SUM from genomic sequence would improve our understanding of microbial evolution, adaptation and ecophysiology in a variety of natural habitats, from soil to human guts, and strongly impact biotechnological and biomedical applications. However, presently, the bulk of our knowledge about sugar catabolic pathways is limited by a handful of model bacteria. An accurate projection and expansion of this knowledge across the growing variety of bacteria with completely sequenced genomes constitutes a substantial challenge. This is largely due to the aforementioned variations that include numerous cases of non-orthologous gene replacements, families of paralogous enzymes with varying substrate specificity, alternative and presently unknown biochemical routes. As a result, functional annotations of respective genes are often incorrect or imprecise. A subsystems-based approach to genome analysis allows us to substantially improve the accuracy of genomic annotations, predict functions of previously unknown gene families and infer novel pathways. This approach combines parallel metabolic reconstruction of a pathway (or rather a group of related pathways) in multiple integrated genomes with genome context analysis, most importantly the analysis of conserved operons and regulons, revealing functional coupling of known and unknown genes. The predictive power of this approach may be illustrated by the results of our recent studies aimed at reconstruction of SUM in two divergent groups of environmental bacteria represented by (i) Shewanella oneidensis and (ii)  1 Burnham Institute for Medical Research, 10901 North Torrey Pines Road, La Jolla, CA USA 92037, Fellowship for Interpretation of Genomes, Burr Ridge, IL USA 60527;, United States, [email protected] 2 Burnham Institute for Medical Research, 10901 North Torrey Pines Road, La Jolla, CA USA 92037; Institute for Information Transmission Problems (the Kharkevich Institute), RAS, 127994 Moscow, Russia., United States, 271 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Thermotoga maritima. A comparative analysis of 15 completely sequenced genomes from the group of Shewanella, aquatic gamma-Proteobacteria with potential applications in bioremediation, allowed us to reconstruct pathways for utilization of 17 distinct sugars (mono- or disaccharides). An observed mosaic distribution of these pathways across the collection of Shewanella spp illustrates the genetic plasticity of SUM, which apparently plays a key role in adaptation of bacteria to variable environmental conditions. Remarkably, a majority of these pathways contained previously uncharacterized components (enzymes, transporters, regulators) predicted by a combination of bioinformatic techniques. Some of these predictions were confirmed by targeted biochemical and genetic experiments. Growth studies performed using a panel of representative sugar substrates revealed remarkable consistency between predicted and observed growth phenotypes. In another genomic survey we have focused on context-based prediction and experimental assessment of substrate specificity of 16 sugar kinases involved in the extensive SUM of Thermotoga maritima, a deep-sea hyperthermohilic bacterium with a biotechnological potential for biohydrogen production. This analysis revealed a strong correlation between the physiological roles of these enzymes (based on genome context and metabolic reconstruction) and their in vitro substrate preferences (assessed versus a panel of ~40 different sugars). The results of both studies substantially contributed to building a comprehensive genomic encyclopedia of sugar catabolism. The established combined approach is scalable, and its systematic application to other groups of bacteria will allow us to rapidly accumulate sufficient knowledge for accurate automated SUM reconstruction from any genomic and metagenomic data.

272 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

COMPENSATORY EVOLUTION IN RESPONSE TO A NOVEL RNA POLYMERASE: ELECTROSTATIC PROPERTIES OF PROMOTERS MAY LEAD THE ADAPTATION ALEXANDER OSYPOV 1, SVETLANA KAMZOLOVA 1, ANATOLY SOROKIN 2

It is known that not only the consensus sequence text is essential for RNA polymerase-promoter recognition, but some additional information can be coded in physical properties of DNA. Especially electrostatic interactions between promoter DNA and RNA polymerase are of considerable importance in promoter function regulating [1-4]. Here we report the analysis of electrostatic properties of promoters, described in [5]. RNA polymerase (RNAP) gene of the obligate lytic bacteriophage T7 was replaced with that of a relative phage T3 and thus the genome was forced to evolve a new system of regulation [5]. T3 RNAP was supplied in trans by the bacterial host to a T7 genome lacking its own RNAP gene and the phage population was continually propagated on naive bacteria throughout the adaptation. Evolution of the T3 RNAP gene was thereby prevented, and selection was for the evolution of regulatory signals throughout the phage genome. T3 RNAP transcribes from T7 promoters only at low levels, but a single mutation in the promoter confers high expression, providing a ready mechanism for reevolution of gene expression in this system. On selection for rapid growth, fitness of the engineered phage evolved close to that of the native phage. More than 30 mutations were observed in the evolved genome, but changes were found in only 9 of the 16 promoters, and several coding changes occurred in genes with no known contacts with the RNAP. Surprisingly, only 7 of 13 mutations in promoters were to T3 promoters consensus sequence, and one was even from consensus to non-consensus. In the -11 position, known as conservative in T3 and considered to be important to T7-T3 promoters distinguishing, only one of 6 mutations leaded to consensus. So the sequence text-encoding promoters distinguishing principles seem to fail to explain the observed situation in all the details. We found that the main differences in the electrostatic profiles of promoters of T7 phage, from the one hand, and that of T3 and mutated T7,  1 Institute of Cell Biophysics of RAS, Russian Federation, [email protected] 2 The University of Edinburgh, United Kingdom 273 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 from the other, lie in the starting point region. In -2 - -5 b.p. electrostatic potential of T7 promoters is considerably less than that of T3 and mutant. Mutated promoters demonstrated the largest potential shift, though non- mutated also showed some potential gain due to mutations in flanking regions [6]. Though fitness is an integral indicator of promoter activity strength, we can still make an assumption that the differential recognition of promoters by T7 and T3 RNA polymerases can by driven at least in part by their electrostatic properties.

1. R.V.Polozov, T.R.Dzhelyadin, A.A.Sorokin, N.N.Ivanova, V.S.Sivozhelezov, S.G.Kamzolova (1999) Electrostatic potentials of DNA. Comparative analysis of promoter and nonpromoter nucleotide sequences, J. Biomol. Struct. Dyn., 16(6):1135-1143. 2. S.G.Kamzolova, A.A.Sorokin, T.R.Dzhelyadin, P.M.Beskaravainy, A.A.Osypov (2005) Electrostatic potentials of E.coli genome DNA, J. Biomol. Struct. Dyn., 23(3):341-346. 3. S.G.Kamzolova, V.S.Sivozhelezov, A.A.Sorokin,. T.R.Dzhelyadin, N.N.Ivanova, R.V.Polozov (2000) RNA polymerase-promoter recognition. Specific features of electrostatic potential of "early" T4 phage DNA promoters, J. Biomol. Struct. Dyn., 18(3):325-334. 4. A.A.Sorokin, A.A.Osypov, T.R.Dzhelyadin, P.M.Beskaravainy, S.G.Kamzolova (2006) Electrostatic properties of promoter recognized by E. coli RNA polymerase Esigma70, J. Bioinform. Comput. Biol., 4(2):455-467. 5. J.J.Bull, R.Springman, I.J.Molineux (2007) Compensatory Evolution in Response to a Novel RNA Polymerase: Orthologous Replacement of a Central Network Gene, Molecular Biology and Evolution, 24(4):900-908. 6. A.A.Osypov (2009) Electrostatic properties of genome DNA, PhD thesis, Moscow .

274 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

DEPPDB – THE DNA ELECTROSTATIC POTENTIAL DATABASE. ELECTROSTATIC PROPERTIES OF NATURAL GENOMES ALEXANDER OSYPOV 1, SVETLANA KAMZOLOVA 1, ANATOLY SOROKIN 2

A large and growing number of genomes are sequenced. Biochemical and genetic characterization of them lag very far behind which brings the necessity to their automatic annotation. Many tools based on sequence text analysis have been developed to predict some key properties of genomes such as ORFs, promoters and other regulatory elements. Despite this wealth of information about the sequence structure it is difficult from the sequence data alone to pinpoint some critical elements, such as potential promoter sequences in genome DNA or to predict their functional characteristics. A large set of promoter search algorithms based on text analysis has failed in correct prediction of promoter sites in genome. It was shown that some additional information for promoter recognition can be coded in physical properties of DNA. Especially electrostatic interactions between promoter DNA and RNA polymerase are of considerable importance in promoter function regulating. Even more important is the discovered ambiguity of the dependency of electrostatic potential profile on the sequence, meaning that this property is vastly dependent on the whole sequence with flanking regions rather then the sequence text at the given point of consideration [1-5]. Given that, we developed DEPPDB – the DNA Electrostatic Potential Database – to hold and provide all the available information about the electrostatic properties of genomes together with comprehensive annotation of their sequences. The Database is available for academic use at http://promodel.icb.psn.ru and as now consists of 1533 bacterial and plasmid and 2733 viral genomes with annotations, taken mainly from NCBI RefSeq DB [5]. Using DEPPDB we revealed that natural genomes demonstrate close to linear dependence of the mean electrostatic potential from GC content with linear coefficients similar for different taxonomic groups. This dependence is considerably more evident in the AT rich region and less in GC rich. Analysis of the dependencies of the potential formation from the GC content reveals the  1 Institute of Cell Biophysics of RAS, Russian Federation, [email protected] 2 The University of Edinburgh, United Kingdom 275 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 greater ability to form different pronounced electrostatic patterns by AT rich regions compared to GC rich. The ambiguity of the electrostatic potential dependency on the sequence is illustrated by calculated sequences with A/T/G/C=1/1/1/1, vastly different by their electrostatic potential despite the identical content. It's worth noting the non-symmetrical distribution of the natural genomes biased to the AT-richness, which can be at some extend explained by the comparative flexibility in the electrostatic patterns formation by AT rich regions [5]. Considering the common electrostatic properties of different genome elements it's worth mentioning higher electrostatic potential in the promoter regions on average, that may enhance the promoters DNA – RNA polymerase interactions, and the abrupt W-like pattern in terminators, possibly reflecting their palindrome nature.

1. R.V.Polozov, T.R.Dzhelyadin, A.A.Sorokin, N.N.Ivanova, V.S.Sivozhelezov, S.G.Kamzolova (1999) Electrostatic potentials of DNA. Comparative analysis of promoter and nonpromoter nucleotide sequences, J. Biomol. Struct. Dyn., 16(6):1135-1143. 2. S.G.Kamzolova, A.A.Sorokin, T.R.Dzhelyadin, P.M.Beskaravainy, A.A.Osypov (2005) Electrostatic potentials of E.coli genome DNA, J. Biomol. Struct. Dyn., 23(3):341-346. 3. S.G.Kamzolova, V.S.Sivozhelezov, A.A.Sorokin,. T.R.Dzhelyadin, N.N.Ivanova, R.V.Polozov (2000) RNA polymerase-promoter recognition. Specific features of electrostatic potential of "early" T4 phage DNA promoters, J. Biomol. Struct. Dyn., 18(3):325-334. 4. A.A.Sorokin, A.A.Osypov, T.R.Dzhelyadin, P.M.Beskaravainy, S.G.Kamzolova (2006) Electrostatic properties of promoter recognized by E. coli RNA polymerase Esigma70, J. Bioinform. Comput. Biol., 4(2):455-467. 5. A.A.Osypov (2009) Electrostatic properties of genome DNA, PhD thesis, Moscow.

276 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

ELECTROSTATIC PROPERTIES OF T7-LIKE PHAGES PROMOTERS FOR HOST BACTERIAL AND NATIVE VIRAL RNA POLYMERASES ALEXANDER OSYPOV 1, SVETLANA KAMZOLOVA 1, ANATOLY SOROKIN 2

It is known that not only the consensus sequence text is essential for RNA polymerase-promoter recognition, but some additional information can be coded in physical properties of DNA. Especially electrostatic interactions between promoter DNA and RNA polymerase are of considerable importance in promoter function regulating [1-4]. Considering genome organization of some of T7-like phages there are two main genome regions and consequently two main promoter types: the early region is transcribed by the bacterial host RNA polymerase and possess a tandem of two or three strong promoters. One of the main products of this region is the native viral RNA polymerase, which transcribes all the next region. Bacterial polymerases are big proteins with several subunits and have the landing region of some 150 b.p., while viral are small and consist of one subunit with the landing region of some 20-30 b.p., which a priori suggests different characteristics of their interactions with promoter DNA. Using DEPPDB – DNA Electrostatic Potential Properties Database – we compared electrostatic properties of promoters for host bacterial and native viral RNA polymerases of some of T7-like phages. Electrostatic potential profiles of the early region containing bacterial promoters exhibit associated with them strong elements of large amplitude spanning for some hundreds base pairs, while that of promoters for viral RNA polymerase show considerable and consensus similarity in the region of some two-three tens b. p. only [5]. Thus there is considerable scale difference between “bacterial” and “viral” promoters electrostatic potential profiles elements, which closely resemble the physical properties of the two kinds of interacting with them proteins. The similarity between T7, T3, phiA1122, phiYeO3-12, K1-5 и SP6 early region electrostatic profiles suggests the presence and activity of the early host promoters for phiYeO3-12, for which only non-experimental evidence is given in the NCBI RefSeq Database and for K1-5, SP6 и phiA1122, for which there are no annotations there [5].  1 Institute of Cell Biophysics of RAS, Russian Federation, [email protected] 2 The University of Edinburgh, United Kingdom 277 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

The authors are grateful to Saveljeva E. G. for technical support.

1. R.V.Polozov, T.R.Dzhelyadin, A.A.Sorokin, N.N.Ivanova, V.S.Sivozhelezov, S.G.Kamzolova (1999) Electrostatic potentials of DNA. Comparative analysis of promoter and nonpromoter nucleotide sequences, J. Biomol. Struct. Dyn., 16(6):1135-1143. 2. S.G.Kamzolova, A.A.Sorokin, T.R.Dzhelyadin, P.M.Beskaravainy, A.A.Osypov (2005) Electrostatic potentials of E.coli genome DNA, J. Biomol. Struct. Dyn., 23(3):341-346. 3. S.G.Kamzolova, V.S.Sivozhelezov, A.A.Sorokin,. T.R.Dzhelyadin, N.N.Ivanova, R.V.Polozov (2000) RNA polymerase-promoter recognition. Specific features of electrostatic potential of "early" T4 phage DNA promoters, J. Biomol. Struct. Dyn., 18(3):325-334. 4. A.A.Sorokin, A.A.Osypov, T.R.Dzhelyadin, P.M.Beskaravainy, S.G.Kamzolova (2006) Electrostatic properties of promoter recognized by E. coli RNA polymerase Esigma70, J. Bioinform. Comput. Biol., 4(2):455-467. 5. A.A.Osypov (2009) Electrostatic properties of genome DNA, PhD thesis, Moscow.

278 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

RNA POLYMERASE-DNA INTERACTIONS: ARE THEY DRIVEN BY ELECTROSTATICS? ALEXANDER OSYPOV 1, SVETLANA KAMZOLOVA 1, ANATOLY SOROKIN 2

Physical properties of DNA are known to be essential for RNA polymerase- promoter recognition. Especially electrostatic interactions between promoter DNA and RNA polymerase is of considerable importance in regulating promoter function [1-5]. Here we report our analysis of electrostatic properties of the phage lambda genome DNA and its interactions with RNA polymerase, based on [6]. Using total internal reflection fluorescence microscopy, authors [6] have directly observed individual interactions of single RNA polymerase molecules with a single molecule of phage DNA suspended in solution by optical traps. The interactions of RNA polymerase molecules were not homogeneous along DNA. They dissociated slowly from the positions of the promoters and sequences common to promoters at a rate that was more than several fold smaller than the rate at other positions. RNA polymerase molecules on the fast dissociation sites underwent linear diffusion (sliding) along DNA. The binding to the slow dissociation sites was greatly enhanced when DNA was released to a relaxed state, suggesting that the binding depended on the strain exerted on the DNA [6]. Nevertheless the binding frequency of the RNA polymerase molecules to the stretched (and thus linear) phage DNA is not homogeneous along DNA, which brings the necessity of additional researches of possible underlying physical mechanisms. Using DEPPDB – DNA Electrostatic Potential Properties Database [5] – we studied electrostatic properties of the lambda phage genome DNA. Our observations of its electrostatic potential profile revealed the non- homogeneous distribution of the potential along the DNA molecule which strongly resemble the binding profile, obtained by the imaging experiment with stretched DNA. Regions with the highest potentail tend to bind RNA polymerase more frequently and RNA polymerase rest there for more time [5]. This data directly illustrates the possible role of the electrostatic properties of DNA molecule in the RNA polymerase interactions with the genome DNA.  1 Institute of Cell Biophysics of RAS, Russian Federation, [email protected] 2 The University of Edinburgh, United Kingdom 279 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

The authors are grateful to Saveljeva E. G. for technical support. 1. R.V.Polozov, T.R.Dzhelyadin, A.A.Sorokin, N.N.Ivanova, V.S.Sivozhelezov, S.G.Kamzolova (1999) Electrostatic potentials of DNA. Comparative analysis of promoter and nonpromoter nucleotide sequences, J. Biomol. Struct. Dyn., 16(6):1135-1143. 2. S.G.Kamzolova, A.A.Sorokin, T.R.Dzhelyadin, P.M.Beskaravainy, A.A.Osypov (2005) Electrostatic potentials of E.coli genome DNA, J. Biomol. Struct. Dyn., 23(3):341-346. 3. S.G.Kamzolova, V.S.Sivozhelezov, A.A.Sorokin,. T.R.Dzhelyadin, N.N.Ivanova, R.V.Polozov (2000) RNA polymerase-promoter recognition. Specific features of electrostatic potential of "early" T4 phage DNA promoters, J. Biomol. Struct. Dyn., 18(3):325-334. 4. A.A.Sorokin, A.A.Osypov, T.R.Dzhelyadin, P.M.Beskaravainy, S.G.Kamzolova (2006) Electrostatic properties of promoter recognized by E. coli RNA polymerase Esigma70, J. Bioinform. Comput. Biol., 4(2):455-467. 5. A.A.Osypov (2009) Electrostatic properties of genome DNA, PhD thesis, Moscow. 6. Y.Harada, T.Funatsu, K.Murakami, Y.Nonoyama, A.Ishihama, T.Yanagida (1999) Single-Molecule Imaging of RNA Polymerase-DNA Interactions in Real Time, Biophys. J., 76:709–715.

280 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

MAJOR TRENDS IN THE EVOLUTION OF YOUNG HUMAN PARALOGS ALEXANDER PANCHIN 1, MIKHAIL GELFAND 2, VASILY RAMENSKY 3, IRENA ARTAMONOVA 4

Potent sources of new genes during evolution are gene duplication events. This makes families of paralogous genes an interesting target for evolutionary studies. We introduced a novel method for calculating evolutionary rates of individual genes from such families. It shows that negative selection, experienced by a duplicated gene, is weaker during a period of time soon after the duplication event and then increases. These changes of negative selection pressure seem to be a major trend in the evolution of young human paralogs. The second major trend concerns the asymmetry of the evolution of two genes copies, emerged in result of a duplication event. In about 22% pairs of recently duplicated genes from young paralogous gene families the two gene copies accumulate amino acid substitutions at significantly different rates. Differences in gene expression levels do not explain this asymmetry. Asymmetry in the accumulation rate of synonymous substitutions is not significant and much weaker. A possible explanation of this trend would be the need for one of the two duplicated gene copies to retain its initial function, as the other copy rapidly evolves.

Anknowldgements: This work was partially supported by the Russian Academy of Sciences (programs “Molecular and Cellular Biology” and “Biological diversity”)This work is supported by grants from Russian Academy of Science Presidium programs: “Molecular and cell biology” and “Biodiversity”

 1 Lomonosov Moscow State University, Russian Federation, [email protected] 2 Kharkevich Institute for Information Transmission Problems, Russian Federation, [email protected] 3 Engelhardt Institute of Molecular Biology Main University, Russian Federation, [email protected] 4 Vavilov Institute of General Genetics, Russian Federation, [email protected] 281 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

NEW EVIDENCE FOR DIVERSITY OF INTERCELLULAR CHANNEL (GAP JUNCTION) PROTEINS. YURI PANCHIN 1, LUDMILA POPOVA 2, IGOR KOSEVICH 3, YULIA KRAUS 3, IRINA SHAGINA 4, MARIA KURNIKOVA 4, DMITRY SHAGIN 5

Gap junctions (GJ) are composed of membrane proteins that form a channel that is permeable to ions and small molecules, connecting the cytoplasm of adjacent cells. They are considered to be a universal feature of all multicellular animals (Metazoa) and play important role in different biological functions. Connexins were identified as the molecular components of vertebrate GJ about 20 years ago (Paul, 1986). Numerous attempts to clone connexins from invertebrates have failed, and finally, it was suggested that invertebrate GJs are assembled from proteins unrelated to connexins. This protein family was originally designated OPUS, (Barnes, 1994). It was suggested that these are specific invertebrate gap junction proteins, and they were later renamed innexins (Phelan et al., 1998). When the presence of innexin homologes in vertebrates, was demonstrated we proposed to reclassify innexins with their vertebrate homologs into a bigger family named, pannexins (Panchin et al., 2000). Now the diversity of known GJ molecules includes about 20 different connexin paralogs in the genomes of vertebrates and tunicates and from 3 to about 20 paralogs of pannexins in the genomes of different vertebrates and invertebrates. We study GJ diversity on different levels from gene families to individual mutations. Here we report two new separate findings in this field. 1) Our comparative genomics studies supported by physiological experiments predict that a new family(s) of GJ proteins may exist in addition to connexins and pannexins .

 1 Institute for Information Transmission Problems, RAS, Russian Federation, [email protected] 2 A.N. Belozersky Institute, Moscow State University, Russian Federation, [email protected] 3 Moscow State University, Russian Federation, [email protected], [email protected] 4 Evrogen JSC, Russian Federation, [email protected], [email protected] 5 Shemiakin and Ovchinnikov Institute of Bioorganic Chemistry [email protected] 282 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 2) We report a new form of transcript of Human GJ protein Cx26 that may partially compensate for the most common mutation in this gene responsible for widespread heritable deafness.

Recent comparative genomic studies (Shestopalov and Panchin 2008, Panchin 2007, Litvin et al., 2006) suggest that some metazoans have neither connexins nor pannexins (like sea urchin, sea anemone, and sponge). In the sequenced genome of sea anemone Nematostella vectensis (Anthozoa, Cnidaria) (Putnam et al., 2007) we found no relatives to pannexin or connexin families. This finding is supported by an early paper by G. O. Mackie, P. A. V. Anderson and C. L. Singla (1984) who suggested the absence of GJ in Anthozoa. At the same time our direct physiological data demonstrates the presence of GJ in Nematostella. By intracellular recordings we demonstrate electrical coupling between the pairs of blastomeres at the 8 and 16-cells stage embryos. It is known that from the 8-cells stage and on, the blastomeres in Nematostella become completely separated (Fritzenwanker et al., 2007). In our experiments we combined electrophysiologycal measurement with dye injections. Lucifer yellow, fluorescein and dicarboxy fluorescein show no dye coupling in 8 and 16-cells stage embryos. At the same time dye injected cells displayed strong electric coupling to other blastomeres within the embryo. Our data imply the presence of intercellular channels that are permeable to ions but not to dye molecules, connecting the cytoplasm of adjacent cells in Nematostella. As known GJ protein families are absent in Nematostella genome, we predict the discovery of additional protein families utilized for GJ function in Nematostella and other organisms, potentially even in vertebrates. Mutations in the human GJB2 gene, which encodes connexin26 (Cx26), underlie various forms of hereditary deafness and skin disease. Non- syndromic recessive deafness (DFNB1) is considered to be due to a simple loss-of-function mutations in Cx26, as the most frequent recessive Cx26 mutation (involved in 70–85% of Cx26-related deafness (Zelante et al. 1997)) is a single base deletion (35delG) that results in a frameshift at position 12 in the coding sequence and premature termination of the protein at amino acid 13. Most but one of over 50 other reported Cx26 mutations implicated in DFNB1 are missense so that only one amino acid is changed and the protein may potentially retain some functional activity as it was actually shown for some of these mutations. Strangely enough Cx26 knockout in mice is lethal (Gabriel HD et al. 1998) while in humans the presumed loss of the same gene in 35delG causes only non-syndromic deafness. We questioned whether it is 283 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 true that 35delG really results in complete functional disruption of this GJ protein in Humans. This gene is located on chromosome 13q11-q12. It is accepted that CX26 gene contains 2 exons and exon 1 is untranslated. Our NCBI databases search revealed mRNA and EST sequences consistent with the supposition that alternative Cx26 mRNA may exist. This idea was verified by RT-PCR experiments and cloning and sequencing of the alternative mRNA. If this RNA does not contain 35delG single nucleotide deletion it may translate normal protein starting from the initial ATG start codon and produce protein indistinguishable from the “classical” form. Yet if 35delG single nucleotide deletion is present an alternative upstream ATG start codon could be utilized. In this case predicted translated protein will differ in short N-terminal part from the “classical” protein. 10 first amino acids would be substituted in 35delG mutated new form by alternative sequence 45 amino acids long, while the main body of protein, including all 4 TM domains remains identical in both forms. To check if the new form of Cx26 mRNA may be translated to a real transmembrane protein we cloned 35delG mutated sequence from DFNB1 individual and fused it with an EGFP sequence. mRNA has been transcribed in vitro from this construct and injected into Xenopus oocytes. In 2-3 days of incubation we found florescent GJ plaques in the membranes of injected cells similar to those formed by “classical” mRNA of the non mutant gene. Our data shows that one of the most common mutations that cause non-syndromic deafness not necessary abolished Cx26 protein membrane channel function as commonly held. Alternative mRNA from mutant 35delG of this gene is capable to be translated into a membrane protein similar to normal. This finding demonstrate an interesting case for GJ proteins diversity and change our view on the mechanism of DFNB1 deafness and may alter our approaches to its treatment.

284 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

TIME WARPING OF GLOBAL EXPRESSION DATA FOR EVOLUTIONARY DISTANT SPECIES DMITRI PAPATSENKO 1, YURY GOLTSEV 1

Keywords: Time warping yeast cell cycle Drosophila Anopheles development

Background: Comparative analysis of temporal dynamics of gene expression has a broad potential area of application, including evolutionary biology, developmental biology, and medicine. However, when the species are separated by large evolutionary distances the construction of global alignments and the consequent comparison of the time-series data, are difficult. The main reason is the accumulation of variability in expression profiles of orthologous genes, in the course of evolution. Results: We applied Pearson distance matrices, in combination with other noise-suppression techniques in order to enhance the capacity to capture the similarities between the temporal gene expression datasets separated by large evolutionary distances. We aligned and compared the temporal gene expression data in budding (Saccharomyces cerevisiae) and fission (Schizosaccharomyces pombe) yeast, which are separated by more then ~400myr of evolution. We also aligned developmental time courses of fly (Drosophila melanogaster) and mosquito (Anopheles gambiae), species separated by ~250myr of evolution. In the case of yeasts, we found that the global alignment (time warping) properly matched the duration of cell cycle phases in these distant organisms, which was measured in prior studies. In the case of insects, we unambiguously identified an alignment path as well. Concordantly and discordantly expressed genes and gene batteries have been identified in both alignments. Conclusions: Our predictions of the variability in the duration of the cell cycle phases in the two yeast species based on the global alignment were in a good agreement with the existing data, thus supporting the computational strategy, adopted in this study. Alignment of the insect datasets correctly identified some known developmental events, differentially organized in fly and mosquito.

 1 University of California, Berkeley, United States, [email protected] 285 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

EVIDENCE OF PROTEIN DOMAINS STABILITY DUE TO AROMATIC INTERACTIONS. LEONID PEREYASLAVETS 1

Keywords: aromatic interaction, protein domain

Significance of aromatic interactions for protein stability was outlined a long ago [1]. More detailed view has revealed the fact, that nearly 74% of aromatic interactions in proteins connect various secondary structures [2]. This work represents a quantitative analysis of aromatic interactions in protein domains. The main source of data is CATH [3] protein structure classification database. The alignment between all the pairs, which have homology lower than 60% level within every sequence family with 35% level homology of every domain, was performed. Thereby, mean amino acid persistence level should vary between 35% and 60% level. Analysis of persistence of aromatic amino acids has shown that amino acids which take part in aromatic interactions have high level conservatism, whereas the rest of aromatic amino acids have shown very low level conservatism (about 35% persistence level). Aromatic amino acid interacting with one or more of any aromatic residues within 7Å range can be considered as special “Ar” amino acid. Conservatism of such “Ar” amino acid equals a special high level of about 60%, which can only be compared with Proline (the most rigid residue) and Glycine (the most flexible residue). Detailed investigation reveals a big role of aromatic interactions in protein domain stability not only of high contact order, but of special geometric characteristics as well.

1. S. K. Burley, G.A. Petsko (1985) Aromatic-Aromatic Interaction: A Mechanism of Protein Structure Stabilization, Science, 229:23-28. 2. A. Thomas, et al. (2002) Aromatic Side-Chain Interactions in Proteins. I. Main Structural Features, Proteins: Structure, Function and Bioinformatics, 48:628-634. 3. L. H. Greene, et al. (2007) The CATH domain structure database: new protocols and classification levels give a comprehensive resource for exploring evolution. Nucleic Acids Research, 35: D291-D297.

 1 Institute of Protein Research RAS, Russian Federation, [email protected] 286 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

CpG ISLANDS: EVOLUTION OF ‘NON-OBJECTS’ IN THE GENOME INNA PERTSOVSKAYA 1, ARTEM ARTEMOV 2, NINA OPARINA 3, ALEXANDER FAVOROV 4, ANDREI MIRONOV 2, DMITRRY VINOGRADOV 5

Keywords: CpG islands, DNA methylation, evolutian, comparative genomics

CpG islands were described more than 30 years ago as special kind of regions in vertebrate genomes. Those regions differ from all other genome for increased rate of CpG dinucleotides. Commonly, CpG dinucleotides are subject for methylation and MetCpG mutate to TpG very fast. Thus the rate of CpG in genome is significantly lower than one could expect from the Bernoulli model provided C and G rates. The rate of CpG in the islands is still lower than the independent expectation, but it is noticeably higher than other genome regions. As far as CpG islands are defined as “non-objects” (i.e. regions where methylation-and-mutation pipeline washes CpG’s away less intensely than in remaining genome) rather that as an object, the question about their functionality remains unclear nevertheless the history of investigations is rather long. First announced as non-methylated parts of the genome, they finally become something like a statistical object based on a simple formal rule or algorithm. There are a lot of known functions of DNA methylation in the genome, e.g. imprinting, X chromosome inactivation, regulation of gene expression. So, one can assume that the main function of CpG islands in a genome is either lack of functions that are usually provided by methylated regions or potential of islands to be methylated or not in different tissues. The investigation we present here aims to test the hypothesis that the selection pressure at least partially explains the CpG islands existence. Our results are based on genome evolutionary methods and support this hypothesis. We observed very little consistency of CpG islands in a set of related genomes (that of human, chimpanzee, cow and dog). It could be explained by the notion that CpG islands are not a definite evolutionary object, they are a collection of different regions that have avoided the methylation and/or the  1 IDIBAPS, Spain, [email protected] 2 MSU, Russian Federation, [email protected] , [email protected] 3 GosNIIGenetika, Russian Federation, [email protected] 4 Johns Hopkins School of Medicine, United States, [email protected] 5 IITP RAS, Russia, [email protected] 287 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 consequent hypermutability. A permutation test that compared CpG dinucleotides birth probabilities (rates) inside the CpG islands and in random genome regions showed the rate is higher inside the islands. The difference is an evidence of selection pressure role in CpG islands conservation. At the same time, the fact that the difference between the CpG death rates inside and outside the islands is significantly higher than the same difference of the CpG birth rates confirms the common “methylation-protection” explanation of the CpG islands existence. The conclusion is that it is more informative to consider CpG islands as “non-objects” in genome and so to imply their heterogeneity both in function and in origin.

This work was supported by Howard Hughes Medical Institute [grant number 55005610]; the Program ‘Molecular and Cellular Biology’ of the Russian Academy of Sciences; and Russian Foundation of Basic Research [grants number 09-04-92742, 07-04-91555].

288 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

HIGH RATE OF ADAPTATION IN DROSOPHILA DMITRI PETROV 1, JOSEFA GONZALEZ 1, J. MICHAEL MACPHERSON1, LENKOV KAPA 1

Despite the foundational importance of adaptation, we know very little about the process of adaptation in natural populations. We don't even know how frequent adaptation is and whether it generally involves mutations of small or large phenotypic effect. Quantifying molecular adaptation is the key task of modern evolutionary biology in the genomic era. (i) I will discuss how that patterns of genome-wide neutral polymorphism and functional divergence in Drosophila can be used to infer that adaptation is both frequent (approximately one every 1000 generations) and often strong (on the order of 1% advantage). If true, this would mean that virtually all neutral polymorphisms in the Drosophila genome are affected by adaptations in their genomic vicinity. (ii) I will also describe a search for very recent adaptation in Drosophila generated by insertions of transposable elements (TEs). Our search uncovered 13 putatively adaptive TEs implying a very high rate of TE- induced adaptation. Most of these TEs appear to be adaptive in some but not other environments, suggesting that substantial proportion of recent adaptations are ephemeral. I will discuss implications of these results for the understanding of the nature of genetic variation and divergence between species.

 1 Stanford University, United States, [email protected], [email protected], [email protected], [email protected] 289 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

REGULATION OF RIBOSOMAL GENES IN BACTERIA: COMPARATIVE GENOMIC ANALYSIS 1 2 SVETLANA A. PETROVA , ALEXEY G. VITRECHACK

Keywords: regulation, RNA structures, ribosomal operons

Bacteria use a wide range of regulatory mechanisms to control gene expression. While the most common regulatory mechanism seems to be regulation of transcription by DNA-binding proteins, there are other important mechanisms, in particular, regulation of transcription (by premature termination) and translation (by interference with initiation) via formation of RNA structures in 5’-untranslated gene regions. Synthesis of ribosomal proteins is often regulated by structured mRNA that interacts with ribosomal protein. Regulatory ribosomal protein usually plays a pivotal role in the assembly of the central domain of the large or small ribosomal subunit and regulates its own expression by a feedback mechanism at the translational level. The protein recognizes both rRNA and mRNA targets that share partial similarity (molecular mimicry). In the present work, we analysed the regulation of 6 ribosomal оperons: rpsO (S15), α, spc (S8), S10 (L4), L11-L1, β (L10). Using a sample of known E. coli regulatory we predicted regulatory sites in other gamma-proteobacteria. Moreover, we identified by comparative approach mostly conserved mRNA structure regions (“core” binding sites). In the case of the L11-L1 operon, using a training set of regulatory sites we constructed a search pattern to scan available bacterial genomes. New regulatory sites were found in various Gram-negative bacteria (α-, β-, γ-, δ-, ε- proteobacteria), Firmicutes (including Mollicutes), Actinomicetales and some other bacterial groups (Bacteroidetes, Cyanobacteria and Archea). Regulatory sites were identified upstream of both rplK and rplA in most Firmicutes (Bacillales, Lactobacillales and Clostridiales (except some species)). Candidate regulatory sites upstream of only one gene were found in Actinomycetales, Bacteriodates, Cyanobacteria (rplK) and Proteobacteria (rplA). Correlation  1 Department of Bioengineering and Bioinformatics of M.V. Lomonosov Moscow State University, Address (MSU, GSP-2, building 73, Leninskiye Gory, Moscow, 119992), [email protected] 2 Institute for Information Transmission Problems (the Kharkevich Institute), Russian Academy of Sciences, Bolshoi Karetny per. 19, Moscow, 127994, Russia, [email protected] 290 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 between the regulatory structure localization and L11-L1 intergenic distance was identified. We predicted rplK and rplA co-translation as well as independent gene translation for all analysed bacteria. Evolutional analysis of regulatory signals shows that regulation of L11-L1 genes is variable and involves three regulatory strategy: co-regulation of both genes by a single site upstream of the first gene (rplK), independent translation and regulation of rplK and rplA by its own site, and auto-regulation of only ribosomal protein rplA. The taxonomy tree with reconstructed regulatory events for L11-L1 оperon was made. Evolutionary model of regulatory mechanism was presented.

291 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

POLYMORPHISM OF ISSR-PCR MARKERS AND POSITIONING OF INVERT REPEATS OF MICROSATELLITES IN SEQUENCES OF BOVIDAE FAMILY ANTON PHEOPHILOV 1, VALERIY GLAZKO 1

Keywords: sequence homology, polymorphism, ISSR-PCR markers

In current work comparison between specters of ISSR-PCR markers, generated by using of dinucleotide (AG)9C and (GA)9С and trinucleotide (АСС)6G, (AGC)6G, (GAG)6C and (CTC)6C microsatellite loci fragments as primers in polymerase chain reaction and their positioning in sequenced DNA from GenBank database for different species of Bovidae was carried out. Using of (AG)9C primer in PCR on genomes of two sheep breeds had revealed specters of amplification products, containing 13 DNA fragments, and only 8 for the primer (GA)9С. Polymorphous information content (PIC) was in the case of the former primer equals 0.461, of the latter – 0.375. By means of BLASTn algorithm in GenBank database 22 homologous sequences had been revealed for the pair (AG)9 — (CT)9, for the other pair no homology had been detected. Most part of identities for (AG)9C was found in different partly overlapping sequences of DR beta-chain antigen binding domain, major histocompability complex class II DRB genes. Relatively higher DNA polymorphism level of fragments, flanked by inverted repeat (AG)9C comparably to (GA)9C detected in investigated sheep breeds could be related with the localization of AG repeat in mostly polymorphous gene system of mammals, related to immunity system – MHC genes and interleukins. We had found the homologous identities for sequence (GA)9C in cattle sequence database again in MHC, in DRB gene group. Exon localization of this repeat may explain its conservatism despite their relationship to immunity system genes. The highest share of polymorphous amplicons in PCR products’ specter had been revealed with the use of (ACC)6G and (CTC)6C primers on cattle. GenBank search had carried out and revealed the presence of (CTC)6C sequence in protamine gene cluster. The fewest share of polymorphic loci had observed in spectra of the trinucleotide primers (GAG)6C and (АGС)6G

 1 Russian State Agrarian University, Moscow Agrarian Timiryasev University, Russian Federation, [email protected], [email protected] 292 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 sequences. The search had resulted in revealing of homologous sequence for (АGС)6G repeat in gene for insulin-like growth factor binding protein sequence (IGFBP2) both for sheep and cattle, this fact indicated localization conservatism of this sequence in different species. We had also revealed (CTC)6C microsatellite positioning in Ovis aries insulin-like growth factor II (IGF-II) gene. Differences observed allowed us to conclude that ISSR-PCR marker polymorphism depended on repeat motive and its positioning in nucleotide sequences belonging to different genes or other zones. Yet in some cases differences of DNA fragment polymorphism could be associated with invert repeat positioning in exons or introns. ISSR-PCR marker polymorphism revealed us likely to be caused in some degree by a lot of homology zones to microsatellite loci used as primers in introns and exons of immunity system genes (DRB, interleukin) and in genes, related to either growth transforming factors or their receptors.

293 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

EVOLUTION OF MITOCHONDRIAL GENOME SIZE: LARGE GENOMESS IN SMALL MAMMALS AND SMALL GENOMES IN LARGE MAMMALS KONSTANTIN POPADIN 1

Keywords: mitochondrial genome size, mammals, intracellular selection, deletions

Oogonia of human female embryo halt their mitotic divisions as early as 7 months of prenatal development. After that a primary oocytes begin the first meiotic division, but the process stops in prophase and the cells remain in a dormant stage until puberty. At the beginning of puberty primary oocytes start to grow again each month and resume meiosis I and transform to secondary oocytes. The secondary oocytes begin the second meiotic division, followed by ovulation and fertilization. In total each ovulated human oocyte spends 12 - 50 years in non-dividing condition (arrested phase of meiosis I). What happens with mitochondria during this prolonged period of dormancy? Although the metabolic rate in arrested primary oocytes is decreased, the mtDNA turnover rate is not zero due to autophagy of old or damaged mitochondria and synthesis of new mtDNA. So, there is a possibility for intracellular selection, which leads to replication advantage of the shortest mitochondrial genomes (genomes with deletion). In this respect primary oocytes are similar with non-dividing cells such as neurons and skeletal muscle fibers, where clonal expansions of short mitochondrial genomes during cell lifetime lead to a series of encephalomyopathies and aging. An oocyte with high fraction of large-scale deletions most likely will be eliminated and won’t reach ovulation. But, small deletions, especially in non- coding regions might be fixed in an oocyte as if effectively-neutral mutations. Because the clonal expansion of short mitochondrial genomes is more probable during long period of dormancy, I hypothesize that mammalian species with long generation time possess smaller mitochondrial genome as compared to mammalian species with short generation time. (I) genome size and generation time. I found out significant negative linear regression between mitochondrial genome length (bp) and generation time (days) for 131 placental mammals (genome length = generation time*(- 0.04933) + 16720, p = 0.00473, R2=0.06). Additionally I demonstrated that  1 Institute for information transmission problems of the Russian Academy of Sciences (Kharkevich Institute), Russian Federation, [email protected] 294 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 genome size variation is associated mainly with copy number variation of tandem repeats, located preferable in control region of mtDNA (genome length = tandem length*(0.9837) + 16430, p = < 0.001, R2=0.82). Finally, I found out significant negative regression between generation time and total length of tandem repeats (tandem length = generation time*(-0.0527) + 297, p = 0.001, R2=0.082). (II) genome size variation on phylogenetic tree. To understand the evolution dynamic of mitochondrial genome size I (i) reconstructed phylogenetic tree, (ii) reconstructed complete mitochondrial genome sequence at each node of the tree assuming indels (insertions and deletions) as the fifth character and (iii) estimated change in genome size for each branch of the tree (∆l). Sample of 34 short-lived animals with generation time < 500 days (the lower quartile) demonstrated significantly higher increase of genome size as compared to 34 long-lived mammals with generation time > 1643 days (the upper quartile) (acctran: ∆l for short-lived is +56.8 nucleotides while for long-lived is -3.8, P = 0.03, t-test; deltran: ∆l for short- lived is +83.3 nucleotides while for long-lived is +4.6, P = 0.025, t-test). After that I performed Kendall’s rank correlation between ∆l and generation time of modern species. The both models of reconstruction of nodes demonstrated significant negative trend: the shorter the generation time the larger was increase in genome size on the external branch (acctran: Kendall’s tau = - 0,104, P = 0.041; deltran: Kendall’s tau = -0.117, P = 0.025). The trends observed above could be explained by either (i) strong selective constraints on small genome size in long-lived mammals during prolonged period of the oocytes dormancy or (ii) adaptive effect of tandem repeats in short-lived mammals. The first explanation implies that tandem repeats are selfish slightly-deleterious elements, which are accumulated in short-lived mammals due to less effective intracellular selection. The second explanation implies that for some reasons tandem repeats are favorable in short-lived mammals. The first explanation seems more plausible, because (i) high intra- species variation in the tandem repeat number in short-lived species;(ii) independent loss of tandem repeats in a few lineages of long-lived mammals, originated from short-lived ones and (iii) absence of good evidences of biological function of tandem repeats in mtDNA.

295 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

BIOINFORMATICS AS A "CRITICAL TECHNOLOGY" FOR LIFE SCIENCES V.V. POROIKOV 1

Keywords: bioinformatics, data, information, knowledge, databases, computer programs, critical technology, life sciences

Different definitions of the Bioinformatics are used. One example: “The merger of biotechnology and information technology with the goal of revealing new insights and principles in biology” [1]. Another example: «Bioinformatics derives knowledge from computer analysis of biological data. These can consist of the information stored in the genetic code, but also experimental results from various sources, patient statistics, and scientific literature. Research in bioinformatics includes method development for storage, retrieval, and analysis of the data. Bioinformatics is a rapidly developing branch of biology and is highly interdisciplinary, using techniques and concepts from informatics, statistics, mathematics, chemistry, biochemistry, physics, and linguistics. It has many practical applications in different areas of biology and medicine» [2]. We define the Bioinformatics as a multidisciplinary field of science, integrating informational-computational technologies with the achievements of biology and medicine. Bioinformatics’ subject is the development and application of mathematical methods, computer programs and databases for storage and retrieval of information, generation of new knowledge on the basis of vast amounts of experimental data, which are obtained by highthroughput biomedical studies. Solution of the following problems requires the application of Bioinformatics’ methods: 1. Comparative genomics and proteomics. Analysis of nucleotide and amino acid sequences of deciphered genomes. Study of molecular evolution of the organisms. 2. Functional genomics and proteomics. Determination of gene structure and regulatory signals, antigenic determinants, active sites and other functionally-essential regions in nucleic acids and proteins. Study of gene expression regulatory systems. Analysis of structure and function of non- coding regions in nucleic acids.

 1 Institute of Biomedical Chemistry of Rus. Acad. Med. Sci., Pogodinskaya Str., 10, Moscow,119121, Russia; [email protected] 296 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 3. Structural genomics and proteomics. Modeling of 3D structure of proteins and nucleic acids. Determination of posttranscriptional and posttranslational modifications. 4. Analysis of proteomes of various biological samples for determination of similarity and diversity. Identification of protein-protein interactions. 5. Systems biology. Analysis of biological processes at supramolecular level using genomic, transcriptomic, proteomic and metabolomic information. Modeling of metabolic and signal regulatory pathways in a cell. 6. Analysis of SNPs and SAPs, abundance of repeats and other changes at the molecular level, in view of disease-disposing, features of their developing, and individuality of pharmacotherapeutic responses. 7. Revelation of potential markers for detection of diseases. 8. Identification of prospective molecular targets for new drugs. 9. Analysis of interaction of natural and synthetic bioregulators with molecular targets, design and optimization of lead compounds with the required properties as new pharmaceutical agents. 10. Design of immunogenic constructions for creation of new vaccines. 11. Bioengineering (design) of microorganisms and plants as producents of physiologically active substances with required properties. 12. Textomics as a method for determination of associations between the scientific results obtained in different fields of biomedical science. 13. Revelation of signals in noisy postgenomics data (mass-spectrometry, atomic force microscopy, microarrays, etc.). 14. Integration and development of means for the analysis of heterogeneous types of data in large-scale databases on biomedical information. Since the mission of Bioinformatics is “Filling Gaps in the Existing Knowledge”, it could be considered as a “Critical Technology” for Life Sciences.

1. http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html 2. http://www.pasteur.fr/recherche/unites/Binfs/definition/bioinformati cs_definition.html

297 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

EXPANSION OF THE PROTEIN SEQUENCE UNIVERSE INNA POVOLOTSKAYA 1, FYODOR KONDRASHOV 1

It is thought that many different proteins have evolved prior to the divergence of all of life from the last universal common ancestor (LUCA). The divergent evolution that has occurred throughout the last ~4 billion years gave rise to the modern protein universe. Here, we investigate basic parameters of this universe, mainly whether or not the extant protein universe has reached an equilibrium, that is whether or not protein sequences that have diverged since LUCA are continuing to diverge. To achieve this aim, we measured the correlation between divergence rates of three closely related species and the Hamming distance to another distant reference species ortholog for 29 sets of orthologous protein alignments genomes. Triplets of closely related genomes were obtained from the ATGC database (1) or identified using the bidirectional best-hit approach: 19 and 10, respectively. The average protein distance within each triplet was 1-15% for sister genomes and 2-20% the third, outgroup genome. To obtain the distantly related reference species we used the 841 complete prokaryotic genomes extracted from the NCBI Genome Assembly database and clustered into COGs using multiple genome-specific best hits approach (2). Our final dataset contained multiple COG-like families of homologous proteins from each species. Using this dataset, we measured the rate of divergence of the closely related orthologs of the two sister species from the distant, reference ortholog. To decrease an error related with false determination of orthologs pairs we analyzed only higly conserved sets of proteins which assumed to be present in the last universal common ancestor (3). The rate of divergence was measured in the following manner. Using the third, outgroup species to the closely sister species we were able to polarize the substitutions between the sister species. Then, these substitutions were related to the amino acid states in the distant ortholog. We calculated the number of substitutions away and towards the distant reference ortholog in the closely related species. Figure 1 shows an example of three closely related species with one distant reference ortholog with 2 substitutions occurring in the sister species away from the reference ortholog and 2 substitutions towards the reference ortholog.  1 Centre for Genomic Regulation, C/ Dr. Aiguader, 88, Barcelona, Spain, [email protected], [email protected] 298 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

Sister1: ATGNTYLDEF Sister2: AVGHTYHDEY Outgroup: ATGNTYHDEY Reference: KTGHTAHDEF

Figure 1. A hypothetical alignment of two sister species, their closely related ortholog and a distant reference ortholog. The outgroup can be used to polarize the substitutions in the sister species and the directionality of the substitutions can be related to the sequence of the reference ortholog. In yellow are substitutions that occur away from the reference sequence, and magenta outlines cases where the substitution occurs towards the reference sequence.

To make sure that errors in alignment are not overwhelming we excluded from the analyses sets of orthologs with the extent of divergence more than 60%. We found that the number of substitutions away from the reference ortholog was always greater than the number of substitutions towards the reference sequence. Thus, the protein universe is currently continuing to expand such that sequence similarity of distantly related proteins will continue to decrease.

1. P. Novichkov et al. (2009) ATGC: a database of orthologous genes from closely related prokaryotic genomes and a research platform for microevolution of prokaryotes, Nucleic Acids Research, 37: 448–454. 2. R. Tatusov et al. (2001) The COG database: new developments in phylogenetic classification of proteins from complete genomes, Nucleic Acids Research, 29: 22-28. 3. B. Mirkin et al. (2003) Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes, BMC Evolutionary Biology, 3: 2. doi: 10.1186/1471-2148- 3-2.

299 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

CLUSTER ANALYSIS OF PHYLOGENETIC PROFILES MIKHAIL PYATNITSKIY 1, A.V. LISITSA 1, A.I. ARCHAKOV 1

Keywords: phylogenetic profiles, phyletic patterns, cluster analysis, protein-protein interactions, functional links

The advent of whole-genome sequencing led to computational methods that infer protein function and linkages. One of the most promising approaches for prediction of protein-protein structural and functional interactions is studying of phylogenetic profiles [1]. A phylogenetic profile of a protein is a binary vector, where each component represents presence or absence of homolog to that protein in a specific organism. It was shown that proteins with similar patterns of co-occurrence across many organisms tend to participate in the same protein complex, biochemical pathway or have similar sub-cellular location. In the present work we explored the application of cluster analysis to phylogenetic profiles in order to determine whether it can reveal protein functional modules and find optimal parameters to improve the method’s performance. Phylogenetic profiles for E.coli K12 were obtained from COG database. KEGG database was used as a “ground truth”, and each metabolic pathway for E.coli was considered as separate cluster. Profile matrix contained 889 rows (proteins) and 65 columns (genomes), while membership matrix contained 889 rows and 123 columns (KEGG pathways). We applied several standard techniques of cluster analysis including hierarchical agglomerative and divisive clustering, and PAM as iterative partitioning clustering. We also tried different distances between profiles: Hamming, Jaccard, Kulczynski distances and probability of profile’s co-occurrence purely by chance [2]. We proposed usage of indexes for cluster evaluation as a way to assess predictions of groups of related proteins. External indexes (Rand index and Euclid distance between membership matrices) were used to estimate agreement between resulting partitions and KEGG pathways. Several internal indexes (Dunn index, Davies-Bouldin index, Hubert gamma-statistics, silhouette width, etc) were used to assess results of clustering in more realistic case of missing “golden standard” partition. Null distributions for all indices were computed to evaluate statistical significance of partitions. All

 1 Institute of Biomedical Chemistry, 119121, Moscow, Pogodinskaya 10, [email protected] 300 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 software was implemented as a set of platform-independent Perl/BioPerl and R scripts and is freely available upon request. We performed extensive search for all possible combinations of clustering method and distance measure and computed different indexes to evaluate obtained partitions, similar to work by Glazko and Mushegian [3]. According to both external indexes, Ward’s method coupled with Hamming distance showed the best agreement with KEGG clustering, optimal number of clusters was estimated to be 128 and 38.9% of proteins in one cluster participated in common KEGG pathway. However, results of different internal indexes were not always fully consistent with each other, but main tendencies remained the same. For example, index “average silhouette width” yielded that while optimal clustering algorithm was also Ward’s method, but distance should be defined as probability of profile’s co-occurrence by chance; number of clusters was estimated to be 120 with 34.8% of proteins in one cluster participating in common KEGG pathway. Single linkage clustering showed the worst performance in all cases. Results of our work showed, that cluster analysis of phylogenetic profiles can reveal modules of functionally related proteins, while application of proper clustering algorithms can significantly improve performance of the method. Our approach can also be utilized for evaluation of different methods for computational predictions of functionally related proteins.

1. D.Barker, M.Pagel (2005) Predicting functional gene links from phylogenetic-statistical analyses of whole genomes, PLoS Comput Biol, 1:e3. 2. J.Wu et al. (2003) Identification of functional links between genes using phylogenetic profiles, Bioinformatics, 19:1524-1530. 3. G.V.Glazko, A.R.Musheguan (2004) Detection of evolutionarily stable fragments of cellular pathways by hierarchical clustering of phyletic patterns, Genome Biol, 5:R32

301 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

STUDYING NF-kB RESPONSE TO CELLULAR SIGNALS BY HIERARCHICAL MODELING OVIDIU RADULESCU 1, VINCENT NOEL 1, ALEXANDER GORBAN 2, ALAIN LILIENBAUM 3, ANDREI ZINOVYEV 4

NF-kB is an important node in cellular response, integrating signals from many signaling pathways and controlling hundreds of genes. Because of its bowtie position in regulation networks and of its dynamically intricate activation mechanism, NF-kB has been in the spotlights of mathematical modelling for many years [1]. However, most of the mathematical models cope with a canonical NF-kB activation mechanism which is the IKK induced IkB degradation. In simple experimental settings, the canonical pathway can be activated either by TNFa or by LPS. The shape of the response in these two cases is different, transient for TNF and sustained for LPS. The shape is correlated with similar behavior of IKK. The sustained IKK activity under LPS signal could be explained either by a modulation of A20 protein activity that inactivates IKK, or by LPS induced autocrine production of TNFa, or by both. Outside this simplified setting there is the immense realm of eukaryote physiology. Present models could be the starting point for more realistic studies coping with the way NF-kB receives input from various signalling pathways, and how its response can be modulated by pathway crosstalk. In particular, the modifications of this complex molecule could be a solution to the intriguing question of how different specific actions can be dispatched by the same node. The modifications that we study in this paper concern acetylation. To understand the functioning of the NF-kB pathway with respect to response to various cellular signals we apply the methodology of hierarchical modeling that we have developed in [2]. The aim is to start from a complex model of NF-kB pathway containing detailed biochemical mechanisms communicating with external signals and to apply model reduction systematically in order to obtain a series of smaller models, each one mostly adapted to a particular scenario of the response. Then we compare the models

 1 University of Rennes, France, [email protected] 2 University of Leicester, United Kingdom, [email protected] 3 CNRS UMR 7000, France, [email protected] 4 Institut Curie, France, [email protected] 302 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 and discriminate them by determining a set of critical parameters responsible for this or that type of the pathway response. Functioning of complex networks is based on non-linearity and dynamical transitions. For a given choice of inputs, a linear system increases its response proportionally to the input intensity. A non-linear system can react in a hybrid way, small quantitative changes being punctuated by qualitative changes of the type of response. There are various facets of this phenomenon, such as analog to digital transformations (p53), hypersensitivity of signalling cascades (MAPK), dynamic transitions in carbohydrate metabolism (glycolysis to PPP), molecular switches (stem cells). Dynamically, these transitions can have very diverse characterizations: bifurcations, piece-wise linear dynamics, intricate invariant manifolds (crazy quilt), various types of bifurcations (saddle node, Hopf). We demonstrate that exploiting various non-linear mechanisms, signalling networks could answer specifically not only to the type of input, but also, for the same type of input, to various ranges of the stimulus intensity.

1. Cheong R., Hoffmann A., Levchenko A. (2008) Understanding NF-kB signaling via mathematical modeling, Molecular Systems Biology 4: 192. 2. Radulescu O., Gorban A., Zinovyev A., Lilienbaum A. (2008) Robust simplifications of multiscale biochemical networks, BMC Systems Biology 2:86.

303 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

ANALYSIS OF INHIBITOR FOR BREAST CANCER CAUSING GPR30 PROTEIN KARTHIKA RAGHAVAN 1, NITHYA PALANIAPPAN 1, DIVYA RAMKUMAR 1

In this Paper, protein sequences of GPR30 (G Protein-coupled Estrogen Receptor 1/ GPER- Transmembrane expressed in most of Breast cancer cells) are analyzed to establish a structure-sequence correlation and thereby its structure is derived using Threading tools. This Structure is validated and interpreted using structure analysis Tools and docking studies. The procedure of molecular dynamics simulated annealing is applied to confirm a probable receptor binding site on a cyclic peptide (with the GPR30 receptor) that inhibits estrogen-stimulated proliferation of breast cancer. The hydrophilic cyclopeptide EMTOVNOGQ (O = 4- hydroxyproline), derived from alpha- fetoprotein, is an inhibitor of estrogen-stimulated proliferation of human breast cancer. A conformational analysis performed on the cyclopeptide aided in choosing the final structure which was docked on several sites of the predicted GPR30 3- dimensional model. Interaction of the cyclopeptide with the GPR30 transmembrane protein helped to comprehend molecular dynamics of the protein complex. Protein-protein interaction methods were used to predict mutants of the cyclopeptide and their interaction with GPR30. This enabled to propose a stable structure of the peptide (with/without mutations) to be used as a probable drug in future.

 1 SRM University , India, [email protected] 304 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

PREDICTING BINDING SITES OF IONS IN PROTEIN STRUCTURES SERGEI RAHMANOV 1, IVAN KULAKOVSKY 1, VSEVOLOD MAKEEV 1

Keywords: protein interaction potentials ions

Metal ions in proteins play a number of very important roles including catalytic, electron transfer, signaling and regulation, and storage and transfer. A detailed quantitative model for interaction of protein atoms of different types with various intracellular ions, capable of predicting ion binding sites in proteins and specificities, remains a challenge for modern computational molecular biology. We developed knowledge-based distance-dependent potentials for interaction of protein atoms with metal ions, based on the experimental data of positions of various ions in 2114 3D structures of proteins. A non- redundant subset of protein data bank (PDB) was used as a training data set comprising structures of overall high quality and of limited sequence similarity [1], to prevent possible bias due to prevalence of certain families such as hemoglobins. The Monte Carlo reference state [2] was used to calculate the expected densities of atom contacts, resulting in highly detailed potentials covering the full range of contact distances. Despite a relatively small number of protein 3D structures which have metal ions incorporated into them, for a number of metal ions such as calcium, zinc, magnesium, we were able to obtain statistical, or knowledge-based, potentials for interaction with protein atoms. For bivalent cations, certain oxygen atoms in the side chains of aspartatate and glutamate carrying a partial negative charge have demonstrated a high affinity for forming close (non-covalent) contacts with these ions, reflected in the statistics of the numbers of contacts between them observed in the training set of structures. The preferred contact distances were highly specific for each ion and each protein atom type. The potentials were tested for their ability to predict locations of ions in the structures of proteins. For this test, from the structures of proteins which contained the ion(s), all non-protein atoms including water molecules were removed, and the estimates of the free energies of placing the ion at all nodes of a cubic grid covering the structure with a small step (0.1-0.2 Е depending  1 GosNIIGenetika Research Institute, Moscow, Russian Federation, [email protected] 305 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 on the structure size) were calculated. The potential ion binding sites in the protein structure, defined as minima of cumulative binding energy, were compared to the actual experimentally determined positions of the bound ions. In most cases, this comparison has revealed very accurate predictions, with rmsd of 0.1-0.4 Е between the experimental and the predicted ion location. A web server dedicated to predictions of ion binding sites and specificities in macromolecular structure was implemented at the following URL: http://line.imb.ac.ru/ion-calculator/

This work was supported by the Russian Foundation for Fundamental Research (RFFI) research grant #08-04-01383 which contribution is gratefully acknowledged.

1. Hobohm U, Sander C (1994) Enlarged representative set of protein structures. Protein Sci, 3(3):522-524. 2. Rahmanov, S. V. and V. J. Makeev (2007). Atomic hydration potentials using a Monte Carlo Reference State (MCRS) for protein solvation modeling. BMC Struct Biol 7: 19.

306 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

DELETERIOUS AND COMPENSATORY MUTATIONS IN PROTEINS OLGA KALININA 1, ANASTASYA ANASHKINA2, ALEXANDRA MIRINA 3, VASILY RAMENSKY 2

Keywords: compensatory mutation, protein evolution, protein structure

Compensatory amino acid substitutions in proteins (also known as suppressor mutations) mask the deleterious effects of another mutation and comprise an important mechanism for the adaptation and evolution of most organisms. In particular, resistance to antibiotics, antivirals, and antifungals is usually associated with a fitness cost. The initial fitness costs conferred by resistance mutations (or other deleterious mutations) can be reduced by compensatory substitutions. We present the literature-derived set of experimentally verified 77 deleterious amino acid replacements and corresponding 348 compensatory substitutions observed in 36 proteins with source organisms ranging from human to E.coli. The sequence- and structure- related properties of mutants and their suppressors are discussed.

 1 European Molecular Biology Laboratory, Germany, [email protected] 2 Engelhardt Institute of Molecular Biology of Russian Academy of Sciences, Russian Federation, [email protected] 3 Moscow State Lomonossov University, Russian Federation, [email protected] 307 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

POSITIVE SELECTION AND ALTERNATIVE SPLICING OF HUMAN GENES VASILY RAMENSKY 1, R.NURTDINOV 2, A.NEVEROV 3, A.MIRONOV 4, MIKHAIL GELFAND 5

Alternative splicing (AS) of genes is the processing of an RNA transcript into various mRNA molecules (and then proteins) by including some exons and excluding others. It is both a source of proteome complexity [1,2] and an important mechanism of accelerated genome evolution [3,4]. The analysis of polymorphism and human-chimpanzee divergence in 52,151 constitutive and 14,196 alternative exons in 6,671 human genes shows that the alternatively spliced exons experience lower selective pressure at the amino acid level accompanied by selection against synonymous sequence variation. Besides, the results of the McDonald-Kreitman test [5] suggest that, unlike the regions constitutively included in the mRNA, the alternatively spliced ones also experience the positive selection, with up to 27% of amino acids fixed by positive selection [7]. A common measure of the degree of evolutionary constraint on a sequence is the ratio Ka/Ks of nonsynonymous substitutions per nonsynonymous site (Ka) to synonymous substitutions per synonymous site (Ks). Higher values of the ratio indicate weaker negative selection acting on a sequence. Depending on the number of protein, mRNA and EST sequences that cover the corresponding region, all exons in the sample were divided into three major classes: minor, major and constitutive. The ratios Ka/Ks for SNPs and divergence are very close for constitutive and major alternative regions but differ more than two-fold from those for functional minor alternative regions, confirming that lower negative selection against amino acid substitutions is especially characteristic of these fragments. In McDonald-Kreitman test [5], a statistically significant excess of non-synonymous substitutions relative to polymorphisms (Da/Ds > Pa/Ps) implies positive selection that provides

 1 Engelhardt Institute of Molecular Biology of Russian Academy of Sciences, Moscow, 119991, Russia, [email protected] 2 Moscow State University, Moscow, 119992, Russia 3 State Scientific Center GosNIIGenetika, Moscow, 117545, Russia 4 Central Research Institute of Epidemiology, Moscow, 111123, Russia 5 Institute for Information Transmission Problems (Kharkevich Institute) of Russian Academy of Sciences, Moscow, 127994, Russia, [email protected] 308 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 fixation of advantageous mutations. The test can also estimate the fraction of fixed amino acid substitutions driven by positive selection. In our AS sample, this value equals 0.27 in minor alternatives suggesting that they experience positive selection, unlike the major alternatives and constitutive exons evolving under purifying selection. The synonymous divergence and polymorphism rates are slightly lower in minor AS exons. This observation can be explained by RNA-level selection acting on synonymous sites of alternatively spliced exons in order to maintain the splicing regulation motifs, in particular, ESEs (exonic splicing enhancers) [6].

1. B.R. Graveley (2001) Alternative splicing: increasing diversity in the proteomic world, Trends Genet , 171717:100-10717 2. A.A. Mironov, J.W. Fickett, M.S. Gelfand (1999) Frequent alternative splicing of human genes, Genome Res , 999:1288-12939 3. Y. Xing, C. Lee (2006) Alternative splicing and RNA selection pressure -- evolutionary consequences for eukaryotic genomes, Nat Rev Genet , 777:499-5097 4. X.H. Zhang, L.A. Chasin (2006) Comparison of multiple vertebrate genomes reveals the birth and evolution of human exons, Proc Natl Acad Sci U S A , 103103:13427-13432 5. J.H. McDonald, M. Kreitman (1991) Adaptive protein evolution at the Adh locus in Drosophila, Nature , 351351:652-654 6. T.I. Orban, E. Olah (2001) Purifying selection on silent sites - a constraint from splicing regulation?, Trends Genet , 555:252-2535 7. V.E. Ramensky, R.N. Nurtdinov, A.D. Neverov, A.A. Mironov, M.S. Gelfand (2008) Positive selection in alternatively spliced exons of human genes. Am J Hum Genet , 838383:94-883

309 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

SEQWORD GENE ISLAND SNIFFER: A TOOL TO STUDY THE LATERAL GENETIC EXCHANGE AMONG BACTERIA OLIVER BEZUIDT 1, GIPSI LIMA-MENDEZ 2, OLEG REVA 1

Keywords: mobile genetic elements, computer program, genome evolution

Identification and distribution of horizontally transferred mobile genetic elements (MGE) in bacterial communities and also tracing their evolutionary origins have always been the greatest challenge in computational genomics. MGE are subjected to high mutation rates and recombination; and current techniques for the identification of horizontal transfer events suffer from precise predictions of borders of MGE inserts within a genome. Hence our ignorance in MGE behavioral measures conceals from us the functional and evolutionary importance of environmental bacterial population. MGE are recognized as atypical genomic entities in prokaryotic genomes that influence the dissemination of genes that contribute to bacterial antibiotic resistance, diversity and virulence (1). Virulence associated genomic elements were initially detected in human pathogenic microorganisms. These virulence determinants have since been detected in numerous environmental species that put mankind in jeopardy, as there still is an emergence of pathogens that harbor genes of as yet unknown function. Besides pathogenicity, MGE may confer other traits such as: fitness, metabolic versatility, adaptability, symbiosity, commensalisms, and speciation (2). We have therefore developed SeqWord Gene Island Sniffer (SWGIS), a new computational tool for an automated identification of MGE in bacterial and plasmid DNA sequences. The program is available for download at site:http://www.bi.up.ac.za/SeqWord/sniffer/index.html. The approach is based on the analysis of compositional biases in the genome-wide distribution of tetra-nucleotides, just as described in our previous publication (3). However, in SWGIS the ability to identify precise insertion borders has significantly been improved as compared to the method used in the previously published SeqWord Genome Browser (3). Moreover, in contrast to the latter method, SWGIS allows a fully automated processing of multiple genomes per single run.

 1 University of Pretoria, South Africa, [email protected], [email protected] 2 Université Libre de Bruxelles, Belgium, [email protected] 310 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Comparison of results obtained from SWGIS with those from Prophinder, a tool that predicts prophages on the basis of gene annotation and similarity searches of conserved DNA pairs using BLASTP (4), showed consistency in many cases. However, SWGIS failed to identify short and ancient ameliorated MGE that were efficiently detected by Prophinder, whereas Prophinder was deficient in the identification of horizontally transferred gene cassettes and truncated MGE that were precisely detected by SWGIS, likely because they do not harbor phage-specific genes. Thus, a combination and synergistic usage of the both latter methods may be recommended as they could increase the efficiency of MGE detection. A global search of lateral inserts throughout 637 complete bacterial genomes with SWGIS retrieved 3517 putative MGE, which were grouped by sequence and gene similarities into 382 classes. The largest classes were found to comprise of genes that encode polysaccharides and O-antigen biosynthesis and transport (220 MGE), outer membrane proteins (173 MGE), and ABC iron transport (19 MGE). For every identified MGE an oligonucleotide usage pattern (OUP) was calculated and searched for similarity against a reference database of OUP that were calculated for all the completely sequenced plasmids, bacteriophages and bacterial chromosomes. The latter approach allowed us to identify MGE putative origins, and also to trace down ways of distributions of MGE throughout the donor-recipient chains. Acknowledgements: this work was funded by the National Bioinformatics Network of South Africa.

1. Juhas M. et al. (2009) Genomic islands: tools of bacterial horizontal gene transfer and evolution. FEMS Microbiol Rev, 33: 376-393. 2. Dobrindt U. et al. (2004) Genomic islands in pathogenic and environmental microorganisms. Nat. Rev. Microbiol. 2: 414-424. 3. Ganesan H. et al. (2008) The SeqWord Genome Browser: an online tool for the identification and visualization of atypical regions of bacterial genomes through oligonucleotide usage. BMC Bioinformatics, 9:333. 4. Lima-Mendez G. et al. (2008) Prophinder: a computational tool for prophage prediction in prokaryotic genomes. Bioinformatics, 24:6, 863- 865.

311 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

CRYPTIC TRANSCRIPTS REGULATED DURING THE YEAST METABOLIC CYCLE MALGORZATA ROWICKA 1, ANDRZEJ KUDLICKI 1, BENJAMIN TU 2

Keywords: cryptic transcripts, gene expression, periodicity, function prediction, yeast

Cryptic transcripts were recently found to be pervasive in eukaryotes, but their roles mostly remain an enigma. We have identified ~200 novel cryptic transcripts in S. cerevisiae, which are upregulated significantly within precise temporal windows during the yeast metabolic cycle. Their tight regulation suggests that they may be functional, and their expression times provide clues as to which biological processes they may regulate. We will present computational analysis and prediction of their function, which will be in the future validated experimentally.

 1 University of Texas Medical Branch at Galveston, United States, [email protected], [email protected] 2 University of Texas Medical Center, United States, [email protected] 312 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

AN AVERAGE NUMBER OF SUFFIX-PREFIXES M. REGNIER 1, E.FURLETOVA 2, M.ROYTBERG 2

Motivation and Aim: Consider a pattern, i.e. a set of words, H. A word w is a suffix-prefix (an overlap) for H iff ∃h1,h2 ∈ H such as h1 ≠ h2, w is a proper h1 prefix of and w is a proper suffix of h2 . The set of all suffix-prefixes for a pattern H is called an overlap set OV(H). In case of pattern consisting of one word h the overlap set is an autocorrelation set [1]. We are interested in the problem to estimate an average number of suffix-prefixes of all patterns generated according to Bernoulli model. This problem appears in computational biology and in combinatorics of words. In particular. we have proposed [2] an algorithm for counting probabilities of exactly p occurrences of words from a pattern H in a given text of length n. The time complexity of this algorithm is O(n ⋅ p ⋅ (| OV (H |) ∪ | H |) , where OV(H) is overlap set. To estimate the complexity of this algorithm we have to estimate the overlap set size |OV(H)|. Hypothesis and main results: Our aim is to prove the following hypothesis. Hypothesis. Consider patterns H consisting of of r words of length n over an alphabet V and suppose that the patterns are generated according to the Bernoulli model. Then an average number S of suffix-prefixes of the is C ⋅ r , where C is a constant that does not depend on the word lengths and depends on the probability distribution. We have already proved this assumption for patterns having uniform distribution (Bernoulli distribution, where all letters have the same probability 0.25). In case of general (biased) Bernoulli distribution we have proven that S=C•rα where C does not depend on word length n and α >1 also does not depend on Experiments and results. To verify our hypothesis we have performed experiments. We have generated random sets of r words with the same length m over alphabet V={A,C,G,T} distributed under Bernoulli model. Let S be an average number of suffix-prefixes. The experiments show that for uniform  1 INRIA, 78153 Le Chesnay, France, [email protected] 2 Institute of Mathematical Problems of Biology, 4, Institutskaja str., 142290, Pushchino, Moscow Region, Russia, [email protected], [email protected] 313 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 distribution (letters have probabilities: 0.25, 0.25, 0.25, 0.25) C ≈1 and S ≈ r These results are demonstrated on figure 1. Also we consider Bernoulli distribution {0.1, 0.1, 0.1, 0.7}. For this distribution S ≈ 3.1 ⋅ r (see Fig. 2).

Fig1. Uniform Distribution

1200 1000 S 800 r=100 600 r=500 r=1000 400

200 0 8 12 16 20 24 m

Fig2. Distribution {0.1, 0.1, 0.1, 0.7}

1400 1200

1000

800 S m=10 600

400

200 0 100 200 300 400 500 600 700 800 900 1000 r Acknowledgments: The work was supported by grants 08-01-92496- NCNIL-a and 09-04-01053- from RFBR (Russia), Intas grant 05-1000008- 8028 and Migec-Inria associate team. 1. E. Rivals (2006). Autocorrelation of Strings. Encyclopedia of Integer Sequences. 2. M. Regnier, Z. Kirakosyan, E. Furletova, and M. Roytberg. A Word Counting Graph (accepted to “LONDON ALGORITHMICS 2008: THEORY AND PRACTICE”)

314 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

COMPARISON OF STRUCTURE-BASED AND COVARIANCE-BASED SECONDARY STRUCTURES OF 23S RNA D.N. IVANKOV 1, M.A. ROYTBERG 2

Motivation and Aim: The study of 2D-structure of 23S RNA started about 25 years ago [1, 2] and was mainly based on the covariance analysis of multiple sequence alignment [3]; the statistical methods were supported by experiments checking if a given RNA region is single-stranded. More recently the crystallographic data on 23S RNA became available that gives another way to determine the 2D-structure of 23S RNA. The aim of our study is the careful comparison of 2D-structures obtained by these two methods. This may reveal possible errors in the data and lead to better understanding of interrelations between structure and evolution of 23S RNA. Here we present the very preliminary results of the study. Methods: All results below are related to E.coli. The covariance-based structure was obtained from the Gutell’s lab site [4]. The current version of PDB has 23 entries with E.coli 23S RNA; the study relied on the 2qao entry, having the best available resolution (3.21). Only 17 base pairings of the entry are not supported by 5 other best-resolution PDB entries with E.coli. 23S RNA. The H-bonds were revealed with the 3DNA program [5]. Results: 1. The 3D-structure-based secondary structure contains significantly more base pairings (1145 vs. 869). All but 16 of additional 3D-supported base pairing are not Watson-Crick or G-U pairs. About half of additional base pairing is A-A and G-A pairs (cf. [6]). Only ~15% of all 3D-supporteg G-A pairs and ~5% A-A pairs belong to the covariance-based structure (cf. with ~95% for G-C pairs). 2. Only 50 of 869 base pairs suggested by the covariance-based model are not supported by the 3D structure. This includes 3 helixes (879-885:891-898; 2120-2124: 2174-2178; 2127-2129:2159-2161) and 35 individual base pairings. The helixes belong to two sequence inter-vals (878 – 889 and 2111 –

 1 Institute of Protein Research,, Pushchino, Moscow Region, Institutskaya, 4, Russia, 142290 [email protected] 2 Institute of Mathematical Problems in Biology, Pushchino, Moscow Region, Institutskaya, 4, Russia, 142290 [email protected] 315 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 2178) where the 3D-structure shows abnormally low number of base pairings. This may indicate an error in crystallographic data, however, the corresponding regions of archea (pdb entry 1vq8; regions 970-998 and 2137- 2238) show the same effect. Among the other 35 base pairings, 15 cases can be explained by inaccuracy in crystallo-graphy, e.g. if a base pairing is added at the end of a helix. The 20 last base pairings reflect significant rearrangements, e.g. ending base pair of a helix in the covariance-based structure may correspond to a pseudoknot, etc. The cases are subject of further investigation. 3. The 3D-based structure contains 20 “singular” base pairings connecting different sections of standard 2D-structure, (i, j) is singular if both (i-1, j+1), and (i+1, j-1) are not paired. Only two of them are supported by the covariance model. However, all of them but one, according to 3DNA data, take part in non-standard stacking interaction; typically one member of the pair is adjacent to the “standard” helix. As conjectured A.Karyagina [7], these singular “links” may help to set the desired orientation of one section (or its part) relative to another. All pointed out differences between the 2D structures are subject of further research.

Acknowledgments: We thank A. Finkelstein, A.Karyagina, S.Spirin and A. Mironov for the fruitful discussions and suggestions. The work was supported by grants 07-04-00388, 08-01-92496-NCNILa and 09-04-01053, from RFBR (Russia); “Scientific schools” from President of RF, grant from Molecular and cellular biology program (RAS) and grant from Howard-Hughes medical Institute (USA).

1. H.F. Noller et al. (1981). Nucleic Acids Res.;9:6167-89. 2. R.R. Gutell et al.. (1994) Microbiol. Rev.. 58:10-26. 3. S. Griffiths-Jones et al. (2005) Nucleic Acids Res. 33:D121-D124 4. http://www.rna.ccbb.utexas.edu/ 5. X.-J. Lu and W. K. Olson. (2003). Nucleic Acids Research. 31: 5108-5121 6. T. Elgavish et al. (2001).A. Journal of Molecular Biology 310:735-753 7. A. Karyagina. Personal communication.

316 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

SEARCH FOR NEW GENES OF D.VIRILIS AND D.MOJAVENSIS T.V. ASTAHOVA 1, N.S. BOGATYREVA 2, M.A. ROYTBERG 3

Motivation and Aim : Computer based gene annotation of Drosophila genomes is mostly based on two ideas: (1) de-novo annotation and (2) search for genes homologous to those of Drosophila melanogaster. This may lead to the loss of genes in the species that are far from D. melanogaster, e.g. D.virilis and D.mojavensis. In turn, the species are close to each other that allow one to build their reliable genome alignments and use the alignments for gene recognition. Our aim was to study this approach and find out if it can help to recognize additional genes compared to the predicted with the standard methods. Methods: First, using the OWEN [1, 2] program we have constructed pair- wise alignments of main scaffolds of D. virilis, D. mojavensis, syntenic to 2L chromosome of D. melanogaster. Namely, the following scaffolds were used: (1) D. virilis: s12963 (20.2 Mbp); (2) D. mojavensis: s6500 ( 32 Mbp). Then, using the ALEX [3] program we have revealed the “exon-like regions” within the alignments and, finally, the putative exons within the regions. The basic idea of the method is the difference in the mutation patterns in coding and non-coding similarities. The predictions were compared with the annotations maid by other programs, see web-sites [4, 5]. Results: The results of our predictions mainly coincide with the predictions from [4, 5]. We have revealed 1467 exon-like regions, each of them contains from 1 to 3 exons. Among them we have found 134 new exons in D. virilis and 122 new exons in D.mojavensis; all the exons have lengths greater or equal 60 and Ks/Kn ratio greater than 1.0. The number of homologous exon pairs that are new both for D. virilis and for D.mojavensis is 105. The new exons with Ks/Kn ≥ 3 were checked for homology with known proteins; such proteins were found for 6 of 43 checked exons. However, the

 1 Institute of Mathematical Problems in Biology, Pushchino, Moscow Region, Institutskaya, 4, , Russian Federation, [email protected] 2 Institute of Protein Research,, Pushchino, Moscow Region, Institutskaya, 4, Russia, 142290 , Russian Federation, [email protected] 3 Institute of Mathematical Problems in Biology, Pushchino, Moscow Region, Institutskaya, 4, Russia, 142290 , Russian Federation, [email protected] 317 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 high value of Ks/Kn ratio allows one to assume that predicted exons belong to the really transcribed genes. Acknowledgments: The work was supported by grants 07-04-00388, 08- 01-92496-NCNIL-a and 09-04-01053, 07-04-00388- from RFBR (Russia); grant “Scientific schools” from President of RF, grants of Molecular and cellular biology program (RAS) and Howard-Hughes medical Institute (USA).

1. M.A. Roytberg., A.Yu. Ogurtsov, S.A. Shabalina, A.S. Kondrashov. (2002). A hierarchical approach to aligning collinear regions of genomes. Bioinformatics. 18, 1673–1680. 2. A.Yu. Ogurtsov, M.A. Roytberg, S.A. Shabalina, A.S. Kondrashov. (2002). OWEN: aligning long collinear regions of genomes. Bioinformatics. 18, 1703–1704], 3. T.V. Astakhova, S.V. Petrova, I.I. Tsitovich, M.A. Roytberg (2006) Recognition of coding regions in genome alignment. In: Bioinformatics of Genome Regulationand Structure II. N.Kolchanov and R. Hofestaedt (Eds.). 3-10 (Springer Science+Business Media, Inc). 4. ftp://ftp.flybase.net/genomes/dvir/dvir_r1.2_FB2008_07/fasta/dvir-all- predicted-r1.2.fasta.gz 5. ftp://ftp.flybase.net/genomes/Drosophila_mojavensis/dmoj_r1.3_FB200 8_07/fasta/dmoj-all-predicted-r1.3.fasta.gz

318 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

COMPARATIVE GENOMICS OF THE FATTY ACIDS BIOSYNTHESIS IN GAMMA-PROTEOBACTERIA NATALIYA S. SADOVSKAYA 1

The comparative approach is one of the main techniques of the analysis of metabolic pathways in completely sequenced bacterial genomes. It is useful both for the analysis of experimentally studied systems of co-regulated genes and for genes with unknown regulation. The fatty acids biosynthesis is a vital component of metabolism in all living cells. Hence, the fatty acids biosynthesis pathways are a good target for modern antibacterial drugs [1]. The fatty acids biosynthesis in Escherichia coli is transcriptionally controlled by the FabR repressor. This protein binds to predicted sites for genes fabA, fabB and yqfA, and does not bind to FadR binding sites [2]. It belongs to the TetR family and recognizes an 18-bp palindromic motif with the consensus sequence AGCGTACAnGTGTTCGCT, where n is any nucleotide. The balance of saturated to unsaturated fatty acids depends on the levels of the fabA and fabB enzymes. In this study we used the comparative genomic analysis of regulatory sites for the description of the fabR regulon and reconstruction of the fatty acids biosynthesis pathways in gammaproteobacteria. This is joint work with M.Gelfand.

1. H.T. Wright, K.A. Reynolds (2007) Antibacterial targets in fatty acid biosynthesis, Curr Opin Microbiol, 10:447-53. 2. L. McCue et al. (2001) Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes, Nucleic Acids Res, 29:774-82.

 1 State Scientific Center “GosNIIGenetika”, Moscow, Russia, [email protected] 319 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

ADENOSINE DEAMINASE AND ITS ISOENZYMES IN SERUM OF PATIENTS WITH PRIMARY IMMUNODEFICIENCY DISEASES REZA SAGHIRI 1, HADI AKHBARI 2, PEGHAH POURSHARIFI 1, MINA EBRAHIMI-RAD 1, MANIJEH AHMADI 3, H. NAZEM 1, Z. POURPAK 1, M. MOIN 1, S. SHAMS 1, M. SAGHIRI 1, M. KARAMI 1

Keywords: ADA , EHNA , immunodeficiency

Adenosine deaminase (ADA) is involved in purine metabolism and plays a significant role in the mechanism of the immune system. We aimed to investigate the activity of ADA and its isoenzymes in serum of the patients with various primary, Immunodeficiency (PID) syndromes. The serum of 76 children with various (PID) syndromes and 30 healthy control subjects were examined. Total ADA and its isoenzymes activities were determined. using the kinetic method described by Ellis. ADA2 activity was measured in the presence of a specific ADA1 inhibitor, erythro- 9- (2-hydroxy-3-nony1) adenine (EHNA). Our results indicated that tADA and ADA2 level were higher in patients in with chronic Granulomatous Disease (CGD), Leukocyte Adhesion deficiency (LAD), Hyper IgM (HIM) and Wiskott- Aldrich syndrome (WAS) than those of corresponding controls (P< 0.01). There was a significant elevation of tADA and ADA1 activities in IgA deficient patients as compared to healthy individuals (P<0.01). Our results hypothesized that altered ADA activity may be associated with altered immunity. Therefore, serum ADA level could be used as an indicator along with other parameters in diagnosis and follow up of the patients with CGD, LAD, IgA deficiency, HIM and WAS.

 1 Institute Pasteur of Iran - Dept, Biochemistry, Iran, [email protected] 2 Birjand Medical University, Iran 3 Institute Pasteur of Iran - Dept, cellular of bank, Iran 320 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

POLYALLELIC SNPS IN POPULATION OF DROSOPHILA MELANOGASTER VLADIMIR SEPLARSKIJ 1, GEORGII BAZYKIN 2

Keywords: population genetics, polyallelic SNP, transitions, transversions

Under standard population genetics models, neutral derived alleles usually segregate at low frequencies and are short-lived. Since the rate of mutation is low, the chance of a secondary mutation hitting a derived allele segregating in a population during its lifetime is generally assumed to be negligible. In 50 individuals of D. melanogaster, we analyze the polyallelic SNPs, i.e. the positions in genome alignments in which more than two alleles segregate in the population. Tri-allelic and tetra-allelic SNPs occur at higher rates than expected based on the frequencies of the corresponding bi-allelic SNPs. Moreover, if both derived alleles of a tri-allelic SNP are separated by a transversion from the ancestral (D. yakuba) allele, they are overrepresented significantly more than when one allele is linked by a transversion and the other by a transition. Therefore, we observe an excess of pairs of derived alleles linked by a transition. This implies that the second derived allele of a polyallelelic SNP frequently arises through a mutation in the first derived allele, not the ancestral allele as conventionally believed. These results have implications for our understanding of the fate of a SNP in a population.

 1 MSU FBB, Russian Federation, [email protected] 2 IITP RAS, Russian Federation, [email protected] 321 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

STATISTICAL ANALYSIS OF HIV-1 PROTEIN MUTATIONS AND ASSOCIATION WITH ANTIRETROVIRAL THERAPY R.S. SERGEEV 1, A.V. TUZIKOV 2, V.F. EREMIN 3

Keywords: Protein sequences, phylogeny, HIV-1, antiretroviral therapy, drug resistance

Motivation and Aim : Drug resistance is a major obstacle to the effective treatment of human immunodeficiency virus type 1 (HIV-1) infection. Although some of the antiretroviral drugs have been approved for the treatment of HIV-1, cross-resistance within each of the three antiretroviral drug classes—nucleoside reverse transcriptase (RT) inhibitors, nonnucleoside RT inhibitors, and protease inhibitors—often leads to the development of multidrug resistance ([1]). This process is accompanied by intensive mutations in the (HIV pol gene) virus protease and reverse transcriptase. But only a subset of treatment-associated mutations is responsible for establishment of drug-resistance. Most of the published data on drug resistance to protease and RT inhibitors cover mutations in HIV-1 Subtype B isolates. On the other hand majority of infection cases in Belarus is dealt with HIV-1 Subtype A (svetlogorsk variant). Inconsistency of analytical data on drug-resistance mutations in HIV-1 Subtype A strains does not allow developing effective treatment regimen. The aims of this work are: 1) analyze HIV-1 Subtype A mutations from treated and untreated persons; 2) develop reliable statistical algorithms that recognize inhibitor associated mutations and clusters of correlated mutations and analyze their contribution to the development of drug-resistance; 3) establish mutations both in Subtype B and Subtype A HIV-1 strains that lead to similar effects; 4) develop software to recognize and analyze HIV-1 Subtype A mutations. Methods and Algorithms : Data on every person are obtained in the form of aligned HIV-1 primary protein (pol gene) sequences reflecting the dynamic of mutation development in connection with inhibitor treatment. Different statistical methods including contingency tables, correlation analyses, multiple hypothesis testing ([3]) and some exhaustive search techniques were  1 Belarusian State University, Belarus, [email protected] 2 United Institute of Informatics Problems of the National Academy of Sciences of Belarus, [email protected] 3 Research Institute for Epidemiology and Microbiology, Belarus 322 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 used to analyze association of mutations with antiretroviral therapy and its subsequent effect. Phylogeny analysis ([2]) was used to retrieve additional information to recognize correlated mutations. Results : Some mathematical models describing mutations in HIV-1 primary protein sequences have been constructed. To test the correctness of the method published HIV-1 Subtype B sequences from Los Alamos HIV Database http://www.hiv.lanl.gov/content/ and Stanford HIV Drug Resistance Database http://hivdb.stanford.edu/ were used. Patient's histories and conclusions on the provided data were taken into consideration. Obtaining of HIV-1 data from real patients in Belarus is in progress. It is expected that by the end of the year sufficient data set will be collected. Finally, we expect that results of this work will allow developing recommendations for effective antiretroviral therapy and possibly changing treatment regimen to prevent drug-resistance establishment. Conclusion : Treatment-associated mutations have different effect in drug- resistance development. Some of them may serve as specific markers of HIV-1 drug resistance or make a synergetic effect being clustered as correlated mutations.

1. S.-Y. Rhee, W.J. Fessel et al. (2005) HIV-1 Protease and Reverse Transcriptase Mutations: Correlations with Antiretroviral Therapy in Subtype B Isolates and Implications for Drug-Resistance Surveillance, The Journal of Infectious Diseases, 192:456-65. 2. A. Abbas, S. Holmes. (2004) Bioinformatics and Management Science: Some Common Tools and Techniques. Operation Research, Vol.52, No.2, March-April 2004, 165-190. 3. Y. Benjamini and Y. Hochberg. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B, 1995, 57:289–300.

323 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

IN SILICO AND IN VIVO ANALYSIS OF FUNCTIONS OF SOME OF THE CHROMOSOMAL REGIONS. ANNA N. SHABARINA 1, M.V. GLAZKOV 1

Keywords: computer analysis, nuclear organization, gene expression, position effects.

The key question of molecular biology is the one about the mechanisms of the normal functioning of the genome. It is supposed that the DNA sequences that attach chromosomes to the nuclear envelopes participate in this process. We have made computer analysis of one of the fragments from our collection of nuclear DNAs, EnvM4, to examine its capability for influence gene expression. The sequence is 300 bp long, it is AT-rich, noncoding and contains 171 topoisomerase I recognition sites. Its distinctive character is localization within of several polypurine/polypyrimidine tracks and especially of one 17 bp long: AAAAAGAAAAGAAAAGA. The scanning of EnvM4 with devised by our group computer program (ChrClass) confirmed that this fragment belongs to the class of chromosomal DNA fragments isolated from nuclear lamina [1]. Then we have run the homology search of EnvM4 in Genebank base. It was found that the long polypurine track within this sequence shows 90-100% homology with almost all of the examined genomes, such as human, rat, mouse, drosophila, yeast, plant, bacterium, and many of them also have an individual site of homology within EnvM4. This may represent the general nucleotide motif attaching DNA to the nuclear (cellular) envelopes shared among species. Homologous regions of EnvM4 in different objects used to localize mostly between genes or in gene intrones. It is interesting that the special characteristic of that kind of sequences (polypurine tracks) is their ability to form filament (H-form DNA), that appears as a barrier for transcription prolongation by RNA-polymerase II [2]. It is known that many of border elements have special binding proteins connected with their function. So with the help of Genomatix Sequence Shaper we found that EnvM4 sequence contains sites for more than 50 vertebral transcriptional factors. It is natural to suppose that some of these enzymes may play a role in gene regulation. Another part of work was devoted to the comparative analysis of EnvM4 and boundary elements thought to be implicated in delimitation of gene

 1 N.K. Koltsov Institute of Development Biology RAS, Moscow, Russia, [email protected] 324 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 domains: MARs/SARs, LCRs and insulators. We have not found any significant homology of EnvM4 with MARs/SARs, scs, scs’ and Su(Hw) insulators, but have noticed similarity in their structure. All of these elements contain polypurine/polypyrimidine tracks that are capable for formation of triple- stranded DNA hairpin structures or H-form DNA. The generation of these structures is accompanied by the simultaneous formation of single-stranded DNA regions which can tightly and nonspecifically bind to the nuclear envelopes. This observation is very important in context of the structural models of insulator function. All these data provide an explanation how EnvM4 DNA fragment can bind to nuclear envelopes and suggest that as many other boundary elements it may play a role in the establishment of gene domains. To examine the way EnvM4 influence on the expression of reporter genes we used the P-transformation method of Drosophila melanogaster. Drosophila embryos with inactivated yellow and white genes were injected by reporter which contained these genes and regulatory elements flanked with EnvM4. We achieved 10 lines each showed high level of expression of reporters that indicates the overcoming of position effects. To investigate the role of EnvM4 sequences in this process we deleted white gene enhancer in vivo and saw decreasing of the expression level of white gene to basic one. This means that the sequences flanking the genes do not impact the gene expression itself but participate in maintenance of defined expression level. As far as EnvM4 fragments are connected with the nuclear envelopes that might be the way they establish an independent domain of gene and help to eliminate any influence of surrounding sequences. As a result the EnvM4 sequence computer analysis revealed its capacity of participating in the establishment of gene domains that was confirmed in in vivo experiments.

1. G.V. Glazko et al. (2001) Comparative study and prediction of DNA fragments associated with various elements of the nuclear matrix. Biochim. Biophys. Acta. 1517(3): 351-364; 2. T. Misteli (2005) Concepts in nuclear architecture. BioEssays. 27: 477- 487; This work was supported by Russian Academy of Sciences (Presidium Program "Biological diversity", subprogram "Gene pools and genetic diversity").

325 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

CORRELATION OF HIV-1 REV BINDING HOST FACTOR STRUCTURE AND EVOLUTION PROFILES AND THEIR IMPORTANCE IN HIV ASSOCIATED NEUROPATHOGENESIS DEEPAK SHARMA 1

Keywords: Correlation of HIV-1 Rev binding host factor structure and evolution profiles and their importance in HIV associated Neuropathogenesis

HIV-1 Rev protein, an important posttranscriptional regulator of Human Immunodeficiency Virus-1 (HIV-1) life cycle is known to be associated with Neuropathogenesis. Altered Rev activity due to Rev-host protein interactions observed in brain cells has been correlated to differential HIV replication in brain cells. However, the exact mechanism of this Rev dependent pathway is still far from clear. Role of host factors in eliciting this mechanism would be interesting to find and could have potential in identification of new drug targets against HIV. We aimed to analyze the patterns of protein evolution of 3 HIV-1 Rev interacting proteins (believed to be associated with HIV-1 replication in brain cells) by correlating the sequence and structural evolutionary analysis approaches. Three proteins namely DDX1 helicase, Nucleophosmin and RBP representing three different aspects of Rev-host protein interaction in brain cells were selected following extensive literature analysis and information gathered through HIV protein interaction database (NCBI). Each protein sequence cluster representing sequence homologues was selected from Uniref database at both 50% (using uinref 50) and 90% (using uniref 90) level of similarity. Only clusters showing maximum number of species from Eutheria were recognised. Sequences were aligned by standard Clustal X2 Gonnet PAM250 matrix. Solved 3D structures matching the sequences were obtained from RCSB PDB at an expectation value <3e60. Sequence and Structure information for each protein were fed to Evolutionary Trace (ET) server. Calculation yielded rank scores for each amino acids of the sequence which represented the correlation between sequence variations in an alignment and evolutionary divergences. Also, sequence conservation scores were obtained (using empirical Bayesian approach) and plotted on the PDB derived structure

 1 National AIDS Research Institute,Pune, India, [email protected] 326 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 using the combination of Consurf server and Chimera Extensible Molecular Analysis System. An overview of codon sequence evolution was also attempted by estimating the ratio of non-synonymous (amino acid altering) to synonymous (silent) substitutions (the Ka/Ks ratio) which is a measure of positive and purifying selection at each amino acid site. M8 bayesian model incorporated in Selecton server was employed for the former. For clearer picture we also utilised different models which use different Ka/Ks rates at different sites of the sequence. Finally, the evolution patterns were studied for their role in effecting protein interactions by analysing binding pockets, interaction hot spots, docking with small ligands and relative entropy measures (using combination of softwares such as protein dossier, pocket binder, ASTRO fold etc). This study revealed occurrence of number of positively selected sites in DDX helicase (P>0.93) and number of conserved residues in Nucleophosmin and RBP. These residues in DDX helicase are mainly at the N-terminus but some scattered amino acids indicative of positive selection were also located elsewhere. Mostly conserved regions in other two proteins indicated a case of purifying selection. Our study revealed that DDX helicase contains a potential cluster of residues that can modulate host defence activities in human brain cells (and hence altered HIV-1 Rev activity) following positive Darwinian selection. Failure to detect any positive selection signatures in Nucleophosmin and RBP could suggest that these proteins may play an indirect role in antiviral strategies and they may affect some additional brain specific host proteins down the pathway that they are part of. Based on these studies we have initiated an experimental two hybrid approach to study protein-protein interactions between Rev and host factors derived from human brain. We are also extending these evolutionary studies to other brain specific host proteins. Also, we are attempting to predict theoretical models for some brain derived host proteins and to study Rev-host proteins interactions through docking and lab based yeast two hybrid approaches.

327 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

EXPRESSION ANALYSIS OF INTRONLESS TRANSCRIPTOME OF MOUSE VIKTORIA SERZHANOVA 1, ANTON KIREEV 2, ANNA GUSKOVA 3, ANCHA BARANOVA 4, MIKHAIL SKOBLOV 5

Keywords: expression analysis of transcriptome

It is generally accepted that the average number of introns per gene increased in higher evolutionary taxa. While analyzing a database comprised of full-size mRNA (FANTOM), we identified a substantial proportion of transcripts represented by only one exon. These transcripts mostly corresponds to non-coding mRNA, and do not have orthologues in human genome. Here will we present initial characterization of these intronless mouse transcripts and experimental data reflecting their prevalence in various murine tissues.

1. Okazaki Y, Furuno M, Kasukawa T, et al. (2002) Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature. 420(6915):563-73. 2. Sakharkar KR, Sakharkar MK, Culiat CT, Chow VT, Pervaiz S. (2006) Functional and evolutionary analyses on expressed intronless genes in the mouse genome. FEBS Lett. 580(5):1472-8.

 1 Medical Genetics Research Center, Russian Federation, [email protected] 2 Medical Genetics Research Center, Russian Federation, [email protected] 3 Medical Genetics Research Center, Russian Federation [email protected] 4 George Mason University, United States, [email protected] 5 Medical Genetics Research Center, Russian Federation, [email protected] 328 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

STUDY OF ANTISENSE REGULATION OF HUMAN CARBONYL REDUCTASE 3 YURII CHERNOHVOSTOV 1, ANNA GUSKOVA 2, TATIYANA KAZUBSKAYA3, ANCHA BARANOVA 4, MIKHAIL SKOBLOV 5

Keywords: antisense regulation, carbonyl reductase

Monomeric carbonyl reductases (CBRs) are enzymes that catalyze the reduction of many endogenous and xenobiotic carbonyl compounds, including steroids and prostaglandins. There are two monomeric CBR genes in the human genome, CBR1 and CBR3, which exhibit high homology in their amino acid sequences. In CBR3 locus we have found out antisense cluster. The dbEST analysis has shown that CBR3 gene is expressed predominantly in tumor tissues whereas antisense cluster is more presented in normal tissues. To study the antisense regulation of the given locus we have carried out the detailed bioinformatics analysis with the subsequent experimental validation.

1. Klimov D, Skoblov M, Ryazantzev A, Tyazhelova T, Baranova A. (2006) In silico search for natural antisense transcripts reveals their differential expression in human tumors. J Bioinform Comput Biol. 4(2):515—521. 2. M. Lapidot, Y. Pilpel (2006) Genome-wide natural antisense transcription: coupling its regulation to its different regulatory mechanisms. EMBO reports 7:1216–1222. 3. Miura T, Nishinaka T, Terada T. (2008) Different functions between human monomeric carbonyl reductase 3 and carbonyl reductase 1. Mol Cell Biochem. 315(1-2):113-21. 4. Lakhman SS, Ghosh D, Blanco JG. (2005) Functional significance of a natural allelic variant of human carbonyl reductase 3 (CBR3). Drug Metab Dispos. 33(2):254-7.

 1 Medical Genetics Research Center, Russian Federation, 2 Medical Genetics Research Center, Russian Federation, [email protected] 3 Blokhin Cancer Research Center, Russian Federation, [email protected] 4 George Mason University, United States, [email protected] 5 Medical Genetics Research Center, Russian Federation, [email protected] 329 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

PAAS: MACHINE LEARNING METHOD FOR CLASSIFICATION OF AMINO ACID SEQUENCES USING THE LOCAL SIMILARITY SCORES BORIS SOBOLEV 1, KIRILL ALEXANDROV 1, DMITRY FILIMONOV 1, VLADIMIR POROIKOV 1

Keywords: Functional annotation of proteins, Sequence similarity, Machine learning, Recognizing functional classes

When a new amino acid sequence is determined, it should be annotated. Solving this problem one uses different classification schemes: protein families, grouping the amino acid by their experimentally defined features (e.g., the substrate repertoires of enzymes) and others. Now the machine learning approach is widely used for recognition of the protein class. We proposed a machine learning method (PAAS, Projections of Amino Acid Sequences) [1, 2], which uses the sequence description similar to the classical dot-matrix procedure: each sequence of the training set is compared with the annotated sequences and local similarity scores for all amino acid positions are calculated. The novelty of the PAAS method is grounded on using of the local similarity scores as the input data for the original classifier based on the naive Bayesian approach. The method was tested on the sets representing the different enzyme classes. High accuracy was shown for both the protein families and subfamilies. Our program predicted EC (Enzyme classification) taxons with the accuracy superior to SVMProt program and comparable with HMMer. The EC is known to be composed by the experts. In order to analyze the more complicated case, we also tested the PAAS method on the superfamily of cytochromes P450. In P450s one protein may interact with many ligands. The cytochromes interacting with the same substrate or inducer were referred to the same classes of ligand specificity. Phylogenetic clusters not always correspond to groups of ligand specificity [3]. Using the suggested method, the classes of P450 ligand specificity were recognized with lower accuracy comparing with the non-intersected EC classes, however for larger groups the recognition accuracy was better. We showed that the PAAS method enable to display the relatively short motifs in amino acid sequences of remote homologues (bacterial and viral serine peptidases) using the training set composed of mammalian proteins. In order to correctly classify the remote  1 Institute of Biomedical Chemistry of Rus. Acad. Med. Sci, Pogodinskaya Street, 10, Moscow, 119121, Russian Federation, [email protected] 330 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 homologues, we apply the feature filtration based on the statistics calculated for the randomly shuffled sequences. Selected local similarity scores were used as input data for the classifier. It provided the reliable recognition of classification of the remote homologues. Thus, the suggested method can be used both for predicting the protein functional class and selecting the functionally significant motifs in amino acid sequences.

This work was supported by FP7 (grant LSHB-CT-2007-037590) and the Russian Foundation for Basic Research (Grant N 09-04-01281).

1. K.Alexandrov et al, Filimonov D., Poroikov V. J (2008) Recognition of protein function using the local similarity, J. Bioinform. Comput. Biol.,, 6: 709–725. 2. K.E.Alexandrov et al. Functional annotation of the amino acid sequences based of the local similarity, VOGiS Gerald (Rus), 13: 114-121. 3. Yu.Borodina et al. (2003) If there exists correspondence between similarity of substrates and protein sequences in cytochrome P450 superfamily, Nova Acta Leopoldina, 87: 47-55.

331 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

GERM-BASED SPATIAL ALIGNMENT OF PROTEINS DIAN ZHEMOLDINOV 1, ANDREI ALEXEEVSKI 2,3 , SERGEI SPIRIN 2,3

Keywords: Spatial alignment, DNA-protein complex

There are a lot of different algorithms of spatial alignments of protein structures, some of them are implemented as publicly available programs ([1], [2], [3] etc.). We suggest an algorithm that aligns protein chains in two given structures according to similarity of their positions relative to some “germ” of the alignment. That germ may be a common ligand or a conserved part of protein. The input of the algorithm consists of: 1) two structures (i.e., files in PDB format), each including one protein chain (and, possibly, some other atoms); 2) a “germ” set of atoms in each structure; 3) alignment (one-to-one correspondence between atoms) between germ sets. The output is an alignment of some subsets of residues of both protein chains. In the output alignment, each pair of aligned residues is equipped with the score characterizing similarity of dispositions of the residues relative to the germ sets. In tested examples, the proteins are transcription factors and the germ sets consist of phosphorous atoms of DNA in sites recognized by those proteins. The algorithm consists of the following steps. First, for each pair consisting of a residue from one structure and a residue from the other structure, a similarity score is defined. Second, a standard Smith – Waterman algorithm is applied to the obtained similarity matrix. The similarity score is calculated by the following formula:

where |I| is the size of the germ alignment, i runs over all positions of the germ alignment, ri(1) and ri(2) are distances between the Cα-atom of the residue and the i-th atom of the germ set in two structures, αi(1) and αi(2) are the

 1 Faculty of Bioengineering and Bioinformatics, Moscow State University, Russian Federation, [email protected] 2 Belozersky Institute of Moscow Sate University 3 Institute of System Studies of RAS, Moscow, Russia [email protected] 332 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 angles between direction to the Cα-atom of the next residue and the direction to the i-th atom of the germ set. Parameters β, γ, and κ are any positive real numbers, 0 ≤ λ ≤ 1. In Smith – Waterman procedure, the gap penalty is equal to 1 (affine gap penalty is not used); so the parameter β plays the role of inversed gap penalty. The parameters γ and κ regulate “rigorousness” of, respectively, distance to germ atoms and direction of the compared parts of protein chains. The algorithm is realized as a program called “align_by_lcs”. Also we developed an algorithm and a program “dnalign” that automatically generates germ sets and their alignment for two structures of DNA-protein complexes. The germ set in each structure consists of five subsequent phosphorous atoms from one of DNA chains. Among all pairs of such sets, the germ is the pair that mostly similar disposed relative to the protein chains in each structure; the similarity is estimated by the comparison of histograms of distances between phosphorous atoms of a DNA segment and Cα-atoms of the entire protein. Both programs were tested on several pairs of structures of related DNA- protein complexes and show reliable results. For testing, we used the following values of the parameters: β=0.8, γ=κ=0.15, λ=0.5 . In our opinion, the “germ-based” approach to spatial alignment may have some advantages in a number of problems appearing in structural biology. The aligned residues are those that disposed similarly relative to some related objects in two structures; additionally, the measure of their similarity is available. In our plans are replacing Smith – Waterman procedure by some variant of “global-local” alignment, creating a program for multiple germ-based alignment, classification of all available DNA-protein complexes using our approach. The work is partly supported by RFBR grants 07-04-91560 and 08-04- 91975.

1. Holm L, Sander C. (1993) Protein structure comparison by alignment of distance matrices, J Mol Biol, 233(1):123–138 2. Holm L, Sander C. (1995) Dali: a network tool for protein structure comparison, Trends Biochem Sci. 20(11):478–480. 3. E. Krissinel, K. Henrick (2004). Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Cryst. D60:2256–2268.

333 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

NPIDB, A DATABASE OF STRUCTURES OF NUCLEIC ACID – PROTEIN COMPLEXES DMITRY KIRSANOV 1, OLGA ZANEGINA 2, ANDREI ALEXEEVSKI 2,3, SERGEI SPIRIN 2,3, ALEXANDER GRISHIN 1, ANNA KARYAGINA 2,4

Keywords: Database, DNA-protein interaction, NPIDB, 3D stucture, complexes

The resource NPIDB (Nucleic acids – Protein Interaction DataBase) includes a collection of files in the PDB format containing structural information on DNA-protein and RNA-protein complexes, and a number of online tools for analysis of the complexes. Those tools are: an original program CluD [1] for analysis of hydrophobic clusters on interfaces, program for detecting potential hydrogen bonds and water bridges, visualization of structures with Jmol (http://jmol.sourceforge.net/), SCOP [2] and Pfam [3] domains presented in protein chains of structures are detected. Structures of protein – nucleic acid complexes are extracted from PDB as files in the PDB format representing both asymmetric units (PDB entries “as is”) and biological units. Structures are revised to correct possible mistakes (such as duplication of atoms) and inconvenience (such as two or more variants of a structure posed in one coordinate space, see, for example, PDB entry 1QPI, where two variants of each DNA chain are superimposed). All structural files of NPIDB are available for download. Update of the content is done regularly by a special program module. At May 2009, NPIDB contained 2314 structures. NPIDB is available via Internet: http://mouse.belozersky.msu.ru/NPIDB/. The main part of the web interface is the list of available structures. Each NPIDB entry has its own web page, containing general information, links to other resources (e.g., PDBsum), a table describing biological units or (in case of structures solved with NMR) models, tables describing Pfam and SCOP

 1 Institute of Agricultural Biotechnology, 42 Timiryazevskaya st., Moscow, 127550, Russia, Russian Federation, [email protected], [email protected] 2 Belozersky Institute of Physical and Chemical Biology, Moscow State University, Moscow, 119992, Russia, Russian Federation, [email protected] 3 Scientific Research Institute for System Studies (NIISI RAN), Moscow, Russian Federation, [email protected], [email protected] 4 Gamaleya Institute of Epidemiology and Microbiology, 18 Gamaleya st., Moscow, 123098, Russia; Institute of Agricultural Biotechnology, 42 Timiryazevskaya st., Moscow, 127550, Russia; [email protected] 334 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 domains in protein chains, and the list of available actions (including Jmol visualization). The web interface contains also the lists of all presented Pfam and SCOP domains. Each domain family has its own web page with the list of entries that include domains of the family. Representatives of Pfam families are available for download. Each representative is a PDB-format file with a fragment of a protein chain that is a domain of the family, together with the fragments of nucleic acid chains that are in contact with the protein domain. In the time passed since the publication of the paper [4], the following new features appeared. The detection of Pfam domains is done now not with data on PDB files from Pfam, but using the Pfam HMM profiles, this allows recognition of Pfam domains in new structures. Superimposed domains of SCOP families together with information on conserved water molecules appeared. The work is partly supported by the Russian Foundation of Basic Research, grants 07-04-91560 and 08-04-91975.

1. A.Alexeevski et al. (2003) CluD, a program for determination of hydrophobic clusters in 3D structures of protein and protein-nucleic acid complexes, Biophysics 48 suppl. 1, S146–S156. 2. A.G.Murzin et al. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247:536–540. 3. R.D.Finn et al. (2006) Pfam: clans, web tools and services, Nucleic Acids Research 34, Database issue, D247–D251. 4. S.Spirin et al. (2007) NPIDB, a Database of Nucleic Acids–Protein Interactions. Bioinformatics 23 (23):3247–3248.

335 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

STATISTICAL APPROACH FOR DISCOVERING EVOLUTIONARY CONSERVED MEMBERS OF REGULON E. STAVROVSKAYA 1,2, D.A. RODIONOV 3, A.A. MIRONOV 1,2 , I. DUBCHAK 4, P.S. NOVICHKOV 4

Keywords: regulation, TFBS profile, orthologous group

Reconstruction of transcriptional regulatory networks is one of the major challenges facing the bioinformatics community in view of constantly growing number of complete genomes. The comparative genomics approach has been successfully used for the analysis of the transcriptional regulation of many metabolic systems in various bacterial taxa. The key step in this approach is, given a position weight matrix, find an optimal threshold for the search of potential binding sites in genomes. Here we demostrate that this problem is tightly bound to a problem of discovering the optimal content of regulon and suggest an approach to solve both problems simultaneously. First, we select an arbitrary score of the transcriptional factor binding site (TFBS) as a potential optimal threshold S* . For each orthologous group we calculate quality Zi , which allows for ranking of all orthologous groups to get the most promising ones at the top of the list. A particular orthologous group is described by its size (number of orthologous genes) and an average length of upstream regions among all genes in . After application of the TFBS profile to each gene upstream region, the number of genes having potential regulatory site with score can be calculated. We define the quality of orthologous group in terms of probability to find the number of potentially regulated genes being or greater given that upstream regions are random sequences. The corresponding probability can be calculated as:

 1 Department of Bioengineering and Bioinformatics, Moscow State University, Leninskiye Gory 1-73, Moscow, 119992, Russia, [email protected] 2 IITP, Bol'shoi Karetnyi per. 19, Moscow, 127994, Russia 3 Burnham Institute for Medical Research, La Jolla, CA 92037 4 Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA, [email protected] 336 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 where is a probability to observe a binding site with a score in a random sequence of length , and in their turn can be calculated using the extreme value distribution as:

where is a length of the profile. is a probability to observe score in a random sequence of the same length as a length of the profile. Finally, the quality of the group can be calculated as:

At the second step we use “Bernoulli Estimator” (BE) routine(1) to select a subset of orthologous groups of high quality which would deliver the lowest probability to observe such subset given that gene upstream regions are random sequences. BE assumes that input values are a mixture from two distributions: the background distribution describing the noise and the signal one, and automatically defines the optimal threshold to distinguish the signal from the noise. To do this BE requires the background distribution as an additional input, which can be calculated as:

where M is the total number of orthologous groups of genes. Finally, it can be shown that, for a given group the following equality is valid:

Thus, for an arbitrary selected threshold S*, the application of BE to orthologous group qualities provides the most probable set of orthologous groups of genes under regulation, as well as the probability to observe such set given random sequences of upstream regions (BE probability). Iterating through all potential thresholds S*, the optimal threshold delivering the minimum to BE probability, can be obtained. The approach was tested on 7 Shewanella genomes using position-specific weight matrix of SOS response regulator LexA . 337 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

Fig. 1. The dependence of BE probability (log scale) on the TFBS score threshold.

The clear deep minimum of the probability (fig 1) corresponds to the score threshold value 5.1, and selects 8 orthologous groups as potential members of LexA regulon. The manual analysis of LexA regulon in Shewanella genomes yields 13 orthologous groups being the true members of the regulon. The comparison of the eight orthologous groups with results of manual analysis shows that all of them are true positives.

1. Kalinina O.V., Mironov A.A., Gelfand M.S., Rakhmaninova A.B. (2004) Automated selection of positions determining functional specificity of proteins by comparative analysis of orthologous groups in protein families, Protein Sci. 13: 443–456.

338 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

COMPUTER SIMULATION AND QUANTUM CHEMISTRY CALCULATIONS IN THE ANALYSIS OF THE PHYSICAL MECHANISM OF THE BIOLOGICALLY SIGNIFICANT ACTIVITY OF NUCLEOTIDES VASILY STEFANOV 1, OLGA ROGACHEVA 1, ALEXANDER TULUB 1

Keywords: quantum chemistry calculations, ATP, Mg, cAMP, cGMP, protein kinase, spin, triplet/singlet state

Quantum chemistry calculations and computer simulation were used to study the physical mechanism of the biologically significant activity of nucleotides. Calculations were carried by means of the following methods: semi-empirical MNDO, Molecular Dynamics with the use of the Density Functional approach (DFT:B3LYP) method, ab initio calculations (RHF) and software: GAMESS 6.4 Gromacs 3.1.2, Gaussian 03W , Gaussian 94W, docking software Quantum 3.3.0. Cyclic nucleotides cAMP and cGMP were analyzed using computational methods of quantum biochemistry. The occurrence of two conformations (syn and anti) was demonstrated for cyclic nucleotide cAMP in the protonated form. They are separated by an energy barrier of ~ 6 kcal/mol, making transition between them impossible under physiological conditions. Syn- conformation is more favorable (the energy difference for the two conformations is 2.3 kcal/mol). Calculated enthalpy for hydrolysis of cAMP in syn- and anti-conformation is equal to 15.5 and 18.8 kcal/mol, respectively. Similar values (15.0 и 17.9 kcal/mol) were obtained for cGMP in syn- and anti-conformation. It was shown that cAMP-dependent activation of protein kinase A is mediated by transition of its regulatory subunit into thermodynamically favorable conformation, which can be realized only in the presence of the ligand (ΔG˚=-23.9±8.2 kJ/mol) and by increased affinity of cAMP towards regulatory subunit in the induced conformation compared to that observed in the inactive holoenzyme complex (ΔG˚=-28.1±9.7 kJ/mol). Calculated true binding constants of cAMP towards protein kinase A holoenzyme are equal to 60 and 57 μM for A and B domains of the regulatory subunit, respectively. Since intracellular cAMP concentration varies in the range 2 ÷ 55 μM, these values can account for protein kinase A activation in response to regulatory  1 St. Petersburg State University, Russian Federation, [email protected], [email protected], [email protected] 339 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 signals. Docking of the ionized form of cAMP in syn-conformation into the site B of the regulatory subunit suggested formation of 5 hydrogen bonds and one stacking interaction with the energy equal to -5.9 kcal/mol. It was shown that, unlike A site, B site within the holoenzyme, fails to generate the native pattern of interactions with the ligand. Docking analysis proved that ATP can function as competitive inhibitor of protein kinase activation. MD DFT:B3LYP (6-31G** basis set, T=310 K) method was used to study interactions (singlet, S, and triplet, T, paths) between ATP (ATP-subsystem) and Mg-complex [Mg(H2O)6]2+ (Mg-subsystem) in water environment, modeled with 78 water molecules. Computations reveal the appearance of low and high-energy states (stable, quasi-stable, and unstable), assigned to different spin symmetries. At the initial stage of interaction, ATP donates a part of its negative charge to Mg-complex making Mg slightly charged. As a result, the initial octahedral Mg-complex looses two (S-state) or four (T-state) water molecules. Moving along S- or T-potential energy surfaces (PESs), Mg(H 2O) 4 or Mg(H 2O) 2 reveal different ways of complexation with ATP. S-path favors formation of a stable chelate with O1-O2 fragment of ATP triphosphate tail, whereas T path favors appearing of a single-bonded complex [Mg(H 2O) 2- (O2)ATP]. The single-bonded complex is unstable and undergoes further conversion into a spin-separated complex, also unstable, and two quasi-stable S complexes (S3 state), which are subsequently transformed into two stable chelates (S1 low-energy state) and (S4 high-energy state). The spin-separated complex undergoes rapid decomposition, resulting in production of a highly reactive ion-radical •AMP-. Feasibility of the ion-radical pathway of ATP decomposition, which is many orders of magnitude faster than the conventional hydrolytic one, is supported by experiments on 31P conducted by CIDNP (Chemically Induced Dynamic Nuclear Polarization) method capable of recording free radicals in the nanosecond time range. The ion-radical pathway can play a key role in initiating assembly processes in the cell (DNA/RNA polymerization; self-assembly of microtubules) according to the “living chain” polymerization mechanism known from organic chemistry.

340 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

SNPS IN THE HIV-1 TATA BOX AND THE AIDS PANDEMIC SUSLOV V.V. 1, P.M. PONOMARENKO 2, V.M. EFIMOV 1, M.P. PONOMARENKO 1, L.K. SAVINKOVA 1, N.A. KOLCHANOV 1

HIV-1 is a very variable virus. Its revertase makes ~1 error per genome replication and growth rate ~10 10 virions per case. Groups M, O, N are believed to be related to independent transmissions of the HIV-1 ancestor from apes to humans. The main group, M, arose in Cameroon/Congo at the beginning of the 20 th century. It has 9 subtypes (some of which have subsubtypes) and tens of recombinant forms, circulating (isolated from more than two cases) and unique. The phylogeny was based on coding RNAs and proteins, but the adaptive role of subtypes was controversial [1,2]. We analyzed the TATA box controlling the transcription of the integrated viral DNA as a marker of evolutionary trends. The 5'- and 3'-halves of the LTR bear one imperfect TATA copy each. We extracted 2662 TATA boxes (2311 HIV-1 isolates) from GenBank and aligned them to obtain 146 variants. The agatgctgCAТATAAgcagctgcttt sequence, found in 59% of cases, five times as many as the second prevalent variant, was taken to be normal (S 0). The affinity of the TATA-binding protein (TBP) was estimated by the equation for the TBP/TATA binding equilibrium deduced by us in [3]:

−ln[K D,TATA (S)]=10.9−0.23ln[K D,TBP/dsDNA (S)]+0.15PWM TATA,Bucher (S)−0.20ln[K D,TBP/ssDNA (S)] where 10.9 is the nonspecific TBP/DNA affinity; − ln[KD,TBP/dsDNA ] is the contribution of TBP sliding along DNA; PWM TATA,Bucher is the contribution of TBP/TATA recognition; − ln[KD,TBP/ssDNA ] is the contribution of the stabilization of the TBP/TATA complex; 0.23, 0.15, and 0.20 are stoichiometric coefficients. The significance of deviations was assessed by the Student t test ( δ±5%). The mutation-related decrease in TBP/TATA affinity is in agreement with that expected from the binomial law (a>0.8); 54 negative deviations ∆ <-δ5% =- 0.10 produced a significant excess (a<10 -31 ), which was interpreted as selection towards low-expressing HIV-1 forms. This agrees with the data on FIV and SIV [4] and with direct expression measurements in HIV-1 strains of 1980s and the 21th century [5]. Analysis of geographic prevalence of the 146 variants revealed principal components F1 (58% variance) and F2 (34% variance). F1 showed a significant positive correlation with the prevalence of  1 Institute of Cytology and Genetics SB RAS, Novosibirsk, Russia, [email protected] 2 Novosibirsk State University, Novosibirsk, Russia 341 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 S0, and a negative, with low-expressing mutants. F2 showed a significant positive correlation with the prevalence of mutants not differing from S 0 in TBP affinity. Three clusters were recognized (fig.). We assume that France in cluster can be explained by relations with its former African colonies (bold). In cluster ■ the most selection of the low-expressing forms.

Liberia Finland F2 Portugal “p(Mut ≈Norm)” Tunisia Djibouti

0.5 Cote-d'Ivoire Uruguay Angola Ghana Congo Argentina Cameroon Belgium Bolivia, Suriname Nigeria Senegal Antilles UK Niger Cuba Italy Netherlands Gambia Rwanda Greece Cyprus Brazil Spain West Africa Sweden Canada Mozambique France Uganda Honduras 0 South Africa Venezuela r=0.785 Zambia Zimbabwe, Tanzania ααα<0.00025 Myanmar Ethiopia, Kenya USA Botswana India South Korea Australia Ukraine Taiwan Norway Thailand Japan Malawi Israel Malaysia r=0.393 Somalia Romania -0.5Hong-Kong China >0.06 Morocco Belarus Gabon Estonia α Mali Central r=-0.817 Burkina African Russia <10 -12 Faso Republic ααα F1 -1 0 “p(wt) -p( -26T →A)”

The work is supported: grants NSh-2447.2008.4, RFBR 08-04-01048; RAS projects 10.7, 18.13; SB RAS Integration project 119, RAS Integration project 23.29 Biodiversity.

1. Kalish, M.L., et al. (2004) Recombinant viruses and early global HIV-1 epidemic. Emerg. Infect. Dis. 101010:10 1227-1234. 2. Taylor, B.S. et al. (2008) The challenge of HIV-1 subtype diversity. N. Engl. J. Med. 358358: 1590-1602. 3. Ponomarenko, P.M. et al. (2008) A step-by-step model of TBP/TATA box binding allows predicting human hereditary diseases by single nucleotide polymorphism. Dokl. Biochem. Biophys. 419419:419 88-92. 4. H.Friedman et al. (2006) In vivo Models of HIV Disease and Control. N.Y., Springer Science+Business Media. 5. Ariën, K.K et al. (2007) Is HIV-1 evolving to a less virulent form in humans? Nat. Rev. Microbiol. 555:141-151.5

342 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

MODELING OF STRUCTURE AND SUBSTRATE RECOGNITION OF PENICILLIN ACYLASE FROM STREPTOMYCES MOBARAENSIS USING MOLECULAR DOCKING TO EVALUATE PROPER ACTIVE SITE GEOMETRY DIMITRY SUPLATOV 1, IRINA POULIAKHINA 1, VLADIMIR ARZHANIK 1, VYTAS ŠVEDAS 1

Keywords: structure modeling, docking, enzyme specificity, penicillin acylase

Penicillin acylases (PAs, EC 3.5.1.11) represent a group of industrially important enzymes widely used for modification of beta-lactam antibiotics. Bacterial PAs are classified into three major substrate specificity groups and display a remarkable difference in activity and stability. A proper understanding of molecular basis of specificity could have important consequences for their extended application in biotechnological process. PA from an actinomycetes Streptomyces mobaraensis (SmPA) is a novel enzyme with heterodimeric structure common for penicillin G acylases displaying the highest hydrolytic activity on penicillin V. Thus SmPA could be seen as the Rosetta stone for deciphering substrate specificity mechanism and further insight into structure-functional relationship in the PA family. In this work a novel approach to homology modeling of enzyme structure was developed and applied to study the substrate recognition mechanism of SmPA. Since SmPA has less then 20% sequence identity with the closest relative with known three dimensional structure, sequence alignment created using canonical methods was used as a template to identify poorly aligned regions containing potential motifs of the active site and substrate binding cavity. Those regions were then combinatorially shuffled within rational scope to create a set of random alignments covering all possible locations of SmPA residues as superimposed with the main-chain of template protein. Acquired alignments were used by Modeller program to build and evaluate three- dimensional structures by calculating scoring function describing local stereochemical features through CHARMM force field terms. The 10 best scoring models were then selected for optimization of initial conformation beginning with a conjugate gradients energy minimization followed by molecular dynamics with simulated annealing finalized by energy  1 Lomonosov Moscow State University, Russian Federation, [email protected], [email protected] 343 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 minimization. Resulting structures were submitted to screening by common substrates for PA hydrolytic activity using AutoDock4 molecular docking software. In this work 2200 shuffled alignments were created, 22000 candidate structures reviewed in more than 1000000 docking iterations. Models with proper binding were evaluated considering output docking energies and geometrical features of penicillin enzymatic hydrolysis reviewed in the literature. The best candidate models were then selected for structural alignment to discriminate the binding cavity residues between important and unimportant which do not affect the docking results. Final structures were analyzed and binding sites of different chemical groups of the substrate were identified. Analysis showed that catalytic triad in SmPA (SerB164, AsnB439 and ValB233) is similar compared to already studied PAs. Among the most interesting observations are ArgB193 that seems to be analogous to ArgA145 in Escherichia coli PA in stabilizing the substrate leaving group, HisB231 that is responsible for proper orientation of the carboxyl group and TyrB187 that interacts with the beta-lactam ring. This work presents an attempt of introducing molecular docking into the homology modeling protocol as a tool for verification of proper models contrary to widely used falsification schemes that could only point on incorrect models. Despite current limitations of molecular docking this pipeline is seen as a next step in homology modeling of enzyme structures. Results obtained for SmPA active site study shall be used in further analysis of structure-functional relationship in PA family.

344 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

CONSTRUCTION OF INTERACTIVE DATA BASE OF HUMAN ALU REPEATS DIGESTION AT SHORT NUCLEOTIDE SEQUENCES VICTOR TOMILOV 1, MURAT ABDURASHITOV 1, SERGEY DEGTYAREV 1

Keywords: Alu repeats, database, DNA cleavage, in silico

The Alu repeats family, which belongs to SINE class of DNA repeats, is one of the most abundant and well characterized group of repetitive elements in human genome. The total number of annotated Alu sequences is more than 1100 thousand copies and their fraction in genome is about 10%. Human chromosomal DNA digestions in vitro and in silico at short nucleotide sequences, which are recognition sites of restriction endonucleases, provide distinct DNA cleavage patterns [1]. The most of small DNA fragments with a length less than 300 bp in these cleavage patterns are produced from Alu repeats [2]. Earlier we have developed a database of aligned Alu repeats [3], which includes 1,193,407 sequences with a total length of ~ 350 million bp [2]. In this work we have calculated the probability frequencies of each nucleotide presence in every position of Alu repeats consensus sequence and constructed a new interactive version of Alu repeats data base. This new data base with a size about 100 kb is much less than the original one and allows to make a quick analysis for a presence, location and percentage of any short nucleotide sequence in the set of Alu repeats. We have constructed such physical map of Alu repeats cleavage at more than 20 recognition sequences of restriction endonucleases and have shown a good correspondence of theoretical results to experimental data on human DNA hydrolysis. A work with interactive data base of Alu repeats allows saving time in human DNA studies.

1. Abdurashitov M.A., Tomilov V.N., Chernukhin V.A., Gonchar D. A., Degtyarev S. Kh Comparative analysis of human chromosomal DNA digestion with restriction endonucleases in vitro and in silico // Medical genetics V.6, No 8, pp 29-36, 2007 (Online version - http://science.sibenzyme.com/article14_article_31_1.phtml)

 1 SibEnzyme Ltd., Russian Federation, [email protected] 345 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 2. Murat A Abdurashitov , Victor N Tomilov , Valery A Chernukhin and Sergey Kh Degtyarev A physical map of human Alu repeats cleavage by restriction endonucleases // BMC Genomics 2008, 9:305 (Online version - http://www.biomedcentral.com/1471-2164/9/305) 3. Aligned set of Alu-repeats for human DNA - http://science.sibenzyme.com/article4_article_37_1.phtml

346 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

ANIONIC PHOSPHOLIPID ASYMMETRIC LOCATION IN ZWITTERIONIC/ANIONIC VESICLES FRANCISCO TORRENS 1, GLORIA CASTELLANO 2

Keywords: Protein–lipid interaction; Binding isotherm; Partition coefficient; Ionic strength; Asymmetric distribution

The role of electrostatics is studied in the adsorption of cationic proteins to anionic (phosphatidylcholine/phosphatidylglycerol, PC/PG) and zwitterionic (PC) small unilamellar vesicles (SUVs) [1]. For model proteins the interaction is monitored vs . PG content at low ionic strength [2]. The adsorption of lysozyme (Fig. 1) and myoglobin (isoelectric point, p I 7–11) is investigated in SUVs, along with changes of the fluorescence emission spectra of the cationic proteins, via their adsorption on SUVs [3]. In the Gouy–Chapman formalism, the activity coefficient goes with the square of charge number [4]. Deviations, from the ideal model, show the asymmetric location of the anionic phospholipid in the bilayer inner leaflet, in mixed zwitterionic/anionic SUVs for both lysozyme– and myoglobin–PC/PG systems, which is in agreement with experiments and molecular dynamics simulations (Fig. 2). Effective SUV charge stays constant. Effective – formal difference increases 0.417 e.u. Effective protein charge increases as PC/PG < PC being greater for myoglobin. The molar free energies of the protein in aqueous and lipid phases increase as PC < PC/PG. Both free-energy changes are greater for myoglobin. Effective interfacial charge stays constant for anionic PC/PG SUVs being greater for myoglobin. With the Gouy–Chapman formalism γ is obtained as ln γ ∝ ν·sinh – 1(ν) ≈ ν2. Activity coefficient goes with the square of charge number. As Γ ∝ γ at constant ν it can be expected ln Γ ∝ ln γ ∝ zL·sinh –1(zL) ≈ zL2. For lysozyme– myoglobin adsorptions on mixed zwitterionic/anionic PC/PG vesicles, deviations from the ideal model show the asymmetric location of anionic phospholipid, in the inner leaflet of bilayer, in agreement with experiments and molecular dynamics simulations. Each leaflet follows an independent behaviour, in that changes did not correlate with each other.

 1 Institut Universitari de Ciència Molecular, Universitat de València, Edifici d’Instituts de Paterna, P. O. Box 22085, 46071 València, Spain, [email protected] 2 Instituto Universitario de Medio Ambiente y Ciencias Marinas, Universidad Católica de Valencia San Vicente Mártir, 46003, València, Spain, [email protected] 347 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

0.15

PC

*

i 0.10 PC/PG 9:1

/R α PC/PG 8:2 PC/PG 6:4 0.05

0.00 0.0 0.2 0.4 0.6

(1- α)[P] T

Fig. 1. Influence of vesicle charge on adsorption of lysozyme–PC/PG at T=20ºC, pH7.0.

12 Lysozyme

Myoglobin ln Γ Lysozyme ideal Myoglobin ideal 10

-0.4 -0.2 0

Vesicle charge (e.u.)

Fig. 2. Influence of vesicle charge on the natural logarithm of theoretical partition coefficient.

348 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

1. F. Torrens, A. Campos, C. Abad (2003) Binding of vinyl polymers to anionic model membranes, Cell. Mol. Biol., 49:991–998. 2. F. Torrens, C. Abad, A. Codoñer, R. García-Lopera, A. Campos (2005) Interaction of polyelectrolytes with oppositely charged micelles studied by fluorescence and liquid chromatography, Eur. Polym. J., 41:1439–1452. 3. F. Torrens, G. Castellano, A. Campos, C. Abad (2007) Negatively cooperative binding of melittin to neutral phospholipid vesicles, J. Mol. Struct., 834-836:216–228. 4. F. Torrens, G. Castellano, A. Campos, C. Abad (2009) Binding of water- soluble, globular proteins to anionic model membranes, J. Mol. Struct., 924-926:274–284.

349 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

PREDICTION OF SUPER-SECONDARY STRUCTURE IN Α- HELICAL AND Β-BARREL TRANSMEMBRANE PROTEINS VAN DU TRAN 1, PHILIPPE CHASSIGNET 1, JEAN-MARC STEYAERT 1

Keywords: β-barrel, transmembrane protein, super-secondary structure, permutation

Transmembrane proteins are divided into two main classes based upon their conformation: α-helical bundles and β-barrels. Computational methods based on learning are poorly tractable since the transmembrane structures are difficult to determine by standard experimental methods. Generally, those structures are not only a series of β-strands or α-helices where each is bonded to the ones immediately before and after in the primary sequence, but they may contain Greek key, sometimes Jelly roll, motifs as well [1]. This level of structure may be described as a permutation of the order of the bonded segments. We are modeling the protein folding problem with energy minimum into finding the longest closed path in a graph with respect to some given permutation. The energy functions can be tuned accordingly to the studied class of proteins. By dynamic programming, the algorithm runs in O(n3) for an identity permutation, and at most O(n5) for the Greek key motifs, where n is the number of amino acids. A three-dimensional structure is also computed using the geometric criteria. The algorithm can be used to predict structure of different families of proteins and is validated with the class of β-barrel transmembrane proteins. The prediction accuracy, for the latter, evaluated by the percentage of well- predicted residues, reaches 70-85%, that compares favourably to existing works as [2–4]. The number of strands is found correctly, whereas another main geometric characteristic of β-barrel, the shear number, is relatively suitable. We consider to carry out a screening on genomes of certain species such as Paramecium and Neisseria meningitidis.

1. 1. C.Zhang, S.H.Kim (2000) A comprehensive analysis of the Greek key motifs in protein β-barrels and β-sandwiches, Proteins: Struct Funct Genet, 40:409–419.

 1 Laboratoire d'Informatique de l'Ecole Polytechnique, France, [email protected], [email protected], [email protected] 350 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 2. 2. J.Waldispühl, B.Berger, P.Clote, J.-M.Steyaert (2006) Predicting transmembrane β-barrels and interstrand residue interactions from sequence, Proteins: Struct Funct Bioinformatics, 65:61–74. 3. 3. H.R.Bigelow, D.S.Petrey, J.Liu, D.Przybylski, B.Rost (2004) Predicting transmembrane beta-barrels in proteomes, Nucleic Acids Res, 32:2566– 2577. 4. 4. P.Martelli, P.Fariselli, A.Krogh, R.Casadio (2002) A sequence-profile- based HMM for predicting and discriminating β -barrel membrane proteins, Bioinformatics, 18 Suppl 1:S46–S53.

351 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

ETHANOLAMINE UTILIZATION: STUDY OF EVOLUTION AND REGULATION USING COMPARATIVE GENOMICS OLGA TSOY 1, DMITRY RAVCHEEV 1, ARCADY MUSHEGIAN 2

Ethanolamine is used as a source of carbon and nitrogen by phylogenetically diverse bacteria. The ethanolamine degradation is enabled by the enzyme ethanolamine-ammonia lyase, which is typically encoded by two genes, named eutB and eutC in the best-studied case of Salmonella typhimurium [1]. Ethanolamine degradation pathway requires some additional enzymes and transport proteins such as EutA, EutG, EutD, EutH, and Eat. In S. typhimurium , all these genes are part of the single operon, along with the eutR gene for the transcriptional regulator [2]. Despite extensive studies in S. typhimurium , much remains to be learned about EutBC structure and catalytic mechanism, the evolutionary origin of ethanolamine utilization and regulatory links between cobalamin and ethanolamine metabolism. In this work we applied omputational analysis of sequences, structures, genome contexts and phylogenies of ethanolamine- ammonia lyases to address some of these questions. The eut genes were found in almost 100 fully sequenced bacterial genomes. Genes for ethanolamine-ammonia lyase were observed in variable genome context: two main types of the eut operon, “short” and “long” were detected. Phylogeny of the EutB and EutC protein and analysis of the genomic context allowed us to reconstruct evolution history of the eut operon. Regulation of the eut operon transcription was studied for two taxa, Enterobacteriaceae and in some Burkholderiales. In the both taxa conserved motives were found upstream of eut operon and were supposed to be EutR factor binding sites. In Enterobacteriaceae predicted EutR binding sites were also detected upstream of cobalamine biosynthesis operon.

We are grateful to M.S. Gelfand for valuable discussions and appreciate comments from Kazakov A.E. Also we thank Hua Li, Zhu D. for help with statistical calculations. This study was supported by grants from the Howard  1 Institute for Information Transmission Problems, RAS, Russian Federation, [email protected], [email protected] 2 Stowers Institute for Medical Research; Department of Microbiology, Molecular Genetics, and Immunology, University of Kansas Medical Center, United States, [email protected] 352 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Hughes Medical Institute, the Russian Academy of Science (Program "Molecular and Cellular Biology"), and the Russian Foundation for Basic Research.

1. G.W.Chang, J.T.Chang (1975) Evidence for the B12-dependent enzyme ethanolamine deaminase in Salmonella, Nature, 254:150–151. 2. E.Kofoid et al. (1999) The 17-gene ethanolamine (eut) operon of Salmonella typhimurium encodes five homologues of carboxysome shell proteins, Journal of Bacteriology, 181:5317–5329.

353 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

THE TALE OF “UNDERLYING BIOLOGY”: FUNCTIONAL ANALYSIS OF MAQC II DATA MARINA TSYGANOVA 1, WEIWEI SHI 2, DAMIR DOSYMBEKOV 1, ZOLTAN DEZSO 2, TATIANA NIKOLSKAYA 1,2 , YURI NIKOLSKY 2

Keywords: functional analysis, interactome, ontology enrichmnet, MAQCII, gene signature

The MAQCII experimental set-up has created a unique opportunity to conduct the first comprehensive study on functional analysis (FA) of statistics- generated predictor gene signatures. First, this study addresses data size as the original expression data is both massive and diverse, with 6 large scale datasets and 13 phenotypic “end points” including three for drug responses and ten for three different types of cancer. Second, we collectively deal with a pool of signatures for each end point that were generated by 34 expert statistician teams. A diversified collection of signatures provides a large enough “union” dataset suitable for FA and account for the many functional dependencies and correlations that became apparent in cross-signature comparisons. Here we report the results of our functional analysis. One of our initial observations suggests that statistically selected gene descriptors do not make biological sense in the context of end points. Instead, functional correlations manifested only at the level of gene function and “hub” composition of signatures (gene content level), where they were logically distributed across ontologies of cellular processes, pathways and biomarkers (ontology enrichment), and physical connected into significant networks. Ontology enrichment correlation was particularly pronounced for non- redundant gene “unions” of all signatures of a given end point (lower distribution p-values for unions than for individual signatures). This suggests that different models were selective for different subsets of genes on the same pathways and processes. This highlights an caveat of the instability of molecular signatures where variable selection can lead to many quantitative solutions of equal reliability in terms of prediction rates. Descriptor signatures selected by different models share certain features throughout the end points. Thus, most signatures and all unions were enriched in “hubs” (25% highest connected human genes). Signature genes  1 Vavilov Institute for General Genetics, Moscow B333, 117809, Russia, [email protected] 2 GeneGo, Inc. 500 Renaissance Drive, St. Joseph, MI, USA, [email protected] 354 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 predominantly encoded “IN” proteins with most interactions upstream (as opposed to “OUT” proteins with mostly downstream interactions). A high IN/OUT interactions ratio is typical for “effector” proteins such as metabolic enzymes, housekeeping genes encoding homeostasis functions etc. Gene encoding “effector” proteins are likely to be subjected constitutive conditional expression levels, as opposed to transiently expressed regulatory genes, contributing to a higher probability of selection in predictive models. Both protein function and hub distribution were end point and functionality- dependent. Thus, signatures for toxicity end points were enriched in “IN” proteins and in metabolic enzymes. Breast cancer signatures featured high fraction of transcription factors and “OUT” hubs. End point dependency is evident from multiple analyses. In addition to a high fraction of OUT interactions, breast cancer end points featured the largest number of statistically significant direct interaction (DI) networks for the signatures and largest DI networks for unions. Breast cancer is a complex and heterogeneous disease with different sub-types and hundreds of involved pathways and processes. The models applied by different teams, were likely selected according to gene sets responsible for different, yet related processes of cancerogenesis. Signatures for the same end point displayed similarity (congruency) and synergy (inter-connections). We compared signature similarity (congruency) at the “gene content” level and at the “pathway” level (mostly Disease biomarkers ontology) by two different statistical approaches. Congruency at a “pathway” level was consistently higher than at the “feature” level for all end points. A logical explanation for this observation is that statistical models select different sets of genes from the same biologically relevant entities. Higher pathway congruency supports an assumption of common underlying biological mechanisms for each end point. Signature congruency is in agreement with an observation of “synergy” between signatures in ontology enrichment analysis, manifested as lower p-values for unions compared to the individual signatures for the same end point (data not shown). These two observations suggest that functional analyses procedures are robust and efficient tools for measurement of similarity between datasets and gene lists. The pathway congruency technique we applied here can be very useful in such important applications as patient cohort stratification and clustering of clinical samples in biomarker discovery studies.

355 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

CAPTURE AND RELEASE OF CODING DNA: EVOLUTION OF BACTERIAL GENES BY SHIFT OF STOP CODONS ANNA VAKHRUSHEVA 1, MARAT KAZANOV 2, ANDREW MIRONOV 1, GEORGIY BAZYKIN 2

Keywords: stop codon, genomes, evolution

The de novo origin of coding sequence remains an obscure issue in molecular evolution. One of the possible paths for inclusion (exclusion) of DNA segments into (from) a gene is a shift of a stop codon. Single nucleotide substitutions can create a premature stop codon via a nonsense mutation, or destroy the existing stop codon, leading to uninterrupted translation up to the next stop codon in the gene’s reading frame. Here we describe the evolution of the coding sequence of bacterial genes by shift of stop codons. We aligned the families of homologous genes from 623 complete bacterial genomes. In the alignments, we analyzed all cases of inconsistent position of stop codons between individual genes of a family. We concluded that the stop codon has shifted, and that a segment of coding DNA has been captured or released by the gene, if the coding nucleotide sequence just before the stop codon of a gene was unambiguously aligned to the non- coding sequence immediately after the stop codon of its homologue. We polarized the corresponding mutations by assuming that the majority of the genes in the family represents the ancestral state. In individual cases, the polarization was verified using maximum parsimony. This allowed us to tell evolutionary gains from losses of C-terminal coding segment. We describe cases of loss of C-terminal coding segment, as well as cases of incorporation of a region of 3’UTR into the gene due to a mutation in the stop codon. At least some of the observed cases are not due to sequencing errors, since both the short and the long forms of the gene were observed in several variants in a number of bacterial genomes. The obtained results indicate that the position of a stop codon is evolutionarily labile. A point mutation of a stop codon is a simple evolutionary path to obtaining a new coding sequence. Alignments of large gene families from species of different evolutionary relatedness will allow us to study the further evolution of DNA segment after its capture by a gene.  1 Faculty of Bioengeneering and Bioinformatics, Russian Federation, [email protected] 2 The Institute for information transmission problems of the RAS 356 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

BINDING DETERMINANTS OF INTERACTIONS BETWEEN ANTIAPOPTOTIC PROTEINS BCL-2, BCL-XL, MCL1 AND LIGANDS ABT737 AND GOSSYPOL. A.I. DAVIDOVSKII 1, V.G. VERESOV 1

Modern anticancer strategies finally move away from the use of crude nonspecific cytotoxic agents toward the application of rationally designed drugs that inhibit well-defined targets in specific cellular signaling pathways involved in tumorigenesis. Small-molecule drugs that induce apoptosis in tumor cells by activation of the Bcl-2-regulated mitochondrial outer membrane permeabilization (MOMP) hold promise for rational anticancer therapies (1-3). Accumulating evidence indicates that the natural product gossypol and its derivatives, as well as identified through high-throughput screening ABT-737, can kill tumor cells by targeting antiapoptotic BCL-2 family members in such a manner as to trigger MOMP. However, the precise mechanisms by which interactions of ABT-737 and gossypol with the antiapoptotic proteins lead to MOMP and apoptosis remain poorly understood. Antiapoptotic proteins Bcl-2, Bcl-xL, Mcl1 use the equivalent binding interfaces to bind both gossypol and ABT-737 but the binding profiles of these six pairs are different and the structural basis of this is far from clear (1, 2). Here, we used computational docking to elucidate the structural basis of differences in the binding affinities of these two BH3-mimetics with the antiapoptotic proteins. The main problem with the use of modern docking techniques is taking account of receptor flexibility. It has long been recognized that a simplistic rigid model of ligand-receptor interactions is inadequate and incorporation of ligand and receptor flexibility is required for accurate docking. While ligand flexibility has been addressed by a variety of algorithms, receptor flexibility remains a formidable challenge. In current work, the two-stage simulation protocol was used to take receptor flexibility into account. First, the AutoDock program (version 4.1) (4) was used with multiple conformation of receptors (MCR) approach to account crudely receptor flexibility. With this, a number of 3D-structures from X-ray and NMR-analyses of Bcl-xL, Bcl-2 and Mcl-1 were taken from Protein Data Bank (sixteen for Bcl-xl, six for Bcl-2, and seven for Mcl-1) for the simulations of binding of the apoptotic proteins with BH3- ligands and making the side chains of critical protein residues within the  1 Institute of Biophysics and Cell Engineering, Minsk, Belarus. [email protected] 357 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 major binding grooves as flexible. At the second stage, the structures with the lowest energies of binding were subjected to refinement with the use of the program RosettaLigand (5) with full ligand and receptor flexibility. Results of the simulations are shown in Table 1.

Table 1. Calculated binding energies of the antapoptotic proteins Bcl-2, Bcl-xL, Mcl1 with ABT-737 and gossypol Receptors Ligands Binding energies Refined binding energies after the use by AutoDock program of RosettaLigand program Bcl-xL ABT-737 -14.2 kcal/mol -15.3 kcal/mol Bcl-2 ABT-737 -10.1 kcal/mol -10.9 kcal/mol Mcl-1 ABT-737 -3.7 kcal/mol -4.8 kcal/mol Bcl-xL gossypol -7.7 kcal/mol -8.5 kcal/mol Bcl-2 gossypol -7.2 kcal/mol -7.9 kcal/mol Mcl-1 gossypol -5.6 kcal/mol -7.1 kcal/mol

The results of the simulations show that the interactions with NWGR-motif (residues 136-139 of Bcl-xL, 143-146 of Bcl-2 and 241-244 of Mcl-1), common for all antiapoptotic proteins, and solvent reorganization due to insertion of hydrophobic parts of two ligands into hydrophobic pockets p2 and p4 within the hydrophobic grooves of the proteins make the major contributions to the binding of the two ligands with three receptors under consideration and determine the binding profile.

1. V. Labi, F. Grespi, F. Baumgartner, A. Villunger (2008) Targeting the Bcl- 2-regulated apoptosis pathway by BH3-mimetics: a breakthrough in anticancer therapy?, Cell Death Differ. 15: 977-987 2. S. W. Fesik (2005) Promoting apoptosis as a strategy for cancer drug discovery, Nat. Rev. Cancer, 5:5:5: 876-885. 3. M. Vogler, D. Dinsdale, M. J. S. Dyer, Cohen G. M. (2009) Bcl-2 inhibitors: small molecules with a big impact on cancer therapy, Cell Death Differ. 16: 360-367 4. F. Osterberg, G. M. Morris, M. F. Sanner, A. J. Olson, D. S. Goodsell (2002) Automated docking to multiple target structures: incorporation of protein mobility and structural water in AutoDock, Proteins. 464646 : 34-40. 5. I. W. Davis, D. Baker (2009) RosettaLigand Docking with full ligand and receptor flexibility, J. Mol. Biol.:385:385 : 381-392

358 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

EXPLORING THE MOLECULAR BASIS OF THE BINDING OF ABT737AND ABT263 TOWARDS ANTIAPOPTOTIC PROTEINS BCL-2, BCL-XL, MCL-1, A1 A.I. DAVIDOVSKII 1, V.G. VERESOV 1

Keywords: Proteins anticancer drugs protein-ligand docking

One of the cardinal features of cancers is a deregulation of their apoptotic machinery that provides them with a survival advantage (1). Antiapoptotic members of the Bcl-2 family of proteins, such as Bcl-2, Bcl-xL, Bcl-w, Mcl-1 and A1 are overexpressed in many types of human cancers and are associated with the resistance of many tumors to chemo- and radio- therapy and with the failure of conventional anticancer drugs (1). Small-molecule inhibitors of antiapoptotic proteins that induce apoptosis in tumor cells hold promise for anticancer therapies. Two such compounds, ABT-737 and ABT-263, have shown potent cytotoxicity against numerous human tumor cell lines by targeting antiapoptotic proteins Bcl-2 and Bcl-xL with high affinity (2). However, these compounds have failed to inhibit another subclass of antiapoptotic proteins, including Mcl-1 and A1, although both subclasses use the similar binding interfaces to bind ABT-737 and ABT-263. The structural reasons by which interactions of ABT-737 and ABT-263 with one subclass of antiapoptotic proteins lead to apoptosis while that with the other subclass result in low- affinity binding remain still poorly understood (3). Here, we used computational docking to elucidate the structural basis of differences in the binding affinities of these ligands with two subclasses of antiapoptotic proteins. The main problem of modern docking techniques is taking account of receptor flexibility. In the current work, the three-stage simulation protocol was used to take receptor flexibility into account. First, the AutoDock program (version 4.1) (5) was used with multiple conformation of receptors (MCR) approach to account crudely receptor flexibility. With this, a number of 3D- structures from X-ray and NMR-analyses of Bcl-xL, Bcl-2 and Mcl-1 were taken from Protein Data Bank (sixteen for Bcl-xl, six for Bcl-2, and seven for Mcl-1) for the simulations of binding of the apoptotic proteins with BH3-ligands and making the side chains of critical protein residues within the major binding grooves as flexible. At the second stage, the structures with low energies of

 1 Institute of Biophysics and Cell Engineering, Minsk, Belarus. [email protected] 359 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 binding were subjected to refinement with the use of the program RosettaLigand (6) with full ligand and receptor flexibility. At the third stage the several lowest-energy structures for each binding pair were undergone to MD refinements with the use of the GROMACS software (7) followed by the reuse of RosettaLigand docking. Results of the simulations are shown in Table 1.

Table 1. Calculated binding energies of the antiapoptotic proteins Bcl-2, Bcl-xL, Mcl1, A1 with ABT-737 and ABT-263 Receptors Ligands Binding energies Refined binding by AutoDock energies program Bcl-xL ABT-737 -14.2 kcal/mol -19.1 kcal/mol Bcl-2 ABT-737 -10.1 kcal/mol -19.9 kcal/mol Mcl-1 ABT-737 -3.7 kcal/mol -15.2 kcal/mol A1 ABT-737 -3.2 kcal/mol -12.3 kcal/mol Bcl-xL ABT-263 -7.7 kcal/mol -17.5 kcal/mol Bcl-2 ABT-263 -7.2 kcal/mol -16.9 kcal/mol Mcl-1 ABT-263 -5.6 kcal/mol -15.1 kcal/mol A1 ABT-263 -4.5 rcal/mol -12.8 kcal/mol

It was shown that both ABT-263 and ABT-737 are more deeply inserted into the grooves of Bcl-2 and Bcl-xL as compared to the case of Mcl-1 and A1. The simulations showed that it is caused by the presence of threonine (Thr247) within the hydrophobic groove of Mcl-1 at the position of alanine within the corresponding grooves of Bcl-2 and Bcl-xL (Ala149 of Bcl-2 and Ala142 of Bcl-xL) and of bulky residues within the hydrophobic groove of A1. Thr247 of Mcl-1 and the shallowness of the A1 hydrophobic groove prevent more deep insertion of the ligands into the hydrophobic grooves of these proteins as compared with the cases of Bcl-2 and Bcl-xL.

1. V. Labi, et al. (2008) Targeting the Bcl-2-regulated apoptosis pathway by BH3-mimetics: a breakthrough in anticancer therapy?, Cell Death Differ. 15: 977-987 2. M. Vogler, et al. (2009) Bcl-2 inhibitors: small molecules with a big impact on cancer therapy, Cell Death Differ. 16: 360-367

360 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 3. E. F. Lee et al. (2007) Crystal structure of ABT-737 complexed with Bcl- xL: implications for selectivity of antagonists of the Bcl-2 family, Cell Death Differ., 14:1711-1719 4. F. Osterberg et al. (2002) Automated docking to multiple target structures: incorporation of protein mobility and structural water in AutoDock, Proteins. 46: 34-40. 5. I. W. Davis, D. Baker (2009) RosettaLigand Docking with full ligand and receptor flexibility, J. Mol. Biol.:385: 381-392 6. B. Hess et al. (2008) GROMACS 4: Algorithms for Highly Efficient, Load- Balanced, and Scalable Molecular Simulation, J. Chem. Theory Comput., 4: 435-447

361 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

CODON USAGE BIAS: BIOLOGICAL FUNCTION OR NEUTRAL MARKER? SVETLANA VINOGRADOVA 1, DMITRIY VINOGRADOV 1, ANDREY MIRONOV 1

In the standard genetic code, 18 of the 20 amino acids are encoded by more than a single codon, but in many organisms, synonymous codons are not used with equal frequency. The biased use of synonymous codons can be explained in different ways. On the one hand, ‘non-optimal’ codons correlate with translational pause sites, and consecutive runs of ‘non-optimal’ codons reduce the rate of translation. These pause sites have an established role in the correct folding of protein (Zalucki et al., 2007). It is also well known that, typically, highly expressed genes preferentially use a subset of 'optimal' codons (Sharp et al., 1993). On the other hand, the presence of ‘non-optimal’ codons could be explained by stochastic reasons and such codons are supposed to be neutral markers. Our aim was to analyze the codon usage variation on a genome-wide scale. We considered multiple alignments of proteins from PFam and built corresponding DNA alignments using the genome data. All codons were grouped into two classes according to their frequencies, ‘optimal’ and ‘non- optimal’. Then each codons in the alignments was converted into 0 or 1, dependent on its class. For each column of the resulting matrices, the information content (IC) was calculated and the positional IC was plotted for each alignment. Monte-Carlo analysis was performed to assess the statistical significance of the observed peaks. In most cases we observed a small number of positions with relatively high values of the IC. Most of them were caused by ‘non-optimal’ codons. Overall, the IC distribution can be considered random. So we propose that the biased use of synonymous codons is mainly a neutral marker, but at some positions ‘non-optimal’ codons are functional. This work was supported by Howard Hughes Medical Institute [grant number 55005610]; the Program ‘Molecular and Cellular Biology’ of the Russian Academy of Sciences; and Russian Foundation of Basic Research [grants number 09-04-92742, 07-04-91555].

 1 M Faculty of Bioengineering and Bioinformatics, Moscow State University, Russian Federation, [email protected], [email protected] 362 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 1. P.M. Sharp, M. Stenico, J.F. Peden, A.T. Lloyd (1993) Codon usage – mutational bias, translational selection, or both? Biochemical Society Transactions, 21:835-841. 2. Y.M. Zalucki, M.P. Jennings (2007) Experimental confirmation of a key role for non-optimal codons in protein export, BBRC, 355: 143-148

363 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

COLLAGEN-LIKE PATTERNS IN THE HUMAN GENOME ANNA V. VLASOVA 1, PETR K. VLASOV 1, NATALIA G. ESIPOVA 1, VLADIMIR G. TUMANYAN 1

Keywords: collagen, pattern, genome

Collagen proteins found in all animals and necessary for a consistent portion of the animal body and make up about 25% of the summary protein mass for mammalians. Every collagen protein has the fibrillar («real collagen») region with three helical chains coiled on each other, and a globular (non-collagen) region, but the exact size and the relative ratio of these regions vary in different collagens. In addition, collagen-like segments occur in other proteins and, apparently, empower these proteins with a variety of specific functions. Collagen regular structure may be described by a simple template: (Gly-X-Y)n where X and Y are any residue. This pattern corresponds to the specific periodicity of a nucleotide coding sequence where the glycine codon occurs every two codons, or six nucleotides. This sequence periodicity pattern leads to unique sequence-structure interplay [8]. Given the importance of the collagen protein family and the high number of collagen genes, it is imperative to develop a method capable of searching and recognition of collagen-like segments in genome sequences. Initial approaches of collagen gene scans in genomes were based on the standard BLAST algorithm. The special “collagen sequence patterns” were applied to find collagen-like segments in bacterial genomes. However, known collagen motifs have high diversity, and any sequence-specific pattern restricts the sensitivity of alignments. It is difficult to construct a universal collagen sequence query and to take into account the high sequence variability of different collagen genes. Thus, there are clear limitations of simple BLAST- based searches. However, the specific nucleotide periodicity of collagen genes mentioned above can be used to recognize similar patterns in the genome sequence. We realize a new method of a thorough search for collagen-like patterns (CLPs) in any nucleotide sequence. Our approach correctly identified all annotated exons in the fibrillar region of collagen gene. Our program, CollagenFinder, unlike many standard gene prediction programs, can scan any nucleotide

 1 Engelhardt Institute of Molecular Biology RAS, Russian Federation, [email protected], [email protected], [email protected], [email protected] 364 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 sequence regardless of length and nucleotide content to annotate the CLP- regions. The results of our prediction were compared with the GenBank collagen annotation as well as with the results of a popular gene prediction program Genscan. A high level of correspondence of CLP prediction and collagen genes annotation provides evidence for a high rate of accuracy of our approach. Indeed, our program has recognized 85% of all collagen exons, which is better than 60% for GeneScan level. The prediction results show that our proposed approach can indeed improve the collagen gene prediction accuracy, and it is better to combine standard GeneScan method and our approach. Our method marks CLP in the coding regions of many non-collagen proteins. Some of these proteins have annotated collagen fragments (i.g. acetylholinesterase), but most of CLP were founded in gene (exon/intron) regions that are not annotated as collagen containing proteins. The annotation of these CLPs gives addition information about the structural and functional roles of human proteins. Interestingly, many more CLPs were founded in intergenic regions. The functional or evolutionary role of these regions retains unknown. The results obtained denote existence of the strong specific periodicity through the human genome. According to our results, the human genome has numerous regions with 9-nucleotide periodicity that correspond to CLPs. Being of high specificity in respect of sequence predicted CLP regions may serve as useful markers for identification of various genome regions with divergent biological functions.

1. R.E. Burgeson, M.E. Nimni (1992) Collagen types. Molecular structure and tissue distribution, Clin. Orthop., 282: 250-272. 2. D.J.S. Hulmes (1992). The collagen superfamily - diverse structures and assemblies, Essays Biochem., 27: 49-67. 3. M. Rasmussen, M. Jacobsson, and L. Bjorck (2003) Genome-based Identification and Analysis of Collagen-related Structural Motifs in Bacterial and Viral Proteins, J Biol Chem., 278(34): 32313–32316. 4. Gara SK, Grumati P, Urciuolo A, Bonaldo P, Kobbe B, Koch M, Paulsson M, Wagener R. (2008) Three novel collagen VI chains with high homology to the alpha3 chain, J Biol Chem., 283(16): 10658-10670 5. M.A. Fox (2008) Novel roles for collagens in wiring the vertebrate nervous system, Curr Opin Cell Biol., 20(5): 508-513

365 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

INTERRELATION BETWEEN THE TRANSLATION INITIATION SIGNAL AND THE N-END OF ENCODED PROTEIN IN HUMAN mRNA OXANA VOLKOVA 1, ALEX KOCHETOV 1

Accurate prediction of the efficiency of eukaryotic translation initiation signal is of importance for evaluation of both the mRNA coding potential and translation rate. However, this task was not solved yet. The recognition of the AUG triplet as translation initiation site depends on its nucleotide context. It is known that nucleotide frequencies in positions surrounding the start AUG codon are highly biased. The relative importance of a few (-3 and +4) context positions was experimentally shown. It is widely accepted that purine in pos. - 3 and guanine in pos. +4 make the context optimal (i.e., recognized by virtually all the ribosomes as a start site). The roles of other context positions are still unknown. This especially concerns the 3’-context part located at the beginning of CDS. There are controversial opinions: in favour of the functional significance [1] or insignificance [2] of the nucleotides in positions +4, +5, +6. This problem is complicated by the complex character of the mRNA segment: it belongs to both protein coding sequence (second codon of CDS) and translation initiation signal and its structure was formed under the influence of various factors. In addition, some N-end amino acids could influence protein stability and certain posttranslational modification. To analyze the role of nucleotides in positions +4,+5,+6 in translation initiation we performed a comparative analysis of human mRNA sample (24154 nucl. sequences) characterized with either optimal or suboptimal nucleotides in the key position -3. It is believed that translation initiation signals (TIS) with purines and pyrimidines in pos. -3 can be roughly classified as “more optimal” and “less optimal”, respectively [1]. We hypothesized that if the nucleotides pos. +4, +5, +6 participate in start codon recognition, these samples will differ. Comparative statistical analysis of nucleotide, codon and amino acid frequencies in the second positions of CDS and corresponding proteins showed that: 1. The context variant considered to be “optimal” (RnnAUG, R=purine) is heterogeneous in respect with the nucleotide preferences in pos.  1 Institute of Cytology and Genetics SB RAS & Novosibirsk State University, Novosibirsk, Russia, [email protected], [email protected] 366 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 +4,+5,+6: it is likely that G in pos. +4 is important only for functioning of GnnAUG context variant. This unexpected finding means that the TIS variant AnnAUGG has now preferences over AnnAUGN. In turn, the suboptimal context variants YnnAUG (Y=pyrimidine) were also characterized by the significant over-representation of G+4. 2. The second position of amino acid sequences of proteins encoded by mRNA samples with different start codon contexts. Notably, the proteins encoded by mRNAs with AnnAUG context were characterized by specific and significant over-representation of serine, whereas the presence of GnnAUG context correlated with a higher occurrence of alanine and glycine. It is likely that serine in the 2nd protein position can facilitate the translation initiation efficiency of human mRNAs with AnnAUG context variant. In turn, over-representation of Ala and Gly, correlated with the presence of GnnAUG context, could be more dependent from the selection at the level of nucleotides in positions +4, +5, +6. The observed statistical phenomenon can result from the specific functional relationship between the Ser in the 2nd position of the protein and the variant of start codon context with A in position –3. For example, it may be assumed that the formation of the first peptide bond between the initiator Met and Ser allows ribosome to avoid some steric constrains resulting from the presence of A in position -3. In turn, over-representation of Ala and Glu in the 2nd position of proteins encoded by GnnAUG mRNA subsample is likely to be less dependent on the selection at the level of amino acids and can reflect the noticeable selection at the level of nucleotides in positions +4,+5,+6. Thus, the over-representation of G in position +4 could reflect (at least, in part) the functional significance of this nucleotide for the recognition of GnnAUG and YnnAUG by eukaryotic ribosomes as translation start sites.

This work was supported by RFBR (08-04-00525), Ministry of Science & Education (3H-324-09) and the Program of RAS (Molecular and Cellular Biology). We also thank SD RAS Complex Integration Program for partial support.

1. Kozak, M. (2005) Regulation of translation via mRNA structure in prokaryotes and eukaryotes, Gene, 361: 13-37. 2. Xia, X. (2007) The + 4G site in Kozak consensus is not related to the efficiency of translation initiation, PLoS ONE, 2: 188.

367 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

IMPROVED PREDICTION OF HUMAN miRNAs BASED ON CONTEXT-STRUCTURAL HMM PAVEL VOROZHEIKIN 1, A.I. KULIKOV 1, IGOR I. TITOV 2

Keywords: hidden Markov model, miRNA, secondary structure

miRNAs are a large family of noncoding small RNAs controlling mRNA expression either by the cleavage or by the translation arrest. These single- stranded RNAs of 19-25 nt in length are processed by Dicer from miRNA precursors which forms stem-loop secondary structure. Identification of novel miRNAs has recently become an important approach towards understanding of posttranscriptional gene regulation. While cloning methods have successfully identified highly expressed miRNAs from various tissues, computational prediction could become a reliable approach for tissue- specific or lowly expressed miRNAs. Most of computational methods for miRNA prediction have been focused on search for close homologs among related miRNAs. To find distant miRNAs as well as close homologs Wan with coauthors [1] have suggested a probabilistic co-learning method based on a paired hidden Markov model (HMM) for miRNA genes. It combines a precursor stem sequence-structural information for calculation of Dicer cleavage site probability. While the method performed well on the first datasets (probably capturing the main factors of miRNA expression level) we found that it frequently fails to detect miRNAs for recently discovered miRNA even after re-learning on the last miRNA sets. Originally their method frequently recognizes the inner or buldge loop in precursor stem as miRNA boundary what is often the case (Fig.1), while neglecting the excessive loop frequency in the miRNA duplex center (Fig.1). We modified the algorithm of Wan et al. [1] taking into account the loop frequency distribution and found that modified algorithm shows better prediction although underperforms on the first miRNA sets.

 1 Novosibirsk State University, Novosibirsk, Pirogova st., 2, [email protected] 2 Institute of Cytology and Genetics, Russian Federation, [email protected] 368 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

Fig.1. Loop frequencies within 3’ strand of human miRNA genes. MiRNAs are aligned by 5' ends.

The work was supported by RAS Program #22 "Molecular and cell biology" (Project #8 "System biology: computational and experimental approaches"). 1. 1. A.Jin-Wu Nam, Ki-Roo Shin, Jinju Han, Yoontae Lee, V. Narry Kim and Byoung-Tak Zhang (2005) Human microRNA prediction through a probabilistic co-learning model of sequence and structure, Nucleic Acids Research, 33(11):3570-3581.

369 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

GENETICS OF VARIATION OF COPIA SUPPRESSION IN DROSOPHILA MELANOGASTER. WENDY VU 1, SERGEY NUZHDIN 1

Keywords: Retrotransposon copia, QTL mapping, transcription, plasmid DNA, transposition, Drosophila Melanogaster

Transposable elements (TE) are genomic parasites that survive by exploiting its host reproductive mechanism. However, some hosts within population have evolved the ability to silence TE activity while others lack suppression and allow TE activity. We are interested in investigating the host silencing mechanism of the copia long terminal repeat retrotransposon and its population variation in Drosophila melanogaster. Here we identified large effect genes involved in copia suppression by using a semi-quantitative analysis to assay levels of copia plasmids (an intermediate believed to lead to transposition) in ninety-eight recombinant inbred lines constructed from a line exhibiting high copia transpositions and a line exhibiting no transpositions. The results revealed that the influence of copia copy number and transcription level on copia plasmid concentrations are weak and that genomic factors, presumably encoded by the host, have stronger effects on transposition rates. We mapped a QTL affecting copia plasmid concentration in the 33A-43E interval and applied a quantitative deficiency complementation analysis on this chromosomal region. One out of two large effect deficiencies on copia plasmid concentrations corresponded to a gene called vasa, an important component in the nuage-piRNA TE silencing machinery. Therefore, we hypothesize that copia suppression occurs by the joint action of several post-transcriptional machanisms with at least one of the blocks taking place in the nuage.

 1 University of Southern California, United States, [email protected], [email protected] 370 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

XIPPI: INTEGRATING INFORMATION ON PROTEIN- PROTEIN INTERACTIONS YURI VYATKIN 1, DMITRY AFONNIKOV 1

Keywords: protein-protein interactions, databases, information integration

The current experimental techniques for protein-protein interaction (PPI) detection lead to the substantial increase in volume of available data. The information spreads over a number of databases with different data formats, interaction descriptions and protein sequence referencing. This makes difficulties in information processing especially in the tasks of preparation consistent sets of PPI. An efficient way to overcome these difficulties is data integration [1]. However, the problem of protein sequence identifiers redundancy in datasets remains. For example, protein in PPI database can be referenced by several GI or accession numbers due to presence of several isoforms. Another source of the inconsistency is the redundancy of protein IDs in sequence databases [2]. In this work, we suggest a tool XIPPI (stands for eXtended Indexing for Protein-Protein Interactions) to overcome a problem of protein sequence redundancy in PPI databases. This tool merges a set of currently available PPI databases and allows building non-redundant datasets of interacting protein from different databases. XIPPI (eXtended Indexing of Protein-Protein Interactions) merges information from BioGrid, DIP, MINT, IntAct, Reactome, HPRD, and BOND (former BIND). The PPI information stored is the following: the identifiers of interacting partners and synonyms for them (e.g. refseq, uniprot, NCBI gi identifiers for the same protein), information about species in which interaction occurs, PubMed IDs for the papers where the interaction was determined or verified, information about interaction type and the method used to determine it, the link to source database from which the interaction was taken in presented for each record. While each standalone database is of good quality, the quality of datasets taken from several databases at once is to be critically assessed, so the task of possible contradictions search was performed and automated framework for contradictions elimination was created. The XIPPI database allows performing the following tasks:

 1 Institute of Cytology and Genetics SB RAS, Russian Federation, [email protected], [email protected] 371 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 • checking the information on given interactions (one or a list of them) in different databases to increase the reliability of information; • output protein IDs for PPI datasets as GI, Uniprot, RefSeq, and PDB accession numbers; • searching for all possible interactions available for given protein identifier; • building datasets of protein interactions for given species names. The XIPPI tool is available on request. Acknowledgements. This work was supported by interdisciplinary integration projects of SB RAS #26, 109, 113, 119, State Contract with the PIN on the RAS Presidium Basic Research Program, Subprogram 2 “The Origin and Evolution of Biosphere”, НШ-2447.2008.4. Scientific School “Bioinformatics and Computational Systems Biology”.

1. Stark C. et al. (2006) Biogrid: A General Repository for Interaction Datasets. Nucleic Acids Res. 34:D535-9 2. Glynn Dennis, Jr et al. (2003) DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 4(9): R60

372 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

RATE OF EVOLUTION OF PROTEIN-CODING GENES AND THE GENERALIZED MISTRANSLATION-INDUCED MISFOLDING HYPOTHESIS YURI WOLF 1, IRINA GOPICH 2, EUGENE KOONIN 3

Keywords: evolutionary rate, expression level, mistranslation-induced misfolding

We used proteomic data to reexamine the correlations between protein abundances and evolutionary rates of protein-coding genes in an attempt to quantitatively assess the importance of structural-functional constraints (SFC) and protein abundance as determinants of the evolutionary rate. We show that the correlation between lineage-specific, short-term evolutionary rates of orthologous genes in nematodes and flies is much lower than the correlation between the respective protein abundances. A mathematical model was developed to estimate the relative contributions of SFC and the amplifying effect of the expression level (amplification by expression, ABE) to the evolution rate of protein-coding genes from the correlations between lineage- specific, short-term evolutionary rates of orthologous genes in nematodes and flies, and the respective protein abundances. We find that combined SFC and ABE account for approximately 50% of the variance of the evolutionary rates and that the contribution of SFC is likely to be 2 to 4-fold greater than the contribution of ABE. The mistranslation-induced misfolding (MIM) hypothesis posits that the sequence evolutionary rate is controlled largely through selection for the robustness of both the correctly translated protein and the entire ensemble of its mistranslated variants to misfolding. Our findings are compatible with a generalized MIM hypothesis under which the SFC is the primary determinant of robustness, whereas the expression level modulates the actual misfolding cost incurred in the course of (mis)translation.

 1 NCBI/NLM/NIH , United States, [email protected] 2 NIDDK/NLM/NIH, United States 3 NCBI/NLM/NIH, United States 373 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

STABILIZATION OF SEPARATED CHARGES IN REACTION CENTERS OF BACTERIAL PHOTOSYNTHESIS A.G. YAKOVLEV 1, V.A. SHUVALOV 1,2

Keywords: photosynthesis, reaction center, electron transfer, charge separation

The reaction center (RC) of photosynthesis is a pigment protein complex in which light energy is conversed into the free chemical energy of the charge separated states. The state P +BA- can be a first charge separated state. Stabilization of separated charges on P + (primary electron donor, bacteriochlorophyll dimer) and B A- (primary electron acceptor, monomeric bacteriochlorophyll) is studied in terms of participation of OH-group of TyrM210. The TyrM210 is separated by 4.8 Å from the C atom of ring IV of P A and by 4.7 Å from the N atom of ring II of B A. For a dynamic stabilization of the state P +BA- two possibilities can be taken into account. (i) An electron from P* is transferred to the higher vibrational level on the potential energy surface of the P +BA- state and then goes down to the lowest level by the process of vibrational relaxation. This situation requires non-symmetrical arrangement of the potential energy surfaces of P* and P +BA-. A relatively slow electron transfer from P* to B A may takes place as a result. (ii) Stabilization occurs as a result of reorientation of surrounding groups during the reversible formation of P +BA- dipole. In the latter case the symmetrical arrangement of the potential surfaces of P* and P +BA- states and the maximal possible rate of electron transfer between P* and B A can be achieved. Excitation of P by femtosecond light pulses with a broad spectrum creates a coherent nuclear wavepacket which moves in an oscillatory manner on the P* potential energy surface. When the wavepacket approaches the intercrossing area between P* and P+BA- surfaces, both states, P* and P +BA-, are observed. Then the wavepacket reflects back to the pure P* surface if there is no additional changes in the surrounding nuclear configuration. These changes can be induced by reorientation of the surrounding polar group like Oδ−Hδ+ of TyrM210 in Rba. sphaeroides (TyrM208 in Rps. viridis). In the absence of tyrosine YM210 the seven periods of 215 fs oscillations at 1020 nm (the absorption band of anion  1 Laboratory of Photobiophysics, Belozersky Institute of Chemical and Physical Biology of Moscow State University, Moscow 119899, Russian Federation. E-mail: [email protected] 2 Institute of Basic Biological Problems, Russian Academy of Sciences, Pushchino, Moscow Region 142290, Russian Federation. E-mail: [email protected] 374 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 BA-) are observed within ~1.5 ps in the RCs from YM210W(L) mutants of Rba. sphaeroides without stabilization of the state P +BA- in the ps time domain. Such stabilization could be due to the motion of Hδ of OH-TyrM210 towards BA- that could lower the energy of P+BA- with respect to that of P*. The estimations of the energy difference in the system were done for two positions of Hδ+ of OH-TyrM210 with respect to P A and B A. In the first neutral position, a dipole O δ−Hδ+ of TyrM210 is perpendicular to the line connecting C-N(IV) of P A and N(II) of B A which are the closest neighbours to TyrM210 and carried positive and negative charges, respectively, in the state P +BA- . One can assume that this position corresponds to the neutral states PBA or P*BA. In the second position, Hδ+ of OH-TyrM210 is on the line connecting O δ− of OH-TyrM210 δ + and N(II) of BA. This position can be realized when H  is attracted by B A- and repulsed by P A+. The energy difference between two positions was estimated to be ~900 cm -1 on the base of Coulomb interaction of O δ−Hδ+ - TyrM210 with BA- and P +. The experimental energy difference between P* and P +BA- in the stabilized state P +BA- in Pheo-modified RCs was found to be 350-550 cm -1. So the calculated energy difference is enough to stabilize an electron on BA- if H  of OH-TyrM210 is shifted to BA- direction during the primary charge separation between P and BA. The estimations of the distance covered by Hδ+ during of each appearance of the wavepacket between the P* and P+BA- surfaces show that in more than a 50% of RCs an electron is stabilized on BA within ~1 ps due to a shift of Hδ+. Note that the attraction and repulsion of H δ+ of OH-TyrM210 by BA- and PA+, respectively, occur when P+BA- is formed and are absent in the neutral state P*. The stabilization time is increased with temperature by the interaction of H δ+ of OH-TyrM210 with phonons.

The work was done with the financial support of Russian Basic Research Foundation (grant N 08-04-00888).

375 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

A NOVEL PROMOTER OF THE ESCHERICHIA COLI YFIA GENE AND PATHWAYS OF ITS REGULATION UNDER OXIDATIVE STRESS CONDITIONS T. M. KHLEBODAROVA 1, A. V. ZADOROZHNY 1, V. A. LIKHOSHVAI 1, N. V. TIKUNOVA 1, D. YU. OSHCHEPKOV 1, N. A. KOLCHANOV 1

The yfiA gene in Escherichia coli encodes the pY (RaiA) protein, which stabilizes ribosome structure and is involved in translation elongation regulation under stress conditions. Microarray studies indicate that yfiA is sensitive to various environmental factors, including oxidative stress. Previously, the yfiA gene, was thought to be included into the operon containing the ectD gene. However, according to microarray data, these genes differently respond to anoxic conditions and presence of the FNR transcription factor (TF). Moreover, we found a putative Rho-independent transcription terminator at 30 nt downstream from the end of the ectD ORF. It was suggested that yfiA had a promoter of its own and recognized a putative regulatory region of yfiA , stretching from the end of the ectD gene to the ATG codon of yfiA . This sequence was fused with the reporter gfp gene and developed the E.coli /pYfi-gfp genosensor. This genosensor responded to oxidative stress, but the processes mediating this response were unknown. We used experimental values of the genosensor cells fluorescence at the maximum response to H 2O2 [1] and approximation by rational polynomials [2] to assess the complexity of the H 2O2-depending regulation of the yfiA promoter. The following polynomial describes the dependence of the efficiency of the promoter Vn,m (s) on H 2O2 concentration:

n n m  s    s   s  Vs()1=++ wv    1 +   1 +    , nm, k   k  k   1    1   2   where kkk111 and kkk 222 are constants with the dimension of concentration that determine the effect of the factor on promoter operation, www is the ratio between the promoter activity without H 2O2 and the background fluorescence  1 Institute of Cytology and Genetics SB RAS, 630090, Novosibirsk, Lavrentiev av. 10, [email protected] 376 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 signal, and vvv is the ratio between the transcription initiation efficiency constant of the activated promoter and the background fluorescence signal. The nnn and mmm values illustrate the complex regulation of the promoter by H 2O2: the larger they are, the more complex regulation is expected. Simulation showed that the maximum approximation to the experimental data was observed in a system modeling complex regulation, probably, mediated by several TFs (Fig. 1).

n=1.3, m=2.8

. . 1.2 ед ед . . . . усл усл , , , , 5 10 15 20 [Н2О2], мМ 0.8 0.6 Флюоресценция Флюоресценция 0.4

Fig. 1 Course of yfiA expression under oxidative stress. Dots indicate experimental relative activities of the yfiA promoter after 60-min exposure to various H2O2 concentrations [1]. The solid curve presents calculation by the normalized model Vn,m(s)/Vn,m(so), so=0.5nM. The following parameters were taken in the calculation: w=0, v=111, k1=8.1, k2=0.97, n=1.35, m=2.6.

This prompted us to seek potential transcription factor binding sites (TFBSs) in the regulatory region of yfiA by the SITECON method, which allows recognition of conservative physicochemical and conformational TFBS features [3]. The high reliability of the prediction with regard to type II error probability (false negative) was shown for the binding sites of the MarA, IscR, MetJ, PurR, and SoxS TFs, directly or indirectly involved in response to oxidative stress (Table 1). We applied these data to reconstruction of the structure of the E.coli yfiA promoter and prediction of its regulation pathways under oxidative stress (Fig. 2).

377 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Table 1. Putative TFBSs found in the yfiA promoter by the SITECON method TF Parameters of putative binding sites Positio ns P* Type I error Type II error PurR -125/-98 0.819 0.2500 1/2776 MetJ -111/-91; -31/-12 0.737; 0.712 0.5000; 0.4583 1/11106; 1/3570 MarA -78/-47 0.827 0.7500 1/19987 SoxS -74/-48 0.740 0.5556 1/1851 IscR -73/-32 0.756 0.2500 1/2776 P*, conformational similarity to known binding sites.

H O H 2O 2

O x y R

S o x R S o x S F u r Is c R M e tJ M a rA P u rR

y fiA PurR MetJ IscR SoxS/M arA M e tJ activation repression de-repression of transcription

Fig. 2. Putative pathways of yfiA (raiA) regulation in E.coli under oxidative stress conditions. Arrows indicate regulatory events: solid line – activation, dotted line– repression, dots – de-repression.

The presence of putative TFBSs for IscR and SoxS on the yfiA promoter was experimentally proved by the EMSA method. This work was supported by the RFBR, project 08-04-01008; Program of the RAS on molecular and cell biology, No. 10.7; Programs of the SB RAS, Nos. 107 and 119; and grant for scientific schools No. NSh-2447.2008.4. 1. Tikunova N.V., Khlebodarova T.M. et al. (2007) Dokl. Biochem. Biophys ., 417: 357-361. 2. Likhoshvai V. and Ratushny A. (2007) J. Bioinform. Comput. Biol., 555:5 593- 610. 378 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

A COMPARATIVE VIEW ON microRNA GENES IN ANIMAL GENOMES .

EVGENY ZDOBNOV 1

Keywords: microRNA, comparative genomics

MicroRNAs (miRNAs) are short, non-protein coding RNAs that direct the widespread phenomenon of post-transcriptional regulation of metazoan genes. The mature ~ 22nt RNAs are processed from genome-encoded stem- loop structured precursor genes (pre-miRNA). Hundreds of such genes have been experimentally validated in vertebrate genomes, yet their discovery remains challenging, and substantially higher numbers have been estimated. Our computational survey of microRNA genes in over 40 animal species, miROrtho [1], is thought to be conceptually complementary to the miRBase catalog of experimentally verified miRNA sequences. We devised a pipeline of an ab-initio SVM predictor of potent stem-loop structures, an orthology delineation [2] step, and an SVM classifier of alignments of miRNA families that provides a consistent comparative genomics perspective, as well as identifying additional novel miRNA genes with strong evolutionary support. Hundreds of human miRNA genes have been experimentally validated, yet functions of only handful of them have been characterized. The majority of animal miRNAs have only limited complementary to their experimentally verified targets, requiring interaction of only several nucleotides. Despite remaining a very hot topic of research in many leading laboratories, deciphering of microRNA/mRNA regulation remains a mammoth challenge as the output of independent analyses of common datasets varies significantly with little overlap. We are also developing novel computational approaches utilizing microRNA/mRNA co-folding, the power of comparative analysis among multiple genomes and the fast growing amount of functional genomics data to elucidate the microRNA targets (e.g. to statistically test the correlation between the effect of experimental miRNAome perturbation on the transcriptome and the predicted potential miRNA-mRNA interactions [3, 4]). Once our understanding of microRNAs matures, we will be able to approach questions similar to what we ask regarding protein-coding genes, e.g. [5], aiming to quantify the global evolutionary trends of the mi-croRNA’omes and their regulatory networks.  1 University of Geneva Medical School, Switzerland, [email protected] 379 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

Acknowledgements: Daniel Gerlach, Evgenia Kriventseva, Nazim Rahman, Charles Vejnar.

1. Gerlach, D., et al., miROrtho: computational survey of microRNA genes. Nucleic Acids Research, 2009. 37: p. D111-D117. 2. Kriventseva, E.V., et al., OrthoDB: the hierarchical catalog of eukaryotic orthologs. Nucleic Acids Research, 2008. 36: p. D271-D275. 3. Papaioannou, M.D., et al., Sertoli cell Dicer is essential for spermatogenesis in mice. Developmental Biology, 2009. 326(1): p. 250-259. 4. Gatfield, D., et al., Integration of microRNA miR-122 in hepatic circadian gene expression. Genes & Development 2009. 23(11). 5. Wyder, S., et al., Quantification of ortholog losses in insects and vertebrates. Genome Biology, 2007. 8(11): p. -.

380 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

FINDING MEANINGFUL STRUCTURES IN HIGH- THROUGHPUT DATA: FROM PRINCIPAL TREES TO SPECTRAL FILTERING ON GRAPHS ANDREI ZINOVYEV 1, ALEXANDER GORBAN 2, EMMANUEL BARILLOT 1, JEAN-PHILIPPE VERT 3

In our presentation we will make a short overview of the ideas, methods and software for exploratory high-throughput data analysis that were developed and exploited by the authors in the projects of the Bioinformatics Laboratory of Institut Curie (Paris, France). Data exploration is an important stage in any project on high-throughput data analysis. It allows, first, to eliminate evident artifacts and estimate data quality; and, second, to find geometrical structures in the distribution of data points that can be interpreted with use of biological knowledge or clinical information. We argue that classical supervised machine learning setting in functional genomic studies is prone to the problems of the dimensionality curse, reproducibility and unreliable or biased sample labeling. Using methods of geometrical data analysis allows to extract significant signals contained in the data with the goal to match them with available external information. If such a match is successful then the conclusions are in general more robust to re- testing them in different technical settings or experimental conditions. We start with the discussion of the most used unsupervised learning methods such as hierarchical clustering and principal component analysis and briefly discuss their properties, drawbacks and impact on the research in the area of molecular biology. This discussion is followed by presenting recently developed methods, extending the classical ideas, such as Independent Component Analysis and ISOMAP, for application in genome-wide gene expression data analysis. We pay particular attention to the methods of unsupervised learning allowing to integrate biological knowledge in the form of the graph of interactions between biological entities (Network Component Analysis, method of spectral filtering on graphs [1]).

 1 Institut Curie, France, [email protected], [email protected] 2 University of Leicester, France, [email protected] 3 Ecole de Mines, France, [email protected] 381 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 We discuss application of non-linear generalizations of the principal component analysis for high-throughput data (method of elastic maps [2] and method of principal trees [3]) and, using systematic test of the quality of data mapping into low-dimensional spaces, demonstrate that their application to high-throughput data is beneficial in comparison to the performance of the linear methods.

1. Rapaport F., Zinovyev A., Dutreix M., Barillot E., Vert J.-P. (2007) Classification of microarray data using gene networks. BMC Bioinformatics 8:35. 2. Gorban A., Zinovyev A. (2008) Elastic Maps and Nets for Approximating Principal Manifolds and Their Application to Microarray Data Visualization. Lecture Notes in Computational Science and Engineering 58: 97-128. 3. Gorban A., Sumner N., Zinovyev A. (2008) Beyond The Concept of Manifolds: Principal Trees, Metro Maps, and Elastic Cubic Complexes. Lecture Notes in Computational Science and Engineering 58: 223-240.

382 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

TABLE OF CONTENTS

New method to improve error probability estimation applied to Illumina sequencing. Irina Abnizova , Tom Skelly, Yumi Yan, Tony Cox 1 Detection of genes that underwent positive selection in deep-sea archaebacteria of Pyrococcus genus . K.V. Gunbin, D.A. Afonnikov , N.A.Kolchanov 2 Mathematical modeling of the molecular genetic systems regulating a plant development. Ilya Akberdin , Fedor Kazantsev, Stanislav Fadeev, Irina Gainova, Vitaly Likhoshvai 4 Water-mediated hydrogen bonds are essential for loop stabilization in protein structures. Evgeniy Aksianov , Sergei Spirin, Anna Karyagina, Andrei Alexeevski 6 Genomic insights into the origins of metazoan cell differentiation. Kirill V. Mikhailov, A.V. Konstantinova, M.A. Nikitin, V.V. Aleoshin, L.Yu. Rusin, Yuri V. Panchin 9 Inherent potentialities of Voronoi-Delauney tessellation as applied to biology problems. Anastasya Anashkina , Natalia Esipova , Vladimir Tumanyan 11 Computational Anti-AIDS Drug Design Resulting from the Study on Specific Interactions of Immunophilins with the HIV-1 gp120 V3 Loop. Alexander Andrianov 13 Homology Modeling and Molecular Dynamics in Structural Studies on the HIV-1 gp120 V3 Loops: Insight into the Virus Subtype A. Ivan Anishchenko, Alexander Andrianov 15 3D Structure Modeling and Posterior Collation of the HIV-1 V3 Variable Loops for Discovery of Their Structurally Invariant Sites Exposing the Achilles' Heel in the HIV-1 “Redoubts”. A. M. Andrianov , I.V. Anishchenko 17 PolyCTLDesigner – the software for constructing polyepitope immunogens. Denis Antonets , Amir Maksyutov, Sergey Bazhan 19 Genome-wide search for 5’-UTR of Saccharomyces cerevisiae genes and their orthologs. Kirill Antonez , Alsu Saifitdinova 21

383 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 A Trusty Knowledge-Based Potential Energy Based on Pairwise Residue Contact Area. Seyed Shahriar Arab , Armita Sheari, Mehdi Sadeghi, Changiz Eslahchi, Hamid Pezeshk 23 Evolutionary dynamics of CRISPR-cassettes. Valery Sorokin, Irena Artamonova 26 Investigating Branch Point Site consensus of human. Fedor Goncharov, Vladimir Babenko 28 Glaucoma and myopia whole genome association study. Vladimir Babenko , Marina Gubina, Igor Kulikov, Ruslan Aitnasarov 30 An Evolutionary Study in the Genomics of Vertebrate Poxviruses. Igor Babkin 31 Dosage compensation and demasculinization of X chromosomes in Drosophila. Doris Bachtrog , Nicholas Toda, Steven Lockton 33 Codon size reduction as the origin of the triplet genetic code. Pavel Baranov , Maxime Venine, Gregory Provan 34 Toward universal malignometer: genome-wide expression patterns as composite biomarkers. Ganiraju Manyam, Alessandro Giuliani, Ancha Baranova 36 Mathematical modelling of cell-fate decision networks. Emmanuel Barillot , Laurence Calzone, Simon Fourquet, Laurent Tournier, Andrei Zinovyev, Denis Thieffry 38 Conservative regions of proteins evolve under stronger positive selection. Georgii Bazykin , Alexey Kondrashov 40 Modelling and stability analysis of interconnected regulatory cycles. Mahsa Behzadi , Mireille Regnier, Laurent Schwartz, Jean-Marc Steyaert 41 Involvement of protein-protein interactions in composite elements detection. Alexander A Belostotsky , Vsevolod Y. Makeev 43 Studying the impact of gene copy number variations on gene expression via a gene regulation network. Sylvain Blachon , Carito Guziolowski, Gautier Stoll, Gaelle Pierron, Stelly Ballet, Franck Tirode, Olivier Delattre, Emmanuel Barillot, Andrei Zynoviev,

384 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Anne Siegel, Ovidiu Radulescu 45 Using SVM and a measure of motif ‘surprise’ to distinguish regulatory DNA. Rene te Boekhorst , Irina Abnizova , Fedor Naumenko , Ivan Kulakovski , Wernisch Lorenz 46 Search for degenerate tandem repeats in nucleotide sequences. Their possible role in regulation of gene expression. V. Boeva , V.J. Makeev, M. Regnier 48 Application of the computer program Rosetta for the protein structure interpretation from tritium planigraphy technique data: M1 protein of influenza virus A. Elena Bogacheva , Alexey Chulichkov, Alexey Dolgov, Aleksandr Shishkov, Iliya Kuzmin, Lidia Nefedova, Ludmila Baratova 50 FSdetector: frameshift prediction in protein coding sequences by the Viterbi algorithm. Ivan Antonov, Mark Borodovsky 52 Automatic tool to describe structure of reliable blocks in a multiple alignment of protein sequences. Boris Burkov , Boris Nagaev, Sergei Spirin, Andrei Alexeevski 54 Evolution of signal peptide appearance/disappearance in bacterial genomes. Nadezhda Bykova , Andrej Mironov 56 A statistical method for PWM clustering. Solenne Carat , Rémi Houlgatte, Jérémie Bourdon 59 Construction and Heterological Expression in E. coli of the Deletion Derivatives of the Cyanobacterium Synechocystis sp. PCC 6803 drgA Gene and its Hybrids with gfp. Regina Chakhiridis , Vera Grivennikova, Elena Muronets, Kirill Timofeev, Irina Elanskaya, Viktoriya Toporova, Alexei Nekrasov, Dmitry Dolgikh 61 Role of GATA4 and NKX2-5 in congenital heart defects of Indian population: a preliminary report. Anbarasan Chakrapani , Ashok Kumar Manickaraja, herian K. M, Soma Guhathakurta, Vijaya M Nayak 63 Hydrogen bond geometry in regular helix structures. Dmitrii L. Ukrainskii, Vladimir O. Chekhov , Vladimir G. Tumanyan, Natalia G. Esipova 63 385 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Negative Information Entropy as a Measure of Nonexponentiality of Protein Folding Kinetics. Sergei F. Chekmarev 67 Changing the content of cytosine, guanine, CPG and CPNPG sequences of RDNA in long phylogenetic branches of flowering plants is a back-and-forth nature. Vladimir Chupov 69 Evolution of sequences under strong selection: splice sites and Shine-Dalgarno boxes. Stepan Denisov , Aksiniya Gaydukova, Andrey Mironov, Alexander Favorov, Ramil Nurtdinov, Mikhail Gelfand 71 Computer simulation of C.Elegans muscular system and neural network. Alexander Dibert , Andrey Palyanov 73 New profiles for two domains of quorum-sensing histidine kinases from Firmicutes bacteria. D.V. Dibrova 75 Multiscale modeling and design of biological molecules. Nikolay V. Dokholyan 77 Prediction of flexibility and ability to hydrogen-deuterium exchange for protein chain using amino acid sequence. Nikita Dovidchenko , Alexey Surin, Sergiy Garbuzynskiy, Michail Lobanov, Оxana Galzitskaya 78 Mathematical Modeling of Steady-State Metabolism in Saccharomyces cerevisiae Mitochondria. Renata A. Zvyagilskaya, Nafisa N. Nazipova, Alexsander A. Alexsandrov, Lyusien N. Drozdov-Tikhomirov 79 Structural trees and classification of proteins. Alexander Efimov 80 Investigation of correlation between domain borders and corresponding exon borders in the nonredundant set of human proteins. V.A. Epaneshnikov , A.A. Anashkina, E.N. Kuznetzov, V.G. Tumanyan 81 Evolution of structure and sequence in alternatively spliced Drosophila genes. Dmitry Malko , Ekaterina Ermakova , Mikhail Gelfand 84 Secondary structure of copolymer consisting of amphiphilic and hydrophilic monomer units: impact of the range of the interaction potential. Vitaly Ermilov ,

386 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Valentina Vasilevskaya, Alexei Khokhlov 86

Mutual Orientation of Q Y Transition Dipoles of Subantennae Pigments as a Structural Factor Optimizing the Photosynthetic Antenna Function. Theoretical and Experimental Studies. Anastasiya Zobova, Andrey Yakovlev, Vladimir Novoderezhkin, Alexandra Taisova, Zoya Fetisova 87 Orientational Factors for Förster`s Resonance Excitation Energy Transfer. V.S. Dujenko, A.V. Zobova, Z.G. Fetisova 90 Search for an Optimal Interfacing Subantennae in Superantenna of Photosynthetic Green Bacteria. V.G. Popov, A.V. Zobova, A.S. Taisova, Z.G. Fetisova 93 Evolution of sex chromosomes in diploids and haploids. Dmitry Filatov 96 Analysis of 3D structure, thermostability and mechanical characteristics of I, II, III, V and XI types of collagens. Ivan V. Filatov , Yuri V. Milchevsky, Vladimir A. Namiot, Marianna V. Moldaver, Sergey A. Lukshin, Maxim A. Rubin, Elisa I. Tiktopulo, Natalia G. Esipova, Vladimir G. Tumanyan 99 A New Atomic Force Field "FFS" for Protein Interactions, Computed from Solubility of Molecular Crystals in Water. Alexei Finkelstein , Leonid Pereyaslavets 101 X(Y)n-type microsatellites in the human and mouse genome. Fridman M.V ., Makeev V., Oparina N.J . 103 DIPROGB: A new genome browser that encodes sequence information by thermodynamic and geometrical dinucleotide properties. Maik Friedel , Thomas Wilhelm, Jürgen Sühnel 105 Helix-helix contacts in membrane proteins: analysis, prediction and applications. Angelika Fuchs , Andreas Kirschner, Barbara Hummel, Dmitrij Frishman 106 Prediction of unstructured residues in protein chains. Oxana Galzitskaya , Sergiy Garbuzynskiy, Michail Lobanov 108

387 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Aggregation propensity of yeast and human proteomes. Natalya Bogatyreva, Оxana Galzitskaya 110 Positions of Protein Folding Nuclei Correspond to Positions of Root Structural Motifs. Sergiy O. Garbuzynskiy , Maria S. Kondratova 112 In Silico design of Primer for 28 kDa Antigen Precursor Protein of Mycobacterium Leprae. Aditya Gaur 114 One Codon – Two Amino Acids. Vadim Gladyshev 115 Dynamics and Rigidity/Flexibility of Thermophilic and Mesophilic Proteins. Anna V. Glyakina , Tatyana B. Mamonova , Maria G. Kurnikova, Оxana V. Galzitskaya 116 A Novel Approach to Structural Alignment of Proteins Based on Energy Landscapes Calculation. Maxim Godsie , Igor Oferkin , Pavel Ivanov 118 Inferring gene evolution along a species tree. K. Gorbunov , V. Lyubetsky 120 Mode of stop codon restriction by the Euplotes eRF1 translation termination factor. Evgeny Gordienko , Boris Eliseev, Elena Alkalaeva, Ludmila Frolova 122 Bioinformatics analysis of LAGLIDADG homing endonucleases for construction of enzymes with changed DNA recognition specificity. Alexander Grishin , Ines Fonfara, Wolfgang Wende, Daniil Alexeyevsky, Andrei Alexeyevsky, Sergei Spirin, Olga Zanegina, Anna Karyagina 123 A comparative assessment of methods for recognition of binding sites in proteins. Concettina Guerra 125 Comparative Genomics and Evolutionary Account of GPI Anchored Proteins: An in silico Study. Ashutosh Mani , Dwijendra K. Gupta 127 In-silico Sequence Analysis, Functional and Evolutionary Characterization of a Novel Cold Shock Domain Protein from Indian Eri silkworm, Philosamia ricini. Ashutosh Mani, Pramod K Yadava, Dwijendra K. Gupta 129

388 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Prediction of Genome-wide Functional Linkages in Mycobacterium tuberculosis using Genome context methods and Gene expression data. Chandrani Das , Shubhada Hegde , Shekhar Mande 132 Dependence between Exon, Intron Length and Nucleotide Content of Genes in Human and Protist Genomes. Anatoliy Ivachshenko , Anel Kabdullina, Vladimir Khailenko, Shara Atambayeva 133 An Update of KineticDB, the Database of Protein Folding Kinetics. Natalya Bogatyreva, Alexander Osypov, Dmitry Ivankov 135 A New Approach For Detecting Tumor Marker Genes From Microarray Datasets Using Evolutionary Algorithm. Georgy Gulbekyan , Valery Valyaev , Pavel Ivanov 137 Analysis of time series Microarray data using Dynamic Bayesian network. K.G. Srinivasa, Seema S, Manoj Jaiswal 139 Chromosome Properties of Unicellular Eukaryotic Genomes. Anel Kabdullina , Anatoliy Ivachshenko, Makpal Tauasarova, Shara Atambayeva 140 Reverse engineering of early endocytic compartments organization by modelling cargo propagation. Yannis Kalaidzidis , Marta Miaczynska, Jochen Rink, Inna Kalaidzidis, Marino Zerial 142 Predicting novel protein-small molecule interactions using molecular modelling techniques. Olga Kalinina , Robert Russell 143 Bioinformatic Search of Plant Microtubule- and Cell Cycle Related Serine-Threonine Protein Kinases. P.A. Karpov , E.S. Nadezhdina ,, A.I. Yemets, V.G. Matusov, A.Yu. Nyporko, N.Yu. Shashina, Y.B. Blume 145 Net2Drug: Combined targeting the key-nodes in signal transduction network shifts balance between apoptosis and survival mechanisms in tumor cells. Alexander Kel , Angela Gluch, Ulyana Boyarskih, Vladimir Poroikov, Alexey Zakharov, Galina Selivanova 148

389 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Highly connected cancer metasignature genes are not evolutionary conserved throughout the three domains of life. Muhummadh Khan , Kaiser Jamil 150 SNPs of MTHFR occur at sites exhibiting significant conservation in comparison to the active sites. Muhummadh Khan , Kaiser Jamil 153 Chromatin organization in D. melanogaster. Peter Kharchenko , Art Alekseyenko, Andrey Gorchakov, Michael Tolstorukov, Mitzi Kuroda, Peter Park, Yuri Schwartz, Daniela Linder Basso, Vincenzo Pirrotta, Nicole Riddle, Sarah Gadel, Sarah Marchetti, Sarah Elgin, Aki Minoda, Cameron Kennedy, Gregory Shanower, Gary Karpen 155 A novel type of repeats mediates interaction between Schizosaccharomyces pombe Rad51 and Sfr1 proteins. Olga Khasanova, Fuat Khasanov 157 Comparative analysis of gene expression profilies in liver and kidney of pigs. N.S. Khlopova , V.I.Glazko, T.T. Glazko 159 Regulation of splicing by small non-coding RNAS. Ekaterina Khrameeva , Andrey Mironov, Mikhail Gelfand, Dmitri Pervouchine 161 Common Predecessor’s Effect in Archaeal Genomes and Proteomes. Vladislav Victorovich Khrustalev , Eugene Victorovich Barkovsky 163 Conformational analysis of rotamer changes upon protein-protein binding. Tatsiana Kirys , Anatoly Ruvinsky, Alexander Tuzikov, Ilya Vakser 165 Introduction and application of CellExpress, a new database for studying human tissue specific gene expression. Larisa Kiseleva , Raymond Wan , Paul Horton 167 Replica-exchange simulations of amyloid growth. Dmitri K Klimov 169 Finding of molecular targets and their ligands for breast cancer therapy. O.N. Koborova , D.A. Filimonov, A.V. Zakharov, A.A. Lagunin, V.V. Poroikov 172 Interaction of antibodies with small aromatic ligands. Darja Svistunova, Vladimir Arzhanik, Oleg Koliasnikov 174

390 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Similar curved motif surrounds CENP-B box in different centromeric satellite DNA. Aleksey Komissarov, Olga Podgornaya 176 Molecular Evolution of Influenza A Virus Hemagglutinin in consideration of Enzyme Proteolysis, Mass Spectrometry and Phylogeny Analysis Data. Yulia Smirnova, Viktor Lebedev, Tatiana Semashko, Ekaterina Kropotkina, Larisa Kordyukova, Marina Serebryakova 178 An online tool for search of correlations between sequences of DNA-binding proteins and their binding sites. Yuriy Korostelev , Olga Laikova, Alexandra Rakhmaninova, Mikhail Gelfand 180 A knowledge-rich approach to drug discovery. Ekaterina Kotelnikova , Nikolai Daraselia 182 Systems biology approach to study morphogenetic field. Konstantin Kozlov , Ekaterina Myasnikova, Maria Samsonova 184 EST-based bioinformatic approaches to identification of cancer biomarkers. George Krasnov , Nina Oparina, Mashkova Tamara, Sergey Beresten 186 Reconstructing ancestral multi-domain proteins. Roland Krause 187 Periodic patterns in B.subtilis promoter structure are associated with promoter selectivity by different forms of RNA polymerase holoenzyme. G. Kravatskaya , Yu. Kravatsky, Yu. Milchevsky, N. Esipova 188 Predicting RNA Secondary Structures Including Pseudoknots. Andrey Kravchenko , Rune Lyngso 190 Rare variants based associaton studies – are they feasible? Gregory Kryukov 191 MISHIMA – a new heuristic method of multiple sequence alignment. Kirill Kryukov , Kazuho Ikeo , Takashi Gojobori , Naruya Saitou 192 Model-based timing of gene expression. Andrzej Kudlicki , Malgorzata Rowicka 193

391 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Chipmunk: a fast DNA motif finder for ChIP data and its application to data integration from different experimental sources. Ivan V. Kulakovskiy , Valentina A. Boeva, Alexander V. Favorov, Vsevolod J. Makeev 194 Changes of selective pressure affecting the isoenzymes of glyceraldehyde-3-phosphate dehydrogenase. Mikhail L. Kuravsky , Vladimir I. Muronetz, Vladimir V. Aleshin 197 Patterns of evolution in protein phosphorylation sites. Yerbol Z. Kurmangaliyev 199 Finding of the gene fruitless in ants. Tatiana Kuzmenko , Mikhail Skoblov, Sergey Nuzhdin, Ancha Baranova 201 Protein-DNA Binding Statistics and Estimating the Total Number of Binding Sites of Transcription Factor in the Genome. Vladimir Kuznetsov , Onkar Singh, Piroon Jengaroenpoon 203 Evolvability and biodiversity – modeling of coevolution in communities using evolutionary constructor program. Sergey A. Lashin , Valentin V. Suslov , Yurii G. Matushkin 205 A noncoding antisense RNA – protein information system for mammalian stress response. Georges Georges St. Laurent III , Dmitry Schtokalo, Sergey Nechkon, Andrey Polyanov, Ajit Kumar, Mohhamed Ali Faghihi, Farzaneh Modarresi, Claes Wahlested 207 Molecular dynamics simulation of membrane curvature induction by I-BAR domain of MIM. Olga Levtsova , Ildar Davletov, Olga Sokolova 209 TINC (Target Id by Network Connectivity). Dmitriy Leyfer , Ugur Guner 211 Understanding the Amino Acid Substitution Process. David A. Liberles 212 Positioning of exons and introns in collagen I and VII genes may be determined by nucleosomes. A.P. Lifanov , P.K. Vlasov, V.Yu. Makeev ,, N.G. Esipova 213 Application of Nucleic Acid Programmable Protein Arrays (NAPPA) to serological profiling for Type 1 Diabetes associated autoantibodies. T. Logvinenko , S. Miersch , S. Sibani , J. LaBaer 215 392 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Chlorophyll synthesis regulation in plant chloroplasts. K.V. Lopatovskaya , A.V. Seliverstov, V.A. Lyubetsky 217 Comparative genomic analysis of the attenuation regulation of amino acid and amino ACYL-TRNA biosynthesis operons in bacteria. V.A. Lyubetsky, K.V. Lopatovskaya 219 Refinement of Spatial Structure Model of Potato Virus X Coat Protein and Detection of Functionally Significant Structural Alterations in This Protein with the Help of Tritium Planigraphy Method. Pavel Semenyuk, Anna Mukhamedzhanova, Elena Lukashina 221 Allele-Specific Expression Using Solexa. Bradley Main , Ryan Bickel, Lauren McIntyre, Rita Graze, Sergey Nuzhdin 223 A novel method for gene prediction in prokaryotic genomes.Rahim Malekshahi , Alirea Mehridehnavi , Hedayatolah Hosseini , Majid Beigi 224 SORT-ITEMS and DiScRIBinATE: Similarity based binning algorithms for accurate taxonomic assignment of metagenomic sequences. Monzoorul Haque Mohammed. Tarini Shankar Ghosh , Sharmila Mande 225 Prediction of conditional gene essentiality through graph theoretical analysis of genome-wide functional linkages. Palanisamy Manimaran , Shubhada Hegde , Shekhar Mande 227 In search of antisense to AFAP1 human gene. Andrey Marakhonov , Ancha Baranova, Tatyana Kazubskaya, Sergey Shigeev, Mikhail Skoblov 228 Equilibrium and dynamical properties of protein binding networks. Sergei Maslov 229 Investigation of age related alternative splicing changes in human brain using Solexa sequencing. Pavel Mazin , Philip Khaitovich, Andrey Mironov, Mikhail Gelfand 230 Knowledge Profile Approach: Insights Into Drug Action and Toxicity Mechanisms. Ilya Mazo , Ekaterina Kotelnikova , Nikolai Daraselia 232 Specific recognition of UGA, UAA, UGA but not UGG by eRF1 protein: molecular modeling study. Yuriy Mazur , Nina Oparina, Vladlen Skvortsov, Igor Baskin,

393 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Vladimir Palyulin 233 Inner structure of CPG islands. Julia Medvedeva , Nika Oparina, Vsevolod Makeev 234 Compensatory evolution in MT-TRNAS navigates shifting balance-like valleys of low fitness. Margarita Meer , Fyodor Kondrashov 235 Modeling of auxin distribution in root: rhizotaxis is defined by auxin regulation of its own transport. Victoria Mironova , Nadya Omelyanchuk, Vitaly Likhoshvai 236 Correlations between DNA-binding domains and their DNA binding sites. D.S. Miteva , V.V. Stepanova, A.B. Rakhmaninova 238 Exceptional nucleotide sequences in genomes of different organisms. Sergei Mitrofanov , Alexander Panchin, Andrei Alexeevski, Sergei Spirin, Yury Panchin 240 Mathematical model of the inhibiting part in TCA at Citric Acid synthesis by superproducers cross-mutants of Yarrowia lipolytica from glucose. Yulia Lunina , Andrew Rudenko, Igor Morgunov 242 Studying origin of life through data mining: Traces of the primeval Zinc World in modern protein and RNA databases. Armen Y. Mulkidjanian , and Michael Galperin 244 Pipeline for acquisition of high precision quantitative information on gene expression from confocal images. Ekaterina Myasnikova , Konstantin Kozlov, Maria Samsonova 246 Organization of physical interactomes as uncovered by network schemas. Eric Banks, Elena Nabieva , Bernard Chazelle, Mona Singh 247 Reclassification of GH13 family of glycoside hydrolases. Diana I. Gizatullina, Daniil G. Naumoff 249 Sequence analysis of endo-α-N-acetylgalactosaminidases and their homologues. Daniil G. Naumoff 251 Prokaryotic transfer RNA: rate of molecular evolution, number of copies, stability & codon usage. Olesya Nechay , Maksim Sorokin, Konstantin Popadin 253 394 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 The base-calling algorithm with vocabulary. Yuri S. Fantin, Denis A. Reshetov, Alexey D. Neverov , Alexander V. Favorov, Andrey A. Mironov, Vladimir P. Chulanov 255 Detection of genomic variation by selection of a 9Mb DNA region and high throughput sequencing. Sergey Nikolaev , Christian Iseli , Andrew Sharp , Daniel Robyr , Jacques Rougemont , Corinne Gehrig, Laurent Farinelli, Stylianos E. Antonarakis 257 Analysis of gene regulation in Escherichia coli. Swetlana Nikolajewa , Maik Friedel, Reinhard Guthke 258 Chargaff's Second Parity Rule. Swetlana Nikolajewa , Reinhard Guthke, Maik Friedel 259 Prediction of regulatory elements in Drosophila genomes using hidden Markov model based on the arrangement of transcription factor binding sites. Anna Nikulova , Andrey Mironov 261 Probe-level annotation database for Affymetrix expression microarrays. Ramil N. Nurtdinov , Mikhail O. Vasiliev, Anna S. Ershova, Ilia S. Lossev, Anna S. Karyagina 263 Structural features of β-tubulin specific interaction with benzimidazole compounds. Yu. Nyporko , Ya. B. Blume 265 Neisseria gonorrhoeae outer membrane protein translocation disorder: combined bioinformatic and experimental analysis. Nina Oparina , Elena Ilina, Maya Malakhova, Alexandra Borovskaya, Irina Demina, Marina Serebryakova, Maria Rogova, Vadim Govorun 268 Search for CPG-islands: comparison of modern approaches. Nina Oparina , Marina Fridman 270 Some like it sweet: towards genomic encyclopedia of sugar catabolism in bacteria. Andrei Osterman , Dmitry Rodionov 271 Compensatory evolution in response to a novel RNA polymerase: electrostatic properties of promoters may lead the adaptation. Alexander Osypov , Svetlana Kamzolova , Anatoly Sorokin 273 395 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 DEPPDB – the DNA Electrostatic Potential Database. Electrostatic properties of natural genomes. Alexander Osypov , Svetlana Kamzolova , Anatoly Sorokin 275 Electrostatic properties of T7-like phages promoters for host bacterial and native viral RNA polymerases. Alexander Osypov , Svetlana Kamzolova , Anatoly Sorokin 277 RNA polymerase-DNA interactions: are they driven by electrostatics? Alexander Osypov , Svetlana Kamzolova , Anatoly Sorokin 279 Major trends in the evolution of young human paralogs. Alexander Panchin ,| Mikhail Gelfand, Vasily Ramensky, Irena Artamonova 281 New evidence for diversity of intercellular channel (gap junction) proteins. Yuri Panchin , Ludmila Popova , Igor Kosevich , Yulia Kraus , Irina Shagina , Maria Kurnikova, Dmitry Shagin 282 Time warping of global expression data for evolutionary distant species. Dmitri Papatsenko , Yury Goltsev 285 Evidence of protein domains stability due to aromatic interactions. Leonid Pereyaslavets 286 CPG islands: evolution of ‘non-objects’ in the genome. Inna Pertsovskaya , Artem Artemov, Nina Oparina, Alexander Favorov, Andrei Mironov, Dmitrry Vinogradov 287 High rate of adaptation in Drosophila. Dmitri Petrov , Josefa Gonzalez , J. Michael Macpherson, Lenkov Kapa 289 Regulation of ribosomal genes in bacteria: comparative genomic analysis. Svetlana A. Petrova , Alexey G. Vitrechack 290 Polymorphism of ISSR-PCR markers and positioning of invert repeats of microsatellites in sequences of Bovidae family. Anton Pheophilov , Valeriy Glazko 292 Evolution of mitochondrial genome size: large genomess in small mammals and small genomes in large mammals. Konstantin Popadin 294 396 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Bioinformatics as a "Critical Technology" for Life Sciences. V.V. Poroikov 296 Expansion of the protein sequence universe. Inna Povolotskaya , Fyodor Kondrashov 298 Cluster Analysis of Phylogenetic Profiles. Mikhail Pyatnitskiy , A.V. Lisitsa, A.I. Archakov 300 Studying NF-KB response to cellular signals by hierarchical modeling. Ovidiu Radulescu , Vincent Noel , Alexander Gorban , Alain Lilienbaum , Andrei Zinovyev 302 Analysis of inhibitor for breast cancer causing GPR30 protein. Karthika Raghavan , Nithya Palaniappan , Divya Ramkumar 304 Predicting binding sites of ions in protein structures. Sergei Rahmanov , Ivan Kulakovsky, Vsevolod Makeev 305 Deleterious and compensatory mutations in proteins. Olga Kalinina, Anastasya Anashkina, Alexandra Mirina, Vasily Ramensky 307 Positive selection and alternative splicing of human genes. Vasily Ramensky , R.Nurtdinov, A.Neverov, A.Mironov, Mikhail Gelfand 308 SeqWord Gene Island Sniffer: a tool to study the lateral genetic exchange among bacteria. Oliver Bezuidt, Gipsi Lima-Mendez, Oleg Reva 310 Cryptic transcripts regulated during the yeast metabolic cycle. Malgorzata Rowicka , Andrzej Kudlicki, Benjamin Tu 312 An average number of suffix-prefixes. M. Regnier, E.Furletova, M.Roytberg 313 Comparison of structure-based and covariance-based secondary structures of 23S RNA. D.N. Ivankov, M.A. Roytberg 315 Search for new genes of D.virilis and D.mojavensis. T.V. Astahova, N.S. Bogatyreva, M.A. Roytberg 317

397 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Comparative Genomics of the Fatty Acids Biosynthesis in Gamma-Proteobacteria. Nataliya S. Sadovskaya 319 Adenosine Deaminase and its Isoenzymes in serum of Patients with Primary Immunodeficiency Diseases. Reza Saghiri , Hadi Akhbari, Peghah Poursharifi, Mina Ebrahimi-Rad, Manijeh Ahmadi, H. Nazem, Z. Pourpak, M. Moin, S. Shams, M. Saghiri, M. Karami 320 Polyallelic SNPs in population of Drosophila melanogaster. Vladimir Seplarskij , Georgii Bazykin 321 Statistical Analysis of HIV-1 Protein Mutations and Association with Antiretroviral Therapy. R.S. Sergeev , A.V. Tuzikov, V.F. Eremin 322 In silico and in vivo analysis of functions of some of the chromosomal regions. Anna N. Shabarina , M.V. Glazkov 324 Correlation of HIV-1 Rev binding host factor structure and evolution profiles and their importance in HIV associated Neuropathogenesis. Deepak Sharma 326 Expression analysis of intronless transcriptome of mouse. Viktoria Serzhanova, Anton Kireev, Anna Guskova, Ancha Baranova, Mikhail Skoblov 328 Study of antisense regulation of human carbonyl reductase 3. Yurii Chernohvostov, Anna Guskova, Tatiyana Kazubskaya, Ancha Baranova, Mikhail Skoblov 329 PAAS: Machine Learning Method for Classification of Amino Acid Sequences Using the Local Similarity Scores. Boris Sobolev , Kirill Alexandrov, Dmitry Filimonov, Vladimir Poroikov 330 Germ-based spatial alignment of proteins. Dian Zhemoldinov, Andrei Alexeevski, Sergei Spirin 332 NPIDB, a database of structures of nucleic acid – protein complexes. Dmitry Kirsanov, Olga Zanegina, Andrei Alexeevski, Sergei Spirin , Alexander Grishin, Anna Karyagina 334 Statistical approach for discovering evolutionary conserved members of regulon.

398 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 E. Stavrovskaya , D.A. Rodionov, A.A. Mironov, I. Dubchak, P.S. Novichkov 336 Computer simulation and quantum chemistry calculations in the analysis of the physical mechanism of the biologically significant activity of nucleotides. Vasily Stefanov , Olga Rogacheva, Alexander Tulub 339 SNPs in the HIV-1 TATA Box and the AIDS Pandemic. Suslov V.V. , P.M. Ponomarenko, V.M. Efimov, M.P. Ponomarenko, L.K. Savinkova, N.A. Kolchanov 341 Modeling of Structure and Substrate Recognition of Penicillin Acylase from Streptomyces Mobaraensis Using Molecular Docking to Evaluate Proper Active Site Geometry. Dimitry Suplatov, Irina Pouliakhina, Vladimir Arzhanik, Vytas Švedas 343 Construction of interactive data base of human Alu repeats digestion at short nucleotide sequences. Victor Tomilov , Murat Abdurashitov, Sergey Degtyarev 345 Anionic Phospholipid Asymmetric Location in Zwitterionic/Anionic Vesicles. Francisco Torrens , Gloria Castellano 347 Prediction of super-secondary structure in α-helical and β-barrel transmembrane proteins. Van Du Tran , Philippe Chassignet, Jean-Marc Steyaert 350 Ethanolamine utilization: study of evolution and regulation using comparative genomics. Olga Tsoy , Dmitry Ravcheev, Arcady Mushegian 352 The Tale of “Underlying biology”: Functional Analysis of MAQC II data. Marina Tsyganova , Weiwei Shi, Damir Dosymbekov, Zoltan Dezso, Tatiana Nikolskaya, Yuri Nikolsky 354 Capture and release of coding DNA: evolution of bacterial genes by shift of stop codons. Anna Vakhrusheva , Marat Kazanov, Andrew Mironov, Georgiy Bazykin 356 Binding Determinants of Interactions Between Antiapoptotic Proteins Bcl-2, Bcl-xL, Mcl1 and ligands ABT737 and Gossypol. A.I. Davidovskii, V.G. Veresov 357 Exploring the Molecular Basis of the Binding of ABT737and ABT263 towards Antiapoptotic Proteins Bcl-2, Bcl-xL, Mcl-1, A1. A.I. Davidovskii , V.G. Veresov 359 399 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Codon usage bias: biological function or neutral marker? Svetlana Vinogradova , Dmitriy Vinogradov, Andrey Mironov 362 Collagen-like patterns in the human genome. Anna V. Vlasova , Petr K. Vlasov, Natalia G. Esipova, Vladimir G. Tumanyan 364 Interrelation between the translation initiation signal and the N-end of encoded protein in human MRNA. Oxana Volkova , Alex Kochetov 366 Improved prediction of human MIRNAS based on context-structural HMM . Pavel Vorozheikin, A.I. Kulikov, Igor I. Titov 368 Genetics of variation of copia suppression in Drosophila melanogaster. Wendy Vu , Sergey Nuzhdin 370 XIPPI: integrating information on protein-protein interactions. Yuri Vyatkin , Dmitry Afonnikov 371 Rate of evolution of protein-coding genes and the generalized mistranslation-induced misfolding hypothesis. Yuri Wolf , Irina Gopich, Eugene Koonin 373 Stabiliation of separated charges in reaction centers of bacterial photosynthesis. A.G. Yakovlev , V.A. Shuvalov 374 A novel promoter of the Escherichia coli yfiA gene and pathways of its regulation under oxidative stress conditions. T. M. Khlebodarova, A. V. Zadorozhny, V. A. Likhoshvai, N. V. Tikunova, D. Yu. Oshchepkov and N. A. Kolchanov 376 A comparative view on microRNA genes in animal genomes. Evgeny Zdobnov 379 Finding meaningful structures in high-throughput data: from principal trees to spectral filtering on graphs. Andrei Zinovyev , Alexander Gorban , Emmanuel Barillot , Jean-Philippe Vert 381

400 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009

AUTHOR INDEX

Murat Abdurashitov 345 Pavel Baranov 34 Irina Abnizova 1,46 Ancha Baranova 36,201,228,328,329 Dmitry Afonnikov 2,371 Ludmila Baratova 50 Manijeh Ahmadi 320 Emmanuel Barillot 38,45,381 Ruslan Aitnasarov 30 Eugene Barkovsky 163 Ilya Akberdin 4 Igor Baskin 233 Hadi Akhbari 320 Sergey Bazhan 19 Evgeniy Aksianov 6 Georgii Bazykin 40,321,356 Vladimir Aleoshin 9 Mahsa Behzadi 41 Vladimir Aleshin 197 Majid Beigi 224 Kirill Alexandrov 330 Alexander Belostotsky 43 Andrei Alexeevski Sergey Beresten 186 6,54,123,240,332,334 Oliver Bezuidt 310 Daniil Alexeyevsky 123 Ryan Bickel 223 Alexsander A. Alexsandrov 79 Sylvain Blachon 45 Elena Alkalaeva 122 Yaroslav Blume 145,265 Anastasya Anashkina 11,81,307 Valentina Boeva 48,194 Alexander Andrianov 13,15,17 Elena Bogacheva 50 Ivan Anishchenko 15,17 Natalya Bogatyreva 110,135,317 Denis Antonets 19 Mark Borodovsky 52 Kirill Antonez 21 Alexandra Borovskaya 268 Ivan Antonov 52 Jeremie Bourdon 59 Stylianos E. Antonarakis 257 Ulyana Boyarskih 148 Seyed Shahriar Arab 23 Boris Burkov 54 Alexander Archakov 300 Nadezhda Bykova 56 Irena Artamonova 26,281 Laurence Calzone 38 Artem Artemov 287 Solenne Carat 59 Vladimir Arzhanik 174,343 Gloria Castellano 347 T.V. Astahova 317 Regina Chakhiridis 61 Shara Atambayeva 133,140 Anbarasan Chakrapani 63 Vladimir Babenko 28,30 Philippe Chassignet 350 Igor Babkin 31 Bernard Chazelle 247 Doris Bachtrog 33 Vladimir Chekhov 65 Stelly Ballet 45 Sergei Chekmarev 67 Eric Banks 247 Yurii Chernohvostov 329 401 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Alexey Chulichkov 50 Laurent Farinelli 257 Vladimir Chupov 69 Alexander Favorov 71,194,255,287 Tony Cox 1 Zoya Fetisova 87,90,93 Nikolai Daraselia 182,232 Dmitry Filatov 96 Chandrani Das 132 Ivan V. Filatov 99 Alexander Davidovskii 357,359 Dmitry Filimonov 172,330 Ildar Davletov 207 Alexei Finkelstein 101 Sergey Degtyarev 345 Ines Fonfara 123 Olivier Delattre 45 Simon Fourquet 38 Irina Demina 268 Marina Fridman 103,270 Stepan Denisov 71 Maik Friedel 105,258,259 Zoltan Dezso 354 Dmitrij Frishman 106 Alexander Dibert 73 Ludmila Frolova 122 Daria Dibrove 75 Angelika Fuchs 106 Nikolay V. Dokholyan 77 E. Furletova 313 Alexey Dolgov 50 Sarah Gadel 155 Damir Dosymbekov 354 Irina Gainova 4 Nikita Dovidchenko 78 Michael Galperin 244 Lyusien N. Drozdov-Tikhomirov 79 Îxana Galzitskaya 78,108,110,116 Inna Dubchak 336 Sergiy Garbuzynskiy 78,108,112 Vladilen Dujenko 90 Aditya Gaur 114 Mina Ebrahimi-Rad 320 Aksiniya Gaydukova 71 Alexander Efimov 80 Corinne Gehrig 257 Vadim M. Efimov 341 Mikhail Gelfand Irina Elanskaya 61 71,84,161,180,230,281,308 Boris Eliseev 122 Tarini Shankar Ghosh 225 Sarah Elgin 155 Alessandro Giuliani 36 Vladislav Epaneshnikov 81 Diana Gizatullina 251 Vladimir Eremin 322 Vadim Gladyshev 115 Ekaterina Ermakova 84 Tatiana Glazko 159 Vitaly Ermilov 86 Valerii Glazko 159,292 Anna Ershova 263 Mikhail Glazkov 324 Natalia Esipova Angela Gluch 148 11,65,99,188,213,364 Anna Glyakina 116 Changiz Eslahchi 23 Maxim Godsie 118 Stanislav Fadeev 4 Takashi Gojobori 192 Mohammad Ali Faghihi 207 Yury Goltsev 285 Yuri Fantin 255 Fedor Goncharov 28 402 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Josefa Gonzalez 289 Yannis Kalaidzidis 142 Irina Gopich 373 Olga Kalinina 143,307 Alexander Gorban 381 Svetlana Kamzolova Konstantin Gorbunov 120 273,275,277,279 Andrey Gorchakov 155 Lavanya Kannan 326 Evgeny Gordienko 122 Lenkov Kapa 289 Vadim Govorun 268 M. Karami 320 Rita Graze 223 Gary Karpen 155 Alexander Grishin 123,334 Pavel Karpov 145 Vera Grivennikova 61 Anna Karyagina 6,123,263,334 Marina Gubina 30 Marat Kazanov 356 Concettina Guerra 125 Fedor Kazantsev 4 Soma Guhathakurta 63 Tatyana Kazubskaya 228,329 Noëlle Guillon 45 Alexander Kel 148 Georgy Gulbekyan 137 Cameron Kennedy 155 Konstantin Gunbin 2 Vladimir Khailenko 133 Ugur Guner 211 Philip Khaitovich 230 Dwijendra K. Gupta 127,129 Muhummadh Khan 150,153 Anna Guskova 328,329 Peter Kharchenko 155 Reinhard Guthke 258,259 Fuat Khasanov 157 Carito Guziolowski 45 Olga Khasanova 157 Shubhada Hegde 132,227 Tamara Khlebodarova 376 Paul Horton 167 Natalia Khlopova 159 Hedayatolah Hosseini 224 Alexei Khokhlov 86 Remi Houlgatte 59 Ekaterina Khrameeva 161 Barbara Hummel 106 Vladislav Khrustalev 163 Kazuho Ikeo 192 Anton Kireev 328 Elena Ilina 268 Dmitry Kirsanov 334 Christian Iseli 257 Andreas Kirschner 106 Anatoliy Ivachshenko 133,140 Tatsiana Kirys 165 Dmitry Ivankov 135,315 Larisa Kiseleva 167 Pavel Ivanov 118,137 Dmitri Klimov 169 Manoj Jaiswal 139 Olga Koborova 172 Kaiser Jamil 150,153 Alex Kochetov 366 Piroon Jengaroenpoon 203 Nikolay Kolchanov 2,341,376 Cherian K. M 63 Oleg Koliasnikov 174 Anel Kabdullina 133,140 Aleksey Kondrashov 40,176 Inna Kalaidzidis 142 Fyodor Kondrashov 235,298 403 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Maria S. Kondratova 112 Olga Levtsova 207 Anastasiya Konstantinova 9 Dmitriy Leyfer 211 Eugene Koonin 373 Hua Li 326 Larisa Kordyukova 178 David A. Liberles 212 Yuriy Korostelev 180 Daniela Linder-Basso 155 Igor Kosevich 282 Alexander Lifanov 213 Ekaterina Kotelnikova 182,232 Vitaly Likhoshvai 4,236,376 Konstantin Kozlov 184,246 Gipsi Lima-Mendez 310 George Krasnov 186 Andrey Lisitsa 300 Yulia Kraus 282 Michail Lobanov 78,108 Roland Krause 187 Steven Lockton 33 Galina Kravatskaya 188 Tanya Logvinenko 215 Yury Kravatsky 188 Kristina Lopatovskaya 217,219 Andrey Kravchenko 190 Wernisch Lorenz 46 Ekaterina Kropotkina 178 Ilia Lossev 263 Gleb Krutinin 279 Elena Lukashina 221 Eugenia Krutinina 279 Sergey A. Lukshin 99 Gregory Kryukov 191 Yulia Lunina 242 Kirill Kryukov 192 Rune Lyngso 190 Andrzej Kudlicki 193,312 Vassily Lyubetsky 120,217,219 Ivan Kulakovskiy 46,194,305 Vijaya M. Nayak 63 Aleksander Kulikov 368 J. Michael Macpherson 289 Igor Kulikov 30 Bradley Main 223 Ajit Kumar 207 Vsevolod Makeev Mikhail Kuravsky 197 43,48,103,194,213,234,305 Yerbol Kurmangaliyev 199 Amir Maksyutov 19 Maria Kurnikova 116,282 Maya Malakhova 268 Mitzi Kuroda 155 Rahim Malekshahi 224 Tatiana Kuzmenko 201 Dmitry Malko 84 Iliya Kuzmin 50 Tatyana Mamonova 116 Vladimir Kuznetsov 203 Shekhar Mande 132,225,227 Eugeniy Kuznetzov 81 Ashutosh Mani 127,129 Joshua LaBaer 215 Ashok Kumar Manickaraja 63 Alexey Lagunin 172 Palanisamy Manimaran 227 Olga Laikova 180 Ganiraju Manyam 36 Sergey A. Lashin 205 Andrey Marakhonov 228 Georges St.Laurent III 207 Sarah Marchetti 155 Viktor Lebedev 178 Sergei Maslov 229 404 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Yurii G. Matushkin 205 Olesya Nechay 253 Vadym Matusov 145 Sergey Nechkin 207 Pavel Mazin 230 Lidia Nefedova 50 Ilya Mazo 232 Alexey Neverov 255,308 Yuriy Mazur 233 Mikhail Nikitin 9 Lauren McIntyre 223 Sergey Nikolaev 257 Julia Medvedeva 234 Swetlana Nikolajewa 258,259 Margarita Meer 235 Tatiana Nikolskaya 354 Alirea Mehridehnavi 224 Yuri Nikolsky 354 Marta Miaczynska 142 Anna Nikulova 261 Shane Miersch 215 Pavel Novichkov 336 Kirill Mikhailov 9 Vladimir Novoderezhkin 87 Yury Milchevsky 99,188 Ramil Nurtdinov 71,263,308 Alexandra Mirina 307 Sergey Nuzhdin 201,223,370 Andrey Mironov 56,71,161,230, Alexey Nyporko 145,265 255,261,287,308,336,356,362 Igor Oferkin 118 Victoria Mironova 236 Nadya Omelyanchuk 236 Desislava Miteva 238 Nika Oparina 234 Sergei Mitrofanov 240 Nina Oparina Farzaneh Modarresi 207 103,186,233,268,270,287 Monzoorul Haque Mohammed 225 Dmitry Oshchepkov 376 M. Moin 320 Andrei Osterman 271 Marianna V. Moldaver 99 Alexander Osypov Igor Morgunov 242 135,273,275,277,279 Anna Mukhamedzhanova 221 Nithya Palaniappan 304 Armen Mulkidjanian 244 Andrey Palyanov 73 Elena Muronets 61 Vladimir Palyulin 233 Vladimir Muronetz 197 Alexander Panchin 240,281 Arcady Mushegian 326,352 Yuri Panchin 240,282 Ekaterina Myasnikova 184,246 Dmitri Papatsenko 285 Elena Nabieva 247 Peter Park 155 Elena Nadezhdina 145 Leonid Pereyaslavets 101,286 Boris Nagaev 54 Inna Pertsovskaya 287 Vladimir A. Namiot 99 Dmitri Pervouchine 161 Fedor Naumenko 46 Dmitri Petrov 289 Daniil Naumoff 249,251 Svetlana Petrova 290 H. Nazem 320 Hamid Pezeshk 23 Nafisa N. Nazipova 79 Anton Pheophilov 292 405 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Gaelle Pierron 45 Robert Russell 143 Vincenzo Pirrotta 155 Anatoly Ruvinsky 165 Olga Podgornaya 176 S. Seema 139 Andrey Polyanov 207 Mehdi Sadeghi 23 Mikhail P. Ponomarenko 341 Nataliya Sadovskaya 319 Piotr M. Ponomarenko 341 M. Saghiri 320 Konstantin Popadin 253,294 Reza Saghiri 320 Vladislav Popov 93 Alsu Saifitdinova 21 Ludmila Popova 282 Naruya Saitou 192 Vladimir Poroikov 148,172,296,330 Maria Samsonova 184,246 Irina Pouliakhina 343 Ludmila K. Savinkova 341 Z. Pourpak 320 Dmitry Schtokalo 207 Peghah Poursharifi 320 Laurent Schwartz 41 Inna Povolotskaya 298 Galina Selivanova 148 Gregory Provan 34 Alexandr Seliverstov 217 Mikhail Pyatnitskiy 300 Tatiana Semashko 178 Ovidiu Radulescu 45,302 Pavel Semenyuk 221 Karthika Raghavan 304 Vladimir Seplarskij 321 Sergei Rahmanov 305 Marina Serebryakova 178,268 Alexandra Rakhmaninova 180 Roman Sergeev 322 Vasily Ramensky 281,307,308 Viktoria Serzhanova 328 Divya Ramkumar 304 Anna Shabarina 324 Dmitry Ravcheev 352 Dmitry Shagin 282 Mireille Regnier 41,48,313 Irina Shagina 282 Denis Reshetov 255 S. Shams 320 Oleg Reva 310 Gregory Shanower 155 Nicole Riddle 155 Deepak Sharma 326 Jochen Rink 142 Andrew Sharp 257 Daniel Robyr 257 N. Yu. Shashina 145 Dmitry Rodionov 271,336 Armita Sheari 23 Olga Rogacheva 339 Weiwei Shi 354 Maria Rogova 268 Sergey Shigeev 228 Jacques Rougemont 257 Aleksandr Shishkov 50 Malgorzata Rowicka 193,312 Vladimir Shuvalov 374 M.A. Roytberg 313,315,317 Sahar Sibani 215 Maxim A. Rubin 99 Anne Siegel 45 Andrew Rudenko 242 Mona Singh 247 Leonid Rusin 9 Onkar Singh 203 406 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Tom Skelly 1 Marina Tsyganova 354 Mikhail Skoblov 201,228,328,329 Benjamin Tu 312 Vladlen Skvortsov 233 Alexander Tulub 339 Yulia Smirnova 178 Vladimir Tumanyan 11,65,81,99,364 Boris Sobolev 330 Alexander Tuzikov 165,322 Olga Sokolova 207 Dmitrii Ukrainskii 65 Anatoly Sorokin 273,275,277,279 Anna Vakhrusheva 356 Maksim Sorokin 253 Ilya Vakser 165 Valery Sorokin 26 Valery Valyaev 137 Sergei Spirin 6,54,123,240,332,334 Valentina Vasilevskaya 86 K.G. Srinivasa 139 Mikhail Vasiliev 263 Elena Stavrovskaya 336 Maxime Venine 34 Vasily Stefanov 339 Valery Veresov 357,359 Vita Stepanova 238 Jean-Philippe Vert 381 Jean-Marc Steyaert 41,350 Dmitriy Vinogradov 287,362 Gautier Stoll 45 Svetlana Vinogradova 362 Jürgen Sühnel 105 Alexey Vitrechack 290 Dimitry Suplatov 343 Peter Vlasov 213,364 Alexey Surin 78 Anna V. Vlasova 364 Valentin V. Suslov 205,341 Oxana Volkova 366 Vytas Švedas 343 Pavel Vorozheikin 368 Darja Svistunova 174 Wendy Vu 370 Alexandra Taisova 87,93 Yuri Vyatkin 371 Mashkova Tamara 186 Claes Wahlested 207 Makpal Tauasarova 140,0 Raymond Wan 167 Rene te Boekhorst 46 Wolfgang Wende 123 Denis Thieffry 38 Thomas Wilhelm 105 Elisa I. Tiktopulo 99 Yuri Wolf 373 Nina Tikunova 376 Pramod K. Yadava 129 Kirill Timofeev 61 Andrey Yakovlev 87,374 Franck Tirode 45 Yumi Yan 1 Igor Titov 368 Alla Yemets 145 Nicholas Toda 33 Andrey Zadorozhnyi 376 Victor Tomilov 345 Alexey Zakharov 148,172 Francisco Torrens 347 Olga Zanegina 123,334 Laurent Tournier 38 Evgeny Zdobnov 379 Van Du Tran 350 Marino Zerial 142 Olga Tsoy 352 Dian Zhemoldinov 332 407 4-TH MOSCOW CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY July 20–23, 2009 Andrei Zinovyev 38,45,381 Renata A. Zvyagilskaya 79 Anastasiya Zobova 87,90,93

408