<<

MCCMB'2015

International Moscow Conference on Computational Molecular Biology MCCMB'2015

PROGRAM

Moscow, Russia July 16-19, 2015

1

MCCMB'2015

Short Program

Thursday, 16 July 2015 9.00 Registration and morning coffee 10.00 (MSU, Biological Faculty) 10.00 Opening 10.10 (MSU, Biological Faculty, auditorium M1) Session: Bacterial genomics (M1) 10.10 Emergence of Membrane Bioenergetics from 10.50 Ancient Systems of Na+/K+ Homeostasis Armen Mulkidjanian 10.50 A universal signaling mechanism in bacterial 11.10 chemoreceptors Igor Zhulin 11.10 Genomic analysis of the respiration the microbiota 11.30 of human intestine Dmitry Ravcheev 11.30-12.00 Coffee break Session: Transcriptomics (M1) 12.00 Classifying transcriptional and genetic heterogeneity 12.40 in single-cell measurements Peter Kharchenko 12.40 Network integration of parallel metabolomic- 13.00 transcriptional data reveals novel metabolic modules regulating divergent macrophage polarization Maxim Artyomov

3 MCCMB'2015

13.00 Differential gene expression by RNA-seq data in 13.20 brain structures of laboratory animals with aggressive and tolerant behavior Yuriy Orlov 13.20-15.00 Lunch break Session: Replication and chromatin (M1) 15.00 Inferring direction of replication fork and mechanism 15.40 of DNA damage using sequencing data Maga Rowicka 15.40 APOBEC-induced mutations are strongly enriched on 16.00 the lagging strand during replication in human cancers Vladimir Seplyarsky 16.00 Diversity and Domestication in Oenological Yeasts 16.40 David Sherman 16.40-18.40 Coffee break and Poster session 18.40 Buses depart for the Conference dinner 19.00 19.00 Dinner

4 MCCMB'2015

Friday, 17 July 2015 09.00-10.00 Morning coffee

10.00 OMICS -2015 (Skolkovo satellite) 12.40 (MSU, Biological Faculty, auditorium M1 ) 10.10 Introduction Yuri Nikolsky 10.15 Introduction Pakhomov or Meyer 10.20 Introduction of Jury 10.30 Presentation of projects 12.10 Projects 1-10 12.10-12.40 Coffee break 12.40 OMICS -2015 (Skolkovo satellite) cont'd 15.00 (MSU, Biological Faculty, auditorium M1 ) 12.40 Presentation of projects 12.10 Projects 11-20 14.20 Jury Counting 14.40 14.40 Announcement of winners, presentation of awards, 15.00 conclusion

10.00 MoBiLe 10Y symposium (Leiden satellite) 13.00 (MSU, Faculty of Bioengineering and Bioinformatics, auditorium 221) 10.00 Welcome 10.10 A.E.Gorbalenya, A.Alexeevski

5 MCCMB'2015

10.10 Fast evolution of a conserved residue of 10.40 polyomaviruses defines a new mechanism of adaptation that operates by accelerated codon- constrained Val-Ala (COCO-VA) toggling within an intrinsically disordered protein region A.E.Gorbalenya 10.40 The impact of biological ageing on RNA processing 11.00 I.Pulyakhina, V.A.Takhaveev, M. Vermaat, M.S. Gelfand, J.F.J. Laros, P.A.C. 't Hoen, BIOS consortium 11.00 Probing-directed structured elements detection in 11.20 RNA sequences Svetlana Vinogradova 11.20 Mobilis in MoBiLe: Students in a Dynamic Research 11.50 Field O.A. Mayboroda 11.50-12.10 Coffee break 12.10 Novel insights in the regulation of mRNA 12.40 transcription, processing and translationthrough integration of mRNA sequencing data P.A.C. 't Hoen 12.40 Active chromatin regions are sufficient to define 13.00 borders of topologically associated domains in D. melanogaster interphase chromosomes E.Khrameeva, S.Ulyanov, A.Gavrilov, Yu.Shevelyov, M.Gelfand, and S.Razin 13.00-14.00 Lunch break 14.40 MoBiLe 10Y symposium (Leiden satellite) cont'd 15.50 (Faculty of Bioengineering and Bioinformatics, auditorium 221) 14.00 Alignment-free telomere length estimation from 14.30 whole genome NGS data S.M.Kielbasa

6 MCCMB'2015

14.30 A comparison of two methods for detection of 14.50 exceptional words in genomic sequences of prokaryotes Ivan Rusinov 14.50 Peptide search engine approach for detection of 15.10 translated mutations P.Sinitcyn, S.Tyanova, M.Mann and J.Cox 15.10 NPG-explorer: a new tool for nucleotide pangenome 15.30 construction and analysis of closely related prokaryotic genomes Boris Nagaev 15.30 O2PLS as an integrative tool in systems oncology 15.50 E.Nevedomskaya and H.Keun

Session: Medical bioinformatics (MSU, Biological Faculty, auditorium M1) 15.20 Building a Sustainable Bioinformatics Program 15.50 Through Integrated Support Michael Tartakovsky 15.50 Network of the Country Tuberculosis Portals 16.20 Alexander Rosenthal Session: Cancer-1 (M1) 16.20 Deciphering gene interaction to explain tumor 17.00 progression Emmanuel Barillot 17.00-17.30 Coffee break Session: Cancer-2 (M1) 17.30 A model for scoring damaging mutations in the non- 17.50 coding tumoral genome based on germline and tumor data Jia Li

7 MCCMB'2015

17.50 Selection pressure on breast cancer somatic 18.10 mutations revealed by bioinformatics sequence analysis Ivan Kulakovskiy 18.10 Searching For Essential Cancer Proteins: Analysis Of 18.30 Hypomutated Genes In Skin Melanoma Mikhail Pyatnitskiy 18.30 Revealing mechanisms of cancer progression by 19.10 pan-cancer deconvolution of tumoral transcriptomes Andrei Zinovyev

8 MCCMB'2015

Saturday, 18 July 2015 09.00-10.00 Morning coffee Session: Methods and algorithms (M1) 10.00 Analysis of variation in 3000 rice genomes project 10.20 Tatiana Tatarinova 10.20 Distance-based profiling aids in evaluation of 10.40 ageing-related phenomena Ancha Baranova 10.40 Scaffold assembly based on genome rearrangement 11.00 analysis Max Alekseyev 11.00 A method for model comparison based on the 11.20 parameter sensitivity measures Ekaterina Myasnikova 11.20 Method to predict the percentage of cell types in 11.40 human blood Anna Igolkina 11.40-12.10 Coffee break Session: Systems biology: Mammals and beyond (M1) 12.10 Mammalian Systems biology 12.50 Alistar Forrest 12.50 Genomics of Lifespan Control 13.30 Vadim Gladyshev 13.30-15.00 Lunch break Session: Genomics of regulation (M1) 15.00 Computer analysis of genome co-localization of 15.20 transcription factor binding sites based on ChIP-seq data Arthur Dergilev

9 MCCMB'2015

15.20 Search for simple and composite auxin responsive 15.40 elements in Arabidopsis thaliana genome Victoria Mironova 15.40 Antisense interactions of long noncoding RNAs in 16.00 human cells Ivan Antonov Session: Viruses – 1 (M1) 16.00 Challenges in virus genomics 16.40 Manja Marz 16.40-17.10 Coffee break Session: Viruses – 2 (M1) 17.10 Sequence and structural analysis of related proteins 17.30 in distant viral species Olga Kalinina Session: Proteins (M1) 17.30 Bioinformatic analysis of diverse protein 17.50 superfamilies to design improved enzymes Dmitry Suplatov 17.50 Detecting the features of functional specificity in 18.10 protein families based on the local sequence similarity Boris Sobolev 18.10 Determination of the size of folding nuclei of 18.30 protofibrils from the concentration dependence of the rate and lag-time of their formation Oxana Galzitskaya 18.30 Assessing protein synthesis with ribosome profiling 19.10 Pavel Baranov

10 MCCMB'2015

Sunday, 19 July 2015 09.00-10.00 Morning coffee Session: Genomics of anhydrobiosis (M1) 10.00 Anhydrobiosis in the sleeping chironomids: where 10.40 are we now Takahiro Kikawada 10.40 Expression regulation of desiccation-resistance 11.00 genes in Polypedilum vanderplanki Pavel Mazin 11.00 Molecular basics of different mechanisms of 11.20 desiccation tolerance in Chironomidae midges Olga Kozlova 11.20 Adapting to extremes: linking metabolome and 11.40 genome of an anhydrobiotic insects Elena Shagimardanova 11.40-12.10 Coffee break Session: Toolkits for anhydrobiosis research (M1) 12.10 Single cell molecular toolkit for inducible resistance 12.50 to complete desiccation Oleg Gusev 12.50 Genetic toolkit for investigation of anhydrobiosis: 13.30 promoters and RNAi Richard Cornette 13.30-15.00 Lunch break Session: Genome structure (M1) 15.00 Genome mapping revealed scaffold misassemblies 15.20 and elevated gene shuffling on the X chromosome in malaria mosquitoes Igor Sharakhov

11 MCCMB'2015

15.20 Detection of short size mutations and copy number 15.40 alterations in ultra-deep targeted sequencing data Valentina Boeva 15.40 Genomic structural instability and homologous 16.00 recombination deficiency in breast and ovarian cancers Tatiana Popova 16.00 Genome Track Analyzer : New tool for genome-wide 16.20 study of correlations between distributed genome features Galina Kravatskaya 16.20 Tale on the transposons on chromatin landscape 16.40 Vladimir Babenko 16.40-17.10 Coffee break Session: Evolution (M1) 17.10 Assessing the impact of horizontal gene transfer on 17.30 the evolution of prokaryotes Vladimir Makarenkov 17.30 Rare amino acid changes fixation drives divergence 17.50 in Metazoa evolution Konstantin Gunbin 17.50 Evolution of TAG codon in Methanosarcina 18.10 Margarita Meer 18.10 A model of protein evolution within local fitness 18.30 landscape changing with time Dinara Usmanova 18.30 Chartering the local fitness landscape of the green 19.10 fluorescent protein Fedor Kondrashov 19-10-19.20 Closing 19.30 Farewell party

12 MCCMB'2015

Extended Program

Thursday, 16 July 2015 9.00 Registration and morning coffee 10.00 (MSU, Biological Faculty) 10.00 Opening 10.10 (MSU, Biological Faculty, auditorium M1) Session: Bacterial genomics (M1) 10.10 Emergence of Membrane Bioenergetics from 10.50 Ancient Systems of Na+/K+ Homeostasis Armen Mulkidjanian It is well known that the cytoplasm of living cells, generally, contains more potassium ions than sodium ions. The prevalence of K+ ions is crucial for the activity of numerous (nearly) universal, key enzymes, including those components of the translation system that even preceded the Last Universal Cellular Ancestor (LUCA). In modern prokaryotic cells, the [K+]/[Na+] ratio > 1.0 is maintained by ion-tight cellular membranes and an arsenal of ion pumps. It is unlikely that modern-type ion-tight membranes made of two- tail lipids, not to mention a plethora of ion- pumping machines, were present in the very first cells. It is more likely that the monovalent ion content of the cytoplasm of the first cells would have to be equilibrated with the environment. The inhibitory effect of Na+ on many of these ubiquitous K+-dependent enzymes does not seem compatible with the evolution of the respective cellular systems and, generally, the first cells in environments with high sodium levels. As early as in 1926 Archibald Macallum suggested that the first cells might have emerged in K+-rich habitats. Several different, albeit complementary, geochemical scenarios have been recently proposed for the K+-rich environments of the primordial Earth. Marine and freshwater environments generally show a [K+]/[Na+] ratio less than unity. Therefore, to invade such environments, while maintaining the cytoplasmic [K+]/[Na+] ratio over unity, primordial cells needed ion-tight membranes and means to extrude sodium ions. The foray into new, Na+-rich habitats was the likely driving force behind the evolution of diverse redox-, light-, chemically-, or osmotically-dependent sodium export pumps. By combining comparative structural and phylogenomic analyses we try to reconstruct how an interplay between diverse, initially independent sodium export pumps could lead to the emergence of membrane bioenergetics. 10.50 A universal signaling mechanism in bacterial 11.10 chemoreceptors Igor Zhulin

13 MCCMB'2015

Bacterial chemoreceptors serve as a model system for understanding transmembrane signaling. However, the mechanisms by which conformational signals move within and between receptors and how they control kinase activity remain unknown. Using all-atom, microsecond-range molecular dynamics simulations on a special-purpose supercomputer, we show that the kinase-activating cytoplasmic tip of the chemoreceptor fluctuates between two stable conformations in a signal-dependent manner. A specific residue, Phe396, appears to serve as the conformational switch, because flipping of the stacked aromatic rings of an interacting F396-F396' pair in the receptor homodimer took place concomitantly with the signal-related conformational changes. Comparative genomic analysis reveals that F396 is the single most conserved residue in the entire chemoreceptor molecule: it is invariant in 99.8% of chemoreceptor sequences from all available genomes of bacteria and archaea. We conclude that despite substantial differences in the signaling domain between diverse bacterial species, the signaling mechanism is universally conserved 11.10 Genomic analysis of the respiration the microbiota 11.30 of human intestine Dmitry Ravcheev 11.30-12.00 Coffee break Session: Transcriptomics (M1) 12.00 Classifying transcriptional and genetic heterogeneity 12.40 in single-cell measurements Peter Kharchenko Single-cell assays are making it possible to examine transcriptional states and other genome-wide properties of thousands individual cells. The ability to directly assess cell heterogeneity is particularly critical in the context of cancer therapy, where presence of phenotypically distinct subclonal populations fuels relapse and resistance to treatment. The transcriptional heterogeneity within such tumors and its impact on disease progression is poorly understood. Furthermore, the extent to which genetic and transcriptional subpopulations correspond to each other cannot be currently assessed. To investigate these questions we have developed methods for analysis of single-cell RNA-seq data in concert with other genomic information. To characterize transcriptional subpopulations we identify annotated or newly-discovered gene sets that are linked to statistically significant heterogeneity within the measured collection of cells. To infer genotype information we rely on probabilistic assessment of single nucleotide variants and copy number variation in individual cells, which can be used to distinguish genetically subclonal populations. We apply these methods to examine transcriptional and genetic heterogeneity in samples of multiple myeloma and other tumors.

14 MCCMB'2015

12.40 Network integration of parallel metabolomic- 13.00 transcriptional data reveals novel metabolic modules regulating divergent macrophage polarization Maxim Artyomov We have developed an integrated high-throughput transcriptional-metabolic profiling and analysis pipeline, and applied it to characterize global rewiring during murine macrophage polarization to pro- and anti-inflammatory (M1 and M2) states. Network based integration of metabolic and transcriptional RNA-seq data allowed us to mitigate problems specific to individual types of data and to obtain a global view of the metabolic changes during macrophage polarization. Metabolic profiling can be directly associated with a well-defined network of biochemical reactions, but it is not thorough: absence of a signal does not imply absence of the metabolite. On the other hand, transcriptional profiling catches all sufficiently expressed genes and this information can be associated with metabolic reactions via enzymes. Thus, we compiled a network of reactions based on KEGG as a framework to integrate the metabolic and transcriptional profiling data. Next, we adapted BioNet algorithm to weigh the nodes and edges in the network based on the p-value of differential expression (DE) between M1 and M2 conditions. Then we found a most connected subnetwork that contained as much positively scored and as few negatively scored nodes as possible. That led to a set of most important interconnected reactions. As expected this set contained well-known macrophage related pathways such as glycolysis, TCA cycle, etc. However, 1) it showed how these pathways were interacting and 2) it contained modules not described previously. In M2 macrophages we discovered novel glutamine/glutamate- and UDP-GlcNAc-associated modules, and validated their involvement using isotope labeling studies. Functional importance of these modules was further confirmed by glutamine deprivation and N-glycosylation inhibition experiments. In M1 macrophages we identified a metabolic break at Idh fragmenting the TCA cycle, and validated it using isotope labeling. Label distribution suggested presence of novel variant of aspartate-arginosuccinate shunt. Consistently, inhibition of aspartate- aminotransferase, a key enzyme of the shunt, hindered NO and IL6 production while promoting mitochondrial respiration. This systems approach provides a highly integrated picture of the physiological modules supporting macrophage polarization, identifying potential pharmacologic control points for both macrophage phenotypes. 13.00 Differential gene expression by RNA-seq data in 13.20 brain structures of laboratory animals with aggressive and tolerant behavior Yuriy Orlov 13.20-15.00 Lunch break Session: Replication and chromatin (M1)

15 MCCMB'2015

15.00 Inferring direction of replication fork and mechanism 15.40 of DNA damage using sequencing data Maga Rowicka Double-stranded DNA breaks (DSBs) are a genotoxic form of DNA damage. The damage to both DNA strands precludes the straightforward use of the complementary strand as a template for repair, resulting in mutagenic lesions. Despite many studies on the mechanisms of DSB formation, our knowledge of them is very incomplete. A main reason for our limited knowledge is that, to date, DSB formation has been extensively studied only at specific loci but remains largely unexplored at the genome-wide level. We recently developed a method to label DSBs in situ followed by deep sequencing (BLESS), and used it to map DSBs in human cells with a resolution 2-3 orders of magnitude better than previously achieved. There are many factors inducing DSBs, including replication stress, oxidative stress and irradiation. Most of them cause two-ended DSBs (having two free ends of DNA), the only exception is replication stress which usually induces one-ended DSBs (caused by replication fork stalling and collapse).We use this observation to infer DSBs resulting from replication stress and to analyze chromatin context and sequence features related to replication stress-induced DSBs. Moreover, we show how to reconstruct the direction of replication fork movement from BLESS-Seq read pattern. We apply this concept to infer replication domain boundaries for several cell lines and conditions and to analyze how they change upon treatments and vary between cell lines. We also provide experimental verification for the proposed computational method and show that purely computational methods can predict >80% of experimentally detected DSBs. 15.40 APOBEC-induced mutations are strongly enriched on 16.00 the lagging strand during replication in human cancers Vladimir Seplyarskiy Mutagenesis induced by deaminases of the APOBEC family is prevalent in many cancers. A fraction of APOBEC mutations is clustered around DSBs, however vast majority of them are dispersed over the genome. Since APOBEC mutates specifically single stranded DNA (ssDNA) we hypothesized that lagging DNA strand which exists in single strand state during DNA replication may be a frequent target for APOBEC mutations. Knowing the direction of replication fork progression in human genome we were able predict for each genomic region which of the two DNA strands is lagging during replication. We observed that APOBEC mutations exhibit a strong 1.96 fold bias towards lagging strand, suggesting that this is the major mechanism of generation of APOBEC mutations explaining more than 1/3 of cases. Additionally we report the 2.3 fold of APOBEC mutations for non-methylated cytosines then for 5-methylcytosine; and nearly complete absence of enrichment of APOBEC and non-APOBEC mutations in patients with APOBEC signature in late replication time. This research provides novel insights into the APOBEC mutagenesis and suggests mechanistic explanations for a considerable fraction of APOBEC induced mutations. 16.00 Diversity and Domestication in Oenological Yeasts 16.40 David Sherman

16 MCCMB'2015

16.40-18.40 Coffee break and Poster session 18.40 Buses depart for the Conference dinner 19.00 19.00 Dinner

17 MCCMB'2015

Friday, 17 July 2015 09.00-10.00 Morning coffee

10.00 OMICS-2015 (Skolkovo satellite) 12.40 (MSU, Biological Faculty, auditorium M1 ) 10.10 Introduction Yuri Nikolsky 10.15 Introduction Pakhomov or Meyer 10.20 Introduction of Jury 10.30 Presentation of projects 12.10 Projects 1-10 12.10-12.40 Coffee break 12.40 OMICS-2015 (Skolkovo satellite), cont'd 15.00 (MSU, Biological Faculty, auditorium M1 ) 12.40 Presentation of projects 12.10 Projects 11-20 14.20 Jury Counting 14.40 14.40 Announcement of winners, presentation of awards, 15.00 conclusion

10.00 MoBiLe 10Y symposium (Leiden satellite) 13.00 (MSU, Faculty of Bioengineering and Bioinformatics, auditorium 221) 10.00 Welcome 10.10 A.E.Gorbalenya, A.Alexeevski

18 MCCMB'2015

10.10 Fast evolution of a conserved residue of 10.40 polyomaviruses defines a new mechanism of adaptation that operates by accelerated codon- constrained Val-Ala (COCO-VA) toggling within an intrinsically disordered protein region A.E.Gorbalenya It is that conserved residues evolve slowly. We challenge generality of this central tenet of molecular biology by describing the fast evolution of a nucleotide position that is among the most conserved in the long overlap of de novo and ancestral open reading frames (ORFs) of a large subset of polyomaviruses. The de novo ORF is expressed through either the ALTO protein or the Middle T antigen (MT/ALTO), while the ancestral ORF encodes the N-terminal domain of helicase-containing Large T (LT) antigen. In the latter domain the conserved Cys codon of the LXCXE pRB-binding motif constrains codon evolution in the overlapping MT/ALTO ORF to a binary choice between Val and Ala codons, termed here as codon-constrained Val-Ala (COCO-VA) toggling. We found the rate of COCO-VA toggling to approach the speciation rate and to be significantly accelerated compared to the baseline rate of chance substitution in a large monophyletic lineage of MT/ALTO encoding viruses comprising dozens species. We have then extended this analysis to the characterization of the evolution of the COCO-VA site within a single polyomavirus species. To this end, we have analyzed thirteen mostly newly sequenced genomes of Trichodysplasia spinulosa-associated polyomavirus (TSPyV) representing ~40% of reported cases of the Trichodysplasia spinulosa disease in humans world-wide. Only very limited genome variation (≤ 0.6%) was found, with a total of four non- synonymous substitutions (NSS). Three of these affected only MT/ALTO, with one NSS - fixed most early in TSPyV evolution - involving the COCO-VA toggling. Importantly, the COCO-VA site is located in a short linear motif (SLiM) of an intrinsically disordered region, a typical characteristic of adaptive responders. These findings provide evidence that the COCO-VA toggling is under positive selection in TSPyV and many other polyomaviruses that form a monophyletic lineage and infect a wide range of hosts. Thus, the COCO-VA toggling plays a critical role in virus adaptation, which is unprecedented for conserved residues.

10.40 The impact of biological ageing on RNA processing 11.00 I.Pulyakhina, V.A.Takhaveev, M. Vermaat, M.S. Gelfand, J.F.J. Laros, P.A.C. 't Hoen, BIOS consortium Ageing of humans has been associated with large-scale changes in gene expression, however, influence of biological ageing on RNA processing has not been extensively studied yet. In this work we utilized an unprecedented collection of transcriptomes to uncover changes that encompass alternative splicing over the process of human ageing. Whole blood was collected from a large cohort of 626 individuals with a broad age distribution (20-80 years) and subjected to RNA-sequencing. Analyzing this RNA-Seq data, we developed a statistical model to evaluate characteristics of alternative splicing, accounting for potential confounder effects of phenotypic traits and age-related switching in the cell composition of blood. We discovered that the rates of exon skipping and intron

19 MCCMB'2015 retention significantly elevate with age, and that affected genes show no functional selectivity. GC content of the transcriptome was found to increase temporatily, and the changes in alternative splicing were recognized contributing to that. We discovered that the usage of non-canonical donor splice sites increases with age, furthermore, we show that the number of acceptors paired with one donor significantly increases with age leading to potential functional changes. Our findings indicate that splicing machinery undergoes significant age-related changes. They lead to the increased incidence of such alternative splicing events as intron retention and exon skipping, and promote implication of novel splice sites with unconventional nucleotide motifs. 11.00 Probing-directed structured elements detection in 11.20 RNA sequences Svetlana Vinogradova Transcripts often harbor RNA elements, which regulate cell processes co- or post- transcriptionally. The functions of many regulatory RNA elements depend on their structure, so it is important to determine the structure as well as to scan genomes for structured elements. The best way to do this is to use comparative genomics approach and search for evolutionary conserved structures. But a suitable set of homologous sequences with moderate sequence divergence is too often not available due to the lack of related sequenced genomes, or rugged fitness landscape resulted in extremely high or low sequence conservation of structured RNAs. In these cases, we have to deal with single RNA sequences. Functional RNAs are more stable than genomic background and we used this fact to develop the RNASurface algorithm that detects putative structured elements in RNA sequences. The sizes of regulatory RNA elements vary from tens to hundreds of nucleotides and this results in limitation for computational approaches based on sliding window. RNASurface does not restrict the search to elements of fixed size but rather detects structured RNAs of optimal lengths. Chemical probing of RNA is an alternative source of structural information: probing reactivities strongly correlate with local nucleotide flexibility. We incorporate probing data in MFE calculation using procedure that is called ‘soft constraint’ approach. It is based on pseudo-energies that favour individual positions in RNA structure to be paired or unpaired. One important advantage of our approach is the ability to incorporate any type of experimental data (SHAPE, PARS, DMS, etc.): we deal with probabilities of nucleotides of being paired/unpaired instead of relying on arbitrary normalized data from experiments. Incorporation of RNA probing data into computation pipeline increases the signal/noise rate of structures prediction and detects more functional structures. However, our method is still dependent on the quality of probing data. Though at the moment high quality genome-wide RNA probing data for various organisms is not available, we believe that global interrogation of RNA structure will assist computational strategies to better model RNA structure, predict RNA function and screen genomes for functional RNAs. 11.20 Mobilis in MoBiLe: Students in a Dynamic Research 11.50 Field O.A. Mayboroda One of the most quoted definitions of bioinformatics describes it as a discipline which “encompassesalmostallcomputerapplicationsin biological sciences” (Attwood, 1999), but

20 MCCMB'2015

at the same time points out that the term “was originally coined in the mid-1980s for the analysis of the biological sequence data”. Indeed, the analysis of the biological sequence data had dominated the field until analytical instrumentation such as mass spectrometry and NMR has become an everyday reality of a biological/medical laboratory. Today, translation of the instrumental raw data into a “computationally efficient” format, matching tandem mass spectra to peptide sequences (derived from DNA sequences) or 1D and 2D NMR data to metabolic libraries is probably most dynamic part of the field. Here we present the data processing solutions for proteomics and metabolomics implemented into routine analysis over the last years. An essential element of the selected workflows is a contribution of the MoBiLe program students to their development. Presented with small but intellectually challenging tasks, the students have often had to better define the problem as a first step towards solving it. Examples include a format converter for a novel algorithm matching tandem mass spectra to peptides that had to infer enzyme specificity, and a model for species identification that ended up simulating and comparing bottom-up proteomics experiments. This year, a pair of students (or “Dimitries”) will combine anatomical and stage ontologies, controlled vocabularies with mass spectrometry, RNA-Seq or NMR data to produce visual, molecular, maps projecting the NMR, RNA-Seq or MS data onto representative anatomical drawings of our model systems. Finally, using an ongoing collaboration with Department of Nephrology as an example, we show how the tools developed by our students can lead to clinically meaningful results. 11.50-12.10 Coffee break 12.10 Novel insights in the regulation of mRNA 12.40 transcription, processing and translationthrough integration of mRNA sequencing data P.A.C. 't Hoen To date, the human transcriptome is known to contain around 80,000 protein-coding transcripts, and the estimated number of proteins synthesized range from 250,000 to 1 million. All these transcripts and proteins are coded by less than 20,000 genes, suggesting extensive regulation at transcriptional, post-transcriptional and translational level. I will discuss how integration of data obtained from diverse RNA sequencing technologies (RNA-seq, deepCAGE, ribosome footprinting) improves our understanding of these regulatory mechanisms and I will illustrate how these mechanisms jointly orchestrate the changes in protein demands during muscle differentiation. The individual regulatory layers appear to be tightly linked, with extensive cross-talk and feedback between them. To decipher the cross-talk between transcriptional and posttranscriptional regulation, we analysed PacBio® single-molecule long sequencing reads capturing full- length mRNA molecules. These data show that the vast number of potential combinations between alternative transcription start sites, alternatively spliced exons and alternative polyadenylation sites result in a relatively limited number of mRNA species, supporting the tight coupling between these processes. Further integration of RNA sequencing data will elucidate the true complexity of the transcriptome and its multi- layered regulation. Resume: Peter-Bram ’t Hoen is Associate Professor in Bioinformatics at Leiden University Medical Center. Since 2010, he has been responsible for all

21 MCCMB'2015 bioinformatics activities within the department of Human Genetics. He is leading a multidisciplinary team of researchers (molecular biologists and bioinformaticians) working on transcriptomics (RNA-seq) and proteomics data analysis, modeling of transcriptional networks, (cross-species) data integration, analysis of biological networks, and discovery of molecular biomarkers. He is an expert in RNA sequencing and coordinator of the yearly BioSB “Advanced RNA sequencing data analysis” course. He is also a member of the management team of BBMRI’s Biobank-based integrative omics study (BIOS). His main research interest is the regulation of gene expression and the mechanisms controlling alternative transcription, splicing, polyadenylation, and translation. 12.40 Active chromatin regions are sufficient to define 13.00 borders of topologically associated domains in D. melanogaster interphase chromosomes E.Khrameeva, S.Ulyanov, A.Gavrilov, Yu.Shevelyov, M.Gelfand, and S.Razin In Drosophila, interphase chromosomes are organized in topologically associated domains (TADs) within which chromatin-chromatin interactions are frequent, while interactions across domain borders are rare. TAD positions on chromosomes appear to be conservative between cells of different lineages, and even between animal species. However, molecular mechanisms underlying partitioning of chromosomes in TADs are poorly understood. Insulator elements have been proposed to play a key role in definition of TAD borders but recently experimental evidences against this hypothesis have appeared. Here we used Hi-C method to map TADs in four drosophila cell lines of different origin. The cell lines share up to 80% TAD positions, while cell type specific TAD borders correlate with transcription changes between cell lines. TADs appear to be self-organizing condensed chromatin domains depleted in active chromatin marks. Active chromatin regions that cannot be organized in compact structures separate TADs, being sufficient to establish TAD borders without contribution of insulator proteins, such as Su(Hw) or CTCF 13.00-14.00 Lunch break 14.40 MoBiLe 10Y symposium (Leiden satellite), cont'd 15.50 (Faculty of Bioengineering and Bioinformatics, auditorium 221) 14.00 Alignment-free telomere length estimation from 14.30 whole genome NGS data S.M.Kielbasa Telomeres are repetitive structures present at each end of a chromatid. They play role in maintenance of genome integrity. Due to the nature of the chromosome replication process, the telomeres shorten at each replication cycle. Consequently, with lifetime of the organism the average telomere length decreases and it may be used as a marker for organism’s biological age. Here we present a method for accurate estimation of telomere lengths from unaligned whole genome sequencing reads. We developed the method based on a dataset provided by The Genome of the Netherlands (GoNL) project which

22 MCCMB'2015

generated whole genome sequencing data for 754 samples of 248 Dutch families. For 381 of the samples telomere length measurements were available. These measurements were obtained without usage of next generation sequencing methods. Our method contains two components: the read classifier and a linear model. The read classifier is a fast function for detection of repetitive sequences (in particular the telomeric motif TTAGGG) in read sequences. We apply this function to all reads of a sample and then we build a table of counts of reads with various repetitive motifs. Next, based on the read counts table and available telomere length measurements we train a linear predictor of telomere length. We demonstrate that the simplest possible predictor, which only bases on frequency of reads with the telomeric motif TTAGGG, displays a strong sequencing batch bias. When frequencies of a few other repetitive motifs are incorporated to the model, its performance significantly improves. Finally, we compare our predictions with predictions obtained from telseq algorithm. The telseq estimations show strong effect of sequencing batch. Moreover, we demonstrate that our method delivers estimations more strongly associated with individuals age. 14.30 A comparison of two methods for detection of 14.50 exceptional words in genomic sequences of prokaryotes Ivan Rusinov Exceptional word is an oligonucleotide which observed frequency in genome notably differs from the expected one. Such words are good candidates for functional sites under evolutionary pressure. The maximum order Markov model (Mmax) is widely used for estimation of expected frequency of a short word in a genome. But the real DNA sequences are described with such model poorly. Karlin et al proposed another method that takes into account observed frequencies of all subwords of a word, including degenerate ones, to estimate its expected frequency. We compared the Karlin's method with the Mmax based one in terms of detection of recognition sites of restriction- modification systems avoided in a prokaryotic genome. Restriction sites were chosen for the methods comparison as target short words because of high specificity of restriction- modification systems. A significant difference in restriction site representation estimated with the two methods was shown. Thus, the method used has significant impact on the results. We demonstrated that Karlin's method is more reliable for detection of exceptional words in prokaryotic genome sequences, probably due to use of all site subwords frequencies for the representation evaluation. 14.50 Peptide search engine approach for detection of 15.10 translated mutations P.Sinitcyn, S.Tyanova, M.Mann and J.Cox 15.10 NPG-explorer: a new tool for nucleotide pangenome 15.30 construction and analysis of closely related prokaryotic genomes Boris Nagaev Genomes of closely related bacteria have highly similar sequences of orthologous

23 MCCMB'2015 fragments but usually undergo multiple rearrangements, long deletions, insertions of mobile elements and occasionally horizontally transferred regions. We developed a new tool, Nucleotide PanGenome explorer (NPG-explorer), designed for aligning and analysis of a number of input closely related genomes. NPG-explorer constructs nucleotide pangenome - a set of aligned blocks, each block consisting of orthologous fragments. Minimum length of block (default 100 bp) and minimum identity (default 90%) are algorithm parameters. NPG-explorer iterates block detection algorithm until the following criterion is satisfied: BLAST search all-against-all block consensuses detects no hits of appropriate size and identity. Each nucleotide from input genomes belongs to exactly one block of NPG (it is a reason for NPG terminology). Blocks are classified into four categories. Stable blocks (named s-blocks) are composed of one fragment from each genome. Hemi-stable blocks (h-blocks) are presented by one fragment from a subset of genomes. Repeat containing blocks (r-blocks) contain more than one fragment from at least one genome. Unique sequence blocks (u-blocks) contain only one fragment of length greater than a threshold. Minor blocks (m-blocks) are blocks of fragments of length less than a threshold. Blockset of global and intermediate blocks. Global blocks consist of glued consequent collinear s-blocks and fragments of sequencesthat are between them. Intermediate blocks consist of fragments of sequences that are between consequent global blocks. In addition NPG-explorer provides: (1) Multiple alignments of input chromosomes represented by a sequence of block identifiers. These alignments allow to detect chromosomal rearrangements. (2) File with consensus sequences of all blocks and file with description of all mutations with respect to consensuses. Thus, all input genome sequences can be completely reconstructed from these two files. (3) Phylogenetic trees of blocks and of whole genomes. Core blocks are those that contain exactly one fragment of each genome. These trees are computed on the base of diagnostic positions in block alignments. (4) All gene annotations, mapped on blocks. This data are useful for detection and correction mis-annotations, gene corruption etc. Using NPG-explorer we constructed nucleotide pangenomes of five sets of genomes: 17 complete genomes of Brucella genus (56 Mb totally), 39 partially completed genomes of Brucella genus (129 Mb totally),12 genomes of Yersinia pestis (55 Mb), 8 genomes of Rickettsia rickettsii (10 Mb), 5 genomes of Burkholderia cenocepacia (38.5 Mb). In Brucella pangenome there are 653 stable blocks covering 91.5% of sum of lengths of all genomes. Identity within joined alignment of s-blocks is 99.2% showing high sequence similarity of all genomes. Program detected 33 global blocks. Phylogenetic tree of genomes computed by NPG-explorer by using diagnostic positions is in agreement with published data for 10 Brucella genomes. The program found large translocation from first to second chromosome in Brucella suis ATCC 23445 and large inversion in chromosome 2 of Brucella abortus, also described earlier. NPG-visualization tool presents interactively a list of blocks, the alignment with mapped genes, alignments of block identifiers. NPG- explorer is written in C++ and is licensed under the GNU GPL. Simple script language for program modules invocation is introduced.

15.30 O2PLS as an integrative tool in systems oncology 15.50 E.Nevedomskaya and H.Keun Altered metabolism is a universal characteristic of cancer that is implicated in such clinically relevant phenotypes as metastasis and chemotherapy resistance. Regulation of

24 MCCMB'2015

metabolic reprogramming in the context of heterogeneous genomic context of cancer is poorly understood. Systematic integration of omics data can unravel interconnectivity of multiple components of this regulation. We approached such integration through joint analysis of metabolic, gene expression and microRNA data. For this we employed a statistical integration method, O2PLS, for combining data from the well-characterized NCI-60 cancer cell line panel. O2PLS is a generalization of OPLS approach that combines orthogonal signal correction (OSC) and Partial Least Square (PLS) analyses. OPLS allows separating variation in the data matrix X into the following parts: correlated to the response Y, systemically non-related (orthogonal) to Y and the residual variance. Such a segregation allows examining the sources of variation. With the use of the bidirectional O2PLS method we were able to focus on the correlations of interest between sets of multidimentional data and achieve improved interpretability of the results. With this work we demonstrate that O2PLS is a versatile tool for data integration through joint analysis of metabolomics, transcriptomic and microRNA data. We combined knowledge- and literature-based selection of molecules of interest (based on GWAS, metabolic reconstruction and target prediction) with rigorous cross-validation to identify correlations of interest between metabolites and microRNAs, as well as between metabolites and mRNAs. We identified microRNA modules associated with catabolic and anabolic processes, as well as defined NT5E as a novel regulator of cancer metabolism. We confirmed the observed correlations using other datasets and furthermore demonstrated the implication of NT5E in intrinsic and acquired resistance to chemotherapy in ovarian cancer and various cancer subtypes. With this we present an integrative biology approach to the study of cancer cell molecular profiles (‘systems oncology’) that facilitates discovery of novel players in cancer metabolism, progression and therapy resistance.

Session: Medical bioinformatics (MSU, Biological Faculty, auditorium M1) 15.20 Building a Sustainable Bioinformatics Program 15.50 Through Integrated Support Michael Tartakovsky The rapid growth of advanced computational research methods being applied across the sciences increasingly demands support beyond the capabilities of the average informational technology (IT) department. Nowhere is this more evident than in the field of bioinformatics. The necessary cyberinfrastructure - both hardware and people - must be highly specialized, but also diverse, to meet the needs of a broad range of users and applications. The Office of Cyber Infrastructure and Computational Biology (OCICB) at the National Institute of Allergy and Infectious Diseases (NIAID), part of the U.S. National Institutes of Health (NIH), coordinates IT resources and training for a staff of over 4000 people, including over 2300 research scientists and scientific support staff located in the US and abroad. The mission of OCICB is to strategically enhance the Institute’s capabilities in clinical informatics and bioinformatics, and ensure that NIAID researchers can access and fully utilize the most advanced bioinformatics tools available. To accomplish this, OCICB brings together a multidisciplinary teams of engineers, developers, analysts, and specialists to provide a broad suite of scientific services and

25 MCCMB'2015 resources tailored to the NIAID research community. Highly-trained, doctoral-level research scientists are embedded throughout OCICB, specializing in structural biology, biostatistics, phylogenetics, systems biology, and the many ‘omics fields. By including these subject matter experts within day-to-day operations, OCICB ensures that the needs of the primary end users - NIAID research scientists - will be met. This collaborative approach has been critical to the success of OCICB. For example, it was the computational biologists at OCICB who advocated for the creation of a high-performance computing cluster when they recognized the potential impact of next generation sequencing when it first became commercially available in 2005. The NIAID cluster provides robust, reliable, cost-effective, and scalable infrastructure. As additional compute power becomes a necessity, just-in-time modular upgrades provide continuous improvement and up-to-date systems, as opposed to filling rack space with outdated, unused servers. Researchers were polled on an Institute-level to determine unmet needs in using the data center. Based on the feedback provided, training series were to address need gaps. Additionally this feedback has helped OCICB focus on up and coming research areas. By reaching out to stakeholders in early stages and creating pilot programs, OCICB is able to get ahead of the curve to provide transformational tools, allowing organic growth is able to take place.

15.50 Network of the Country Tuberculosis Portals 16.20 Alexander Rosenthal Tuberculosis (TB) is a major global public health. The recent escalation of the occurrence of the disease has been complicated due to the appearance and development of multi­ resistant tuberculosis (MDR TB) or extensively drugresistant tuberculosis (XDR TB), as well as HIV/TB coinfection. The needs for fast, precise diagnostics of resistant TB and new efficient antiTB drugs are calling for integral approach and multicenter, multicountry collaborations. The ability for the worldwide community of TB researchers to understand the nature of the TB disease will be greatly improved by using a common database containing anonymized medical images, treatment information, lab work, clinical data, and bacterial genomes. Value of such database would be further increased if it contained unique patient cohorts, molecular information on coexistence of multiple Mycobacterium tuberculosis strains, and tools for genomic and bioimaging analysis. The scope of the Network of the Country Tuberculosis Portals initiative is to maintain the network of open­ access tuberculosis centers that use common database architectures, user interfaces, programmatic solutions, medical and scientific nomenclature. This unified approach can facilitate adherence to the treatment protocols, serve as a consistent repository of records and present a rich source for tuberculosisrelated data mining and epidemiological studies.

Session: Cancer-1 (M1) 16.20 Deciphering gene interaction to explain tumor 17.00 progression Emmanuel Barillot

26 MCCMB'2015

17.00-17.30 Coffee break Session: Cancer-2 (M1) 17.30 A model for scoring damaging mutations in the non- 17.50 coding tumoral genome based on germline and tumor data Jia Li Cancer driver mutations are somatic events that promote tumor growth or metastasis. Previous computational studies have largely focused on driver mutations located in protein-coding exons that change amino acid residues with damaging effects. However, non-coding RNA (ncRNA) genes and non-coding parts of coding genes (introns, UTRs) now emerge as significant players in the regulation of gene expression and potentially in tumor progression. There is an urgent need for methods that can evaluate the effect of somatic mutations in such non-coding regions and prioritize mutations for further scrutiny. Here we develop two random forest models for predicting germline and somatic mutation constraints in any non-coding region. These models combine functional features from Encode and other genome surveys, using as response variables the mutational constraints provided by the 1000 Genome Project (germline model) and by collections of tumor whole genome sequences (somatic model). We show that each model reflects a different set of constraints acting on the normal and tumor genome and we identify the specific features (such as conserved elements and histone marks) that most contribute to these constraints. Furthermore, high scoring regions defined by each model are enriched in known disease-related mutations, indicating we can use the resulting scores as a proxy for damaging non-coding mutation. We combine both model to predict regions in ncRNAs and introns/UTRs of protein coding genes where mutations are most likely to be damaging. This system paves the way for the detection of non-coding driver genes and regulatory elements in cancer.

17.50 Selection pressure on breast cancer somatic 18.10 mutations revealed by bioinformatics sequence analysis Ivan Kulakovskiy Among different variations of the human genome single nucleotide variants, SNVs, are the most common. SNVs located in coding regions may directly affect function of a particular protein through alterations of the protein sequence and, consequently, the structure. Nucleotide substitutions occurring in regulatory regions do not alter the protein but may change expression of the corresponding genes. In particular, SNVs in promoters and enhancers may alter transcription factor (TF) DNA binding and thus affect efficiency of transcription initiation. With hundreds of human TFs binding patterns known it is finally possible to predict regulatory effects of mutations purely by sequence analysis in silico. In the past, SNVs were primarily studied in a population context as single-nucleotide polymorphisms, SNPs. The high-throughput sequencing gave birth to principally new data on somatic mutations, in particular, those emerging in cancer. Here we discuss a new

27 MCCMB'2015 version of PERFECTOS-APE (Vorontsov et al., 2015), the software to PrEdict Regulatory Functional Effect of SNVs by Approximate P-value Estimation. We applied PERFECTOS- APE to analyze somatic mutations detected in 21 breast cancer samples by Nik-Zainal et al., 2012. Using HOCOMOCO (Kulakovskiy et al., 2013) collection of transcription factor binding patterns we identified TFs whose binding sites were affected by somatic substitutions in breast cancer cells. Binding sites of several transcription factors were damaged by mutations significantly more often than expected by chance. At the same time, for dozens of transcription factors binding sites were protected from mutations, i.e. were affected by them significantly less often than expected by chance. We believe this is the evidence for positive and negative selection of cancer somatic mutations in regulatory regions 18.10 Searching For Essential Cancer Proteins: Analysis Of 18.30 Hypomutated Genes In Skin Melanoma Mikhail Pyatnitskiy We propose an approach to detection of essential genes/proteins required for cancer cell survival. Gene is considered essential if mutation with high impact upon function of encoded protein causes death of cancer cell. We draw an analogy between essential cancer proteins and well-known Abraham Wald’s work on estimating the plane critical areas using data on survivability of aircraft encountering enemy fire. Wald reasoned that parts hit least on the returned planes are critical and should be protected more. Similarly we propose that genes essential for tumor cell should carry less high-impact mutations in cancer compared to polymorphisms found in normal cells. We used data on mutations from the Cancer Genome Atlas and polymorphisms found in healthy humans (from 1000 Genomes Project) to predict 91 protein-coding genes essential for melanoma. These genes were selected according to several criteria including negative selection, expression in melanocytes and decrease in the proportion of high-impact mutations in cancer compared with normal cells. Gene ontology analysis revealed enrichment of essential proteins related to membrane and cell periphery. We speculate that this could be a sign of immune system-driven negative selection of cancer neo-antigens. Another finding is overrepresentation of semaphorin receptors, which can mediate distinctive signaling cascades and are involved in various aspects of tumor development. Cytokine receptors CCR5 and CXCR1 were also identified as cancer essential proteins and this is confirmed by other studies. Overall our goal was to illustrate the idea of detecting proteins whose sequence integrity and functioning is important for cancer cell survival. Hopefully, this prediction of essential cancer proteins may point to new targets for anti-tumor therapies. 18.30 Revealing mechanisms of cancer progression by 19.10 pan-cancer deconvolution of tumoral transcriptomes Andrei Zinovyev Large-scale projects are generating massive amounts of molecular profiles for tumoural samples. There exists a big challenge to establish a “catalogue” of signals that can shape the tumoral transcriptomes in cancer type-specific manner and signals common for many cancer types, as well as to distinguish them from the commonly observed technological and other biases and signals coming from tumoural microenvironment. In other words, we need to decipher the tumoural transcriptome, in order to focus on

28 MCCMB'2015

specific mechanisms that can be targeted in therapy. One of the most suitable methodology for this decoding comes from the signal processing field, connected to linear matrix factorization, such as the method of Independent Component Analysis (ICA). We analysed data on nine different cancers from 21 patient cohorts and 6671 tumours and identified their commonalities, as well as the cancer type-specific characteristics. By carefull interpretation of ICA results, we managed to distinguish the signals coming from tumoural cells from those coming from the tumour microenvironment, clearly identified signals associated with technology and related to different treatments of tumour tissue biases. New insights were obtained in bladder cancer. The projections of the tumors on the different components allow characterizing and comparing any predefined subgroups of tumors. We thus could distinguish for the first time FGFR3-mutated tumors and RAS-mutated tumors. The analysis of a bladder cancer-specific component led to identify PPARG as an oncogene both controlling differentiation and proliferation in bladder tumors, and verify this prediction in an experiment. We showed that the information captured in independent components is also reflected into anatomopathological staining microscopy images.

29 MCCMB'2015

Saturday, 18 July 2015 09.00-10.00 Morning coffee Session: Methods and algorithms (M1) 10.00 Analysis of variation in 3000 rice genomes project 10.20 Tatiana Tatarinova Rice is the staple food for half the world population, particularly for poor developing countries in Asia. Remarkably, rice has a significant within-species genetic diversity. Traditional rice varieties encompass a huge range of potentially valuable genes. These can be used to develop superior varieties for farmers to take part in the uphill battle of feeding an ever-increasing world population (estimated to reach 9.6 billion by 2050). The genes linked to valuable traits can help breeders create new rice varieties that have improved yield potential, higher nutritional quality, better ability to grow in problem soils, and improved tolerance of pests, diseases, and the stresses, such as flood and drought, that will be inevitable with future climate change. Much of this diversity is conserved within the International Rice Genebank Collection (IRGC) at the International Rice Research Institute (IRRI). In the framework of the 3,000 rice genomes project, IRRI and collaborators have completed the sequencing of 3,000 rice genomes of varieties and lines representing 89 countries. The 3,000 Rice Genomes Project Rice Genomes Project is funded by the Bill and Melinda Gates Foundation and the Chinese Ministry of Science and Technology. The project’s entire 13.4-terabyte dataset was released in 2014 in an open-access database, GigaDB, which instantly quadrupled the previous amount of publicly available rice sequence data. The dataset contains genome sequences (averaging 14X depth of coverage) derived from 3,000 accessions of rice with global representation of genetic and functional diversity. Availability of 3K rice genomes provided a unique opportunity to explore variability of different functional regions of genome. We focused our analysis on those regions that are most likely enriched by transcription factor binding sites, such as promoters, 5’ and 3’-UTRs. We have examined distribution of SNPs, known transcription factor binding sites, and DNA methylation in those regions. We observe increased sequence conservation in these regions and hypothesize that unusually conserved motifs in these regions have biological significance. We found the most conserved motifs and performed an enrichment analysis for these motifs in various biological processes. We applied our reAdmix tool to analysis of 3000 rice genomes, using currently sequenced varieties of wild rice as a reference. We present a novel plantMix pipeline for analysis of domesticated species using their wild relatives. 10.20 Distance-based profiling aids in evaluation of 10.40 ageing-related phenomena Ancha Baranova In typical biological assay performed in a high-throughput mode, either expression levels for individual genes or other quantifiable variables are assessed in parallel. These variables could be represented as dimensions of the information space that we study. In high dimensional space, the data become sparse. In other words, when a data set contains a large number of attributes, we are faced with a choice of either completely

30 MCCMB'2015

suppressing most of the data or losing the desired level of statistical significance for any possible finding. The problem outlined above is known as the "curse of dimensionality". There is a need to develop integrative approaches, capable of combining data from multiple high-throughput experiments to increase sample size or statistically sound and robust techniques to reduce the data to the most informative features. In our previous studies, we developed a novel approach based on the "distances" in the multidimensional space of gene expression values. As a proof-of-principle, we showed that this approach produces surprisingly good results in separation of normal and affected samples both for analysis of human malignancies and for chronic progressive conditions like psoriasis. In current work, we applied distance-based metrics to the problem of quantification of ageing and age-related phenotypes. Aging has been an intriguing field of study for biologists for decades. As cells experience stress and damages from internal and external factors, they normally progress toward cellular senescence at which point they cease to replicate, but acquire pro-inflammatory features. This process comes with significant changes in gene expression profile (GEP) of the cell. Here we performed a systematic classification of gene expression profiles from 12 microarray dataset. Samples from multiple disorders and healthy controls that were taken from various tissues were included. The array data were grouped and analyzed by the age of the donor. Pearson and Kolmogorov-Smirnov and correlation coefficient were used to compare GEPs between different groups. In such way, we built a holistic marker taking into account the quantifiable expression levels of all genes assayed, rather than extracting top ranked features as markers. In our analysis, the cumulative gene expression pattern of an individual patient is considered as a whole and is represented as a data point in a multidimensional space formed by all gene expression features assayed in the given system. The degree of separation between samples indicates the drift of the testing samples away from the cellular stable state in the process of cellular senescence. The classifiers showed clear separation between different age groups, as verified by k-fold cross validation. The holistic marker was further compared with specific markers extracted based on the ranking of statistical significance. The performance of the classifiers was evaluated by receiver operating characteristic curve (ROC curve). As an example of analysis, here we show linear distance plots for datasets GSE13330. In respective experiment, human foreskin BJ fibroblasts were mock or Bleomycin sulfate- treated (100ug/ml, Sigma, St. Louis, MO) for 24 hrs, while replicatively senescent fibroblasts were obtained by continuous passage. After 72 hr serum-starvation, RNA was collected and biotinylated cRNA was hybridized to Affymetrix Human Genome U133 Plus 2.0 GeneChips (Affymetrix, Santa Clara, CA) in the Washington University Microarray Facility. There were 4, 6, and 6 samples for Stress-Induced Prematurely Senescent (SIPS), Replicative Senescent (RS), and Young respectively. Our distance-based marker demonstrates the predictive power of global signatures is as good as specific markers, yet with better robustness and reproducibility. The classifiers may be used to identify the aging status of tissues and verify whether disease-based aging models resemble normal aging process. 10.40 Scaffold assembly based on genome rearrangement 11.00 analysis Max Alekseyev Advances in DNA sequencing technology over the past decade have increased the

31 MCCMB'2015 volume of raw sequenced genomic data available for further assembly and analysis. While there exist many algorithms for assembly of sequenced genomic material, they often experience difficulties in constructing complete genomic sequences. Instead, they produce long genomic subsequences (scaffolds), which then become a subject to scaffold assembly aimed at reconstruction of their order along genome chromosomes. The balance between reliability and cost for scaffold assembly is not there just yet, which inspires one to seek for new approaches to address this problem. We present a new method for scaffold assembly based on the analysis of gene orders and genome rearrangements in multiple related genomes (some or even all of which may be fragmented). Evaluation of the proposed method on artificially fragmented mammalian genomes demonstrates its high reliability. We also apply our method for incomplete anophelinae genomes, which expose high fragmentation, and further validate the assembly results with referenced-based scaffolding. While the two methods demonstrate consistent results, the proposed method is able to identify more assembly points than the reference-based scaffolding. 11.00 A method for model comparison based on the 11.20 parameter sensitivity measures Ekaterina Myasnikova In modeling of complex biological systems one often faces a dilemma of trade-off between over-simplification of mechanisms underlying the modelled biological processes and the model over-parameterization. In the former case the model may turn to be unrealistic while in the latter case the fitting to experimental data may lead to non- identifiable parameter estimates. Methods for analysis of parameter sensitivity and identifiability may give a clue to the correct choice of the level of model detail. In our previous work (Myasnikova & Kozlov, 2014) we have introduced quantitative measures of the model prediction power based on relative sensitivity to parameters. We propose a modified version of the method based on the similar principles and designed to compare models of different complexity describing the same biological system. An idea of the method is to make sure that the model complication is practically reasonable by checking the sensitivity of the model to the additionally introduced subset of parameters and their identifiability. The method performance is demonstrated on the model of transcriptional control of the Drosophila melanogaster even-skipped gene published in (Janssens et al., 2006). 11.20 Method to predict the percentage of cell types in 11.40 human blood Anna Igolkina 11.40-12.10 Coffee break Session: Systems biology: Mammals and beyond (M1) 12.10 Mammalian Systems biology 12.50 Alistar Forrest We are complex multicellular organisms composed of hundreds of different cell types.

32 MCCMB'2015

The specialization of cell types and division of labour allows us to have coordinated complex functions such as responding to pathogens, movement and maintaining homeostasis. In the FANTOM5 project we have been interested in identifying the complete set of transcribed objects in the human genome and then predicting how they work together in the context of transcriptional regulatory networks (TRN). Each primary cell type runs a different version of the TRN based on the set of gene products it expresses. Not only this, but the FANTOM5 CAGE data reveal a wealth of cell-type-specific enhancers that are expressed in a very specific manner. Understanding the cell-type- specificity of these elements and promoters is key to building cell type specific TRNs. Lastly we go beyond the TRNs and examine cell-cell signaling within a multicellular organism. By identifying the sets of protein ligands and receptors expressed in any given human cell type we have made the first draft cell-cell communication network map. 12.50 Genomics of Lifespan Control 13.30 Vadim Gladyshev Understanding the mechanisms that control lifespan is among the most challenging biological problems. Many complex human diseases are associated with aging, which is both the most significant risk factor and the process that drives the development of these diseases. It is clear that the aging process and the maximum lifespan of species can be regulated and adjusted. For instance, mammals are characterized by >100-fold difference in lifespan, which can both increase and decrease during evolution. We employ this diversity in mammalian lifespan and the associated life-history traits to shed light on the mechanisms that regulate species lifespan. For this, we utilize methods of comparative genomics to examine the genomes of exceptionally long-lived species and carry out analysis of lifespan across a panel of mammals. We sequenced the genomes of several mammals with exceptional lifespan, including the naked mole rat, the Damaraland mole rat, and the Brandt’s bat, and identified genes that may contribute to their longevity. We also apply transcriptomics and metabolmics approaches to analyze the molecular basis for adaptations associated longevity across mammals. These studies point to both lineage-specific and common processes involving various pathways. It is our hope that a better understanding of molecular mechanisms of mammalian lifespan control will lead to a better understanding of human diseases of aging. 13.30-15.00 Lunch break Session: Genomics of regulation (M1) 15.00 Computer analysis of genome co-localization of 15.20 transcription factor binding sites based on ChIP-seq data Arthur Dergilev A scientific problem being solved is to study transcription factor binding sites (TFBS) colocalization in mammalian genomes using ChIP-seq data. Technology ChIP-seq, which combines chromatin immunoprecipitation (ChIP) and highly efficient DNA sequencing, allows to determine transcription factor binding sites in genome scale. The tasks of analyzing genomewide ChIP-seq data rises are to identify the coordinates of TFBS and to

33 MCCMB'2015 compare their location with genomic annotation (relative location and distance to gene transcription start sites, promoter regions etc.). In addition to determining the location of binding sites for a transcription factor, there are problems of determining the cluster sites of different transcription factors, clusters together or located at a short (100-200 nt) distances on chromosomes assuming similar function and regulatory mechanisms. Programs processing huge amounts of text data (bed, wig files) identifying areas of intersection of genomic annotations (coordinates), adapted to the respective model genomes are technically necessary. We developed set of programming script for TFBS location analysis. The study of clusters of sites ChIP-seq data on the status of binding sites of 15 different transcription factors in the mouse genome were used. The computer program in C ++ language is developed to calculate the relative position of the coordinate TFBS and their clusters. Methods of establishing complex signals and patterns of the algorithm "Discovery" (program GeneDiscovery), previously developed in the framework of the theory of data analysis (Data Mining, Knowledge Discovery) in the context of signals DNA segments were used for the analysis of clusters of binding sites. We confirmed separation of TFBS clusters in mouse genome (embryonic stem cells) onto classes presented by Oct4, Nanog, Sox2 from one side, and c-Myc from another side. This analysis was extended to exact location of nucleotide motifs in ChIP-seq peaks relative to each other and iterative correction of such motifs. 15.20 Search for simple and composite auxin responsive 15.40 elements in Arabidopsis thaliana genome Victoria Mironova The hormone auxin is a major regulator of plant growth and development. The influence of auxin on gene transcription is primarily mediated through Auxin Response Factors (ARFs). ARFs bind in target promoters to the specific sites called AuxREs (Auxin Response Elements) with the TGTCNN (most frequently TGTCTC) consensus core sequence. While Chip-seq data for most of ARFs are still unavailable, prediction of potential AuxRE is restricted by consensus models that detect too many false positive sites. About half of the Arabidopsis thaliana genes have at least one TGTCTC in any orientation within the first 1000 nt of their promoter regions. While single TGTCTC hexamer does not confer auxin inducibility (Ulmasov et al. 1997), this is provided by multimerized (Guilfoyle et al. 1998), or composite AuxREs (Ulmasov et al. 1995). In the composite AuxREs, TGTCNN adjoins or overlaps with coupling elements (Ulmasov et al. 1995; Guilfoyle et al. 1998). We performed bioinformatical analysis of simple and composite AuxREs distribution in Arabidopsis thaliana genome. AuxREs were recognized by three different models: (1) simple TGTCNN consensus, (2) TGTCNN pairs with a certain distance between them and (3) combination of oPWM and SiteGA tools (AuxREP&S) (Mironova et al., 2014). To test which model predicts AuxREs associated with auxin response better, we performed meta- analysis of publicly available 23 microarray experiments with auxin treatments (Mironova et al., 2014). First, we created a list of auxin-regulated genes which significantly changed their expression (by more/less than 1,5/0,67-fold, p<0,05) in at least four microarrays. The threshold for the number of microarrays was set by the binomial trial estimate. The resulting list contained 1301 up-regulated and 1262 down-regulated genes. Second, the fractions of the significantly up- or down-regulated genes with an AuxRE variant in their promoter were compared with that for all the genes tested in the experiment. The statistical significance of the difference between the fractions was estimated by the t-test

34 MCCMB'2015

for arcsine square-root transformed proportions. This analysis showed that all three models predicted AuxREs, which were enriched in auxin responsive genes, but the genes sets differed. For example, a highly associated with auxin response AuxREP&S were predicted in about of 10% of auxin responsive genes and were associated only with up- regulation. Where several variants of TGTCNN consensus were significantly associated with auxin down-regulation. Additionally, we performed a context analysis of the flanks in experimentally proven AuxREs and found three distinct types of potential coupling motifs (Y-patch, AuxRE-like, and ABRE-like) (Mironova et al., 2014). The similar bioinformatical analysis of associations in a number of microarray datasets assured us that the composite elements with a specific orientation of AuxRE and the coupling motifs and the certain range of spacer length between them were associated with auxin responsiveness. The methodology proposed in this work suggested for the cis-regulatory elements annotation in the case the cis-element is associated with a response to physiological and ecological factors. 15.40 Antisense interactions of long noncoding RNAs in 16.00 human cells Ivan Antonov The hybridization of two RNA molecules is called antisense interaction. Theses interactions are usually based on long (>100 bp) highly complementary duplexes that correspond to transcripts produced from overlapping genes (cis-interactions) or based on Alu repeats. It was also hypothesized that RNA-RNA hybridization can be based on several short antisense sites (trans-interaction). Recently it has been demonstrated by RNA pull-down assay that a cytoplasmic long noncoding RNA (lncRNA) is capable to bind with hundreds of mRNAs in human cell lines. The identified transcripts did not have long antisense duplexes with the lncRNA but rather several short (< 30 bp each) duplexes thus suggesting the possibility for trans-antisense interactions. To check this hypothesis we used thermodynamics based tools to compute the energy of the putative lncRNA- mRNA interactions. We have shown that the energies for all the pull-down genes are an order of magnitude weaker than for functional cases of cis and Alu-based duplexes of the similar total length. Moreover, the energies for the majority of these duplexes are comparable to the values observed in random simulations suggesting that such pull- down transcripts are indirectly associated with the lncRNA and should not be considered as RNA-RNA interactions. Nevertheless, in each of the two analyzed pull-down experiments we have found 12 and 17 cases of putative trans-antisense interactions – the lncRNA-mRNAs pairs with energies significantly stronger than for random simulation. We thus continued the search for functional trans-antisense duplexes focusing on regulatory RNA-RNA interactions. Ab initio analysis performed for 71 lncRNAs expressed in HEK293 cell line identified 12 potential cases of regulatory trans-antisense interactions that are waiting for experimental validation. Session: Viruses – 1 (M1) 16.00 Challenges in virus genomics 16.40 Manja Marz Computer-assisted studies of structure, function, and evolution of viruses remains a

35 MCCMB'2015 neglected area of research. The attention of bioinformaticians to this interesting and challenging field is far from commensurate with its medical and biotechnological importance. The purpose of this talk is to increase awareness among bioinformatics researchers about the pressing needs and unsolved problems of computational virology. I focus primarily on RNA viruses that pose problems to many standard bioinformatics analyses due to their compact genome organization, fast mutation rate, and low evolutionary conservation. 16.40-17.10 Coffee break Session: Viruses – 2 (M1) 17.10 Sequence and structural analysis of related proteins 17.30 in distant viral species Olga Kalinina Unlike cellular organisms, viruses do not constitute a monophyletic group, in which the phylogenetic history can be traced back to a common ancestor. The origin and relatedness of different virus families is currently a subject of active discussion. It is unclear, whether viruses have evolved by reduction of many essential genes from cellular species, descend from mobile elements of other organisms, or whether they precede cellular life and are ancient self-replicating units. Possibly, all these hypotheses are true, for a subset of viral families. The recent discovery of giant viruses revived this discussion with suggestions that a certain clade of them may represent a fourth domain of life. Analysis of evolutionary relationships between distant viral families presents particular difficulties, since the sequence similarity of viral proteins is rarely detectable outside the immediate viral family. We have performed an all-to-all sequence and structural comparison of viral proteins, and focused on cases where similarity is detected between proteins from viruses that use different type of nucleic acid to encode their genome. We can split the corresponding proteins families into families with balanced and unbalanced distribution of viral genome types. For the former category, we recapitulate viral hallmark genes (i.e. genes characteristic to only viruses and present in diverse species) and other known wide-spread viral proteins, providing the first comprehensive analysis of these cases. The protein families of the latter category can be often characterized by horizontal gene transfer events. We could not detect any events of horizontal gene transfer between different viruses, however, we have identified several events of horizontal gene transfer from the host to an infecting virus. We have also identified proteins from several protein families that appear in very distant viruses, whose function is likely conserved but whose origin cannot be traced back to a single viral class, which hints at a much more complex network of kinship in the virus world than previously recognized

Session: Proteins (M1) 17.30 Bioinformatic analysis of diverse protein 17.50 superfamilies to design improved enzymes Dmitry Suplatov

36 MCCMB'2015

17.50 Detecting the features of functional specificity in 18.10 protein families based on the local sequence similarity Boris Sobolev Functional specificity of different subgroups in protein family is determined by particular amino acid residues. Commonly, such residues are identified by the methods using Multiple Sequence Alignment (MSA). We propose the SPrOS method for estimating the specificity of the sequence positions based on independent comparisons of sequence fragment pairs. It is more suited for locating significant positions shifted in MSA and analyzing the intersected classes. The method was tested on data representing the various types of sequence-function relations. Using the artificially generated sequences with introduced position-specific exchanges, high accurate recognition of the groups- specific positions was shown. Application of SPrOS to LacI/GalR protein family resulted in indicating positions, whose functional significance had been experimentally determined earlier. In a more complicated case of protein kinases classified by inhibitor specificity, SPrOS was able to predict group-specific positions by statistically significant estimates. Mapping our results on 3D structures, positions predicted with high significance were detected in ligand-binding areas. In many cases evolutionary coupled mutations significantly complicate recognition of the positions actually determining the group specificity. In the case of protein kinases we showed that excluding the proximate homologues of the test sequence allowed overcoming this problem 18.10 Determination of the size of folding nuclei of 18.30 protofibrils from the concentration dependence of the rate and lag-time of their formation Oxana Galzitskaya In this work a kinetic model of the process of formation of amyloid protofibrils is suggested which allows calculation of the size of the nuclei using only kinetic data. In addition to the stage of primary nucleation, which is believed to be present in many protein aggregation processes, the given model includes both linear growth of protofibrils (proceeding only at the cost of attaching of monomers to the ends) and exponential growth of protofibrils at the cost of growth from the surface, branching, and fragmentation with the secondary nuclei. Theoretically, only the exponential growth is compatible with the existence of a pronounced lag-period (which can take much more time then the growth of aggregates themselves). According to our theory, one can distinguish some mechanism of growth on the basis of kinetic data. 18.30 Assessing protein synthesis with ribosome profiling 19.10 Pavel Baranov We used ribosome profiling (ribo-seq) to assess the gene expression response of mammalian cells to various stresses, such as increased eIF2 phosphorylation (the key step in Integrated Stress Response), and Oxygen and Glucose Deprivation (OGD). It enabled us to delineate the rapid translational response affecting thousands of genes. The response frequently involves translation of short regulatory ORFs usually located in

37 MCCMB'2015 the 5’ leaders of mRNAs. We also observed translation of unannotated long ORFs that likely leads to the synthesis of novel protein products specific to stress conditions. To assist the research community in using ribo-seq data we are developing RiboSeq.Org suite of tools (http://riboseq.org) that currently consists of the GWIPS-viz browser for th visualization of genomic alignments of ribosome footprints and RiboGalaxy which is a Galaxy instance specifically tailored for the analysis of ribo-seq data. In addition we developed a simple computational approach for the characterization of ribo-seq datasets. This technique is resistant to irregular technical noise and aberrant footprint densities caused by ribosome pauses. Application of this approach to several ribo-seq datasets revealed the strong impact of sequencing biases and translation inhibitors on the distribution of aligned ribosome footprints as well as substantial non-biological variability between datasets obtained from different laboratories.

38 MCCMB'2015

Sunday, 19 July 2015 09.00-10.00 Morning coffee Session: Genomics of anhydrobiosis (M1) 10.00 Anhydrobiosis in the sleeping chironomids: where 10.40 are we now Takahiro Kikawada 10.40 Expression regulation of desiccation-resistance 11.00 genes in Polypedilum vanderplanki Pavel Mazin In my talk, I will show how sophisticated analysis of RNA-Seq data could help to understand the molecular mechanisms of dehydration tolerance on the example of Polypedilum vanderplanki, the insect that could survive almost complete water loss and revive during just tree hours of re-hydration. Our results reveal that heat shock transcription factor (HSTF) is responsible for desiccation-induced transcription activation of many genes in P. vanderplanki, but not in congeneric desiccation-sensitive P. nubifer. It is likely achieved by binding of the HSTF to doubled binding site in promoter region of its gene that results in self-activation of HSTF in P. vanderplanki, but not in P. nubifer, where HSTF-binding motif in promoter region of HSTF gene is absent. While HSTF seems to be responsible for activation of hundreds genes under desiccation, it is just a tiny fraction of genes that alter its expression in either direction under desiccation and/or re-hydration. For example, genes that encode heme-binding proteins, globins and cytochromes, are significantly enriched among both: desiccation-suppressed and desiccation-activated genes. Some of these genes are expressed only in animals that survived desiccation and almost silent before dehydration. Transforming growth factor beta, genes that encode proteins involved in DNA-repair, polyketide (chemicals involved in pheromone communications and defense) synthesis and many others are significantly enriched among genes activated soon after start of re-hydration. Analysis of gene expression can provide some clues about desiccation-related regulation of these genes. For example, all four histone deacetylases encoded by P. vanderplanki genome are differentially expressed under desiccation as well as a single histone acetylase. Genes that encode nuclear hormone receptors are significantly enriched among genes activated after three hours of re- hydration, that points to possible role of steroid hormones in re-hydration- induced gene expression changes. 11.00 Molecular basics of different mechanisms of 11.20 desiccation tolerance in Chironomidae midges Olga Kozlova 11.20 Adapting to extremes: linking metabolome and 11.40 genome of an anhydrobiotic insects Elena Shagimardanova

39 MCCMB'2015

11.40-12.10 Coffee break Session: Toolkits for anhydrobiosis research (M1) 12.10 Single cell molecular toolkit for inducible resistance 12.50 to complete desiccation Oleg Gusev

Ability of larvae of the sleeping chironomid Polypedilum vanderplanki represent the most complex organism capable to tolerance to complete desiccation. Upon desiccation, the larvae enter into ametabolic reversible state (anhydrobiosis). It was shown that during desiccation, the nonredundant sugar (trehalose) substitutes water in the cells, leading to a "vitrification ". This mechanism prevent damage of molecules, cell structures and organelles. It has been demonstrated that anhydrobiosis is the property of individual cells rather than hormonally controlled process (reviewed in Cornette and Kikawada, 2010). One of the resent achievements is the establishment of a protocol of P. vanderplanki embryonic cell line, capable to withstand complete desiccation, via inducible anhydrobiosis (Nakahara et.al, 2010). Sleeping chironomid genome sequencing revealed several peculiarities in its structure, associated with the ability to desiccation resistance (Gusev et al., 2014). It is suggested that anhydrobiotic clusters (ARIds) of genes that were not found in the genomes of other insects, including closely related chironomid Polypedilum nubifer, responsible for the formation of a "molecular shield" during dehydration. In the current project we aim to dissect the molecular background of the inducible desiccation resistance in the cells by combining data of whole genome cap analysis gene expression (CAGE) analysis, transcriptomics and comparative proteomics (iTraq). The first stage of the analysis revealed that in contrast to whole larvae, сcharacterized by more than 15% of total number of genes altered by desiccation, the inducible anhydrobiosis in the cell line associated with less than 1% of total number of genes is differentially expressed under desiccation. We further found that only selected members of ARIds “gene islands” are expressed and further up-regulated in response to preconditioning with trehalose and further desiccation in the cell line. Taking together the data suggest that the current approach is effective tool to define the minimum essential gene set needed for induction of anhydrobiosis in stand-alone cell line of chironomid and further would be useful for artificial anhydrobiosis methodology for other eukaryotic cell lines. In addition, tissues or organ specialization might be one of the explanation of anhydrobiosis-related genes paralogization in the sleeping chironomid.

12.50 Genetic toolkit for investigation of anhydrobiosis: 13.30 promoters and RNAi Richard Cornette

13.30-15.00 Lunch break Session: Genome structure (M1)

40 MCCMB'2015

15.00 Genome mapping revealed scaffold misassemblies 15.20 and elevated gene shuffling on the X chromosome in malaria mosquitoes Igor Sharakhov 15.20 Detection of short size mutations and copy number 15.40 alterations in ultra-deep targeted sequencing data Valentina Boeva The emergence of the amplicon sequencing technique, which followed whole exome sequencing, promises a revolution in cancer diagnostics and treatment. Amplicon sequencing consists of the PCR amplification of a limited number of the genomic regions of interest (amplicons) followed by high throughput sequencing. These genomic regions generally correspond to exons of “actionable” cancer- related genes: ALK, BRAF, MYCN, ERBB2, etc. Due to the exceedingly high read coverage of amplicon sequencing data, there is no methodological issue in the identification of clonal point mutations and small insertions or deletions (indels) in actionable genes targeted by amplicon sequencing. However, how to reliably detect copy number changes and identify subclonal mutations present in a very small proportion of tumor cells from amplicon sequencing data is still open to discussion. Here we provide a solution, ONCOCNV, to the challenging question of extracting CNAs from amplicon sequencing data by (i) defining a method to normalize read coverage with a small set of normal control samples and (ii) assigning statistical significance to putative CNAs resulting from the segmentation of normalized profiles. We also propose a method, TargetZoom, to detect subclonal mutations in amplicon sequencing data.

15.40 Genomic structural instability and homologous 16.00 recombination deficiency in breast and ovarian cancers Tatiana Popova 16.00 Genome Track Analyzer : New tool for genome-wide 16.20 study of correlations between distributed genome features Galina Kravatskaya The broad class of tasks in genetics and epigenetics can be reduced to the study of various features that are distributed over the genome (genome tracks). The rapid and efficient processing of the huge amount of data stored in the genome-scale databases cannot be achieved without the advanced software based on analytical methods. However, strong inhomogeneity of genome tracks hampers the development of relevant statistics. We developed the analytical criteria for the assessment of genome track inhomogeneity and correlations between two genome tracks. We also developed a software package, Genome Track Analyzer, based on this theory. It contains the following tools applicable to genomic track investigations: *Correlations between point-wise and

41 MCCMB'2015 stretch-wise genomic tracks *Correlations between profiles (including expression and DNA-protein binding profiles) *Correlations between point-wise and stretch-wise genomic tracks and expression profiles *Statistical Kolmogorov-Smirnov and entropy tests for assessment of distribution of genomic tracks over the chromosomes. The theory and software were tested on simulated data, and were applied to the study of correlations between CpG islands and transcription start sites in the Homo sapiens genome, between profiles of protein binding sites in chromosomes of Drosophila melanogaster, and between DNA double-strand breaks and histone marks in the Homo sapiens genome. Significant correlations between transcription start sites on the forward and the reverse strands were observed in genomes of Drosophila melanogaster, Caenorhabditis elegans, Mus musculus, Homo sapiens, and Danio rerio. The observed correlations may be related to the regulation of gene expression in eukaryotes. Genome Track Analyzer is freely available at http://ancorr.eimb.ru/ 16.20 Tale on the transposons on chromatin landscape 16.40 Vladimir Babenko We categorized human genome 100kb non-overlapping segments by their Dnase Hypersensitive Sites (DHS) counts based on data in (Sheffield et al., 2011). They fit a Weibull long tail distribution with a peak at around 14 DHSs per bin. The few (around 50) bins maintaining less than 14 DHSs were mostly gene deserts, long introns, or some quite distinct gene clusters like ubiquitin peptidase family. Then we performed linear regression analysis between categorized by families transposons counts and #DHS. We revealed two major classes of transposons families: those that prefer “silent” chromatin and those tending to reside in “open” chromatin bins with high confidence. Further on, we discovered that number of Alu retroposons strongly correlates with the number of genes in the bin. Based on this observation, we worked out a method based on a non- linear Alu-gene correlation to infer some non-linear evolution events like the emergence of tandem repeated gene clusters. We also crossed the family – categorized transposons with Txn table (transcription factor binding sites verified by Chip-Seq; genome.ucsc.edu) to elucidate their transposon specific propagation similar to (Jjingo et al., 2014). Further on, we assessed chromosome wise bias of repeat families and found that most chromosome – specific repeat families expansions (LINEs in majority) are maintained at X chromosome. Some ctcf-related open chromatin LTR expansions were observed specifically at chromosome 19 in a way similar to B2 Sine in mouse (Lunyak et al., 2007). Overall we report that the properties of transposons distribution and density within a genomic segment can disclose its specific evolutionary history and features. 16.40-17.10 Coffee break Session: Evolution (M1) 17.10 Assessing the impact of horizontal gene transfer on 17.30 the evolution of prokaryotes Vladimir Makarenkov Horizontal Gene Transfer (HGT) is one of the major evolutionary processes affecting the evolution of prokaryotic species. Two known types of horizontal gene transfer are complete and partial transfers. Partial HGT can be viewed as a complete HGT followed by

42 MCCMB'2015

intragenic recombination and leading to the creation of a mosaic gene. The identification of the origins and the rates of horizontal gene transfers in the context of complete and partial HGT models, and for different phylogenetic families and ecological niches, is a very relevant and challenging problem. We will present a novel bioinformatics framework designed to estimate and compare the rates of complete and partial HGTs at different phylogenetic and ecological levels. Our results suggest that partial HGTs are almost twice more frequent than the complete ones. We also determined that the majority of prokaryotic genes (i.e. a gene was represented by a multiple alignment of the corresponding alleles) have been affected multiple times by gene transfers during its evolutionary history: we found that 83% of the considered prokaryotic genes have been affected by at least one complete HGT and 96% - by at least one complete or partial. 17.30 Rare amino acid changes fixation drives divergence 17.50 in Metazoa evolution Konstantin Gunbin This report answered the following questions: What is the physical nature of the two contrast groups of amino acid substitutions, atypical (statistically rare) and typical? What is the difference between protein sites fixing atypical substitutions and protein sites fixing any amino acid replacements? What is the difference between lineages fixing atypical replacements and lineages without these replacements? How are atypical amino acid substitutions distributed among protein functional groups? Is there a connection between the frequency of atypical replacements and genus birth in the fossil record? Which branches of Metazoa tree enriched with atypical amino acid replacements and why it matters? 17.50 Evolution of TAG codon in Methanosarcina 18.10 Margarita Meer 18.10 A model of protein evolution within local fitness 18.30 landscape changing with time Dinara Usmanova Each amino acid in protein interacts with others. Thus fitness contribution of specific amino acid in particular site depends on the whole genetic background. This background changes over time resulting in change of allele fitness. In other words selection acting against particular alleles is not constant. We developed methods of analysis of long-term protein evolution which allow us to observe patterns of this altering selection. Then we formulate a covarion-like model of protein evolution, which describes this process mathematically. The model tracks not only the evolution of sequence but also the evolution of its local fitness landscape. In more details we allow fitness contribution of specific amino acid in particular site switch from being acceptable to being deleterious and vice versa. We calculated the rate of this switches for approximately 100 bacterial genes and 10000 vertebrates’ genes. It appears that fitness landscape changes very fast: on average 5 switches between allowed and blocked states occur on the same timeframe as a single amino acid substitution.

43 MCCMB'2015

18.30 Chartering the local fitness landscape of the green 19.10 fluorescent protein Fedor Kondrashov The nature of the genotype to phenotype connection, the fitness landscape, and the extent to which it is shaped by the non-independent contribution of mutations, epistasis, remain poorly understood. My talk will focus on an assay of the native function, fluorescence, of tens of thousands genotypes of the green fluorescent protein, including genotypes containing multiple missense mutations, allowing for the exploration of the local fitness landscape of an entire protein coding gene with an unprecedented detail. We find that the impact of multiple missense mutations on fluorescence was influenced by epistasis, especially those in functionally important sites with a weak individual effect on fluorescence. Furthermore, although the fitness landscape can be approximated with a relatively simple unidimensional function it is also affected by multidimensional epistasis, such that a multidimensional representation of the genotype space leads to a more accurate prediction of the level of fluorescence for each genotype. The broad congruence of the estimates of the prevalence of epistasis from long-term evolution with our data suggests that our query of the shape of the local fitness landscape can be extrapolated to a larger scale. However, the local fitness landscape does not appear to be influenced by epistasis between amino acid residues with a direct interaction in the protein structure. This observation appears to contrast with the general conclusions of the importance of structural interactions in long term evolution, suggesting that multidimensional epistatic interactions are rare in short term evolution but accumulate with protein divergence. 19-10-19.20 Closing 19.30 Farewell party

44 MCCMB'2015

Poster session

# Name Section Poster title 1 Shlikht Anatoliy algorithms Automated working place of bioinformatics

2 Taranov Evgeny algorithms DegenPrimer: a software for in silico simulation of multiplex PCR with degenerate primers 3 Golosova Olga algorithms NGS Data Analysis with Unipro UGENE

4 Kalinina algorithms The method for Anastasia homologous recombination detection within bacterial species 5 Klimchuk Olesya algorithms OLESA: Operon Loci Examination and Sorting Application 6 Demidov German algorithms A Novel Statistical Algorithm to Detection of Large-scale Deletions in PCR-enriched Target Sequencing Data 7 Demidov German algorithms Stochastic modeling of enhancer molecular configurations 8 Fedonin algorithms Characterization of highly Gennadiy diverse viral populations by fast reference selection and accurate read mapping

45 MCCMB'2015

9 Flegontov Pavel algorithms A read mapper for investigation of U- insertion/deletion RNA editing 10 Gerasimov algorithms Human-guided genome Evgeny assembly finishing software

11 Lyubetsky Vassily algorithms A method of detecting local gene synteny rearrangement

12 Poverennaia Irina algorithms Investigation of exon-intron structure multiple alignments 13 Soldatov Ruslan algorithms Differential activity of polymerase ? associated with replication timing and gene bodies in humans: evidence from mutational signatures 14 Nagaev Boris algorithms NPG-explorer, a tool for creating and exploring nucleotide pangenome for closely related prokaryotic genomes 15 Zhuravleva epigenetics Evaluation of the positional Ekaterina and chromatin correlations between structure whole genome annotations: novel statistical approaches development, advancement of the GenometriCorr methodologies

46 MCCMB'2015

16 Stavrovskaya epigenetics StereoGene: a tool for fast Elena and chromatin correlation assessment structure and its application to the analysis of bivalent histone methylation 17 Galitsyna epigenetics Spatial configuration of the Aleksandra and chromatin alpha-globin gene domain structure in three cell types of G.gallus 18 Khrameeva epigenetics Active chromatin regions Ekaterina and chromatin are sufficient to define structure borders of topologically associated domains in D. melanogaster interphase chromosomes 19 Kulakova epigenetics Computer analysis of Ekaterina and chromatin chromosome contacts structure obtained by ChIA-PET and Hi-C technologies 20 Klink Galya evolution Analysis of prevalence of epistasis on the basis of huge phylogenies

21 Potapova evolution Accumulation of mutations Nadezhda in nonsense alleles of Drosophila melanogaster

22 Rusinov Ivan evolution Estimation of selection pressure on degenerate sequences in genomes: choice of method 23 Terekhanova evolution Local variation of the Nadezhda mutation rate across the primate phylogeny

47 MCCMB'2015

24 Teterina evolution The evolution of cod Anastasia protein coding genes: intra- and interspecies levels

25 Savitskaya evolution Autoimmune primed Ekaterina CRISPR adaptation in I-E and I-F systems: comparative analysis of new spacer selection mechanisms 26 Novakovsky evolution Phylogenomic analysis of German the type I NADH:quinone- oxidoreductase 27 Olga Bondareva evolution Study of lactobacteria's genomes evolution 28 Tarasov Oleg evolution, Sequencing genomes of taxonomy Saccharomyces cerevisiae strains belonging to the Peterhof Genetic Collection helps elucidate the origin of several widely used laboratory strains 29 Troitsky Aleksey evolution, Moss phylogeny taxonomy reconstructed from 24 full mitogenome sequences using new "pangenome" based approach. 30 Korvigo Ilia evolution, The Evolutionary Space of taxonomy bacterial 16S rRNA gene 31 Baranova Mariia medical and Extremely high population polymorphism level in genetics fungi S. commune: the cause and the importance for population genomics

48 MCCMB'2015

32 Bai Haihua medical Identification of the genetics susceptibility gene loci associated with ischemic stroke in a Mongolian population in China 33 Belenikin Maxim medical Studying of epileptic genetics encephalopathies using NimbleGen-based target panels

34 Belenikin Maxim medical Finding of compound genetics heterozygous mutations in the ALDH7A1 gene. Clinical case

35 Reznik Aleksandr medical Evolutionary analysis of genetics NPC1 improves accuracy of predicting disease causing missense mutations 36 Sergeev Roman medical Mutation analysis of M. genetics tuberculosis nucleotide sequences from patients in Belarus 37 Bizin Ilya medical A bioinformatics pipeline genetics, for analysing germline cancer mutations in human breast cancer by exome sequencing 38 Milchevskaya medical Improved gene Vladislava genetics, annotations for microarray cancer based identifications of reporter metabolites in recurrent breast cancer

49 MCCMB'2015

39 Moshkovskii medical Exome-based Sergei genetics, proteogenomics of human cancer cancer cell lines 40 Terskikh medical Analysis of mutational Anastasia genetics, landscape of patients with cancer chronic lymphocytic leukemia 41 Popova Anfisa metagenomics BCVISS: a web application for analyzing mixed 16S rRNA gene chromatograms 42 Dubinkina metagenomics Assessment of k-mer Veronika spectrum applicability for metagenomic dissimilarity analysis of human gut microbiota 43 Garushyants metagenomics Comparative metagenomic Sofya profiling of two pilot-scale microbial fuel cells treating industrial wastewaters 44 Kiseleva Larisa metagenomics 45 Kovarsky Boris metagenomics Recent genomic changes in the human gut microbiome 46 Shavkunov metagenomics Bacteria revived from an Konstantin ancient bison gut 47 Kazakov Sergey metagenomics MetaFast: fast reference- free graph-based comparison of shotgun metagenomic data 48 Ivashchenko miRNA Binding sites of miRNAs Anatoliy with transcription factors' genes of Camelus ferus and Homo sapiens

50 MCCMB'2015

49 Ivashchenko miRNA Features of miR-574-5p Anatoliy and miR-574-3p binding sites in mRNA of target genes 50 Prosvirov Kirill miRNA At least 6% of conserved miRNAs` sites are misaligned. 51 Niyazova Raigul miRNA, cancer Interactions between miRNAs and mRNAs of apoptosis genes in lung cancer 52 Niyazova Raigul miRNA, cancer The interaction of miRNAs with mRNAs of the cell cycle genes in lung cancer 53 Hadarovich Anna protein Quantitative comparison of function functional properties in protein-protein complexes 54 Petrov Artem protein A novel Arg H52 and Tyr function H33 conservative binding motif in antibodies: a correlation between sequence of immunoglobulins and their binding properties 55 Alexandrov Anton protein Computational prediction function, of MHC class I tumor- algorithms specific antigens

56 Argun Dmitriy protein Sequence analysis in short function, functionally important algorithms peptides by combination of bioinformatics, molecular dynamics and testing of biological activity

51 MCCMB'2015

57 Bogatyreva protein Methods for protein folding Natalya structure rate prediction 58 Dudko Anna protein TOM-complex structure structure modeling 59 Gushchina Irina protein Molecular model of tyrosyl- structure DNA phosphodiesterase 1 for a structure-based screening for its inhibitors 60 Guzenko Dmytro protein Constrained Modelling of structure an Intermediate Filament Dimer

61 Ivankov Dmitry protein Testing applicability of structure machine learning for protein folding rate prediction

62 Milchevskiy Yury protein Local protein structure structure prediction based on physicochemical properties of amino acids 63 Nyporko Alex protein The 8-oxo-7,8-dihydro-2- structure dGTP behavior in active site of human DNA polymerase : structural investigation in silico 64 Rogacheva Olga protein cAMP-induced structure conformational changes of Protein Kinase A Ia A- domain

65 Scherbakov Kirill protein Docking method reveals structure binding patterns of -dioc acids by albumin

52 MCCMB'2015

66 Shalaeva Daria protein Modeling the role of structure positively charged moieties in hydrolysis of nucleoside triphosphates 67 Tarnovskaya protein Structural analisys of Svetlana structure mutations assosiated with idiopatic restrictive cardiomyopathy in cytoskeletal and sarcomeric proteins 68 Aksianov Evgeniy protein Sequence alignment of structure, non-superposable beta- algorithms sheets

69 Bykov Alexander regulation of Changing the transcription transcriptional activity of genome regulatory loci by PCR-mutagenesis

70 Khoroshkin regulation of Transcriptional Regulation Matvei transcription of the Carbohydrate Metabolism in the Bifidobacterium Genus

71 Suvorova Inna regulation of Reconstruction of GABA transcription and taurine metabolic regulons, controlled by MocR-subfamily transcription factors 72 Tutukina Maria regulation of Revealing and comparing transcription regulons of homologues transcription factors UxuR and ExuR in Escherichia coli.

53 MCCMB'2015

73 Zharov Ilya regulation of Correlations of transcription substitutions predict specific protein-DNA contacts in the MerR family of transcriptional factors 74 Zalevsky Arthur RNA and DNA Unraveling CD spectra of structure G-quadruplexes

75 Lomzov RNA and DNA Hybridization energy of Alexander structure native and modified DNA duplexes calculated using molecular dynamics 76 Chervontseva RNA structure The evolution of 5' Zoe untranslated regions' structure in Bacilli and Clostridia genomes

77 Moldovan Mikhail RNA structure Comparative genomics analysis of thiamine- pyrophosphate riboswitches in fungal genomes 78 Baulin Eugene RNA structure Long-range stem-based RNA tertiary motifs

79 Vasileva RNA structure Secondary structures in Aleksandra the coding regions of mRNAs: literature survey and comparison of prediction methods 80 Vinogradova RNA structure Probing-directed Svetlana structured elements detection in RNA sequences

54 MCCMB'2015

81 Volkova Oxana RNA structure Estimation of translational importance of mammalian mrna nucleotide sequence characteristics based on ribosomal profiling data 82 Lavrekha systems Mathematical modeling of Viktoriya biology, morphogenetic regulation modeling of the meristem zone formation in the plant root

83 Novikov systems A new method for Konstantin biology, identification of molecular modeling motor role in endocytosis

84 Spirov Alexander systems Dynamic modeling of biology, genes for spatial modeling patterning in embryo development on the example of the Drosophila segmentation gene hunchback 85 Dymova Arina systems Combined sequenced- biology, based model of the networks Drosophila gap gene network

86 Kogan Valeria systems Critical dynamics of gene biology, networks is behind ageing networks and Gompertz law

87 Kondratova Maria systems Atlas of Cancer Signaling biology, Network: from intracellular networks networks to tumoral microenviroment

55 MCCMB'2015

88 Misko Vladislav systems The construction of gene biology, networks for networks Mycobacterium Tuberculosis by analyzing next-generation sequencing data 89 Mitra Chanchal systems Modelling the Metabolic biology, Pathways networks 90 Sergushichev systems Globally connected Alexey biology, networks of GEO networks transcriptional profiles reveal hypothesis generation and drug repurposing potential 91 Shagimardanova systems Comparative metabolomic Elena biology, profiling of desiccation networks tolerant midge

92 Cherkasov transcriptomics Whole genome analysis of Alexander variety and expression of heat-shock protein encoding genes during desiccation stress in an anhydrobiotic midge Polypedium vanderplanki 93 Gazizova Guzel transcriptomics How to escape from muscle atrophy: whole- genome analysis of gene expression in edible dormouse (Glis glis) during immobilization 94 Kuznetsova transcriptomics Transcriptomic of the leech Svetlana Ozobranchus jantseanus

56 MCCMB'2015

95 Naumenko transcriptomics Building the set of Sergey orthologous genes for 66 Gammaridae transcriptomes 96 Nesmelov transcriptomics Antioxidant system of Alexander desiccation-tolerant insect Polypedilum vanderplanki 97 Spitsina transcriptomics Computer tool for gene Anastasia , algorithms expression data processing and correlation analysis 98 Garanina Irina transcriptomics Splicing sites evolution in , splicing primates prefrontal cortex 99 Speshilov Gleb transcriptomics Comprehensive , splicing comparison of RNA-seq based methods for differential splicing analysis 100 Vinogradov transcriptomics Alternative splicing in Dmitry , splicing hepatocellular carcinoma

57 MCCMB'2015

Conference materials are available at http://mccmb.belozersky.msu.ru/201 5

250 copies

© ITTP RAS, 2015

58