2012 STOCKHOLM JUNE 11-14 2012

hp://socbin.org/bioinfo2012/ Welcome

SocBiN in collaboration with Center for Biomembrane Research welcomes you to the 12th annual conference in bioinformatics. This year the conference will be held in beautiful Stockholm starting at lunch-time June 11 and ending at lunch on June 14. It will be held in the lecture hall "Berzelius" (Berzelius väg 3 / Tomtebodavägen at the bus stop for SL bus 69) Stockholm on the Karolinska Institutet campus, close to Science for Life Laboratory, Stockholm. We are looking forward to an exciting scientific program with 4 invited keynote speakers and 5 sessions (Molecular Machines, Using next generation sequence data, Data analysis of assays, Bioinformatics of chemical biology and RNA bioinformatics).

We wish you all very welcome

The organization committee

Arne Elofsson, Department of and Biophysics, Stockholm University, Erik Lindahl, Theoretical Physics, KTH and Bengt Persson, Linköping University SocBin

SocBiN (Society for Bioinformatics in Northern Europe) is a non-profit organization for people working with and interested in bioinformatics and . The members of the organization are predominantly from the Nordic and Baltic countries, but others are also welcome. We are grateful for the help of our session chairs

• Arne Elofsson,Science for Life Laboratory, Stockholm University,Sweden • Lukas Käll, Science for Life Laboratory, KTH,Sweden • Anders Andersson, Science for Life Laboratory, KTH,Sweden • Jens Carlsson, Center for Biomembrane Research, Stockholm University, Sweden • Janusz Bujnicki, International Institute of Molecular and Cell Biology in Warsaw, Poland

Program

Mon 11 Data Analysis of Proteomics Assays 13:45- Arne Elofsson Welcome 14:00 14:00- Ruedi Searching and Mining of Proteomic SWATH-MS datasets 14:30 Aebersold 14:30- Lennart Snakes and ladders: where do proteomics assays fail and how 15:00 Martens can we fix them? 15:00- The Triform algorithm: improved sensitivity and specificity in Finn Drabløs 15:15 ChIP-Seq peak finding Coffee 15:45- Edward Insights from proteomics into protein organization, evolution, and 16:15 Marcotte genetic disease 16:15- Roman Pathway Analysis in Expression Proteomics 16:45 Zubarev 16:45- Paul Horton MoiraiSP: a novel mitochondrial cleavage site predictor 17:00 17:00- Reception and poster session (Presentation by odd numbers) 19:00

Tue June 12 RNA Bioinformatics 09:30- Bob Darnell 10:00 10:00- Jan Gorodkin Towards the search for RNA-RNA interaction based networks 10:30 10:30- Mihaela A biophysical model to infer canonical and non-canonical 10:45 Zavolan microRNA-target interaction Coffee 11:15- The Detection of the Architectural Modules of RNA and Recent Eric Westhof 11:45 Progress in RNA Modelling 11:45- Samuel Flores A structural and dynamical model of human telomerase 12:15 12:15- Nanjiang Shu Computational analysis of membrane protein topology evolution 12:30 LUNCH Keynotes Session 13:30- Anders Krogh On the accuracy of short read mapping 14:30 14:30- Kerstin 15:30 Lindblad-Toh Coffee

16:00- Genome-Scale Metabolic Models: A Bridge between Bioinformatics Jens Nielsen 17:00 and 17:00- Paul Horton Excavating human NUMTs 18:00 18:00- olving the Recalcitrant Crystal Structure of Group II Chaperonin 19:00 TRiC/CCT by Mass Spectrometry and Sentinel Correlation Analysis 19:30- Conference Dinner 24:00

Wed 13 Bioinformatics of chemical biology 09:30- Gert Vriend What can we (not yet) learn from 70 GPCR structures 10:00 10:00- Raymond Understanding Human G-protein Coupled Receptor Structural 10:30 Stevens Diversity and Modularity 10:30- Chemogenomic Discovery of Allosteric Antagonists at the GPRC6A David Gloriam 10:45 Receptor Coffee 11:15- Helgi Schiöth The origin of GPCRs, the largest family of membrane bound proteins 11:45 11:45- Andreas Using Chemogenomics Approaches to Modulate Biological Systems 12:15 Bender 12:15- PoSSuM: a database of known and potential ligand-binding sites in Kentaro Tomii 12:30 proteins Using Next generation sequence data 14:00- Metagenomics data analysis: from the oceans to the human Jeroen Raes 14:30 microbiome 14:30- Christopher Extracting ecological signal from noisy microbiomics data 15:00 Quince Comprehensive Analysis of Antibiotic Resistance Genes in River 15:00- Johan Sediment, Well Water and Soil Microbial Communities Using 15:15 Bengtsson Metagenomic DNA Sequencing 15:15- Daniel Allele specic expression changes after induction of inflammation 15:30 Edsgärd Coffee 16:00- Erik van Reconstructing transcription regulatory networks in mammals using 16:30 Nimwegen a combination of modeling and next-generation sequencing data 16:30- Joakim Sequencing and assembly of the largest and most complex genome 17:00 Lundeberg to date - the Norway spruce (Picea abies) 17:00- Ivo Gut High-resolution whole-genome analysis and cancer 17:30 17:30- Poster session (Presentation by even numbers) 19:00

Thu 14 Molecular machines 09:00- Martin Weigt From sequence variability to protein (complex) structure prediction 09:30 09:30- Evolution teaches protein prediction 10:00 10:00- Janusz If Thereʼs an Order in All of This Disorder…: Structural 10:15 Bujnicki Bioinformatics of the Human Spliceosomal Proteome 10:15- Joanna M PyRy3D: a software tool for modelling of large macromolecular 10:30 Kasprzak complexes Coffee 11:00- Ingemar André Design and Prediction of Protein Self-assembly 11:30 11:30- Rob Russel 12:00 12:00- Closing words 12:15

List of partcipants

Conference "Bioinformatics 2012", June 11-14 at Karolinska Institutet, Stockholm Sweden

First name Surname/Family name University/Organization Country E-mail address Ruedi Aebersold ETH Zurich SWITZERLAND [email protected] Rahul Agarwal Chalmers SWEDEN [email protected] Mehmood Alam Khan KTH SWEDEN [email protected] Raja Hashim Ali Kungliga Tekniska Hogskolan SWEDEN [email protected] Anders Andersson KTH SWEDEN [email protected] Ingemar André Lund University SWEDEN [email protected] Reidar Andreson University of Tartu ESTONIA [email protected] Lars Arvestad Stockholm University SWEDEN [email protected] Ahmad Barghash Saarland University GERMANY [email protected] Walter Basile SWEDEN [email protected] Johan Bengtsson University of Gothenburg SWEDEN [email protected] Jorrit Boekel Scilifelab Stockholm SWEDEN [email protected] Mikael Borg SWEDEN [email protected] Susanne Bornelöv Uppsala university SWEDEN [email protected] John Boss Karolinska Institutet SWEDEN [email protected] Fredrik Boulund Chalmers University of Technology SWEDEN [email protected] Christian Brüffer Lund University SWEDEN [email protected] Torben Brömstrup KTH SWEDEN [email protected] Janusz Bujnicki IIMCB POLAND [email protected] Ignas Bunikis Uppsala University SWEDEN [email protected] Jens Carlsson Stockholm University SWEDEN [email protected] Alexey Chernobrovkin IBMC RAMS RUSSIA [email protected] Anna Czerwoniec Adam Mickiewicz University POLAND [email protected] Robert Darnell The Rockefeller University UNITED STATES [email protected] Carsten Daub Karolinska Institutet and SciLifeLab SWEDEN [email protected] Ino De Bruijn Stockholm University SWEDEN [email protected] Finn Drablos Norwegian Univ of Science and Technology NORWAY [email protected] Lei Du Karolinska Institutet SWEDEN [email protected] Stanislaw Dunin-Horkawicz IIMCB POLAND [email protected] Daniel Edsgärd KTH, Science for Life Laboratory SWEDEN [email protected] Arne Elofsson Principal Investigator/Lab Head/Senior R SWEDEN [email protected] Olof Emanuelsson Kungliga Tekniska Högskolan SWEDEN [email protected] Hassan Foroughi Asl Karolinska Institutet SWEDEN [email protected] Oliver Frings SWEDEN [email protected] Mattias Frånberg Stockholms Universitet/Karolinska Instit SWEDEN [email protected] David Gloriam [email protected] David Gomez-Cabrero BILS SWEDEN [email protected] Jan Gorodkin University of Copenhagen DENMARK [email protected] Viktor Granholm Stockholm University SWEDEN [email protected] Svenn Helge Grindhaug Uni Research NORWAY [email protected] Dimitri Guala SWEDEN [email protected] Ivo Gut Centro Nacional de Análisis Genómico SPAIN [email protected] Mohamed Hamed Saarland University GERMANY [email protected] Sampsa Hautaniemi University of Helsinki FINLAND [email protected] Sikander Hayat [email protected] Paul Horton AIST, Computational Biology Res. Ctr. JAPAN [email protected] Luisa Hugerth KTH SWEDEN [email protected] Lina Hultin Rosenberg Karolinska Institutet SWEDEN [email protected] Lukasz Huminiecki [email protected] Katherine Abigail Icay University of Helsinki FINLAND [email protected] Henrik Johansson Karolinska Institute SWEDEN [email protected] Anna Johnning University of Gothenburg SWEDEN [email protected] Viktor Jonsson University of Gothenburg SWEDEN [email protected] Sini Junttila University of Turku FINLAND [email protected] Mette Jørgensen University of Copenhagen DENMARK [email protected] Yvonne Kallberg Karolinska Institutet SWEDEN [email protected] Joanna Kasprzak Kasprzak Adam Mickiewicz University POLAND [email protected] Zeeshan Khaliq Uppsala University SWEDEN [email protected] Per Kraulis SWEDEN [email protected] Anders Krogh University of Copenhagen DENMARK [email protected] Deepak Kumar Adam Mickiewicz University POLAND [email protected] Mayank Kumar Saarland University GERMANY [email protected] Kanthida Kusonmano Uni Research AS NORWAY [email protected] Leena Kytömäki University of Turku FINLAND [email protected] Lukas Käll Royal Institute of Technology (KTH) SWEDEN [email protected] Jens Lagergren KTH SWEDEN [email protected] Silja Laht University of Tartu ESTONIA [email protected] Dan Larhammar Uppsala University SWEDEN [email protected] Ksenia Lavrichenko University of Bergen NORWAY [email protected] Fredrik Levander Lund University SWEDEN [email protected] Michael Levitt Stanford University UNITED STATES [email protected] Sara Light SWEDEN [email protected] Erik Lindahl KTH Royal Institute of Technology SWEDEN [email protected] Jessica Lindvall Karolinska Institutet SWEDEN [email protected] Joakim Lundeberg KTH, Science for Life Laboratory SWEDEN [email protected] Ingrid Lundell Uppsala University SWEDEN [email protected] Fredrik Lysholm Linköping University SWEDEN [email protected] Ari Löytynoja University of Helsinki FINLAND [email protected] Muhammad Owais Mahmudi KTH SWEDEN [email protected] Edward Marcotte University of Texas UNITED STATES [email protected] Tonu Margus University of Tartu ESTONIA [email protected] Lennart Martens VIB and Ghent University BELGIUM [email protected] Paula Martinez Chalmers University of Technology SWEDEN [email protected] Dorota Matelska IIMCB POLAND [email protected] Veli Mäkinen University of Helsinki FINLAND [email protected] Jens Nielsen Chalmers University of Technology SWEDEN [email protected] Henrik Nielsen Technical University of Denmark DENMARK [email protected] Roland Nilsson Karolinska Institutet SWEDEN [email protected] Wieslaw Nowak Uniwersytet M.Kopernika w Toruniu POLAND [email protected] Johan Nylander Swedish Museum of Natural History SWEDEN [email protected] Pall Isolfur Olason Uppsala University SWEDEN [email protected] Ananta Paine Karolinska Institute SWEDEN [email protected] Pekka Parviainen KTH SWEDEN [email protected] Maria Pernemalm Karolinska Institutet SWEDEN [email protected] Bengt Persson Linköping University and BILS SWEDEN [email protected] Christoph Peters SWEDEN [email protected] Kjell Petersen Uni Research AS NORWAY [email protected] Robert Pilstål Linköping University SWEDEN [email protected] Rui Pinto Umeå University SWEDEN [email protected] Iman Pouya Royal Institute of Technology SWEDEN [email protected] Jasna Pruner Uppsala University SWEDEN [email protected] Christopher Quince University of Glasgow UNITED KINGDOM [email protected] Jeroen Raes Vrije Universiteit Brussel BELGIUM [email protected] Balaji Rajashekar Tartu University ESTONIA [email protected] Anirudh Ranganathan [email protected] Henri Raska Tallinn University of Technology ESTONIA [email protected] Johan Reimegård Royal Institute of Technology SWEDEN [email protected] Maido Remm University of Tartu ESTONIA [email protected] Dirk Repsilber Leibniz Inst. for Farm Animal Biology GERMANY [email protected] Ana Maria Rodriguez Sanchez KAROLINSKA UNIVERSITY HOSPITAL SWEDEN [email protected] Burkhard Rost Technische Universitaet Muenchen GERMANY [email protected] Arcadio Rubio García Technical University of Denmark DENMARK [email protected] Robert Russell University of Heidelberg GERMANY [email protected] Kristoffer Sahlin KTH/Science for life Laboratory SWEDEN [email protected] Helgi Schiöth Uppsala University SWEDEN [email protected] Thomas Schmitt SWEDEN [email protected] Sophie Schwaiger KTH SWEDEN [email protected] Bengt Sennblad Karolinska Institutet SWEDEN [email protected] Alexey Sergushichev University ITMO RUSSIA [email protected] Hossein Shahrabi Farahani KTH SWEDEN [email protected] Nanjiang Shu SWEDEN [email protected] Gilad Silberberg Stockholm University SWEDEN [email protected] Indranil Sinha Karolinska Institute SWEDEN [email protected] Joel Sjöstrand SWEDEN [email protected] Marcin Skwark POLAND [email protected] Erik Sonnhammer SWEDEN [email protected] Matthew Studham SWEDEN [email protected] Shravan Sukumar University of Wisconsin- Madison UNITED STATES [email protected] Valentine Svensson SWEDEN [email protected] Thomas Svensson Science for Life Laboratory SWEDEN [email protected] Christian Tellgren-Roth Uppsala University SWEDEN [email protected] Andreas Tjärnberg SWEDEN [email protected] Kentaro Tomii AIST JAPAN [email protected] Ikram Ullah KTH SWEDEN [email protected] Per Unneberg SWEDEN [email protected] Björn Wallner Linköping University SWEDEN [email protected] Roman Valls Guimera SWEDEN [email protected] Erik Van Nimwegen University of Basel SWITZERLAND [email protected] Lixiao Wang Umeå University SWEDEN [email protected] Per Warholm SWEDEN [email protected] Martin Weigt University Pierre & Marie Curie FRANCE [email protected] Björn Wesén KTH SWEDEN [email protected] Eric Westhof University of Strasbourg, IBMC-CNRS FRANCE [email protected] Francesco Vezzi KTH Royal Institute of Technology SWEDEN [email protected] Viola Volpato University College Dublin (UCD) IRELAND [email protected] Gert Vriend CMBI NETHERLANDS, THE [email protected] Bo Xu Uppsala University SWEDEN [email protected] Özge Yoluk KTH SWEDEN [email protected] Katarzyna Zaremba-Niedzwiedzka Uppsala University SWEDEN [email protected] Weizhou Zhao Uppsala University SWEDEN [email protected] Marie Öhman Stockholm University SWEDEN [email protected] Linus Östberg Karolinska Institutet SWEDEN [email protected] Abstracts from Invited speakers

Ruedi Aebersold, Institute of Molecular Systems Biology, ETH Zurich and Faculty of Science, University of Zurich

Searching and Mining of Proteomic SWATH-MS datasets

Recently we introduced a new data independent (DIA) acquisition method termed SWATH-MS (1). This method, in effect, is a time-and- mass segmented acquisition method where complex, high-specificity fragment ion maps of all precursor ions within a user-defined precursor RT and m/z space are being generated and recorded. This is accomplished by stepping the isolation window of a specifically tuned quadrupole time-of-flight (QqTOF) instrument in discrete increments recursively throughout the duration of the LC separation. The data acquired by SWATH-MS are not searchable by conventional database search engines, because each fragment ion spectrum is a composite of multiple, concurrently fragmented precursor ions.

In this presentation we will describe an automatic pipeline for peptide identification and quantification from SWATH-MS datasets. It is conceptually related to the mProphet algorithm developed for the analysis of S/MRM datasets (2). The algorithm applies a targeted search strategy, whereby peak groups uniquely identifying a particular peptide are extracted from the SWATH-MS dataset and assigned a probability of being correctly associated with the target peptide. The algorithm uses a system of individual feature score rankings that are then combined into a composite score.

The performance of the method will be illustrated with selected examples that indicate the power of the approach for the reproducible analysis of proteomes, the detection of modified peptides and the estimation of the absolute quantity of proteins and proteomes.

1. Gillet LC, Navarro P, Tate S, Roest H, Selevsek N, Reiter L, Bonner R, Aebersold R. (2012) Targeted data extraction of the MS/MS spectra generated by data independent acquisition: a new concept for consistent and accurate proteome analysis MCP [Epub ahead of print] 2. Reiter L, Rinner O, Picotti P, Huettenhain R, Beck M, Brusniak MY, Hengartner MO, Aebersold R. (2011) mProphet: automated data processing and statistical validation for large-scale SRM experiments. Nat Methods: 8(5):430-5. Ingemar André Center for Molecular Protein Science Biochemistry and Structural Biology Lund University [email protected]

Design and Prediction of Protein Self-assembly

Many of the largest protein complexes in biology are composed of a single type of subunit that is repeated a large number of times to generate a functional assembly. Such homomeric structures are often assembled spontaneously from individual components through the process of self-assembly. Research in our group is focused on the prediction of the three-dimensional structure of homomeric assemblies and the rational design of novel self-assembling proteins and peptides. Over the last several years we have developed computational methods to model the structure of homomeric assemblies using the powerful constraint of molecular symmetry. In this presentation I will illustrate how these prediction methods, in conjunction with limited experimental constraints, can be used to tackle important problems in structural biology. The second part of the talk will deal with the rational design of self-assembling proteins and peptides. We combine the powerful design template of self- assembly with structural modeling and computational protein to design protein assemblies on an atomic level.

Andreas Bender, Cambridge University

Using Chemogenomics Approaches to Modulate Biological Systems

Modulating biological systems can be achieved via biological means (such as knock-out animals, or RNA interference etc.); however, chemical modulation by small molecules is an alternative method with significantly different properties, such as the ability to control dose and timecourse of the administration in detail. In this presentation, different methods for the analysis of the mode-of-action of small molecules which show an effect in phenotypic assays will be discussed, in order to understand small molecule action better. Also, reversing the direction of the analysis, we will outline how large bioactivity databases available today can be be used to design molecules with the desired effect on a biological system, be it by modulating single targets or, becoming more popular recently, by modulating a defined set of target proteins.

Samuel Flores, Uppsala University

A structural and dynamical model of human telomerase

Mutations in the telomerase complex disrupt either nucleic acid binding or catalysis, and are the cause of numerous human diseases. Despite its importance, the structure of the human telomerase complex has not been observed crystallographically, nor are its dynamics understood in detail. Fragments of this complex from Tetrahymena thermophila and, more controversially,Tribolium castaneum have been crystallized. Biochemical probes provide important insight into dynamics. In this work we use available structural fragments to build a homology model of human TERT, and validate the result with functional assays. We then generate a trajectory of telomere elongation following a “typewriter” mechanism: the RNA template moves to keep the end of the growing telomere in the active site, disengaging after every 6-residue extension to execute a “carriage return” and go back to its starting position. A hairpin can easily form in the telomere, from DNA residues leaving the telomere-template duplex. The trajectory is consistent with available experimental evidence and suggests focused biochemical experiments for further validation.

Jan Gorodkin, Center for non-coding RNA in Technology and Health, Denmark

Towards the search for RNA-RNA interaction based networks

Within recent years the awareness of non-coding RNAs has increased rapidly and experimental as well as in silico results elucidate the large potential. Here, the motivation takes outset in the thousands of in silico generated RNA structure candidates in the genome. A major challenge is to assign function to these. The first step is to search for RNA interactions to other RNAs (DNA or proteins). Searching for RNA-RNA interactions is in general a time consuming task. As a first approach we have developed an approach searching for only near complement interactions (ignoring intra molecular base pairs). We show that this approach is faster than existing methods, while maintaining accuracy and show that the method can be used as filter (on existing methods) for microRNA target search. In a case study on microRNAs, we combined target predictons (conserved in human and mouse) to protein coding genes with literature mining and obtained a combined enrichment to only transcriptor factors (TFs) and subsequently found that TFs are also enriched for targeting microRNAs. Our results suggests a network of mutual activating and suppressive regulation.

Ivo Gut, Centro Nacional de Analysis Genomico, C/Baldiri Reixac 4, 08028 Barcelona, Spain.

High-resolution whole-genome analysis and cancer

The International Cancer Genome Consortium (ICGC) aims to fully characterize in the 50 most common forms of cancer 50 tumour/normal sample pairs exhaustively and then to validate observations in further 450 samples. The first three years of this project have seen huge advances in the development, implementation and standardisation of the methods for characterising samples, ethical approval, whole-genome sequencing, exome sequencing, RNA sequencing, epigenetic analysis, methods for validation, informatics analysis and data basing.

The Spanish contribution to the ICGC is on Chronic Lymphocytic Leukaemia (CLL). Our main responsibility has been on whole genome , exome analysis, RNA sequence analysis and epigenetic analysis. Complete genome sequencing of many samples requires bringing together many different elements, starting from samples, preparation for sequencing, sequencing itself, data analysis, through to verification of results and translating a result into biological knowledge. Thorough examination of the first 4 tumour/normal pairs and follow up in a large replication set allowed us to identify four recurrent in the NOTCH1, XPO1, MYD88 and the KLHL6 genes. In an extension we analysed 100 tumour/normal pairs by exome sequencing which allowed the identification of further recurrent somatic mutations, the most frequent being in SF3B1 and POT1. Interestingly the two recognised subtypes of CLL, immunoglobulin modulated and not, do not completely reflect themselves in the recurrent mutations. The methods and findings will be discussed.

Paul Horton, CBRC

Excavating human NUMTs

NUMTs (Nuclear mtDNA), are partial copies of the mitochondrial genome found in the nuclear genome. They are sometimes referred to as molecular fossils, and, due to the higher mutation rate of mtDNA, can in some cases be more similar to parts of our ancestral mtDNA than our extent mtDNA genome is. The existence of NUMTs has been known for decades and many informatics studies on NUMTs have attempted to elucidate the characteristics of their insertion sites. By showing that NUMTs are typically very clean insertions with only minimal deletion or duplication of the surrounding nuclear DNA, these studies have lead to a consensus opinion that most NUMTs are likely inserted as filler DNA via NHEJ (Non- Homologous End Joining). Previous informatics studies have not shed much light upon the preferred insertion sites of NUMTs. Most of them conclude that NUMT insertion is random -- except for contradictory reports that NUMTs correlate positively, or negatively, with retrotransposons. Fortunately, by employing more careful methodology, we were able to discover several as yet undiscovered aspects of this phenomenon. We found that inferred NUMTs insertion sites strongly correlate with predicted physical properties of DNA (curvature and bendability) and A+T rich oligomers. Moreover, recently inserted NUMTs correlate strongly with nucleosome free regions as measured by DNase-seq and FAIRE-seq. We also firmly establishing that NUMTs do indeed tend to co-occur with retrotransposons. As for the source mtDNA which is copied to create NUMTs, we find that part of the mtDNA D-loop region is very seldom copied. Relating these facts to concrete hypotheses regarding the mechanism of NUMT insertion proved very challenging, but also fascinating, as it touched upon diverse topics in molecular biology: from retrotransposon activity and DNA repair to evolutionary conservation of chromatin structure and the packaging of mtDNA.

REFERENCES Tsuji et al., under revision, NAR

Anders Krogh, Copenhagen University, Denmark

On the accuracy of short read mapping.

Next-generation DNA sequencing technologies produce huge amounts of DNA sequence reads. Often the initial bioinformatics task is to map these reads to a reference genome. For this, well-tested methods like Blast are way too slow and next-generation bioinformatics tools are needed. Several new methods have been developed, some of which builds on the Burrows-Wheeler index – an elegant indexing of the genome that facilitates fast searches in a small memory footprint. These methods are based on mapping the reads exactly apart from a few mismatches and indels. Most of them do not report any significance or probability that a match is actually correct. In this talk I will briefly review the field, give some general results for mapping accuracy, and suggest a more precise notion of uniqueness. I will also present a probabilistic approach to short read mapping, which uses quality scores to calculate mapping probabilities. This can improve mapping accuracy, in particular when mapping very short reads, such as small RNAs, various tag sequences, and ancient DNA. The effect on mapping performance will be illustrated using both simulated and actual DNA reads.

Michael Levitt, Stanford, USA

Solving the Recalcitrant Crystal Structure of Group II Chaperonin TRiC/CCT by Mass Spectrometry and Sentinel Correlation Analysis

Eukaryotic group II Chaperonin TRiC or CCT is a 0.95 megadalton protein complex that is essential for the correct and efficient folding of cytosolic polypeptides. The closed form is a 16 nm sphere made of two hemi‐spherical rings of 8 subunits (~550 residues/subunit) that rotate to open a central folding chamber. In eukaryotes, 8 different genes encode the subunits of this ATP‐powered nanomachine. The high sequence identity of subunits made the 40,320 (=8 factorial) possible arrangements indistinguishable in previous cryo‐electron microscopy and crystallographic analysis. We solve this problem by independent studies on bovine and yeast TRiC chaperonin.

First we use cross‐linking, mass spectrometry and combinatorial homology modeling. We react bovine TRiC under native conditions with a lysine‐specific cross‐linker, follow up with trypsin digestion, and use mass spectrometry to identify 63 cross‐linked pairs providing distance restraints. Independently of the cross‐link set, we construct all 40,320 homology models of the TRiC particle. When we compared each model with the cross‐link set, we discovered that one model is significantly more compatible than any other model. Bootstrapping analysis confirms that this model is 10 times more likely to result from this cross‐link set than the next best‐fitting model.

Second, we re‐examine the 3.8 Å resolution X‐ray data of yeast TRiC. Our method of Sentinel Correlation Analysis (SCA) exhaustively tests all 2,580,460 possible models. This unbiased analysis singles out with overwhelming significance one model, which is fully consistent with our previous biochemical data and refines to a much lower Rfree value than reported previously with the same X‐ray data. With four‐fold averaging, our structure reveals remarkably resolved details of the unique conformation of each subunit, and suggests a mechanism for the initiation of transition to the open state. More generally, we expect SCA to resolve ambiguity in future low‐resolution crystallographic studies.

Joakim Lundeberg, SciLifeLab, Sweden and The Spruce genome project

Sequencing and assembly of the largest and most complex genome to date - the Norway spruce (Picea abies)

Conifers are the dominant plant species in many ecosystems, including large areas in Sweden. Despite this, no conifer genome has yet been published, mainly owing to their large size and complexity. The lack of a genome sequence has hampered our understanding of conifer biology and evolution, as well as the development of potential novel breeding strategies of these economically important species.

We are currently performing whole genome sequencing and assembly of the 20 Gbp Norway spruce genome. This genome contains huge amounts of repeated elements, with an estimated gene density of only 1/500 kbp. In common with other tree genomes, heterozygozity is high, which further complicates the assembly process. The Spruce Genome Project is addressing questions of genome size, content and evolution, including analyses of gene families and repeats, and will establish Norway spruce as a prime model species for conifer research.

In this talk, we will present our main strategies concerning sequencing and assembly of this de novo genome, and give an update on the results obtained so far. In brief, we use a combination of whole genome shotgun and fosmid pool sequencing, followed by scaffolding and merging of the separate assemblies. This is complemented by a manually curated spruce-specific repeat library, sequencing of random fosmid clones for assembly benchmarking, as well as assemblies of the chloroplast and mitochondrial genomes.

Ed Marcotte, University of Texas Austin, US

Insights from proteomics into protein organization, evolution, and genetic disease

Lennart Martens, Lennart Martens VIB, Gent, Belgium

Snakes and ladders: where do proteomics assays fail and how can we fix them?

Proteomics assays increasingly rely on two distinct and largely independent informatics processing steps: identification and quantification. Both procesing steps can rely on a plethora of available algrotihms and tools, but the maturity of these algorithms is quite distinct. Whereas identification is typically handled by venerable algorithms called search engines, that have been in use for many years, quantification algorithms are still continuously evolving to accommodate the increasing resolution and sensitivity of modern mass spectrometers. Despite this difference in maturity, both steps can be improved. Indeed, the performance of current quantitative workflows can be boosted by simply combining several of them into a single, joint analysis, making the most of the specific sensitivities of each of the algorithms used. On the other hand, the long-serving search engines have also reached crucial limits in terms of specificity, effectively preventing proteomics from reaching a central status in the life sciences. Fortunately, this inherent limitation of current search engines can be fixed by improving the way in which we use the measurements provided by the mass spectrometer. We will here discuss these developments, and highlight how both quantification and identification can be improved; the former by incremental advances, the latter by a more radical change in approach.

Jens Nielsen Department of Chemical and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden

Genome-Scale Metabolic Models: A Bridge between Bioinformatics and Systems Biology

We are currently working on building a Human Metabolic Atlas, a novel web-based database and modelling tool that can be used by medical and pharmaceutical researchers to analyse clinical data with the objectives of identifying biomarkers associated with disease development and improving health care. The central technology in the Human Metabolic Atlas is so-called genome- scale metabolic modelling (GEMs), which will be made tissue-specific by using different types of experimental data, e.g. from the Human Protein Atlas. These models allow for context-dependent analysis of clinical data, providing much more information than traditional statistical correlation analysis, and hence advance the identification of biomarkers from high-throughput experimental data that can be used for early diagnosis of metabolic related diseases. As part of the Human Metabolic Atlas we are developing GEMs for the gut microbiome. In this context we are using metagenomics for identification of different metabolic functions that are associated with human diseases. Here we are using metagenomics sequencing data from the gut microbiome of patients with different diseases, e.g. arteriosclerosis and type 2 diabetes. Through the combination of the bacterial GEMs and metagenomics data we have identified enriched metabolic functions in the microbiome, and based on this we point to novel prospective biomarkers for disease development. We are further integrating metagenomics information into predictive metabolic models that have the prospect for simulation of how the gut microbiome will respond to diet.

Raymond Stevens, The Scripps Research Institute, USA

Understanding Human G-protein Coupled Receptor Structural Diversity and Modularity

GPCRs constitute one of the largest protein families in the human genome and play essential roles in normal cell processes, most notably in cell signaling. The human GPCR family contains more than 800 members and recognizes thousands of different ligands and activates a number of signaling pathways through interactions with a small number of binding partners. GPCRs have also been implicated in numerous human diseases, and represent more than 40% of drug targets. Delivering GPCR structures in close collaboration with experts on specific receptor systems is of immense value to the basic science community interested in cell signaling and molecular recognition, as well as the applied science community interested in drug discovery. This work is being followed up with additional biophysical characterization including NMR spectroscopy, HDX mass spectrometry, medicinal chemistry and community wide assessments with computational biology groups throughout the world. Crystal structures are now available for rhodopsin, adrenergic, and adenosine receptors in both inactive and activated forms, as well as for chemokine, dopamine, histamine, S1P1, muscarinic, opioid receptors in inactive conformations. A review of the common structural features seen in these receptors will be discussed and the scope of structural diversity of GPCRs at different levels of homology provides insight into our growing understanding of the biology of GPCR action and their impact on drug discovery. Given the current set of GPCR structural data, a distinct modularity is now being observed between the extracellular (ligand-binding) and intracellular (signaling) regions. The rapidly expanding repertoire of GPCR structures provides a solid framework for experimental and molecular modeling studies, and helps to chart a roadmap for comprehensive structural coverage of the whole superfamily and an understanding of GPCR biological and therapeutic mechanisms. The long range goal is to understand GPCR molecular recognition and evolution in relation to human cognition.

This work was supported by NIGMS PSI:Biology for GPCR structure processing (U54GM094618) and the NIH Roadmap Initiative (JCIMPT) for technology development (GM073197).

Burkhard Rost, TU Munich

Evolution teaches protein prediction

The objective of our group is to predict aspects of protein function from sequence. The only reason why we can pursue such an ambitious goal is the wealth of evolutionary information available through the comparison of the whole bio-diversity of species. Many approaches have benefited substantially from using evolutionary information; for some of these methods learning from evolution made the difference between possible and impossible. In my talk I will present examples of methods that target the prediction of protein interactions, of protein disorder, and of the effect of single residue mutations upon and function.

Schiöth HB.

The origin of GPCRs, the largest family of membrane bound proteins

G protein-coupled receptors (GPCRs) are the largest superfamily among membrane bound proteins. The GPCRs in humans are classified into the five main families named Glutamate, Rhodopsin, Adhesion, Frizzled and Secretin according to the GRAFS classification. Several families of GPCRs show however no apparent sequence similarities to each other, and it has been debated which of them share a common origin. Mining of early vertebrates including lancelet (Branchiostoma floridae) and one of the most primitive animals, the cniderian sea anemones (Nematostella vectensis) provided considerable evidence suggesting that the Adhesion family is ancestral to the peptide hormone binding Secretin family of GPCRs. We also used integrated and independent HHsearch, Needleman-Wunsch-based and motif analyses to determine at the relationship of the other main families. We found strong evidence that the Adhesion and Frizzled families are children to the cyclic AMP (cAMP) family while the large Rhodopsin family is likely a child of the cAMP family. We suggest that the Adhesion and Frizzled families originated from the cAMP family in an event close to that which gave rise to the Rhodopsin family. We also found convincing evidence that the Rhodopsin family is parent to the important sensory families; Taste 2 and Vomeronasal type 1 as well as the Nematode chemoreceptor families. The insect odorant, gustatory, and Trehalose receptors, frequently referred to as GPCRs, form a separate cluster without relationship to the other families, and we propose, based on these and other results, that these families are ligand-gated ion channels rather than GPCRs. We suggest common descent of at least 97% of the GPCRs sequences found in humans. Moreover, we provide the first evidence that four of the five main mammalian families of GPCRs, namely Rhodopsin, Adhesion, Glutamate and Frizzled, are present in Fungi. The unicellular relatives of the Metazoan lineage, Salpingoeca rosetta and Capsaspora owczarzaki have a rich group of both the Adhesion and Glutamate families, which in particular provided insight to the early emergence of the N-terminal domains of the Adhesion family. Further mining of Dictyostelium discoideum suggests that the Glutamate family is as ancient as the cAMP receptor family. Together, these studies clarify the early evolutionary history of the GPCR superfamily and their emergence could be traced back at least 1400 MYA. Gert Vriend,Radboud University Nijmegen Medical Centre, Neatherlands

What can we (not yet) learn from 70 GPCR structures?

Headed by the next speaker, the crystallography community has cracked the GPCR crystallisation problem, and the past years we have seen at least one new GPCR structure enter the PDB each month. These structures are in an active state, semi active state, inactive state, or sometimes also an artefactual state. We have been comparing all available structures trying to average out the things done to make the GPCRs crystallize (mutation of crucial residues; adding llama antibodies; adding funny salts and lipids; cloning-in lysozyme). The shear volume of data now allows us to extract the beginning of a coherent story about the activation of GPCRs. Not surprisingly, this story agrees more with basic laws of physics and thermodynamics, and less with the myriads of funny activation schemas that include distict states like R, R*, etc, that have entered the literature over the years.

Martin Weigt, University of Sorbonne, France

From sequence variability to protein (complex) structure prediction

Many families of homologous proteins show a remarkable degree of structural and functional conservation, despite their large variability in amino acid sequences. We have developed a statistical-mechanics inspired inference approach to link this variability (easy to observe) to structure (hard to obtain), i.e. to infer directly co-evolving residue pairs which turn our to form native contacts in the folded protein with high accuracy. The gained information is used to guide tertiary and quaternary structure prediction. As a specific example, I will discuss the auto-phosphorylation complex of histidine kinases, which are involved in the majority of signal transduction systems in the bacteria. Only a multidisciplinary approach integrating statistical genomics, biophysical protein simulation, and mutagenesis experiments, allows us to predict and verify the - so far unknown - active kinase structure.

Eric Westhof Architecture et Réactivité de lʼARN, Université de Strasbourg, Institut de Biologie Moléculaire et Cellulaire, CNRS, 15 rue René Descartes, 67084 Strasbourg, France

The Detection of the Architectural Modules of RNA and Recent Progress in RNA Modelling

RNA architecture can be viewed as the hierarchical assembly of preformed doublestranded helices defined by Watson-Crick base pairs and RNA modules maintained by non-Watson-Crick base pairs. RNA modules are recurrent ensemble of ordered nonWatson-Crick base pairs. Such RNA modules constitute a signal for detecting noncoding RNAs with specific biological functions. It is, therefore, important to be able to recognize such genomic elements within genomes. Through systematic comparisons between homologous sequences and x-ray structures, followed by automatic clustering, the whole range of sequence diversity in recurrent RNA modules has been characterized. These data permitted the construction of a computational pipeline for identifying known 3D structural modules in single and multiple RNA sequences in the absence of any other information. Any module can in principle be searched, but four can be searched automatically: the G-bulged loop, the Kink-turn, the C-loop and the tandem GA loop. The present pipeline can be used for RNA 2D structure refinement, 3D model assembly, and for searching and annotating structured RNAs in genomic data. Following the recent dramatic advances in tools aimed at RNA 3D modelling, a first, collective, blind experiment in RNA three-dimensional structure prediction has been performed. The goals are to assess the leading edge of RNA structure prediction techniques, compare existing methods and tools, and evaluate their relative strengths, weaknesses, and limitations in terms of sequence length and structural complexity. The results should give potential users insight into the suitability of available methods for different applications and facilitate efforts in the RNA structure prediction community in their efforts to improve their tools.

Roman A. Zubarev Division of Physiological Chemistry I, Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Scheeles väg 2, S-171 77 Stockholm, Sweden

Pathway Analysis in Expression Proteomics

Proteomics studies have revealed unexpected plasticity and dynamic nature of the human proteome. The paradigm that the time evolution of a biological system can be described by abundance variation of relatively few “regulated” proteins has been shuttered, being replaced by the growing understanding that the whole proteome is regulated, and virtually no protein remains unaffected when the system undergoes transition from one state to another.

This finding underlines the importance of systems biology analysis of expression proteomics data. Systems biology shifts the analytical focus from thousands of proteins to hundreds of signaling pathways, thus reducing the number of entities to be analyzed. Application of these methods required the development of novel systems biology tools, such as the pathway search engine (PSE [1-3]). These tools can only be effective when they are quantitative, i.e. predict not only the activated pathway, but also the relative degree of its activation. Introducing the quantitative aspect in systems biology is one of the greatest challenges this field is facing today, since the final goal of pathway analysis, which is the creation of a quantitative predicting model of the biological process under investigation.

1. Zubarev, R. A.; Nielsen, M. L.; Savitski, M. M.; Kel-Margoulis, O.; Wingender, E.; Kel, A. Identification of dominant signaling pathways from proteomics expression data, J. Proteomics, 2008, 1, 89-96. 2. Ståhl, S.; Fung, Y.M.E.; Adams, C. M.; Lengqvist, J.; Mörk, B.; Stenerlöw, B.; Lewensohn, R.; Lehtiö, J.; Zubarev, R. A.; Viktorsson, K. Proteomics and Pathway Analysis Identifies JNK-signaling as Critical for High-LET Radiation-induced Apoptosis in Non-Small Lung Cancer Cells, Mol. Cell Proteomics, 2009, 8, 1117-1129. 3. Marin-Vicente, C.; Zubarev, R. A. Search engine for proteomics, Fact or Fiction? G.I.T. Lab J, 2009, 11-12, 10-11.

Poster Number 1 ScalaLife – Scalable Software Services for Life Science

Rossen Apostolov, KTH

The Life Sciences have rapidly become one of the major beneficiaries of the European e- Infrastructures, placing a growing demand on the capabilities of simulation software and on the support services. The ScalaLife project has set to address some of the specific problems associated with this growth, acting along two distinct and complementary directions.

On the one hand, the project is concerned with the discrepancy between the scalability advances made by e-Infrastructure projects such as PRACE/DEISA on large molecular systems and the reality of the typical Life Science simulation, which works predominantly with small-to-medium systems. Thus, ScalaLife is implementing new techniques for efficient small-system parallelisation, developing new hierarchical approaches (explicitly based on ensemble and high-throughput computing for new multi-core and streaming/GPU architectures) and establishing open software standards for data storage and exchange.

On the other hand, the project is committed to the long- term support of the Life Science users and communities, providing both training and expert advice. First, ScalaLife is documenting and developing training material for the new techniques and data storage formats implemented by the project. Second, the project has created a pilot for a cross-disciplinary Competence Centre, which enables the Life Science community to exploit the key European applications developed as part of the project as well as the existing European e-Infrastructures effectively.

By providing a training and support infrastructure and by developing an adequate framework and associated policies to foster collaboration, the Competence Centre establishes a long- term structure for the maintenance and optimisation of Life Science software.

The ScalaLife Comptence Center is welcoming developers of bioinformatics applications for partnership projects! Poster Number 2

Bioinformatics2012Abstract

Prokaryotic and eukaryotic genomes each encode for hundreds of membrane transporter proteins that play important roles for the cellular import and export of ions, small molecules or macromolecules. Therefore, the functional classification of membrane proteins is an important task in genome annotation. Experimental knowledge about transporter function has been compiled in databases such as TCDB, TransportDB, and Aramemnon. An important research question for membrane biology is whether two membrane transporters in organisms X and Y that show a certain sequence similarity will have the same function or not. Previous computational work in this area includes, e.g., the tools TransportTP (Xhao 2009) and work by (Gromiha 2008). Prediction methods often include features such as sequence homology, enriched motifs, and amino acid properties. Interestingly, no study has sofar critically analyzed the reliability margins of the individual features.

Here, we provide a benchmarking study of the transferability of functional classifications of membrane transporters between organisms. We have tested the method using the transporters of the two model organisms E. coli and Arabidopsis thaliana. 157 experimentally validated transporter sequences from E. coli were obtained from TransportDB and 156 such sequences from A. thaliana were obtained from the Aramemnon database. The statistical significance of sequence similarity between an input sequence and sequences in the training set was determined using the well-known tools BLAST and HMMER. The MEME program suite was used to identify enriched motifs in different transporter families. Later, the MAST program from the MEME suite provided a score for statistically significant motifs identified in the unknown sequence. If all 3 approaches (BLAST, HMMER, MEME) assigned membership to the same TC family, this was considered a high confidence annotation.

We tested at which E-value annotations could be reliable transferred between E.coli and A. thaliana. For this purpose we created subsets according to (1) TC families, (2) substrate annotations and (3) Substrates split into TC families. According to the TC system, transporters of the two organisms are annotated to 47 different TC families (E. coli) and 29 (A. thaliana). 14 TC families are shared and could be used for testing. Concerning the first subset, E-values of 10-4, 10-3 and 10-8 were identified as reliable thresholds for the three classifiers BLAST, HMMER and MEME, respectively. Different thresholds were discovered for the other subsets. To the best of our knowledge, these results provide the first benchmarking study for the transfer of functional annotations for the important class of membrane transporter proteins. Poster Number 3

Comprehensive Analysis of Antibiotic Resistance Genes in River Sediment, Well Water and Soil Microbial Communities Using Metagenomic DNA Sequencing

Johan Bengtsson*, Fredrik Boulund^, Erik Kristiansson^, DG Joakim Larsson*

* Dept. of Neuroscience and Physiology, University of Gothenburg, Sweden ^ Dept. of Mathematical Statistics, Chalmers University of Technology, Sweden

The development and spread of antibiotic resistance across the globe has emerged as one of the most immense health problems in modern time, further accentuated by the slow pace of development of antibiotics with new functional mechanisms. While the role of antibiotics use and abuse in resistance development has been extensively investigated, examination of the impact of environmental antibiotic pollution in promoting emergence and dissemination of resistance genes has been limited. We have shown that the selection pressure of antimicrobial agents can be exceptionally high in environments contaminated by wastewater from antibiotic manufacturing facilities, creating the kind of extreme conditions that likely could drive mobilization of resistance genes.

We have compared bacterial genes within microbial communities from river sediments upstream and downstream of a treatment plant in India receiving wastewater from the pharmaceutical industry, and releasing effluent containing high concentrations of several antibiotics into a small river. We have previously characterized these metagenomes using 454 pyrosequencing, however, to get a more thorough view of the community composition and the resistance gene content, we have now sequenced the same communities using high throughput Illumina sequencing. In addition, we have sampled soil from nearby farmland, as well as water from wells in villages affected by antibiotics pollution. From the DNA extracted from these microbial communities, we have generated more than 650 million paired-end reads, corresponding to between 15 and 20 million pairs of reads per sample.

In this data, we can identify a wide range of resistance gene types. Preliminary analysis of the resistance gene content reveals clear differences in abundances between upstream and downstream samples; for example the sul2 and sul3 genes are much more commonly encountered downstream from the treatment plant. In addition, in a nearby lake polluted by dumping of industrial waste, we find further deviations from the resistance gene pattern of the river communities, with for example higher abundance of sul1. The preliminary data also indicates that there are substantial differences in the prevalence of antibiotic resistance genes between bacterial communities from different well water.

Utilizing short-read sequencing technologies opens up for broader screening for antibiotic resistance genes in various environments, as the vast number of reads generated by e.g. Illumina sequencing allows for far deeper studies than the fairly limited pyrosequencing approaches. Thus, we are able to search also for relatively rare types of resistance genes. However, some caution should be exercised, as the complexity of the sampled community may be too large to generate sufficiently long stretches of DNA to accurately identify and classify resistance genes and mobile genetic elements. Nevertheless, the material investigated allows more precise studies of the effect on resistance promotion in microbial communities, and consequently risks for further dissemination to human pathogens as a result of antibiotic pollution from manufacturing sites Poster Number 4

Searching metagenomes to identify and discover mobile fluoroquinolone antibiotic resistance genes using hidden Markov models Fredrik Boulund1, Anna Johnning2, Mariana B. Pereira1, Joakim D.G. Larsson2, Erik Kristiansson1

1Department of Mathematical Sciences, Chalmers University of Technology and University of Gothenburg, SE-412 96 Göteborg, Sweden, 2Department of Neuroscience and Physiology, the Sahlgrenska Academy at the University of Gothenburg, Box 434, SE-405 30 Göteborg, Sweden

Antibiotics are one of our most powerful tools for treating bacterial infections and have since their introduction vastly improved human health and drastically reduced mortality rates. However, the growing use of antibiotics has brought increased resistance in pathogens. Bacteria can acquire resistance either through chromosomal mutations or via horizontal transfer of antibiotic resistance genes. It is believed that there exists a vast and unexplored environmental library of mobile resistance genes called the resistome. Many antibiotics are derived from compounds produced by organisms in the environment and bacteria have therefore developed natural protection mechanisms against such substances. Not surprisingly, it has been shown that several of the clinically important resistance genes originate from the environment. Fluoroquinolones are family of widely used broad- spectrum antibiotics of synthetic origin, thus lacking any known natural production system. Consequently, it was originally believed that they would lack any natural resistance mechanisms. However, a class of mobile fluoroquinolone resistance genes called qnr was recently discovered. There are currently five known subclasses of plasmid-mediated qnr genes, with the last novel subclass discovered as late as 2009. It is unknown whether more subclasses exist in the environmental resistome. The Qnr proteins are pentapeptide repeat proteins that display a repeating pattern of five amino acid residues. Based on this distinctive sequence feature we created a from the sequences of all currently known plasmid mediated subclasses and variants. To enable identification of novel qnr-like gene variants or subclasses, we developed a classifier to discriminate between putative novel qnr sequence fragments and non-qnr fragments in metagenomic data. Evaluation of the model’s performance showed that the statistical power for correctly classifying fragments from a novel class of qnr genes was more than 94% for input sequences as short as 100 nucleotides. We applied the model to several large datasets containing both annotated (e.g. NCBI GenBank) and metagenomic sequences produced with high-throughput sequencing technologies (e.g. CAMERA, Meta-HIT). Using our method, we were able to identify all previously known qnr-genes, as well as several putative novel variants. In addition, we discovered several sequences in the annotated data sources where we could correct and improve annotation. Poster Number 5

Two-site mechanism for the allosteric modulation of pentameric ligand gated ion channels by anesthetics and alcohols

Torben Broemstrupac, Rebecca Howardd, Samuel Murailc, James Trudelle, Adrian Harrisd, Eric Lindahlab

aCenter for Biomembrane Research, Department of Biochemistry and Biophysics, Stockholm University, SE-10961 Stockholm, Sweden bTheoretical and Computational Biophysics, Kungliga Tekniska högskolan Royal Institute of Technology, SE-10691 Stockholm, Sweden cInstitut Pasteur, Groupe Récepteurs-Canaux, and Centre National de la Recherche Scientifique, Unité de Recherche Associée 2182, F-75015 Paris, France dWaggoner Center for Alcohol and Addiction Research, The University of Texas at Austin, Austin, Texas, United States of America eDepartment of Anesthesia and Beckman Program for Molecular and Genetic Medicine, Stanford University School of Medicine, Stanford, United States of America

Pentameric ligand-gated ion channels (pLGIC) of the Cys-loop family mediate fast chemo- electrical transduction. General anesthetics and n-alcohols alter the nerve signaling by interacting with pLGICs. Despite mutagenesis and labeling studies, the relevant anesthetic binding sites remain controversial as modeling studies have proposed diverse intrasubunit and intersubunit binding sites. The recent determination of the crystal of GLIC a prokaryotic member of pGLIC family enables structural studies to characterize the anesthetic and alcohol binding sites. But GLIC as a lower-organism pLGIC resembles the bimodal n-alcohol modulation of eukaryotic channels, while methanol and ethanol are potentiating longer n-alcohols are inhibiting the channel.

Site-directed mutagenesis studies and a chimera between the GLIC and the human glycine receptor identified the transmembrane domain as alcohol binding location. A single mutation in GLIC was identified, which turns the volatile anesthetics desflurane and chloroform from inhibitors to activators. Further this mutation increases ethanol potentiation and extends n-alcohol potentiation to hexanol while longer chain alcohols still inhibiting the channel, compared to only methanol and ethanol potentiating the wild-type. To explain the increased potentiation of the GLIC mutant, the exact interaction sites of general anesthetics and n-alcohols need to be characterized and the binding to differential sites needs to be quantified.

To this end we apply atomistic MD simulations and the Free Energy Perturbation method (FEP) to get binding free energies for desflurane and chloroform as well as n-alcohols in the intra- and intermolecular binding sites of GLIC. Our results demonstrate two independent binding sites for alcohols and anesthetics in GLIC, an inhibitory intrasubunit site and a potentiating intersubunit site. For example, the free energies of binding show that the wild-type inhibition by desflurane correlates with superior intrasubunit binding of desflurane (intra: -21.8 ± 0.3 KJ/mol versus inter: -14.4 ± 0.6 KJ/mol), while the potentiating-enhancing mutation makes desflurane intersubunit binding superior to intrasubunit binding (intra: -19.7 ± 0.4 KJ/mol versus inter: -23.2 ± 0.5 KJ/mol). Similar, binding affinities of n-alcohols are increased in the intersubunit site by the mutation correlating with the increased potentiation of GLIC by n-alcohols.

In conclusion, we present a two-site model for the modulation of pLGICs with an inhibitory intrasubunit site and a potentiation intersubunit site. Computational predicting of the binding affinities give quantitative support for the two-site model demonstrating that differential binding to both sites results in differential modulation of pLGIC. Poster Number 6

If There’s an Order in All of This Disorder…: Structural Bioinformatics of the Human Spliceosomal Proteome

Iga Korneta1, Marcin Magnus1, Janusz M. Bujnicki1,2,*

1 Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology, Warsaw, PL-02-109, Poland 2 Bioinformatics Laboratory, Institute of Molecular Biology and Biotechnology, Faculty of Biology, Adam Mickiewicz University, Poznań, PL-61-614 , Poland * [email protected]

The spliceosome is one of the largest molecular machines known. It performs the excision of introns from eukaryotic pre-mRNAs. In human cells it comprises five RNAs, over one hundred “core” proteins and more than one hundred additional associated proteins. The details of the spliceosome mechanism of action are unclear, because only a small fraction of spliceosomal proteins have been characterized structurally in high resolution. To aid structural and functional analyses of the spliceosomal proteins and complexes, and to provide a starting point for multiscale modeling, we carried out a comprehensive structural bioinformatics analysis of the entire spliceosomal proteome. First, we discovered that almost a half of the combined sequence of proteins abundant in the spliceosome is predicted to be intrinsically disordered, at least when the individual proteins are considered in isolation. The distribution of intrinsic order and disorder throughout the spliceosome is uneven, and is related to the various functions performed by the intrinsic disorder of the spliceosomal proteins in the complex. In particular, proteins involved in the secondary functions of the spliceosome, such as mRNA recognition, intron/exon definition and spliceosomal assembly and dynamics, are more disordered than proteins directly involved in assisting splicing catalysis. Conserved disordered regions in splicing proteins are evolutionarily younger and less widespread than ordered domains of essential splicing proteins at the core of the spliceosome, suggesting that disordered regions were added to a preexistent ordered functional core. The spliceosomal proteome contains a much higher amount of intrinsic disorder predicted to lack secondary structure than the proteome of the ribosome, another large RNP machine. This result agrees with the currently recognized different functions of proteins in these two complexes. For the ordered part of the spliceosomal proteome, we have carried out protein structure prediction. We identified new domains in spliceosomal proteins and predicted 3D folds for many previously known domains. We also established a non-redundant set of experimental models of spliceosomal proteins, as well as constructed in silico models for regions without an experimental structure. Altogether, over 90% of the ordered regions of the spliceosomal proteome can be represented structurally with a high degree of confidence. The combined set of structural models for the entire spliceosomal proteome is available for download from the SpliProt3D database (http://iimcb.genesilico.pl/SpliProt3D). Finally, we analyzed the reduced spliceosomal proteome of the intron-poor organism Giardia lamblia, and as a result, we proposed a candidate set of ordered structural regions necessary for a functional spliceosome. The results of this work enable multiscale modeling of the structure and dynamics of the entire spliceosome and its subcomplexes and will have a profound impact on the understanding of the molecular mechanism of mRNA splicing.

Poster Number 7

COMPREHENSIVE ANALYSIS OF UNIDENTIFIED LC-MS FEATURES FOR INVESTIGATING PROTEINS DIVERSITY IN HIGH-THROUGHPUT PROTEOMICS EXPERIMENTS

A.L. Chernobrovkin*, V.G. Zgoda, A.V. Lisitsa and A.I. Archakov Institute of Biomedical Chemistry RAMS, Moscow, Russia e-mail: [email protected] *Corresponding author

Key words: single amino-acid polymorphisms; lc-ms; proteins identification

Motivation and Aim More than 65 thousands nsSNP are known to exist in human genome, and more than 20% of them associated with different diseases. However, the vast majority of annotated nsSNP have not been observed at protein level yet. Investigation of diseases-related nsSNP at protein level can shed light on the molecular nature of diseases and provide additional information for molecular biomarkers discovering. Methods and Algorithms According to recent estimation only a small proteomes can be analyzed properly using high-accuracy LC-MS without using MS/MS for peptide identification [1]. Within the human proteome only 20% peptides can be properly identified using only accurate parent mass and retention time data. Here we propose the new strategy for unidentified LC-MS features analysis, which allows significantly increase the sequence coverage of proteins, identified using MS/MS data and reveal protein variants caused by translation of non-synonymous nucleotide polymorphisms. The method uses accurate m/z and retention time data analysis for assigning theoretical peptides of identified using MS/MS proteins to the unidentified LC-MS features. As an additional resource for removing the ambiguity in features annotating we use quantitative data of protein abundance changes during cells differentiation. Results There were 1370 proteins identified in HL60 cells using LC-MS/MS (LTQ Orbitrap Velos, Thermo Scientific) analysis of triptically digested cell lysates. Quantitative analysis was performed using Progenesis-LC-MS software and allows us to reveal 300 proteins that have changed their abundance more than 3 times during cells differentiation process. LC-MS chromatograms were reanalyzed to select those features that could be matched to the triptic peptides of selected proteins and their variants. Such procedure allows two to three fold increase in the sequence coverage of selected proteins. Additionally we observed 38 features that match 17 SAP-specific proteotypic peptides of identified proteins. Conclusion Proposed approach makes it possible to decrease number of unsigned features in LC- MS based proteomics experiments. Assigning of additional features to previously identified proteins allows increasing protein sequence coverage and revealing variant- specific proteotypic peptides. References 1. P. Bochet et al. (2010) Fragmentation-free LC-MS can identify hundreds of proteins, Proteomics, 11(1): 22-32. Poster Number 8

Structural bioinformatics analysis of pre-mRNA editing complex in Trypanosoma brucei

Anna Czerwoniec1, Joanna Kasprzak1, Patrycja Bytner1, Janusz M. Bujnicki1,2

1 Bioinformatics Laboratory, Institute of Molecular Biology and Biotechnology, Adam Mickiewicz University, Umultowska 89, PL-61-614 Poznan, Poland 2 Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology, Ks. Trojdena 4, PL-02-190 Warsaw, Poland

Corresponding author: [email protected]

Key words structural bioinformatics, pre-mRNA editing complex, Trypanosoma brucei

Abstract Mitochondrial pre-mRNA in trypanosomas kinetoplastids undergoes editing process to become translatable molecule. Insertion and deletion of uridine nucleotides is catalyzed by up to 20 proteins acting in a series of catalytic steps. Despite intensive research on editing complexes their complete structure and components interactions remain unknown. Here we present structural analysis of ~20S pre-mRNA editing complexes of Trypanosoma brucei. We built homology models for components of ~20S complexes and gathered information about disordered regions of proteins, macromolecular interactions between individual elements and within whole editing complexes. Then we used in software developed in our group – PyRy3D – to build and visualize very low-resolution 3D models of large macromolecular complexes fit into density maps. Procedure used represents components as experimental structures (e.g. X- ray or NMR models), structural models (e.g. homology models) or flexible shapes and applies Monte Carlo approach to find solutions fulfilling experimental restraints. All generated models were clustered, scored and ranked and best complexes are presented. Obtained results provide us with information about macromolecular interactions in pre-mRNA editing complexes.

Acknowledgments This analysis was funded by the Polish Ministry of Science and Higher Education (grant to AC - number 0083/IP1/2011/71, grant to JK - N N301 123138).

Poster Number 9

The Triform algorithm: improved sensitivity and specificity in ChIP-Seq peak finding

Karl Kornacker1, Morten Beck Rye2, Tony Håndstad2, and Finn Drabløs2

1Division of Sensory Biophysics, Ohio State University, Columbus, Ohio, USA 2Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology (NTNU), NO-7491 Trondheim, Norway

Chromatin immunoprecipitation combined with high-throughput sequencing (ChIP-Seq) is the most frequently used method to identify the binding sites of transcription factors. Active binding sites can be seen as peaks in enrichment profiles when the sequencing reads are mapped to a reference genome. However, the profiles are normally noisy, making it challenging to identify all significantly enriched regions in a reliable way, and with an acceptable false discovery rate.

We have developed the Triform algorithm, an improved approach to automatic peak finding in ChIP-Seq enrichment profiles for transcription factors. The method uses model-free statistics to identify peak-like distributions of sequencing reads, taking advantage of improved peak definition in combination with known characteristics of ChIP-Seq data. The statistical test in Triform is fully nonparametric, i.e. free from any assumed relationships or fitted parameters. In particular, the test is free from any assumed background model and is therefore more robust than model-based tests, which depend on locally uniform background models and fitted background parameters.

Triform outperforms several existing methods (i.e. MACS, Meta, QuEST, PeakRanger, PICS, FindPeaks, and TPic) in the identification of representative peak profiles in curated benchmark data sets for the transcription factors NRSF/REST, SRF and MAX [1]. We also show that Triform in many cases is able to identify peaks that are more consistent with biological function, compared with other methods. In particular, we test for properties that are significantly associated with peak regions identified by Triform, MACS, Meta, QuEST, PeakRanger and TPic, using statistical overrepresentation analysis. Finally, we show that Triform can be used to generate novel information on transcription factor binding in repeat regions, which represents a particular challenge in many ChIP-Seq experiments.

1. Rye MB, Saetrom P, Drablos F: A manually curated ChIP-seq benchmark demonstrates room for improvement in current peak-finder programs. Nucleic Acids Res 2011, 39(4):e25. Poster Number 10

HAMP domains: implications for transmembrane signal transduction

Stanisław Dunin-Horkawicza,b, Andrei Lupasb

a International Institute of Molecular and Cell Biology, Warsaw, Poland b Max Planck Institute for Developmental Biology, Tuebingen, Germany

Homodimeric receptors with one or two transmembrane (TM) segments per monomer are universal to life and represent the largest and most diverse group of cellular TM receptors. They frequently share domain types across phyla and, in some cases, have been recombined experimentally into functional chimeras (e.g., the bacterial aspartate chemoreceptor with the human insulin receptor), suggesting that they have a common mechanism. We have proposed a model for transduction mechanism by axial helix rotation, based on the structure of a widespread domain, HAMP, that frequently occurs in direct continuation of the last TM segment. Here we show by statistical analysis that HAMP domain sequences have biophysical properties compatible with the two conformations proposed by the model. The analysis also identifies networks of coevolving residues, which allow the mechanism to subdivide into individual steps. The most extended of these networks is specific for membrane-bound HAMP domains and most likely accepts the signal from the TM helices. In a classification based on sequence clustering, these HAMPs form a central supercluster, surrounded by smaller clusters of divergent HAMPs, which typically combine into arrays of up to 31 consecutive copies and accept conformational input from other HAMP domains. Poster Number 11

Allele specific expression changes after induction of inflammation

Recent advances in RNA and DNA sequencing technology has enabled a more detailed picture of gene expression and genomic differences to emerge. One particularly interesting aspect is the difference in expression between the two different alleles of a gene within a single individual, one inherited from the mother and one from the father. Any such allele specific expression (ASE) could indicate an allele-specific cis-acting genetic factor. ASE thereby provides an efficient means to explore the functional effects of genomic variation and can help in identifying functional variants in the extensive conserved non-coding part of the genome. In this study we assessed ASE in human white blood cells with and without treatment of the immune-inducing chemical LPS by performing RNA-seq on several individuals. This allowed studying ASE of transcripts which potentially are of special importance in inflammation. Further, to find candidate haplo- types responsible for observed allelic differences we conducted whole genome genotyping of the RNA source subjects. Preliminary results indicate that about 5% of all genes show ASE. Searching for variants where a change in allele speci- ficity was induced by the treatment, a total of 117 unique significant variants were detected among all individuals, of which ten variants were found in two or more individuals. To our knowledge, ASE analysis coupled with differential expression analysis of inflammatory induced cells have not previously been done. Poster Number 12

Statistical assessment of gene group crosstalk enrichment in networks

Oliver Frings1,2,, Theodore McCormack1,2, ‡, Andrey Alexeyenko1,3, Erik L.L. Sonnhammer1,2,4

1Stockholm Bioinformatics Centre, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden. 2Department of Biochemistry and Biophysics, Stockholm University 3School of Biotechnology, Royal Institute of Technology 4Swedish eScience Research Center

Abstract

Motivation Analyzing groups of functionally coupled genes or proteins in the context of global interaction networks has become an important part of bioinformatic analysis. Typically, one wants to analyze the crosstalk, that is, the extent of connectivity between or within functional groups. However, this is only meaningful if statistical significance of the measured crosstalk enrichment is assessed.

Results CrossTalkZ, a statistical method and software to assess the significance of crosstalk enrichment between pairs of gene or protein groups in large undirected biological networks. We demonstrate that the standard z-score is generally an appropriate and unbiased statistic. We further evaluate the ability of four different methods to recover crosstalk within known biological pathways and estimate the confidence of the findings. We conclude that the methods preserving second-order topological properties perform the best for crosstalk analyses.

Availability and Implementation CrossTalkZ (available at http://sonnhammer.sbc.su.se/download/software/CrossTalkZ/) is implemented in C++ and is fast, accepts various input file formats, and produces a number of statistics. These include z-score, p-value, false discovery rate, and a test of normality for the random distribution. Poster Number 13

Associate Professor David E. Gloriam University of Copenhagen, Department of Drug Design and Pharmacology, Universitetsparken 2,

21000 Copenhagen, E-mail: [email protected]

Chemogenomic Discovery of Allosteric Antagonists at the GPRC6A Receptor We have integrated chemogenomic ligand inference, homology modeling, compound synthesis, and pharmacological mechanism-of-action studies to discover the most selective GPRC6A allosteric antagonists discovered to date1. GPRC6A is a Family C G protein-coupled receptor recently discovered and deorphanized by the Bräuner-Osborne group at University of Copenhagen. Three compounds with at least ~3-fold selectivity for GPRC6A were discovered, which present a significant step forward compared with the previously published GPRC6A antagonists, calindol and NPS 2143, which instead are ~30-fold selective for the calcium-sensing receptor. The antagonists constitute novel research tools toward investigating the signaling mechanism of the GPRC6A receptor at the cellular level and serve as initial ligands for further optimization of potency and selectivity enabling future ex vivo/in vivo pharmacological studies.

Our chemogenomic lead identification is, to our knowledge, the first ligand inference between two different GPCR families, Families A and C. The unprecedented inference of pharmacological activity across GPCR families provides proof-of-concept for in silico approaches against Family C targets based on Family A templates, greatly expanding the prospects of successful drug design and discovery. Furthermore, ongoing work on the application of the chemogenomic method to a large number of orphan receptors and drug targets will be described. Finally, a novel bioinformatic method for the identification of endogenous peptide ligands will be presented.

(1) Gloriam, D. E. et al., Chem. Biol. 2011, 11, 1489-1498.

Poster Number 14

CCdeep - a bioinformatics tool for chromatin conformation data analysis.

David Gomez-Cabrero1,2, Davide Cittaro3, Deborah Farmer4, Alejandro Woodbridge5, Jesper Tegnér1, Elia Stupka3

1Unit of Computational Medicine, Center for Molecular Medicine, Department of Medicine, Karolinska University Hospital, Solna, Sweden. 2BILS, Bioinformatics Infrastructure for Life Sciences, Sweden 3San Raffaele Scientific Institute, Center for Translational Genomics and Bioinformatics, Italy 4Division of Infection and Immunity, University College of London, United Kingdom 5Department of , Tumor and Cell Biology, Karolinska Institutet, Solna, Sweden

Next Generation Sequencing (NGS) technologies make it possible to address biological questions at a new level of resolution resulting in large volumes of data which as a rule requires sophisticated but yet manageable bioinformatics pipelines for their analysis and understanding. One recent example is the analysis of Chromatin Conformation (CC) data capturing the chromosome structure and the chromosome - chromosome interactions.. Recent experimental identification of CC has been performed by what is referred to as the Chromosome Conformation Capture (3C) technique. Here the contact probabilities between specific loci is quantified and recent developments the quantification of multiple loci, thus setting the stage for genome-wide mapping of pairwise contacts. However, the analysis of the data provided by 3C or Hi-C faces two major challenges. First, the pre-processing and normalization of the data is key since the experimental procedures have inherent biases and experimental artifacts and even though experimental methods are under development proper methods for pre-processing and normalization are urgently needed. The second challenge reads – how to generate biological conclusions from the data. To illustrate, in the Hi-C experiments the ratio between the "number of reads" in a sequencing sample and the “number of possible chromosomal interactions” is approximately the same order of magnitude for human samples, and furthermore the mean and median number of reads per sample is not larger than one. Hence, to compute the differences between signal and noise under these conditions is difficult. Finally, to the best of our knowledge there is not yet any published software allowing researchers to manipulate and analyze their own CC data. We have therefore designed - CCdeep – which is a software written in C++ that incorporates tools for pre-processing, normalization and bioinformatics analysis of CC data. The tool allows four stages in the analysis: the first stage is devoted to filter the reads and to identify the CC loci which have low quality (for instance mapping quality and mappability for reads and CC locus respectively). In the second stage CCdeep summarizes the reads in locus-to- locus interactions, by assigning reads to locus. The third stage implements three algorithms: (a) a normalization algorithm (similar to (4)), (b) an interaction identification algorithm and (c) demarcation of physical domains (see (3)). A fourth stage allows the integration of the CC data with other data types such as ChIP-Seq data. CCdeep allows researchers to define their own parameters for the analysis, but it also provides suggestions based on samples of the data. The summary outputs are prepared to allow the generation of explanative plots.

Poster Number 15

Title: Topology prediction and three-dimensional modeling of single-chain and multi- chain transmembrane β-barrel proteins Authors: Sikander Hayat and Arne Elofsson Affiliation: Dept. Biochem. & Biophy., Stockholm University, SciLifeLab, Stockholm, Sweden

Transmembrane β-barrels play a major role in the normal functioning of a cell such as translocation and insertion of other proteins, transport of substrates across the membrane. Further, both single-chain and multi-chain transmembrane β-barrels are key constituents of Type V secretion system in gram-negative bacteria and are relevant anti-microbial drug targets. Given that it is difficult to crystallize membrane proteins to determine their 3D structure experimentally, it is necessary to develop computational methods to predict their topology and three-dimensional structure. The barrel region of single-chain transmembrane β-barrels is formed of one single protein chain and the number of β-strands varies from 8 to 24. While the barrel of multi-chain transmembrane β-barrels consists of at least 3 chains that contribute equal number of strands to form a barrel.

We recently developed computational methods for the topology prediction (BOCTOPUS [1]) and three-dimensional modeling (tobmodel [2]) of single-chain transmembrane β-barrels. Briefly, BOCTOPUS is a two-stage topology predictor that employs a support vector machine and hidden markov model layer to account for local and global residue preferences. BOCTOPUS uses position specific scoring matrix as the input and outputs the topology (i=inner loop, o=outer loop, M=β- strand). BOCTOPUS predicts the correct number of strands in 30.1±1.5 out of 36 proteins in the dataset. Further, correct topology is predicted for 25.4±2.0 proteins, which is slightly better than the other methods [1].

For three-dimensional modeling, first BOCTOPUS [1] is used to obtain alternative topologies for the given sequence. Then, multiple Cα models of the transmembrane β-barrel region are generated for different tilts of β-strands for all obtained topologies [2]. A novel z-coordinate predictor called ZPRED3 is used to predict the distance of residues from the membrane center. The top-ranking model is then chosen based on the minimum difference between the predicted z-coordinate and z-coordinate obtained from the generated models [2]. Tobmodel is compared with TMBpro [4] and 3D- SPoT [3]. Models obtained from TMBpro [4] and tobmodel have an average RMSD of 8.79 and 7.24 Å. The average TM_Score for TMBpro models is 0.56 and is slightly higher than for top-ranking tobmodel models (0.43). The average RMSD when correct topology is available is 5.86 Å and 4.10 Å, for tobmodel and 3DSPoT [3]. However, we believe that tobmodel can be a useful tool for topology prediction and 3D modeling of transmembrane β-barrels. In future, more advanced model selection methods will be developed to select the best possible model. Further, BOCTOPUS and other methods cannot predict the topology of multi-chain chain transmembrane β- barrels with high accuracy. Thus, we are currently working on the topology prediction and three-dimensional modeling of multi-chain transmembrane β -barrels. BOCTOPUS and tobmodel are freely available at boctopus.cbr.su.se and tobmodel.cbr.su.se

1. Hayat, S. & Elofsson, A., Bioinformatics, 28, 516–522, (2012). 2. Hayat, S. & Elofsson, A., ISMB Proceedings, (2012). 3. Naveed, H., Xu, Y., Jackups, R., and Liang, J., Journal of the Am. Chem. Society, 134, 1775–1781, (2012). 4. Randall, A., Cheng, J., Sweredoski, M., and Baldi, P., Bioinformatics, 24, 513–520, (2008). Poster Number 16

Luisa Hugerth1, Daniel Lundin1, Ino DeBruijn1, Daniel Herlemann2, Anders F Andersson1

1) Science for Life Laboratory, School of Biotechnology, KTH Royal Institute of Technology, Tomtebodavägen 23 B, SE­17165 Solna, Sweden 2) Leibniz Institute for Baltic Sea Research Warnemünde, Seestraße 15, 18119 Rostock

Systems Biology of Baltic Sea Microbial Food Webs

The Baltic Sea represents one of world’s largest brackish water reservoirs. It faces large yearly variations in temperature and nutrient levels and is highly affected by human activities, such as overfishing and eutrophication. As in all environments, microbes are key drivers of the fluxes of nutrients and energy in the Baltic Sea. The extremely wide variety of microbes in natural environments baffles any attempt to identify their diversity through culture or microscopy. In reality, only deep sequencing can provide enough data for a thorough inventory of microbial stocks.

While being drivers of nutrient and energy fluxes, microbes are also greatly affected by these same conditions. Therefore, along the length and depth of the Baltic Sea, which represent gradients of salinity and dissolved oxygen, bacterial communities are highly structured (Herlemann, 2011). Communities also experience strong fluctuations over the course of a year, but show recurrent annual patterns (Andersson, 2010).

Bacterial communities are controlled both by the environment and by interactions with other microbes. In particular, single­cellular eukaryotes (protists) that feed on bacteria are important elements in the energy flux from bacteria to higher organisms. To investigate environmental protist variety and dynamics, we performed an extensive in silico evaluation of over 55,000 eukaryotic ribosomal RNA sequences, selecting the primer sites that will allow us both to amplify the widest possible variety of eukaryotes and to extract the highest proportion of unique sequences from short paired Illumina reads. A highly resolved temporal series of bacterial and protist communities and physico­chemical parameters will allow us to infer the network structure of food webs in the Baltic Sea environment.

However, even with high temporal resolution, identifying microbes will not reveal much about the biological and climatic factors shaping an environment unless this information is coupled to knowledge about individual microbes’ metabolic capabilities. To this end, we have initiated the work of reconstructing the genomes of the most abundant Baltic Sea microbial populations through de novo assembly of shotgun metagenomic sequencing reads. In a first study we reconstructed the genome of a highly abundant bacterial population in Baltic surface waters belonging to the Verrucomicrobia by 454 sequencing. Unsupervised binning of metagenomic genome fragments (contigs) using tetra­nucleotide frequencies and a self­organizing map, and verification with marker genes, resulted in a cluster of contigs that represents a near complete genome of the Verrucomicrobium. The draft genome sequence of this organism, which lacks cultured relatives, revealed a typical aerobic heterotrophic metabolism but also several glycoside hydrolases that allow the use of complex carbon molecules as carbon source. Its high abundance in the Baltic Sea and the presence of potential cellulase genes in its genome suggest an important role of this organism for organic matter decomposition in this environment. To enable assembly of lower abundance organism we are currently optimizing assembly protocols for Illumina­based metagenomics.

Andersson AF, Riemann L, Bertilsson S. Pyrosequencing reveals contrasting seasonal dynamics of taxa within Baltic Sea bacterioplankton communities. ISME J, 2010. 4(2): p. 171­81.

Herlemann DPR, Labrenz M, Jürgens K, Bertilsson S, Waniek JJ, Andersson AF. Transitions in bacterial communities along the 2000 km salinity gradient of the Baltic Sea. ISME J, 2011. Apr 7. Poster Number 17

NETWORK BASED ANALYSIS OF IN-DEPTH PROTEOMICS DATA TO ASSESS FBW7 CELLULAR TARGETS AND FUNCTIONS Hultin Rosenberg L1, Branca R1, Forshed J1 and Lehtiö J1.

1. Clinical Proteomics Mass Spectrometry, Science For Life Laboratory, Karolinska Institutet, Karolinska University Hospital, Stockholm, Sweden.

Background The omics research fields have so far mainly failed to deliver biomarkers for clinical and therapeutic use. Traditionally, biomarkers have been selected by differential expression analysis, scoring each gene or protein for how well its expression pattern discriminates between groups of samples. However, biological processes are driven by functional modules rather than individual genes or proteins so it is necessary to understand how the state of single units jointly determines the higher level state of these functional modules. To enhance the generation of biologically and clinically relevant information from expression data and to enable better models of disease and healthy phenotype, systems based approaches are necessary. In systems biology the focus is on complex interactions between components in biological systems. Several recent studies have shown for example that the predictive performance of gene expression data can be improved by incorporating interactome data1,2. These studies revealed that changes in network activity can be more predictive of clinical outcome in cancer patients than individual genes expression.

Aim In the current study, a network based approach will be developed and coupled to protein quantities from mass spectrometry data. The idea is to shift the focus from individual proteins showing differential expression to whole protein subnetworks with altered activity or regulation between conditions. The aim is to identify protein subnetworks that are altered or deregulated between wild type and Fbw7 knockout samples to define new targets of Fbw7 and improve the understanding of its function. Fbw7 is a known tumor suppressor that targets several oncogenes for ubiquitin-mediated proteasomal degradation. Disruption of the Fbw7 gene is associated with embryonic lethality, genetic instability and tumorigenesis, however the full extent of Fbw7 targets and functions are still poorly understood. By studying protein expression in the context of protein-protein interaction networks we hope to detect differences on a biological system level, although not detectable on a protein by protein level.

Methods Mass spectrometry (TMT IEF-LC-MS/MS) is applied to identify and quantify proteins in samples from HCT116 colon cancer cell line, wild type and Fbw7 knockout. The protein quantities are merged with protein-protein interaction data from the Human Protein Reference Database (HPRD). Using a heuristic search algorithm the protein-protein interaction network is searched for subnetworks with significant difference in activity between phenotypic classes. The scoring of subnetworks will be based on different measures of subnetwork state: average expression activity as well as variance between proteins in subnetwork. The highest scoring subnetworks and corresponding measures will be fed to a multivariate model to select subnetworks most predictive of phenotypic outcome. The identified subnetworks will also be studied in terms of enriched functions and pathways to characterize them further.

1. Chuang, HY. et al. Network-based classification of breast cancer metastasis. Mol. Syst. Biol. 3, 140 (2007) 2. Taylor, IW. et al. Dynamic modularity in protein interaction networks predicts breast cancer outcome. Nature Biotech. 27, 2 (2009) Poster Number 18

Bioinformatics2012Abstract

Lukasz Huminiecki, DBB, Stockholm University

Abstract

Whole genome duplication (WGD) is a special case of gene duplication, observed rarely in animals, whereby all genes duplicate simultaneously through polyploidisation. Two rounds of WGD (2R-WGD) occurred at the base of vertebrates, giving rise to an enormous wave of genetic novelty, but a systematic analysis of functional consequences of this event has not yet been performed.

We show that 2R-WGD affected an overwhelming majority (74%) of signalling genes, in particular developmental pathways involving receptor tyrosine kinases, Wnt and transforming growth factor-β ligands, G protein-coupled receptors and the apoptosis pathway. 2R-retained genes, in contrast to tandem duplicates, were enriched in protein interaction domains and multifunctional signalling modules of Ras and mitogen-activated protein kinase cascades. 2R-WGD had a fundamental impact on the cell-cycle machinery, redefined molecular building blocks of the neuronal synapse, and was formative for vertebrate brains. We investigated 2R-associated nodes in the context of the human signalling network, as well as in an inferred ancestral pre-2R (AP2R) network, and found that hubs (particularly involving negative regulation) were preferentially retained, with high connectivity driving retention. Finally, microarrays and proteomics demonstrated a trend for gradual paralog expression divergence independent of the duplication mechanism, but inferred ancestral expression states suggested preferential subfunctionalisation among 2R-ohnologs (2ROs).

The 2R event left an indelible imprint on vertebrate signalling and the cell cycle. We show that 2R-WGD preferentially retained genes are associated with higher organismal complexity (for example, locomotion, nervous system, morphogenesis), while genes associated with basic cellular functions (for example, translation, replication, splicing, recombination; with the notable exception of cell cycle) tended to be excluded. 2R-WGD set the stage for the emergence of key vertebrate functional novelties (such as complex brains, circulatory system, heart, bone, cartilage, musculature and adipose tissue). A full explanation of the impact of 2R on evolution, function and the flow of information in vertebrate signalling networks is likely to have practical consequences for regenerative medicine, stem cell therapies and cancer treatment. Poster Number 19

Genomic variation of mouse microRNAs

Katherine Icay1,2*, Tessa Sipilä1,2,3* Dario Greco4, Iiris Hovatta1,2,3,5 1Research Programs Unit, Molecular Neurology, Biomedicum-Helsinki, University of Helsinki, Finland 2Department of Medical , Haartman Institute, University of Helsinki, Finland 3Department of Mental Health and Substance Abuse Services, National Institute for Health and Welfare, Helsinki, Finland 4Department of Bioscience and Nutrition, Karolinska Institutet, Sweden 5Institute of Molecular Medicine Finland (FIMM), University of Helsinki, Finland

The last decade of genetics and biomedical research has seen an explosion of interest in microRNAs (miRNAs), a set of small non-protein coding molecules with functional roles in the post-transcriptional regulation of gene expression. MiRNAs are characterized in all higher organisms and tissues. A single miRNA can target hundreds of genes, often with similar functions and/or biological processes, and regulate their expression by inhibiting the translation of their mRNAs. Consequently, abnormal miRNA expression has been observed in numerous diseases including psychiatric disorders. We hypothesize that genetic variation within miRNA genes and their putative regulatory regions could affect the biological function of miRNAs, thus resulting in phenotypic differences.

The well-characterized biochemical and/or behavioural phenotypes of inbred mouse strains make them ideal models for the study of wide range of phenotypes. We have previously used a panel of inbred mouse strains and carried out gene expression profiling to identify genes that regulate anxiety-like behaviour. Keane et al. (2011) recently published genome sequences of 17 inbred mouse strains, including the most commonly used laboratory strains and wild-derived strains. We utilized this publicly available dataset (http://www.sanger.ac.uk/resources/mouse/genomes/) and performed a genome-wide analysis of SNPs and structural variations that could affect the biological function of miRNAs (i.e. within hairpin, mature, seed and putative regulatory regions) in different inbred mouse strains.

We observed, as expected, genomic variations occurring less frequently within miRNA genes compared to the rest of the genome. Interestingly, we also observed that a miRNA’s genomic location and membership in a miRNA cluster or family significantly influenced the occurrence of polymorphisms. Moreover, we observed SNPs in miRNA seed regions altering the set of predicted target genes of a miRNA by over 90% and, consequently, the biological processes and pathways potentially regulated by a miRNA. Additionally, miRNAs having human and/or rat orthologs were more likely to be conserved and not contain genetic variation.

In conclusion, these findings provide a valuable characterization of miRNAs with sequence information to be used in the consideration and design of future experimental studies. We aim to further investigate if SNPs affect miRNA expression level with miRNA-seq data (derived from the hippocampus of five mouse strains), with special interest in finding correlations to anxiety-like behavior. Poster Number 20

Copernicus : Enabling large scale computing as a workflow The amount of compute resources has grown vastly in the recent years, however they are underutilized but there are many problems that can put them into good use. As an example within the field of molecular dynamics, adaptive sampling algorithms such as Markov State modeling or Free energy calculations constitute of many short(100-1000) simulations to gather statistics followed by iterations of adaptive sampling in order to guide simulations for the coming iteration. Although it is a simple workflow to define these type of problems can easily utilize thousands of cores and generate massive amounts of data and require something more than a queue. Copernicus is a platform that enables large scale computing to be defined as a workflow. The platform will take care of the task breakdown, distribute it to available compute resources, all in a secure and fault tolerant manner. Its overlay P2P network can utilize a wide variety of heterogeneus compute resources such as desktops,clusters and cloud compute instances and automatic resource awareness makes sure to to use the best resources for the defined job. As a proof of concept the folding of the villin headpiece was performed by combining molecular simulations with Markov state modelling for kinetic clustering and statistical model building. This combination made it possible to identify the native folded state without any prior knowledge within 46 hours utilizing a total of 5736 cores on a Cray XT6 and an AMD Istanbul cluster. By being able to combine simulations with statistical model building parallelization was achieved on a fine grained level and on an algorithmical level resulting in much stronger scaling. The Copernicus platform is built in a general manner and its plugin architecture makes it possible to enable any executable to be utilized in a workflow making it ideal for any type of large scale statistical computations and data processing. Poster Number 21

Cross species comparison of C/EBPα, PPARγ, DHS and chromatin state in mouse and human adipocytes highlights factors important for retention of PPARγ binding

Mette Jørgensen2†, Søren F Schmidt1†,Yun Chen2, Ronni Nielsen1, Albin Sandelin2* and Susanne Mandrup1*

1 Department of Biochemistry and Molecular Biology, University of Southern Denmark, Campusvej 55, DK-5230 Odense M, Denmark 2The Bioinformatics Centre, Department of Biology and Biomedical Research and Innovation Centre, Copenhagen University, Ole Maaløs Vej 5, DK-2200, Copenhagen N, Denmark * Corresponding authors: Albin Sandelin [email protected] - Susanne Mandrup [email protected] † Equal contributors

The transcription factors peroxisome proliferator activated receptor γ (PPARγ) and CCAAT/enhancer binding protein α (C/EBPα) are key transcriptional regulators of adipocyte differentiation and function. In this study have mapped all binding sites of C/EBPα and PPARγ in human SGBS adipocytes and compared these with the genome-wide profiles from mouse adipocytes to systematically investigate what biological features correlate with retention of sites in orthologous regions between mouse and human. Despite a limited interspecies retention of binding sites(~20%), several biological features make sites more likely to be retained. First, co-binding of PPARγ and C/EBPα in mouse is the most powerful predictor of retention of the corresponding binding sites in human. Second, vicinity to genes highly upregulated during adipogenesis significantly increases retention. Third, the presence of C/EBPα consensus sites correlate with retention of both factors, indicating that C/EBPα facilitates recruitment of PPARγ. Fourth, retention correlates with overall sequence conservation within the binding regions independent of C/EBPα and PPARγ sequence patterns, indicating that other transcription factors work cooperatively with these two key transcription factors. Fifth, We show that binding sites that are highly accessible(Based on public available DHS data) in preadipocytes are more likely to be retained. Sixth, PPARγ sites are more likely to be retained, if they have high H3K27ac levels in either preadipocytes, adipocytes or both. Thus, the total H3K27ac levels in the PPARγ binding regions seem to be more important for retention than the development of acetylation during adipogenesis. This study provides a comprehensive and systematic analysis of what biological features impact on retention of binding sites between human and mouse. Specifically, we show that the binding of C/EBPα and PPARγ in adipocytes have evolved in a highly interdependent manner, indicating a significant cooperativity between these two transcription factors.

Poster Number 22

Resistance mutations in environmental bacterial communities living under different antibiotic selection pressures

Anna Johnning1, Erik Kristiansson2, Birgitta Weijdegård1, D.G. Joakim Larsson1 1Institute of Neuroscience and Physiology, Sahlgrenska Academy at University of Gothenburg, Gothenburg, Sweden, 2Department of Mathematical Sciences, Chalmers University of Technology/University of Gothenburg, Gothenburg, Sweden

Antibiotic resistance is a pressing concern for the health care sector globally. Resistant pathogenic bacteria can cause refractory infections which lead to added suffering, greater risk of spreading of the disease and death. Recently, pollution with antibiotics in the environment has been recognized as a potential driver of microbial resistance. Several human pathogens, such as Escherichia coli and Salmonella spp., can spread through contaminated water. Therefore, exposing environmental communities of pathogenic bacteria to sufficiently high concentrations of antibiotics could select for resistant strains and hence pose a risk to human health.

We have sampled river sediment up- and downstream from an Indian treatment plant receiving industrial effluent from pharmaceutical production, and from a regular Swedish sewage treatment plant. We have previously showed that the Indian river receiving the treated effluent is highly polluted with fluoroquinolone antibiotics downstream from the treatment plant, but some are also detected upstream. The Indian sediment samples hence represent a gradient of fluoroquinolone pollution (in the range 914-5.24 µg ciprofloxacin/g organic matter) while no fluoroquinolones were detected in the Swedish samples. Fluoroquinolones target the essential enzymes DNA gyrase and topoisomerase IV, encoded by the two gene pairs gyrA and gyrB, and parC and parE respectively. Certain mutations within these genes, especially in gyrA and parC, have been linked to a lowered susceptibility to fluoroquinolones.

To study if the abundance of resistance mutations in the Escherichia and Salmonella communities residing in the river sediments depends on the level of fluoroquinolone exposure, we designed primers targeting gyrA and parC in these families. The resulting amplicons were sequenced using massively parallel pyrosequencing, and any deviations from the amino acid sequence of the type strain were analyzed using the GS Amplicon Variant Analyzer from 454. For Escherichia, the resistance mutations S83L and D87N in gyrA, and S80I and E84V or E84G in parC could be detected at all sampling sites, more or less frequently occurring in the same sequence read. There was no apparent correlation between the level of fluoroquinolone pollution and the mutation abundance, except that the most polluted river site showed the highest abundance of the aforementioned double mutations in both genes. For Salmonella, we detected resistance mutations S83F and D87N in gyrA, and Y57S in parC in all samples. In the most polluted sample, the mentioned parC mutation was often coupled with mutation E84K. Single mutations in gyrA or parC are not sufficient to provide a high level resistance to fluoroquinolones. The sediment directly downstream from the Indian wastewater treatment plant appears to be the only investigated site where the selection pressure was sufficiently high to promote bacteria with such double mutations. Poster Number 23

Statistical challenges of comparative metagenomics

Viktor Jonsson, Olle Nerman, Erik Kristiansson

Mathematical Sciences, University of Gothenburg and Chalmers University ofTechnology, 412 96 Göteborg, Sweden.

In metagenomics the whole joint genome of microbial communities is analyzed. Samples are taken directly from the environment and many organisms that cannot be cultivated in the laboratory can therefore be investigated. In comparative metagenomics the difference between samples is studied by quantifying and comparing gene abundance. However, there are many statistical challenges associated with comparative metagenomics that can, if not correctly handled, result in a substantial decrease in power and data reliability. On this poster we give a summary of some of these challenges and present a first view on new improved statistical methods for comparative metagenomics. The main challenge in statistical analysis of metagenomics data is the substantial variation. The enormous diversity of most microbial communities results in high biological variability between different metagenomes. Technical variability is also introduced by sequencing errors, the limited length of the generated DNA fragments and imprecise matching of reads with the correct gene function. In addition, most metagenomes are heavily undersampled as even modern massively parallel DNA sequencing techniques can only sequence a fraction of the total DNA content available in most samples. Finally, the data observed is discrete (counts of gene occurrences) and high dimensional (many genes tested simultaneously) which makes many standard statistical methods unsuitable. For example we show that the standard t-test has a poor performance under these conditions. Furthermore we present work being done on a new method for statistical comparison of metagenomes. The method is built on hierarchical Bayesian modeling within the framework of a generalized linear model. Markov chain Monte Carlo (MCMC) will be used to fit the model to the data. The main benefit is the sharing of variance between genes making the variance estimates more stable. Using a generalized linear model as basis allows for flexibility in the choice of experimental design. These parts together will form a data driven statistical model that will improve the potential of comparative metagenomics.

Poster Number 24

De novo assembly and annotation of the grey reindeer lichen (Cladonia rangiferina) transcriptome

Sini Junttila and Stephen Rudd Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, Tykistökatu 6, 20520 Turku, Finland

Lichen is a symbiotic relationship between a fungus and an alga, and these organisms have a remarkable ability to survive in some of the harshest climates on earth. They can endure frequent drying and wetting and are able to survive in the desiccated state for long periods at a time. Although molecular biological and genetic resources have been established for the systematic study and classification of lichens, there are no published lichen reference genomes available and the sequences available in public databases are very limited. We report the de novo assembly and annotation of the transcriptome of the grey reindeer lichen, Cladonia rangiferina, using high- throughput next generation sequencing and traditional Sanger sequencing data. High quality sequence reads from a Roche GS FLX sequencing run and Sanger sequencing were de novo assembled. The genome of origin for the lichen sequences was determined and the assembled sequences were annotated using BLASTX analysis against the non-redundant database. The C. rangiferina transcriptome was further characterised by functional annotation of the sequences against GO and KEGG databases. Our results present the first transcriptome sequencing and de novo assembly of any lichen species, describe the ongoing molecular processes and the most active pathways in C. rangiferina, and bring a significant increase to publicly available lichen sequence information. These data provide a first look into the molecular nature of the lichen symbiosis and characterise the transcriptional space of this remarkable organism. These data will also enable further studies aimed at deciphering the genetic mechanisms behind lichen desiccation tolerance.

Poster Number 25

PyRy3D: a software tool for modelling of large macromolecular

complexes

Joanna M. Kasprzak1, *, Wojciech Potrzebowski2, Mateusz Dobrychłop1, Janusz M. Bujnicki 1,2

1 Faculty of Biology, Adam Mickiewicz University, ul. Umultowska 89, 61-614 Poznan, POLAND 2 International Institute of Molecular and Cell Biology in Warsaw, ul. Ks. Trojdena 4, 02-109 Warsaw, POLAND

* presenting author One of the major challenges in structural biology is to determine the structures of macromolecular complexes and to understand their function and mechanism of action. However, compared to structure determination of the individual components, structural characterization of macromolecular assemblies is very difficult. To maximize completeness, accuracy and efficiency of structure determination for large macromolecular complexes, a hybrid computational approach is required that will be able to incorporate spatial information from a variety of experimental methods (like X-ray, NMR, cryo-EM, cross-linking and mass spectrometry, etc.) into modeling procedure. For many biological complexes such an approach might become the only possibility to retrieve structural details essential for planning further experiments e.g. in order to explain mechanism of action. We developed PyRy3D, a method for building and visualizing low-resolution models of large macromolecular complexes. The components can be represented as rigid bodies (e.g. macromolecular structures determined by X-ray crystallography or NMR, theoretical models, or abstract shapes) or as flexible shapes (e.g. disordered regions or parts of protein or nucleic acid sequence with unknown structure). Spatial restraints are used to identify components interacting with each other, and to pack them tightly into contours of the entire complex (e.g. cryoEM density maps or ab initio reconstructions from SAXS or SANS methods). Such an approach enables creation of low-resolution models even for very large macromolecular complexes with components of unknown 3D structure. Our model building procedure applies Monte Carlo approach to sample the space of solutions fulfilling experimental restraints.

Acknowledgements: This analysis was funded by the Polish Ministry of Science and Higher Education grant N N301 123138 to JMK, and by the European Research Council (StG grant RNA+P=123D to JMB). JMK has been a scholarship-holder of Adam Mickiewicz University Foundation for 2011. JMB has been supported by the "Ideas for Poland" fellowship from the Foundation for Polish Science.

References: 1. Alber F, Förster F, Korkin D, Topf M, Sali A., , Integrating diverse data for structure determination of macromolecular assemblies., Rev Biochem. 2. Sali A, Glaeser R, Earnest T, Baumeister W., , From words to literature in structural proteomics., Nature Poster Number 26

RNA Specificity of RNA Recognition Motif (RRM) Domains

Deepak Kumar1,*, Joanna M. Kasprzak1, Janusz M. Bujnicki2,1

1 Laboratory of Structural Bioinformatics, Institute of Molecular Biology and Biotechnology, Collegium Biologicum,Adam Mickiewicz University, ul. Umultowska 89, 61-614 Poznan, Poland. 2 Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology in Warsaw, ul. Ks. Trojdena 4, 02-109 Warsaw, Poland. * presenting author

The RNA-recognition motif (RRM), also known as RBD (RNA binding domain) or RNP (ribonucleoprotein domain) is the most abundant RNA-binding domain in higher vertebrates and is the most extensively studied RNA-binding domain, both in terms of structure and biochemistry. RRM-containing proteins are involved in most post-transcriptional gene expression processes (i.e. mRNA and rRNA processing, RNA export and stability).

The mechanism of protein and RNA recognition by RRMs is not clear owing to the high variability of interactions. To elucidate sequence-structure-function relationships in the RRM family, a comprehensive bioinformatics analysis was carried out. Extensive database search was performed to identify all proteins with the RRM domain. Clustering analysis on the basis of sequence similarity revealed subfamilies of closely related sequences that are likely to share a similar function and specificity. Analysis was performed for two sets of data including full-length sequences and sequences limited to RRM domains only. We analyzed relations between groups of related RRMs by phylogenetic patterns and calculation of structure- and sequence-based trees for representative members of the RRM family. Based on these results, we inferred the phylogenetic tree and suggested a scenario for the evolutionary origin of RRMs. Molecular Dynamics simulations and application of statistical potentials are being performed on the RRMs for the study of RRM-RNA interaction.

Acknowledgements: Deepak Kumar has been supported by the International PhD school grant from the Foundation of Polish Science ( grant MPD/2010/3). J.M.K. has been supported by the Polish Ministry of Science and Higher Education ( grants 0067/P01/2010/70 and N N301 123138 ) and by the Foundation from Polish Science ( grant POMOST C/58). JMK has been a scholarship-holder of the Adam Mickiewicz University Foundation for 2011. J.M.B. has been supported by the FNP (TEAM/2009-4/2 and "Ideas for Poland" fellowship).

REFERENCES: 1. Maris C, Dominguez C, Allain FH , 2005 : The RNA recognition motif, a plastic RNA-binding platform to regulate post-transcriptional gene expression. FEBS Journal, 272:2118-2131. Poster Number 27

LocusVu: Abstract

Mayank Kumar, Christian Spaniol, Volkhard Helms May 14, 2012

Here we present LocusVu, a novel, easy to use software tool that can analyze large sets of genomic loci data, e.g. from NGS experiments, and perform statis- tics on it. On the back-end, the tool is linked to The Genome Browser from UCSC via its MySQL interface. The front-end is supported by a Java Swing based GUI, to enable that the user interacts with the data in a simple and inter- active manner. The tool takes as input a list of genomic loci (positions on the chromosome), and fetches attributes (e.g. gene name, cytogenetic band, repeats information, etc.) for each of these loci. It then presents this information in a tabular form, where each locus and its corresponding attribute is listed (the user is given the freedom to choose among various attributes). One can then do many operations on this information, which include but are not restricted to, viewing N neighboring upstream-downstream genes (the user can choose the value of N dynamically); draw pie-charts/bar-charts on the data in order to obtain a graphical summary of the data; view many datasets in the same win- dow; compare different datasets to draw comparative-genomic conclusions, etc. The tool’s ready connection to UCSC gives it a direct advantage - it does not require the maintenance of a local copy of large databases, at the same time en- suring the user always has access to the most up-to-date data. Present day tools with similar functionality are either interactive (requiring mouse clicks, etc. and thus slow and tedious), or are non-interactive (one can do batch submissions, but then lose the interactivity). LocusVu instead provides the user with the power to do batch submissions, without taking away the interactivity. LocusVu also provides the user with the ability to perform statistics on the generated data, which gives it the distinct advantage of analyzing large sets of data in a relatively short time.

1 Poster Number 28

An ensemble-based feature selection algorithm for identifying candidate metastasis marker genes in endometrial cancer

Kanthida Kusonmano1,2,3, Elisabeth Wik2,3, Helga Salvesen2,3, Kjell Petersen1

1Computational Biology Unit, Uni Computing, Uni Research AS, Bergen, Norway 2Department of Obstetrics and Gynecology, Haukeland University Hospital, Bergen, Norway 3Department of Clinical Medicine, University of Bergen, Bergen, Norway

Abstract: Endometrial cancer is a malignant growth of the endometrium, the lining of the uterus. It is the most common pelvic gynecological malignancy. Although majority of patients are treated at an early stage, about one forth of patients in Norway suffer from distant metastases, which are largely incurable leading to death. Even if most studies of comprehensive profiling of malignant tissues are based on primary lesions, the metastatic lesions may be more relevant to define targets for new therapeutics with systemic disease. A metastatic lesion of cancer cells disseminated to remote sites will have certain cell-biological and molecular properties that may be different from the primary tumor. Nowadays, with the availability of high-throughput technologies, the development of more specific metastasis diagnostic markers to define the most relevant therapeutic targets at molecular level is of broad interest. However, there are only a small number of studies on metastasis in general and in endometrial cancer in particular. In this study, we want to identify metastasis signature genes in endometrial cancer, which show distinct changes or have high discriminatory ability between transcriptional profiles of 122 primary tumors and 19 metastatic lesions based on microarray data. We propose an ensemble-based feature (marker or gene in this context) selection method for identification of our interest candidate marker genes. Instead of having different ranking features sets from different feature selection techniques, we try to identify consensus a feature set among various methods e.g. t-test, Significance analysis of microarrays (SAM), Information Gain, ReliefF, etc. The algorithm provides the common features between different feature selection methods with consideration of ranking order. An ensemble approach is widely used for classification algorithms and has been proven to provide more robust results than applying only a single method. The selected feature subsets will subsequently give better confidence in the selection of biologic relevant markers for further validation studies.

Keywords: Feature selection, Biomarker identification, Endometrial cancer, Ensemble method Poster Number 29

Genome and transcriptome of venomous marine snail Conus consors.

Age Brauer1,Reidar Andreson1, Silja Laht1, Lauris Kaplinski1, Aleksander Sudakov1, Mikk Eelmets1, Maido Remm1, CONCO Consortium2 1 Estonian Biocentre, Riia 23C, 51010 Tartu, Estonia, 2 http://www.conco.eu

Conus consors is a fish hunting snail that uses venom to paralyze its prey. The venom consists of a mixture of neurotoxins called conopeptides. Conopeptides are very specifically blocking different ion channels. Some conopeptides are used as medications, several others are in drug development. The main goal of sequencing Conus consors genome and venom duct transcriptome was to find new conopeptides. From the venom duct transcriptome 53 conopeptides were discovered (published by Terrat et.al, 2012). 47 of these conopeptides were also found in the genome assembly. In addition, 33 conopeptides not present in the transcriptomes were discovered from the genome assembly.

Only a few mollusk genomes have been sequenced. C. consors genome is the largest assembled genome so far.

Length of Raw Length of Nr of core Core gene Species the sequence assembled N50 (bp) genes coverage genome coverage sequences (max=458) Lottia 0.5 Gb 8.9x 360 Mbp 1870055 452 80 % gigantea Aplysia 1.8 Gb 9.9x 716 Mbp 264327 451 77 % californica Conus 3.0 Gb 3.0x 201 Mbp 182 317 12.1 % bullatus Conus 3.0 Gb 6.0x 1393 Mbp 599 457 88.5 % consors Pinctada 1.15 Gb 40.0x 1413 Mbp 1629 457 85 % fucata

To have a better comparison of very different genome assemblies we used eukaryotic core gene coverage as one of the parameters. A set of 458 eukaryotic core genes that should exist in all eukaryotes was searhced against the assembled genomes (http://korflab.ucdavis.edu/Datasets/cegma/). In addition we used mitochondrial genome to choose the best C. consors genome assembly. Mitochondrial genome has been sequenced and assembled independently. We compared, how much of the mitochondrial genome was present in each assembly and in how many contigs/scaffolds. The values ranged from 6- 100% and from 4-251 scaffolds. The assembly chosen for annotation had 100% of mitochondrial genome present in 4 scaffolds.

Poster Number 30

Accurate extension of multiple sequence alignments

Ari L¨oytynoja 1, Albert J. Vilella 2 and Nick Goldman 2 1Institute of Biotechnology, University of Helsinki, Finland 2EMBL-European Bioinformatics Institute, Hinxton, UK

Accurate multiple alignment is demanding and extension of existing alignments with new data is often an attractive option: addition of new sequences without re-computation of the full alignment saves time and resources, especially when amounts of data added are small relative to the full alignment sizes; extension of existing alignments also retains the relative matching of reference sequences and thus ensures that downstream analyses depending on certain features of the alignment will not broken and the need for manual re-annotation is minimised. However, the benefits of alignment extension are especially significant in the analysis of fragmented sequences such as those coming from next-generation metagenomics. Popular progressive alignment methods designed for global alignment struggle with short sequence fragments that do not all overlap with each other and contain little information to anchor them in their correct context. In evolutionary analyses of metagenomic data, multiple alignment is therefore often performed with HMMER package [1] that first generates a profile HMM of a pre- defined reference alignment, then aligns the sequence fragments against this profile and finally maps the against-profile alignments to the original reference sequences. A limitation of profile-based methods is that they do not incorporate and use phylogenetic information and are affected by the composition of the reference alignment and the phylogenetic positions of query sequences within it. We have developed a method for phylogeny-aware alignment of partial-order sequence graphs and apply it to the extension of alignments with new data [2]. Our new method, called PAGAN [3], infers ancestral sequence history for the reference alignment and adds new sequences by aligning them against extant sequences or inferred ancestral sequences in their phylogenetic context, either to pre-defined positions or by finding the best placement for sequences of unknown origin. Unlike profile-based alternatives, PAGAN considers the phylogenetic relatedness of the sequences and is not affected by inclusion of more diverged sequences in the reference set. Our analyses show that PAGAN outperforms alternative methods for alignment extension and provides superior accuracy for both DNA and protein data, the improvement being especially large for fragmented sequences. PAGAN-generated alignments of noisy NGS sequences are accurate enough for the use of RNA-seq data in evolutionary analyses while the method also scales up to analyses of large metagenomic data sets. The concepts developed for progressive alignment of sequence graphs can be extended to phylogeny-aware alignment refinement and co-estimation of sequence alignment and phylogeny.

[1] S Eddy. HMMER 3.0 (http://hmmer.org). [2] A L¨oytynoja, AJ Vilella, and N Goldman. Accurate extension of multiple sequence alignments using a phylogeny-aware graph algorithm. Bioinformatics, accepted. [3] A L¨oytynoja. PAGAN (http://code.google.com/p/pagan-msa). Poster Number 31

Analysis of functional divergency in the EFG I - and EFG II subfamily

Author(s): Tõnu Margus1,2, Maido Remm 1,2 and Tanel Tenson1

Affiliation(s): 1University of Tartu, 2Estonian Biocentre,

Elongation factor G (EFG) is an indispensable protein whose primary function is to catalyze translocation in protein synthesis. EFG duplications in bacteria form four subfamilies: EFG I, EFG II, spdEFG1 and spdEFG2. The four EFG subfamilies are characterized by genome context conservation, evolutionary speed and by their indispensability. We have described in depth, for the first time, the EFG II subfamily (e.g., Thermus thermophilus EFG-2). This differs from the EFG I subfamily (e.g., Escherichi coli EFG) by its high levels of primary sequence divergence. To study the EFG II, we analyzed conservation of domains and motifs and identified differentially conserved positions between the EFG I and EFG II. The main EFG II specific characteristics are: low conserved functionally important GTPase domain; absence of trGTPase family specific consensus RGITI in the G2 motif; and six differentially conserved positions. The latter are related to substantial changes in the physical- chemical properties indicating EFG II specific functional changes. Interestingly, the differentially conserved positions were found within the most divergent domains of EFG II (GTPase domain and domain II). Moreover, three of these positions were located in the GTPase domain consensus elements P-loop and G2 motif. This location strongly suggests that one part of the EFG II specific functional peculiarities are associated with changes in GTP/GDP binding and hydrolysis conditions. The mapping of differentially conserved positions onto the tertiary structure revealed that another three positions in domain II point towards different interaction partners of the domain. This means that the nature of interactions between GTPase domain, ribosome 16S rRNA (h5 & h15) and domain III, are different in EFG I and EFG II. The presence of these characteristics, amongst the otherwise highly divergent sequences of EFG II, is consistent with functional peculiarities unique to this subfamily. The nature of these interactions is a suitable subject for investigation in future experimental designs targeting differentially conserved amino acid residues within EFG II. Poster Number 32

Exploratory Metagenomic Analysis of Antibiotic Resistance Genes in Bacterial Communities

Paula Andrea Martinez*, Viktor Jonsson, Fredrik Boulund, Erik Kristiansson Division of Mathematical Statistics, Department of Mathematical Sciences Chalmers University of Technology and University of Gothenburg * E-mail: [email protected]

The increasing prevalence of antibiotic-resistant bacteria has become a notorious threat to human health. Bacteria become resistant through resistance genes that can move between cells using horizontal gene transfer. Antibiotics are naturally produced by microorganisms in the environment and therefore bacterial communities maintain a large collection of resistance genes (the resistome). The diversity and mobility of the environmental resistome is however not well studied and further research into these issues is warranted.

The aim of this project is to explore the environmental resistome and to characterize the abundance of known resistance genes in the environment. We used 98 gigabytes of publicly available data from The Community Database for Metagenomic Data CAMERA, including more than 650 study sites around the world. Based on this data, we identified several common antibiotic resistance genes families spread in different environments, where the beta-lactamase TEM was the most abundant (having 41.7 % occurrence in 347 sites). We also compared different sites by clustering, and found that the resistome is highly variable. However, similarities were found also in geographically close sites and between sites from similar environments. For example, environments contaminated with antibiotics showed similarities in their resistome abundance. Additionally, we also cluster the resistome, observing groups of antibiotic resistance genes with similar abundance patterns between the sites. Several of these groups could be associated with genetically linked co-resistance through known horizontally transferred elements.

We conclude that metagenomics is a powerful tool for identifying antibiotic resistance genes in uncultured bacteria.

Keywords: metagenomics, environmental bacterial communities, antibiotic resistance, resistome, next generation sequencing NGS.

Poster Number 33

A web tool for discovering protein-protein interactions using sequence information Dorota Matelska a,b, Robert B. Russell a

a Cell Networks, University of Heidelberg, Im Neuenheimer Feld 267, 69120 Heidelberg, Germany b International Institute of Molecular and Cell Biology, 4 Ks. Trojdena Street, 02-109 Warsaw, Poland

Proteins fulfill their function as a part of large molecular machines that are coordinated by regulatory interaction networks. Currently, details of interaction interfaces are captured only in high-resolution 3D structures of protein complexes. Initial efforts to classify such molecular details of interaction interfaces resulted in the development of databases of domain-domain and domain-peptide interactions. However, despite growing number of bioinformatic tools making use of protein sequences, there is no meta-service that incorporates local sequence features and uses them to search for interaction interfaces in a set of proteins.

We present a web tool that integrates various structural and sequence data and predicts possible modes of interactions between given proteins. Predictions of protein domains, sequence motifs, and homology to known structures are used to reveal several types of interfaces, i.e. domain- domain, domain-peptide and peptide-peptide interactions. Moreover, we introduce the probability measure of the possible modes and visualize them using an intuitive graph model. Poster Number 34

MoiraiSP: a novel mitochondrial cleavage site predictor

Yoshinori Fukasawa, Szu-Chin Fu, Junko Tsuji, Noriyuki Sakiyama, Kenichiro Imai and Paul Horton

A large fraction of mitochondrial proteins are cleaved upon entry into the mitochondria, but prediction of this cleavage is still challenging. In recent years, large-scale mitochondrial proteomics research has provided large data sets of mitochondrial protein cleavage sites [1, 2]. We present MoiraiSP (Mitochondrial matrix targeting Signal Predictor), a novel mitochondrial cleavage site predictor trained on recent proteomics data. To prepare our dataset, we needed to mark intermediate cleave events fortunately Vögtle et al [1] provide data from knock out experiments determining cleavage by Oct1 and Icp55. However, many cleavage site remained which do not follow the observation that almost all known MPP (Mitochondrial Processing Peptidase) cleavage sites occur with arginine in the -2 position (the “R-2 rule” [3]). Although explaining those cleavage sites is an interesting scientific question, for training of our predictor we filtered out all cleavage sites which do not follow the R-2 rule. We trained an SVM (LIBSVM) classifier to predict MPP cleavage sites. For the classification task, we defined several feature types. For each protease we trained a profile HMM (HMMER2) to learn the local sequence patterns near their cleavage sites; and treated the likelihood ratio score of the HMM as a feature for the SVM. To reflect distance from the N-terminus, we trained a mixture model of Γ distributions; and additionally used physico-chemical properties such as net charge and the number of charged residues. When predicting, the SVM model is used to scan the sequence for MPP cleavage sites. Secondary cleavage events are predicted using the Oct1 and Icp55 profile HMM's respectively (Figure 1A). We developed a separate version of MoiraiSP for plants, because they exhibit important differences from yeast, including generally longer MTS's [2] and the lack of an Oct1 protease. Due to the relatively small plant dataset, we used the yeast trained profiles to model the sequence preferences around the MPP but retrained the Icp55 profile on plant data. Also we modeled the distance from MPP cleavage to the original N-terminus with a mixture model of Γ distributions fit to the plant data. Using 10-fold cross-validation on a non-redundant dataset, we estimated the performance of MoiraiSP and compared it to two previous predictors [4, 5], as summarized in Table 1 and Figure 1. MCC stands for Matthew’s correlation coefficient and shows the performance of classification between cleaved and non-cleaved mitochondrial proteins. The preliminary results indicate that, having the advantage of a large training dataset for cleavage site, MoiraiSP makes more accurate predictions than previous methods.

Table 1. Preliminary classification result for yeast dataset[1] MoiraiSP TargetP[4] MitoProtII[5] MCC 0.794 ± 0.094 0.582 0.552

Figure 1. (A) Flow of MoiraiSP (B) Result of cleavage site prediction in yeast dataset[1]. TheY-axis shows fraction of cleaved proteins which are correctly predicted their cleavage position and x-axis shows accepted range between prediction and actual site determined by experiments. (C) Result for plant dataset[2].

References: [1] Vögtle,F.-N. et al., Cell, 139, 428-39, 2009. [2] Huang,S. et al., Plant physiology, 150, 1272-85, 2009. [3] Gakh,O., Biochimica et Biophysica Acta (BBA) - Molecular Cell Research, 1592, 63-77, 2002. [4] Emanuelsson,O. et al., Journal of molecular biology, 300, 1005-16, 2000. [5] Claros,M.G. and Vincens,P, European journal of biochemistry / FEBS, 241, 779-86, 1996. Poster Number 35

Bioinformatics2012Abstract

Nanomechanics of proteins at synaptic junction

W. Nowak*, K.Mikulska, R.Jakubowski, L.Pepłowski, J.Strzelecki

Theoretical Molecular Biophysics Group, Institute of Physics, N. Copernicus University, Grudziądzka 5, 87-100 Toruń, Poland * presenting author ([email protected])

Mechanical stability of synaptic junction is of paramount importance for proper functioning of brain. Out of hundreds of proteins present in the cleft some form pairs linking pre- and post-synaptic neurons., for example neuroligins (NLG)-neurexins (NRX). Others, such as contactins (CNTN) contribute to proper functioning of Ranvier nodes and early formation of a neuronal network. Better understanding of nanomechanics of these modular protein help in the future engineer molecular machines working in a neural membrane or the extracellular matrix environment. Moreover, recent genetic studies indicate that mutations in genes coding these protein lead to severe diseases such as autism. In order to understand better mechanical properties and stability of synaptic junction proteins we combined single molecules experimental techniques, such as Atomic Force Microscopy (AFM), and theoretical methods, such as the Steered Molecular Dynamics (SMD) simulations [1,2] to unfold selected modular, adhesive proteins such as CNTN4 [3,4]. Computer simulations of mechanical unfolding, despite the known problems with experimental timescale mismatch, provide information on intra-molecular interactions critical for protein functionality. In this presentation, for the first time, we will show results of SMD unfolding of the whole CNTN4 protein (100 ns timescale, 10 modules) and an enforced dissociation of a NRX-NLG pair in the presence (and absence) of calcium ions. Problems arising when natural interactions of protein modules with other signaling molecules are affected by a molecular strain will be discussed. We believe that our computational approach, together with bioinformatical analysis of homologous systems, provides new scientific data on the biological role of these abundant proteins.

Acknowledgements: Support from Polish Funds for Science (grant No. N N202 262038 and the nationwide license for Accelrys software) is acknowledged. Calculations were performed at the Computational Center TASK in Gdansk and UMK Torun. UMK grants (2011) to KM and JS are also acknowledged.

References: 1. W. Nowak, P.Marszalek, 2005, Molecular Dynamics Simulations of Single Molecule Atomic Force Microscope Experiments, Current Trends in Computational Chemistry, 47-83. 2. Ł. Pepłowski, M. Sikora, W. Nowak, M. Cieplak, 2011, Molecular jamming - The cystine slipknot mechanical clamp in all-atom simulations , J. Chem. Phys., 134: 085102-1 - 085102-14. 3. K. Mikulska, Ł. Pepłowski, W. Nowak, 2011, Nanomechanics of Ig-like domains of human contactin (BIG- 2) , J. Mol. Model. (Springer) 17 (2011) 2313 - 2323. 4. K. Mikulska, J.Strzelecki, A.Balter, W. Nowak, Nanomechanical unfolding of α-neurexin – a major component of the synaptic junction, Chem. Phys. Lett. (Elsevier) 521 (2012) 134-137. Poster Number 36

StoreBioinfo - high capacity storage for Life Sciences projects in Norway

Kjell Petersen1, H. Sagehaug7, S. Omholt8, K.S. Jakobsen9, N.P. Willassen6, F. Drabløs2, E. Hovig3,4 and I. Jonassen1,5

1Computational Biology Unit, Uni Computing, Uni Research AS, Bergen. 2Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology (NTNU), Trondheim. 3Medical Informatics and Department of Tumor Biology, Norwegian Radium Hospital. 4Institute for Informatics, University of Oslo. 5Department of Informatics, University of Bergen. 6Department of Molecular Biotechnology, Institute of Medical Biology, Faculty of Medicine, University of Tromsø. 7Parallab, Uni Computing, Uni Research AS, Bergen. 8Centre for Integrative Genetics, Norwegian University of Life Sciences (UMB), Ås. 9Centre for Ecological and Evolutionary Synthesis (CEES), University of Oslo. Abstract:

The overall aim of the Storebioinfo project is to provide life science users with integrated access to NorStore storage resources and Notur computational resources. Notur is the Norwegian national metacenter for computational resources as well as the coordinator of the NorStore project for high capacity storage of scientfic data.

The StoreBioinfo project originates from the Norwegian FUGE Bioinformatics platform, extended with representatives from UMB at Ås and CEES at UiO. The platform is coordinated from the Computational Biology Unit at the University of Bergen.

The main deliverables of the StoreBioinfo project is to i) manage a large block quota of storage on behalf of the Life Sciences community in Norway, and ii) develop e-services for better integration of NorStore and Notur resources in the tools of the Bioinformatics platform, and the Life Sciences community in general.

This work presents the operation of the StoreBioinfo project as well as the developed infrastructure solutions in the project.

Keywords: Large scale data storage, structured storage, sharing project data, interactive data access, programmatic data access. Poster Number 37

Transcriptome analysis reveals genes involved in early cone setting

Tree products are one of the largest exports in Sweden. Tree selection programs to increase the benefit has started but are progressing slowly due to the long generation time in trees. This is especially true for the Norway spruce, which sets cones after 20 years. We are interested in identifying the genes involved in cone setting and use those to reduce the generation time of Norway spruce. A naturally occurring mutant of the Norway spruce called Acrocona sets cones after just four years. We have collected samples from both Norway spruce and Acrocona during development in order to identify the genes involved in early cone setting.

Since the genome sequence is large, more than 20 giga bases, no genome or transcriptome sequence for Norway spruce is available. We have, by preforming de novo assembly on RNAseq, created a transcriptome covering approximately 80 percent of all transcripts in spruce. By comparing the transcriptome of Norway spruce and the Acrocona we are generating an Acrocona SNP specific library. We have further analyzed the differential expression pattern of the different transcripts in the different cell types and time points to identify genes involved in cone setting. Among other results we found one transcription factor, known to be important in flower development in other plants, which we hypothesize to be one of the key players in initiation of cone setting. The poster will explain the pipeline and the results that lead to this hypothesis.

Poster Number 38

Screening for few but independent biomarker candidates by a genetic algorithm

Dirk Repsilber1, Lena Scheubert2 and Georg Fuellen2

1) Leibniz Institute for Farm Animal Biology, Dummerstorf, Germany 2) Institute for Biostatistics and Informatics in Medicine and Ageing Re- search, University of Rostock, Germany

We present an approach for screening OMICs datasets explicitly for small discriminating biosignatures. A genetic algorithm is used together with a statistical learning algorithm combined in a wrapper, enabling us to reward small biosignatures in particular. Interestingly, the resulting signatures ap- pear to be enriched in features which contribute independently to the ob- served patterns. Features selected by this method are compared to features selected by other common algorithms, with respect to pairwise mutual infor- mation among the top candidates. Two examples of schoolbook-like 2D gene expression patterns are presented from two datasets, involving mouse gene expression data on pluripotency as well as human brain tissue gene expression data related to Alzheimer’s disease. We suggest that our approach success- fully proposes small biosignature candidates by eliminating redundancy in the resulting sets of features.

Scheubert, L., Schmidt, R., Repsilber, D., Lustrek, M. & F¨ullen, G. (2011). Learning biomarkers of pluripotent stem cells in mouse. DNA Research, 18, 233–251. Poster Number 39

HumLoc: An Integrated Service for Coarse-Grained Subcellular Localization

Arcadio Rubio1, Bernard de Bono2,3, Henrik Nielsen1 and Ramneek Gupta1 1Center for Biological Sequence Analysis, Technical University of Denmark 2European Bioinformatics Institute 3Auckland Bioengineering Institute, University of Auckland

Background

Proteins are directed to cellular compartments by peptide sequences that act as targeting signals. Mislocalization due to disrupted signaling caused by sequence mutations is likely to have a major impact on protein function, as well as on physiological processes that such a function brokers. Localization models achieve good predictive performance for most individual cell compartments, but fail to scale to many transport signals simultaneously or to integrate existing annotations.

Methods

We describe HumLoc, a protein subcellular annotation pipeline aimed at Homo sapiens and closely related mammalian organisms. This tool enriches expert-curated localization informa- tion from UniProt with machine learning predictions to maximize coverage and provide a one- stop shop. Integration of both types of information is achieved by mapping compartments to a 3-element ontology (extracellular, cell membrane or intracellular) which eliminates granularity differences in annotations and makes prediction more tractable.

Results

The prediction pipeline of HumLoc achieves an estimated 83% correct classification rate assigning proteins to the 3-element ontology. It is of special interest for the interpretation of GWAS results. Given a set of SNPs, HumLoc can be used to filter those which are predicted to alter localization, potentially leading to a disease. We present such a set of germline and somatic mutations, in addition to some general findings about mislocalization SNPs.

Availability

A preliminary web interface as well as a web service can be accessed at http://cbs.dtu. dk/cgi-bin/humloc-2.0.cgi. These allow browsing existing subcellular annotations and pre- computed predictions. Furthermore, the submission of novel sequences is also possible. Poster Number 40

IMPROVED GAP SIZE ESTIMATION FOR SCAFFOLDING ALGORITHMS

KRISTOFFER SAHLIN, NATHANIEL STREET, JOAKIM LUNDEBERG, AND LARS ARVESTAD

Abstract. Motivation: One of the important steps of genome assembly is scaffolding, in which contigs are linked using information from read-pairs. Scaffolding provides estimates about the order, relative orientation and distance between contigs. We have found that contig distance estimates are generally strongly biased and based on false assumptions. Since erroneous distance estimates can mislead in sub- sequent analysis, it is important to provide unbiased estimation of contig dis- tance. Results: We show that state-of-the-art programs for scaffolding are using an incorrect model of gap size estimation. We discuss why current ML estimators are bi- ased and describe what different cases of bias we are facing. Furthermore, we provide a model for the distribution of reads that span a gap, and derive the ML equation for the gap length. We motivate why this ML estimate is sound and show empirically that it outperforms gap estimators in popular scaffolding programs. Our results have consequences both for scaffolding software, struc- tural variation detection, and for library insert-size estimation as is commonly performed by read aligners. A new scaffolding tool (BESST) is also presented. BESST has a new way of inferring spurious links between contigs based on the gap estimator mentioned above.

KTH Royal Institute of Technology, Science for Life Laboratory, School of Com- puter Science and Communication, Solna, Sweden. Ume˚a Plant Science Centre, Department of Plant Physiology, Ume˚a University, Sweden. KTH Royal Institute of Technology, Science for Life Laboratory, School of Biotechnology, Division of gene Technology, Solna, Sweden Swedish eScience Research Centre (SeRC), Department of Numerical Analysis and Computing Science, Stockholm University. KTH Royal Institute of Technology, Sci- ence for Life Laboratory, School of Computer Science and Communication, Solna, Sweden. E-mail addresses: [email protected]. 1 Poster Number 41

Title:

Comparative interactomics with FunCoup

Abstract:

FunCoup (http://FunCoup.sbc.su.se) is a database that maintains and visualizes global gene/protein networks of functional coupling that have been constructed by Bayesian integration of diverse high-throughput data. FunCoup achieves high coverage by orthology-based integration of data sources from different model organisms and from different platforms. Network links are annotated with confidence scores in support of different kinds of interactions: physical interaction, protein complex membership, metabolic, or signaling link. The current release, version 2.0, integrates 70 large-scale experimental datasets of such diverse types as: mRNA expression, protein expression, sub-cellular localization, protein-protein interaction, miRNA-mRNA targeting, transcription factor binding, phylogenetic profile, genetic interaction, and domain-domain interaction. A total of 22 million links has been predicted for 11 different species including the major model organisms. The FunCoup website allows query-based analysis of conserved subnetworks in multiple species. Poster Number 42

Tracking a complete voltage-sensor cycle with metal-ion bridges and modulation through neurotoxins

Ulrike Henrion‡, Jakob Renhorn‡, Sara I. B¨orjesson‡, Erin M. Nelson‡, Christine S. Schwaiger†*, P¨arBjelkmar†, Bj¨ornWallner‡, Erik Lindahl†, and Fredrik Elinder‡

† KTH Royal Institute of Technology, Stockholm, Sweden ‡Link¨opingUniversity, Sweden *[email protected]

Voltage-gated ion channels open and close in response to changes in membrane potential, thereby enabling electrical signaling in excitable cells. The voltage sensitivity is conferred through four voltage-sensor domains (VSDs) where positively charged residues in the fourth transmembrane segment (S4) sense the potential. While an open state is known from the Kv1.2/2.1 X-ray structure, the conformational changes underlying voltage sensing have not been resolved. We present 20 additional interactions in one open and four different closed conformations based on metal-ion bridges between all four segments of the VSD in the voltage-gated Shaker K channel. A subset of the experimental constraints was used to generate Rosetta models of the conformations that were subjected to molecular simulation and tested against the remaining constraints. This achieves a detailed model of intermediate conformations during VSD gating. The results provide molecular insight into the transition, suggesting that S4 slides at least 12 A˚ along its axis to open the channel with a 310-helix region present that moves in sequence in S4 in order to occupy the same position in space opposite F290 from open through the three first closed states. Additional we study how neurotoxins, such as Hanatoxin1, can modulate the gating process. Through free energy calculation and further electrophysiological experiments we try to understand why the toxin stabilizes the resting over the activated state and identify the important residues ensuring toxin’s high binding affinity. Poster Number 43

Combining de Bruijn graph, overlaps graph and microassembly for de novo genome assembly

Anton Alexandrov, Sergey Kazakov, Sergey Melnikov, Alexey Sergushichev1, Anatoly Shalyto, Fedor Tsarev

St. Petersburg National Research University of Information Technologies, Mechanics and Optics Genome Assembly Algorithms Laboratory 197101, Kronverksky pr. 49, St. Petersburg, Russia

In this paper we present a method for de novo genome assembly that splits the process into three stages: quasicontigs assembly; contigs assembly from quasicontigs; contigs postprocessing with microassembly. We have carried out an experiment of assembling the E. Coli genome from an Illumina Genome Analyzer 160-fold coverage paired-end reads library SRR001665 with insert sizes of about 200 bp and got 247 contigs with an N50 size of 53720 and covering 98% of the reference genome. The first stage uses a de Bruijn graph built from all the input data. For each pair of reads a path connecting reads’ beginning k-mers is searched for, assuming that reads are directed inwards. For this we are searching for all paths connecting these k-mers with lengths bounded from up and down by a priori limits of insert sizes. This is done by a pair of simultaneous breadth-first searches starting from the k-mers. If all paths found have the same length and are similar to each other then we have a sequence likely to be in the genome. We call these sequences quasicontigs as they are far from being contigs but are greater than raw reads. For the second stage the previously assembled quasicontigs are used. In the beginning short ones are thrown out to get to a reasonable size of an input data, e.g. 10-fold coverage can be kept. Then contigs are assembled with the algorithm based on the overlap-layout-consensus approach. The third stage is similar to OLC and scaffolding. We are trying to order the contigs and fill the gaps between them. At first all of the paired-end reads are aligned to the contigs using Bowtie (reads in a pair are aligned independently). Then if both reads in a pair are aligned but to different contigs such reads are called bridging and the contigs are called bridged (see Figure). For every pair of bridged contigs we can infer their order from orientations of alignments of the bridging reads. After that all pairs of reads with at least one read aligned to one of these contigs are used to build a relatively small (thus, microassembly) de Bruijn graph.

Figure. Contigs A and B are bridged, reads a1 and a2 are bridging, pairs (b1, b2) and (c1, c2) can be used for microassembly.

As graph is small and “local” we are likely to find a path connecting reads in a bridging pair using the same technique as in the first stage of the whole algorithm (quasicontigs assembly). This path gives us a distance between contigs and a filling sequence. After the distance is determined (it’s accurate, not like in scaffolding) we have a layouting tasks similar to the one of the second stage. On the E. coli dataset after the first stage we had about 10 million quasicontigs with a total size of two Gbp. Then this data was truncated to 175 Mbp. After the second phase there were 525 contigs with an N50 size of 17804 and a maximum size of 73908. After the third phase there were 247 contigs with an N50 size of 53720 and a maximum size of 167319.

1 Corresponding author. Email: [email protected] Poster Number 44

Computational analysis of membrane protein topology evolution

Nanjiang Shu and Arne Elofsson Department of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm; Sweden

Traditionally, the topology of integral membrane proteins (IMPs) is viewed as a set of helical-bundle roughly perpendicular to the membrane plane. Further, the topology within a protein family has been believed to be quite conserved in evolution. However, recent observations provide a much more complex picture of the structure of IMPs: homologous proteins can adopt opposite orientations, internal duplications are common (von Heijne, 2006). Here, we aim to gain deep insights into how the topology varies in the protein family on a genomic scale. Others DUP INV_SHIFT By analyzing tens of thousands pairs of homologous IMPs INV SHIFT IDT ranging at various sequence identities, we observed that the 120

Number of cases fraction of identical topology increases with the sequence 100 5597 25555 28043 28848 28878 28905 17929 8417 4349 3798 identity (see Figure 1). Moreover, we noticed that about 80

4% of pairs are adopting inverted topology at different se- 60 quence identity ranges. This probably indicates that pro- 40 teins with dual topology are extensively existing in protein 20 families. Homologous pairs with shifted topology are com-

Percentages of topology comparison classes 0 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 mon when sequence identity is low. However, their fraction Sequence identity drops quickly as the sequence identity increases. The high Figure 1: Topology comparison versus sequence occurrence of shifted topology at low sequence identity range identity. The relationships of the compared topolo- is most probably caused by the unreliably alignment when gies are categorized in the following groups,identical sequence identity is low. A small fraction of pairs are having (IDT), inverted (INV), and shifted (SHIFT), inverted and shifted (INV SHIFT), duplicated (DUP) and different topologies just because one is a duplicated form of other different topology (Others). The numbers above the other. Such duplications take up about 1% of all cases each bar are the number of pairs included in statis- (shown in cyan in Figure 1). tics at each sequence identity range. Topology was Furthermore, we analyzed topology evolution in all fam- predicted by TOPCONS (Bernsel et al., 2009). ilies of transmembrane proteins by multiple comparison of topologies. Figure 2 shows an example of the phylogenetic tree of the family Cytochrome C assembly protein (Pfam ID: PF01578), highlighted by topology comparison. From the tree, the evolution of the topology is clearly shown. Take this family for example, roughly 7 major topology changes are found from the phy- logenetic tree. We found that family Histidine kinase (PfamID: PF07730), PAP2 superfamily (PfamID:

1

Legend: PF01569), CAAX amino terminal protease family (PfamID: PF02517)numTM are families with most varied topol-

cmpclass

NtermState ogy. They are probably most evolved families. cluster From our study, we discovered that topology variations such as topology shifting, inversion and duplication exist ex- tensively in protein families.

124028359 156936842

14600426

110639618 312129523284038699255038827 256420568 325104551 83815614 284161753 159899238 313676576 284036366 268317973

288931336 193214962 183221310 262196547 156744282222524303 156744342 55378158 116623329 257063501256826946 322434876 162451122 76801669 94987395 320106969 257792316 226228003 317154167 94969329

284047311 256830272 294495317 258405109 91772979 298675328 18396637

56476662 71082992194033222 288931758 297568037 300087556 320161341 254796864 325283128297622664 291278610291286270 156937108 31367364278044897 238650839 14601960 296133773220931226 189184161 269959096 313679527 225630236 312132079 284161752 291295224310820950 124028361 119716187 256371151108757543 317122107 320451153 117926774 88657668 269925359269837439 221633200 83313296 256422095 209966280 148261834 255038824 288959707294084647 325103590313676573 284036361 11432693558040085 110639615 258541993 8719940885373598 55378155 209545052 103487004 76802824 260752325 94987398 294011738148556889

83816236

238479207

268316602 298674818 294495285

193214335 315497484302381434197103602 256830275 156744345 295691518 256826943 222524300 91773035 256371154 304320899 320451151 258405106 159899241 254796927 257792319 114800532 194033251 317154170 114571397 257063498 310815844

183220136

291295227 258405922 288931227220931223 254292471 110679785 313679530 99080842 291286273 89055140 283787057 325283125 224586047 313673645 159044402 187731032 238651019 294677324 297622661 162456456 262196537 269140134261821986 288931729 119384152

310820947

108759086 119716190 320106717 291278607

319898563 322435239

317122885

226226226 58040083 320156809 284045926

88657620 296133779221633205 320161350 11662332394970234 269926612269837444 78044186 300087551 258541990 254293640 85373601 225630052 260752328 261868635319775129 209545049 154251859 148261836 312113925 103487001114797201 71082989 300024752 308049378 87199411

217976627 114326933 298292747 182679193 209695458 254560521 295688440 294011735 294678652110680404 15601892145299281 148556891 190150377 315499023 110633253 158426081 325292353 256369048 114570209 11938668499080968 312115916241203809 197106152 319784555 158422758 153009995 302381995 89054643 154245601 310816541 154248051 52425869 304320621 254561755 159045951 294084314 159046589

298290741 227821357 217979609 References 182677704 83313299209966284 300022655 288959710 154251397 209883720 182682087 209885694 163858087 264678145 92116992 148251987

54308455 316934901

148254513 316931825 17146398589898862 296113455 114330090 148652611 92116033 296106440 297537638 188576673226941010 302878979 227823748 121998151114331624291614252257092357 31978687119436647724166403031110562771909606 241206748 171058233 18857585319429141482702338 56477421237654645121603497 325294058 224583264291283447308048025 53802370114320840124266404 153007453 90021516294142898 283786007 119900219 256368562 261821978254785938 300113509 260598762152997413 110635688 288935069 71280528288941892 319898304 308187654300717721 319784210 319776653120554872152982910192360395 17071888915601875261868646322831922294504673157371623307546315120554334163858388 1901506875242466126913980183647060226945107152980890218892647 315127284 53804697 320155695209695708 300113512 11994664054308145 188575856 11083369725682220819615786856460209145298379 182682084 188576861 194366474 319786874 114320838 121998148 171058230 124266406 226941007 297537635 311112098 71909603 184200137 257092360 239918273 171463988 308178243 163841439 89898865 50954031 288941889 170782902 302878982 28493726 157283889 291614255 256831764 121603499 269794058 114330092 269955255 82702341 229821730 114331627 256826202 290958313 119716429 56477418 256397179 312200090 119900222 117927463 237654648 29756392572163100 311105630 269128623 264678148 271962146 296268264 241664027 296106437 194291417 312141071 54027129 148653963 262201034 296112400 315442395296138425 110833733 256822205 296394098 257054374 192359499 300857705 300782431256380695 134103357 90021513 258651172284992937 152997416254785935 159040025 307546318 315501132 296132650 291298351 120554875 300859261297572017317124999167630032 21889265083647063 219669376 226945110 83591023 260893727 163858386 188585746 260598765 94967780 158321386 152983049 297617841 288935066 260893906 224585617261821981 187732266 322421892 283786010 322421446 269139804 302038055 294504676 118581778 322831919 296132644 157371626 322418130 319789850 322418153 307718894 300717724 296132546 308187657 296132938 297570490 116749024 118579683 169827293212638879 56460212 296133716 297529551 296134347 222151386 23099275226312003 158523247218781276 51891874 320354696 190150690 169828198 321311785 51892868 52424658 85858416224368201 15601872 229916171 320355201 170718886 171060789 51246932 320155692261868649 Bernsel, A., Viklund, H., Hennerdal, A., and Elofsson, A. (2009). TOP- 54308143 319776656 302341771 209695711 160901019 218781889 262199946 319761548 121611739222109907 116622283 121603577 220916564 162449894 319791906 71279652 94969592 294142903 124265627163859236 206889887 308048020 152981762 310820648 196157871 311109559134096287 108760340 145298376 300309497194290988 315127287 241664498 322435845 119946643 312795813 319778686 74316203 258405587 3449984322694201191774590 25682952894986765 320106344 78484395 187734606 217969948114321998 261856748 224368993 118580999 289207225 288817691289548408225847895218779119158521117116748142 320105063 32035406651245533 21866692615607018 322418178 302036218 297568234 294055944189220200 313669019 319789488 297568745 325294751188585748

85857935 182413906 195953423 317153248 302344540 32476107 325106616 325295449319789338

291288796 313680474 206889203 225848627 146329354 313672451 317052494 225849843291279207 193212403 226311389 193215987 310643597 194336062 15605652 195952539 194334344 313673930 189499891

288817385 321312348

23099524 CONS: consensus prediction of membrane protein topology. Nucleic 297529190212638452

319957090296273147 225850978 289548322 268679121 159903538 307720937152990931 16329589 34558438 166362931 319957122 268679338315123645 313682516 169829411 317052491 186685993 34556815 220909823 152992898 170078526 37521591 34557356 315452579 22299158 158334881 268679347 258511806 255087388 256822049 224372605

145356516 222151552 196156498 295134171 305667530 298208529 42523110 120436160 256820587 325298461 315126106 305666692 150025477325288045 313205906313203346 308048456 313676828 260062664 7525084 145297734294139725 319892716 150007282 320155361 260060851

256422791 294675423 54310153 220916858 209694238 119946974 71277916

193212069

320105099

237809534 269140195 157369095322831428

261855287 182414636 194337750 299771375 313206425 288942106

182416211 53804559 189500965 189499622 294505010 182414404 150006785 220916957 194334643 56460829 218889975

256819940 300114673 148652403 187736408 261820437 307132205 188578080 114321088 226946000 52424573

92115133

182680698296113558 253990680 190150708 188995612 289207610 121997248 194364927 307546455 300718049 261867192 308187877 300722259197284281 283781948 120555224 85058522 170718342 319786593 288934013260598990 238899330 311278431 224584531 15603049 46446728 192361942 325111302 302879900 291283883 254785307 78485341 110833657

297539517 291615195 297621219 90020848 83644604

91774876 Acids Res., 37(Web Server issue), W465–468. 313201906

212218939 114331693

119899522

56477902 32475464 217969598

71906281

257094572 152997773

34499287

74318390

226941399

117925956

312795095

241664228

313669228

163858645

300309718

311108980

134095938

152981551

83311463

288958351 319779086

171463018

296135008

89902112

124268403

171057068

319791707

121608472

222112107

319764322 160896953

121606269 von Heijne, G. (2006). Membrane-protein topology. Nat. Rev. Mol. 264680209 Cell Biol., 7(12), 909–918.

Figure 2: Phylogenetic tree of family Cytochrome C assembly protein (PfamID: PF01578). From inner- most to utmost, circles are representing topology rela- tionship categories (as in Figure 1), inside/outside sta- tus of N-terminal (red for inside and blue for outside), topology clustering, and number of transmembrane helices, respectively. Topology predicted by Topcons. Poster Number 45 ULTRA RAPID, ACCURATE QUALITY ASSESSMENT OF PROTEIN STRUCTURE MODELS Marcin J. Skwark1,2 & Arne Elofsson1,2* Dept. of Biochemistry and Biophysics1, Stockholm University. Science for Life Laboratory2, Stockholm, Sweden *[email protected]

INTRODUCTION Prediction of 3D structure of proteins is one of the major goals of contemporary bioinformatics. For each predicted model, there is a need for use of an independent measure to evaluate correctness of the model. This is the role of Model Quality Assessment programs (MQAPs). Traditionally, MQAPs focused on evaluating structural features of predicted models, to assess the likelihood of model being similar to a native protein structure. These approaches are capable of detecting non-physical model conformations, but of discriminating which of the two biophysically feasible structure models is in the correct conformation. The advent of consensus methods alleviated this problem. Consensus methods are based on the premise, that among different models of the same protein, the one that is most similar to the others is most likely to be correct.

Most of consensus-based MQAPs rely on structural superposition, which is a computationally expensive process and as such makes consensus approaches unfeasible for larger model ensembles. Additionally, structural superposition does not account for conformational flexibility of proteins.

METHOD Approach presented in this work does not rely on structural superposition, but rather on comparison of inter-atom distance matrices. It is at least as efficient in selecting the most accurate models from the model ensemble, as world-leading consensus methods. The increase in selection accuracy is particularly notable in case of more difficult target, where there is no evident largest cluster of structurally similar models 1. Due to use of the streaming computing platform (off-the-shelf CUDA-compatible GPU), it is able to obtain at least a 10-fold speed-up in comparison to the other approaches, with no upper bounds on the amount of models in the ensemble, nor on the model size.

RESULTS The method presented in this work – PconsD Q-score (GPU implementation of distance-driven quality metric) is significantly faster than other model quality assessment methods for non-trivial targets. It is up to 60 times faster than ModFOLDclustQ2 – another method relying on the same principle and approximately 8-10 times faster than superposition based methods, such as Pcons3. Additionally, it scales very well, both as far as the model length and model amount are concerned (see Figures 1 and 2)

Increased performance allows for much shorter turnaround times, thus enabling options not feasible with other approaches (e.g. iterative modelling, nearly real-time assessment etc.).

While, it is intuitively obvious that quality assessment methods based on distance matrix comparison do not correlate perfectly well with superposition based metrics, PconsD has been demonstrated to outperform superposition based methods as far as capability to select most native-like decoy in the set is concerned.

REFERENCES 1. Ben-David M. et al Proteins 26(Suppl 9)7, 50-76, (2009) 2. McGuffin, L.J. & Roche, D.B Bioinformatics 26(2), 182-186 (2009) 3. Wallner B. et al Nucleic Acid Research 35(Suppl 2), W369-W374, (2007) Poster Number 46

Optimal Sparsity Criteria for Network Inference Network inference is an intense area of research in Systems biology. Most contemporary inference methods rely on a sparsity parameter, which we call zeta, to obtain sparse network estimates. Since small changes in zeta can lead to very different networks, it is crucial to correctly set this parameter. We here propose a method for optimization of zeta which maximizes the accuracy of the predicted network for any given inference method and data set. Our procedure is based on leave one out cross optimization and selection of the zeta value that minimizes the prediction error. We demonstrate that our zeta optimization method for two widely used inference algorithms – Glmnet and NIR -- gives accurate prediction of the network structure, given that the data is informative enough. We also use a simple least square approximation algorithm with a link strength threshold cutoff to demonstrate the effect of our method. Our results hence show how to improve the experimental workflow from data to meaningful transcriptional networks. Poster Number 47

PoSSuM: a database of known and potential ligand-binding sites in proteins Jun-Ichi Ito1,2,4, Yasuo Tabei3, Kana Shimizu2, Koji Tsuda2,3 and Kentaro Tomii1,2

1. Department of Computational Biology, Graduate School of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa,

Chiba 277-8568, Japan, 2. Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and

Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo 135-0064, Japan, 3. Minato Discrete Structure Manipulation System Project, ERATO,

JST, Sapporo 060-0814, Japan, 4. Present address: National Institute of Biomedical Innovation (NIBIO), Saito-Asagi, Ibaraki, Japan.

We proposed an ultrafast alignment-free method that can compare over 1 million ligand-binding sites

in the Protein Data Bank (PDB) [1]. In our method, ligand-binding sites are first encoded as feature

vectors based on their physicochemical and geometric properties. Once ligand-binding sites are

converted to bit strings, called structural sketches, which is obtained by random projections of

feature vectors, a multiple sorting method is applied to the enumeration of all similar pairs in terms of

the Hamming distance. We created our new database, called Pocket Similarity Search using

Multiple-sketchsorts (PoSSuM) to compile all similar pairs detected using our method [2]. As the

source dataset, we concatenated the following two sets: 226,630 small molecule-binding sites

obtained from protein–ligand complexes in the PDB, and 3,134,413 potential ligand-binding sites

identified using an existing pocket detection algorithm. We applied our method to all-pair similarity

searches for the 3.4 million known and potential ligand-binding sites. Consequently, we discovered

ca. 24 million similar binding sites, which is the largest-scale study of binding site comparison for the

PDB entries ever reported. We provide those results as a relational database including all the

discovered pairs with annotations of various types such as CATH, SCOP, EC numbers, and Gene

Ontology (GO) terms. Therefore, users can easily scrutinize similar ligand-binding sites between

proteins with different folds or similar sites between enzymes with different EC numbers. Users can

also browse superpositions of similar sites with the Jmol viewer. Our database is expected to be

useful for annotation of protein functions and rapid screening of target proteins in drug design. The

PoSSuM database is available for use by researchers at http://possum.cbrc.jp/PoSSuM/.

References:

[1] Ito et al., Proteins. (2012) 80 (3): 747-763.

[2] Ito et al., Nucl. Acids Res. (2012) 40 (D1): D541-D548. Poster Number 48

ERNE: a multi-purpose alignment package

Francesco Vezzi,∗ Cristian Del Fabbro,† Alexandru I. Tomescu,‡ Nicola Prezza,§ and Alberto Policriti¶ May 15, 2012 Abstract

String alignment against a genome reference is the first and most important phase in every (re-)sequencing project based on Next Generation Sequencing data. The importance of this problem is demonstrated by the large number of tools (i.e., aligners) designed to tackle this problem. Aligners must be able to manipulate a broad variety of reads (different lengths, paired reads, etc.), obtained from a wide range of types of organisms (from short viruses up to Giga base pairs long plants genomes), and sequenced for different purposes (DNA-seq, RNA-seq, BS-seq, etc.). In practice, different problems are solved by different tools, thus obliging researchers to use more than one program to align different types of reads. This situation gives rise to several problems, all stemming from the fact that users must become familiar with different tools and learn how to tune a large number of parameters for each one of them. Moreover, different tools can handle similar problems in different way (e.g., reads mapping in multiple position) or output alignments in different (often non standard) formats. We present ERNE (Extended Randomized Numerical alignEr), a short string alignment package whose goal is to provide an all-inclusive set of tools to handle short (NGS-like) reads. ERNE comprises ERNE-MAP (core alignment tool/algorithm), ERNE-DMAP (distributed version of the aligner), ERNE- BS5 (bisulfite treated reads aligner), and ERNE-VISUAL (graphical user interface). ERNE-MAP (ERNE MAPper) is an highly performing and sensitive hash-based aligner: it imple- ments an Hamming-aware hash function able to handle mismatches extending the approach originally proposed by Rabin and Karp. ERNE-MAP handles paired reads, allows both gapped and un-gapped alignments, and outputs alignment in standard SAM/BAM format. Moreover, it can align RNA-seq reads taking care of reads spanning over exon-junctions. ERNE-DMAP (ERNE Distributed MAPer) was designed to tackle the main computational bottle- neck of all classical parallel implementation of aligners: references longer than 4 Gbp. ERNE-DMAP distributes alignment’s computation over a cluster of computers using the OpenMPI protocol. Its imple- mentation allows to split the genome across nodes: the maximum allowed reference length depends only on the number of available nodes in a cluster. The computation is based on a PIPELINE model but we are developing a new faster approach based on point-to-point intercomunication. ERNE-BS5 (ERNE-BiSulfite 5, the newest) has been developed to efficiently map bisulfite-treated reads against large genomes (e.g., Human). To achieve this goal we have implemented three different ideas: 1) we use a weighted context-aware Hamming distance to identify a T coming from an unmethy- lated C context, 2) we use a 5-letter alphabet for storing methylation information, and 3) we use an iterative process to position multiple-hit reads starting from a preliminary map built using single hits. The map is corrected and extended at each cycle using the alignments added in the previous step. ERNE- BS5 implements an improved (xor based) hash function that we plan to integrate in ERNE-MAP and ERNE-DMAP. In order to ease the interaction with the various components of the tool we developed a Graphical User Interface (GUI) dubbed ERNE-VISUAL. ERNE executables and source codes are freely downloadable at http://erne.sourceforge.net/.

[email protected] - KTH Royal Institute of Technology, Science for Life Laboratory, School of Computer Science and Communication, Solna, Sweden †[email protected] - Applied Genomics Institute- Italy ‡[email protected] - University of Udine - Italy §[email protected] - University of Udine - Italy ¶[email protected] - University of Udine and Applied Genomics Institute- Italy

1 Poster Number 49

Accurate prediction of protein enzymatic class by N-to-1 Neural Networks

Viola Volpato1,2, Alessandro Adelfio1,2 and Gianluca Pollastri1,2 1School of Computer Science and Informatics, University College Dublin, Ireland 2Complex and Adaptive Systems Laboratory, University College Dublin, Ireland

Genome sequencing projects and high-throughput experimental procedures have recently produced a rapid growth in protein databases but only a small fraction of known sequences have been determined to have a function by experimental means. Besides, the prediction of protein functions, to date, remains problematic; when dealing with lack of significant sequence homology between two proteins, it is hard to transfer functional annotations reliably and divergent/convergent evolutionary events make this task even more complex [1]. Accordingly, one of the most important challenges of Bioinformatics at present is to develop accurate computational methods capable of determining or accurately predicting protein functions and enhancing the annotation of sequence databases in order to expand our knowledge of the mechanisms of life [2]. Since protein structures are known for less than 1% of known protein sequences, most proteins of newly sequenced genomes have to be characterized by their amino-acid sequences alone [2]. We present a novel ab initio N-to-1 Neural Network predictor based on the architecture SCLpred, developed to predict subcellular localization [3]. Our model, trained on a large, curated database of over 6,000 non-redundant proteins, can classify proteins, solely based on their sequences, into one of six classes extracted from the enzyme commission (EC) classification scheme. In addition, in order to exploit evolutionary information effective at detecting functionally significant residue patterns (e.g. active-site residues/portion) which can be properly harnessed for enzymatic class prediction, we represent each input sequence position by the residue frequency derived from multiple sequence alignments instead of using single protein sequences. The model is capable of approximating non-linear functions mapping sequences to features and features to classes in a two-step prediction. As the model operates on the full sequence and not on predefined features, in a first step all motifs of a predefined length (31 residues in this work) are considered and are compressed by an N-to-1 Neural Network into a feature vector which is automatically determined during training. In a second step, the vectorial outputs of all networks are added up and the resulting feature vector is input to the final network to produce the enzymatic class prediction. We test our predictor in 10-fold cross-validation and obtain state of the art results, with a 96% correct classification and 86% generalized correlation. All six classes are predicted with a specificity of at least 80% and false positive rates never exceeding 7%. It has been reported that even for pairs of enzymes with over 70% residue identity in the optimal alignment more than 30% do not belong to the same class (first EC number) [4], underlying the difficulty of obtaining accurate predictions based on sequence identity. However, the overall accuracy of our method in predicting the main enzymatic classes is very high for the datasets used, in which sequence identity is below 30% for any two proteins. We also compare our method to ProtFun [5], which has been shown to be one of the most accurate methods for function prediction. Although comparisons on different datasets are to be taken with caution, we obtain performances exceeding those of ProtFun by over 10% while also considerably reducing false positive rates. In conclusion, the high classification performances we achieve suggest that the network is able to recognize those functionally conserved portions of enzymatic sequences that are related to the chemical reactions on which the EC classification scheme is based. Therefore, we are now analysing trained networks to mine motifs that are most informative for the prediction, hence, likely, functionally relevant in order to provide them as a public database.

References

1. Ganfornina MD, Sanchez D: Generation of evolutionary novelty by functional shift. BioEssays 1999, 21:432–439.

2. Whisstock JC, Lesk AM: Prediction of protein function from protein sequence and structure. Quarterly Reviews of Biophysics 2003, 36:307–340.

3. Mooney C, Wang YH, Pollastri G: SCLpred: protein subcellular localization prediction by N-to-1 neural networks. Bioinformatics 2011, 27(20):2812–9.

4. Rost B: Enzyme function less conserved than anticipated. J. Mol. Biol. 2002, 318:595–608.

5. Jensen LJ, Gupta R, Blom N, Devos D, Tamames J, Kesmir C, Nielsen H, Stærfeldt HH, Rapacki K, Workman C, Andersen CAF, Knudsen S, Krogh A, Valencia A, Brunak S: Prediction of human protein function form post-translational modifications and localization features. J. Mol. Biol. 2002, 319:1257–1265.

Poster Number 50

Microsecond simulations of membrane-proteins as a global ligand-docking method Wesén, B.

Most drugs on the market target membrane-proteins, but their exact molecular functions are rarely known. Anesthetics are thought to act on membrane-bound ion- channels in the nervous system, and having better knowledge of their exact sites of action and functioning would accelerate the optimization of current drugs and design of new ones.

To find sites of action for a drug, automatic docking algorithms can heuristically search a protein although these are commonly optimized for globular proteins and don't perform as well with membrane-proteins of complex topology. Free-energy perturbation methods can accurately determine the binding affinity for a drug in a specific, small site but the site has to be given explicitely, thereby inducing bias and work. Hence, for analyzing membrane-protein / ligand combinations which are not well known, a global, unbiased and unguided method which can report putative sites and probable poses and automatically seed more detailed FEP calculations is desired.

In this work, we apply microsecond-scale simulations of the prokaryotic pH-gated ion-channel GLIC (from Gloeobacter violaceus) together with the anesthetic desflurane in order to evaluate the feasability of this method. The trajectories of the ligands and water are aligned to the protein and occupancy maps are built, which are automatically analyzed for hotspots. Some of the sites of high occupancy correspond to a known crystallographically resolved binding site of desflurane on GLIC near residues I201 and I202, while some others are suggested in previous studies but are not thoroughly analyzed.

The ongoing project will add the automatic spawning of FEP calculations for each identified site of high occupancy to add affinity values to the individual sites. Also, by correlating the ligand occupancy in the identified binding site regions to other dynamical properties of the system, further interesting analysis is possible, for example modulation of patterns of hydrogen bonding among structurally important protein backbone parts, salt bridges between the channel subunits and the tilting angle or other motion of the major parts of the protein. Poster Number 51

Where do anesthetics bind: Modeling of human ligand-gated ion channels. Ozge Yoluk 1,2, Erik Lindahl1,2 1 Theoretical & Computational Biophysics, Royal Institue of Technology, Stockholm, Sweden 2 Center for Biomembrane Research, Stockholm University, Stockholm, Sweden Anesthesia is crucial in all fields of medicine allowing the patients to undergo surgery without stress and pain. However, adverse effects of anesthetics are quite strong and occasionally even lethal. Anesthetics with fewer side effects would increase the confidence of patients, lowering the discomfort and stress before the procedures. To be able to identify potential anesthetics with fewer side effects, it is important to fully understand how anesthetics work on molecular level. The main targets of anesthetics in nervous system are ligand- gated ion channels (LGICs). There are still no X-ray structures of human LGICs available due to the difficulties with overexpression and crystallization. However, studies on prokaryotic homologues of ligand-gated ion channels contribute greatly to our knowledge on human LGICs (i.e. GABAAR, a glutamate- gated ion channel that transports Cl ions into the cell). A recent eukaryotic structure (X-ray structure of GluCl from C.elegans) with higher similarity in structure and function has opened up another door to help us study human LGICs. However, presence of co-crystallized ligands in the structure is a major concern for computational studies as they might force the channel to a particular state. As part of our modeling efforts of human LGICs, we are therefore performing simulations with and without ligands in order to analyze the stability and the native state of the structure. In particular, GluCl remains stable in presence of the drug Ivermectin and subunits preserve their distance as it is in crystal structure (~9Å). When simulated without Ivermectin, subunits are closer (~7Å) and the pore is partially collapsed. The resulting trajectories are used to construct different models for the human GABAAR that should correspond to states without ligands. We are also carrying out further tests with these different models to understand the role of the subunit distance and binding pockets for anesthetics in human LGICs. Poster Number 52

The metagenomics of the dead: taxonomic and functional annotation methods for analysis of ancient datasets

Katarzyna Zaremba-Niedzwiedzka1 and Siv G.E. Andersson1

1Department of Molecular Evolution, Biomedical Center, Uppsala University, Uppsala, Sweden;

Metagenomics allow insights into microbiology of unique samples, such as fossils and mummies. Many questions about microbes in ancient samples remain without answer, including basic knowledge of which bacteria are present there, what is their source and are they old or modern. The Neandertal genome has recently been determined from DNA extracted from a 38,000-year-old fossil. Mammalian DNA accounted for only a few percent of the DNA sample and the large majority, 80%, remained uncharacterized. The main difficulty in the bioinformatic analyses of such data is the short read length and unknown reference genomes. We test the performance of different annotation methods using both artificially generated metagenomes and real sequencing reads of known origin. Full dataset annotation by BLAST performed poorly in the tests, especially in nucleotide searches, but even protein-based searches suffered from lack of very closely related reference sequences. Testing ribosomal RNA based taxonomic methods revealed tRNAs, adjacent to rRNA genes as the main source of false positive hits. We propose a modification of the lowest common ancestor assignment procedure that prevents high- level taxonomic assignment due to single misclassified sequences in the databases. BLAST-based assignments served as starting point for assembly in subsets of reads with limited diversity, to prevent chimeras. Ultimately, identified sequences of interest were analyzed by phylogenies to confirm annotation and taxonomic position. Finally, we developed substitution calculation procedure based on comparison of individual reads to the consensus sequences that allowed recognition of the typical ancient substitution patterns. We applied these bioinformatic approaches to the analysis of metagenome sequence data generated from DNA extracted from the remains of the Neandertal. Our taxonomic profiling analyses suggest that Actinobacteria dominate the Neandertal metagenome and that a single species of Streptomyces accounted for 25% of all bacterial sequences in the data. Estimates of substitution frequencies from assembled rRNA sequence data verified the typical ancient features of the Neandertal sequences, and suggested a modern origin for the bacterial reads. Streptomyces-like collagenase genes were present at similar genome-equivalents as the rRNA genes, indicating that they are derived from the same genome. We hypothesize that actinobacteria of the genus Streptomyces have been enriched inside the Neandertal bones because of their ability to degrade collagen and obtain nutrients from this otherwise nutrient-poor environment. The approach used here paves the way for similar metagenomic studies of microbial communities associated with the remains of extinct organisms. Poster Number 53

A biophysical model to infer canonical and non-canonical microRNA-target interactions

Mihaela Zavolan, Mohsen Khorshid, Jean Hausser, Erik van Nimwegen Biozentrum, University of Basel and Swiss Institute of Bioinformatics, Klingelbergstrasse 50-70, CH-4056 Basel, Switzerland

miRNAs are a large class of regulators of gene expression, acting at post-transcriptional level to modulate the stability of mRNA targets and their rate of translation into proteins. Concerted experimental and computational approaches revealed that in mammals, 7-8 nucleotides of per- fect complementarity between the miRNA 5’ end and the target mRNA is frequently sufficient to elicit a response, typically measured in terms of mRNA degradation. Many putative binding sites however, do not seem to elicit an effect and some binding sites that do not have perfect comple- mentarity to the corresponding miRNA ”seed” region have been found to be effective in mRNA destabilization. Further progress in understanding the determinants of miRNA-dependent reg- ulation will likely depend on the availability of much more comprehensive data sets of bona fide miRNA binding sites. Such data sets may be obtained with recently developed experimental methods relying on Argonaute (Ago) protein crosslinking and immunoprecipitation (CLIP) [1–4] if one were able to identify the miRNA that guided the interaction of Ago with the isolated RNA site. To address this need, we developed a quantitative, biophysical model for comparatively evaluating the likelihood of interaction of individual miRNAs with a given site. We inferred the model’s parameters from Ago2 CLIP data and found that they largely reflect previously uncovered principles of miRNA-target interaction. Application of this model to various Ago2 CLIP data sets enabled us to identify a substantial number of miRNA binding sites that are non-canonical, yet effective in mRNA destabilization upon miRNA transfection. The degree of mRNA destabilization correlates well with the predicted affinities of binding of these sites to miRNAs, indicating that our model enables the discovery of binding sites of individual miRNAs through Ago-CLIP. Combining Ago-CLIP with mRNA and protein expression measurements we are unraveling the kinetics and mechanism of action of miRNAs.

References

1. Chi SW, Zang JB, Mele A, Darnell RB (2009) Argonaute HITS-CLIP decodes microRNA- mRNA interaction maps. Nature 460: 479-86. 2. Zisoulis DG, Lovci MT, Wilbert ML, Hutt KR, Liang TY, et al. (2010) Comprehensive dis- covery of endogenous Argonaute binding sites in Caenorhabditis elegans. Nature structural & molecular biology 17: 173–9. 3. Hafner M, Landthaler M, Burger L, Khorshid M, Hausser J, et al. (2010) Transcriptome- wide Identification of RNA-Binding Protein and MicroRNA Target Sites by PAR-CLIP. Cell 141: 129–141. 4. Kishore S, Jaskiewicz L, Burger L, Hausser J, Khorshid M, et al. (2011) A quantitative analysis of CLIP methods for identifying binding sites of RNA-binding proteins. Nature methods 8: 559–564.