Sevilla, 21-24 septiembre 2014

Book of Abstracts

Supported by: #JdBI2014 Book of Abstracts

Index of Kenynote Lectures...... 3 of Oral Presentations per Topics

Highlights...... 5 Metagenomics...... 8 Integrative Biology...... 10 Medical Informatics ...... 13 Phylogeny / Evolution...... 16 Structure / Function...... 19 Student Symposium...... 22 of Posters per Topics

Highlights...... 25 Metagenomics...... 27 Integrative Biology...... 28 Medical Informatics...... 49 Phylogeny / Evolution...... 64 Structure / Function...... 71 Student Symposium...... 81

Page 2 #JdBI2014 Keynotes

K1-01 The gut microbiome - A new target for understanding, diagnosing and treating disease Jeroen Raes VIB - K.U. Leuven, Leuven, BE

The functioning of the human body constitutes a complex interplay of human processes and ‘services’ rendered to us by the 1000 trillion microbial cells we carry. Disruption of this natural microbial flora is linked to infection, autoimmune diseases and cancer, but detailed knowledge about our microbial component remains scarce.

Recent technological advances such as metagenomics and next-generation sequencing permit the study of the various microbiota of the human body at a previously unseen scale. These advances have allowed the initiation of the Inter- national Human Microbiome Project, aiming at genomically characterizing the totality of human-associated microor- ganisms (the “microbiome”).

Here, I will present our work on characterizing the human intestinal flora based upon the analysis of high-throughput meta-omics (metagenomics, metatranscriptomics, metaproteomics) data. I will show how the healthy gut flora can be classified “enterotypes” that are independent from host nationality, age, bmi and gender. I will also show how meta- genome-wide association studies (MGWAS) can lead to the detection of diagnostic markers for host properties and disease (e.g. in IBD, diabetes and obesity), and aid in further understanding on how the gut flora disturbances contribute to these pathologies. Finally, I will illustrate how gut microbiota-based treatment strategies are emerging, for example through Faecal Microbiota Transplantation (FMT).

References:

Hildebrand F et al. (2013) Inflammation-associated enterotypes, host genotype, cage and interindividual effects drive gut microbiota variation in common laboratory mice. Genome Biol, 14(1):R4

Qin et al. (2012) A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490:55-60

Arumugam*, Raes* et al. (2011) Enterotypes of the human gut microbiome. Nature 473, 174-180

K2-01 Cellular resolution models for gene regulation in fly embryos Nick Luscombe Cancer Research UK London Research Institute, London, UK

Transcriptional control ensures genes are expressed in the right amounts at the correct times and locations. Understan- ding quantitatively how regulatory systems convert input signals to appropriate outputs remains a challenge. For the first time, we successfully model even skipped (eve) stripes 2 and 3+7 across the entire fly embryo at cellular resolution. A straightforward statistical relationship explains how transcription factor (TF) concentrations define eve’s complex spatial expression, without the need for pairwise interactions or cross-regulatory dynamics. Simulating thousands of TF combinations, we recover known regulators and suggest new candidates. Finally, we accurately predict the intricate effects of perturbations including TF mutations and misexpression. Our approach imposes minimal assumptions about regulatory function; instead we infer underlying mechanisms from models that best fit the data, like the lack of TF- specific thresholds and the positional value of homotypic interactions. Our study provides a general and quantitative method for elucidating the regulation of diverse biological systems.

Page 3 #JdBI2014 Keynotes

K3-01 Functions of miRNAs within gene expression regulatory networks Mihaela Zavolan Biozentrum - University of Basel, Basel, CH

Among the many mechanisms that regulate gene expression, miRNAs have emerged in the past decade as an impor- tant class of post-transcriptional regulators of mRNA decay and protein translation. Through complementarity invol- ving 7-8 nucleotides at their 5’end miRNAs guide Argonaute proteins to ‘canonical’ target mRNAs. Recent studies have suggested that the miRNA-induced target degradation can further give rise to additional behaviors. These include the threshold-linear response of the targets to their transcriptional induction, reduction of the ‘noise’ in target expression and induction of correlations in the expression of the targets of a given miRNA. Here I will discuss experimental and computational approaches to studying these behaviors, including single cell gene expression profiling, as well as the insights that were derived about the functions of miRNAs in the regulation of gene expression.

K4-01 Why are individuals different? Ben Lehner Centre for Genomic Regulation, Barcelona, ES

We study the causes of phenotypic variation amongst individuals, including the distribution and effects of genetic variation, somatic mutations and epigenetic differences (stochastic/environmental influences). I will present some our recent work on how inherited genetic variation influences dynamic processes, on the causes of phenotypic variation in the absence of genetic variation, and on somatic mutation processes in human cancers.

Page 4 Oral presentations Highlights

H1-01 The Functional Topography of the Arabidopsis Genome Is Organized in a Reduced Number of Linear Motifs of Chromatin States Joana Sequeira-Mendes1, Irene Araguez1, Ramon Peiró1, Raul Mendez-Giraldez1, Xiaoyou Zhang2, Steven Jacob- sen2, Ugo Bastolla1, Crisanto Gutierrez1 1Centro de Biología Molecular Severo Ochoa, Madrid, ES, 2University of California Los Angeles, Los Angeles, US

Chromatin is of major relevance for gene expression, cell division, and differentiation. Here, we determined the lands- cape of Arabidopsis thaliana chromatin states using 16 features, including DNA sequence, CG methylation, histone variants, and modifications. The combinatorial complexity of chromatin can be reduced to nine states that describe chromatin with high resolution and robustness. Each chromatin state has a strong propensity to associate with a subset of other states defining a discrete number of chromatin motifs. These topographical relationships revealed that an in- tergenic state, characterized by H3K27me3 and slightly enriched in activation marks, physically separates the canonical Polycomb chromatin and two heterochromatin states from the rest of the euchromatin domains. Genomic elements are distinguished by specific chromatin states: four states span genes from transcriptional start sites (TSS) to termination sites and two contain regulatory regions upstream of TSS. Polycomb regions and the rest of the euchromatin can be connected by two major chromatin paths. Sequential chromatin immunoprecipitation experiments demonstrated the occurrence of H3K27me3 and H3K4me3 in the same chromatin fiber, within a two to three nucleosome size range. Our data provide insight into the Arabidopsis genome topography and the establishment of gene expression patterns, specification of DNA replication origins, and definition of chromatin domains.

H1-02 Do long non-coding RNAs make proteins? Jorge Ruiz-Orera1, Xavier Messeguer2, Juan A Subirana3, M. Mar Albà4 1Evolutionary Genomics Group, Research Programme on Biomedical Informatics (GRIB) - Hospital del Mar Research Institute (IMIM) - Uni- versitat Pompeu Fabra (UPF), Barcelona, ES, 2Universitat Politècnica de Catalunya (UPC), Barcelona, Barcelona, ES, 3Evolutionary Genomics Group, Research Programme on Biomedical Informatics (GRIB) - Hospital del Mar Research Institute (IMIM) - Universitat Pompeu Fabra (UPF), Real Academia de Ciències i Arts de Barcelona (RACAB), Barcelona, ES, 4Evolutionary Genomics Group, Research Programme on Bio- medical Informatics (GRIB) - Hospital del Mar Research Institute (IMIM) - Universitat Pompeu Fabra (UPF), Catalan Institution for Research and Advanced Studies (ICREA), Barcelona, ES

Deep transcriptome sequencing has revealed the existence of thousands of transcripts that lack conserved open rea- ding frames and which have been termed long non-coding RNAs (lncRNAs). The majority of these transcripts are expressed at low levels, are relatively short, and do not yet have a known function. Motivated by the existence of ribosome profiling data for several species, we have investigated if lncRNAs are scanned by ribosomes and compared the properties of any putatively translated open reading frames (ORFs) to different sets of coding and non-coding se- quences. As many lncRNAs are lineage-specific a relevant comparison is against young protein coding genes, such as known primate-specific proteins. The ribosome profiling data from human, mouse, zebrafish, fruit fly, Arabidopsis and yeast, strongly indicates that many lncRNAs are translated. We have found that ribosome density in the ORFs present in lncRNAs is high and contrasts sharply with the 3’UTR region, in which very often there is no detectable ribosome binding, as it happens in bona fide protein-coding genes. Remarkably, ORFs in lncRNAs strongly resemble those in known young protein coding genes: they are short, the coding score is higher than for random sequences but lower than for evolutionary conserved protein coding genes, and selective constraints, measured using the non-synonymous to synonymous polymorphism ratio, are relatively weak. The new peptides produced by lncRNAs may acquire useful functions and continue to evolve under selection. Taken together, these findings strongly suggest that lncRNAs play an important role in the evolution of new proteins.

Page 5 Oral presentations Highlights

H1-05 Multiple evidence strands suggest that there may be as few as 19 000 human protein- coding genes Iakes Ezkurdia1, David Juan2, Jose Manuel Rodriguez3, Adam Frankish4, Mark Diekhans5, Jennifer Harrow6, Jes- us Vazquez7, Alfonso Valencia8, Michael Tress2 1Unidad de Proteomica, Centro Nacional de Investigaciones Cardiovasculares, CNIC, Melchor Fernandez Almagro, 3,, Madrid, ES, 2Struc- tural Biology and Bioinformatics Programme, Spanish National Cancer Research Centre (CNIO), Melchor Fernandez Almagro, 3, Madrid, ES, 3National Bioinformatics Institute (INB), Spanish National Cancer Research Centre (CNIO), Melchor Fernandez Almagro, 3, Madrid, ES, 4Wellcome Trust Sanger Institute, Wellcome Trust Campus, Hinxton, Cambridge, UK, 5Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), 1156 High Street, Santa Cruz, US, 6Wellcome Trust Sanger Institute, We- llcome Trust Campus, Hinxton , Cambridge, UK, 7Laboratorio de Proteomica Cardiovascular, Centro Nacional de Investigaciones Cardiovas- culares, CNIC, Melchor Fernandez Almagro, 3, Madrid, ES, 8Structural Biology and Bioinformatics Programme and National Bioinformatics Institute (INB), Spanish National Cancer Research Centre (CNIO), Melchor Fernandez Almagro, 3,, Madrid, ES

Determining the full complement of protein-coding genes is a key goal of genome annotation. The most powerful approach for confirming protein-coding potential is the detection of cellular protein expression through peptide mass spectrometry (MS) experiments. Here, we mapped peptides detected in seven large-scale proteomics studies to almost 60% of the protein-coding genes in the GENCODE annotation of the human genome. We found a strong relationship between detection in proteomics experiments and both gene family age and cross-species conservation. Most of the genes for which we detected peptides were highly conserved. We found peptides for >96% of genes that evolved before bilateria. At the opposite end of the scale, we identified almost no peptides for genes that have appeared since prima- tes, for genes that did not have any protein-like features or for genes with poor cross-species conservation.

These results motivated us to describe a set of 2001 potential non-coding genes based on features such as weak con- servation, a lack of protein features, or ambiguous annotations from major databases, all of which correlated with low peptide detection across the seven experiments. We identified peptides for just 3% of these genes. We show that many of these genes behave more like non-coding genes than protein-coding genes and suggest that most are unlikely to code for proteins under normal circumstances. We believe that their inclusion in the human protein-coding gene cata- logue should be revised as part of the ongoing human genome annotation effort.

H2-01 Accurate characterization of complex structural variation in cancer by using a reference- free approach Valentí Moncunill1, Santiago Gonzalez1, Silvia Bea2, Itziar Salaverria2, Cristina Royo2, Laura Martinez1, Montserrat Piu- ggross1, Maia Segura-Wang3, Romina Royo4, Josep L Gelpi4, Ivo Gut5, Carlos Lopez-Otin6, Modesto Orozco1, Jan Korbel3, Elias Campo2, Xose Puente6, David Torrents7 1Joint IRB-BSC Program in Computational Biology, BSC, Life Sc, Barcelona, ES, 2Hospital Clínic (IDIBAPS), Dept of Pathology, Barcelona, ES, 3European Molecular Biology Laboratory, Genome. Biol, Heidelberg, DE, 4Joint IRB-BSC Program in Computational Biology, BSC, Life Sc. & Comput. Bioinf., INB, BSC, Barcelona, ES,5Centro Nacional Análisis Genómico (CNAG) - Centro Regulación Genómica (CRG), Barcelona, ES, 6Univ. de Oviedo - IUOPA, Dpt. Biochem. Biol. Molec, Oviedo, ES,7Joint IRB-BSC Program in Computational Biology, BSC, Life - Sc. Inst. Catalana de Recerca i Estudis Avançats,, Barcelona, ES

The development of highthroughput sequencing technologies has changed our understanding of cancer. However, des- pite the increasing demand to identify the genetic alterations in tumor cells, the accurate characterization of somatic structural variants in cancer still remains a challenge. Current strategies depend on the alignment of reads to a referen- ce genome, a step that restricts the complete definition of structural variation. We developed a reference-independent approach called SMUFIN (Somatic MUtation FINder), which is able to accurately identify all types of somatic variation, from substitutions to large structural variants (SVs), at base pair resolution.

Novel features of SMUFIN compared to existing strategies include: (i) The direct comparison of normal and tumor reads without the need to generate mapped BAM files, avoiding conflicts derived from the inefficient mapping of tumor reads carrying differences and a potential contamination with germline variants; (ii) the detection, in a single run, of SNVs and SVs with no limitations in type or size, (iii) the identification of all variants at base pair resolution, (iv) the description of the exact change in the tumor, including the sequence at both sides of all breakpoints detected, and (v) the simplicity of execution, as it is provided as a single binary executable file.

Page 6 Oral presentations Highlights

Performance tests showed average sensitivity of 92% and 74% for SNVs and SVs, with specificities of 95% and 91%, respectively. Analysis of two aggressive forms of solid and hematological tumors revealed that this procedure identifies breakpoints associated with chromothripsis and chromoplexy with specificities above 90%. Taken together, SMUFIN constitutes the first reference-free and integrated solution for an accurate and complete characterization of somatic variation in cancer.

H2-02 Target Prediction for an Open Access Set of Compounds Active against Mycobacterium tu- berculosis Francisco Martínez-Jiménez1, George Papadatos2, Lun Yang3, Iain M. Wallace2, Vinod Kumar3, Ursula Pieper4, An- drej Sali4, James Brown3, John Overington2, Marc A Martí-Renom1 1Centro Nacional Análisis Genómico / Centro Regulación Genómica, Barcelona, ES, 2European Molecular Biology Laboratory – European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK, 3Computational Biology, Quantitative Sciences, GlaxoSmithKline, Collegeville, Pennsylvania, US, 4Department of Bioengineering and Therapeutic Sciences, University of Califor- nia, San Francisco, US

Mycobacterium tuberculosis, the causative agent of tuberculosis (TB), infects an estimated two billion people world- wide and is the leading cause of mortality due to infectious disease. The development of new anti-TB therapeutics is required, because of the emergence of multi-drug resistance strains as well as co-infection with other pathogens, es- pecially HIV. Recently, the pharmaceutical company GlaxoSmithKline published the results of a high-throughput screen (HTS) of their two million compound library for anti-mycobacterial phenotypes. The screen revealed 776 compounds with significant activity against the M. tuberculosis H37Rv strain, including a subset of 177 prioritized compounds with high potency and low in vitro cytotoxicity. The next major challenge is the identification of the target proteins. Here, we use a computational approach that integrates historical bioassay data, chemical properties and structural comparisons of selected compounds to propose their potential targets in M. tuberculosis. We predicted 139 target - compound links, providing a necessary basis for further studies to characterize the mode of action of these compounds. The results from our analysis, including the predicted structural models, are available to the wider scientific community in the open source mode, to encourage further development of novel TB therapeutics.

H2-03 Emergence of the human DNA Damage Response Network Aida Arcas1, Oscar Fernandez-Capetillo2, Ildefonso Cases3, Ana M Rojas4 1Instituto de Neurociencias, Alicante, ES, 2Centro Nacional de Investigaciones Oncologicas, Madrid, ES, 3Genomics and Bioinformatics Platform of Andalusia. Sevilla, Sevilla, ES, 4Insituto de Biomedicina de Sevilla (IBIS-HUVR-CSIC-US), Sevilla, ES

The DNA Damage response is a crucial signaling network that preserves the integrity of the genome. To understand how these elements have been assembled together in humans, we performed comparative genomic analyses in se- lected species to trace back their emergence using systematic phylogenetic analyses and estimated gene ages. The emergence of the contribution of post-translational modifications to the complex regulation of DDR was also inves- tigated. This is the first time a systematic analysis has focused on the evolution of DDR sub-networks as a whole. Our results indicate that a DDR core, mostly constructed around metabolic activities, appeared soon after the emergence of eukaryotes, and that additional regulatory capacities appeared later through complex evolutionary process. Potential key post-translational modifications were also in place then, with interacting pairs preferentially appearing at the same evolutionary time, although modifications often led to the subsequent acquisition of new targets afterwards. We also found extensive gene loss in essential modules of the regulatory network in fungi, plants and arthropods, important for their validation as model organisms for DDR studies.

Published in Molecular Biology and Evolution. 31:940-61

Page 7 Oral presentations Metagenomics

A7-01 Metapasta: scalable tool for microbial community proling Evdokim Kovach, Alexey Alekhin, Marina Manrique, Pablo Pareja-Tobes, Eduardo Pareja, Raquel Tobes, Eduardo Pareja- Tobes Oh no sequences! Research Group, Era7 bioinformatics, Granada, ES

Metapasta is an open-source, fast and horizontally scalable tool for community profiling based on the analysis of 16S metagenomics data. It is entirely -based and specifically designed to take advantage of it: it performs the com- munity profiling of a sample starting from raw Illumina reads in approximately 1 hour, needing approximately the same time for doing the same on hundreds of samples. It uses BLAST or LAST, but other mapping solutions can be integrated. The taxonomic assignment is done using a best hit and a lowest common ancestor paradigm taking the NCBI taxo- nomy as reference. As an output, Metapasta generates the frequencies of all the identified taxa in any of the samples in tab-separated value text files. This output includes direct assignment frequencies and cumulative frequencies based on the hierarchical structure of the taxonomy tree. Reports format can be configured using DSL similar to spreadsheet formulas. PDF files with assigned taxonomy tree can be rendered. Metapasta is an open-source tool available under the AGPLv3 license.

Methods Metapasta is implemented in Scala and based on (). The graph data platform Bio4j (www.bio4j.com) is used for retrieving taxonomy related information and the tool Compota (http://oh- nosequences.com/compota) is used for distributing and coordinating compute tasks.

Fundings This project is funded in part by the ITN FP7 project INTERCROSSING (Grant 289974).

A7-02 Text Mining for Metabolic Reactions in Bacteria: the TeBactEn System Martin Krallinger1, Andres Cañada2, Victor de la Torre3, Alfonso Valencia3 1Structural Computational Biology Group, Spanish National Cancer Research Centre (CNIO), Madrid, ES, 2National Bioinformatic Institute Unit, Spanish National Cancer Research Centre (CNIO), . Melchor Fernández Almagro 3, 28029 Madrid, Spain, Madrid, ES, 3Structural Computational Biology Group, Spanish National Cancer Research Centre (CNIO), c. Melchor Fernández Almagro 3, 28029 Madrid, Spain, Madrid, ES

TeBactEn is a tool designed to facilitate the retrieval, extraction and annotation of bacterial enzymatic reactions and pathways from the literature. The system contains three different data collections, namely (a) a compilation of articles derived from the Microme database, i.e. articles (abstracts and full text articles) that had been used for manual annota- tion of bacterial pathways, (b) a set that covers abstracts from the entire PubMed database that are relevant to bacteria and finally (c) a collection of abstracts and full text articles that are relevant for a list of bacteria of special interest to metabolic reactions, facilitating a more exhaustive extraction of enzymes particularly for these bacteria. In case of all three TeBactEn data collections, an exhaustive recognition of mentions of all species and taxonomic entities was ca- rried out. TeBactEn covers all the main steps relevant for the automatic extraction and ranking of metabolism relations from the literature and allows enhanced access and annotation of related information:

1.Identification of metabolism relevant articles.

2. Detection of the bio-entities involved in biochemical reactions: enzyme, compounds and organisms.

3. Extraction of weighted (ranked) relationships between these bio-entities.

4. An interface to browse this information and to construct a manually curated database of metabolism reactions.

5. Facilitate quick manual literature curation.

6. The option to normalize/ground bio-entity mentions to other knowledgebases like UniProt and ChEBI.

The system is available at: http://tebacten.bioinfo.cnio.es

Page 8 Oral presentations Metagenomics

A7-03 GET_HOMOLOGUES, a Versatile Software Package for Scalable and Robust Microbial Pan- genome Analysis Bruno Contreras-Moreira1, Pablo Vinuesa2 1Estación Experimental de Aula Dei (EEAD-CSIC) and Fundación ARAID , Zaragoza, ES, 2Centro de Ciencias Genómicas, Universidad Nacio- nal Autónoma de México, Cuernavaca, MX

GET_HOMOLOGUES is an open source software package that builds upon popular orthology-calling approaches ma- king highly customizable and detailed pan-genome analyses of microorganisms accessible to non-bioinformaticians. It can cluster homologous gene families using the bidirectional best-hit, COGtriangles or OrthoMCL clustering algo- rithms. Clustering stringency can be adjusted by scanning the domain-composition of proteins using the HMMER3 package, by imposing desired pair-wise alignment coverage cut-offs or by selecting only syntenic genes. Resulting ho- mologous gene families can be made even more robust by computing consensus clusters from those generated by any combination of the clustering algorithms and filtering criteria. Auxiliary scripts make the construction, interrogation and graphical display of core and pan-genome sets easy to perform. Exponential and binomial mixture models can be fitted to the data to estimate theoretical core and pan-genome sizes, and high quality graphics generated. Furthermo- re, pan-genome trees can be easily computed and basic comparative genomics performed to identify lineage-specific genes or gene family expansions. The software is designed to take advantage of modern multiprocessor personal com- puters as well as computer clusters to parallelize time-consuming tasks. To demonstrate some of these capabilities, we survey a set of 50 Streptococcus genomes annotated in the Orthologous Matrix (OMA) Browser as a benchmark case. The package can be downloaded at http://www.eead.csic.es/compbio/soft/gethoms.php and http://maya.ccg.unam. mx/soft/gethoms.php.

Page 9 Oral presentations Integrative Biology

B7-01 Inverse Comorbidity between Central Nervous System Disorders and Cancers interpreted by Transcriptomic Meta-analyses Kristina Ibáñez1, César Boullosa1, Rafael Tabarés-Seisdedos2, Alfonso Valencia1, Anaïs Baudot3 1Structural Biology and Biocomputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid, ES, 2Department of Medicine, University of Valencia, CIBERSAM, INCLIVA, Valencia, ES, 3Marseilles Institute of Mathematics (I2M), CNRS-AMU, Marseille, FR

Epidemiological evidences point to a lower-than-expected risk of cancer in patients with certain Central Nervous Sys- tem (CNS) disorders. Such inverse comorbidity could arise, for instance, from environmental factors or drug treatments or be related to disease diagnosis. We and others hypothesized that the inverse comorbidities could also be driven by genetic factors common to both sets of complex diseases (Roe et al. Neurology (2013), �����������������������������Tabarés-seisdedos et al. Lan- cet Oncology (2011)).��InIn this context, we recently published the firstfi rst evidence of gene expression deregulations in op-op- posite directions in inversely comorbid diseases (Ibáñez et al. Plos Genetics 2014 Feb 20;10(2)).¶I will describe in this presentation the results obtained by meta-analyzing gene deregulations in 3 CNS disorders (Alzheimer’s, Parkinson’s and Schizophrenia) and 3 cancer types (Lung, Prostate, Colorectal). Strikingly, the comparison of the deregulations in both sets of diseases showed that a significant number of genes and pathways up-regulated in CNS disorders are down-regulated in Cancers, and vice-versa.¶For instance, the Pin1 gene and the p53 pathway, which were previously suggested to be implied in inverse comorbidity because of their key role in cell fate decision (Roe et al. Neurology (2013)), are down-regulated in the CNS disorders and up-regulated in Cancers. Novelties include the proteasome pathway – up-regulated in the 3 cancer types while down-regulated in the 3 CNS disorders, and the metallothioneins, MT1X, MT2A and MT1M – down-regulated in the 3 cancer types while up-regulated in the 3 CNS disorders.¶Overall, our analysis points to specific molecular processes the up-regulation of which could increase the risk of CNS disorders while reducing the risk for Cancer, while the down-regulation of another set of molecular processes would be implied in a reduced risk of CNS disorders while increasing the risk of Cancers.��II will fifinally nally present my future projects groun-groun- ded on this initial analysis, considering mutation and polymorphism informations, and the integration of all inverse comorbidities-related data in large-scale protein-protein and drug-target interaction networks for the initiation of drug repurposing strategies.

B7-02 Characterisation of the neural stem cell gene regulatory network identifies Olig2 as a mul- ti-functional regulator of self-renewal Juan L. Mateo1, Debbie L. C. van den Berg2, Maximilian Haeussler3, Daniela Drechsel2, Zachary B. Gaber2, Diogo S. Cas- tro4, Paul Robson5, Gregory E. Crawford6, Paul Flicek7, Laurence Ettwiller1, Joachim Wittbrodt1, François Guillemot2, Ben Martynoga2 1Centre for Organismal Studies (COS), University of Heidelberg, Heidelberg, DE, 2Division of Molecular Neurobiology, MRC-National Institu- te for Medical Research, London, UK, 3Faculty of Life Sciences, University of Manchester, Manchester, UK, 4Instituto Gulbenkian de Ciência, Oeiras, PT, 5Developmental Cellomics Laboratory, Genome Institute of Singapore, Singapore, SG, 6Institute of Genome Sciences & Policy, Duke University, Durham, US, 7European Molecular Biology Laboratory – European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Hinxton, UK

The gene regulatory network (GRN) that supports neural stem cell (NS cell) self-renewal has so far been poorly cha- racterised. Knowledge of the central transcription factors (TFs), the non-coding gene regulatory regions that they bind to and the genes whose expression they modulate will be crucial in unlocking the full therapeutic potential of these cells. Here, we use DNase-seq in combination with analysis of histone modifications to identify multiple classes of epi- genetically and functionally distinct cis-regulatory elements (CREs). Through motif analysis and ChIP-seq we identify several of the crucial TF regulators of NS cells. At the core of the network are TFs of the basic helix-loop-helix (bHLH), Nuclear Factor I (NFI), Sox and Fox families, with CREs often densely bound by several of these different TFs. We use machine learning to highlight several crucial regulatory features of the network that underpin NS cell self-renewal and multipotency. We validate our predictions by functional analysis of the bHLH TF Olig2. This TF makes an important contribution to NS cell self-renewal by concurrently activating pro-proliferation genes and preventing the untimely activation of genes promoting neuronal differentiation and stem cell quiescence.

Page 10 Oral presentations Integrative Biology

B7-03 Less is more or the more the merrier: Improvements in network-based function prediction by removing nodes and adding QTL information Joachim Bargsten1, Gabino Sanchez-Perez1, Jan-Peter Nap1, Stefano Toppo2, Aalt-Jan van Dijk1 1Applied Bioinformatics, Plant Research International, Wageningen UR, Wageningen, NL, 2Department of Molecular Medicine, University of Padova, Italy, Padova, IT

Recently, we have developed a network-based function prediction method Bayesian Markov Random Field (BMRF) that was the best algorithm for human and Arabidopsis in the CAFA experiment (Critical Assessment of Function Annotation). However, BMRF function prediction performance is not so accurate in species with only a limited set of experimental annotations. We have improved BMRF with sequence-based methods (Argot2 and Blast2Go) that are used to generate initial predictions that subsequently are used as seed annotations in the network for BMRF. With these combined methods, we participated in the next round of CAFA (CAFA2).

Additionally, we present two novel strategies to improve network-based function predictions. First, we demonstrate that “pruning” highly connected nodes in a network has a positive effect on prediction performance. This is related to the tendency of hubs to have lower similarity to their neigbours compared to less well connected nodes. Second, we in- tegrated Quantitative Trait Locus (QTL) and Genome Wide Association Study (GWAS) data with our function predictions. These data indicate that a genome region is associated with certain probability to a trait and/or disease. Such genome regions can potentially be large (tens to hundreds of genes), depending on the experimental setup. We demonstrate that the use of QTL/GWAS data leads to improved gene function prediction performance, as assessed using experimen- tal annotations available after the predictions were made. In addition, predicted gene functions can be used to priori- tize candidate genes in QTL regions, i.e., to find the most likely causal candidate gene underlying the trait of interest.

B8-01 Exploring hibernation in mammals through transcriptomics: the case of the hibernating lemur Sheena L. Faherty1, José Luis Villanueva-Cañas2, M. Mar Albà2, Anne D. Yoder1 1Department of Biology, Duke University, Durham, NC 27708, USA, Durham, US, 2Evolutionary Genomics Group, Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Research Institute (IMIM), Universitat Pompeu Fabra (UPF), Barcelona 08003, Spain, Bar- celona, ES

Hibernation is a complex physiological response some mammalian species employ to evade energetic demands. Hi- bernators conserve energy by essentially “shutting down” physiological processes; metabolic rate is severely depressed, body temperature plummets to ambient levels, and brain activity is greatly diminished (reviewed in (Carey, Andrews, & Martin, 2003)). In recent years the study of the molecular processes involved in mammalian hibernation has shifted from investigating a few carefully selected candidate genes to large-scale analysis of differential gene expression. The lemur Cheirogaleus medius belongs to a unique group of primates known to hibernate (genus Cheirogaleus). It does it in tropical conditions(Here, The, & At, 2004), which shows that hibernation is triggered by a lack of sufficient resources rather than low temperature. It has a characteristic tail rich in adipose tissue that is burnt during hibernation, a period that can endure up to 7 months.¶The Duke Lemur Center in North Carolina has a small population of C. medius ���������that per- form hibernation during fall. Using non-invasive techniques we were able to take samples of adipose tissue while the animals were in full summer activity and in each of the 3 phases of hibernation (preparation, entrance and torpor) for expression studies using deep RNA sequencing (RNA-seq). With the sequencing reads we performed de novo transcript assembly in the absence of a reference genome. ���������������������������������������������������������������������We were able to generate a list of genes that were differentially ex- pressed between conditions and identified several key catabolic enzymes that were over-expressed during hibernation. We have now gathered samples from two sister species (C. crossleyi, C. sibreei) in selected field sites in Madagascar in order to study this phenomena in the wild for the first time.

References

Carey H V, Andrews MT, Martin SL. Mammalian hibernation: cellular and molecular responses to depressed meta- bolism and low temperature. Physiological reviews 2003; Vol: 83, 4 pages, doi:10.1152/physrev.00008.2003

Here C, The C, At C. Hibernation in a tropical primate. Nature 2004; Vol: 429, June pages.

Page 11 Oral presentations Integrative Biology

B8-02 Fundamental physical cellular constraints drive self-organization of tissues Daniel Sánchez Gutiérrez Instituto de Biomedicina de Sevilla (IBiS), Sevilla, ES

Natural patterns such as fractals, spirals or tessellations have intrigued mathematicians and biologists for decades. These evolutionary conserved structures emerge from the physical properties of the soft living matter. Geometrical concepts have been widely applied as an approach to understand the basis of tissue architecture and remodelling. A clear example is the stereotyped polygon distribution found in very diverse proliferating epithelia among metazoans. Currently, it is accepted that the conserved distribution is a mathematical consequence of cell divisions, via a probabi- listic Markov chain, in conjunction with cell arrangements. We have used simple geometric concepts based in Voronoi tessellations to investigate the organization of diverse tissues from Drosophila epithelia to human muscles. We show that the conserved polygon distribution is not exclusive to proliferating tissues. On the contrary, the packing of a “re- laxed” Voronoi tessellations and non-proliferative polygonal tissues present the stereotyped distribution. Our results demonstrate that the distribution of cell areas dictates the frequency of polygons in these tissues. The increase of cell size heterogeneity deviate the tissue from the conserved polygon distribution. We will present real and simulated data that explain the physical nature of this cellular constraint that is able to drive the organization of diverse tissue structures.

B8-03 Metabolomics and aging. The effect of mitochondrial prohibitin complex on the C. elegans metabolome Artur Bastos Lourenço1, Celia Muñoz Jiménez1, Mónica Venegas Calerón2, Mary Doherty3, Phillip Whitfield3, Marta Artal Sanz1 1Andalusian Centre for Developmental Biology (CABD), CSIC-Universidad Pablo de Olavide, Carretera de Utrera km1,Seville, Spain, Sevilla, ES, 2Instituto de la Grasa, Sevilla, ES, 3Department of Diabetes and Cardiovascular Science, University of the Highlands and Islands, Inver- ness, UK

The nematode C. elegans has been extensively used to gain insights in the complex process of aging. Several studies have stressed the existence of a tight link between aging and metabolism. Indeed, marked alterations in the cellular energy metabolism is one universal hallmark of the aging process. Consistently, the biogenesis and function of mitochondria, the energy-generating organelles, is a primary longevity determinant [1].

The mitochondrial prohibitin complex, composed of two proteins, PHB1 and PHB2, which bind to each ther to form a heterodimeric bulding block assembled into a ring-like macromolecular structure at the inner mitochon- drial membrane, was shown to affect longevity. In particular, while shortening the lifespan of otherwise wild-type animals, prohibitin deficiency increases the lifespan of mutants in insulin signalling pathway (e.g. daf-2(e1370)) [2]. Prohibitin deficiency was also shown to affect ATP levels, fat content and mitochondrial proliferation in age- netic-background- and age-specific manner [2]. These findings suggest that prohibitin deficiency may havea broader effect in the metabolome, which might be, at least partially, on the basis of its impact in the aging process. We used gas chromatography coupled to a flame ionization detector (GC-FID) and 1H-NMR spectroscopy to gain molecular insights on the effect of prohibitin deficiency in the C. elegans metabolome. The free fatty acid (GC- FID) and 1H-NMR profiles of wild-type (N2) animals at both L4 and young adult (YA) stages revealed a cleardis- tinction between control and phb-1 or phb-2 (RNAi), being this difference more pronounced at YA stage. Further-Further- more, the GC-FID and 1H-NMR data clearly distinguished between N2 and daf-2(e1370) mutants, in both control (RNAi) and phb-1 (RNAi), at YA stage. We are currently undertaking mass spectrometry (MS)-based metabolo- mic approaches aiming to identify the metabolites that change in a prohibitin-dependent manner and/or in a genetic-background-dependent manner. Our ultimate goal is to pinpoint the metabolic pathways, and more specifically the players, that might be involved in how mitochondrial prohibitin complex affects longevity.

[1] Balaban, R. S., Nemoto, S. & Finkel, T. Mitochondria, oxidants, and aging. Cell 120, 483–495 (2005). [2] Artal-Sanz, M. & Tavernarakis, N. Prohibitin couples diapause signalling to mitochondrial metabolism during ageing in C. elegans. Nature461, 793-797 (2009).

Page 12 Oral presentations Medical Informatics

C7-01 IntSide: a web server for the chemical and biological examination of drug side effects Teresa Juan-Blanco, Miquel Duran-Frigola, Patrick Aloy IRB Barcelona, Barcelona, ES

Drug side effects (SEs) are one of the main health threats worldwide, and an important obstacle in drug development. Understanding how adverse reactions occur requires knowledge on drug mechanisms at the molecular level. Despite recent advances, the need for tools and methods that facilitate side effect identification still remains.

Very recently, we presented a top-down approach to identify chemical and biological drug features that may be invol- ved in the development of adverse drug reactions (Duran- Frigola & Aloy, 2013). We delimited the chemical and biolo- gical space for each compound by gathering molecular properties from major biomedical resources and carried out an enrichment analysis, associating more than 1,000 SEs with molecular features. On the biological side, we considered drug targets and off-targets, pathways, molecular functions and biological processes. From a chemical viewpoint, we included molecular fingerprints, scaffolds and chemical entities.

Here, we introduce a web server, named IntSide, which automates this analysis and enables the quick and easy access to our findings. Moreover, we further extend the method by integrating additional biological information, like protein interactions and disease-related genes, to facilitate mechanistic interpretations. IntSide is available at http://intside. irbbarcelona.org/.

Reference: Duran-Frigola M, Aloy P. Analysis of chemical and biological features yields mechanistic insights into drug side effects. Chemistry & biology. 2013;20(4):594-603.

C7-02 the EGA as a resource for medical and clinical informatics Jordi Rambla Fundació Centre de Regulació Genòmica (CRG), Barcelona, ES

The European Genome-phenome Archive (EGA http://ega.crg.eu) is a repository of about 800 hundred studies from humans, jointly managed by EBI and CRG.

It includes studies from many different diseases and technologies, and from consortia like the ICGC or the UK10K, allowing other research teams to leverage this studies for new assays or to complement their owns.

The EGA is handling the long term, secure controlled access storage for public funded studies and clinical analysis.

In this talk we will show which tools and assets the EGA is offering and how research teams are leveraging them.

Page 13 Oral presentations Medical Informatics

C8-01 Cancer, drugs and expression signatures Héctor Tejero, Fátima Al-Shahrour Spanish National Cancer Research Centre, CNIO, Madrid, ES

In cancer treatment, a personalized therapy is based on give to each patient the right drug according to the characte- ristics of its tumor. Several projects (CCLE, GDSC, NCI60) have studied the pharmacological response of a great number of cancer cell lines to several drugs and have also carried out a deep molecular characterization of these cell lines. In this work, the determinants of drug response have been studied using the genetic expression of those cell-lines, grou- ped by expression signatures. In order to do this, gene-sets from several collections have been used, giving special importance to oncogenic expression signatures. Oncogenic expression signatures are empirically obtained from the study of genetic perturbation of great importance in cancer. Its use allows to infer the activation state of a given on- cogene or tumor suppressor from the changes it induces in the cell transcriptome. Using this approach we have found new drug response biomarkers based on oncogenic expression signatures for most of the drugs studied. Interestingly, two clusters of drugs with opposite sensitive and resistant biomarkers have been observed. The study of the genomic determinants of drug response may be of great importance in the near future in order to make that current, an deve- loping, antitumor therapies, targeted or cytotoxic, more effective through a better adaptation to the specific molecular characteristics of the tumor.

C8-02 The landscape of alternative splicing alterations in human cancer Sebestyén Endre1, Singh Babita1, Gael Pérez Alamancos1, Amadís Pagès1, Eduardo Eyras2 1Computational Genomics, Universitat Pompeu Fabra, Dr. Aiguader 88, E08003, Barcelona, ES, 2Computational Genomics, Universitat Pom- peu Fabra, Dr. Aiguader 88, E08003, Catalan Institution for Research and Advanced Studies, Passeig Lluís Companys 23, E08010, Barcelo- na, ES

Alternative splicing (AS) enables genes to produce multiple RNA and protein isoforms with different functions and is regulated by a large number of splicing factors (SF) that bind to specific sites on the pre-mRNA and serve as splicing enhancers or repressors. Several human diseases, including cancer, can be affected by this process. In fact, almost all biological processes important during the neoplastic transformation are profoundly influenced by alternative splicing [1].

However, the specific role of many SFs and AS events is still not very well established. In order to characterize the role of alternative splicing in cancer, we analyzed DNA and RNA sequencing data from The Cancer Genome Atlas (TCGA) for 9 different cancer types where we had paired normal and tumor samples. First, we studied the differential expression patterns of a set of splicing factors that bind RNA. Interestingly, we found that splicing factors show more frequently expression changes than somatic mutations in tumors and many of them show no mutations at all. In total we obtained 68 splicing factor expression changes, 18 being consistent across several tumors, including breast, colon, kidney, lung and prostate cancer.

We further investigated the association between splicing factor expression and the enrichment of their binding sites in regulated AS events calculated in the different cancer types. We found SF binding motifs enriched in three or more tumor samples, including motifs for QKI, RBFOX2, MBNL1, RBM4, CELF5, HNRNPK, SRSF1, SRSF7 and SRSF10. Additio- nally, we found a strong association between SF expression changes and inclusion changes of splicing events with the presence of the motif binding sites.

In conclusion, we have found many splicing factors and alternative splicing events showing common patterns of regu- lation in almost all cancer types, and others specific to certain cancers. This analysis can lead to the identification of novel prognostic and therapeutic targets, besides helping to understand the general mechanisms of tumor transforma- tion and the role of alternative splicing in it.

References:

1) David, C. J., & Manley, J. L. (2010). Alternative pre-mRNA splicing regulation in cancer: pathways and programs unhin- ged. Genes & development, 24(21), 2343-2364.

Page 14 Oral presentations Medical Informatics

C8-03 In silico prescription of anticancer drugs to cohorts of 28 tumor types reveals novel targe- ting opportunities Carlota Rubio-Perez1, David Tamborero1, Michael P. Schroeder1, Albert A. Antolín2, Jordi Deu-Pons1, Christian Perez- Llamas1, Jordi Mestres2, Abel Gonzalez-Perez1, Nuria Lopez-Bigas1 1Research Unit on Biomedical Informatics, Department of Experimental and Health Sciences, Universitat Pompeu Fabra, Dr. Aiguader 88, Barcelona, Spain, Barcelona, ES, 2Systems Pharmacology, Research Program on Biomedical Informatics, IMIM Hospital del Mar Medical Research Institute and Universitat Pompeu Fabra, Doctor Aiguader 88, 08003 Barcelona, Catalonia, Spain, Barcelona, ES

The development of targeted therapies against altered driver proteins holds the promise of selectively and efficiently eliminating cancer cells. Nevertheless, the applicability of this strategy and the limitations of the oncogene addiction hypothesis are not clear. Here, we present the first large-scale therapeutic landscape of cancer as it stands today in a 6.795 sample cohort covering 28 tumor types.

To discover mutations actionable through direct targeting, we first comprehensively identified cancer driver genes (CDs) by detecting complementary signals of positive selection in the pattern of their mutations across the tumor co- horts. Next, we detected which of these CDs contained activating mutations and which ones lost their function upon mutation. Third, we systematically gathered all information available on drugs; FDA approved and in clinical or pre- clinical stages, designed to inhibit the function of the proteins encoded by these genes. By combining these results, we developed in silico drug prescription, a novel approach to determine which of the drugs could benefit each of the tumor individuals.

In all, we identified 461 CDs acting in one or more of the tumor types; 255 of them were predicted to act via gain-of- function or unknown mechanisms and thus were further evaluated in order to identify drugs for their direct targeting. As a result, only 4 CDs that appear mutated in 197 samples (2.9%) are directly targeted according to clinical guidelines by ten FDA approved drugs. The number of samples that could benefit of FDA approved drugs would increase to 12.7% when considering repurposing opportunities and up to 23.9% if taking into account drugs currently undergoing clinical trials. Interestingly, we found good target candidates for drug development: 15 CDs tightly bound by pre-clinical small molecules and 68 CDs potentially suitable for molecule binding, covering an additional 29% of the samples.

In summary, direct targeting of mutated oncogenes by currently available drugs can benefit a small subset of tumors, although the figures can be extended by repurposing strategies. The present study provides data that could be used for prioritizing the targets of novel drugs and approaches, as well as for the design of panels of early-detection and diagnosis

Page 15 Oral presentations Phylogeny / Evolution

D7-01 Comparative genomics of Citrus chloroplast genomes José Carbonell-Caballero1, Roberto Alonso-Valero1, Victoria Ibañez2, Javier Terol2, Manuel Talón2, Joaquín Dopazo3 1Systems Genomics Lab, Centro de Investigación Príncipe Felipe (CIPF), Valencia, ES, 2IVIA, centro de genómica, Valencia, ES, 3Functional Genomics Node (INB); BIER CIBER de Enfermedades Raras (CIBERER); Systems Genomics Lab, Centro de Investigación Príncipe Felipe (CIPF), Valencia, ES

Chloroplasts are organelles of prokaryotic origin within plant cells in which the photosynthetic machinery resides. As a remnant of their prokaryotic origin, chloroplasts have their own genome along with the corresponding transcrip- tional and translational machinery to express their genetic information. Chloroplast genomes of plants are known to be highly conserved in both gene order and gene content, with a substitution rate much lower than in nuclear DNA, which is even significantly reduced in the inverted repeat regions as compared to the single copy regions. In this study, 34 samples from 27 Citrus species representative of the Citrus genus have been selected, sequenced and compared. The selection comprises different Citrus species covering the three main documented groups (citrons, pummelos and mandarins) which are described as citrus founders. We used approximately 100 kbs from the 160 kb genome citrus genome, which includes 133 genes (89 protein-coding, 4 rRNAs and 30 distinct tRNAs) to infer the phylogeny of the citrus genus using the selected species. We obtained SNVs and indels and their distribution along the coding, intronic and intergenic regions. We also detected a remarkable level of heteroplasmy, whose evolutionary origin could be traced back using the phylogeny. Additionally, we found 4 fragile and recurrent regions affected by several structural variation events. Furthermore, selective pressure of chloroplast genes were analyzed along the phylogenetic tree, where 3 genes were positively affected by natural selection. Finally, we used previous estimations of divergence times to calibrate phylogenetic tree to date the main unknown speciation events such as pummelo-mandarins separation.

D7-02 The genomes salad: applying phylogenomics in the green kingdom Salvador Capella-Gutierrez1, Toni Gabaldon2 1Bioinformatics and Genomics Programme. Centre for Genomic Regulation (CRG), Barcelona, ES | Universitat Pompeu Fabra (UPF), Barce- lona, ES | Yeast and Basidiomycete Research Group. CBS Fungal Biodiversity Centre, Utrecht, NL, 2Bioinformatics and Genomics Program- me. Centre for Genomic Regulation (CRG), Barcelona, ES | Universitat Pompeu Fabra (UPF), Barcelona, ES | Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, ES

With a rough estimation of 300,000 different species, plants constitute one of the most diverse group of organisms. Despite their major role for oxygenation and as food source for animals - including humans - there were few complete genomes sequenced until very recently. In recent years, the publication of plant genomes has grown rapidly. From their analyses it is clear that many of them are complex in terms of genome size, number of genes, presence of alternative transcripts and a high percentage of repetitive elements. With up to 80% of repetitive regions, partially due to multiple events of polyploidization and the activity of transposable elements, the identification and comparison of gene sets across species is not easy. Phylogenomics, the intersection of evolutionary studies and genomics, can provide an appro- priate framework to identify and classify differences in gene content across species.

Here, we will present the results of using a highly accurate phylogenomics pipeline in the context of a number of ge- nome sequencing projects. We have contributed to understand different aspects of plant evolution across the whole kingdom, from red algae (Chondrus crispus), to sugar beet (Beta vulgaris) to melon (Cucumis melo) among others. We will describe how large-collections of single-gene phylogenies (i.e. phylomes) can help defining a stable gene-set and identifying recently-expanded transposable elements. Moreover, we have studied the presence/absence of resistance genes or the impact of gene families expansions associated to an increase in sugar production. We have also traced back the origin of many chloroplast genes to an ancestral cyanobacterial origin and contribute to identify potential domestication genes for bean in the context of a complex evolutionary scenario with two independent domestication events within the same species.

Page 16 Oral presentations Phylogeny / Evolution

D7-03 Genome-wide comparison, and intrastrain variation, in genome-integrated Human Herpes- virus 6 Marco Telford1, Gabriel Santpere1, Arcadi Navarro2 1Institute of Evolutionary Biology (Universitat Pompeu Fabra-CSIC), PRBB, Doctor Aiguader 88, 08003, Barcelona, Catalonia, Spain, Bar- celona, ES, 2Institute of Evolutionary Biology (Universitat Pompeu Fabra-CSIC), PRBB, Doctor Aiguader 88, 08003, Barcelona, Catalonia, Spain; Centre de Regulacio´ Genomica (CRG). Barcelona, Catalonia, Spain; National Institute for Bioinformatics (INB), Barcelona, Catalonia, Spain; Institucio´ Catalana de Recerca i Estudis Avanc¸ats (ICREA), Catalonia, Spain, Barcelona, ES

Human herpesvirus 6 (HHV-6), even if considered in the past as a single virus, has been recently defined as a worldwide- spread entity composed by two different strains (HHV-6A and HHV-6B) that share genome structure and highly similar nucleotide sequences. Even though the genetic variability between the two HHV-6 has been a central issue of past studies, the intrastrain one has not yet been fully assessed at a genome wide level, mainly due to the few complete sequenced genomes available today. Here we scanned the 1000 Genome Project data in search for genome-integrated HHV-6 sequences by mapping against HHV-6 references the reads that were not mapping to human genome. This resulted in 3 individuals presenting genome-integrated HHV-6A, and 8 presenting HHV-6B, from whom we extracted the nucleotide sequences of the virus to analyze the variability within and between the strains. In addition, and due to the low coverage of the available sequence data, we performed a viral genome capture on LCL (Lymphoblastoid Cell Lines) derived from the 1000 Genome Project individuals presenting HHV-6 integration, resulting in high-coverage, high-quality data. An overall analysis showed much higher variability in HHV-6A than in HHV-6B, with exons being the class of genomic element in where this difference is more marked. Both Principal Component and phylogenetic analysis shows a solid cluster of the Asian strains in HHV-6A, while a poor structure is found within HHV-6B, with the exception of two individuals that turned out to be directly related (first degree). The lack of a world population struc- ture in HHV-6B could be the evidence of the absence of efficient barrier to genetic flow, and of the high infectivity of the virus. Recombination analysis resulted in signs of a possible alternative scenario for HHV-6B, where the population structure is not absent, but partially masked by recombination events happened during the history of the virus. Never having been done before a genome-wide study with a high number of HHV-6 genomes, the result of the present one takes importance as it hints strongly at the extant intrastrain variability of these viruses.

D7-04 Species-centered coevolutionary networks as a source of species-specific functional infor- mation David Juan, Alfonso Valencia Structural Biology and BioComputing Programme, Spanish National Cancer Research Center (CNIO), Madrid, ES

In the last few years, major advances have been developed in the field of protein coevolution[1]. In particular, pro- tein-protein coevolution can provide high-quality predictions of protein functional associations[2]. Coevolution-based methods provide a promising source of information for enriching our knowledge on the fast growing set of fully- sequenced bacterial species. Despite the large amount of such genomes and the high-quality annotation available for a few of them, species-specific automatic functional annotation remains challenging. Here, we address the potential of species-focused protein coevolutionary networks for providing species-specific information for very different bac- terial species. For this, we obtained species-focused protein coevolutionary networks based on specifically selected sets of evolutionary-related species. Our results confirm the predictive power of these coevolutionary networks in very different evolutionary scenarios. Moreover, we observed that evolutionary distant species show very different “species- specific” coevolutionary networks. We explored the potential of this species-centered coevolution for understanding species-specific functional phenomena.

1. de Juan D, Pazos F, Valencia A. Nat Rev Genet. 2013;14(4):249-61.

2. Juan, D., Pazos, F., and Valencia, A. (2008). Proc. Natl. Acad. Sci. U.S.A. 105, 934–939.

Page 17 Oral presentations Phylogeny / Evolution

D7-05 Parallels between demographic history and genome evolution in the critically endangered Iberian lynx Federico Abascal1, Fernando Cruz2, Begoña Martínez-Cruz3, Miriam Rubio-Camarillo1, Sophia Derdak2, Tyler Allioto2, Alfonso Valencia1, José A. Godoy3 1Structural Computational Biology Group, Spanish National Cancer Research Centre (CNIO), Madrid, ES, 2Centre Nacional d Analisi Geno- mica (CNAG), Barcelona, ES,3Dept. of Integrative Ecology (EBD-CSIC), Sevilla, ES

The Iberian lynx is considered the most endangered felid in the world. With the double aim of understanding how dras- tic population declines have shaped its genome and improving future conservation strategies, the Iberian lynx genome consortium has sequenced the genomes of 11 Iberian and 1 Eurasian lynx.

Demographic history inference revealed that the two lynx species shared two ancient population bottlenecks (1-0.4 Mya, 60-40 Kya) and that the Iberian lynx was affected by a more recent bottleneck (ca. 400 y.b.p.), on top of the do- cumented decline during the second half of the 20th century. The accumulated drift and inbreeding associated with these bottlenecks have shaped genomic variation patterns in many ways, leading to low overall diversity, extensive linkage disequilibrium, long runs of homozygosity, and reduced chromosome X to autosomes diversity ratios. High ra- tios of non-synonymous to synonymous diversity (πN/πS) and substitutions (dN/dS) in coding sequences indicate that the relaxation of purifying selection has resulted in the accumulation of deleterious variants and a high genetic load.

By combining genomic data from domestic cat, tiger, and Iberian and Eurasian lynx, we were able to trace genome evo- lution at each particular branch of the tree. Patterns of mutation show large differences among branches, especially in terms of weak-to-strong (A/T→G/C; hereafter W→S) mutation biases. W→S bias is much higher along the lineages of cat and the ancestor of the two lynxes. Then, after Eurasian and Iberian lynx speciation, a drastic reduction in W→S bias is observed. High W→S biases are mostly related to GC-biased gene conversion (gBGC), a process acting during meiotic recombination that has been shown to be inefficient in the absence of heterozygosity. Interestingly, the dynamics of transposable elements (TE) also show remarkable differences among felids, revealing the expansion of some families in the two lynx species. TE expansion might be due to higher TE activity or, more likely, to higher fixation rates due to less efficient purging of TE insertions.

The extensive signs of genetic erosion in the Iberian lynx raise concerns about the likely impact on fitness and adaptive potential and suggest the need for genetic management to be integrated into current conservation strategies.

Page 18 Oral presentations Structure / Function

E7-01 Structural features of the 5-colors Drosophila chromatin types Davide Bau1, François Serra1, Guillaume Filion2, Marc Marti-Renom3 11 Centre Nacional d’Anàlisi Genòmica (CNAG), Barcelona, Spain 2 Gene Regulation, Stem Cells and Cancer Program, Centre de Regulació Genòmica (CRG), Barcelona, Spain , Barcelona, ES, 22 Gene Regulation, Stem Cells and Cancer Program, Centre de Regulació Genòmi- ca (CRG), Barcelona, Spain 3 Universitat Pompeu Fabra (UPF), Barcelona, Spain , Barcelona, ES, 31 Centre Nacional d’Anàlisi Genòmica (CNAG), Barcelona, Spain 2 Gene Regulation, Stem Cells and Cancer Program, Centre de Regulació Genòmica (CRG), Barcelona, Spain 4 Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain , Barcelona, ES

Advances in genomic technologies and the development of new analytical methods (e.g. Hi-C) have allowed to get better insights into how the genome is organized inside the cell nucleus. Recently, it has been shown that chromatin is organized in Topologically Associating Domains (TADs), large interacting domains that are conserved among diffe- rent cell types. The Drosophila genome is also folded into TADs, which are packaged into a mosaic of five principal chromatin types, each defined by a unique combination of proteins. The five types of chromatin differ substantially in their genome coverage, numbers of domains, and numbers of genes [1]. To determine whether these TADs correspond to functional domains defined by epigenetic marks, Hou et al. [2], examined the composition of chromatin types within physical domains, following the 5-colors classification described in [1]. To figure out whether these “chromatin color blocks” have characteristic structural features, we studied the relationship between the 3D architecture of selected regions of theDrosophila genome and their chromatin color. Using Hi-C data at 10 Kb resolution, we found that the analyzed regions have structural features characteristic of their functional signatures. Although with the present data resolution it is not possible to unambiguously distinguish between different chromatin types by simple comparison of their structural features, our results show that different chromatin type have specific structural characteristics that correlate with their functional roles, with active and inactive chromatin type showing significantly different structural characteristics.

E7-02 AnaBlast: searching for ancient coding signals to identify novel genes and fossil regions within genomic sequences Juan Jiménez, María Gallardo, Antonio J. Pérez Pulido Universidad Pablo de Olavide, Sevilla, ES

Currently, new whole genomes are routinely sequenced and automatically annotated. Gene prediction computational tools are able to discover most of the coding genes, especially when they conserve sequence similarity with other se- quences from the public databases. But they fail with specific non-conserved genes or when the coding sequence has errors. So, to uncover new genes and completing the structural annotation of genomes is usually required to carry out experimental tests with organisms or samples growing in different conditions.

We have developed a simple computational tool, here named AnaBlast (standing on Ancestral-sequence Analysis through a Blast-based strategy), that may identify ancient coding regions in sequenced genomes. AnaBlast is based on an algorithm designed to search for ancient patterns in protein sequences using non-redundant databases, which use a customized Blast search with a very high expected value as threshold, and it gives priority to identity versus simila- rity sequence relations. Then, the profiles with accumulation of small patterns highlight coding signals which can be initially assigned to new genes as well as fossil regions belonging to ancient genes.

To test the developed algorithm we have used the fission yeast Schizosaccharomyces pombe, which has an extensively annotated genome. To try to identify new S. pombe genes, we performed AnaBlast analysis of all six-frame translations (ignoring stop codons) of DNA sequences between every two annotated exons in the entire genome. AnaBlast profiles highlighted more than 100 significant regions. A detailed analysis of each of these regions allowed us to identify one new pseudogene and nine putative new genes, six of which were validated by expression analysis. The results have been also validated by evolutionary evidences coming from the analysis of non-synonymous and synonymous substi- tutions, and matching of the predictions with results of RNA-Seq experiments from the literature.

Page 19 Oral presentations Structure / Function

In conclusion, this new approach provides a powerful tool to uncover novel genes in already annotated genomes, as well as to underlie fossil sequences, a unique property of AnaBlast that may help to better understand the evolutionary origin of annotated genomes. Later, we want to use AnaBlast for searching new genes and pseudogenes in metazoan genomes and studying the evolutionary history of their coding sequences.

E7-03 Protein structure features inferred from sequence can change significantly with database growth over time Inmaculada Yruela1, Bruno Contreras-Moreira2 1Estación Experimental de Aula Dei, Consejo Superior de Investigaciones Científicas (EEAD-CSIC), Avda. Montañana, 1005, 50059 Zara- goza, Spain. Instituto de Biocomputación y Física de Sistemas Complejos (BIFI), Universidad de Zaragoza, Mariano Esquillor, Edificio I D , Zaragoza, ES, 2Estación Experimental de Aula Dei, Consejo Superior de Investigaciones Científicas (EEAD-CSIC), Avda. Montañana, 1005, 50059 Zaragoza, Spain. Instituto de Biocomputación y Física de Sistemas Complejos (BIFI), Universidad de Zaragoza, Mariano Esquillor, Edificio I D. Fundación ARAID. , Zaragoza, ES

Protein structure predictions based on sequence normally use a reference database to carry out sequence similarity calculations. Typical references, such as nr or nt, essentially are continuously updated sequence sets and therefore their volume increase over time, as well as their annotations, fueled by the ever growing number of published genomes and proteomes in the last decade. In this work we measure to what extent the recent growth of a fairly standard sequence database, uniref90, affects the prediction of structural features of proteins, namely secondary structure, intrinsic disor- der and the phenotype of point mutations. The results of widely used predictive software (PSIPRED, DISOPRED and SIFT, respectively), which relies on sequence searches for producing results, were compared with two carbon copies of uniref90 from 2010 and 2014. For this purpose four different sets of proteins from model organisms Escherichia coli, Arabidopsis thaliana and Homo sapiens, plus a subset of high quality structures from the Protein Data Bank (PDB), were analyzed. Our results indicate that the 2010 and 2014 predictions vary strongly (Q3 = 0.29 - 0.42) in the case of SIFT, suggesting that this algorithm probably overestimates the effect of point mutations that insert rare amino acids in poorly characterized protein families. Such variations were less pronounced, although significant, in the case of DISO- PRED (Q3 = 0.83 - 0.89), and much lower with PSIPRED (Q3 = 0.92). Therefore, it can be concluded that out the three tested features, secondary structure predictions are the most stable over time. Furthermore, by comparing PSIPRED results with the corresponding secondary structure assignments of high resolution PDB structures, it was found that 2010 and 2014 predictions also co

E7-04 Scoring docking conformations using structural alignment of protein-protein interfaces Sergio Mares-Sámano, Didier Barradas-Bautista, Juan Fernández-Recio Barcelona Supercomputing Center, Life Sciences Department, Barcelona, ES

Identification of near-native conformations from the vast number of poses generated by docking programmes remains a critical challenge. Recent studies have shown that protein-protein interface architectures existing in nature are li- mited and that interface templates are available to model nearly all protein complexes. These findings suggest that a near-native docking conformation, unlike an incorrect pose, is more likely to have a closer structural analogue in the space of solved interface structures. It is therefore highly attractive to use the degree of structural similarity of docking conformations with respect to solved interfaces as an indicator of the quality of the predicted complex. Here we show an approach that uses structural alignment of the interfaces of the docking conformations against a diverse and high- resolution library of complex interfaces to scoring docking solutions. We assessed the method on a decoy benchmark of docking conformations covering 11 complexes for which near-native solutions can be found. The approach structurally aligns the interface of the decoy conformations against the DOCKGROUND library containing 5050 protein complexes.

Page 20 Oral presentations Structure / Function

To evaluate the optimal size of the interface structures involved in the alignment, we used two cutoff values, 5 and 12 Å, across the interface. The method ranks docking conformations based on the best alignment for each decoy structure, as judged by the highest TM-score. We found that the 12 Å interface set yielded better quality alignments (i.e higher TM-scores) and it was reflected in a higher calculated success rate for the benchmark dataset than that corresponding to the 5 Å cutoff. Also, we show that though this method finds structural analogues even for incorrect docking models, near-native conformations consistently show higher TM-score values. In a comparative analysis we found that the per- formance of this approach, in terms of the success rate, is similar to that of pyDock, a state-of-the-art scoring function. Since the underlying principles whereby both methods perform the scoring calculation are different, we believe that they could be combined to an increased chance to identify near-native structures.

E7-05 Computational exploration of the binding mode of the heme-dependent activator YC-1 into the active catalytic site of soluble guanylate cyclase Luis Agulló1, Ignasi Buch2, Hugo Gutierrez de Teran3, Gianni de Fabritis2, David Garcia-Dorado4, Jordi Villà-Freixa1 1Bioinformatics and Medical Statistics Research Group, Escola Politecnica Superior, Universitat de Vic, Vic, ES, 2Research Program on Bio- medical Informatics (GRIB), IMIM (Hospital del Mar Medical Research Institute), Barcelona, ES, 3Fundacion Publica Galega de Medicina Xenomica. Complejo Hospitalario Universitario de Santiago (CHUS), Santiago de Compostela, ES, 4Laboratory of Experimental Cardiology, Vall Hebron University Hospital and Research Institute, Barcelona, ES

Soluble guanylate cyclase (sGC), the main target of nitric oxide (NO), has been proven to have a significant role in coronary artery disease, pulmonary hypertension, erectile dysfunction and myocardial infarction. Several drugs that increase the activity of this enzyme are now in clinical phase of development: some of them are heme-dependent and might interact with the catalytic domain and others are heme-independent and supposedly bind to the sensory domain. The absence of reliable structural information is one of the factors that have precluded knowledge of the precise site of interaction of these molecules and of the mechanism of activation of the enzyme. Homology models of the catalytic domain of sGC in “inactive” or “active” conformation were constructed using, for the β-chain, the structure of recently published crystal of a non-physiological homodimer of β subunits of human guanylate cyclase (2WZ1), for the α-chain, a similar domain of the green algae Chlamydomonas reinhardtii (3ET6) and, for monomer arrangement, the sGC «inactive» structure (3ET6) or the «active» catalytic domain of adenylate cyclase (1CJU). Molecular dynamics simulations of about 1μs����������������������������������������������������������������������������������������� each where run on all relevant models (NAMD/ACEMD, Amber99SB). In the different trajec- tories, sGC conformation varied between having 1CJU- and 3ET6-like structures. One of these trajectories maintained extremely stable relative positions of the aminoacids in the catalytic site, being very similar to those described in 1CJU. The observed conformational transitions suggest a possible mechanism for the transmission of the cooperativity signal between the pseudo-symmetric and the catalytic site, in which Arg-592 (α-chain) and Arg-539 (β-chain) and the loop β2-β3 seem to play a critical role. Docking of YC-1, a classic heme-dependent activator, to all frames of this trajec- tory and absolute binding free energies with the linear interaction energy method (LIE) for selected poses revealed one potential binding site located between pseudo-symmetric and catalytic sites just over the loop β2-β3. This site would be compatible with the binding of a second GTP or an inhibitory ATP to the pseudo-symmetric site.

Page 21 Oral presentations Student Symposium

S1-01 nAnnoLyze: ligand-target prediction by structural network biology Francisco Martínez-Jiménez1, Marc A. Marti-Renom2 1Centro Nacional Análisis Genómico (CNAG) Centro de Regulación Genómica (CRG), Barcelona, ES, 2Centro Nacional Análisis Genómico (CNAG) Centro de Regulación Genómica (CRG), Barcelona, ES

Target identification is essential for the drug optimization process, drug-drug interactions identification, dosage ad- justment and for side effect anticipation. Specifically, the structural details are essential to understand the compound- protein relationship. Here, we present nAnnoLyze a method for target identification that relies on the biological pre- mise that structurally similar binding-sites tend to bind similar ligands. nAnnoLyze integrates structural information into a bipartite network and makes uses of the network features to predict structurally detailed compound-protein interactions at proteome scale.

The method was benchmarked in a dataset of 6,282 pairs of known interacting ligand-targets reaching a 0.96 of AUC when using the drug names, and a 0.70 of AUC when using anonymous compounds. nAnnolyze has been already used in an open source drug discovery initiative against Mycobacterium tuberculosis (MTB)1. Furthermore, we have applied the method to the human proteome by predicting interactions for all the compounds in the DrugBank database to any human protein with either known or predicted three dimensional structure.

Finally, we exemplify the applicability of nAnnoLyze with several examples of new target identification for known drugs against human diseases. We provide not only the link but also the structural localization of the interaction.

The method and all the DrugBank predictions for both human and MTB proteomes are available online through our webserver http://nannolyze.cnag.cat.

1.Martinez-Jimenez, F., et al. (2013). “Target prediction for an open access set of compounds active against Mycobacte- rium tuberculosis.” PLoS Comput Biol 9(10): e1003253.

S1-02 BIGO: A tool to improve gene enrichment analysis in collections of genes Aurelio López Fernández, Domingo Savio Rodríguez Baena Universidad Pablo de Olavide, Sevilla, ES

During the last years, a lot of methods have been developed to analyze massive data derived from the measurements of hundreds or thousands of genes. These genes are grouped into collections of genes sharing some functionally relevant characteristic.

The gene enrichment analysis allows the validating of genes collections, by means of previous biological knowledge. This analysis measures the relation between the genes of a certain collection with the biological annotated terms stored in a biological database, like for example Gene Ontology.

BIGO is a novel software tool designed to improve the gene enrichment analysis, providing new information which helps to enclose the results and to generate new conclusions. Concretely, BIGO generates a ranking of biological terms and a graph that shows the hidden relationships between different genes collections.

BIGO processes the output of the gene enrichment analysis of a concrete group of genes collections. This output in- cludes a list of the biological terms and the genes of every concrete collection that are associated to them. As a first result of this processing, a ranking of all biological terms included in the gene enrichment analysis is generated. The ascending order of that ranking is based on the number of gene collections in which every biological term has been detected. The first terms of this ranking will allow us to focus the conclusions of the biological study on those terms that really distinguish a group of genes between others. In addition, the last terms of the ranking are stop-words, bio-

Page 22 Oral presentations Student Symposium

logical functions considered very generic since they appear in a large number of genes, so they shouldn’t be taken into account to obtain conclusions.

From this ranking, BIGO generates a graph, which represents the relations between the collections of genes, based on the number of biological terms shared by these collections. Every node is a certain gene collection. A weighted edge will join two nodes if that collections share a number of biological terms, represented by the weight. This graph is able to show collections of genes which are well-defined and independent from each other. Moreover, those closely related groups are also interesting, in case they don’t share a high proportion of their genes.

In conclusion, BIGO enhances the understanding of the results obtained by a gene grouping technique, providing new information that makes possible the generation of useful knowledge.

S1-04 “Bioinformática con Ñ v1.0”: a collaborative project of young Spanish scientists to write a complete book about Bioinformatics Alvaro Sebastian1, Alberto Pascual-García2, Federico Abascal3, Jacobo Aguirre4, Eduardo Andrés-León5, Djordje Bajic6, Davide Baú7, Juan A. Bueren-Calabuig8, Álvaro Cortés-Cabrera2, Iván Dotu9, José M. Fernández10, Helena G. Dos Santos2, Beatriz García-Jiménez11, Raúl Guantes12, Iker Irisarri13, Natalia Jiménez-Lozano14, Javier Klett2, Raúl Méndez2, Anto- nio Morreale2, Almudena Perona2, Michael Stich15, Sonia Tarazona16, Inmaculada Yruela17, Rafael Zardoya17 1Adam Mickiewicz University (AMU), Poznan, PL, 2Centro de Biología Molecular Severo Ochoa (CBM), Madrid, ES, 3Centro Nacional de In- vestigaciones Oncológicas (CNIO), Madrid, ES, 4Centro de Astrobiología (CSIC-INTA), Madrid, ES, 5Instituto de Biomedicina de Sevilla (IBIS), Sevilla, ES, 6Centro Nacional de Biotecnología (CNB), Madrid, ES, 7Centro Nacional de Análisis Genómicos (CNAG), Center for Genomic Regulation (CRG), Barcelona, ES, 8Universidad de Florida, Florida, US,9Boston College, Boston, US, 10Centro Nacional de Investigaciones Oncológicas (CNIO), Instituto Nacional de Bioinformática (INB), Madrid, ES, 11Centro de Biotecnología y Genómica de Plantas (UPM-INIA), Madrid, ES, 12Universidad Autónoma de Madrid (UAM), Madrid, ES, 13Museo Nacional de Ciencias Naturales (CSIC), Madrid, ES, University of Konstanz, Konstanz, DE, 14BULL (España) SA, Madrid, ES, 15Aston University, Birmingham, UK, 16Centro de Investigación Príncipe Felipe (CIPF), Universidad Politécnica de Valencia (UPV), Valencia, ES, 17Estación Experimental de Aula Dei (EEAD-CSIC), Zaragoza, ES

Bioinformatics is a discipline where the number and variety of subjects is increasingly growing. From an educational perspective, this fact represents an important challenge. There are still few educational programs covering these sub- jects and in turn there is few educational material. In addition, most of the available material is written in English, what represents an additional bottleneck in the acquisition of new knowledge for non-native English speakers.

Here we present a project aiming to provide specialized educational bibliography on Bioinformatics for Spanish speakers. The idea of writing a book in Spanish language covering the most important topics in the field of Bioinforma- tics was born in the XIth Spanish Symposium on Bioinformatics in Barcelona two years ago. Different scientists have been involved in the project, from senior scientists to PhD students from different countries. Each of the chapters in the book has been written by specialists in the field and the whole book has been edited in LaTeX format that can be readily updated. The result consists on more than 500 pages where the following matters are covered: biomedical databases, sequence analysis, phylogeny and evolution, structural biology, including diverse topics such as docking, virtual scree- ning or molecular dynamics, statistics and R, systems biology, programming skills, data mining, parallel computation, bibliography management and science article writing.

The book will be available in printed format for universities and research institutions and a digital version will be added to the European Multimedia Bioinformatics Educational Resource (EMBER). The book intends to be the begin- ning of an open project, where all the chapters are susceptible of being updated and new topics can be incorporated in future versions. Current version of the book can be accessed online at http://goo.gl/UYG0o7.

Page 23 Oral presentations Student Symposium

S1-05 Computational prediction of microRNA targets in plant genomes Manuel Reis1, Nuno Mendes2, Ana Teresa Freitas3 1IST/UL, Av. Rovisco Pais, 1, 1049-001 Lisboa - Portugal ; KDBIO/INESC-ID, Rua Alves Redol, 9, 1000-029, Lisbon, PT, 2KDBIO/INESC-ID, Rua Alves Redol, 9, 1000-029 Lisboa - Portugal ; IBET, Av. Republica, Qta. do Marquês, 2780-157 Oeiras - Portugal, Lisbon, PT, 3IST/UL, Av. Rovisco Pais, 1, 1049-001 Lisboa - Portugal ; KDBIO/INESC-ID, Rua Alves Redol, 9, 1000-029 Lisboa - Portugal, Lisbon, PT

MicroRNAs (miRNAs) are posttranscriptional regulators and act by binding to sites in their target messenger RNAs (mR- NAs). They are present in nearly all eukaryotes, in particular in plants, where they play important roles in developmental and stress response processes by targeting mRNAs for cleavage or translational repression.

MiRNAs have been shown to have a crucial role in gene expression regulation, but so far only a few miRNA targets in plants have been experimentally validated. Based on the number of annotated genes, the number of experimentally validated miRNAs and the fact that one miRNA often regulates multiple genes, a long list of yet unidentified targets is to be expected.

The developments in high-throughput sequencing technologies have produced large datasets of miRNA sequences and the demand for tools that analyze such data has consequently increased. However, most existing tools for target prediction were designed for animal miRNAs which differ significantly from plant miRNAs in the target recognition process. Furthermore, existing tools for target prediction in plants are unreliable, and the demand for more effective tools persists.

Here we present a plant miRNA target prediction tool that features three important aspects: (i) reverse complementari- ty matching between a target transcript and a miRNA, using an adapted scoring scheme, and (ii) target-site accessibility evaluation, by calculating the opening energy of the secondary structure around the miRNA target site on the mRNA, and (iii) the possibility of incorporating additional evaluation methods.

The developed program consists of two main modules: the first is a variant of the Smith-Waterman alignment, that elicits the prediction of multiple optimal targets, and the second calculates the opening energy for the secondary struc- ture around the miRNA target site, giving us a measure of accessibility. For accessibility calculations we use RNAplfold which is part of the Vienna Package. The evaluation for the complementarity and acessibility modules is compared against an empiric distribution obtained from a set of randomized sequences based on the dinucleotide frequencies of the original transcript, thus obtaining a pair of p-values for each miRNA/mRNA.

We also performed a systematic evaluation of two existing target prediction tools for plants - TAPIR and psRNATarget - and compare the results against validated targets by analysing their sensitivity and specificity.

Page 24 Posters Highlights

H1-03 Late-replicating CNVs as a source of new genes David Juan1, Daniel Rico1, Tomas Marques-Bonet2, Óscar Fernández-Capetillo3, Alfonso Valencia1 1Structural Biology and BioComputing Programme, Spanish National Cancer Research Center (CNIO), Madrid, ES, 2Institut Catala de Recer- ca i Estudis Avancats (ICREA) and Institut de Biologia Evolutiva (UPF/CSIC), Barcelona, ES, 3Genomic Instability Group, Spanish National Cancer Research Centre (CNIO), Madrid, ES

Asynchronous replication of the genome has been associated with different rates of point mutation and copy number variation (CNV) in human populations. Here, our aim was to investigate whether the bias in the generation of CNV that is associated with DNA replication timing might have conditioned the birth of new protein-coding genes during evolution. We show that genes that were duplicated during primate evolution are more commonly found among the human genes located in late-replicating CNV regions. We traced the relationship between replication timing and the evolutionary age of duplicated genes. Strikingly, we found that there is a significant enrichment of evolutionary youn- ger duplicates in late-replicating regions of the human and mouse genome. Indeed, the presence of duplicates in late- replicating regions gradually decreases as the evolutionary time since duplication extends. Our results suggest that the accumulation of recent duplications in late-replicating CNV regions is an active process influencing genome evolution.

H1-04 ngsCAT: a tool to assess the efficiency of targeted enrichment sequencing Francisco J. López-Domingo1, Javier P.Florido1, Antonio Rueda1, Joaquín Dopazo2, Javier Santoyo-López3 1Genomics and Bioinformatics Platform of Andalusia (GBPA), Sevilla, ES, 2Computational Genomics Department, Centro de Investigación Príncipe Felipe, Valencia, ES, 3Edinburgh Genomics, Ashworth Laboratories, The University of Edinburgh, Edinburgh, UK

Targeted enrichment sequencing by next-generation sequencing is a common approach to interrogate specific loci or the whole exome in the human genome. The efficiency and the lack of bias in the enrichment process need to be assessed as a quality control step before performing downstream analysis of the sequence data. Tools that can report on the sensitivity, specificity, uniformity and other enrichment-specific features are needed. We have implemented the next-generation sequencing data Capture Assessment Tool (ngsCAT), a tool that takes the information of the map- ped reads and the coordinates of the targeted regions as input files, and generates a report with metrics and figures that allows the evaluation of the efficiency of the enrichment process. The tool can also take as input the information of two samples allowing the comparison of two different experiments. Documentation and downloads for ngsCAT can be found at http://www.bioinfomgp.org/ngscat

Page 25 Posters Highlights

H1-06 footprintDB: a database of transcription factors with annotated cis elements and binding interfaces Álvaro Sebastián1, Bruno Contreras-Moreira2 1Laboratory of Computational and Structural Biology, Department of Genetics and Plant Production, Estación Experimental de Aula Dei- CSIC, Av. Montañana 1005, Zaragoza, ES, 2Laboratory of Computational and Structural Biology, Department of Genetics and Plant Produc- tion, Estación Experimental de Aula Dei-CSIC, Av. Montañana 1005 and Fundación ARAID, C/ María de Luna, 11, planta 1ª, Edificio CEEI Aragón, Zaragoza, ES

Motivation: Traditional and high throughput techniques for determining transcription factor binding specificities are generating large volumes of data of uneven quality, which are scattered across individual databases.

Results: FootprintDB integrates some of the most comprehensive freely available libraries of curated DNA binding sites (DBSs), and systematically annotates the binding interfaces of the corresponding transcription factors (TFs). The first release contains 2422 unique TF sequences, 10112 DBSs and 3662 DNA motifs. A survey of the included data sources, organisms and TF families was performed together with proprietary database TRANSFAC, finding that footprintDB has a similar coverage of multicellular organisms, while also containing bacterial regulatory data. A search engine has been designed that drives the prediction of DNA motifs for input TFs, or conversely of TF se- quences that might recognize input regulatory sequences, by comparison with database entries. Such predic- tions can also be extended to a single proteome chosen by the user, and results are ranked in terms of interfa- ce similarity. Benchmark experiments with bacterial, plant and human data were performed to measure the predictive power of footprintDB searches, which were able to correctly recover 10%, 55% and 90% of the tested sequences, respectively. Correctly predicted TFs had a higher interface similarity than the average, confirming its diag- nostic value.

Availability: Website implemented in PHP, Perl, MySQL and Apache. Freely available from http://floresta.eead.csic.es/ footprintdb

H1-07 MAPI: a software framework for distributed biomedical applications Tor Johan Mikael Karlsson1, Oswaldo Trelles2 1Parque Tecnológico de Ciencias de la Salud, Avenida de la Innovación, nº 1, Armilla, Granada, ES, 2ETSI Informatica, Campus Teatinos, Universidad de Málaga, Malaga, ES

Background: The amount of web-based resources (databases, tools etc.) in biomedicine has increased, but the integrated usage of those resources is complex due to differences in access protocols and data formats. However, distributed data pro- cessing is becoming inevitable in several domains, in particular in biomedicine, where researchers face rapidly increasing data sizes. This big data is difficult to process locally because of the large processing, memory and storage capacity required.

Results: This manuscript describes a framework, called MAPI, which provides a uniform representation of resources available over the , in particular for Web Services. The framework enhances their interoperability and collabo- rative use by enabling a uniform and remote access. The framework functionality is organized in modules that can be combined and configured in different ways to fulfil concrete development requirements.

Conclusions: The framework has been tested in the biomedical application domain where it has been a base for de- veloping several clients that are able to integrate different web resources. The MAPI binaries and documentation are freely available at http://www.bitlab-es.com/mapi under the Creative Commons Attribution-No Derivative Works 2.5 Spain License. The MAPI source code is available by request (GPL v3 license).

Page 26 Posters Metagenomics

A5-01 An evaluation of the Earth Microbiome Project approach (amplicon region and bioinfor- matic pipeline) for eukaryotic biodiversity assessment Mikel Aguirre1, David Abad1, Aitor Albaina1, Iratxe Zarraonaindia2, Unai Aldalur1, Andone Estonba1 1Department of Genetics, Physical Anthropology & Animal Physiology. Faculty of Science and Technology. University of the Basque Coun- try, Spain, Leioa, ES,2Argonne National Laboratory, Argonne, IL 60439, U.S.A, Chicago, US

The Earth Microbiome Project (EMP) is a proposed massively multidisciplinary effort to analyse microbial communities across the globe (http://www.earthmicrobiome.org/); biodiversity is measured applying massive parallel sequencing of certain barcodes (also known as metabarcoding) in DNA extracts from environmental samples. For marine eukaryotic communities, such as phytoplankton and zooplankton, EMP has proposed a 18S rRNA region that after sequencing and bioinformatic analysis with the QIIME pipeline (http://qiime.org/), is to be the golden standard in eukaryotic bio- diversity assessment worldwide. However, neither the discriminatory power of the targeted region nor the efficiency of the proposed pipeline has been assessed properly. To address this, we have evaluated in silico the discriminatory power of the EMP region for plankton taxonomy by retrieving every publicly available 18S rRNA sequences followed by alignment and polymorphisms calling using a combination of existent and in house developed scripts; we have also included in situ generated sequences for key species not represented in public databases. We have further tested this by comparing the biodiversity estimations of a selection of Bay of Biscay plankton samples by both the EMP´s metabarcoding approach (Illumina 2x150 paired-end reads of the 18S rRNA amplicon) and classic (light microscopy) taxonomy. In this analysis, apart from natural (collected in situ) plankton samples we included some so-called mock samples comprised of a certain number of previously sorted individuals. Finally we compared the obtained results by 1) the standard QIIME pipeline and 2) a new pipeline adding an external trimming step and an alternative method for merging paired ends reads. Obtained results are the first step towards the implementation of metabarcoding as an automated plankton biodiversity monitoring tool in European environmental policies.

Page 27 Posters Integrative Biology

B1-01 Integrative analysis of genomic and epigenomic data to discover new MPNST drivers Bernat Gel1, Ernest Terribas1, Tapan Mehta2, Josep Biayna1, Josep Biayna1, Juana Fernandez-Rodriguez3, Ignacio Blanco4, Peggy Wallace5, Nancy Ratner6, Conxi Lázaro3, Eduard Serra1 1Institut de Medicina Predictiva i Personalitzada del Càncer (IMMPC), Badalona, ES, 2University of Alabama at Birmingham, Birmingham, US, 3Institut Català d`Oncologia (ICO) – IDIBELL, L`Hospitalet de Llobregat, ES, 4Hospital Germans Trias i Pujol, Badalona, ES, 5University of Florida, Gainesville, US, 6Cincinnati Children`s Hospital Research Foundation, Cincinnati Children`s Hospital Medical Center, Cincinnati, US

Background and aims

Malignant Peripheral Nerve Sheath Tumors (MPNSTs) are one of the major clinical complications of NF1 patients and the leading cause of NF1-related mortality. MPNSTs have a poor prognosis, with a 5-year survival rate of 21%. Our aim is to identify genes and mechanisms driving the development and progression of MPNSTs by integrating different sources of genomic and epigenomic data.

Methods

We generated SNP-array data for 14 primary MPNSTs and 5 cell lines, sequenced the exome of 10 MPNSTs and 5 cell lines, generated DNA-methylation data with Illumina 450K arrays for 4 cell lines and used ChIP-on-chip to study three histone marks of 3 cell lines. In addition, we used the expression data from the NF1 Microarray Consortium to identify transcriptional imbalances (TIs), parts of the genome presenting regional up- or down-regulation of genes.

Results and Conclusions

The genomic analysis of MPNSTs and derived cell lines revealed highly altered genomes, with a high degree of hyper- ploidy and LOH and a global landscape of genomic alterations consistent with an origin due to a catastrophic event. Gene expression levels were heavily influenced by somatic copy-number alterations (SCNA). In fact, overexpression TIs were strongly associated with recurrent copy-number gains (P<0.0005). Underexpression TIs, however, showed to be the result of the combination of regional alterations at genomic and epigenomic levels. In addition, although our data is still preliminary, we have observed strong differences between sporadic and NF1-related MPNSTs in the genome structure, DNA methylation levels and the frequency and type of point mutations, suggesting a different biological status. We have performed an integrative analysis of genomic and epigenomic MPNST data together with pathway and functional data to uncover potential driver genes and mechanisms for tumor progression. Using different heuristic data integration strategies, we have identified a number of candidates, which are currently under experimental validation.

B2-01 The Barcelona Supercomputing Center contribution to the PanCancer project for compre- hensive cross-tumor genome analysis Romina Royo1, David Ocaña1, Javier Bartolomé1, David Vicente1, Ana Milovanovic1, Valentí Moncunill1, Josep Lluís Gel- pí2, David Torrents3 1Barcelona Supercomputing Center, Barcelona, ES, 2Barcelona Supercomputing Center and Universitat de Barcelona, Barcelona, ES, 3Barce- lona Supercomputing Center and Inst. Catalana de Recerca i Estudis Avançats, Barcelona, ES

The evolution of current sequencing technology has made possible to undergo deep studies of DNA alterations that trigger diseases. In the case of cancer, where the nature of the disease varies among patients, the analysis of genome pairs from normal and tumor cells from the same individual, allows us to identify the specific mutations that can be responsible for the origin and evolution of the tumor in that specific patient. Following this rational, the International Cancer Genome Consortium aimed to identify the genetic basis of cancer by analyzing up to 500 genome pairs from more than 50 cancer types.

As a natural evolution, the ICGC has launched a new worldwide initiative, called PanCancer, where 4,000 genomes will be analyzed to search for the genetic basis of common processes across different cancer types. Opposite to the original ICGC projects, the rational of the PanC project is to provide a uniform analysis for the whole dataset, in a way that a

Page 28 Posters Integrative Biology

coherent picture of different cancers types can be obtained. Only a reduced set of centers participate in the project as computer providers: the University of Chicago, the European Bioinformatics Institute, the University of Tokyo, the Electronic and Telecom Research Institute, the German Cancer Research Center, and the Barcelona Supercomputing Centre, coordinated by the Ontario Institute for Cancer Research. After an initial alignment and variant calling phase, all results will be synchronized between the computing centers, to offer, in the last phase of the project, a uniform data environment for further analysis.

This project constitutes a perfect opportunity to analyze, develop, and evaluate standard protocols for data transmis- sion, and large scale genomic analysis involving several centers world-wide. The challenge of this project is its distri- buted nature, where several computing centers need to coordinate to run the alignment and variant calling workflows, and finally offer an uniform environment to data users. A cloud-based ������������������������������������������� model,������������������������������������ which can provide a uniform compu- tational environment irrespective of the provider, has been chosen. Individual VMs will be setup and configured and then distributed to the centers where the data is located. This will ensure homogeneity between all centers, but imply not only challenges to computing capacities, but also the need to find an agreement between working and computing philosophies that have to widen beyond strict high performance computing models.

B2-02 BIER platform: analyzing and understanding genomic and biomedical data Francisco García-García1, Alejandro Alemán1, Joaquín Dopazo2 1Systems Genomics Lab, Centro de Investigación Príncipe Felipe (CIPF), Valencia, ES, 2Functional Genomics Node (INB); BIER CIBER de Enfermedades Raras (CIBERER); Systems Genomics Lab, Centro de Investigación Príncipe Felipe (CIPF), Valencia, ES

INTRODUCTION

BiER (Bioinformatics Platform for Rare Diseases; http://www.ciberer.es/bier) is a transversal working group whose function is to provide experimental and clinical groups CIBERER, bioinformatic and technological support needed for the integration, analysis and interpretation of biomedical data (structural and functional genomics, modeling and mo- lecular dynamics, metabolism, relationship networks genes-phenotypes/disease).

METHODS

BiER has designed pipelines for Genomics and Transcriptomics sequecing data analysis and developed web tools to analyze and prioritize genes or mutations for diseases. This bioinformatic and technological support includes advice on the experimental design, analysis strategy and interpretation of data. Several training activities were carried out to facilitate the understanding and management of data.

RESULTS

Scientific collaborations took place among 19 groups CIBERER: 173 exomes were analyzed in 94 different families. Af- ter including new methods in the pipeline, we reanalyzed 72 of the previous exomes to refine the selection of can- didate variants. Recent publications include the discovery of two new mutations in the BCKDK gene, responsible of a neurobehavioral deficit in pediatric patients, new mutations in different genes causing inherited retinal dystrophies and metabolic diseases.

Several web tools were generated to analyze and improve the management of results:

1. BiERapp. A web-based interactive framework to assist in the prioritization of disease candidate genes in whole- exome secuencing studies.

2. ExomeServer. Created with the intention to provide the scientific and medical community, information about the variability in the Spanish population. It is useful for filtering polymorphisms and local variants.

3. TEAM. A web tool for the design and management of panels of genes for targeted enrichment and massive sequen- cing for clinical applications.

Page 29 Posters Integrative Biology

CONCLUSIONS

Interaction between research groups and BIER platform has been an important factor in web design and adjustment tools for analyzing sequencing data and its interpretation.

The results obtained from the analyzes have provided a better understanding of the genomic data of these diseases, as well as the detection of biomarkers that can be used in the prevention, diagnosis and clinical therapy design.

B2-03 A new bioinformatics pipeline to address the most common requirements in RNA-seq data analysis Osvaldo Graña1, Miriam Rubio-Camarillo2, Florentino Fdez-Riverola3, David G. Pisano1, Daniel González-Peña3 1Bioinformatics Unit, Structural Biology and BioComputing Programme. Spanish National Cancer Research Centre (CNIO), Madrid, ES, 2Structural Computational Biology Group, Structural Biology and Biocomputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid, ES, 3ESEI: Escuela Superior de Ingeniería Informática, University of Vigo, Ourense, ES

For an experimental laboratory that performs RNA-seq experiments and, more specifically, for a dedicated bioinforma- tics facility, it is mandatory to quickly and automatically answer common scientific questions. Bioinformatics pipelines are used in this context since they provide a comprehensive number of functionalities, in a flexible and reconfigurable environment. We have developed a new pipeline that embeds some of the most common tools in the field of RNAs se- quencing (1,2). Our pipeline includes the possibility of: (i) assessing sequencing quality and checking possible contami- nations, (ii) trimming of low quality nucleotides at both 3’ and 5’ ends, (iii) performing random down-sampling of reads in case that the different libraries are not properly balanced, (iv) aligning of reads against the reference transcriptome and/or the reference genome, (v) providing information about the proportion of reads covering the different genomic regions and aligning percentages, (vi) assembling and quantifying transcript abundance, (vii) calibrating transcripts expression across the samples with spike-in controls available, (viii) calculating levels of correlation among the sam- ples in the experiment, (ix) doing the differential expression tests of genes/isoforms between conditions, (x) predicting gene fusions and (xi) creating direct files that can be uploaded to genome browsers such as the UCSC Genome Browser or Ensembl in order to visualize alignments. The pipeline will be integrated in the RUBioseq (3) software suite, what will facilitate its maintenance and the execution in HPC environments. References: 1. Li, H, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. 2. Trapnell C, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012 Mar 1;7(3):562- 78. 3. Rubio-Camarillo M, et al. RUbioSeq: a suite of parallelized pipelines to automate exome variation and bisulfite- seq analyses. Bioinformatics. 2013 Jul 1;29(13):1687-9.

B2-04 PsyGeNET: a curated resource on associations between genes and psychiatric disorders Alba Gutiérrez1, Solène Grosdidier1, Olga Valverde1, Marta Torrens2, Àlex Bravo1, Janet Piñero1, Ferran Sanz1, Laura Ines Furlong1 1Hospital del Mar Medical Research Institute (IMIM), Barcelona (Spain) Department of Experimental and Health Sciences, Universitat Pompeu Fabra, Barcelona (Spain), Barcelona, ES, 2Institute of Neuropsychiatry and Addiction, Parc de Salut Mar; Department of Psychiatry, Universitat Autònoma de Barcelona, Barcelona, ES

Comorbidity is the norm among common mental disorders, as more than 50% of affected people meet criteria for mul- tiple diseases. Among these comorbidities, the coexistence of mood and substance use disorders (SUD) is attracting growing interest in the scientific community because of its high prevalence rates and its association with a greater se-

Page 30 Posters Integrative Biology

verity of illness and rate of recurrence for both disorders. In particular, alcohol and cocaine dependences are frequently associated to depression. Several mechanisms have been proposed to explain the coexistence of diseases in one pa- tient, such as pathophysiological and clinical factors, drug side effects, lifestyle, lack of access to health services, as well as differences in socioeconomic status. Recently, it has been proposed that comorbidities may have a common genetic origin. In particular the genes associated to different diseases that co-occur in a given patient, or the proteins encoded by these genes, might be shared by the co-occurring diseases or might be involved in the same biological processes. We present here PsyGeNET (Psychiatric disorders Gene association NETwork), a new resource that integrates informa- tion on psychiatric disorders and their genes, offering exploratory tools for the analysis of gene-disease associations and psychiatric disease comorbidities. In this first version of the database, we focused on three psychiatric disorders: depression, alcohol use disorder and cocaine addiction. PsyGeNET, composed of a database and a set of analysis tools, is the result of the integration of information from DisGeNET (http://www.disgenet.org/) and data extracted from the literature by in-house text mining tools (http://ibi.imim.es/befree/). An important aspect of PsyGeNET is that the data identified by text mining was reviewed and selected by two experts in the field of psychiatry before entering into the database. A web-based annotation tool was used to assist the curation process. A comparison to other available resou- rces indicates that PsyGeNET is unique regarding coverage and data quality. According to PsyGeNET, the three psychia- tric disorders might be linked at the molecular level through shared genes and proteins. All in all, due to its special focus on psychiatric diseases, comprehensiveness and high-quality database, PsyGeNET represents a valuable resource for the analysis of the molecular underpinning of psychiatric disorders and their comorbidities.

B2-05 Bio4j: the bioinformatics data platform Pablo Pareja-Tobes, Alexey Alekhin, Evdokim Kovach, Marina Manrique, Eduardo Pareja, Raquel Tobes, Eduardo Pareja- Tobes Oh no sequences! Research Group, Era7 bioinformatics, Granada, ES

Bio4j (http://bio4j.com) is a cloud-based high-performance bioinformatics graph database. It is one of the first and most important graph databases for biological data with a special focus on data integration: it integrates most data available in UniProt KB (SwissProt + Trembl), Gene Ontology (GO), UniRef (50, 90, 100), RefSeq, NCBI taxonomy, and Ex- pasy Enzyme DBs. All this data in Bio4j is organized in a semantically equivalent graph structure. It allows having many different types of nodes and relationships, making it perfect for highly interconnected complex biological data. Graph databases allow fast local access to all the elements related with each entity, through the edges that connect them with others. So, from a performance point of view, queries which would even be impossible to perform with a standard relational database take no more than a couple of seconds with Bio4j.

Bio4j is in active development and grows rapidly: it includes now 1,216,993,547 relationships and 190,625,351 nodes, which is close to triplicating the figures from one year ago. A flexible module system based on Statika is provided with Bio4j enabling the user to build and deploy only the modules needed for the analysis.

Bio4j is now based on an abstract domain model which decoupling the inner database implementation from the rela- tionships among entities themselves. This allowed us to have a default implementation using Blueprints, the de-facto standard for graph data modeling, thus making the domain model independent from the choice of database technolo- gy. Building on that, we now offer binary distributions for Neo4j and TitanDB backends, yielding a dramatic increase of performance using the backend-specific optimizations, such as vertex-local edge-typed indexes in TitanDB for instance.

Bio4j is open source, available under the AGPLv3 license.

This project is funded in part by the ITN FP7 project INTERCROSSING (Grant 289974).

Page 31 Posters Integrative Biology

B4-01 A workflow with dedicated tools for preparing reference transcriptomes from non-model organisms has evidenced important biological information Hicham Benzekri1, Pedro Seoane1, Rosario Carmona2, Rocío Bautista2, Darío Guerrero-Fernández2, Noé Fernández- Pozo1, M. Gonzalo Claros1 1Departamento de Biología Molecular y Bioquímica, Plataforma Andaluza de Bioinformática. Universidad de Málaga, Málaga, ES, 2Plata- forma Andaluza de Bioinformática, Universidad de Málaga, Málaga, ES

Construction transcriptomes of non-model organisms is a common task nowadays due to the advent of low-cost se- quencing platforms. The non-model organisms require a de novo assembling strategy since no reference is available. Moreover, DNA or RNA from non-model species usually comes from natural, highly heterozygous populations, providing an additional complexity to the assembly process. To facilitate the otherwise cumbersome task of obtaining the better transcriptome, here it is presented an automatic, reliable pipeline that uses public and newly designed tools to do so, handling both short and long reads. It starts with our pre-processing software SeqTrimNext that extracts reliable reads and removes any uncertain sequence that could obscure the final result. Two different algorithms (usually MRA3 and Euler-SR, or Oases) are used to provide different sets of contigs that are then simplified using CD-HIT. Mapping with Bowtie2 is used to discard artefactual contigs. Reliable contigs are analysed with our software Full-LengtherNext to discard non-coding sequences, split chimeras, detect transcripts containing complete proteins, provide an overview of the transcriptome, and sort putative new or species-specific transcripts. Other valuable features of Full-LengtherNext are the selection of the closest orthologue from a model species, and the extraction of a reference transcriptome that can be further used for RNA-seq studies. Finally, descriptions, GO, InterPro, KEGG and EC codes are added using Sma3 and AutoFact, and a set of microsatellite markers is obtained with MREPS. The structure of the pipeline has been au- tomatised with AutFlow, a framework developed in our laboratory to automatise repetitive and long tasks, enabling taking decisions (such as which is the better assembly) during the execution without human intervention. This strategy has been already used for assembling several transcriptomes and provide functional characterisation of pollinic tube genes (olive tree), genes related to eye development (sole), resistance to bight (bean) and the comparison of gene fa- mily sizes (pine). Moreover, lot of genes have been cloned in pine, sole and olive tree based on the sequences revealed by our pipeline.

B4-02 Topological network properties for protein domain relationships enhancement Joan Segura, Carlos Oscar Sorzano, Jesus Cuenca-Alba, Jose Maria Carazo Garcia Centro Nacional de Biotecnologia-CSIC, MAdrid, ES

Experimental methods to analyse specific protein interactions can provide with very accurate description of the bin- ding regions, interacting residues, etc. However, these methods require complex and time consuming assays and thus, they cannot cope –in term of throughput- with the vast amount of interactomic data coming from a large range of high-throughput technologies. Therefore, the number of protein interactions that can be explained with the current experimental knowledge covers only a small fraction of all possible ones. In this work we describe a new approach to infer interactions between protein domains (DDI, for Domain-Domain-Interaction) based on the topology of protein interaction networks. We have defined two novel metrics that quantify a degree of cohesiveness between two sets of nodes within a network. These metrics measure the proportion of interacting nodes between two sets of proteins and the fraction of common neighbours.This approach extends previous works were homolog coefficients were first defi- ned around network nodes and later around edges (1).Moreover, we have proved that when the selected sets involve the network of proteins that contain a particular domain, the scores can be used to distinguish between interacting and non-interacting domain pairs. Different databases were used to implement the proposed approach. The protein interaction networks were collected from STRING database (2) and, then, the Pfam classification (3) was used to de- fine the protein domains of the network nodes. The method was implemented in a web application that includes the interactomics and the structural information used during the implementation. The application allows the evaluation of potential interactions between set of domains. Moreover, when structural information is available, the application can display binding sites, interacting residues or possible poses for the different domains and their interactions. (1) Goldberg DS et al. Proc Natl Acad Sci. 200

Page 32 Posters Integrative Biology

B4-03 Elucidating the gene regulatory network governing the eye development in medaka Juan L. Mateo, Ina Weisswange, Beate Wittbrodt, Joachim Wittbrodt Centre for Organismal Studies (COS), University of Heidelberg, Heidelberg, DE

The eye, as part of the central nervous system, represents an invaluable system to study neurogenesis. In addition, the eye of some animals, like fish for instance, presents the amazing ability of regeneration and continuous growth even during adulthood. The mechanisms of how the delicate balance between proliferation and differentiation is achieved, both during development and in the adult organism, are currently not completely understood. We aim to unravel the principles of such control at transcriptional level using medaka, a Japanese killifish, as model system. We have deve- loped a bioinformatic pipeline using Hight Throughput Sequencing data, RNA-seq and ChIP-seq, in order to identify the key genes involved in the eye development, in one side, and putative CREs (Cis-Regulatory Elements), in the other. The integration of these data in the next level allows us to determine a network of interactions of transcription fac- tors and the association of distal regulatory elements, or enhancers, with proximal ones, or promoters. In this way we will be able to elucidate the transcriptional regulatory network that governs the trade-off between proliferation and differentiation. We validate these predictions in vivo using medaka embryos taking advantage of the enhancer assay developed previously in the lab with site specific integration, what makes possible the evaluation of CREs already in injected embryos.

B4-04 Computational strategies for a more accurate microRNA target prediction Dannys Jorge Martinez-Herrera, Daniel Tabas-Madrid, Carlos Oscar Sanchez-Sorzano, Alberto Pascual-Montano Centro Nacional de Biotecnologia-CSIC, Madrid, ES

MicroRNAs have a great impact on protein output in both plants and animals. These ~22-nucleotide long molecules represent another layer of regulation of the gene expression, at the post-transcriptional level. In plants, miRNAs bind mostly to the coding sequence of mRNAs, with a perfect or almost perfect complementarity. The 5’ region of animal miRNAs (the “seed”) is of high importance to bind their sites, mostly located in the 3’UTRs of mRNAs. However, there is still much to know about the miRNAs mode of action. We propose four different methods to improve the predictions of miRNA targets. First, we developed a statistical approach to combine a large set of available prediction algorithms, and assign each interaction a credibility measure related to experimental validated interactions. All scored interactions for each miRNA can be queried at m3RNA (http://m3rna.cnb.csic.es/). We also predict targets using the sequences of miR- NAs and 3’UTRs. For this we filter the output of five complementary algorithms, based on biological data regarding the mode of action of miRNAs. Putative sites for each miRNA must first be predicted by more than one algorithm to occur at the same position of the 3’UTR. Using transcriptomics data we also determine the proportions of downregulated genes in the whole genome and compare them to the downregulated proportions of miRNA targets to discover miRNAs with different ratios. Their significance is evaluated by a Wilson approximation to the hypothesis test of equality of two pro- portions following a binomial distribution. Finally, we determine if repressed mRNAs are enriched in putative miRNAs target sites by calculating the average number of sites per 1kb of 3’ UTR sequences. In this last method we include the calculation of the miRNA-target binding free energy to better estimate the probability of predicting a real pair. Unlike most algorithms for miRNA target prediction, we evaluate the interaction of the microRNA and the whole mRNA se- quence. This allows us to perform the predictions in a nature-like way.This work has been funded by the Children’s Tumor Foundation, the PRB2-ISCIII platform supported by grant PT13/0001 and the Government of Madrid (P2010/BMD-2305).

Page 33 Posters Integrative Biology

B4-05 Proteogenomics Dashboard for the Human Proteome Project Daniel Tabas-Madrid1, Joao Alves-Cruzeiro1, Victor Segura2, Elizabeth Guruceaga2, Vital Vialas1, Gorka Prieto3, Fernan- do Corrales2, Juan Pablo Albar1, Alberto Pascual-Montano1 1Centro Nacional de Biotecnologia-CSIC, Madrid, ES, 2Centro de Investigación Médica Aplicada - Universidad de Navarra, Pamplona, ES, 3Universidad del País Vasco, Bilbao, ES

The Human Proteome Project (HPP) aims to map the entire human proteome in a systematic approach. Two of the programs to achieve this goal are the Chromosome-based HPP (C-HPP), which characterize the human proteome on a chromosome-by-chromosome basis; and the Biology/Disease HPP (B/D-HPP) that provides a framework for the coordi- nation of biology and diseased-based contributions. These projects specifically study the uncharacterized products for known protein coding genes, variants generated by alternative splicing and coding SNPs (Single Nucleotide Polymor- phisms) and also a comprehensive characterization of PTMs (Post-Translational Modifications). In this work we have followed the strategy of the analog genomics projects like the Encyclopedia of DNA Elements (ENCODE) that provides a vast amount of data on experiments of different human cell lines and reports them in a intuitive, interactive web- based dashboard. We have therefore developed a proteomics based dashboard named dasHPPboard that collects and reports the experiments produced by the HPP consortium. A first logic approximation has been the integration of the data produced by the Spanish contribution to the HPP project (the 16thChromosome). The dashboard includes results of Shotgun and MRM (Multiple Reaction Monitoring) proteomics experiments, PTMs information as well as proteoge- nomics study of the Cancer Cell Line Encyclopedia (CCLE). We have also processed the ENCODE and Human Body Map (HBM) transcriptomics data for the identification of those cell lines with high expression levels for protein coding ge- nes, especially those classified as “missing” where no strong proteomic evidences are available, allowing the selection of cell lines or tissues to conduct the proteomics studies. We produce and allow downloading the alternative peptides databases built using novel junctions and SNPs derived from RNA-Seq data to be used for protein identification on the same cell line or tissue. We expect the dashboard to be the central place of all experiments produced and collected by the C-HPP project, allowing the community to quickly explore and find the wide range of produced experiments. The dashboard can be freely accessed at: http://sphppdashboard.cnb.csic.es This work has been funded by the PRB2- ISCIII platform supported by grant PT13/0001, the Government of Madrid (P2010/BMD-2305) and the Children’s Tumor Foundation.

B4-06 Systemic approaches to predict novel protein functional associations Ian Morilla1, Juan A.G. Ranea2 1Swiss Institute of Bioinformatics, University of Zurich, Zurich, CH, 2Department of Molecular Biology and Biochemistry-CIBER de Enferme- dades Raras, University of Malaga, Málaga, ES

The advent of high-throughput assays enables experimentalists to identify molecular subsets with function associa- ted to biological systems of interest, in complete organisms. Nevertheless, factors such as experimental constraints or researchers’ limitations processing these complex data sometimes introduce inaccuracies that may result in some key molecular players being missed. Hence, accurate functional characterization of large datasets returned by high- throughput experiments remains a major challenge in System Biology.

We evaluated the potential to build accurate and comprehensive protein interaction models by means of meta-statis- tical integration of different computational methods and PPI data resources, which contain valuable information on real biological networks. These models show that many regions of the human protein-protein interaction maps are still uncharted, constituting the ‘dark matter’ of many functional systems. The “dark matter” term, borrowed from astronomy, refers to those protein associations that can be predicted but have not been experimentally characterized yet (Ranea et al., 2010).

Page 34 Posters Integrative Biology

We collaborate actively with experimental groups to demonstrate the utility of different bio-computational prediction approaches in finding novel membership of components and associations to biological systems they are studying. There have been some successful outcomes to date in characterizing new spindle components (Rojas et al., 2012) or the discovering of new proteins involved in the chromosome condensation occurring during cell division (Hériché et al, 2014). We believe in successful practical collaborations with experimental groups as the most effective way of encou- raging higher confidence in predictive models.

References:

- Ranea JA, Morilla I, Lees JG, Reid AJ, Yeats C, Clegg AB, Sanchez-Jimenez F, Orengo C. Finding the “dark matter” in human and yeast protein network prediction and modelling. PLoS Comput Biol. 6(9). pii: e1000945. (2010).

- Rojas AM, Santamaria A, Malik R, et. al. …, Orengo C, Valencia A, Ranea JA. Uncovering the molecular machinery of the human spindle--an integration of wet and dry systems biology. PLoS One. 7(3):e31813. (2012).

- Hériché JK, Lees JG, Morilla I, et al…. Ranea JA, Orengo C, Ellenberg J. Integration of biological data by kernels on graph nodes allows prediction of new genes involved in mitotic chromosome condensation. Mol Biol Cell.. pii: mbc.E13-04- 0221. (2014).

B4-07 ChlamyNET, a software tool for the exploration of co-expression patterns in the transcrip- tome of Chlamydomonas reinhardtii Francisco J. Romero-Campero1, Ignacio Pérez-Hurtado1, F. Javier Sánchez-Ortiz1, José M. Romero2, Federico Valverde2 1Departamento de Ciencias de la Computación e Inteligencia Artificial, Universidad de Sevilla, Sevilla, ES, 2Instituto de Bioquímica Vegetal y Fotosíntesis (CSIC - Universidad de Sevilla), Sevilla, ES

Chlamydomonas reinhardtii is a reference model organism for algal genomics and physiological studies. It is of special interest in the study of the evolution of regulatory pathways between algae and higher plants (Serrano et al., 2009) . Additionally, Chlamydomonas has recently gained attention as a potential source for biodiesel production. The se- quencing of its genome (Merchant et al., 2007) and the accumulation of RNA-seq data available in public databases (Wheeler et al., 2005) have allowed researchers to analyse its transcriptome under different physiological conditions. Up to now these studies have remained fragmented, making necessary integrative approaches based on molecular systems biology methodologies in order to reveal novel global properties of its transcriptome.

In order to integrate all this fragmented knowledge we have constructed a gene co-expression network based on RNA-seq data and developed a web tool called ChlamyNET for the exploration of the Chlamydomonas transcripto������������- me. Topological analysis of ChlamyNETshowed that it is a scale-free and small-world network which suggests that the Chlamydomonas transcriptome has relevant characteristics related to error tolerance, vulnerability and informa- tion propagation. Clustering techniques applied over ChlamyNET identified a central cluster where most authoritative hub genes are located interconnecting key biological processes such as light signaling and protein phosphorylation with carbon/nitrogen metabolism and metabolite transport. Analysis performed using ChlamyNET have revealed an apparent photoperiodic control of starch synthesis, lipid metabolism and cell cycle.

References:

1. Merchant S, Prochnik S, Vallon O, Harris E, Karpowicz S, Witman G, Terry A, Salamov A, Fritz-Laylin L, Marechal- Drouard Lea: The Chlamydomonas genome reveals the evolution of key animal and plant functions. Science 2007, 318(5848):245–250.

2. Serrano G, Herrera-Palau R, Romero JM, Serrano A, Coupland G, Valverde F: Chlamydomonas CONSTANS and the

Page 35 Posters Integrative Biology

evolution of plant photoperiodic signaling. Current Biology 2009, 19(5):359-368.

3. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Church DM, DiCuccio M, Edgar R, Federhen S, Helmberg W, Kenton DL, Khovayko O, Lipman DJ, Madden TL, Maglott DR, Ostell J, Pontius JU, Pruitt KD, Schuler GD, Schriml LM, Se- queira E, Sherry ST, Sirotkin K: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2005, 33:D39-D45

B5-01 STATegra EMS: an Experiment Management System for complex next-generation omics experiments Rafael Hernández de Diego1, Noemi Boix Chova2, Imad Abugessaisa2, Jesper Tegner2, David Gómez Cabrero2, Ana Cone- sa Cegarra1 1Genomics of Gene Expression Lab, Centro de Investigación Príncipe Felipe, VALENCIA, ES, 2Computational Medicine. Karolinska Institute, Stockholm, SE

High-throughput sequencing and NGS-based assays have gained popularity as methodologies to study different levels of genome organization. One of the advantages of NGS experiments is that little or no a priori genome knowledge is required, which makes them universally applicable to the study of model and non-model organisms. The decreasing costs and commercial service availability of sequencing have put the technology within reach of most laboratories, which can now use one or more NGS assay in their research projects.

While the number of samples and replicates in these experiments are relatively modest, these can quickly grow to several dozens of samples and thus require standardized annotation, storage and management of preprocessing steps.

As a part of the STATegra project, we have developed an Experiment Management System for omics experiments that include different types of NGS-based assays, proteomics and metabolomics data. We specifically support sequencing experiments such as RNA-seq, miRNA-seq, Chip-seq, Methyl-seq, or DNase-seq and can easily be extended to support additional sequencing assays. The system uses free, open source software technologies, such as Java Servlets, the Sen- cha EXT JS framework, MySQL relational database system and the Apache Tomcat Servlet engine. The STATegra EMS is experiment –rather than sample- centric and has been conceived to support experiment annotation at research labs that perform many different types of NGS-based assays and may work with different sequencing platforms. The System supports metadata annotation with controlled vocabularies, batch import and annotation and storage of different pro- cessing steps from raw data to ready-to-use measurements of analysis pipelines.

More info: http://stategra.eu/stategraems

Published: http://www.biomedcentral.com/1752-0509/8/S2/S9

B5-02 Drug activity prediction using mechanism-based biomarkers Alicia Amadoz1, Patricia Sebastián-León2, Joaquín Dopazo1 1Systems Genomics Lab, Centro de Investigación Príncipe Felipe (CIPF), Valencia, ES, 2Genomics of Gene Expression Lab, Centro de Investi- gación Príncipe Felipe (CIPF), Valencia, ES

Drug development and validation is an expensive and time-consuming process. Several computational approaches are being used to identify interesting compounds that could be good candidates for further clinical trials. In this sen- se, much effort has been directed towards the identification of genomic biomarkers related to cancer drug response.

Page 36 Posters Integrative Biology

However, the identification of such biomarkers is complicated due to the high levels of cancer heterogeneity. Here, we present prediction results of cancer drug sensitivity from 12 human tumour cell lines and 7 compounds using novel mechanism-based biomarkers. A major challenge in the development of cancer therapies is to discover the molecular mechanisms of drug action. The knowledge of the specific molecular alterations caused by drugs would suggest more effective and personalized treatment strategies. Signaling pathways provide a biological framework to quantify the functional activity of the cell. Consequently, the activation status of stimulus-response circuits can be used as a rich- informative biomarker that provide mechanistic explanations for the molecular basis of drug effect [1]. In the present work, two gene expression public datasets [2,3] were used to connect the signaling activity profiles of cancer cell lines with the concentration at which the drug response reached an absolute inhibition of 50% (IC50). Prediction models were obtained per cancer and drug using our proposed methodology (in preparation) with one of the public datasets and were validated using the second one. We found that the performance of using mechanism-based biomarkers was accurate and also that the suggested molecular mechanisms were reported in previous studies.

References:

[1] Sebastián-León et al. ���������������������������������������������������������������������������������������������(2013). Inferring the functional effect of gene expression changes in signaling pathways. Nu- cleic Acids Research, 41(W1):W213-W217.

[2] Garnett et al. (2012). Systematic identification of genomic markers of drug sensitivity in cancer cells. Nature, 483(7391):570-5.

[3] Barretina et al. (2012). The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature, 483(7391):603-7.

B5-03 BiERapp: A web-based interactive framework for the prioritization of disease candidate genes in whole exome sequencing studies Alejandro Alemán1, Francisco García1, Francisco Salavert1, Ignacio Medina2, Joaquín Dopazo3 1BIER CIBER de Enfermedades Raras (CIBERER); Systems Genomics Lab, Centro de Investigación Príncipe Felipe (CIPF), Valencia, ES, 2Sys- tems Genomics Lab, Centro de Investigación Príncipe Felipe (CIPF); European Molecular Biology Laboratory – European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK, 3Functional Genomics Node (INB); BIER CIBER de Enfer- medades Raras (CIBERER); Systems Genomics Lab, Centro de Investigación Príncipe Felipe (CIPF), Valencia, ES

Whole-exome sequencing has become a fundamental tool for the discovery of disease-related genes of familial di- seases and the identification of somatic driver variants in cancer. However, finding the causal mutation among the enormous background of individual variability in a small number of samples is still a big challenge. Here we describe a web-based tool, BiERapp, which efficiently helps in the identification of causative variants in family and sporadic genetic diseases. The program reads lists of predicted variants (nucleotide substitutions and indels) in affected indivi- duals or tumor samples and controls. In family studies, different modes of inheritance can easily be defined to filter out variants that do not segregate with the disease along the family. Moreover, BiERapp integrates additional information such as allelic frequencies in the general population and the most popular damaging scores to further narrow down the number of putative variants in successive filtering steps. BiERapp provides an interactive and user-friendly interfa- ce that implements the filtering strategy used in the context of a large-scale genomic project carried out by the Spa- nish Network for Research in Rare Diseases (CIBERER) in which more than 800 exomes have been analyzed. BiERapp is freely available at: http://bierapp.babelomics.org/

Page 37 Posters Integrative Biology

B5-04 CellMaps Francisco Salavert-Torres1, Luz Garcia-Alonso1, Marta Bleda1, Nacho Medina2, Joaquin Dopazo3 1BIER CIBER de Enfermedades Raras (CIBERER); Systems Genomics Lab, Centro de Investigación Príncipe Felipe (CIPF), Valencia, ES, 2Sys- tems Genomics Lab, Centro de Investigación Príncipe Felipe (CIPF); European Molecular Biology Laboratory – European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK, 3Functional Genomics Node (INB); BIER CIBER de Enfer- medades Raras (CIBERER); Systems Genomics Lab, Centro de Investigación Príncipe Felipe (CIPF), Valencia, ES

Genes, proteins and regulatory elements operate in an intricate network of interactions. A new paradigm has emerged to study these biological systems. This new holistic paradigm aims to understand how the interactions of the compo- nents of biological systems give rise to the function and how they contribute to phenotypes and diseases. In the last years several network analysis tools have been developed. The most popular and widely used, Cytoscape, is written in Java and therefore it runs as a local desktop application. However, the continuous increase of biological knowled- ge, the size of the networks and the computational complexity of the analysis makes the analysis in a local machine increasingly difficult. First, the analysis of big networks demands high computational and memory resources. Second, Next-Generation Sequencing (NGS) data can easily reach terabytes of size and are usually stored in the cloud. Third, systems biology analysis requires the integration of different types of biological information. Such data usually spread out in many different databases that need to be integrated and updated on regular basis.

CellMaps is an open web-based application for visualization and network analysis. CellMaps solves the above mentio- ned problems by following the new cloud-based software paradigm, which can be seen as an evolution of the client/ server architecture. First, analyses are implemented using High-Performance Computing (HPC) techniques and exe- cuted in remote high-end cluster machines. Second, analyses are performed close to the data, so no data transfer or local storage is needed as data is kept in cloud servers. Third, CellMaps makes an exhaustive use of CellBase (Bleda et al., NAR web server 2012) web services, which query a comprehensive database containing heterogeneous biological information.

CellMaps has a user-friendly interface implemented in Javascript using the new HTML5 and SVG standards; therefore, it runs in modern web browsers without requiring the installation of any Flash plug-in or Java Applet. Network data can also be imported and displayed using the most common formats (SIF, SBML, dot). Users can also import an attribute file containing metadata information about nodes and edges that can be used to filter and select sub-networks. Users can also import biological information from CellBase web services such as Reactome pathways or IntAct protein-protein interaction data and perform some network analyses.

B5-05 BioSWR: WSDL 2.0 model-based Semantic Web Services Registry Dmitry Repchevsky1, Josep Lluís Gelpí2 1Barcelona Supercomputing Center, Barcelona, ES, 2Barcelona Supercomputing Center and Universitat de Barcelona, Barcelona, ES

The arrival of Internet brought new ways in how biological information is processed and organized. The simplicity of web-based applications quickly made them a popular means to access biological tools and resources. Nevertheless, while traditional web applications present a convenient way to work for end users, they pose serious limitations for automatic data processing and integration. The need to provide an automatic method of communications in such in- trinsically heterogeneous network as Internet brought Web services into the scene. Life Sciences community rapidly embraced the technology making thousands of web services available.

The spectacular growth of number of bioinformatics web services requires a way to discover them matching some criteria. Although this problem was already tackled by popular web services catalogues such as BioCatalogue and Em- brace Web Services Registry, we present a novel semantic approach based on latest W3C standards.

BioSWR is a semantic web services registry based on Web Services Description Language (WSDL) 2.0 ontology and is

Page 38 Posters Integrative Biology

especially targeted to the Life Science community. The registry provides web-based interface for web services regis- tration, querying and annotation and is also accessible programmatically via REST API or using SPARQL Protocol and RDF Query Language.

Providing semantic approach for web services descriptions, BioSWR also supports conventional, XML-based description languages such as WSDL 1.1 / 2.0 and Web Application Description Language (WADL). In order to encompass more services, BioSWR supports BioMoby web services via embedding their semantic definitions into WSDL 2.0 descriptors.

Semantic representation greatly simplifies web services annotation and querying. BioSWR integrates EMBRACE Data and Methods (EDAM) ontology as a source of semantic annotations and provides SPARQL Update language support to manage them.

Simple RESTful API facilitates BioSWR integration with other bioinformatics tools. Integration with Taverna workflow management and execution tool is under development.

Our team actively works to extend its web services collection and encourages other providers to contribute.

B5-06 A cloud infrastructure for plant genomics Javier Alvarez1, Laia Codó2, Romina Royo2, Rosa Maria Badia1, Josep Lluís Gelpí3 1Barcelona Supercomputing Center - Computer Dept, Barcelona, ES, 2Barcelona Supercomputing Center - Life Dept. Computational Bioin- formatics node - National Institute of Bioinformatics, Barcelona, ES, 3Barcelona Supercomputing Center. National Institute of Bioinforma- tics. Dept. Biochemistry and Molecular Biology - University of Barcelona, Barcelona, ES

TransPLANT (Trans-national Infrastructure for Plant Genomic Science - http://www.transplantdb.eu) is a European con- sortium established to design, implement, deploy and operate the software infrastructure critical to the future needs of plant genomics. Here we present the computational environment designed and built to offer a platform for program- matic and interactive access to plant genomic data and applications.Although tools for genomic analysis are available, plant genomics shows an increased difficulty due to their larger size and complexity (i.e. polyploidy, large amount of repeated sequences). Downstream analyses require a large series of constantly evolving tools - some highly computa- tional intensive-, all together with a significant amount of expert’s manual operations. For these reasons, a specific com- putational platform has been implemented. Requirements of the platform includes flexibility, ability to be installable next to data producers so that data transfer and eventual privacy data issues are minimized, and also multiscalar capa- bility to allow single tools to be executed both, at low scale and at HPC level.The strategy chosen for the transPLANT consortium is a virtualized environment. Tools are provided as a collection of virtual machines. They include from well- known bioinformatics tools, to workflows or software produced within the project. The cloud middleware OpenNebula (http://www.opennebula.org) is responsible for managing the hardware resources and the virtualized environments in a way transparent to the user. The infrastructure is powered by a multiscale programming model (COMPSs -http://www. bsc.es/computer-sciences/grid-computing/comp-superscalar) which allows to take advantage of HPC, grid or cloud based distributed computing, without being forced to develop specific HPC software. An interface to OpenNebula has been implemented to allow the adjustment of the virtualized resources to the requirements of the workflows mana- ged by COMPSs. Final users can programmatically access to the platform through SOAP-based web services using the programing model PMES and its Java API, or, alternatively, through a web interface (the Dashboard, http://transplantdb. bsc.es/pmes/). The whole infrastructure is provided as a set of installable packages, in a way that data providers will be able to offer an uniform computational frontend to access and analyze data.

Page 39 Posters Integrative Biology

B5-07 Integrated Gene Set Analysis for microRNA Studies Francisco García-García1, Joaquín Panadero2, Joaquín Dopazo3, David Montaner1 1Systems Genomics Lab, Centro de Investigación Príncipe Felipe (CIPF), Valencia, ES, 2Genómetra, Valencia, ES, 3Functional Genomics Node (INB); BIER CIBER de Enfermedades Raras (CIBERER); Systems Genomics Lab, Centro de Investigación Príncipe Felipe (CIPF), Valencia, ES

INTRODUCTION From a systems biology perspective, gene set analysis (GSA) allow us to understand the molecular basis of a genome-scale experiments. Gene set methods are much more sensitive than single enrichment methods in detecting gene sets (defined as sets of genes with a common annotation) with a joint implication in a genomic experiment. But currently there are not GSA methods tailored for the miRNA context. In this work we present a novel approach to the functional interpretation of miRNA studies which keeps the advantages of the GSA.

METHODS We downloaded 20 datasets from The Cancer Genome Atlas (http://cancergenome.nih.gov/), containing tumoral and normal samples. Differential expression analysis was carried out for mRNA and miRNA levels (Biocon- ductor library edgeR). Information from miRNA was transferred to gene level by adding its effects and generating a new index which ranks genes according to their differential inhibition by miRNA activity across biological conditions. Given such ranking statistics of the genes for each functional class, we apply the logistic regression models for GSA. P-values were corrected for multiple testing using the method Benjamini and Yekutieli.

RESULTS This new approach has allowed to obtain a genomic functional profiling for different cancers when using miRNA data. In our study we used Gene Ontology terms (http://www.geneontology.org/) to define gene sets, obtaining detailed functional results for each ontology (biological process, cellular component and molecular function).

CONCLUSIONS This method may be successfully applied in genomic functional profiling, transferring miRNA data to gene level so that GSA can be properly aplyed. Functional results take advantage of the knowledge already available in biological databases and can help to understand large-scale experiments from a systems biology perspective.

B5-08 Extraction of relations between genes and diseases from text and large-scale data analy- sis: implications for translational research Àlex Bravo, Janet Piñero, Núria Queralt, Michael Rautschka, Laura I. Furlong Research Program on Biomedical Informatics (GRIB), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, IMIM (Hospital del Mar Medical Research Institute), Dr. Aiguader, 88 Barcelona, Spain, Barcelona, ES

Current biomedical research needs to leverage and exploit the large amount of information reported in publications. Automated text mining approaches, in particular those aimed at finding relationships between entities, are key for identification of actionable knowledge from free text repositories. We report on the development of the BeFree system aimed at identifying relationships between biomedical entities with a special focus on genes and their associated di- seases. By exploiting morpho-syntactic information of the text BeFree performs competitively not only for the identifi- cation of gene-disease relationships, but also for drug-disease and drug-target associations. The application of BeFree to real-case scenarios shows its potentiality in extracting relevant information for translational research. We show the value of the gene-disease associations extracted by BeFree through a number of analysis and integration with other data sources. For instance, BeFree is able to identify genes associated to one of the most prevalent diseases, depression, which are not present in public databases. Moreover, large-scale extraction and analysis of gene-disease associations provided interesting insights on the kind of information that can be found in the literature, and raised challenges regarding data prioritization and curation. We found that only a small proportion of the gene-disease associations discovered by text mining is collected in expert-curated databases. Thus, there is a pressing need to find alternative strategies to manual curation to review, prioritize and curate text-mining data and incorporate it into domain-specific databases. We present our strategy for data prioritization and discuss its implications for supporting biomedical re- search and applications.

Page 40 Posters Integrative Biology

B5-09 Predicting aberrant recombination events in T-cell precursors using genetic and epigenetic data Marie Bonnet, Daniel Sobral, Joana Silva, Jocelyne Demengeot Instituto Gulbenkian de Ciência, Oeiras, PT

The generation of the diversity that characterizes the adaptive immune system depends on the intricate mechanism of V-D-J recombination, where the genome is physically altered in a highly regulated way to generate a functional gene, unique to each cell. Erroneous, non-productive recombination events are almost always removed (by cell death) or silenced (by allelic inactivation). Nonetheless, the extreme effects of recombination make it a potential source of ma- lignancy, and there are known cases of cancers resulting from aberrant recombination events. It is therefore important to be able to identify factors that facilitate such aberrant events. Recent studies showed that recombination depends on both genetic and epigenetic factors. Using the Tcr-beta locus, we are trying to build a model that would enable us to predict the recombination potential of any genomic region, based on its genetic (sequence) and epigenetic data. For this we are taking advantage of a multitude of epigenetic datasets that have been recently released for T-cell precur- sors. To assess the validity of our models, we perform targeted sequencing of the Tcr-beta locus to check for aberrant recombination events in-vivo. We were already able to detect rare, but consistent aberrant recombination events in the Tcr-beta locus. Nonetheless, we have so far been unable to find satisfactory models to fully explain our observations.

B5-10 COPABI: a Computational Platform for Automation on the Genome-Scale Metabolic Mo- dels Reconstruction Raymari Reyes1, Maria Siurana2, Daniel Gamermann3, Arnau Montagud4, Julián Triana1, Ramón Jaime1, Victor M. Nina2, David Fuente2, Yarlenis Pacheco1, Javier F. Urchueguia2, Pedro Fernández de Córdoba2 1Universidad Pinar del Río “Hermanos Saíz Montes de Oca”, Pinar del Río, CU, 2Universitat Politècnica de València, Valencia, ES, 3Universi- dade Federal do Rio Grande do Sul (UFRGS), Porto Alegre, BR, 4U900 - Systems Biology of Cancer, Institut Curie, Paris, FR

Currently, the reconstruction of genome-scale metabolic models is a non-automatized and interactive process based on decision making. This lengthy process usually requires a full year of one person’s work in order to satisfactory co- llect, analyze, and validate the list of all metabolic reactions present in a specific organism. In order to write this list, one manually has to go through a huge amount of genomic, metabolomic, and physiological information. Nowadays, there is no optimal algorithm that allows one to automatically go through all this information and generate the mo- dels taking into account probabilistic criteria of unicity and completeness that a biologist would consider. This work presents the automation of a methodology for the reconstruction of genome-scale metabolic models for any organism. The methodology that follows is the automatized version of the steps implemented manually for the reconstruction of the genome-scale metabolic model of a photosynthetic organism, Synechocystis sp. PCC6803. The steps for the re- construction are implemented in a computational platform (COPABI) that generates the models from the probabilistic algorithms that have been developed. For validation of the developed algorithm robustness, the metabolic models of several organisms generated by the platform have been studied together with published models that have been manually curated. Network properties of the models, like connectivity and average shortest mean path of the different models, have been compared and analyzed.Key words: systems biology, genome-scale metabolic models, connectivity, metabolic networks.Acknowledgements: The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement number 308518 (CyanoFactory), from the Spanish Ministerio de Educación Cultura y Deporte grant FPU12/05873 through the program FPU and from the Universitat Politècnia de València grant Contratos Predoctorales FPI 2013.

Page 41 Posters Integrative Biology

B5-11 CyanoFactoryKB – An open-source web-based software program for constructing model organism databases for Synechocystis sp. PCC 6803 Gabriel Kind1, Arnau Montagud2, Maria Siurana3, Victor M. Nina3, Eric Zuchantke1, David Fuente3, J. Alberto Conejero3, Julian Triana4, Pedro Fernández de Córdoba3, Javier F. Urchueguia3, Röbbe Wünschiers1 1University of Applied Sciences Mittweida/Germany, Mittweida, DE, 2U900 - Systems Biology of Cancer, Institut Curie, Paris, FR, 3Universitat Politècnica de València, Valencia, ES, 4Universidad Pinar del Río “Hermanos Saíz Montes de Oca”, Pinar del Río, CU

Nowadays, there are many efforts for designing comprehensive systems that provide the information needed for cons- tructing model organism databases. One of them is WholeCellKB that provides an extensive and customizable data model that describes the structure and function of each gene, protein, reaction and pathway [1]. The philosophy of this kind of systems is to have a robust database of the organism with features like data integration, cross-linking and mo- deling and finally for dissemination [2]. CyanoFactory is an European research project whose main interest is to develop strategies to enhance the production of renewable energy and biofuels. Within this project, the nodes of Mittweida and Valencia have several tasks, one of which is to make use of techniques of systems biology and synthetic biology to improve hydrogen production organism Synechocystis sp. PCC6803. To automate the use of these techniques we are developing different tools, some of which are: CyanoView, CyanoMaps, CyanoDesign, etc. All these are integrated into CyanoFactoryKB, which is capable of providing complete information of the organism, making various calculations including metabolic fluxes, and, in addition, of presenting the results to the end user. As part of this project, we have participated in the development of CyanoDesign, which is a web-based tool that allows, mainly, the study of the flux distribution over the metabolic network, and the generation and evaluation of in silico mutants. Currently, this goal is under development and has a basic but functional prototype that allows integrating information and analysis from all the KB. Furthermore one of our efforts is develop a usable tool, for this goal we are studying many ways to create a comprehensible and friendly interface.

References:

1. Karr JR*, Sanghvi JC*, Macklin DN, Gutschow MV, Jacobs JM, Bolival B, Assad-Garcia N, Glass JI, Covert MW. A Whole-Cell Computational Model Predicts Phenotype from Genotype.Cell 150,389-401(2012)

2. Karr JR, Sanghvi JC, Macklin DN, Arora A, Covert MW. WholeCellKB: Pathway/Genome Databases for Comprehensive Whole-Cell Models.41,D787-D792(2013)

Acknowledgments: The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement number 308518 (CyanoFactory), from the Spanish MECD grant FPU12/5873 through the program FPU and from the UPV grant Contratos Predoctorales FPI 2013.

B5-12 Highthroughput information integrated with in-silico predictions to identify key partici- pants in host-pathogen interactions Daniel Poglayen1, Ana Garcia2, Jascha Casadio1, Oriol Fornés1, Javier Garcia-Garcia1, Guy Zinman3, Manuel Alejan- dro Marín-López1, Ziv Bar-Joseph3, Heribert Hirt2, Judith Klein- Seetharaman4, Baldo Oliva1 1Structural Bioinformatics Laboratory, Universitat Pompeu Fabra, Barcelona, ES, 2Unité de Recherche en Genomique Végétale, URGV, Evry, FR, 3System Biology Group – Carnagie Mellon University, Pittsburgh, US, 4Department of Computational Biology – School of medicine, University of Pittsburgh, Pittsburgh, US

We implemented a multidisciplinary approach that, integrating the analysis of data coming from high-throughput

Page 42 Posters Integrative Biology

technologies and the information from in-silico PPI prediction methods, is capable of identifying key genes/proteins during Salmonella infection of its host.

First of all Microarray data are analysed and genes with similar behaviour are clustered together using the STEM al- gorithm and software. According to what is known about the mechanisms of infection of such bacteria, the next step is the analysis of the putative pathways between proteins in the plasma membrane and the set of known and clus- tered Transcription Factors (TFs) of the host. Because some of the Salmonella effectors are known, it makes sense to highlight, in our analysis, the pathways that involve predicted interactions of the TFs we clustered with both known effectors and hypothetical new ones. To predict these putative interactions between proteins based on homologs found in PPI databases, we used the BIPS server. In addition we identify TFs that, in principle, can regulate the behaviour of all the genes in the same cluster, called putative Main Regulators, and we also face the “remote homology hypothesis” consisting in a bacterial protein with the potential to directly act as a host TF. We applied GUILD, a message-passing algorithm, in both directions: from bacterial effectors to host proteins and from host TFs, with particular attention on putative MRs, to bacterial proteins. Last but not least we analysed results retrieved with the SDREM software in order to identify response pathways to the bacterial infection.

We applied the described method to short time series data derived from Salmonella infected Arabidopsis. The results predict a crucial role played by the proteins WRKY18, WRKY 40 and WRKY 60. We validated these results with a triple mutant lacking the mentioned WRKYs and performing a qPCR experiment. We could not test other pathways involved in the regulation and also we could not test single wrky mutants but we could confirm that the three proteins do play a role in the transcriptional regulation in response to Salmonella, at least at early stages of the infection (2h).

B5-13 Identifying bioinformatics sub-workflows using automated biomedical ontology annota- tions Beatriz García-Jiménez, Mark D. Wilkinson Center for Plant Biotechnology and Genomics (CBGP, UPM - INIA), Pozuelo de Alarcón (Madrid), ES

Scientific workflows are formal representations of the sequence of steps in a scientific methodology. Their useis growing due to the necessity of reproducible research. Workflows can be shared and reused, in whole or in part; howe- ver, workflow repositories, such as myExperiment, do not facilitate the discovery of sub-workflows relevant to a task.

Sub-workflows represent fragments of scientific knowledge that could be used as “modules” to assemble new workflows, or repair broken ones. We propose that sub-workflows can be identified by patterns of semantic similarities, revealed by clustering subgraphs of exhaustively-fragmented workflows with their services ontologically annotated. Service annotations are derived from our previous work [http://arxiv.org/abs/1407.0165] and have a quality comparable to manually curated bioinformatics resources. Ontology-based annotations enable the discovery of related terms by their ontological connections, despite their lexical descriptions being highly disparate (i.e. BLAST and FASTA). In addition, we offer a rich cluster interpretation, which associates to each pattern: 1) the ontology annotations together with their frequencies and 2) the possible combinations of services satisfying each pattern.

Preliminary results using bioinformatics-related workflows from myExperiment confirm the viability of our approach to sub-workflow clustering. Using 13 assorted OBO Foundry ontologies (such as EDAM, MESH or OBI), we obtain between 150 and 2600 automatically-annotated subgraphs (depending on the ontology) of 2 or 3 nodes each. Calculating the semantic similarity among these annotations, we derive clusters of workflow fragments, ranging from 8 to 36 clusters, increasing with the available number of subgraphs. For example, a known and commonly-executed bioinformatics process consists of a sequence/structure similarity search (with BLAST, FASTA, Tcoffee, etc.), followed by an id/result retrieval of the best matches. This pattern is identified by our system, using semantic annotations from EDAM, with a high performance of 78% (in terms of Silhouette coefficient). We suggest that automatically-identified modules such as these, derived from legacy workflows designed by experts, could be applied to simplify the design of new workflows in emerging domains of biological investigation, such as NGS and metagenomics analysis, since these basic conceptual units of analysis are likely to be shared between legacy and new workflow designs.

Page 43 Posters Integrative Biology

B5-14 Self Organizing Maps based approach to identify protein patterns related to learning in control and mouse models for Down syndrome Clara Higuera1, Katheleen J. Gardiner2, Krzysztof J. Cios3 1Departamento de Bioquímica y Biología Molecular I, Facultad de Ciencias Químicas and Departamento de Ingeniería del Software e Inteligencia Artificial, Facultad de Informática, Universidad Complutense de Madrid, Madrid, ES, 2Linda Crnic Institute for Down Syndro- me; Department of Pediatrics; Colorado School of Public Health; Department of Biochemistry and Molecular Genetics; Human Medical Genetics and Genomics, and Neuroscience Programs, University of Colorado, School of Medicine, Mail Stop 8608, 12700 E 19th Avenue, Aurora, Colorado 80045, US, 3Department of Computer Science, Virginia Commonwealth University, Richmond, Virginia, US, and IITiS, Po- lish Academy of Sciences, Gliwice, PL

Down syndrome (DS) is the most common genetic cause of learning/memory deficits. It is due to an extra copy of the long arm of human chromosome 21 and the consequent increased expression of the genes it encodes. No pharma- cotherapies for learning deficits in DS are available. Because of its high incidence, there is considerable interest in preclinical evaluation of potential drugs in mouse models. In this work, protein expression data from brains of control and the Ts65Dn mouse model of DS exposed to context fear conditioning (CFC) were analyzed. Protein expression levels were measured in mice with and without treatment with the drug, memantine. Control mice successfully learn the CFC task, while Ts65Dn fail to learn unless they are treated with drug. We designed an approach based on the unsupervised clustering method Self Organizing Maps (SOM) and the Wilcoxon rank-sum test to extract information from this data. There were two main goals: 1) to determine if, based on the expression levels of 77 proteins, SOM would automatically cluster mice into genetic and treatment-specific clusters, and 2) to identify subsets of proteins that best discriminate between clusters/classes and also to define genetic and treatment-specific candidate protein responses. This approach identified protein responses that discriminate successful learning in control mice from failed learning in the Ts65Dn mice, and that critically respond to drug treatment. These results suggest that this approach, applied to additional datasets, can help to identify protein abnormalities in DS mice, and those proteins that need to be altered by drug treatments to facilitate the rescue of learning deficits.

B5-15 Bioinformatics with mobile devices Sergio Díaz del Pino1, Óscar Torreño2, Tor Johan Mikael Karlsson3, Oswaldo Trelles1, Juan Falgueras4 1Department of Computer Architecture University of Malaga, Malaga, ES, 2Advanced Computing Technologies, RISC Software GmbH, Ha- genberg, AT, 3Integromics S.L, Granada, ES, 4Department of Computer Sciences and Languages University of Malaga, Malaga, ES

Introduction

Mobile platforms are continuously growing in popularity and importance in every aspect of our everyday lives. Bioin- formatics and biomedical applications should not fall behind this trend. These platforms offer ubiquitous access, and give their users results when they really need them. However, mobile applications development has its own challenges (i.e. limited screen size, storage, etc.). We have developed a lightweight platform independent mobile application that allows bioinformaticians to browse Web Services repositories and to invoke the services.

Methods

We use MAPI (Ramirez, S. et al., 2011), which allows us to browse multiple service repositories and invokes those ser- vices, manages their input data and retrieves results. MAPI also takes care of the data format standardization to allow us to connect different services.

Page 44 Posters Integrative Biology

The backbone of the implementation is based on open web technologies, such as JavaScript, HTML5 and CSS3, and protocols like SOAP and REST.

The user interface has been implemented using table-views for browsing the repositories, and with dynamically adap- tive views for the invocation of services, in concordance with their repository definition. These interfaces have been specially designed keeping in mind the way the users interact with mobile devices (i.e. touch paradigm).

Results and discussion

To illustrate the usage of the application we present two exercises, the first one runs a BLAST service, and the second one is a more complex exercise running a full pipeline, that performs a homology search and phylogenetic study (http://goo.gl/wX35YA), including several services such as BLAST (Altschul et al., 1990), CLUSTAL (Larkin et al., 2007) and custom software.

The exercises start with the user selecting the service, then the system dynamically generates the service parameters interface and adapts it to the screen size. Service invocation, monitoring and results retrieval complete the exercise.

The use of open web technologies allows us to have a platform independent application in contrast with the use native development languages usually fitting one specific platform (e.g. Android, iOS, etc.). The use of web technologies also allows the access across a wide range of devices, including desktop web browsers, through a responsive design.

In our opinion, this application represents a step forward in the ubiquitous access of bioinformatics services, facilitating their access to the researchers.

B5-16 A KINASE SCREEN REVEALS GENES WITH DIFFERENTIAL EFFECT ON FAT METABOLISM IN INSULIN SIGNALLING MUTANTS UPON PROHIBITIN DEPLETION Marta Artal Sanz Andalusian Centre for Developmental Biology (CABD), CSIC-Universidad Pablo de Olavide, Carretera de Utrera km1,Seville, Spain, Seville, ES

Manipulations of mitochondrial activity affect the lifespan of many organisms, and mitochondrial function has been implicated in many age-related human disorders. The insulin/IGF-like signalling (IIS) pathway and mitochondrial function are known to influence lifespan across phyla. It was generally believed that mitochondria affect lifespan independent of the IIS in C. elegans. The recently discovered role of the mitochondrial prohibitin proteins in ageing challenges this. Prohibitins are conserved mitochondrial proteins composed of two subunits, PHB-1 and PHB-2, which form a ring-like structure at the inner mitochondrial membrane. Interestingly, prohibitin depletion shows opposite effects on aging; it shortens the lifespan of wildtype worms while it dramatically extends the lifespan of animals un- der reduced IIS conditions. Moreover, prohibitin depletion affects fat content in a genetic-background and age-specific manner. These findings indicate a novel mechanism regulating mitochondrial function with opposing effects on fat metabolism and aging1.

C. elegans stores fat in droplets in their intestinal and hypodermal cells. Owing to their transparent bodies, these fat stores can be easily visualized by feeding worms with the vital dye Nile Red. To better understand the function of prohibitins in ageing regulation, we exploit the fat phenotype to identify genetic interactions and to elucidate the sig- nalling pathways involved in the metabolic response to PHB depletion in wild type and IIS mutants by performing lar- ge scale RNAi screens. The identified genes will be clustered according to their phenotypes in the different genetic backgrounds and used to create genetic networks in order to study interactions.

Here, we present results obtained during a kinase pilot screen in which we came across candidates causing a differen- tial increase or decrease in Nile Red content depending on the genetic background. These genes could be involved in the differential effect of prohibitins on fat metabolism and longevity.

Page 45 Posters Integrative Biology

References:

1. Artal- Sanz, M and Tavernarakis, N (2009). Prohibitin couples diapause signalling to mitochondrial metabolism during ageing in C. elegans. Nature 461, 793-797

B5-17 Statika: managing bioinformatics tools and resource in the cloud Alexey Alekhin, Evdokim Kovach, Pablo Pareja-Tobes, Marina Manrique, Eduardo Pareja, Raquel Tobes, Eduardo Pareja- Tobes Oh no sequences! Research Group, Era7 bioinformatics, Granada, ES

Next Generation Sequencing has revolutionized the bioinformatics landscape, reshaping fields such as genomics and transcriptomics, by offering huge amounts of data about previously inaccessible domains in a cheap and scalable way. Thus, biological data analysis demands, more than ever, high performance computing architectures. Cloud Computing, a comparable breakthrough in the IT world, holds promise for being the foundation on which a solution could be built (as already demonstrated by pioneering efforts such as Galaxy or CloudBioLinux). It provides a perfect framework for high throughput data analysis: deploying architectures with as much computing capacity as needed, scaling in a hori- zontal way, being also able to scale down adjusting to the computing needs real time, with the pay-as-you-go model.

However, fast and cost-effective data analysis in the cloud at such scale remains elusive. High throughput analysis, where a lot of resources are to be used and paid for, critically needs to have an ability to manage both the tools and data in a robust, reproducible and automated way. As in bioinformatics analysis often a pretty complex and unstable chain of dependencies underlies tools and data, knowing beforehand that all the resources to be used are properly configured is invaluable.

Statika (http://ohnosequences.com/statika) aims to be a basic tool for the declaration and automated deployment of composable cloud infrastructures for the bioinformatics space. Using Statika data, tools and infrastructure are treated on an equal basis with a expressive domain specific language that allows the user to express complex dependency relationships. Statika will automatically check for possible version conflicts and choose a safe resource creation order.

Statika has been applied in different scenarios: from a cloud-based system for scalable and composable parallel com- putations in the bioinformatics domain as in Nispero tool, to modular automated deployments of complex databases as Bio4j. Bio4j (bio4j.com)is a graph database integrating all data from key resources in the bioinformatics data space, including UniProt, Gene Ontology, the NCBI Taxonomy or UniRef. We use Statika internally for the integration and au- tomated deployment of all sort of bioinformatics tools and data.

Statika is open source, available under the AGPLv3 license.

This project is funded in part by the ITN FP7 project INTERCROSSING (Grant 289974).

Page 46 Posters Integrative Biology

B5-18 A model of transcriptional regulation networks that explains co-expressions using a parsi- mony criterion Vicente Acuña, Andrés Aravena, Alejandro Maass Centro de Modelamiento Matemático, Universidad de Chile, Santiago, CL

A regulatory network is usually modelled as a graph where vertices represent genes and arcs connect regulator genes to their target genes. Putative regulatory networks are usually obtained by analyzing the genomic sequence of an organism to predict which genes code for transcriptions factors and which ones have binding sites in their promoter region. Unfortunately, this approach produces many false positive arcs, often generating huge putative networks, with a size many times larger than expected. For example in our test case using E. coli data, we obtain a putative regulatory network having 25,604 arcs, more than 15 times the number of experimentally validated arcs.

We propose a model to predict more realistic subnetworks by including information derived from gene expression data. Since co-expressed genes should share a common regulator (either direct or via a regulation cascade), we can select from the putative network a parsimonious subset of confident arcs that satisfy this requirement. The weights of the arcs are derived from the statistical confidence of the binding site predictions. We prove that the regulatory net- works satisfying this criterion are the solution of a non-trivial combinatorial problem. We propose a formulation which is solvable in a reasonable time for the relevant cases.

To validate our model we used E. coli genomic and transcriptomic data to build a genomic-scale putative regulatory network and to determine co-expressions between all genes by using standard tools. The sub-network satisfying our parsimony criterion contains only 19% of the original arcs while providing common regulators to each co-expressed pair of genes. We verified that the model kept 66% of the arcs that have been experimentally validated, showing a strong bias to select true regulations. The average number of regulations per gene and in the role played by the global regulators were similar to the ones described in literature.

We also show that the model can be applied to small sets of specific genes to unveil the mechanism that explains its transcriptional behavior. Moreover, the model allows the inclusion of experimental regulatory evidence, in the form of high confidence arcs, improving significantly the network prediction.

B5-19 An automatic workflow for microRNASeq analysis Rocio Nuñez Torres1, Eduardo Andrés León1, Ana M. Rojas Mendoza2 1Instituto de Biomedicina de Sevilla (IBiS), Hospital Universitario Virgen del Rocio/CSIC/Universidad de Sevilla, 41013 Seville, Spain. Computational Biology and Bioinformatics., ES Current address: Unidad de Enfermedades Infecciosas y Microbiología Clínica. Hospital Universitario Virgen de Valme/ Instituto de Biomedicina de Sevilla (IBIS), Seville, ES, 2Instituto de Biomedicina de Sevilla (IBiS), Hospital Universitario Virgen del Rocio/CSIC/Universidad de Sevilla, 41013 Seville, Spain. Computational Biology and Bioinformatics., ES, Seville, ES

In the past few years, the study of microRNAs (miRNAs) attracted attention due to their important role in post-trans- criptional fine tuning regulation of gene expression. Altered expression of miRNA has been associated toseveral pathological conditions such as cancer or infectious diseases.

The development of Next Generation Sequencing technologies has enabled novel approaches for the expression stu- dies of miRNAs using Small RNASeq technology. Due to crucial differences among standard RNASeq and Small RNASeq analyses, we have developed a miRNASeq analysis workflow, which automatically performs several analysis processes using state-of-art software. Briefly, our process includes: (a) Quality analysis of the reads (by FASTQ) (b) Adapter re- moval using Cutadapt [1] or Reaper [2]. If the adapter information is not available a computational prediction of the adapter sequence can be performed using Minion [2] (c) Alignment to the reference genome (indexing is included within the pipeline) (by Bowtie1/2 [3-4]) (d) Read quantification by desired feature (premiRNA, mature miRNA...) using Htseq-Count [5] (e) Quality analysis to determine the correlation among replicates using graphical approaches, such as

Page 47 Posters Integrative Biology

PCA, MDS or hierarchical clustering (f) Differential Expression Analysis (DEA) using EdgeR [6] and/or NOISeq [7]. Our pipeline process standard sequencing files (fastq format) performing several and parallelized analysis resulting in a re- sults file (tsv format) with DEA features for each experimental condition evaluated and an additional quality report file.

The pipeline presented here has been successfully applied to analyze miRNASeq data obtained from a time course experiment performed in the MCF7 cell line in hypoxic conditions [8] presenting analogous results. This workflow has been established by the Computational Biology and Bioinformatics group at IBIS to perform their miRNASeq analyses.

References:

1. Martin M. EMBnet.journal. 2011; 17(1):10-12.

2. Davis MP, et al. Methods. 2013; 63(1):41-9.

3. Langmead B, et al. Genome Biol. 2009;10(3):R25.

4. Langmead B et al. Nature Methods. 2012; 9, 357–359.

5. Anders S, et al. bioRxiv preprint. 2014.

6. Robinson MD, et al. Bioinformatics. 2010; 26, -1.

7. Tarazona S, et al. Genome research. 2011; 21(12), 4436.

8. Camps C, et al. Mol Cancer. 2014;13:28.

B5-20 regioneR: an R package for the management and statistical comparison of genomic re- gions Anna Díez-Villanueva, Bernat Gel, Marcus Buschbeck, Miguel A. Peinado, Eduard Serra, Roberto Malinverni Institut de Medicina Predictiva i Personalitzada del Càncer (IMMPC), Badalona, ES

Management and analysis of regional genomic information is increasingly important in biological studies, either as the main outcome of an analysis or as an additional layer in a dataset. Statistically assessing the spatial relations between region sets is a fundamental part of their analysis, but so far, the available options are lacking or limited in scope.

Here we present regioneR, an R package built on top of the Bioconductor’s genomic regions functions with two main aims: (1) to offer a basic set of region manipulating functions with a simple interface and (2) to create a statistical framework based on customizable permutation tests to assess the relations between genomic region sets.

The core part of the package is a permutation test specifically designed to evaluate the relations between sets of ge- nomic regions. All functions are prepared to work with a genome and a mask, either custom or automatically loaded from BSGenome, and custom masks can be used to deal with complex analysis. The randomization and evaluation functions are fully customizable and users can define their own functions. For example, in addition to the included evaluation functions dealing with overlaps, distances and base-level values, it is possible to evaluate other relevant information as GC content, methylation levels or position within the chromosomes. It is even possible to change the randomization process to take into account the structural complexity of the genome using alternative randomization strategies. In addition, the included plotting functionality creates publication-ready graphics representing the results of permutation tests.

Besides its easy-to-use design, regioneR is a customizable and powerful tool to manage and analyze sets of regions, and a useful addition to the NGS and genome wide analysis toolbox.

Page 48 Posters Medical Informatics

C2-01 TaxaTox: System-level assessment of taxa-specific differences in drug toxicity Pablo Carbonell, Ferran Sanz

Research Program on Biomedical Informatics (GRIB), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, IMIM (Hospital del Mar Medical Research Institute), Dr. Aiguader, 88 Barcelona, Spain, Barcelona, ES

Assessing toxicity is a key step in drug development to increase the quality of drug candidates, ensure lower attrition rates, and to reduce the number of animals to be used in preclinical studies [1]. Moreover, besides to human health there is a growing concern over potential unintended effects of drugs into the environment [2]. To address that issues, predictive in silico models of toxicity can contribute to the identification of adverse outcome pathways providing understanding of mechanisms of action [3, 4]. One challenge associated with the building of such toxicity models, however, is to delimitate their power to predict interspecies toxicity both in model organisms used for the assessment of human toxicity and in diverse species representations for ecological risk management. Typical studies have shown agreement in toxic effects between animal species and humans of 50-60% [5], and even less than 50% with bacterial data [6].

Here, we investigated interspecies toxicity differences by comparing predicted chemical toxicity in microorganisms from a previously developed quantitative structure-activity relationship (QSAR) model [7] with data from in vivo toxi- city assays measured in model organisms [8, 9]. Next, we integrated knowledge about toxicity mechanisms for the data set with omics information at various levels (molecular, network, cell, tissue, organ) in order to extract sets of rules defining a particular drug and/or dose as taxa-wise toxic or non-toxic. The resulting system-level rules were then used to develop predictive models for assessment of interspecies drug toxicity differences. Results showed that such models could in some cases provide better understanding about toxicity mechanisms, like in the identification of critical steps associated with metabolic pathway perturbation.

Acknowledgments

This work was supported by UPFellows program and the Marie Curie COFUND program.

References

1. Krewski et al. J. Toxicol. Environ. Health. B. Crit. Rev. 2010, 13:51–138.

2. Boxall. EMBO Rep. 2004, 5:1110–1116.

3. Gleeson et al. Curr. Pharm. Des. 2012, 18:1266–1291.

4. Vinken. Toxicology 2013, 312:158–165.

5. Hartung. Nature 2009, 460:208–212.

6. Devillers et al. Ecotox. Model. 2009, 2:85–115.

7. Planson et al. Biotechnol. Bioeng 2012, 109:846–850.

8. Kavlock et al. Chem. Res. Toxicol. 20112, 25:1287–1302.

9. Briggs et al. Int. J. Mol. Sci. 2012, 13:3820–3846.

Page 49 Posters Medical Informatics

C2-02 NFFinder: An online bioinformatics tool for searching similar transcriptomics experiments in the context of neurofibromatosis Javier Setoain, Monica Franch, Marta Martinez, Daniel Tabas-Madrid, Alberto Pascual-Montano

Centro Nacional de Biotecnologia-CSIC, Madrid, ES

Drug repositioning or repurposing is the idea of identifying and using known drugs because of their capability to target diseases other than those for which they were originally designed. There are several effective strategies for drug repo- sitioning with focus on the drug, the target or the disease. Taking advantage of gene expression databases, our group has addressed the development of a new bioinformatics tool to look for potential drugs that can serve as candidates in the context of Neurofibromatosis drug discovery. We have designed NFFinder, a publicly available system which takes the transcriptomic profile of an experiment of interest as input, screens existing public data (GEO, ArrayExpress, Cmap and DrugMatrix) and identifies existing conditions (diseases, drugs, treated cell lines, among others) that produce similar or opposite phenotypic experiments suggesting new hypothetical ways for drug repositioning by recovering connections between genes, drugs and diseases related to the same biological processes. This output will be combi- ned with complementary functional information to complete the global picture addressed to understand the disease processes and to identify potential existing target drugs. Using expression data related to the disease we can use NFFinder to identify other diseases with similar genetic expression profiles; drugs treating these others diseases could also potentially be used for NF. We can also find drugs with opposite expression profiles, suggesting these drugs and compounds might revert the NF patient´s phenotype and they are, therefore, repurposing candidates. Additionally, this tool might helps us to identify experts in other fields whose studies share common expression programs, biological functions or disease pathways with NF. These experts could contribute to improve the knowledge of this pathology. To the best of our knowledge, NFFinder is the first academic tool addressed to solve problems in the NF context, contribu- ting to identify potential existing target drugs and to understand the global disease processes. Therefore we consider it of high scientific, social and economic interest. Grant acknowledgement: CTF award: “In silico accelerated identification of associated pathologies and drugs: A drug repurposing approach” This work has been funded by the Children’s Tumor Foun- dation, the PRB2-ISCIII platform supported by grant PT13/0001 and the Government of Madrid (P2010/BMD-2305)

C2-03 Mutational load in signaling pathways of human populations and its functional conse- quences Rosa D. Hernansaiz1, Patricia Sebastián-León1, Joaquín Dopazo2

1Principe Felipe Research Center, Valencia, ES, 2Principe Felipe Research Center; BIER CIBER de Enfermedades Raras (CIBERER); Functio- nal Genomics Node, (INB) at CIPF, Valencia, ES

Signaling pathways constitute a formal representation of the knowledge existent on the consequences that the com- bined effect of gene activity has over the cell functionality in response to different stimulus. A non-negligible number of genes of these pathways are affected by the extensive mutational load recently uncovered by large scale genome sequencing projects. Nevertheless, to what extent such variation affects signaling pathways remains still unknown. Whole exome sequencing (WES) data of 1,092 individuals belonging to 14 populations from The 1000 Genomes Pro- ject have been used to derive a catalog of deleterious variants in genes involved in human signaling pathways. Proba- bilistic models of signal transmission along with gene expression data on 66 tissues were used to analyze the effect of the deleterious variants found in normal population has over the functionality of the different pathways studied. We have produced a comprehensive catalog of the effects that naturally occurring deleterious mutations cause in different pathways, measured in different populations and in 66 different tissues. The proportion of stimulus-response signaling circuits active in all the tissues are around 5% of the total number. It is very frequent that genes carrying deleterious mutations have ultimately not an effect on signal transmission in the pathways in which they are located.

Page 50 Posters Medical Informatics

C2-04 Neurofibromatosis gene expression meta-analysis Mònica Franch, Marta Martínez, Alberto Pascual-Montano

Centro Nacional de Biotecnologia-CSIC, Madrid, ES

The Neurofibromatosis is an autosomal dominant disease caused in humans by deficiencies in one of the neurofibro- min genes, NF1 or NF2. Patients may develop different anomalies in skin, eyes, skeleton, and cardiovascular, endocrine and nervous systems. In the peripheral nervous system, disorders typically manifest as benign neurofibromas that eventually may degenerate to malignant peripheral nerve sheath tumors (MPNST).

Individual gene expression studies based on microarrays were carried out in the past decade oriented to identify sub- sets of genes, known as gene signatures, differentially expressed among patient groups. Those studies yielded limited results because they focused on a specific narrow range of samples (Miller et al. 2009). Trying to increase the knowled- ge on Neurofibromatosis molecular determinants, our team has addressed a global meta-analysis project to identify and update gene signatures associated with the different phenotypic alterations characterizing the disease.

To accomplish this objective, we inspected the public databases GEO and ArrayExpress for Neurofibromatosis gene expression experiments. We selected more than 250 transcriptional samples including both microarray and RNA-seq analyses derived from human and mouse tissues. We collected the samples in a global matrix and subtyped them in three phenotypic groups. Two of these groups, NF1 and NF2, show different benign symptoms and involve tissues affec- ted by a deficiency in NF1 and NF2 genes, respectively. The third group, MPNST, includes tissues showing malignant degeneration regardless the locus affected. In order to dri

C2-05 Different diseases associate with mutations in different biological features Eduard Porta1, Ana Rojas2, Ildefonso Cases3

1Sanford-Burnham Institute, La Jolla, CA, USA, La Jolla, US, 2Institute of Biomedicine of Seville (IBIS-HUVR-CSIC), Sevilla, ES, 3Genomics and Bioinformatics Platform of Andalusia, Sevilla, ES

Do different diseases associate preferentially to mutations affecting different biological features? To answer this ques- tion, we have devised a three-levels definition of biological feature: function, domain and motif. And, instead of the commonly used method of compare disease-associated mutations with neutral, frequent or simulated random muta- tions, we developed a disease vs. diseases enrichment procedure.We extracted non-synonymous mutations related to disease from public repositories, normalized these data using ontologies reflecting these three levels of resolution, and searched for significant associations between feature terms and disease terms using all disease-associated mutations as background. We identified significant associations at the three levels of resolution. While some were obvious, other revealed subtle differences between related diseases. This phenomenon can be observed at the three levels of reso- lution, and even at the sub-protein-domain one. At the level of protein domain, we also observed that similar diseases had similar mutation profiles. Finally, at the level of motifs, we discovered 6 motifs negatively correlated with cancer. Three of them pointed out to the importance of endoplasmic reticulum stress in this disease, and the other three, the role of unstructured regions in cancer

Page 51 Posters Medical Informatics

C2-06 DisGeNET: a discovery platform for the exploration of human diseases and their genes Janet Piñero, Núria Queralt-Rosinach, Àlex Bravo, Jordi Deu-Pons, Ferran Sanz, Laura I. Furlong

Hospital del Mar Medical Research Institute (IMIM), Pompeu Fabra University (UPF), Barcelona, ES

Researchers of the genetic determinants of human disease currently face two main hurdles: the large volume of in- formation that connects genomic elements to disease phenotypes, and the fragmentation of this information across resources that employ different vocabularies and standards. Integrative platforms are therefore essential to gather, and homogeneously annotate clinically relevant information on the genetic causes of diseases. In keeping with this spirit, we have developed DisGeNET (www.disgenet.org), a discovery platform that integrates human gene-disease relationships from several public sources, as well as from text-mining the biomedical literature. DisGeNET is one of the largest repositories of gene-disease relationships currently available to researchers, containing more than 300,000 associations between 13,172 diseases and 16,666 genes. Besides compiling information from several expert curated data sources, DisGeNET contains a unique repository of gene-disease associations obtained by text mining biomedical publications using the BeFree system (http://ibi.imim.es/befree/), which exploits syntactic and semantic information to find relations between biomedical entities. DisGeNET allows prioritization of gene-disease associations based on data provenance by using the DisGeNET score (http://www.disgenet.org/web/DisGeNET/v2.1/dbinfo#score). The user can explore the information by using standard disease and protein classifications (e.g. MeSH and Panther), and inspect specific sentences describing a gene-disease association. The information in DisGeNET can be accessed in several ways that include a user friendly search and browse web interface, a Cytoscape plugin for network analysis and data visualization, and an SPARQL endpoint that enables to browse DisGeNET data as linked data in the Semantic Web. Dis- GeNET data is available for download, either as text files, or as SQLite database. List of genes or diseases provided by the user can be annotated with DisGeNET data using the web interface or the plugin. In addition, by using the platform, customized queries in R, Perl, Python and bash scripts can be automatically generated and saved by the user, allowing to reproduce their analysis or incorporate it in their own programs. This makes DisGeNET a tool of choice to a broad variety of users, from the ones with basic informatics skills, such as clinician and bench biologists, to the hard-core bioinformaticians.

C2-07

Integrated variant annotation and filtering using R/Bioconductor and the VariantFiltering package Dei M. Elurbe1, Montserrat Milà2, Robert Castelo3

1Center for Molecular and Biomolecular Informatics, Raboud University Nijmegen Medical Centre, Nijmegen, NL, 2Dept. of Biochemistry and Molecular Genetics, Hospital Clínic de Barcelona, Barcelona, ES, 3Dept. of Experimental and Health Sciences, Universitat Pompeu Fabra, Barcelona, ES

The steady decrease in DNA sequencing costs facilitates the adoption of whole-exome sequencing (WES) in clinical genetic testing settings. The identification of disease-causing non-synonymous coding rare variants from WES data is straightforward with currently available software and databases. However, the larger number of diseased individuals being genetically profiled using WES technology also increases the number of pathogenic variants that remain un- characterized. This typically happens with variants that do not appear in curated databases and occur in non-coding regions, often having a reduced penetrance in the population (Cutting, 2014).

To approach the identification of such pathogenic non-coding variants of reduced penetrance, approaches based on the integration of multiple annotation sources and filtering strategies are needed. One of the software platforms which can potentially offer the required flexibility and interoperability for this goal is the R/Bioconductor project and its DNA

Page 52 Posters Medical Informatics

variant analysis infrastructure (Obenchain et al., 2014). On top of this infrastructure, we have built the VariantFiltering package (Elurbe et al, in preparation) with the aim of facilitating the annotation and filtering of both, coding and non- coding genetic variants. The main features of this software are: 1. integration of multiple annotation sources tracing provenance; 2. programatic filtering with multiple strategies for both coding and non-coding variants, such as inheri- tance model, protein damage potential, minimum allele frequency, gene and nucleotide conservation, (cryptic) splice site strength, etc., which can be extended by the user; 3. minimization of end-user scripting tasks with an interactive shiny web app.

References

Cutting GR. Annotating DNA variants is the next major goal for human genetics. Am J Hum Genet, 94:5-10, 2014.

Elurbe DM, Milà M and Castelo R. VariantFiltering: filter coding and non-coding genetic variants, in preparation. Software available at http://www.bioconductor.org/packages/release/bioc/html/VariantFiltering.html

Obenchain V, Lawrence M, Carey V et al. VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants.Bioinformatics, 2014.

C2-08 The problem of human samples heterogeneity: discovery of differential biomarkers in he- terogeneous clinical cohorts Francisco José Campos-Laborie1, Beatriz Rosón1, Alberto Risueño2, Celia Fontanillo2, Matthew W Throtter2, Javier De Las Rivas1

1Centro de Investigación del Cáncer (CIC - IBMCC), Universidad de Salamanca, Salamanca, ES, 2Celgene Institute for Translational Research Europe (CITRE), Sevilla, Sevilla, ES

Many current bioinformatic approaches, applied to large-scale ‘-omic’ data in order to detect differences between patient sample subgroups (classes) and thereby identify putative clinical biomarkers, focus on analyses of significant mean difference of the multiple features quantified (e.g., SAM or LIMMA differential expression algorithms). Heteroge- neous clinical populations, when studied via high-throughput genomic assay platforms (e.g. genome-wide expression analyses), challenge such methods owing to the capture of multiple sources of undefined variance between individual samples in the same class (i.e. between individuals of the same clinical category), which may blur small but important differences related to the contrast of interest and hide or impede the discovery of critical features that mark phenotypic differences between patients in the clinical classes compared.

This work aims to develop and apply new methods to overcome the problem described above, based on the following postulated hypothesis: given a subset of relatively small but biologically relevant differences that reflect phenotypic subgroups of clinical interest (e.g., clinically observed differences relating to positive or negative prognosis), there should exist an associated subset of altered molecular features, e.g. genes or metagenes, that mark such phenotype and are quantifiable via an appropriate genome-wide assay. To show up this premise, we have developed a recursive heuristic algorithm to identify genes that mark differences between closely related biological states or pathological subtypes. The method uses combinatorial sampling without replacement to select multiple sample subsets from each category defined a priori in the cohorts studied. This sampling enables exploration of differential signatures while considering (i) variation between individuals that does not relate to phenotypic subgroups of interest and (ii) possible errors in the class labels assigned to each individual.

We compare our approach with recently published methods that attempt to tackle the same problem, including DIDS (Detection of Imbalanced Differential Signal) [1], MOST (Maximum-Ordered Subset T-statistics applied to outlier analy- sis) [2] and COPA (Cancer Outlier Profile Analysis)

Page 53 Posters Medical Informatics

C2-09 Systemic approach to polypharmacology based on drug-domain association networks Aurelio A. Moya-García1, Juan A. G. Ranea2

1Dept. of Structural and Molecular Biology, University College London, Gower Street, WC1E 6BT, London, UK, London, UK, 2Department of Molecular Biology and Biochemistry-CIBER de Enfermedades Raras, University of Malaga, Málaga, ES

In recent years drug discovery has been driven by the quest for “magic bullets”, drugs that acts selectively on a single target. This assumption has been concurrent with a decrease in the translation of drug candidates into effective thera- pies. The situation –illustrated by the concept of polypharmacology– is that there are many drugs for each target and a single drug can affect multiple targets. Thus on the one hand, networks and modules within can be considered as drug targets and on the other hand, “magic shotguns” (i.e. drugs able to bind multiple targets with low specificity), are a new paradigm in drug discovery.

In addition to targeting multiple protein targets in a network it may also be important to target specific domains within them. Most eukaryotic proteins comprise multiple domains and domains play a fundamental role in drug polypharma- cology. Promiscuous drugs are often associated with particular protein domains. Because domains are the fundamental blocks of protein structure and are combined to form different proteins, they are likely to be the druggable entity in a drug target.

In this communication we show our advances in unravelling the interactions between drugs and protein domains. We explore the role of protein domains as drug targets in signalling networks, which can lead to new structure-based target identification and drug discovery strategies.

References:

- Moya-García AA, Ranea JA. Insights into polypharmacology from drug-domain associations. Bioinformatics. 29(16):1934- 1937. 2013

- Sanchez-Jimenez F, Reyes-Palomares A, Moya-Garcia AA, Ranea JA, Medina MA. Biocomputational resources useful for drug discovery against compartmentalized targets. Curr Pharm Des.;20(2):293-300. 2014

- Reyes-Palomares A, Rodríguez-López R, Ranea JA, Sánchez Jiménez F, Medina MA. Global analysis of the human patho- phenotypic similarity gene network merges disease module components. PLoS One.;8(2):e56653. 2013.

C2-10 Accelerating GWAS Epistatic Interaction Analysis Methods Alex Upton, Priscill Orue, Oswaldo Trelles

Department of Computer Architecture University of Malaga, Málaga, ES

This work introduces a user-friendly application allowing two categories of users, clinicians and bioinformaticians, to analyse GWAS genotype/phenotype correlations using computationally accelerated epistatic interaction models. The objective is to optimise the application of state-of-the-art methods to examine pairwise epistatic effects on the causes of complex disease using high performance computing, in order to detect biologically relevant pathways and potential genetic biomarkers.

Page 54 Posters Medical Informatics

It is widely agreed that complex diseases are typically caused by joint effects of multiple genetic variations, rather than a single genetic variation (Anunciação et al., 2013). Multi-SNP interactions, also known as epistatic interactions, have the potential to provide information about causes of complex diseases, and build on GWAS studies that look at asso- ciations between single SNPs and phenotypes. Genes can be mapped to the SNPs that are identified for downstream analysis, aiding in the identification of functional enrichment for disease using tools such as ClueGO (Bindea et al., 2009) and GOEast (Zheng et al., 2008).

Due to the large number of interactions that have to be calculated, implementation of these epistatic interaction models is not practical. To illustrate; a relatively small GWAS dataset, with 100,000 SNPs that pass quality control, has 5x10-9 pairwise interactions. Using the FaST-LMM epistatic interaction model (Lippert et al., 2011), it would take ap- proximately two years to calculate these pairwise interactions on a desktop computer. As such, this does not present a viable tool for researchers.

High performance computing supports deployment of epistatic models across various cores, thereby accelerating them. As the majority of these models are deployed using command line interfaces, this work proposes pipelining the appli- cations using a Java-based GUI to make them more accessible. Java 1.8 SE provides the fork/join framework, enabling the implementation of parallel computing in applications (Oracle, 2014). This work therefore builds on existing epis- tatic models by both accelerating them, and making them more accessible. Initially, two different types of model are used: the linear regression model BOOST (Wan et al, 2010), and the linear mixed methods model FaST-LMM (Lippert et al., 2011). Here, we present a scaled-down prototype version of our application that can be deployed on a typical desktop computer to

C4-01 Analyzing SNPs, CNVs, inversions and mosaicisms association studies using Affymetrix CytoScan technologies Carles Hernandez-Ferrer1, Ines Quintela3, Katharina Danielski2, Angel Carracedo3, Luis Pérez-Jurado4, Juan R Gonzalez1

1Centre de Recerca en Epidemiologia Ambiental Parc de Recerca Biomèdica de Barcelona Doctor Aiguader, 88 | 08003 Barcelona, Barcelo- na, ES, 2Affymetrix, UK Ltd, High Wycombe, UK, 3Centro Nacional de Genotipado-ISCIII. Universidade de Santiago de Compostela, Santiago de Compostela, ES, 4Universitat Pompeu Fabra Departament de Cienxies Experimentals i de la Salut Unitat de Recerca en Genètica , Bar- celona, ES

Genome-Wide Association Studies (GWAS) interrogates a large number of genetic variants (SNPs) with high-throughput technologies. To date, GWAS have led to many scientific discoveries including cancer or asthma among others. Nonethe- less, SNPs have explained relatively little of the total heritability of complex diseases. In order to overcome this diffi- culty, some researchers are being analysing other structural variants (SVs) like copy number variants (CNVs), mosaicisms or inversions in complex diseases. In the past five years, commercial business, such Affymetrix and Illumina, produced dense SNP arrays that made possible to genotype many markers in a single assay. For the Svs studies, custom arrays and specific-disease arrays have been developed. An example of them is Affymetrix CytoScan family, that include a high density array (CytoScan HD) and the light version array (CytoScan 750K). This family of arrays was designed to proved a broad overview of the whole genome since they include markers for constitutional and cancer genes and OMIM and Re- fSeq genes. The most common software to analyse CytoScan data is called Chromosome Analysis Suite (ChAS). Overall the benefits, the usage of ad hoc software has some limitations. For that , an R package called affy2sv has been created. The package includes the advantages and functionality provided by ChAS by incorporating new functionalities that make possible the analysis of CytoScan HD data with other existing tools (PLINK, PennCNV, MAD, GADA, ...) as well as data visualization. Therefore, affy2sv will facilitate the analysis of CytoScan data in SNP, CNV, mosaicisms or inversion association studies by using, for instance, pipelines under R environment. New features are illustrated by analysing two cohorts of 624 individuals from Toronto and Nijmegen for which CytoScan HD array data were obtained.

Page 55 Posters Medical Informatics

C4-02 Analysis of non-synonymous variability in Human population Antonio Rueda1, Javier P.Florido1, Francisco J. López-Domingo1, Eva Fernández1, Javier Santoyo-López2, Joaquin Dopazo3

1Genomics and Bioinformatics Platform of Andalusia (GBPA), Sevilla, ES, 2Edinburgh Genomics, Ashworth Laboratories, The University of Edinburgh, Edinburgh, UK,3Computational Genomics Department, Centro de Investigación Príncipe Felipe, Valencia, ES

The use of Next-Generation Sequencing (NGS) technologies is becoming a common practice for variant discovery in the field of human genetic disease research. However genome sequencing studies, and even exome sequencing projects, by NGS generate a huge amount of variant data that will need filtration, classification, annotation and prioritization in order to determine a small subset of risk variants. For this purpose, variant frequencies provided by publicly available tools and databases such as the Exome Sequencing Project (ESP), conservation methods (PhyloP, Grep++) as well as deleteriousness prediction tools (SIFT, PolyPhen) have been shown to be very useful. However in many cases the subset of putative variants contains a great number of variants not related to the disease that are predicted by the different in silico tools as disease causative variants. In this sense, results of the Medical Genome Project (MGP, http://www.medi- calgenomeproject.com) allowed us to identify regions in the genome associated to the coding regions that, regardless of the disease under study, consistently accumulate non-synonymous variants no related to the disease that prediction tools reliably predict as disease-causing variants.

We hypothesize that such highly variable regions in coding sequences are related to polymorphic regions with low functional impact. Based on this hypothesis, we have focused on the analysis of non-synonymous variants using data from the ESP, 1000 Genomes projects and our control population from MGP. For each data set, the degree of hetero- zygosity is calculated throughout the coding regions in the genome, using windows of a specific size, obtaining a dis- tribution of Heterozygosity Raw Scores (HRS). From this distribution, a Heterozygosity Normalized Score (HNS) is obtained for each coding position in RefSeq genes, based on the percentile of its corresponding window in the HRS distribution. After the analysis of the data, we have found a strong correlation between genomic regions that accumulate in the population a great number of non-synonymous variants and the type of functionality of such regions. Thus, this new mutation score, HNS, can be very useful for filtering variants no related to the disease from variant candidate lists in both clinical and functional studies.

C5-01 GWIMP-COMPSS: An Integrated Framework for Genome-wide Imputation and Association Studies on High-Performance Computers and Cloud Environments Friman Sánchez1, Silvia Bonàs1, Marta Guindo1, Carlos Díaz2, Enric Tejedor2, Rosa Badia2, Josep Mercader1, David To- rrents1

1Barcelona Supercomputing Center, Life Sciences Department, Computational Genomics Group, Barcelona, ES, 2Barcelona Supercomputing Center - Computer Sciences Department, Grid Computing group, Barcelona, ES

Genome-wide association studies (GWAS) have been a successful methodology for identifying hundreds of associa- tions between common genetic variants and human complex traits and diseases. Additionally, genotype imputation, the process of inferring non-genotyped genetic variants based on a denser reference panel of haplotypes, has become a key approach for improving the power of GWAS, fine mapping of known association regions and for allowing large GWA meta-analyses from data sets using different genotyping platforms. However, whole genome imputation and association testing with increasingly larger reference panels represent a high computational burden and still face im- portant limitations. First, they require the combination of several tools working coordinately in a workflow style, each tool having different requirements of performance, parallelism, memory usage, configuration parameters, etc. Second, imputation and GWAS workflows are not easily deployable across the variety of current distributed computing infras- tructures (e.g., high-performance computing clusters, grids, clouds, etc.) and usually involve complex setup work and code adaptation.We developed GWImp-COMPSs, an research tool to phase, impute genotypes and perform association testing that requires minimal configuration and delivers optimal and robust results. GWImp-COMPSs works on top of

Page 56 Posters Medical Informatics

the COMPSs framework, which steers the parallelization of the application and makes it portable between different computing infrastructures ranging from simple destktop computers to distributed computing infrastructures, such as HPC clusters, grids and cloud, without software modifications. GWImp-COMPSs allows merging the results from diffe- rent reference panels such as UK10K and 1000 Genomes, and generates graphical and summary outputs providing to the non-expert user amenable information for the biological interpretation of the results. With GWImp-COMPSs we were able to perform whole-genome imputation in a total of 6000 cases and controls with the UK10K reference panel and association testing in less than 15 hours and without any user intervention using 23 nodes on the Marenostrum III supercomputer. GWImp-COMPSs also represents an ideal tool for large multi-center GWAS consortia, allowing whole genome imputation and association testing in a standardized way across different institutions, without the need of sharing the individual- level data.

C5-02 A pharmacological study of the TCGA Pan-Cancer analysis using a new resource that links molecular profiles with therapeutic options Elena Piñeiro-Yáñez, Hector Tejero, Javier Perales-Patón, Manuel Hidalgo, Alfonso Valencia, Fátima Al-Shahrour

Spanish National Cancer Research Centre, CNIO, Madrid, ES

The main goal of personalized medicine is to provide the right therapeutic treatment for a given patient based on the individual’s molecular profile. This is an important challenge in cancer disease, characterized by a great complexity and heterogeneity. To this effect it is essential to identify the genomic background that defines each individual and shows the distinctive features that points to a specific treatment. Although different types of biomarkers exist with this pur- pose, the most commonly used are those at genetic level, and some of these biomarkers are also the precise target of existing drugs. This drug-gene correlation is therefore essential to define a personalized treatment.

In these sense, there are many databases storing this type of information, these described drug-gene associations. At the same time, drug sensitivity studies in genomically characterized cancer cell lines have been carried out, such as those accomplished by the Cancer Therapeutics Response Portal (CTRP) and the Genomics of Drug Sensitivity in Can- cer (GDSC). Here we present a tool we have developed that integrates all this information adding curated annotations about approval status and therapeutic area. Along with this, a score reflecting the association evidence and the suita- bility of a possible treatment according to the drug availability is provided.

We have applied this resource to The Cancer Genome Atlas (TCGA) data and performed a therapeutic analysis at indi- vidual level, matching the drug information with the multidimensional genomic data of Pan-Cancer analysis. To that end we have integrated information about somatic mutations, copy number variations (CNV) and gene expression of each patient. The combination of both resources allows us to define and improve the therapeutic strategies as well as to discover new and potential pharmacological agents.

Page 57 Posters Medical Informatics

C5-03 Quality assessment and data analysis pipeline for multilocus genotyping using next-gene- ration amplicon sequencing Alvaro Sebastian1, Michal Stuglik2, Jacek Radwan1

1Evolutionary Biology Group, Faculty of Biology, Adam Mickiewicz University (https://sites.google.com/site/evobiolab), Poznan, PL, 2Institu- te of Environmental Sciences, Jagiellonian University, Krakow, PL

New generation sequencing (NGS) technologies are revolutionizing the fields of evolutionary biology and personalized medicine as powerful tools for amplicon sequencing. Using combinations of primers and barcodes it is possible to se- quence targeted gen regions with deep coverage for hundreds, even thousands of individuals in a single experiment. This is the case of the major histocompatibility complex (MHC), a multilocus gene family where amplicon sequencing can be used for high-throughput genotyping1. The utility of these techniques is limited by the intrinsic high error rates of NGS methods2 and other error sources like polymerase amplification or chimeras.

We designed a two step analysis pipeline to extract reliable results from amplicon sequencing data: i) quality as- sessment of the sequencing data using the software ampliQC (Amplicon Quality Control tool), ii) de-multiplexation, correction and classification of amplicons with jMHC4 and ampliSAS (Amplicon Sequence ASsignment tool). ampliQC retrieves amplicon statistics based on high quality primer-reads alignments3: number and length frequency of am- plicons, nucleotide error rates (indels and mismatches), primer and barcode frequencies, primer errors and a matrix of amplicon-sample assignments. jMHC4⁠ and ampliSAS classify reads by barcodes (samples) and amplicons, correct sequencing errors and chimeras, remove low frequency reads and retrieve the real allele sequences for each sample and locus.

1 Lank, S. M. et al. Ultra-high resolution HLA genotyping and allele discovery by highly multiplexed cDNA amplicon pyrosequencing. BMC Genomics 13, 378 (2012).

2 Babik, W., Taberlet, P., Ejsmond, M. J. & Radwan, J. New generation sequencers as a tool for genotyping of highly poly- morphic multilocus MHC system. Mol. Ecol. Resour. 9, 713–9 (2009).

3 Rizk, G. & Lavenier, D. GASSST: global alignment short sequence search tool. Bioinformatics 26, 2534–40 (2010).

4 Stuglik, M. T., Radwan, J. & Babik, W. jMHC: software assistant for multilocus genotyping of gene families using next- generation amplicon sequencing. Mol. Ecol. Resour. 11, 739–42 (2011).

C5-04 Detection of Large Copy Number Variation Algorithm using Read-Depth of Coverage in Fit- ted Panels of Genes Kristina Ibáñez1, Juan Carlos Silla-Castro1, Pablo Lapunzina2, Angela Del Pozo1

1Bioinformatics Section. Institute of Medical and Molecular Genetics (INGEMM), Hospital Universitario La Paz, Madrid, ES, 2Institute of Medical and Molecular Genetics (INGEMM), Hospital Universitario La Paz Centro de Investigaciones Biomédicas en Red de Enfermedades Raras (CIBERER), Madrid, ES

Currently Next-Generation Sequencing (NGS) technologies are used to scan the genome for SNPs, indels and struc- tural variants that range from targeted genes in custom panels to whole exome/genome wide studies. Even though structural variations, such as Copy Number Variations (CNVs), which includes insertions, deletions and duplications are routinely identified in clinical domain by array CGH (Comparative Genomic Hybridization) or SNP arrays, there is no ‘’best practice guidelines’’ in the NGS community in order to assess the validity of detection of CNVs in data sequenced from NGS neither the clinical utility of the methods.

Page 58 Posters Medical Informatics

Several computational tools for CNV detection have been developed in the last years. Frequently, they are exome- oriented methods or are specialized in a particular NGS platform. This issue motivated us to develope an approach adequate to identify CNVs in different kinds of panels of genes and sequencing platforms.

We present here a methodology for CNV detection indicated for tailored panel of candidate genes including whole- exome data. Using aligned DNA reads our algorithm calls copy number losses and gains for each target region based on previously normalized read Depth of Coverage. Our approach includes GC-content adjustment due to Illumina platform’’s bias, local and global normalization of reads and abstracted study of the sexual chromosomes. A permuta- tion test across samples is used as significance testing. This methodology has been validated on a set of 11 breast can- cer samples confirming a deletion of 6 exons of BRCA1 (>4Kb), a deletion of an allele in HNF1B (>3Kb) and an addition of 2 exons (600b) in HNF1B in Mody diabetes syndrome using in this case two different custom panels with 28 and 16 samples respectively (two-faced validation) and a deletion on an allele in GATA4 (>1Kb), TBX1 (>2Kb) and CRKL (1Kb) on a set of 12 samples associated with a cardiopathy syndrome. These samples have been captured using in house- customized panel of genes in breast cancer, mody syndrome and cardiopathy syndromes respectively.

Our results suggest accurate detection of CNV sites major than 600 bases with high specificity (around 90%) and sen- sibility estimating in course.

C5-05 Next Generation Sequencing in Clinical Practice: Challenges and Promises in a Cohort of Endocrine Patients Angela del Pozo1, Juan Carlos Silla-Castro1, Kristina Ibañez1, Angel Campos-Barro2, Jose Carlos Moreno2, Karen Heath2, Julian Nevado2, Victoria Eugenia F. Montaño2, Elena Vallespín2, Pablo Lapunzina3

1Bioinformatics Section. Institute of Medical and Molecular Genetics (INGEMM), Hospital Universitario La Paz, Madrid, ES, 2Institute of Me- dical and Molecular Genetics (INGEMM), Hospital Universitario La Paz, Madrid, ES, 3Institute of Medical and Molecular Genetics (INGEMM), Hospital Universitario La Paz. Centro de Investigaciones Biomédicas en Red de Enfermedades Raras (CIBERER), Madrid, ES

Motivation: The rapid evolution of sequencing technologies over the last 5 years, have revolutionized the field of gene- tics, especially the clinical practice. The cost reduction, the experimental protocol simplification and the possibility of the simultaneous inspection of a set of genes have promised a high diagnostic performance. However, the implementa- tion of NGS-based tests in clinical laboratories is a challenge as it does not exists diagnostic standards that guarantee the four main criteria proposed by the ACCE model for evaluating a genetic test: analytical/clinical validity, clinical uti- lity and associated ethical, legal and social implications. Beyond the clinical aspects, some additional methodological issues emerge that should be dealt within the bioinformatics community. The limits and validity of the experimental design as the bioinformatics tools should be established in order to estimate the sensibility and specificity of the tech- nique. Aspects such as the targeted capture kit, region of interest, quality metrics, variant calling/filtering strategies or selective functional annotation are crucial to determine the utility and interpretability of the data.

Approach: The Bioinformatics Section of the Institute of Medical and Molecular Genetics (INGEMM) together with the Genomic and Clinical services ha designed a study to outline a reference framework to decide the convenience of the different choices provided by the commercial kits in clinical context. Patients with known mutations within the EN- DOSCREEN project that focus on endocrine genetic diseases were selected. The analysis was structured in two sets of patients: group A that comprises 28 patients with Mody Diabetes phenotype and group B integrated by 28 patients with several Thyroid disorders. The targeted DNA regions werecaptured by two selective capture commercial kits (Roche NimbleGen Seq Cap E2 library and Illumina Nextera®) and the samples were sequenced in three ways: A) custom gene panel B) commercial disease panel and C) Whole exome sequencing (WES). For all the experiments it was used Illumina MiSeq platform. In total, it was generated 48 samples from each of the groups. This pool of DNA sequenced enables the analysis and validation of the mutational profile of the patients in at least two different captures and panel designs. In addition, the WES samples would provide secundary findings that explains better some complex phenotypes.

Results: The performance results of the capture kits are presented as well as the scoring according to abundance, homogeneity and reproducibility of the sequenced DNA. Known mutations are validated in a rank between 90-95% de- pending on the disease and the capture kit and the number of patients with additional events reach the 20% in group

Page 59 Posters Medical Informatics

A whilst the percentage raises the 50% in case of group B.

C5-06 LimTox: Text Mining Application for Toxicology Andres Cañada1, Florian Leitner2, Miguel Vazquez3, Alfonso Valencia3

1National Bioinformatic Institute Unit, Spanish National Cancer Research Centre (CNIO), c. Melchor Fernández Almagro 3, 28029 Madrid, Spain, Madrid, ES,2Universidad Politécnica de Madrid, Madrid, ES, 3Structural Computational Biology Group, Spanish National Cancer Re- search Centre (CNIO), c. Melchor Fernández Almagro 3, 28029 Madrid, Spain, Madrid, ES

The LiMTox system is the first text mining approach that extracts associations between compounds and a particular toxicological end point at various levels of granularity and evidence types, all inspired by the content of toxicology reports. It integrates direct ranking of associations between compounds and hepatotoxicity through a combination of heterogeneous complementary strategies from term co-mention, rules, and patterns to machine learning based text classification. It also provides indirect associations to hepatotoxicity through the extraction of relations reflecting the effect of compounds at the level of metabolism and on liver enzymes.

To determine if the detected compound/drug mentions are directly associated to hepatotoxicity we used several strate- gies. One approach relied on building a lexical resource of terms relevant to this particular toxic end point and then re- trieved co-mentions at the sentence level (term co-mention). Another strategy relied on rules that analyze the context of mention of compounds and adverse effects (rule matching). We also explored textual pattern based approaches that made used of n-gram statistics and part of speech tagging to define the actual extraction patterns (pattern matching). Moreover, the result of the sentence and document classification system can be used for more efficient retrieval of he- patotoxicity evidence for a given search term or chemical substance (interpretation and validation). In addition to the resulting toxicology search engine we have used context based scoring compounds to score/prioritize their association to hepatotoxicity.

The LimTox system is available at: http://limtox.bioinfo.cnio.es

C5-07 Usability tests on bioinformatics mobile applications Noura Chelbah1, Sergio Díaz2, Johan Karlsson3, Oswaldo Trelles2, Juan Antonio Falgueras Cano4

1RISC Software GmbH, Linz, AT, 2Computer Architecture Department, University of Malaga, Málaga, ES, 3Integromics S.L, Madrid, ES, 4De- partment of Languages and Computational Sciences, University of Malaga, Málaga, ES

Introduction

Over a million of mobile apps are currently running and covering different user needs. Despite that, within the area of bioinformatics, there is a particular dearth of them, mostly due to lack of the convenient user interfaces and precise protocols. Bioinformatics is many times about complex data sources and long processes; two things mobile apps are not good at. Limitations in memory (ram and mass storage) and processor power have to be considered when deve- loping such apps.

A thorough usability tests based on the needs of final users is paramount for bioinformatics apps to ensure their ac- cessibility.

Page 60 Posters Medical Informatics

Methods

First we have developed a native iOS app, and after some testing, a universally accessible HTML5 that runs on a server being able to access, enact and retrieve service results using MAPI. Second, the interface is being adapted from pre- viously well tested desktop interfaces. The most challenging part is the customization of dialogs for each service, to ensure efficient usage of the mobile app. And third, we are running a battery of tests on a group of bioinformaticians.

Results and discussion

Several experiments have been devised. Each is described in such a way that the final user can quickly advance step by step by following simple hints while understanding the goals of the apps.

The corresponding user tests have been performed and analyzed. Users have followed the simplest initial guidelines and have been recorded, interviewed and monitored following standards from the literature.

Two usability exercises are performed related to solving common bioinformatics problems: 1. Blast, and 2. Homology search and phylogenetic study on a given amino acid sequence. For the selected tasks the user decides the parameters values of choice and, after enacting the tasks, the application will keep the user informed about its evolution. For these tests the users represent bioinformaticians, biologists, physicians and researchers on the field.

This usability tests reinforce the urge to develop new mobile applications for bioinformatics and data management, considering their possible mobile situations.

C5-08 Providing interactive visualization on-demand for big data Tor Johan Mikael Karlsson, Juan Elvira, Miguel Hernández Martos

Parque Tecnológico de Ciencias de la Salud, Avenida de la Innovación, nº 1, Armilla, ES

Large scale genomic projects using high-throughput technologies are producing massive data sets, thus immersing bioinformatics in what has been named the “Big-Data” problem. Data storage, processing and visualization are be- coming real concerns. In the Mr.SymBioMath Marie-Curie project [1], we employ parallel and – in particular – cloud computing as a promising approach to tackle these issues.

Due to the complexity and high dimensionality of the genomic data the visual analytics tools have proven to be essen- tial for exploring and extracting the value such massive dataset in an intuitive and interactive way. OmicsOffice Suite [2] from Integromics S.L. provides with a comprehensive set of advanced statistical tools and visual analytics workflows targeting diverse genomics technologies like microarrays, NGS, qPCR, etc. OmicsOffice Suite tools and workflows are built as extensions of the Tibco Spotfire visual analytic platform.

Connecting visual analytics tools, like OmicsOffice/Spotfire, into automated analysis pipelines and workflow engines like Galaxy [3] enable the users to uncover the valuable results faster than in the traditional business intelligence plat- forms. Normally the visual analytics tools require a lot of user interactions in order to create the data visualizations. Therefore one of the main obstacles to overcome when integrating visualization tools is to preserve the high through- put and reproducibility of the workflows. A platform independent automation and integration API can reduce if not completely avoid user interactions throughout the entire analysis path, from raw data to the final result visualizations.

We report the development of a restful web-service API deployed as an Spotfire extension. This API allows external clients (such as Galaxy) to submit jobs consisting of IronPython scripts where the script has full access to the Spotfire API. Due to security concerns, calls to the API are authenticated before executed. The API queues incoming jobs to en- sure that only one job at a time is being executed. The API also allows the clients to cancel jobs in the queue, check for status (waiting, running, error, finished) and finally download the results. The visualizations are published as platform independent web-pages.

Page 61 Posters Medical Informatics

[1] http://www.mrsymbiomath.eu

[2] https://www.integromics.com/

[3] http://galaxyproject.org/

C5-09 A pipeline for the analysis of gene/microRNA expression data from microfluidic cards Ferran Briansó1, Alex Sánchez2

1Statistics and Bioinformatics Unit (UEB) Vall d’Hebron Institut de Recerca (VHIR) , Barcelona, ES, 2Statistics and Bioinformatics Unit (UEB). Vall d’Hebron Institut de Recerca (VHIR) Statistics Department. University of Barcelona, Barcelona, ES

The post-genomic age has been identified with some type of ‘omics data that have thoroughly been used in thousands of studies. First these were microarrays which tend to be replaced by RNA-seq. In parallel with these technologies and their evolution another one has been used to study gene expression: RTqPCR. Although initially its was mainly used to confirm or reject microarray findings, microfluidic cards technology, allowing to perform dozens to hundreds of RTqP- CRs in parallel, is becoming very popular as a complement or even an alternative to other techniques, especially for some types of studies such as microRNA, with a smaller number of targets than microarrays, which have become very popular. In spite of having less targets than microarrays or RNA-seq the analysis of RTqPCR data suffers from the same issues than other high throughput techniques: its success depends strongly of an appropriate experimental design, it may need more or less sophisticated pre-processing such as different types of normalizations and the analysis of the data has to account for these complexities as well as for a probably moderate sample size. These issues are not new in the field of ‘omics data analysis and most of them may be addressed using the appropriate statistical and bioinformatic tools. For this aim there exist commercial options such as GenEx or StatMiner or open source tools such as QPCR a java web-based application or the Bioconductor package HTqPCR. This package allows to address many of the issues described here but requires, as with many bioconductor packages, the manual programming of the analysis steps. Besides this, there are some aspects such as importing data from different formats or using different normalization methods, where this package is not flexible enough. We have developed an R package, BP-PCR, which addresses some of these issues allowing a certain automation of the whole process. The only thing the user has to do is to define the study by setting the values of a set of parameters describing all the analysis steps. Once this is done the pipeline can be executed and an easy-to-complete template with the analysis report and an html page giving access to the results are generated. This allows the user to go from the study definition to the reports result without having to write any analysis-specific R code and warrants the reproducibility of any analysis performed because, jointly with the repor

C5-10 NETWORK INTERPRETATION OF INDEPENDENT COMPONENTS EXTRACTED FROM CANCER TRANSCRIPTOMES U Kairov1, A Zinovyev2, A. Molenov1, TA Karpenyuk3, A Akilzhanova4, Ye.M. Ramanculov4, Zh. Zhumadilov4

1Center for Life Sciences, Nazarbayev University, Astana, Kazakhstan, Astana, KZ, 2Institute Curie, Paris, FR, 3Kazakh National University after Al-Farabi, Almaty, KZ,4National Center for Biotechnology of the Republic of Kazakhstan, Astana, KZ

Key words: Independent Component Analysis, microarrays, transcriptome, gene network Motivation and Aim: The high- throughput genomic technologies and particularly the microarray technology have a major impact on studying cancer. Huge amount of microarray data requires application of reproducible statistical approaches. In our study we aimed to apply Independent Component Analysis (ICA) [1] to do meta-analysis of cancer gene expression data and extract meaningful molecular signals in the form of meta gene networks. Methods and Algorithms: We used raw microarray

Page 62 Posters Medical Informatics

data (*.CEL files) of different cancer datasets GSE1456, GSE2034, GSE2990, GSE3494, GSE20685, GSE31210, GSE17951, GSE9891 from the Gene Expression Omnibus database [2]. The microarrays were normalized by GCRMA and processed using R 2.8.1 software [3] and Matlab2009b [4]. Matlab version of Icasso package [5] with ICA algorithm implementa- tion was used to analysis of independent components. Construction and visualization of gene networks and graphs was performed using the Cytoscape [6], BiNoM plug-in [7] and HPRD database [8].Results: We identified from 6 to 8 repro- ducible components in all cancer datasets. We developed graph-based approach to meta-analysis and interpretation of these independent components such that each of them was associated with a small gene network. Using analysis of these networks, we provided a tentative interpretation of stably reproducible components. Thus, we found that various factors such as proliferation, immune response, contamination of tumor cells by lymphocytes and normal tissues affect gene expression in cancer.

References: 1. P.Comon. (1994): Independent Component Analysis: a new concept?, Signal Processing, 36(3):287– 314. 2. http://www.ncbi.nlm.nih.gov/geo/ 3.http://www.bioconductor.org/ 4. http://www.mathworks.com/ 5. J.Himberg, A.Hyvarinen and F.Esposito. (2004): Validating the independent components of neuroimaging time series via clustering and visualization., Neuroimage, 22(3):1214-1222. 6. M.Cline, M.Smoot, E.Cerami et.al. (2007): Integration of biological networks and gene expression data using Cytoscape., Nature Protocols, 2, 2366 - 2382 7. A.Zinovyev, E.Viara, L.Calzone, E.Barillot. (2008): BiNoM: a Cytoscape plugin for manipulating and analyzing biological networks., Bioinformatics, 24(6):876-877 8. Prasad, T. S. K. et al. (2009): Human Protein Reference Database - 2009 Update. Nucleic Acids Research. 37, D767-72

C5-11 Whole transcriptomes from Kazakhstani esophageal cancer patients: first bioinformatics analysis results of NGS data U Kairov1, S Rakhimova1, A Molkenov1, Seungbok Lee2, Jong-Il Kim2, Jeong-sun Seo2, Zh Zhumadilov1, A Akilzhanova1

1Department of Genomic and Personalized Medicine, Center for Life Sciences, Nazarbayev University, Astana, KZ, 2Ilchun Genomic Medici- ne Institute, Seoul National University, Seoul, KR

Esophageal cancer (EC) is among the 10 most common and fatal malignancies in the world, presenting a marked geographic variation in incidence rates between and within different countries. Esophageal squamous cell carcinoma (ESCC) is predominant type of esophageal cancers worldwide comprising almost 95% of cases. While ESCC is prevalent in the developing world, esophageal adenocarcinoma is commonly seen in the developed country, usually in associa- tion with Barrett’s esophagus. Traditionally Kazakhs have habbits to drink hot boiled tea, eat at laying position etc., rising recurrent traumatic lesions of esophagus and esophagitis. In spite of its higher prevalence, ESCC has not been studied as intensively as esophageal adenocarcinoma. Molecular mechanisms contributing to initiation and progres- sion of ESCC are still poorly understood. Yet the lack of availability of sensitive and specific biomarkers for diagnosis of ESCC emphasizes the need for more studies on ESCC tumorigenesis.

Materials and methods. Forty two tissue samples (21 normal and 21 cancer samples) from Kazakh patients with esophageal cancer were collected for RNA preparation. Total RNA was isolated using Takara RNA Isolation kit and puri- fied with Qiagen RNA Purification kit. Extracted RNA was assessed for quality and quantified using an RNA 6000 Nano LabChip on a 2100 Bioanalyzer (Agilent Inc.). Whole-transcriptome sequencing was performed on 10 tissue samples (5 normal and 5 cancer samples) of 5 Kazakh individuals with esophageal cancer using next generation sequencing platform Illumina HiSeq2000. All generated *.bcl files were simultaneously converted and demultiplexed using bcl2fas- tq application. RNA-seq data were aligned with Tophat2. Gene expression profiling was performed using HTSeq tool and differently expressed genes were determined by DESeq. MSigDB and KEGG Pathway databases were processed to determine of biological functions and interactions.

Results. The sequence alignment, gene expression profiling and determination of differently expressed genes were completed from ten tissue samples of five Kazakh individuals. After paired analysis we found 287 down-regulated and 192 up-regulated genes between cancer and normal tissue. From MSigDB and KEGG Pathway databases we found 10 and 4; 10 and 10 overlapped gene sets in up- and down-regulated list of genes, respectively.

Page 63 Posters Phylogeny / Evolution

D2-01 The Human DNA Damage Response Network database of proteins Eduardo Andrés León1, Ildefonso Cases2, Ana Rojas Mendoza1 1Instituto de Biomedicina de Sevilla (IBiS), Hospital Universitario Virgen del Rocio/CSIC/Universidad de Sevilla, 41013 Seville, Spain. Computational Biology and Bioinformatics, Seville, ES, 2Medical Genome Project, Andalusian Center for Human Genomic Sequencing, c/ Albert Einstein s/n. Plta. Baja, Sevilla, 41092, Spain, Seville, ES

The DNA damage response (DDR) is an essential signaling network that protects the integrity of the genome. This net- work is built upon a repertoire of distinct but often overlapping sub-networks, where sometimes the same components have different roles in precise spatial and temporal scenarios. Perturbations of this network produce genomic insta- bility, which is inherently related to aging (Fernandez-Capetillo 2010), disease (Ciccia and Elledge 2010), and cancer, reviewed in (Lukas, Lukas et al. 2011).

Despite its importance, evolutionary studies addressing the emergence of this network were restricted to few protein families (On, Xiong et al. 2010). In this line, we have recently provided the largest systematic analyses of the human DDR network and have analyzed its evolutionary properties (Arcas, Fernandez-Capetillo et al. 2014).

To complement this study, we have built a resource to explore these data, in an evolutionary context.

From this database, it is possible to select genes according with its function in a particular pathway or network, accord- ing with posttranslational modifications (PTMs) where the gene acts as a target or a modifier and also by sequence similarity. When searching for PTMs, affected residues and links to Pubmed articles are provided.

In this tool, it is easy to find DDR proteins that emerged at the same age, or involved in same networks/pathways, and also affected by similar PTM modifiers. When a DDR protein is selected, a detailed view displays information regarding its emergence and conservation across 47 species, the overall agreement with the taxonomic tree, and the position of a post-translationaly modified residue in a structure when available.

This resource is available at: http://ddr.cbbio.es

REFERENCES.

Arcas, A., O. Fernandez-Capetillo, I. Cases and A. M. Rojas (2014). “Emergence and evolutionary analysis of the human DDR network: implications in comparative genomics and downstream analyses.” Mol Biol Evol 31(4): 940-961.

Ciccia, A. and S. J. Elledge (2010). “The DNA damage response: making it safe to play with knives.” Mol Cell 40(2): 179- 204.

Fernandez-Capetillo, O. (2010). “Intrauterine programming of ageing.” EMBO Rep 11(1): 32-36.

Lukas, J., C. Lukas and J. Bartek (2011). “More than just a focus: The chromatin response to DNA damage and its role in genome integrity maintenance.” Nat Cell Biol

Page 64 Posters Phylogeny / Evolution

D2-02 An integrative evolution theory of histo-blood group ABO and related genes Fumiichiro Yamamoto , ES

The ABO system is one of the most important blood group systems in transfusion/transplantation medicine. How- ever, the evolutionary significance of the ABO gene and its polymorphism remained unknown. We took an integrative approach to gain insights into the significance of the evolutionary process of ABO genes, including those related not only phylogenetically but also functionally. We experimentally created a code table correlating amino acid sequence motifs of the ABO gene-encoded glycosyltransferases with GalNAc (A)/galactose (B) specificity,specifi city, and assigned A/B specispeci-- ficity to individual ABO genes from various species thus going beyond the simple sequence comparison. Together with genome information and phylogenetic analyses, this assignment revealed early appearance of A and Bgene sequences in evolution and non-allelic presence of both genes in some animal species. We argue: Evolution may have suppressed the establishment of two independent, functional A and B genes in most vertebrates and promoted A/B conversion through amino acid substitutions and/or recombination; A/B allelism should have existed in common ancestors of primates; and bacterial ABO genes evolved through horizontal and vertical gene transmission into 2 separate groups encoding glycosyltransferases with distinct sugar specificities.

D2-03 150 Tomato Genome Project Uncovers Structural Variation in the Tomato Clade Saulo Aflitos1, Gabino Sanchez-Perez1, Sandra Smit2, Jan van Haarst1, Elio Schijlen1, Sander Peters1 1Applied Bioinformatics, Plant Research International, Wageningen UR, Wageningen, NL, 2Department of Bioinformatics, Wageningen UR, Wageningen, NL

84 genotypes, including 10 old varieties, 42 landraces and 30 wild relatives of tomato have been sequenced using Il- lumina HiSeq as representatives of the four major phylogenetic groups in the tomato clade (www.tomatogenome.net). Furthermore, three wild species, Solanum arcanum, Solanum habrochaites and Solanum pennelliihave been sequencing using both 454 and Illumina HiSeq, with the goal to construct new reference genomes. Using these sequencing data, patterns of structural variation have been identified.

We studied allelic variation in the twelve phylogenetic groups of the tomato clade and identified patterns of structural variation such as duplications and deletions and patterns of introgression have been identified. Ratios of synonymous and non-synonymous SNPs have been determined. In addition, a collection of tag SNPs, useful for further genetic studies in tomato, has been identified. The identified structural variation has been uploaded into a JBrowse Genome Browser and will be publicly available via www.tomatogenome.net. Larger structural variations have been studied by comparing synteny between the four available reference genomes for the tomato clade. The data provide a valuable resource to facilitate tomato genetics and breeding.

Page 65 Posters Phylogeny / Evolution

D2-04 AFLPsim: an R package for simulation and genome scan of dominant markers in hybridiz- ing populations Juan Luis García-Castaño, Francisco Balao Departamento de Biología Vegetal y Ecología, Universidad de Sevilla, Ap-1095, 41080, Sevilla, ES

Amplified Fragment Length Polymorphisms (AFLPs) have been very successfully used in the identification of hybrids and outlier loci presumably under selection. However, up-to-date, very few programs have been specifically designed to test for selection in hybrids using dominant markers. Additionally, simulators of dominant markers are very scarce and they do not usually take into account hybridization. Here, we describe AFLPsim, a software package (written for its use in R) that is designed to overcome these limitations by implementing a dominant marker simulator of hybridization and some genome scan algorithms.

Simulating hybridization and demographic evolution. Our software generates diploid hybrid genotypes under the Hardy-Weinberg equilibrium hypotheses and Mendelian inheritance of markers. We consider phenotypic directional se- lection on the dominant allele, i.e. we modify the frequency of those individuals bearing a selected fragment regardless they are homozygous or heterozygous. Additionally, we implement a modified version of the demographic evolution model in hybrid zones developed by Epifanio & Philipp (2000).

Genomic scan. For statistically seeking outlier loci in different hybrid classes, AFLPsim performs a search based on binomial tests that assess any significant deviation between the observed and the expected frequencies for each marker. This search is extended by calculating the sqrt(1 - α) confidence intervals for the parental frequencies and not only the average values. Finally, the False Discovery Rate (FDR) correction is used.

Graphics. Moreover, AFLPsim contains functions that produce graphics to visualise, among others, the expected frequencies under neutrality for loci under selection across different hybrid classes and the results of the demographic evolution model in a hybrid zone.

Availability and implementation. AFLPsim package is freely available on CRAN from http://cran.r-project.org. A development version is also available on Github repository (https://github.com/fbalao/AFLPsim).

Reference cited. Epifanio, J. & D. Philipp. 2000. Reviews in Fish Biology and Fisheries 10: 339–354.

D3-01 A genome wide exploration of the pleiotropic theory of senescence. Are human disease and senescence the result of natural selection? Juan Antonio Rodriguez Institute of Evolutionary Biology (Universitat Pompeu Fabra-CSIC), PRBB, Doctor Aiguader 88, 08003, Barcelona, Catalonia, Spain, Barcelo- na, ES

Changing demographic patterns and the ageing of the World’s population, have spurred the interest on the causes and mechanisms of senescence. Senescence has long been a mystery, with no single universally accepted theory account- ing for its ultimate evolutionary causes (if indeed these causes exist). Perhaps the most popular of the evolutionary explanations proposed so far is the pleiotropic theory of senescence, suggested by G. Williams in 1957¹. This theory states that mutations conferring risk for traits that are damaging for the organism late in life (e.g. after the fertile stage) might be maintained in a population if they are advantageous early in life, when they can result in an increased reproductive success.

Page 66 Posters Phylogeny / Evolution

In humans, this theory is consistent with evidence coming from certain genes, from specific conditions or from the life-long reproductive patterns of a few animal models. However, an exhaustive assessment of the impact of all these pleiotropic effects in the senescence of our species has not yet been carried out.

Using public metadata from Genome-Wide Association Studies (GWAS)2 we quantified the global extent and evolutionary implications for our species of the kind of early-late age antagonistic pleiotropy predicted by the theory. Diseases were split in early or late onset conditions and pleiotropies were computed among the SNPs reported to be associated with the diseases.

Our preliminary results are two-fold. First, they reveal some non-trivial antagonistic pleiotropies, that may be relevant to diagnosis and treatment of age-related pathologies. Second, and more interestingly in evolutionary terms, we ob- serve a significant excess of early-late antagonistic pleiotropy in our genomes, some of which present the signature of natural selection.

At the time of submission, we are examining the consistency of the signatures of natural selection around pleiotropies with their putative role in the evolution of senescence.

References:

1. Williams, G. (1957). Pleiotropy, natural selection, and the evolution of senescence.

2. Hindorff, LA et al., (2009) – NHGRI GWAS Catalog: https://www.genome.gov/26525384

D4-01 orthoFinder: a new automated tool for searching orthologous proteins useful for func- tional annotation Pablo Mier, Antonio J. Pérez-Pulido Universidad Pablo de Olavide, Sevilla, ES

Homology refers to a common evolutionary origin between two sequences that have diverged, and that present com- mon functional and structural characteristics. Searching for homologous proteins, especially those coming from specia- tion events and called orthologues, is a practical way to functionally annotate sequences, or carrying out evolutionary studies, since the evolutionary history of orthologues can show the evolution of species. Accurate identification of orthologous sequences is a continuous challenge in bioinformatics due to the accumulation of evolutionary processes such as gene loss, duplication and horizontal transfer. A series of protocols, computational tools and databases are available for searching homologues or orthologues from a query sequence of interest. However, tools are not free of troubles, databases usually contain obsolete entries and protocols are time consuming. To overcome these problems, we have developed a new tool for the automatic search of homologous and orthologous proteins, named orthoFinder, which solves the multidomain problem along with the problems related to the accumulation of evolutionary processes. It is implemented as an accessible and easy-to-use web application, designed to be used by non-expert users. The algo- rithm has been successfully tested with heterogeneous sets of proteins, yielding good sensitivity and specificity values, and it gives functional information about the results which can be useful for annotation processes.

Page 67 Posters Phylogeny / Evolution

D5-01 The study of the evolution of compartmentalisation using the YRG protein family indicate three endosymbiosis events Pablo Mier1, Miguel A. Andrade-Navarro2, Emmanuel Reynaud3 1Universidad Pablo de Olavide, Sevilla, ES, 2Max-Delbrück Center for Molecular Medicine, Berlin, DE, 3University College Dublin, Dublin, IE

Compartmentalisation is a key feature of eukaryotic cells, but its evolution remains poorly understood. The YlqF Related GTPase (YRG) protein family is a unique family of circularly permuted GTPases, and its proteins are present in Bacteria, Archaea and Eukarya organisms. The YRG family seems to have 10 subfamilies with several well defined cellular loca- tion characteristics. We describe the identification of 352 YRG proteins in 171 different proteomes. By assigning each protein to its related compartment, it is possible to follow the evolution of each cellular compartment along with the evolution of the species in time and compartmentalisation structure. The YRG protein family evolution can be related to the compartmentalisation evolution in Eukarya organisms as well as the origin of each compartment including the nucleus. Based on our analysis, we observed an archaeal YRG protein for the eukaryotic nuclear and nucleolar YRG pro- teins. This supports an archeal origin of the nucleus and so forth the first endosymbiosis event in eukaryote compart- mentalisation. Similarly, the mitochondrial and plastid proteins, including secondary and tertiary endosymbiotic events, evolved out of bacterial proteins. Our analysis indicates three main endosymbiosis events in compartmentalisation evolution, leading to the appearance of Eukarya organisms. First, an archeal endosymbiosis leads to the formation of the nucleus and explains the chimeric structure of the eukaryotic genome. Secondly, a bacterial endosymbiosis allows for the creation of the mitochondrial compartment and the increase in energy potential in eukaryotes. Finally, the endosymbiosis of a cyanobacterium initiated the formation of a plastid compartment that can be followed even after secondary and tertiary endosymbiosis.

Overall, our analysis of the YRG protein family allows for a compartimentalization evolution analysis at the protein sequence level of each member of the family as well as reflecting the known evolution of compartmentalisation in specific species such as parasites and Fungi.

D5-02 The Dna Damage Response: Domain-based analysis of its components Aida Arcas1, Ildefonso Cases2, Ana M. Rojas3 1Instituto de Neurociencias, Alicante, ES, 2Genomics and Bioinformatics Platform of Andalusia, Sevilla, ES, 3Instituto de Biomedicina de Sevilla (IBIS-HUVR-CSIC-US), Seville, ES

The DNA damage response (DDR) is a crucial signaling network that preserves the integrity of the genome. Although extensive work has been conducted in particular proteins of the DDR, few evolutionary studies have been done to un- derstand the origin of these proteins and to provide insightful clues into how this concerted system of pathways has been acquired in eukaryotes [1].

In this work we study at a domain level the orthologous sequences of 118 human DDR proteins to establish the domain repertoire involved in the DDR, to analyze the conservation of domains in different organisms, and to determine the acquisition of novel functions due to diverse domain architectures reflecting differences at the species level. Also, we intend to identify whether there are domains enriched in DDR-related functions.

We have identified DDR orthologous sequences with InParanoid [2] in a comprehensive set of species. The domain composition of the orthologous DDR sequences was analyzed using HMMER [3] and the Pfam database [4]. Also, manual identification of remote homologous domains in orthologous proteins without detected Pfam domains was performed. We constructed phylogenetic protein and domain profiles and clustered them to identify proteins and do- mains that have appeared consistently in evolution, respectively. Besides, domain enrichment analyses were performed and the distribution of domains in DDR functional tiers was analysed.

Page 68 Posters Phylogeny / Evolution

Our results show that most components of the DDR appear to be specific to the eukaryotic lineage. This specificity is related to the acquisition of novel domains that increase the pathways complexity in terms of fine-tuning and extend the interaction repertoire of DDR proteins to cross-talk with closely related pathways. Also, along evolution lineage- specific and domain rearrangement events may have included novel functions in various organisms, mainly in plants.

[1] Arcas A, Fernández-Capetillo O, Cases I and Rojas AM. Mol Biol Evol. 31(4):940-61, (2014).

[2] Remm M, Storm CE, and Sonnhammer EL J Mol Biol 314, 1041-1052, (2001).

[3] Eddy, SRA. Genome Inform 23, 205-211, (2009).

[4] Finn, R. D. et al. Nucleic Acids Res 38, D211-222, (2010).

D5-03 Gene expression alterations after polyploidization brings divergence in sibling Dactylo- rhiza allopolyploids Francisco Balao1, Daniel Jacob Diehl2, María Teresa Lorenzo1, Mikael Hedrén3, Ovidiu Paun2 1Departamento de Biología Vegetal y Ecología, Universidad de Sevilla, Ap-1095, Sevilla, ES, 2Department of Botany and Biodiversity Re- search, University of Vienna, Vienna, AT, 3Lund University, Lund, SE

Hybridization and polyploidization are central processes in plant evolution and speciation. Immediately following a polyploidization and/or a hybridization event, a genome suffers adjustments in organization and function, thereby influencing the adaptive success and the evolutionary fate of resulting lineages. Most allopolyploids have multiple origins, but the long-term significance of iterative allopolyploid evolution is not fully understood. We investigate here gene expression alterations in ecologically-divergent, sibling allopolyploidsDactylorhiza majalis and D. traunstei- neri, together with representatives of their diploid parents, aiming to understand their importance to the ecological properties of the polyploids. Using high-throughput RNA sequencing (RNA-seq), we have first assembled a unique reference transcriptome by combining data from one individual of each of the diploid parental species, excluding any redundancy. Our RNA-seq experiment read in total close to 1,000 billion nucleotides. The transcriptomic data from 26 Dactylorhiza individuals were mapped against the reference, and gene expression has been quantified using the CLC Genomics Workbench. Differential gene expression was estimated using DESeq2 and EdgeR in the Bioconductor. We observe higher dominance of D. fuchsii expression pattern in both polyploids. Our results point to dominance of patterns inherited from the parent with higher expression level in the polyploids as the main mechanism of differential expression, with only few transcripts exhibiting novel expression patterns. BLAST2GO analysis demonstrate that sig- nificantly different expressed genes between D. traunsteineri and D. majalis include some of ecological relevance, and that the pattern of expression of these genes differs between the polyploids, which could bring about the divergence observed between sibling allopolyploids.

Page 69 Posters Phylogeny / Evolution

D5-04 Characterizing Breaking Points and Synteny Blocks using SVM Jose Antonio Arjona Medina1, Noura Chelbat2, Oscar Torreño Tirado1, Oswaldo Trelles2 1Advanced Computing Technologies, RISC Software GmbH, Hagenberg, AT, 2Department of Computer Architecture University of Malaga, Malaga, ES

Genome regions characterization is in the spotlight of comparative genomics. Efforts have been made to depict the re- gions composing the genome; synteny blocks (SBs) and breaking points (BPs) among them. SBs are defined as continu- ous conservative regions with none or low “role” in chromosomal rearrangements. BPs are segments flanking conserved regions and considered responsible for chromosomal rearrangements.

Here we present an approach for characterizing BPs and SBs by a supervised classification method. Our rationale is to find out common patterns that could be used to classify unknown sequences into BPs and SBs.

We use an SVM endowed with different sequence kernels for training a classifier that distinguishes between the two classes composing the data set (BP from SBs or non BPs). We considered the standard linear spectrum kernel for dif- ferent choices of the pattern length parameter K=1,…,12, both with and without normalization. We also considered the quadratic spectrum kernel with K=1,…,6, again with and without normalization. A leave-one-out cross validation for assessing the generalization performance of the classifiers was used and weights for all features used by the svm were extracted in order to obtain which sequence patterns are indicative for each class.

For the classification task, data sets consist of sequences representing SBs and BPsin Salmonella enterica strains (NC_003198.1 and NC_003197.1). SBs are obtained through the HSPs (High Score Segment Pairs) pipeline (Torreno et al., 2013). BPs is represented by sequences flanking the ends of each SB and sequences located in the middle of two consecutive SBs. All sequences fulfill a minimum requirement length of 2000 bp. Additional tests were performed with datasets covering SBs and BPs sequences with lengths ranging from 21 bp to 2.2 Mbp from human, mice, rat and dogs genomes.

In our first attempt the best classification results yielded accuracies of 77% for the normalized linear spectrum kernel for K = 10 as indicative patterns.

An initial analysis of the results suggests a higher presence of tymines (T) is indicative for BPs whereas SBs are richer in guanines (G). This matches with microhomology rearrangement regions as patterns within Double Strands Breaks (DSBs) as reported by Girirajan et al.

Page 70 Posters Structure / Function

E1-01 Network curation of genome-scale models through the structural properties of a Metamodel Miguel Ponce-de-Leon1, Juli Pereto2, Francisco Montero1 1Departamento de Bioquímica y Biología Molecular I, Facultad de Ciencias Químicas, Universidad Complutense de Madrid, Madrid, ES, 2Departament de Bioquímica i Biologia Molecular and Institut Cavanilles de Biodiversitat i Biologia Evolutiva, Universitat de València, Paterna, ES

In order to study the phenotipic capabilities of organisms, the reconstruction of Genome-scale metabolic models (GSM) from their functionally annotated genome have probed to be a useful tool in systems biology. A first draft of a meta- bolic model usually leads to an incomplete network. Under the assumptions of Constraint-Based Modeling (CBM), many of the proposed reactions are functionally blocked, and many metabolites appears as gaps. We have recently developed a general method for the identification of such gaps metabolites, as well as for the detection of the so called Uncon- nected Modules (UM), defined as isolated sets of blocked reactions connected through gap metabolites. From here the curation of a GSM can be tackled in a straightforward manner, as recently described. This curation process relies on the use of databases of biochemical reactions such as SEED, BiGG or Metacyc among others. A metabolic database can be considered a metamodel or global network, because it may include reactions found among many different organisms across the tree of life. In this sense, the metabolism of a particular organism may be regarded as a subnetwork of this metamodel. The analysis of a metamodel by means of a framework such as CBM can lead to detection of structural or invariant properties. These properties included Reaction Subsets (RS) and other coupling relations, as well as the detection of super-essential reactions. Interestingly, some of these structural properties may be extended to particu- lar organisms, and used for the curation of the corresponding metabolic model. In this communication we present a novel method to reconstruct and curate GSM which combines Flux Coupling Analysis (FCA) and Gapfilling on a curated metamodel. FCA was performed over the network in order to detect the reaction subsets and the directional coupling relations. From these coupling relationships it is possible to detect missing reactions in the sub-network, being these reactions candidates to solve the gaps of the organism whose metabolic network we are trying to infer. The remaining blocked reactions in the model have been unblocked by means of Gapfilling. This method has been tested on a data set of 130 GSM.

E1-02 Maximum-likelihood phylogenetic computations with selection on protein folding stabil- ity Miguel Arenas, Agustin Sanchez-Cobos, Ugo Bastolla Centro de Biología Molecular Severo Ochoa, Madrid, ES

We present a physically inspired model of protein evolution with selection on the thermodynamic stability of the native state, but in which protein sites evolve independently, as in substitution models used for phylogenetic reconstructions, which allows the computation of the likelihood of phylogenetic trees.

The constraint that the native state must be thermodynamically stable is imposed in a self-consistent way, in the same spirit of Mean-Field models of statistical physics. The resulting model is a Mean-Field structurally constrained substi- tution model (SCS) of protein evolution. We tested this model on a large dataset of proteins, finding that the ensemble of sequences generated with this model are on the average thermodynamically stable, they distribute hydrophobicity and substitution rates across the protein sequence as expected, in the sense that internal sites are more hydrophobic and evolve more slowly than surface sites, and they yield favorable likelihood for the wild type protein sequence in the Protein Data Bank. We are currently testing the performances of this model with phylogenetic reconstruction al- gorithms.

Page 71 Posters Structure / Function

E2-01 From annotated transcriptome to a browseable database in the seek of olive tree aller- gens Rosario Carmona1, Adoración Zafra2, Pedro Seoane3, Antonio Jesús Castro2, Juan De Dios Alché2, M. Gonzalo Claros3 1Plataforma Andaluza de Bioinformática, Universidad de Málaga, Málaga, ES, 2Department of Biochemistry, Cell and Molecular Biology of Plants. Estación Experimental del Zaidín. CSIC, Granada, ES, 3Departamento de Biología Molecular y Bioquímica, Plataforma Andaluza de Bioinformática. Universidad de Málaga, Málaga, ES

The olive tree (Olea europaea L.) is an important crop plant in Spain and the Mediterranean basin, and its pollen represents an important source of allergens. Although some transcriptomes������������������������������������������������������ have been obtained from vegetative tis- sues, the peculiarity of reproductive tissues in terms of gene expression deserves a dedicated study not only for biologi- cal reasons but also in the seek of allergens. For this purpose, Sanger sequences and Roche/454 reads were obtained from pollen and stigma in different maturing and developing stages. After pre-processing, sequences were assembled, and annotated using a complex workflow previously explained in other communication at this congress (Hicham Ben- zekri et al), including the corresponding orthologues in Arabidopsis thaliana from TAIR and RefSeq databases. Using AutoFlow, an automated workflow has been constructed to import the annotated transcriptome (73,245 transcripts) into a freely-accessible, web-browseable database called ReprOlive (http://reprolive.eez.csic.es). It is based on Ruby- on-Rails framework connected to a MySQL database, making it scalable, maintainable and expandable. Any sequence or annotation set shown on-screen can be downloaded by the scientific community. Partial transcriptomes from pollen and stigma can also be browsed and downloaded. Retrieval mechanisms for sequences and gene annotations are pro- vided. Mapping and visualising annotated enzymes to KEGG pathways is also possible, revealing that most transcripts involved pathways such as ascorbate and glutathione metabolism or flavonoid biosynthesis have been completely se- quenced. From the complete set of transcripts (contigs), a representative transcriptome consisting of 31,257 transcripts has been identified, providing valuable information for future microarray or RNA-seq studies. Functional annotations of such transcriptome are now offering the possibility of exploring the presence of already described allergens in the olive pollen. Moreover, the comparison with dedicated databases for allergens (e.g. allergome: http://www.allergome. org) allows assessing the putative occurrence of not yet described allergens, which may take part of the complex al- lergen profile of the pollen of this species.

E2-02 The isoform with the most conserved protein features is the major protein isoform in the cell Federico Abascal1, Iakes Ezkurdia2, Angela del Pozo3, Jose Manuel Rodriguez4, Michael Tress1, Alfonso Valencia1, Jes- us Vazquez5 1Structural Biology and Bioinformatics Programme and National Bioinformatics Institute (INB), Spanish National Cancer Research Centre (CNIO), Melchor Fernandez Almagro, 3,, Madrid, ES, 2Unidad de Proteomica, Centro Nacional de Investigaciones Cardiovasculares, CNIC, Melchor Fernandez Almagro, 3,, Madrid, ES, 3Institute of Medical and Molecular Genetics (INGEMM), Hospital Universitario La Paz. Cen- tro de Investigaciones Biomédicas en Red de Enfermedades Raras (CIBERER), Madrid, ES, 4National Bioinformatic Institute Unit, Spanish National Cancer Research Centre (CNIO), c. Melchor Fernández Almagro 3, 28029 Madrid, Spain, Madrid, ES, 5Laboratorio de Proteomica Cardiovascular, Centro Nacional de Investigaciones Cardiovasculares, CNIC, Melchor Fernandez Almagro, 3, Madrid, ES

BACKGROUND - Studies repeatedly demonstrate the expression of a wide range of alternatively spliced transcripts in eukaryotic cells. Although many studies have attempted to address the equivalent diversity at the protein level, the va- riety of methods and criteria used for splice isoform identification have generated confusing and contradictory results for the identification of cellular proteins isoforms.

Page 72 Posters Structure / Function

RESULTS - We have analysed the peptides from 8 large-scale human proteomics analyses with a rigorous strategy for peptide search and identification. We map peptides to 60% of the protein coding genes in the human genome, yet find convincing evidence for alternatively spliced proteins for fewer than 250 protein coding genes. For the vast majority of the genes identified in the proteomics experiments the peptide evidence mapped to just a single protein isoform. We found that the dominant isoform identified in the proteomics analyses was also almost always the isoform with the most conserved protein features and could be predicted using the APPRIS database. Where we were able to identify alternative splice isoforms, many of the alternative protein isoforms were only subtly different from the main isoform and almost all maintained their Pfam domain composition.

Parallel experiments carried out with mouse proteomic datasets confirmed a similar pattern, the prevalence of a single dominant isoform and the conservation of functional domains for those few alternative splicing events detected at the protein level. Many of the alternative isoforms that we identified in mouse were also found in the human experiments, and almost half of these were generated by the splicing of mutually exclusive homologous exons.

CONCLUSIONS - Our results suggest that the expression of alternative splice isoforms is subject to some level of cel- lular control, that most genes have a single dominant protein isoform and that this isoform can be determined from an analysis of the structure, function and conservation of the annotated splice variants.

E3-01 Whole genome sequencing of Mycobacterium tuberculosis in Kazakhstan: first sequence results of two clinical isolates Ulykbek Kairov1, Askhat Molkenov1, Ulan Kozhamkulov1, Saule Rakhimova1, Ayken Askapuli1, Maxat Zhabagin1, Zhan- nur Abilova1, Ainur Akhmetova1, Dauren Yerezhepov1, Aliya Abilmazhinova1, Venera Bismilda2, Lyailya Chingisova2, Zhaxybay Zhumadilov1, Ainur Akilzhanova1 1Center for Life Sciences, Nazarbayev University, Astana, KZ, 2National Center of Tuberculosis Problems of the Republic of Kazakhstan, Almaty, KZ

Objective. The project is aimed to create the prerequisites for a personalized approach to the diagnosis and treatment of tuberculosis (TB) by identifying and comparing the whole genome sequences of M.tuberculosis strains, isolated in Kazakhstan. Analysis for whole genome sequences obtained using the next generation sequencing technology will clarify the factors that cause the formation of highly virulent strains of M.tuberculosis, the evolution of local strains, and genetic markers of drug resistance.

Methods. Material collection from 50 patients, sputum extraction and determination of drug sensitivity was performed in the reference-laboratory “National Center of Tuberculosis Problems”, Almaty, Kazakhstan. DNA libraries for whole ge- nome sequencing were prepared from DNA of isolates. The whole genome sequencing was performed on Roche 454 GS FLX+ next-generation sequencing platform at the Center for Life Sciences, Nazarbayev University, Astana, Kazakhstan. The sequencing reads from two isolates were assembled into contigs using GS De Novo Assembler. All alignments were done against the M.tuberculosis reference strain H37Rv using GS Reference Mapper.

Results. The whole genome sequencing was performed for two M.tuberculosis isolates MTB-476 and MTB-489. 96 M bp with an average read length of 520 bp, approximately 21.8X coverage and 104.2 M bp with an average read length of 589 bp and approximately 23.7X coverage were generated for the MTB-476 and MTB-489, respectively. The genome of MTB-476 consists of 257 contigs, 4204 CDS, 46 tRNAs and 3 rRNAs. MTB-489 has 187 contigs, 4183 CDS, 45 tRNAs and 3rRNAs. The results of genome assembling were submitted into NCBI GenBank and available for public access under the accession numbers AZBA00000000 and AZAZ00000000. Further work is being conducted on detailed analysis of results from whole genomes, genotyping of M.tuberculosis isolates that circulated on the territory of Kazakhstan.

Page 73 Posters Structure / Function

E3-02 First Kazakh whole genomes: report of NGS data Ainur Akilzhanova1, Ulykbek Kairov1, Saule Rakhimova1, Askhat Molkenov1, Arang Rhie2, Jong-Il Kim2, Jeong-Sun Seo2, Zhaxybay Zhumadilov1 1Center for Life Sciences, Nazarbayev University, Astana, KZ, 2Ilchun Genomic Medicine Institute, Seoul National University, Seoul, KR

The human genome sequencing will underpin human biology and medicine in the next century, providing a single, es- sential reference to all genetic information. Extraordinary technological advances and cost decline of DNA sequencing made the whole genome sequencing (WGS) widely accessible test for numerous indications. The international project “Genetic architecture of Kazakh population” aimed to determine the complete DNA sequence of Kazakh individuals. Next generation sequencing is a powerful tool for genetic analysis, and will enable us to uncover the association of loci at specific sites in the genome with many disease traits.

Methods. First WGS was performed on 6 Kazakh individuals using next generation sequencing platform HiSeq2000, Illumina using TruSeq SBS Kit v3. All generated *.bcl files were simultaneously converted and demultiplexed using bcl2fasta application. Alignment of sequence reads performed using bwa-mem against human b19 reference genome. Sorting, removing of intermediate files, *.bam files assembling and marking duplicates were performed using Picard- Tools package. GATK haplotype caller tool was used for variant calling. ClinVar, SNPedia and Cosmic databases were processed to identify clinical genomic variants in 6 kazakh whole genomes. To perform raw data processing and run- ning program scripts Java Runtime Environment and R Bioconductor package were installed.

Results. The sequence alignment and mapping procedures on reference genome hg19 of each 6 healthy Kazakh indi- viduals were completed. From 87,308,581,400 to 107,526,741,301 total base pairs were sequenced with average cover- age 29.85. From 98.85 to 99.58 % base pairs were totally mapped with properly paired 96.07 % in average. Het/Hom and Ti/Tv ratios for each whole genome ranged from 1.35 to 1.52 and from 2.07 to 2.08, respectively. We compared and analyzed each genome with existing clinical databases ClinVar, SNPedia, Cosmic and found from 20 to 25, from 269 to 288, from 7 to 12 SNP records, respectively. The availability of a reference kazakh genome sequences provides the basis for studying the nature of sequence variation, particularly single nucleotide polymorphisms.

Conclusion. First whole genome sequencing of Kazakhs were performed. We identified SNPs that are associated with different susceptibilities. Further studies of WGS on Kazakh population are needed to identify possible unique genetic variants in Kazakhs.

Page 74 Posters Structure / Function

E4-1 Dissecting Domain-Specific Evolutionary Pressure Profiles of Transient Receptor Potential Vanilloid Subfamily Members 1 to 4 Pablo Doñate Macian, Alex Perálvarez Marín UAB, Cerdanyola del valles, ES

The transient receptor potential vanilloid family includes 6 members split into two groups. The first group, non cal- cium selective, with four ion channels (TRPV1, TRPV2, TRPV3 and TRPV4). The second one, calcium selective, with two channels (TRPV5 and TRPV6). Both groups are represented within the vertebrate subphylum. TRPV ion channels are involved in several sensory and physiological processes, being related to sensation and adaptation to the environ- ment. TRPVs homologs are already present in invertebrates subphylum, called inactive and nanchung; but it was on the vertebrate subphylum when the TRPV subfamily expanded by gene duplication and differentiation of the original genes. Upon expansion TRPV members undergo strong evolutionary pressure. Using multiple sequence alignments as source for evolutionary, bioinformatics and statistical analysis, we have analyzed the evolutionary profiles for the non calcium selective TRPVs (TRPV1, TRPV2, TRPV3 and TRPV4). We have analyzed the selective pressure on specific protein domains, observing a common selective pressure trend for TRPV channels. Through a more detailed analysis we have identified evolutionary constraints involved in the subunit contact at the transmembrane domain level. Through evolutionary comparison, we have translated specific channel structural information such as the transmembrane topol- ogy, and the interaction between the membrane proximal domain and the TRP . We have also identified potential common regulatory domains among all TRPV1-4 members, such as protein-protein, lipid-protein and vesicle trafficking domains.

E5-01 Understanding protein recognition using structural features Manuel A. Marín-López, Joan Planas-Iglesias, Jaume Bonet, Baldo Oliva Structural Bioinformatics Laboratory, Universitat Pompeu Fabra, Barcelona, ES

Protein-Protein interactions (PPIs) play a crucial role in many cell processes. Thus, understanding the molecular mecha- nism of protein recognition is a critical challenge in molecular biology. Previous works in this field show that not only the binding region but also the rest of the protein is involved in the interaction, suggesting a funnel-like recogni- tion model as responsible of facilitating the interacting process. Further more, we have previously shown that three- dimensional local structural features (groups of protein loops) define characteristic patterns (interaction signatures) that can be used to predict whether two proteins will interact or not. A notable trait of this prediction system is that interaction signatures can be denoted as favouring or disfavouring depending on their role on the promotion of the molecular binding. Here, we use such features in order to determine differences between the binding interface and the rest of the protein surface in known PPIs. Particularly, we study three different groups of protein-protein interfaces: i) native interfaces (the actual binding patches of the interacting pairs), ii) partial interfaces (the docking between the binding patch of one protein and a non-interacting patch of the interacting partner), and iii) back-to-back interfaces (the docking between non-interacting patches for both of the interacting proteins). Our results show that native inter- faces present a slightly higher proportion of favouring signatures than the other two groups. Notably, we also show that the interaction signatures in partial interfaces are less favoured than the ones observed in back-to-back interfaces. We hypothesise that this phenomenon is related to the dynamics of the molecular association process. Back-to-back interfaces preserve the exposure of the real interacting patches (thus, allowing the formation of a native interface), while in a partial interface one interacting patch is sequestered and becomes unavailable to form a native interaction. According to this reasoning, partial interfaces represent a major obstacle in the formation of the real interaction and should be prevented or released, and hence unfavourable signals should characterize them over the other two groups. In comparison, although back-to-back interfaces also represent a wrong interacting conformation, they still expose the binding patches of both partners and may represent an opportunity for the native conformation to occur.

Page 75 Posters Structure / Function

E5-02 KARTES: a 3D spatial genome browser Mike Goodstadt1, Marc A. Marti-Renom2 1Genome Biology Group, Centre Nacional d’Anàlisi Genòmica (CNAG); Gene Regulacion, Stem Cells and Cancer Program, Centre de Regula- ció Genòmica (CRG), Barcelona, ES, 2Genome Biology Group, Centre Nacional d’Anàlisi Genòmica (CNAG); Gene Regulacion, Stem Cells and Cancer Program, Centre de Regulació Genòmica (CRG); Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, ES

The genome is conventionally represented and explored as a linear sequence, however chromatin is a dynamic struc- ture within the cell nucleus [1]. Chromatin architecture has been shown to have key roles in defining and controlling cell functions such as gene expression and regulation. The spatial organization and dynamics of the chromatin can be computationally modeled using software such as TADbit [2], which relies mainly on 3C-based data [1]. The extent, detail and utility of these types of data present challenges in representing and interacting with such large, multi- scale models. Current genome browsers have limited overview of cell-wide activity, do not include spatial data and do not provide an adequate end-user experience. New web-based, open-source, cross-platform technologies provide the opportunity to address these issues with increasing ease, stability and security. Despite the limitations in three- dimensional (3D) representation due to occlusion, subjectivity and interpretation inherent in human system, 3D visual- izations can assist navigability, comprehension and discovery by leveraging human visual processing, spatial reasoning and creativity. Here, we describe a new Web-based browser that aims at visualizing the genome from all its dimensions (linear to 3D). KARTES integrates the ‘1D’ genome sequence, its ‘2D’ aligned annotations and its ‘3D’ models to give a more complete vision of the forms and interactions. 1. Dekker, J., M.A. Marti-Renom, and L.A. Mirny, Exploring the three- dimensional organization of genomes: interpreting chromatin interaction data.Nat Rev Genet, 2013. 14(6): p. 390-403.

2. Bau, D. and M.A. Marti-Renom, Structure determination of genomic domains by satisfaction of spatial restraints. Chromo- some Res, 2011. 19(1): p. 25-35.

E5-03 Water-omics: high throughput analysis of protein-solvent interaction from molecular dy- namics simulations Adam Hospital1, Modesto Orozco2, Josep Lluís Gelpí2 1Molecular Modeling and Bioinformatics Group; Institute for Research in Biomedicine; Instituto Nacional de Bioinformática;, Barcelona, ES, 2Molecular Modeling and Bioinformatics Group; Institute for Research in Biomedicine; Instituto Nacional de Bioinformática; Depart- ment of Biochemistry and Molecular Biology, University of Barcelona , Barcelona, ES

Hydration around proteins is described from the mining our library of around 1,800 molecular dynamics trajectories of proteins in explicit solvent (MoDEL1). Analysis of the trajectories of more than 16 million of water molecules on the most representative protein folds provide a picture of unprecedented quality of the solvent environment around folded proteins. Results suggest a much more dynamic behavior of the solvent-protein interactions than expected. For in- stance, there is a very limited correlation between low mobility water molecules in the simulation, and crystallographic water molecules, usually taken as a reference for structural water. Analysis of Mean Residence Times (MRT) agrees with that conclusion. MRT generally lay in the range of picoseconds, only reaching the nanosecond in those water molecules placed within protein pockets. Water molecules do interact closely with protein atoms but they only remain in the first solvent layer for a limited amount of time. On the other hand, solvent also plays a significant role in the stability and dynamics of hydrogen bonds. The general conclusion of the HB stability study reveals that only a 40% of them can be considered as stable (formation dG better than -2 kcal/mol). Water acts, also, actively in the breaking of protein HBs, mainly by solvating the individual atoms, and in a much lesser extend by forming water bridges. Overall, this study pro- vides a systematic methodology to undergo the analysis of solvent-protein interactions that can be generally applied.

Meyer, T., D’’Abramo, M., Hospital, A., Rueda, M., Ferrer-Costa, C., Pérez, A., Carrillo, O., Camps, J., Fenollosa, C., Repchevsky, D., Gelpí, J.L., Orozco, M. (2010) MoDEL (Molecular Dynamics Extended Library): A database of atomistic molecular dy- namics trajectories.Structure, 18, 1399-1409.

Page 76 Posters Structure / Function

E5-04 A new pipeline to analyze RNA-Seq data applied to determine the influence of the stress induced kinases ATM and p38 MAPK on transcription M.D. Stobbe1, N. Trempolec1, E. Planet2, T.H. Stracker1, A.R. Nebreda1, D. Rossell3 1IRB Barcelona, Barcelona, ES, 2Ecole polytechnique fédérale de Lausanne, Lausanne, CH, 3University of Warwick, Coventry, UK

Deficiency of the kinase ATM leads to the disease Ataxia-telangiectasia, characterized by neurodegeneration and pre- disposition to lymphoma. ATM loss increases reactive oxygen species and activates the p38 MAPK pathway that has been shown to influence several pathological outcomes of the disease in animal models. As both ATM and p38 MAPK activities can regulate transcription, we hypothesized that they may differentially affect the same targets. To address this, we performed RNA sequencing from primary cells lacking either ATM or different p38 MAPK family members and wild type cells in unperturbed condition or stressed by ionizing radiation. To identify the influence of ATM and p38 MAPK on transcript expression and to determine if common transcripts were affected, we developed a new pipeline for RNA-Seq data analysis.

In the first step of the pipeline we aligned the reads to the mouse genome using TopHat2 [1]. Next, we used the novel algorithm Casper [2] to estimate expression at isoform level. Casper avoids loss of information in RNA-Seq data summarization to deliver more precise estimates, particularly for lowly expressed isoforms. We then carefully prepro- cessed the data to take into account differences in sequence depth, batch effects, and genetic differences between cell cultures. After this, GaGa [3] was applied to compare the groups and detect differentially expressed transcripts. GaGa analyzes all transcripts and all samples of the groups jointly, giving more statistical power to detect true differences in an experiment with a small sample size. Moreover, we avoid common misinterpretations arising from doing all pairwise comparisons between groups, where negative findings caused by lack of sensitivity are often interpreted as proving that no differences exist. In the final step of the pipeline we required the differences detected to be at least a 2 fold change to focus on the most strongly affected transcripts.

Using this pipeline, we detected 347 differentially expressed transcripts in a comparison of the effect of the genotype in unperturbed conditions. For the difference in response to irradiation across the different genotypes, we detected 341 differentially expressed transcripts. Current work is focusing on the experimental validation of the predictions made and understanding their biological relevance.

References

[1] Kim D et al Genome Biol, 2013 Vol 14

[2] Rossell D et al Ann Appl Stat, 2014 Vol 8

[3] Rossell D Ann Appl Stat, 2009 Vol 3

E5-05 Optimization of protein-protein docking for predicting Fc-protein binding modes Mark Agostino1, Ricardo Mancera1, Paul Ramsland2, Juan Fernandez-Recio3 1Curtin University, Perth, AU, 2Burnet Institute, Melbourne, AU, 3Barcelona Supercomputing Center, Barcelona, ES

Protein recognition of the antibody crystallizable fragment, or Fc, is an integral part of the immune response to an- tigens. Pathogens produce proteins that bind Fc in order to evade immune response. The structural characterization of the determinants of Fc-protein association is essential to improve our understanding of the immune system at the molecular level and to develop new therapeutic agents. However, very few structures of Fc-protein complexes are avail-

Page 77 Posters Structure / Function

able, limiting the level of structural understanding of protein recognition at this site. In this study, a protein-protein docking protocol is developed and optimized for studying Fc-protein recognition at the CH2-CH3 site of Fc. The proto- col utilizes a combination of Fc-based restraints, defined by Fc regions contacted in the known Fc-protein complexes, with an optimized scoring strategy. The protocol is capable of identifying a suitably accurate pose for the evaluation cases within a set of 30 poses, and is robust to the use of homology models. Using the optimized approach, the binding modes of several human and pathogenic proteins with IgA Fc were proposed, including Fcα/μ������������������������R,����������������������� TRIM21, and strepto- coccal M and β proteins. Together with knowledge of the experimentally determined IgA Fc complexes with SSL7 and FcαRI, the proposed binding modes allowed the determination of a «pharmacophore» model for protein binding to IgA Fc, as well as further highlighting the key IgA Fc residues involved in protein recognition. This structural knowledge will be valuable for structure-based design of molecules to modulate IgA Fc-protein interactions, as well as for the design of molecules for the purification of therapeutic IgA.

E5-06 Structural characterization of carbohydrate binding promiscuity of Euonymus europaeus lectin Mark Agostino1, Tamir Dingjan2, Tony Velkov2, Spencer Williams3, Elizabeth Yuriev2, Paul Ramsland4 1Curtin University, Perth, AU, 2Monash Institute of Pharmaceutical Sciences, Parkville, AU, 3Bio21 Institute, University of Melbourne, Parkville, AU, 4Burnet Institute, Melbourne, AU

Euonymus europaeus lectin (EEL) is a carbohydrate-binding protein derived from the fruit of the European spindle tree. Despite being first purified in the mid-1970s and sold commercially, precise details of its structure and mechanism of carbohydrate recognition remain elusive. Fluorescence titrations demonstrate that EEL binds to a wide range of car- bohydrates, including blood group-related carbohydrates, mannose-terminating carbohydrates, chitotriose and α-sialic acid, although affinity appears strongest for H-like (fucose-terminating) carbohydrates. To investigate the binding pro- miscuity of EEL, a homology model of EEL was prepared and molecular docking simulations performed. Due to the low sequence identity of EEL with other structurally characterized proteins, protein homologs were sought using the HHPred server, which uses hidden Markov models (HMMs) to identify relationships between sequence, structure and function. The HMM-based comparison identified that the best templates for EEL were proteins featuring a ricin B-like fold, a fold featuring three putative carbohydrate-binding sites (α, β, γ). The EEL model was prepared using different templates for the protein core and the binding sites; for the binding sites, the complex of actinohivin with Manα(1→3) Man was used. Redocking of the modelled ligands at each of the three sites suggested that the γ site was the most likely carbohydrate-binding site of EEL. The γ site was subsequently optimized for binding to α-L-fucose. Likely binding modes for the tested carbohydrates were determined using constrained docking and RMSD comparison to the fucose template. Furthermore, a relationship between the experimental binding energy and the predicted binding energy of the selected poses was able to be determined and optimized. In conjunction with recent crystallographic analyses of carbohydrate binding to langerin and DC-SIGN, this study highlights the importance of understanding the structural similarities of monosaccharides and provides a basis for exploiting such knowledge in elucidating carbohydrate-pro- tein recognition.

E5-07 Structural Studies on Chondroitin/Dermatan Sulfate mixed synthetic oligosaccharides by NMR and Molecular Dynamics Cristina Solera, Giuseppe Macchione, Susana Maza, Jose Luis de Paz, Pedro Manuel Nieto Glycosystems Laboratory, Instituto de Investigaciones Químicas (IIQ), CSIC, Sevilla, ES

We are involved in a long-term project about the interaction of Glycosaminoglycans (GAG) and signalling proteins. In

Page 78 Posters Structure / Function

this context, synthetic oligosaccharides are useful tools for the establishment of structure-activity relationships for specific sequences and the design of mimetics that potentially modulate the biological functions of the natural prod- uct. Chondroitin sulfate (CS) and Dermatan sulphate (DS) are linear and polyanionic polysaccharides that belongs to the GAG family. Their chains are composed of repetitive disaccharide units of ᴅ-glucuronic acid (GlcA)-β(1→3)-N-acetyl- ᴅ-galactosamine(GalNAc)-β(1→4) or ʟ-iduronic acid (IdoA)-α-(1→3)-N-acetyl-ᴅ-galactosamine(GalNAc)-β(1→4) re- spectively that may present sulfate groups at various positions. This structural heterogeneity, in terms of sulfation pattern, can be understood as an inherent capacity to encode information and control a wide variety of biological processes. We have performed an exhaustive NMR study to obtain structural restrains that allow deriving 3D models of the synthetic tetrasaccharides (CS and mixed DS/CS) by MD simulations. Here we report the results of this approach. The global 3D shapes obtained are similar elongated forms, but in the cases where iduronate residues are present they 1 2 display its characteristic conformational equilibrium C4 – SO.

E5-08 Importance of the polarity of the glycosaminoglycan chain on the interaction with FGF-1 Juan C. Muñoz-García1, M. José García-Jiménez1, Paula Carrero1, Ángeles Canales2, Jesús Jiménez-Barbero2, Manuel Mar- tín-Lomas3, Anne Imberty4, José L. de Paz1, Jesús Angulo1, Hugues Lortat–Jacob5, Pedro M. Nieto1 1Glycosystems Laboratory, Instituto de Investigaciones Químicas (IIQ), CSIC, Sevilla, ES, 2Centro de Investigaciones Biológicas, CSIC, Madrid, ES, 3CIC biomaGUNE, Biofunctional Nanomaterials Unit, San Sebastian, ES, 4CERMAV-CNRS, Grenoble, FR, 5Institute de Biologie Structural Jean Pierre Ebel, Grenoble, FR

Heparin-like saccharides play an essential role in binding to the FGF-1 and to their membrane receptors FGFR form- ing a ternary complex. That is responsible of the internalization of the signal, via the dimerization of the intracellu- lar regions of the receptor. To investigate the ability of the hexasaccharides to interact with FGF-1, IC50 values were determined from SPR competition experiments. The affinity order is different from the previously reported data for the mitogenic activity. A potential reason is the different sulfatation patron of the glycosaminoglycan and the geometry around the glycosidic linkages for both directionalities of the chain (non-reducing to reducing end and vice versa). To find a satisfactory explanation to the different activity of the saccharides, a molecular modelling docking protocol was employed to analyze the possible molecular interactions of inactive saccharide (Hexa3) and FGF-1.

Figura1. Hexa3. Initially, we carried out a molecular dynamic trajectory from 500 ns without restrictions and we ob- tained different conformations. The backbone of the most representative one was manually superimposed to 2erm structure. After that, we started to perform docking calculations. W e have used Glide, first using the Induced Fit Dock- ing protocol with the standard conditions and then, the results were subjected to a run of Single Precision Docking. In this case, the focus was put into the three residues of the triad, leading to a displaced sequence. The impossibility to essemble a complex with the complete set of charged interactions between the FGF-1 and the hexasaccharide, lead us to conclude that the correct polarity of the GAG chain is essential for the interaction with the growth factor. At the moment, we are studying other important hexasaccharide (Hexa4S) to the forming the ternary complex and signal intracellular. We carried out MD (500 ns) and tar-MD (time average restrained molecular dynamic simulation) of 20 ns to determine the conformations involved in the pseudorotational equilibrium of the IdoA2S residue in the equatorial region of the Cremer -Pople sphere.

Figura2. Hexa4S.

Page 79 Posters Structure / Function

E5-09 Assessments in Bioinformatics and the need of continuous evaluations José M. Fernández, Michael Tress, Alfonso Valencia INB-GN2, Structural and Computational Biology Programme, CNIO, Madrid, Spain, Madrid, ES

The second edition of CAFA (Critical Assessment of Functional Annotation) challenge has had 54 participating groups, interested on submitting their predictions about 100816 targets from selected organisms and 1301 enzymes from EFI. The main idea behind it is evaluating the state of the art in the software from the functional annotation community. 5 months past since the challenge started until the submission deadline was reached, but more time than that is being needed to accurately assess the participating prediction systems with enough depth. Existing constraints on available human resources and time usually make the results of this and many other challenges and assessments outdated and even obsolete.

Bioinformatics communities are used to critical assessments and challenges as a common way to inspect and strength- en the state of the art in their tools. One of the outcomes from these challenges are the well balanced golden data sets generated for the assessments, which are later used, along with the assessment results, as a baseline to compare new developments. As the technology and techniques continue evolving along the periods a challenge is running, as- sessment results received at the end of each challenge (CAFA, CASP, CAPRI, BioCreAtIvE, etc...) are snapshots from the past. But, as these competitions need so many resources, many of them are sporadically celebrated. Even worse, there are very few available golden data sets and assessments of existing tools which can be used to profile the new devel- opments in the different bioinformatic research fields, compared to the size and growth of the biological data sources usually found and used in life sciences research.

Bioinformatic communities need to periodically evaluate their systems, shortening the periods between assessments as much as possible, so they can receive a real feedback about the state of the art. The only way to achieve this is automating as much of the work done in these evaluations as possible, building continuous evaluation systems (CES). Tinkering with that idea, our group has designed GOPHER, a modular CES initially designed for the evaluation of func- tion prediction methods and servers, capable to be setup for other domains.

Page 80 Posters Student Symposium

S1-03 A new pipeline to analyze RNA-Seq data applied to determine the influence of the stress induced kinases ATM and p38 MAPK on transcription M.D. Stobbe1, N. Trempolec1, E. Planet2, T.H. Stracker1, A.R. Nebreda1, D. Rossell3 1IRB Barcelona, Barcelona, ES, 2Ecole polytechnique fédérale de Lausanne, Lausanne, CH, 3University of Warwick, Coventry, UK

Deficiency of the kinase ATM leads to the disease Ataxia-telangiectasia, characterized by neurodegeneration and pre- disposition to lymphoma. ATM loss increases reactive oxygen species and activates the p38 MAPK pathway that has been shown to influence several pathological outcomes of the disease in animal models. As both ATM and p38 MAPK activities can regulate transcription, we hypothesized that they may differentially affect the same targets. To address this, we performed RNA sequencing from primary cells lacking either ATM or different p38 MAPK family members and wild type cells in unperturbed condition or stressed by ionizing radiation. To identify the influence of ATM and p38 MAPK on transcript expression and to determine if common transcripts were affected, we developed a new pipeline for RNA-Seq data analysis.

In the first step of the pipeline we aligned the reads to the mouse genome using TopHat2 [1]. Next, we used the novel algorithm Casper [2] to estimate expression at isoform level. Casper avoids loss of information in RNA-Seq data summarization to deliver more precise estimates, particularly for lowly expressed isoforms. We then carefully prepro- cessed the data to take into account differences in sequence depth, batch effects, and genetic differences between cell cultures. After this, GaGa [3] was applied to compare the groups and detect differentially expressed transcripts. GaGa analyzes all transcripts and all samples of the groups jointly, giving more statistical power to detect true differences in an experiment with a small sample size. Moreover, we avoid common misinterpretations arising from doing all pairwise comparisons between groups, where negative findings caused by lack of sensitivity are often interpreted as proving that no differences exist. In the final step of the pipeline we required the differences detected to be at least a 2 fold change to focus on the most strongly affected transcripts.

Using this pipeline, we detected 347 differentially expressed transcripts in a comparison of the effect of the genotype in unperturbed conditions. For the difference in response to irradiation across the different genotypes, we detected 341 differentially expressed transcripts. Current work is focusing on the experimental validation of the predictions made and understanding their biological relevance.

References

[1] Kim D et al Genome Biol, 2013 Vol 14

[2] Rossell D et al Ann Appl Stat, 2014 Vol 8

[3] Rossell D Ann Appl Stat, 2009 Vol 3

Page 81 Posters Student Symposium

S1-06 GoldBinch: a scatter search-based biclustering of gene expression data algorithm that in- tegrates biological knowledge with functional annotations Juan A. Nepomuceno1, Alicia Troncoso2, Isabel A. Nepomuceno--Chamorro1, Jesús S. Aguilar--Ruiz2 1Departamento de Lenguajes y Sistemas Informáticos. Universidad de Sevilla, Sevilla, ES, 2Departamento de Informática, Universidad Pablo de Olavide, Sevilla, ES

Biclustering is an unsupervised machine learning technique that searches patterns in gene expression data matrices. The most important difference with respect to traditional clustering is that biclustering algorithms aim to cluster simultaneously on genes as well as conditions, rather than focusing solely on either one. These algorithms are based on the assumption that co-expressed genes imply co-regulated genes. Recently, this assumption is being reformulated because of the co-expression of a group of genes may be the result of an independent activation with respect to the same experimental condition and not due to the same regulatory regime is captured. Due to this fact, it is a key point to introduce prior knowledge into these algorithms. Lately, the integration of biological information stored in repositories and databases has been proposed in other fields as traditional clustering or classification. GoldBinch is a bicluster- ing algorithm that integrates biological knowledge to find bioclusters. This integration is carried out by means of the fitness function in a scatter search scheme. Scatter search is a population based metaheuristic that performs the op- timization process with a small set of solutions instead of the complete population of solutions as in other population- based metaheuristics as for example genetic algorithms. Two different functions are analyzed in this work in order to incorporate biological information in the search process. In addition to the gene expression data matrix, the input of the algorithm is a direct annotation file, which relates each gene to a set of terms from a biological repository where the gene is annotated. To evaluate the proposed algorithm three experiments have been carried out. As initial step, the algorithm is applied taking into account or not biological information in order to show its performance. Moreover, the analysis and comparison between the two functions proposed here to integrate biological information is studied. Finally, it is shown that GoldBinch obtains better results than other classical biclustering algorithms typically used as benchmark.

S1-07 MECoMaP: a Multiobjective Evolutionary Contact Map Predictor Alfonso E. Márquez-Chamorro1, Gualberto Asencio-Cortés1, Juan A. Nepomuceno2, Jesús S. Aguilar-Ruiz1 1Departamento de Informática, Universidad Pablo de Olavide, Sevilla, ES, 2Departamento de Lenguajes y Sistemas Informáticos. Universi- dad de Sevilla, Sevilla, ES

One of the main topics in Structural Bioinformatics is the protein inter-residue contact prediction problem. Although multiple approaches have been developed in recent years, this problem is far from being solved yet. We present an evolutionary approach for the protein inter-residue contact prediction called MECoMap. Our algorithm returns a set of decision rules which determines the specific characteristics of a residue-residue contact. The features of the rules used by our algorithm are based on structural features, such as protein secondary structure and solvent accessibility, physi- co-chemical properties of amino acids besides of evolutionary information (PSSM). An improvement of the efficiency is also achieved by using an AVL tree in order to classify the training examples used by the evolutionary algorithm. Results obtained show better accuracy rates than other similar approaches.

Page 82 Posters Student Symposium

S1-08 Compota: a cloud-computing based Scala tool especially suited for bioinformatics pipe- lines Evdokim Kovach, Alexey Alekhin, Marina Manrique, Pablo Pareja-Tobes, Eduardo Pareja, Raquel Tobes, Eduardo Pareja- Tobes Oh no sequences! Research Group, Era7 bioinformatics, Granada, ES

Compota is a Scala library for declaring stateless computations and scaling them using cloud computing, in particular a combination of services from AWS (Amazon Web Services). Compota relies on the EC2 service (Elastic Compute Cloud) to carry out the computations, on the S3 service (Simple Storage Service) and DynamoDB for data storage and on SQS (Simple Queue Service) and SNS (Simple Notification Service) for communication between the different system com- ponents. Compota consists of set of independent components (nisperos), each of them has:

• a set of workers that performs computations specific to each nispero

• a manager instance that is in charge of deploying and undeploying the group of workers

• an input SQS queue where input data for workers will be stored

• an output SQS queue where workers will publish results. Nisperos can be composed as follows: the output queue of one nispero can be used as an input queue of other nispero then computations of nisperos will be applied sequentially on input data, as a composition of functions. This composi- tion allows building complex bioinformatics pipelines. The first application of compota was Metapasta -- a microbial community profiling tool (\url{https://github.com/ohnosequences/metapasta}). Compota is an open-source project re- leased under AGPLv3 license.

The source code is available at \url{https://github.com/ohnosequences/compota}. This project is funded in part by the ITN FP7 project INTERCROSSING (Grant 289974)

Page 83