EMBnet.journal Volume 21 Supplement A December 2015

Next Generation : a look into the future Final Conference & MC Meeting of COST Action BM1006 16-17 March 2015 Bratislava, Slovakia http://seqahead.eu/bratislava_2015

ESF provides COST is supported the Cost Office by the EU RTD through an EC Framework contract Programme Editorial/Content EMBnet.journal 21.A Editorial Contents The key task of COST Action BM1006, SeqAhead, Editorial...... 2 Next Generation Sequencing (NGS) Data Analysis COST Action BM1006 (SeqAhead) closing Network, was, as its name suggests, networking; conference...... 3 but SeqAhead also emphasised the dissemina- Scientific Programme...... 5 tion of knowledge. During the four years of the Keynote Lectures...... 9 Action, SeqAhead surpassed every expectation: Oral Presentations ...... 13 with members participating from 29 European Posters...... 25 countries, plus one international partner from South Africa, the Management Committee mem- bership reads like a “who’s-who” of European NGS research. This EMBnet.journal Conference Supplement clearly shows that during the four years of SeqAhead’s existence, the Action members ac- tively shared software and experiences, and col- laborated in numerous projects spanning diverse EMBnet.journal Executive Editorial topics – from those at the biological end of the spectrum to those at the informatics end. Board SeqAhead was acknowledged in more than Erik Bongcam-Rudloff, Department of Animal Breeding and Genetics, SLU, SE 100 peer-reviewed articles, demonstrating that [email protected] the Action’s activities benefited a large and wide audience. Teresa K. Attwood, Faculty of Life Sciences and School of Computer Science, University of Manchester, UK [email protected]

EMBnet.journal Editorial Board Domenica D’Elia, Institute for Biomedical Technologies, CNR, Bari, IT [email protected] Andreas Gisel, Institute for Biomedical Technologies, CNR, Bari, IT [email protected]

Laurent Falquet, University of Fribourg, Fribourg, CH [email protected] Pedro Fernandes, Instituto Gulbenkian. PT [email protected] Lubos Klucar, Institute of Molecular Biology, SAS Bratislava, SK [email protected] Vicky Schneider-Gricar, The Genome Analysis Centre (TGAC) Norwich, UK [email protected]

Cover picture: Valparaiso, Chile, 2015 [© Erik Bongcam-Rudloff]

page 2 (not for indexing) EMBnet.journal 21.A Reports e860

COST Action BM1006 (SeqAhead) During the conference, keynote and selected speakers presented the latest NGS technologies closing conference “Next Generation and their applications, discussed storage and Sequencing: a look into the future” analysis systems, and presented the training available and still required to enable researchers to utilise these technologies in new fields of re- search. The event opened with a short welcome from the local organiser, Dr. Lubos Klucar (IMB SAS, Bratislava (SK)), followed by an introduction by the SeqAhead Chair, Prof. Erik Bongcam-Rudloff (SLU, Uppsala (SE)). Fourteen speakers presented their work and discussed new technology highlights in the field, including the latest minION device Lubos Klucar 1, Teresa K. Attwood 2, Erik presented by Ola Wallerman (SLU, Uppsala (SE)). Bongcam-Rudloff 3  The final SeqAhead MC meeting took place 1Institute of Molecular Biology, Slovak Academy of Sciences, during the second day, where the principal Bratislava, Slovakia achievements from the four years of the Action 2 University of Manchester, Manchester, United Kingdom were presented and discussed. The impact of 3Swedish University of Agricultural Sciences, Uppsala, Sweden the project has been multi-faceted, ranging Klucar L et al. (2015) EMBnet.journal 21(Suppl A), e860. http:// from the very abstract ‘improved healthcare’, dx.doi.org/10.14806/ej.21.A.860 owing to better trained NGS researchers in medi- cal diagnostic facilities, to the very concrete, in The closing conference and final Management terms of the research described in the nearly 100 Committee (MC) meeting of COST Action peer-reviewed journal articles that acknowledge BM1006 (SeqAhead) took place over two days, SeqAhead. 16-17 March 2015, in Bratislava (SK). Fifty par- Doubtless, the most tangible result of ticipants from 20 European countries gathered SeqAhead is the large number of students/scien- at the event to discuss and share experiences, tists who gained access to high-quality training in know-how related to new technologies, and key the emerging field of NGS, training that is not yet developments in the field of Next Generation part of most formal university curricula, and was Sequencing (NGS) research.

SeqAhead COST conference participants.

page 3 (not for indexing) e860 Reports EMBnet.journal 21.A certainly not available to scientists who complet- ed their studies a few years ago. SeqAhead organised more than 30 work- shops, four conferences with more than 150 par- ticipants each, and several schools, each with 25-60 students. In total, almost 2,000 scientists participated in these activities, the impact of which has been immediate for NGS research in Europe, and will continue to be felt for many years. As already mentioned, SeqAhead partners have published about 100 articles on NGS and Poster session. related topics, and many more publications are expected beyond the end of the project. Such articles help disseminate the scientific activities entific interests in formal and informal settings, catalysed by SeqAhead and, we hope, will have and in physical and virtual (video) meetings. a lasting impact: healthcare is a major financial These meetings led to the formation of several burden in most European countries, so better large European collaborations, such as the (now 1 NGS science in hospitals is likely to have both so- complete) AllBio project and the newly awarded 2 cietal and economic benefits. COST Action CHARME , which is to kick-off in 2016. SeqAhead brought together researchers from AllBio, in turn, triggered the establishment of the 3 across Europe, allowing them to discuss their sci- recently initiated EU project PROLIFIC . Members of SeqAhead also won an IRSES Marie Curie pro- ject aiming to create an NGS network between the Americas and Europe4. We expect these new projects to have a lasting impact in many areas of life-science re- search, we are proud to have been part of the successful network that seeded them, and we wish them success in the years ahead. Finally, our thanks go out to all who made SeqAhead such a warmly collaborative and productive pro- ject! Lubos Klucar (Local Organiser), Teresa K. Attwood (Action Vice Chair) and Erik Bongcam- Jim Dowling (KTH - Royal Institute of Technology) having talk Rudloff (Action Chair) on the biobankcloud project (www.biobankcloud.com).

1 www.allbioinformatics.eu/doku.php 2 www.cost.eu/COST _ Actions/ca/CA15110 3 euprolific.eu/index.php/the-project 4 www.deann.eu

page 4 (not for indexing) Scientific Programme

COST Action BM1006 - Final Conference 16-17 March 2015, Bratislava, Slovakia EMBnet.journal 21.A Scientific Programme 16 March COST Conference “NGS: a look into the future” 08:30-09:00 Registration and Coffee 09:00-09:20 Welcome from organisers NGS Technologies and their Applications 09:20-10:00 Keynote Lecture Session Chair: Erik Bongcam-Rudloff Power and Limitations of RNA-Seq: Findings from the SEQC (MAQC-III) consortium Paweł Łabaj Chair of Research Group, Boku University Vienna, Austria 10:00-10:30 Coffee Break & postesr set-up 10:30-12:00 Oral Presentations Reproducible Research in the era of Next Generation Sequencing: current

approaches, examples and future perspectives Claudia Angelini, Dario Righelli, Francesco Russo Current status of nanopore sequencing using the minION device – from full length cDNA sequencing to genome assembly improvements Ola Wallerman Deep insights into Mecp2-driven transcriptional (de)regulation at embryonic developmental stage through RNA-Seq data analysis Kumar Parijat Tripathi, Maurizio D'Esposito, Mario R Guarracino, Marcella Vacca Exploring the activity of microorganisms in the forest soil using metatranscriptomics Petr Baldrian 12:00-13:00 Lunch Storage and Analysis Systems 13:00-13:40 Keynote Lecture Session Chair: Ralf Herwig Population-Scale Genomics with BiobankCloud and Hops/Hadoop Jim Dowling KTH - Royal Institute of Technology, Stockholm, Sweden 13:40- 15:00 Oral Presentations Creating a successful facility for large-scale extraction of DNA, an example from the Swedish biobank initiative BBMRI.se James Thompson Using neural networks to filter predicted errors in NGS Data Dimitar Vassilev, Milko Krachunov DIANA-TarBase v7: indexing thousands experimentally supported miRNA:mRNA interactions Dimitra Karagkouni Biobanking for the future, how to prepare for the next generation of next generation sequencing Tomas Klingström 15:00-15:30 Coffee Break 15:30-16:10 Keynote Lecture Session Chair: Claudia Angelini Data deluge demands training tsunami Eija Korpelainen CSC, Espo, Finland 16:10-17:10 Oral Presentations Introducing Meta2genomics: the search for the “micro-bee” Jose R Valverde page 6 (not for indexing) 21.A EMBnet.journal

16:10-17:10 Oral Presentations Introducing Meta2genomics: the search for the “micro-bee” Jose R Valverde Information-theoretic approach for detection of differential splicing from RNA-seq data Ralf Herwig iMir: An innovative and complete pipeline for smallRNA-Seq data analysis Giorgio Giurato, Antonio Rinaldi, Adnan Hashim, Giovanni Nassa, Maria Ravo, Francesca Rizzo, Roberta Tarallo, Angela Cordella, Giovanna Marchese, Domenico Memoli, Alessandro Weisz 17:10-17:20 Closing remarks 17:20-18:00 Poster session 19:30 Dinner

17 March SeqAhead Management Committee Meeting 08:30-09:00 Morning Coffee 09:00-09:15 Welcome and introduction 09:15-10:00 Business meeting 10:00-10:30 Coffee Break 10:30-12:00 Business meeting 12:00 Lunch

Chairs and Conference Committees Scientific Committee Prof. Teresa Attwood (University of Manchester, Manchester, United Kingdom) Dr. Claudia Angelini (CNR, Naples, Italy) Dr. Erik Bongcam-rudloff (SLU, Uppsala, Sweden) Dr. Laurent Falquet (University of Fribourg, Fribourg, Switzerland) Prof. Jacques van Helden (AMU, Marseille, France) Mr. Oliver Hunewald (CRP-Sante, Luxembourg, Luxembourg) Dr. Lubos Klucar (IMB SAS, Bratislava, Slovakia) Prof. Claude Muller (CRP-Sante, Luxembourg, Luxembourg) Dr. Guy Perriere (Université Claude Bernard, Lyon, France) Dr. Jose Ramon Valverde (CNB/CSIC, Madrid, Spain) Dr. Dimitar Vassilev (Agro Bio Institute, Sofia, Bulgaria) Dr. Vicky Schneider (TGAC, Norwich, United Kingdom) Scientific Organisers Prof. Teresa Attwood (University of Manchester, Manchester, United Kingdom) Dr. Erik Bongcam-Rudloff (SLU, Uppsala, Sweden) Dr. Lubos Klucar (IMB SAS, Bratislava, Slovakia) Session Chairs Dr. Erik Bongcam-Rudloff (SLU, Uppsala, Sweden) Dr. Ralf Herwig (Max-Planck-Institute for Molecular Genetics, Berlin, Germany) Dr. Claudia Angelini (CNR, Naples, Italy)

page 7 (not for indexing) EMBnet.journal 21.A List of presentations (presenting authors)

Keynote Lectures Information-theoretic approach for detection of Power and limitations of RNA-Seq: findings from the differential splicing from RNA-seq data SEQC (MAQC-III) Consortium Ralf Herwig...... 23 Paweł Piotr Łabaj...... 10 iMir: an innovative and complete pipeline for Population-scale genomics with BiobankCloud and smallRNA-Seq data analysis Hops/Hadoop Giorgio Giurato...... 24 Jim Dowling...... 11 Posters Data deluge demands a training tsunami Microbial analysis of ovine cheese by next Eija Korpelainen...... 12 generation sequencing Vladimir Kmet...... 26 Oral Presenations Reproducible Research in the era of Next Generation PACMAN: PacBio Methylation Analyzer Sequencing: current approaches, examples and Laurent Falquet...... 27 future perspectives NORM-SYS - harmonizing standardization processes Claudia Angelini...... 14 for model and data exchange in systems biology Current status of nanopore sequencing using the Babette Regierer...... 28 MinION device – from full length cDNA sequencing to Targeted amplicon sequencing in genetic genome assembly improvements diagnostics of patients with cystic fibrosis Ola Wallerman...... 15 Jelena Kusic-Tisma...... 29 Deep insights into Mecp2-driven transcriptional (de) Reference leaf transcriptomes for potato cultivars: regulation at embryonic developmental stage Desiree and PW363 through RNA-Seq data analysis Maja Zagorščak...... 30 Kumar Parijat Tripathi...... 16 NGS as a tool for investigating spread of zoonotic Exploring the activity of microorganisms in the forest bacteria: dealing with source and outbreak genetic soil using metatranscriptomics variation in a spatio-temporal context Petr Baldrian...... 17 Robert Söderlund...... 31 Creating a successful facility for large-scale Biobanks and future emerging technologies: new extraction of DNA, an example from the Swedish approaches, new pre-analytical challenges Biobank initiative BBMRI.se Eva Ortega-Paino...... 32 James Thompson...... 18 MetLab: an In silico experimental design, simulation Using neural networks to filter predicted errors in NGS and validation tool for viral metagenomics studies data Hadrien Gourlé...... 33 Dimitar Vassilev...... 19 Towards SNP calling in polyploid genomes DIANA-TarBase v7: indexing hundreds of thousands experimentally supported miRNA:mRNA interactions Anatoliy Dimitrov...... 34 Dimitra Karagkouni ...... 20 Gene set enrichment analysis of neuroendocrine system of the silkworm Bombyx mori Biobanking for the future, how to prepare for the next generation of Next Generation Sequencing Gabor Beke...... 35 Tomas Klingström...... 21 The NonCode aReNA DB: a non-redundant and integrated collection of non-coding RNAs Introducing Meta2genomics: the search for the “micro-bee” Flavio Licciulli...... 36 José R. Valverde...... 22

page 8 (not for indexing) Keynote Lectures e831 Keynote Lectures EMBnet.journal 21.A

Power and limitations of RNA-Seq: findings from the SEQC (MAQC-III) Consortium

Paweł Piotr Łabaj Chair of Bioinformatics Research Group, Boku University Vienna, Vienna, Austria Łabaj PP (2015) EMBnet.journal 21(Suppl A), e831. http://dx.doi.org/10.14806/ej.21.A.831 We present primary results from the Sequencing if specific filters are used. Comparisons with mi- Quality Control (SEQC) project by US-FDA MAQC croarrays identified complementary strengths, consortium. Here we present a multi-centre with RNA-Seq at sufficient read-depth detecting cross-platform study introducing a landmark differential expression more sensitively, and mi- RNA-Seq reference dataset comprising 30 billion croarrays achieving higher rank-reproducibility. reads. In addition to NGS also several microar- Measurement performance depends on the ray and qPCR platforms were examined. The platform and data analysis pipeline, and varia- study design supports large variety of comple- tion is large for transcript-level profiling. On the mentary benchmark metrics by featuring known other hand, even at read-depths >100 million, mixtures, high-dynamic range ERCC-spikes, as we find thousands of novel junctions, with good well as nested replication structure. With no inde- agreement between platforms, and with qPCR pendent ‘gold standard’ feasible, these built-in validation-rates >80%. We also have shown that truths support an objective assessment of per- the modelling approaches for inferring alterna- formance and are critical for the development tive transcripts expression-levels from read counts and validation of novel or improved algorithms along a gene can be applied to probes along and data processing pipelines. We find that a gene in high-density next-generation micro- measurements of relative expression are accu- arrays. This has advantages in quantitative tran- rate and reproducible across sites and platforms script-resolved expression profiling.

page 10 (not for indexing) EMBnet.journal 21.A Keynote Lectures e825

Population-scale genomics with BiobankCloud and Hops/Hadoop

Jim Dowling KTH - Royal Institute of Technology, Stockholm, Sweden Dowling J (2015) EMBnet.journal 21(Suppl A), e825. http://dx.doi.org/10.14806/ej.21.A.825 Recent advances in the cost, throughput, and of data and it supports a number of Hadoop- speed of Next-Generation Sequencing (NGS) based data analytics platforms for scale-out pro- technology have resulted in huge growth in both cessing of NGS data, such as Cuneiform/HiWAY. the volume and velocity of genomic data avail- BiobankCloud supports multi-tenancy for able for processing. studies with sensitive data. A project-based Large NGS projects are now starting with the model for authentication has been incorporated goal of population-based genomics, that is, the into the Hadoop ecosystem, and studies can analysis of genomes of tens or even hundreds co-exist on the same platform without users ac- of thousands of individuals. Population-based cidentally or maliciously accessing each others genomics needs petabyte-scale data analyt- study data. The multi-tenancy support is based ics platforms. BiobankCloud is an open-source around a user interface that brings together, in framework, based on Hadoop1, that supports a single platform, the owners of NGS data with the scalable and secure storage and analysis of bioinformaticians, the researchers who typically NGS data, alongside traditional Biobank meta- analyze NGS data. data. The platform scales to manage petabytes

1 hadoop.apache.org/

page 11 (not for indexing) e833 Keynote Lectures EMBnet.journal 21.A

Data deluge demands a training tsunami

Eija Korpelainen CSC - IT Center for Science, Espoo, Finland Korpelainen E (2015) EMBnet.journal 21(Suppl A), e833. http://dx.doi.org/10.14806/ej.21.A.833 Modern sequencing technologies open unprec- gravated by the fact that life science research- edented possibilities for life science research. As ers typically have a bio/medical background new sequencing applications or “seqs” become and very little experience in programming and available and the cost is going down, sequenc- statistics. Taken together, the need for training ing has become the measurement technol- in up-to-date data analysis skills is massive, but ogy of choice for more and more researchers. luckily international efforts are already underway However, in order to exploit sequencing to its to improve the situation. The truly global effort in full extent, substantial data analysis skills are re- this regard is GOBLET1, the Global Organisation quired. This is a bottleneck for several reasons: for Bioinformatics Learning, Education & Training. each “seq” requires its own analysis methods, This talk discusses several aspect of training, such methods are changing rapidly as the field hasn’t as who should be trained, what should be taught matured yet, and the sheer volume of the data and how, who should provide the training, and poses technical challenges. The situation is ag- who will train the trainers.

1 mygoblet.org

page 12 (not for indexing) Oral Presentations e808 Oral Presentations EMBnet.journal 21.A

Reproducible Research in the era of Next Generation Sequencing: current approaches, examples and future perspectives Claudia Angelini , Dario Righelli, Francesco Russo Istituto per le Applicazioni del Calcolo “M. Picone”, Napoli, Italy Angelini C et al. (2015) EMBnet.journal 21(Suppl A), e808. http://dx.doi.org/10.14806/ej.21.A.808 Next Generation Sequencing (NGS) has revolu- Bioconductor packages. In particular, we show tionised the way of thinking and of performing how RR can be incorporated within a graphi- biomedical research. It is now possible to inves- cal user-friendly interface and how such tools tigate biological aspects of cell functionalities can automatically generate executable analy- and to understand previously unexplored disease sis reports. Then, we describe how it is possible etiologies by analyzing multi-omic data. Several to speed up repetitive and computational ex- computational (open-source) tools have been pensive function calls by using results stored in developed to analyze NGS data. Despite the a cache memory (the latter point is crucial for technological advances, the way in which data the analysis of “Big Data” as those collected with analysis is performed and described in most of NGS experiments). As a concrete working exam- scientific papers does not facilitate the repro- ple of our approach, we illustrate the advances ducibility of scientific results. Complex analyses we have introduced in RNASeqGUI (Russo and are often poorly described and the lack of the Angelini, 2014). technical information impedes the possibility for References a researcher to reproduce the results available in Nekrutenko A, Taylor J (2012) Next-generation sequencing literature (Nekrutenko and Taylor, 2012). Therefore, data interpretation: enhancing reproducibility and acces- the problem of the “Reproducible Research”, sibility. Nature Reviews Genetics 13, 667-672. http://dx.doi. here-denoted RR, (Stodden et al., 2014) is emerg- org/10.1038/nrg3305 ing as a serious issue for all Life Sciences. Stodden V, Leisch F, Peng RD (2014) Implementing Reproducible Research. In: Chapman & Hall/CRC The R In this work, we first introduce the concept of Series. RR, its main benefits and challenges. Then, we Russo F, Angelini C (2014) RNASeqGUI: A GUI for analyzing discuss a simple way to develop novel compu- RNA-seq data. Bioinformatics 30(17), 2514-2516. http:// tational tools that implement RR by using R and dx.doi.org/10.1093/bioinformatics/btu308

page 14 (not for indexing) EMBnet.journal 21.A Oral Presentations e819

Current status of nanopore sequencing using the MinION device – from full length cDNA sequencing to genome assembly improvements Ola Wallerman Uppsala University and Swedish University of Agricultural Sciences, Uppsala, Sweden Wallerman O (2015) EMBnet.journal 21(Suppl A), e819. http://dx.doi.org/10.14806/ej.21.A.819 The MinION is a small device that may transform tion in a 13 kb transcript. From Illumina RNA-seq the way sequencing is done by taking the se- after knockout of ZBED6 we observe a drastic quencer out of the core labs. It detects the cur- change of IGF2 expression, and also a potential rent through nanopores as single DNA strands shift in promotor usage. As a member of MAP I passes through in order to decode the sequence plan use the MinION to analyse isoforms of these and is capable of generating sequences longer transcripts from full-length cDNA sequencing, but than any other platform currently on the market. given the nature of the reads it has a great po- The MinION is under active development in an tential also to be used for improvement of ge- early-access program (MAP). Like other single nome assemblies. As a first experiment to get a molecule sequencers, the accuracy of one- better understanding of the system and its reads pass reads is lower than that of the amplification- I generated long genomic reads from the hon- based sequencers but through recent improve- eybee Apis Mellifera, which is used as a model ments to chemistry and basecalling it is now organism in the lab and here I will describe these possible to generate long reads with over 90% results and discuss the current status and future alignment accuracy. perspectives of nanopre sequencing in terms of We are studing the transcription factor ZBED6, data quality and throughput. which is a repressor of IGF2 in muscle tissue. It is co-expressed with ZC3H11A through intron reten-

Figure 1. The MinION device is capable of sequencing long unamplified genomic samples with high accuracy. BLAST align- ment of an 8.2 kb read shows that most errors are due to gaps from skipped bases.

page 15 (not for indexing) e816 Oral Presentations EMBnet.journal 21.A

Deep insights into Mecp2-driven transcriptional (de)regulation at embryonic developmental stage through RNA-Seq data analysis Kumar Parijat Tripathi 1 , Maurizio D’Esposito 2, Mario R. Guarracino 1, Marcella Vacca 2 1Genomic, Proteomic and Transcriptomic Laboratory, National Research Council of Italy (CNR), Institute for High- Performance Computing and Networking (ICAR), Napoli, Italy 2Institute of Genetics and Biophysics - ABT, Napoli, Italy Tripathi KP et al. (2015) EMBnet.journal 21(Suppl A), e816. http://dx.doi.org/10.14806/ej.21.A.816 Rett Syndrome (RTT, MIM 312750) is a progressive rons with the dataset from treated versus un- X-linked neurodevelopmental disorder due to treated WT permits to distinguish transcripts influ- mutation of the Mecp2 gene (encoding the tran- enced only by the specific genotype from those scription regulator methyl-CpG binding protein responding to the compound. A subset of latter 2) (Amir et al., 1999). How mutations in Mecp2 category rescues a level of physiological expres- lead to the neuropathological signs of RTT is still sion as consequence of chemical treatment. unknown and there is no cure for this devastat- Using in-house built computational pipeline ing disorder. RTT typically manifests months after “Transcriptator” (automated computation pipe- birth (by 5 weeks in null male mice), arguing that line to annotate assembled reads) (Tripathi et al., key embryonic and perinatal developmental 2014) we annotated the functional as well gene steps take place normally in affected individu- ontological terms associated with this different als. Despite this, we find an altered transcrip- set of transcripts. Specific functional terms re- tome and a significantly lower primary neurotic lated to SMART, PIR superfamily, InterPro domains branching in Mecp2 null cortical neurons disso- and KEGG pathways are enriched in each set of ciated from single embryos (embryonic day 15) transcripts. To understand more clearly the pre- compared to wild type cultures. The pharmaco- cise mechanism of the used drug, we checked logical stimulation of a serotonin receptor nor- for both specific and common functional anno- malizes defective neurite branching and, par- tation terms for the different categories of tran- tially, transcriptional deregulation due to MeCP2 scripts. We also show distribution of Panther path- loss. To understand the biological mechanism ways and Biological process GO terms between behind transcriptional remodeling under the in- transcripts responding to the treatment (rescued fluence of Mecp2 knocking out, we need large- and newly recruited genes). scale study of the transcriptional response of null References cortical neurons before and after treatment with Amir RE, Van den Veyver IB, Wan M, Tran CQ, Francke U et al. serotonin receptor stimulator. RNA-Seq is a new (1999) Rett syndrome is caused by mutations in X-linked tool, which utilises high-throughput sequencing MECP2, encoding methyl-CpG-binding protein 2. Nat to measure RNA transcript counts at an extraor- Genet 23, 185-188. dinary accuracy. It provides quantitative means Trapnell C, Roberts A, Goff L, Pertea G, Kim D et al. (2012) Differential gene and transcript expression analy- of exploring the transcriptome of an organism of sis of RNA-seq experiments with TopHat and Cufflinks. interest.Total RNA was extracted from wild type Nature Protocols 7, 562–578. http://dx.doi.org/10.1038/ and Mecp2 null cortical neurons, sequenced nprot.2012.016 and analysed with a gene specific approach Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R et al. (2013) TopHat2: accurate alignment of transcriptomes in the pres- using Tuxedo pipeline, i.e., Tophat2 (Kim et al., ence of insertions, deletions and gene fusions. Genome 2013) and cufflinks package (Trapnell et al., Biology 14:R36. http://dx.doi.org/10.1186/gb-2013-14-4-r36 2012). The intersection of differentially expressed Tripathi KP et al. (2014) Transcriptator: computational pipeline genes dataset from untreated Mecp2 null neu- to annotate transcripts and assembled reads from RNA- seq data. Lecture Notes in Bioinformatics, In Press.

page 16 (not for indexing) EMBnet.journal 21.A Oral Presentations e811

Exploring the activity of microorganisms in the forest soil using metatranscriptomics Petr Baldrian Institute of Microbiology of the ASCR, Prague, Czech Republic Baldrian P (2015) EMBnet.journal 21(Suppl A), e811. http://dx.doi.org/10.14806/ej.21.A.811 Understanding the ecology of forest soils is very changes in taxon-specific microbial transcription important because they belong to the larg- profiles. These differences were more significant est carbon sinks globally. Metatranscriptomics in soil than in litter. Most importantly, fungal con- seems to be perfectly suitable for the explora- tribution to total microbial transcription in soil de- tion of microbial involvement in soil processes creased from 33% in summer to 16% in winter. but until recently it was technically difficult to use In particular, the activity of the ectomycorrhizal it successfully for these highly complex environ- fungi that quantitatively dominate this environ- ments. Here we used metatranscriptomics com- ment was reduced in winter. The results indicate bined with microbial community analysis to de- that plant photosynthetic production was likely scribe the roles of individual microbial taxa in the the major driver of changes in the functioning of coniferous forest soil in two contrasting seasons. microbial communities in the studied ecosystem Both the microbial community composition and across seasons. Technically, the annotation of the pool of microbial transcripts were found to functions and taxonomic affiliations is relatively be highly diverse and characterised by the high precise for bacteria but less reliable for the fungi abundance and activity of fungi. The differences and archaea, due to the lower number of fully in seasonal functioning of the ecosystem con- sequenced and annotated genomes. These sisted of a combination of moderate changes in limitations will hopefully decrease in the future microbial community composition and profound along with the advances in microbial genomics.

page 17 (not for indexing) e830 Oral Presentations EMBnet.journal 21.A

Creating a successful facility for large-scale extraction of DNA, an example from the Swedish Biobank initiative BBMRI.se James Thompson Karolinska Institutet Biobank, Stockholm, Sweden Thompson J (2015) EMBnet.journal 21(Suppl A), e830. http://dx.doi.org/10.14806/ej.21.A.830

Driven by the needs of several large popula- have extracted DNA from over 100 000 individu- tion cohorts, the Biobanking and BioMolecular als. Our early experience clearly demonstrated Resource Infrastructure of Sweden (BBMRI.se) has the potential of the new systems in speed and established, and now operates, a large-scale cost. But we also experienced several difficult facility for the rapid purification of high quality challenges with the approach. Genotyping and DNA from human whole blood at the KI Biobank. NGS platforms are very sensitive to DNA qual- The facility has a throughput of up to 1000 ity parameters. After gaining significant process whole blood samples per day on two parallel improvements we now see study designs that automated systems. Automation and modern use genotyping and NGS delivering results that extraction chemistry has given many benefits, promise to transform healthcare. and since starting the operation in May 2011, we

page 18 (not for indexing) EMBnet.journal 21.A Oral Presentations e827

Using neural networks to filter predicted errors in NGS data Milko Krachunov 1, Ognyan Kulev 1, Maria Nisheva 1, Valeria Simeonova 1, Deyan Peychev 2, Dimitar Vassilev 2  1Faculty of Mathematics and Informatics, Sofia University “St. Kliment Ohridski”, Sofia, Bulgaria 2AgroBioInstitute, Bioinformatics group, Sofia, Bulgaria Krachunov M et al. (2015) EMBnet.journal 21(Suppl A), e827. http://dx.doi.org/10.14806/ej.21.A.827 The amount of sequencing errors produced by of false positives. Due to the classification accu- NGS technologies is low, but not negligible. Some racy of the network, there is a net decrease in studies, such as single nucleotide polymorphism the number of false positives without a significant (SNP) calling in metagenomics, are very sensi- increase in the number of false negatives. The tive to any noise present in the sequencing data, combination significantly increases the accura- and would greatly benefit from precise error de- cy of the sub-optimal approach. A 46-61% de- tection techniques to discover incorrect bases, crease in the number of predicted errors, all from without flagging the real variation in the data as incorrectly identified errors, is observed when the erroneous. There are still no general solutions to neural network is applied over a set predicted the error detection problem that can be applied with frequencies and thresholds. to studies like these. Acknowledgements As part of this work, a neural network is trained The study is supported by the National Science to recognise errors with a high accuracy in a Fund of Bulgaria within the “Methods for Data dataset containing high natural variation. The Analysis and Knowledge Discovery in Big neural network is then applied to a set of possible Sequencing Datasets” project under contract errors that were predicted with a sub-optimal de- DFNI-I02/7 of 12.12.2014. tection algorithm that produces a high amount

page 19 (not for indexing) e824 Oral Presentations EMBnet.journal 21.A

DIANA-TarBase v7: indexing hundreds of thousands experimentally supported miRNA:mRNA interactions Dimitra Karagkouni 1 , Ioannis S. Vlachos 1, Maria D. Paraskevopoulou 1, Georgios Georgakilas 1, Thanasis Vergoulis 2, Theodore Dalamagas 2, Artemis G. Hatzigeorgiou 1 1DIANA-Lab, Department of Electrical &Computer Engineering, University of Thessaly, Thessaly, Greece 2‘Athena’ Research and Innovation Center, Athenas, Greece Karagkouni D et al. (2015) EMBnet.journal 21(Suppl A), e824. http://dx.doi.org/10.14806/ej.21.A.824 microRNAs are short non-coding RNAs which experimental conditions including cell/tissue act as potent post-transcriptional regulators. type and treatment. The new interface provides Accurate identification and cataloging of miRNA also advanced information ranging from the targets is crucial to understanding their function. binding site location, as identified experimentally Currently, hundreds of thousands of miRNA:gene as well as in silico, to the primer sequences used interactions have been experimentally identified. for cloning experiments. Numerous wet lab methodologies have been DIANA-TarBase v7.0 is the first relevant data- developed, enabling the validation of predict- base which breaks the barrier of hundreds of ed miRNA interactions or the high-throughput thousands entries by indexing more than half a screening and identification of novel miRNA tar- million interactions in 24 species, 9–250 times gets. However, this wealth of information is frag- more than any other manually curated data- mented and hidden in thousands of manuscripts base. This wealth of information can be utilised and raw next generation sequencing data sets. for exploratory studies and can significantly DIANA-TarBase v7.01 aims to provide for the boost the understanding of miRNA:mRNA col- first time hundreds of thousands of high-quali- laboration. ty manually curated experimentally validated Acknowledgements miRNA:gene interactions. A text-mining pipeline This research has been co-financed by the has been implemented for the identification European Social Fund – ESF and National of all the advanced in experimental method- Resources, Greek national funds through the ologies articles which have been subjected to “Operational Program Education and Lifelong manual curation. Learning” of the National Strategic Reference DIANA-TarBase v7.0 has been significantly ex- Framework (NSRF) - Research Funding Program: tended with richer meta-data and detailed infor- Thales. mation for each interaction, while the interface now supports advanced real-time queering and References result filtering. The database enables users to Vlachos IS, Paraskevopoulou MD, Karagkouni D, Georgakilas G, Vergoulis T et al. (2015) DIANA-TarBase v7.0: index- easily identify positive or negative experimental ing more than half a million experimentally supported results, the utilised experimental methodology, miRNA:mRNA interactions. Nucleic Acids Res. 43 (D1), D153-D159. http://dx.doi.org/10.1093/nar/gku1215

1 www.microrna.gr/tarbase

page 20 (not for indexing) EMBnet.journal 21.A Oral Presentations e815

Biobanking for the future, how to prepare for the next generation of Next Generation Sequencing Tomas Klingström Swedish University of Agricultural Sciences, Uppsala, Sweden Klingström T (2015) EMBnet.journal 21(Suppl A), e815. http://dx.doi.org/10.14806/ej.21.A.815 In biobanking the quality of samples is defined To properly support next generation se- depending on their usage. A piece of formalin- quencing it is therefore especially important that fixed, paraffin-embedded tissue may therefore biobanks consider the following challenges: be considered of very high quality when used for • technical challenges - DNA is often regarded immune histochemistry, despite heavy crosslink- as stable and easy to work with. But NGS with ing and fragmentation of nucleic acids. This, long read lengths, epigenetic studies and combined with ethical and legal questions on in- single cell sequencing are all areas where formed consent, makes it important for biobanks current or upcoming technologies will great- to prepare their collections for future technology. ly challenge the quality management of The technology watch of the BBMRI-LPC 2015 biobanks; will focus on the pre-analytical, ethical and data • ethical challenges - every sample may be a management issues that may prevent biobanks genetic sample if the research is innovative from providing samples that fully take advantage and meticulous enough. This greatly increas- of the next generation of sequencing technol- es the requirements on how biobanks handle ogy. Biobanks are a long term commitment, but donor consent and privacy when collecting by preparing for the next generation of technol- biospecimen; ogy, biobanks provide researchers with immedi- • data management - despite huge cost re- ate access to material for testing scientific theo- duction, NGS remains the costliest step in the ries that would otherwise take years or decades process. Data access to ensure reuse and re- to evaluate. peatability is therefore important.

page 21 (not for indexing) e835 Oral Presentations EMBnet.journal 21.A

Introducing Meta2genomics: the search for the “micro-bee” José R. Valverde CSIC, Centro Nacional de Biotecnología, Madrid, Spain Valverde JR (2015) EMBnet.journal 21(Suppl A), e835. http://dx.doi.org/10.14806/ej.21.A.835 We will describe what may be called meta- proval or renewal (or its deny thereof) of the use of metagenomics (or Meta2genomics), a novel ap- herbicidal preparations in the EU and elsewhere. proach to mine metagenomic data and how An orthornormal approach is used in the bi- we apply it to search for microbial sentinels of ological security field. Here, the problem is not contamination, the microbial equivalent of bees knowing and preventing the effect of specific (or “micro-bee”). substances, but preventing and detecting the re- One of the main challenges of Next lease of biological or toxic agents, and reacting Generation Sequencing (NGS) is its pervasive use accordingly to minimize their impact on humans for afterthought research: mining of existing ex- and the environment at large. We are interested periments developed with a specific aim for the in the development of cost-effective advisory pursuit of novel, unrelated quests that were not policies that may help professionals in the de- foreseen in advance. tection of potentially harmful spills, specially in The last years have seen a tremendous in- developing countries. crease in the availability of metagenomic data Here we describe our quest for a sensitive test collected from soil samples under a variety of that may help detect contamination chang- soil textures, locations, circumstances and treat- es in the soil through the identification of the ments. All these experiments were carried out “micro-bee”, a soil microorganism taxon that with specific aims in mind: understand the vari- may act as a sentinel, developing in the pro- ability of the rhizobacterial community under cess what can be called meta-metagenomics specific environmental conditions. (or Meta2genomics): mining many unrelated In parallel with these advances, and partly metagenomics experiments to identify a com- thanks to some of these studies, society has mon trend of interest. In effect, this requires to become increasingly aware of the impact that select multiple soil metagenomics experiments human activities have on the environment, and from the Sequence Read Archive, re-analyse is trying to adapt policies and operation proce- each of them towards our ends, and compare dures to better exploit this newly acquired knowl- them for common trends that meet our end cri- edge. As an example, studies of the impact of teria, developing new planning, analysis, com- herbicide treatments on soil rhizobacteria has parison and statistical technologies in the pro- been one of the factors considered in the ap- cess.

page 22 (not for indexing) EMBnet.journal 21.A Oral Presentations e828

Information-theoretic approach for detection of differential splicing from RNA-seq data Axel Rasche, Ralf Herwig  Max-Planck-Institute for Molecular Genetics, Dep. Computational Molecular Biology, Berlin, Germany Rasche A and Herwig R (2015) EMBnet.journal 21(Suppl A), e828. http://dx.doi.org/10.14806/ej.21.A.828 The computational prediction of alternative date our workflow we challenged it with publicly splicing from high-throughput sequencing data available sequencing data derived from human is inherently difficult and necessitates robust sta- tissues, and conducted a comparison with eight tistical measures because the differential splic- alternative computational methods. In order to ing signal is overlaid by influencing factors such judge the performance of the different methods as gene expression differences and simultane- we constructed a benchmark data set of true ous expression of multiple isoforms, among oth- positive splicing events across different tissues, ers. In this work we describe ARH-seq (Rasche et agglomerated from public databases, and al., 2014), a discovery tool for differential splicing show that ARH-seq is an accurate, computation- in case-control studies, that is based on the in- ally fast and high-performing method for detect- formation-theoretic concept of entropy. ARH-seq ing differential splicing events. works on high-throughput sequencing data and References is an extension of the ARH method that was origi- Rasche A, Herwig R (2010) ARH: predicting splice vari- nally developed for exon microarrays (Rasche ants from genome-wide data with modified entropy. and Herwig, 2010). We show that the method Bioinformatics 26(1), 84-90. http://dx.doi.org/10.1093/bioin- has inherent features, such as independence of formatics/btp626 transcript exon number and independence of Rasche A, Lienhard M, Yaspo ML, Lehrach H, Herwig R (2014) ARH-seq: identification of differential splicing in RNA- differential expression, what makes it particularly seq data. Nucleic Acids Res 42(14), e110. http://dx.doi. suited for detecting alternative splicing events org/10.1093/nar/gku495 from sequencing data. In order to test and vali-

page 23 (not for indexing) e810 Oral Presentations EMBnet.journal 21.A iMir: an innovative and complete pipeline for smallRNA-Seq data analysis Giorgio Giurato 1, Antonio Rinaldi 1, Adnan Hashim 1, Giovanni Nassa 1, Maria Ravo 1, Francesca Rizzo 1, Roberta Tarallo 1, Angela Cordella 2, Giovanna Marchese 2, Domenico Memoli 1, Alessandro Weisz 1  1Department of Medicine and Surgery, Laboratory of Molecular Medicine and Genomics, University of Salerno, Baronissi, Italy 2Genomix4Life Srl, Spin-Off of the University of Salerno, Baronissi, Italy Giurato G et al. (2015) EMBnet.journal 21(Suppl A), e810. http://dx.doi.org/10.14806/ej.21.A.810

Next-generation sequencing allows researchers recent high-performance next generation se- to gage the depth and variation of small non- quencers. This pipeline allowed us to investigate coding RNA populations, comprising miRNAs, piRNA expression patterns in rat liver and their piRNAs, tRNAs and other regulatory small tran- modulation during regenerative proliferation scripts. The accurate analysis of smallRNA-Seq (Rizzo et al., 2014) and to identify >100 human data remain a non-trivial computational prob- piRNAs in breast cancer, some of which showing lem, requiring implementation of multiple statis- significant differences in expression in mammary tical and bioinformatics tools. Here we present epithelial compared to cancer cells or in normal iMir (Giurato et al., 2013), a modular pipeline for respect to cancerous mammary tissues (Hashim comprehensive analysis of smallRNA-Seq data, et al., 2014), and in endometrial hyperplasia and comprising specific tools for adapter trimming, cancer (Ravo et al., 2015). quality filtering, differential expression analysis, Acknowledgements biological target prediction and other useful op- Work supported by: Italian Ministry of Health tions by integrating multiple open source mod- (Grant Young Researcher GR-2011–02350476 to ules and resources in an automated workflow M.R.), Italian Ministry for Education, University and (Figure 1). Research (Grants PRIN 2010LC747T to A.W. and FIRB RBFR12W5V5 _ 003 to R.T.), Italian Association for Cancer Research (Grants IG 13176 to A.W.), National Research Council Flagship Project Interomics. G.N. is supported by a ‘Mario e Valeria Rindi’ fellowship of the Italian Foundation for Cancer Research, A.R. is a PhD student of the Research Doctorate ‘Molecular and Translational Oncology and Innovative Medical-Surgical Technologies’, University of Catanzaro ‘Magna Graecia’. References Giurato G, De Filippo MR, Rinaldi A, Hashim A, Nassa G, et al. (2013) iMir: an integrated pipeline for high-throughput analysis of small non-coding RNA data obtained by smallRNA-Seq. BMC Bioinformatics 14:362. http://dx.doi. org/10.1186/1471-2105-14-362 Hashim A, Rizzo F, Marchese G, Ravo M, Tarallo R, et al. (2014) Figure 1. The pipeline accepts NGS data as input and then RNA sequencing identifies specific PIWI-interacting small proceeds automatically to perform several independent non-coding RNA expression patterns in breast cancer. analysis, most of them can be selected or excluded ac- Oncotarget 5(20), 9901-9910. cording to the user’s needs. Ravo M, Cordella A, Rinaldi A, Bruno G, Alexandrova E et al. (2015) Small non coding RNA deregulation in endometrial carcinogenesis. Oncotarget (Epub ahead of print). iMir is based on reliable, flexible and fully Rizzo F, Hashim A, Marchese G, Ravo M, Tarallo R et al. (2014) automated workflow, allowing to rapidly and Timed regulation of P-element-induced wimpy testis- efficiently analyze high-throughput smallRNA- interacting RNA expression during rat liver regeneration. Seq data, such as those produced by the most Hepatology 60(3), 798-806. http://dx.doi.org/10.1002/ hep.27267

page 24 (not for indexing) Posters e806 Posters EMBnet.journal 21.A

Microbial analysis of ovine cheese by next generation sequencing Vladimir Kmet , Dobroslava Bujnakova Institute of Animal Physiology, Slovak Academy of Sciences, Košice, Slovakia Kmet V and Bujnakova D (2015) EMBnet.journal 21(Suppl A), e806. http://dx.doi.org/10.14806/ej.21.A.806 Ovine cheese is a good source for isolation of al., 2014). There were identified probiotic pro- wild lactobacilli-potential starter cultures. This teins as an alpha amylase (PF00128), peptidase study aimed to analyse microbiota of the ovine (PF01433), catalase (PF00199), heat shock protein cheese (Slovak Bryndza) and to investigate the 33 (PF01430). Nevertheless, there was discrep- presence of Lactobacillus antibiotic resistance, ancy between Lb. plantarum ampicillin sensitiv- virulence or probiotic genes by pyrosequencing. ity and the presence of serine beta-lactamase The V1-V3 regions of the 16S ribosomal DNA like superfamily (PF00144). No virulence factors were amplifed from different ovine cheeses us- were detected. Results indicated new properties ing PCR. In all samples, the microbial popula- of lactobacilli, which were not occurred by phe- tions consisted of Lactobacillus helveticus, Lb. notyping testing. acidophilus, Lb. plantarum, Lb. rhamnosus, Lb. Acknowledgements brevis; Lactococcus lactis, L. raffinolactis, L. This work was supported by Slovak grant VEGA garviae; Enterococcus italicus and E. cameliae; No. 2/0014/13. Streptococcus salivarius, St. thermophilus, St. ca- balli and St. ferus. References Furthermore, the genomes of selected Lb. D’Auria G, Džunkova M, Moya A, Tomaška M, Kološta M, Kmet V (2014). Genome Sequence of Lactobacillus plantarum plantarum, Lb. brevis, Lb. paracasei were pyrose- 19L3, a Strain Proposed as a Starter Culture for Slovenska quenced. The assembly of L. plantarum resulted Bryndza Ovine Cheese. Genome Announc. 2(2):e00292- in 203 contigs longer than 1,000 bp (D’Auria et 14. http://dx.doi.org/10.1128/genomeA.00292-14

page 26 (not for indexing) EMBnet.journal 21.A Posters e807

PACMAN: PacBio Methylation Analyzer Laurent Falquet 1,2 , Alexis Loetscher 1 1University of Fribourg, Fribourg, Switzerland 2Swiss Institute of Bioinformatics, Lausanne, Switzerland Falquet L and Loetscher A (2015) EMBnet.journal 21(Suppl A), e807. http://dx.doi.org/10.14806/ej.21.A.807 Pacific Biosciences sequencing allows for the motifs file in GFF format (from the SMRT pipeline). simultaneous detection of DNA methylation in The counts of each motif are calculated accord- particular m6A and m4C (Flusberg et al., 2010). ing to a customisable sliding window and nor- The motifs around the methylation sites are malised by their expected frequency. PACMAN provided by the SMRT pipeline1 in GFF format. generates a publication quality image of the However the visualization of these motifs methyl- selected methylated motifs counts, locations ated or non-methylated, for example on a bac- and non-methylated locations, on one or both terial genome, is not part of the pipeline. Circos strands of the DNA. PACMAN is available on this is a software package for producing publication web site: http://www.unifr.ch/bugfri/pacman. quality images of large scale data(Krzywinski References et al., 2009), however mastering the numerous Flusberg BA, Webster DR, Lee JH, Travers KJ, Olivares EC configuration files needed, requires extra skills et al. (2010) Direct detection of DNA methylation during not easy to acquire for biologists. We developed single-molecule, real-time sequencing. Nat. Methods 7, a tool written in Perl and an associated web site 461–465. http://dx.doi.org/10.1038/nmeth.1459 called PACMAN (PacBio Methylation Analyzer) al- Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R et al. (2009) Circos: An information aesthetic for comparative lowing users to easily create images with Circos genomics. Genome Res. 19, 1639–1645. http://dx.doi. in a user-friendly interface. The required files are org/10.1101/gr.092759.109 1) the genome or draft in FASTA format and 2) the

1 www.pacb.com/devnet/

page 27 (not for indexing) e813 Posters EMBnet.journal 21.A

NORM-SYS - harmonizing standardization processes for model and data exchange in systems biology Babette Regierer 1, Susanne Hollmann 2, Martin Golebiewski 3  1LifeGlimmer GmbH, Berlin, Germany 2University of Potsdam, Potsdam, Germany 3Heidelberg Institute for Theoretical Studies (HITS), Heidelberg, Germany Regierer B et al. (2015) EMBnet.journal 21(Suppl A), e813. http://dx.doi.org/10.14806/ej.21.A.813 The rapid development of modern life sci- their long-standing professional experience in ence technologies, such as Next Generation the formal standardization process; or represent- Sequencing (NGS) techniques, allows the genera- atives of scientific journals and research funding tion of biological data with increasing speed and agencies. precision. All these data have to be accessed, NORM-SYS1 (Normalization and standardiza- processed, integrated, shared, analysed and tion for the exchange of models and data in compared. Hence, standards for data, resulting systems biology research) is a new project that computer models and applied workflows have started in October 2014 aiming at enhancing become a critical issue specifically in distributed and promoting the formal normalisation of ex- and interdisciplinary approaches like systems bi- isting community standards for computational ology. Different stakeholder groups need to be modelling in systems biology in close collabora- engaged in the standardisation activities to allow tion with relevant stakeholder groups and grass- an efficient and fast process of standardisation root standardisation initiatives. One major goal and adoption of the developed standards: re- of NORM-SYS is to develop a concept for the searchers from both, academia and industries, transformation of existing standards into certified with their existing grass-root standardisation com- specifications or norms to achieve a more effec- munities; as well as the official standardisation tive transfer of systems biology research results bodies (e.g. DIN in Germany, CEN/CENELEC at into applications. European level, or ISO at international level) with

1 normsys.de/

page 28 (not for indexing) EMBnet.journal 21.A Posters e814

Targeted amplicon sequencing in genetic diagnostics of patients with cystic fibrosis Jelena Kusic-Tisma , Aleksandra Divac Rankov, Mila Ljujic, Nadja Pejanovic, Bojana Stanic, Dragica Radojkovic Institute of Molecular Genetics and Genetic Engineering, Belgrade, Serbia Kusic-Tisma J et al (2015) EMBnet.journal 21(Suppl A), e814. http://dx.doi.org/10.14806/ej.21.A.814 Cystic fibrosis (CF), one of the most com- a MiSeq instrument (Illumina). Sequencing data mon autosomal recessive genetic disorders in were evaluated using the software Sequence Caucasians, is caused by mutations in the gene Pilot (JSI Medical Systems). Identification of dis- encoding the CF transmembrane conductance ease-relevant CFTR variants was assessed based regulator (CFTR). Almost 2000 variants in CFTR on the CFTR database1 and CFTR2 website2. gene with variable clinical significance have The NGS sequencing data correctly con- been identified so far. firmed 18 germline variants previously detected In this study we performed targeted re- in our laboratory. Four out of 16 additionally iden- sequencing of CFTR gene in 24 CF patients in tified variants were classified as disease-causing which one or no mutations were identified af- mutations according to the literature data, thus ter analysis of seven most common variants enabling us confirmation of clinical CF diag- (c.1521 _ 1523delCTT, c.489+1G>T, c.1624G>T, nosis in three patients. The NGS technology in c.1652G>A, c.1657C>T, c.1585-1G>A, combination with a well-characterised clinically c.3909C>G) or after sequencing of the whole relevant genomic variation database is a good coding sequence of CFTR gene. Library pool alternative for a time consuming stepwise testing was generated using multiplex PCR amplification of genes with large allelic heterogeneity such as (MASTR, Multiplicom) followed by sequencing on CFTR.

1 www.genet.sickkids.on.ca/app 2 www.cftr2.org/

page 29 (not for indexing) e817 Posters EMBnet.journal 21.A

Reference leaftTranscriptomes for potato cultivars: Desiree and PW363 Maja Zagorščak 1 , Marko Petek 1, Mohamed Zouine 2, Kristina Gruden 1 1National Institute of Biology - Department of Biotechnology and Systems Biology, Ljubljana, Slovenia 2Ecole Nationale Supérieure Agronomique de Toulouse, Toulouse, France Zagorščak M et al. (2015) EMBnet.journal 21(Suppl A), e817. http://dx.doi.org/10.14806/ej.21.A.817 The rapid development of modern life sci- available from CLC Genomics Workbench 7.0.31. ence technologies, such as Next Generation De novo assembly was performed using Trinity Sequencing (NGS) techniques, allows the gen- (Grabherr et al., 2 011). eration of biological data with increasing speed The resulting two sets of preliminary transcrip- and precision. Most potato cultivars are highly tomes were then merged using the CD-HIT clus- heterozygous tetraploids with high genetic vari- tering algorithm (Limin et al., 2012) and merged ability while being susceptible to pathogens, with existing gene models with BLAST against pests and inbreeding depression. To bypass stNIB-v1. The presumed novel clusters were an- polyploidy related sequencing problems, Potato notated using SwissProt, PGSC DM v3.4 super- Genome Sequencing Consortium (PGSC, 2011) scaffolds and NCBI-nt databases. Initial potato sequenced a double monoploid derived from S. pangenome containing 35609 genes was ex- tuberosum group Phureja. panded with 24999 potential new transcripts, In order to avoid problems, such as discrimi- and will serve to further expand knowledge on nating between paralogous genes, divergence the potato pathogen interactions. and expression bias between the reference ge- References nome and potato cultivars, and to identify traits Limin F, Beifang N, Zhengwei Z, Sitao W, Weizhong L (2012) that are not present in initially sequenced gen- CD-HIT: accelerated for clustering the next generation se- otype, RNA-sequencing for cv. Desiree and cv. quencing data. Bioinformatics 28(23), 3150-3152. http:// PW363 leaves was conducted on Illumina NGS dx.doi.org/10.1093/bioinformatics/bts565 platform. In house generated raw reads were Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, et al. (2011) Full-length transcriptome assembly from RNA-seq complemented with data already deposited in data without a reference genome. Nature Biotechnology the NCBI database and stNIB-v1 S. tuberosum 29(7), 644-652. http://dx.doi.org/10.1038/nbt.1883 gene groups, which included two genome as- Ramšak Ž, Baebler Š, Rotter A, Korbar M, Mozetič I, et al. semblies and two EST datasets (Ramšak et al., (2014) GoMapMan: integration, consolidation and visuali- zation of plant gene annotations within the MapMan on- 2014). The preliminary transcriptomes were as- tology. Nucleic Acids Research 42(D1), D1167-D1175. http:// sembled using both hybrid and de novo assem- dx.doi.org/10.1093/nar/gkt1056 bly approaches. The hybrid approach, com- PGSC: The Potato Genome Sequencing Consortium (2011) bining genome-guided and de novo RNA-Seq Genome sequence and analysis of the tuber crop potato. assembly, was implemented using the pipeline Nature 475, 189-195. http://dx.doi.org/10.1038/nature10158

1 www.clcbio.com

page 30 (not for indexing) EMBnet.journal 21.A Posters e818

NGS as a tool for investigating spread of zoonotic bacteria: dealing with source and outbreak genetic variation in a spatio-temporal context Robert Söderlund 1,2 , Tomas Jinnerot 2, Adrien Janssens 1, Erik Bongcam-Rudloff 1 1Swedish University of Agricultural Sciences, Uppsala, Sweden 2National Veterinary Institute, Sweden Söderlund R et al. (2015) EMBnet.journal 21(Suppl A), e818. http://dx.doi.org/10.14806/ej.21.A.818

Next generation sequencing (NGS) is increasingly We are currently conducting a series of stud- the method of choice for tracing transmission of ies to better define the source of genetic varia- zoonotic bacterial disease agents; infections that tion found in bacterial populations in animal res- are spread to humans from animals. Establishing ervoirs and animal feed, as well as the variation the link to a source of infection is dependent on found between bacterial isolates collected from an implicit or explicit definition of a source popu- a single sampling occasion. We are also study- lation from which the outbreak strain originates. ing the spatial distribution of endemic clones in The link is also based on assumptions about the regions with dense livestock populations, and the relevance of a sampling occasion for the time clonal dynamics in such regions. Our focus is on the infection event took place. enterohaemorrhagic E. coli (EHEC) and selected While these problems are the same regard- serotypes of Salmonella. The results are providing less of the method used, they are magnified by us with insights that will improve future outbreak the high resolution of NGS-based typing. In par- investigations by guiding the sampling and anal- ticular, exact genotypic matches are rare lead- ysis strategy as well as the interpretation of data. ing to a much greater challenge in comparing Acknowledgements genetic distances, drawing conclusions from This work was financed by the Swedish Board them and communicating the quantitative re- of Agriculture and the Elsa and Ivar Sandberg sults to a non-specialist audience. Foundation.

page 31 (not for indexing) e822 Posters EMBnet.journal 21.A

Biobanks and future emerging technologies: new approaches, new pre- analytical challenges Eva Ortega-Paino 1 , Tomas Klingström 2, Johanna Ekström 1 1BBMRI.se Service Center for Southern Sweden, Medicon Village (406), Lund University, Lund, Sweden 2SLU Global Bioinformatics Centre, Uppsala University, Uppsala, Sweden Ortega-Paino E et al. (2015) EMBnet.journal 21(Suppl A), e822. http://dx.doi.org/10.14806/ej.21.A.822

Establishing a Biobank is a major long-term com- tions where proteomics and nucleic acid based mitment and samples collected today must be assays are combined. But to achieve this it is relevant for next generation techniques far into necessary to create biobank cohorts with suffi- the future. cient collection, quality and storage conditions. It is therefore crucial for biobanks to be “fu- Acknowledgements ture compatible” and rely on sampling and Funding for this work has been provided by storage methods to maximize the future value BBMRI.se. of the collections. In it’s most basic form such future compatibility may be limited to careful References sample management such as a strict adher- International Society for Biological and Environmental Repositories (2012) 2012 Best Practices for Repositories: ence to quality management and best prac- Collection, Storage, Retrieval, and Distribution of Biological tices as laid down by authoritative sources such Materials for Research. Biopreservation and Biobanking as ISBER (International Society for Biological and 10(2), 81-161. http://dx.doi.org/10.1089/bio.2012.1022 Environmental Repositories, 2012) guidelines and Malentacchi F, Pazzagli M, Simi L, Orlando C. et al. (2014) BRISQ (Moore et al., 2011; 2012; 2013), among SPIDIA-RNA: second external quality assessment for the pre-analytical phase of blood samples used for RNA others. based analyses. PLoS One 9(11):e112293. http://dx.doi. This approach will be sufficient, if adhered to, org/10.1371/journal.pone.0112293 for most current generation sequencing tech- Malentacchi F, Pazzagli M, Simi L, Orlando C, Wyrich R et niques where read-lengths are limited to 150- al. (2013) SPIDIA-DNA: an External Quality Assessment for 500 bp and if biobank samples intended for se- the pre-analytical phase of blood samples used for DNA- based analyses. Clin Chim Acta. 424, 274-286. http:// quencing carry an abundance of RNA or DNA. dx.doi.org/10.1016/j.cca.2013.05.012 But single cell sequencing of circulating tumor Moore HM, Kelly A, Jewel SD, McShane LM et al. (2011) cells, DNA methylation analysis and proteog- Biospecimen reporting for improved study quality (BRISQ). enomic studies, where DNA, RNA and protein J Proteome Res. 10(8), 3429-38. http://dx.doi.org/10.1021/ pr200021n from the same sample is analysed (Nesvizhskii, Moore HM, Kelly A, McShane LM, Vaught J (2012) Biospecimen 2014), require new standards for sample collec- reporting for improved study quality (BRISQ). Clin Chim tion and storage. Acta. 413(15-16), 1305. http://dx.doi.org/10.1016/j.cca.2012 Therefore, and looking at the future, it would .04.013 be very desirable to run studies similar to SPIDIA- Moore HM, Kelly A, McShane LM, Vaught J (2013) Biospecimen reporting for improved study quality (BRISQ). Transfusion RNA (Malentacchi et al., 2014) and SPIDIA-DNA 53(7):e1. ht t p://d x .d o i.o rg /10.1111/ t r f.12281 (Malentacchi, 2013) to ensure that samples can Nesvizhskii AI (2014) Proteogenomics: concepts, applica- be used for proteogenomics and other applica- tions and computational strategies. Nature Methods 11(11) , 1114 -1125. http://dx.doi.org/10.1038/nmeth.3144

page 32 (not for indexing) EMBnet.journal 21.A Posters e823

MetLab: an in silico experimental design, simulation and validation tool for viral metagenomics studies Martin Norling 1, Oskar E. Karlsson 2, Hadrien Gourlé 2, Erik Bongcam-Rudloff 2, Juliette Hayer 2  1Bioinformatics Infrastructure for Life Sciences, Uppsala, Sweden 2SLU Global Bioinformatics Center, Department of Animal Breeding and Genetics (HGEN), SLU, Uppsala, Sweden Norling M et al. (2015) EMBnet.journal 21(Suppl A), e823. http://dx.doi.org/10.14806/ej.21.A.823 The use of metagenomics for detection and gists within veterinary medicine with an easy to characterisation of viruses has several difficult use tool for designing, simulating and analyzing steps, including high throughput sequencing viral metagenomes. The results presented here and bioinformatics analysis. We present MetLab, include a comprehensive benchmark towards a new program aimed at providing scientists with other suitable software, with emphasis on detec- a simple baseline for experimental design and tion of viruses as well as speed of applications. analysis of viral metagenomics. MetLab is packaged as one comprehensive MetLab aims to provide support in planning software package, readily available for Linux the sequencing, by estimating coverage by and OSX users. implementing an adaptation of Stevens’ theo- Acknowledgements rem (Wendl et al. 2013). It also provides scientists Funding: Formas project number 2012-586. with several pipelines aimed at simplifying the analysis of viral metagenomes, including qual- References ity control, assembly and taxonomic binning. We Wendl MC, Kota K, Weinstock GM and Mitreva M (2013) Coverage theories for metagenomic DNA sequencing also implement a tool for simulating metagen- based on a generalization of Stevens’ theorem. J. Math. omics datasets from a number of sequencing Biol. 67(5) , 1141–1161. http://dx.doi.org/10.1007/s00285-012- platforms. The overall aim is to provide virolo- 0586-x

page 33 (not for indexing) e826 Posters EMBnet.journal 21.A

Towards SNP calling in polyploid genomes Anatoliy Dimitrov 1 , Milko Krachunov 1, Ognyan Kulev 1, Jérôme Salse 2, Irena Avdjieva 3, Dimitar Vassilev 3 1Faculty of Mathematics and Informatics, Sofia University “St. Kliment Ohridski”, Sofia, Bulgaria 2INRA-UBP GDEC, Clermont-Ferrand, France 3AgroBioInstitute, Bioinformatics group, Sofia, Bulgaria Dimitrov et al. (2015) EMBnet.journal 21(Suppl A), e826. http://dx.doi.org/10.14806/ej.21.A.826 Polyploid genomes such as hexaploid wheat on the amount of observed variation between pose significant study challenges. The detection the pairs of aligned reads. A secondary filter, rely- of SNPs is one of the problems that are difficult ing on machine learning, is also proposed. This to solve. The existence of multiple sub-genomes is expected to lead to an increase in the num- introduces a significant amount of natural noise ber of correctly predicted SNPs that would have for the SNP calling algorithms, and the expect- been otherwise incorrectly identified as errors. ed occurrence rates of such SNPs is significantly A 56% reduction in the predicted errors, and a lower than in diploid genomes. This makes them 66% increase in the number of predicted SNPs more difficult to distinguish from errors. are observed. Using the assumption that, unlike SNPs, errors Acknowledgements are randomly distributed across sub-genomes, The study is supported by the National Science we try to reduce the set of predicted errors by ex- Fund of Bulgaria within the “Methods for Data cluding rare bases that are clustered together in Analysis and Knowledge Discovery in Big reads with high probability of coming from a dis- Sequencing Datasets” project under contract tinct sub-genome. This is accomplished through DFNI-I02/7 of 12.12.2014. the application of fuzzy clustering that is based

page 34 (not for indexing) EMBnet.journal 21.A Posters e829

Gene set enrichment analysis of neuroendocrine system of the silkworm Bombyx mori Gabor Beke 1 , Matej Stano 1, Ivana Daubnerova 2, Dusan Zitnan 2, Lubos Klucar 1 1Institute of Molecular Biology, Slovak Academy of Sciences, Bratislava, Slovakia 2Institute of Zoology, Slovak Academy of Sciences, Bratislava, Slovakia Beke G et al. (2015) EMBnet.journal 21(Suppl A), e829. http://dx.doi.org/10.14806/ej.21.A.829

Silkworm Bombyx mori represents a model organ- tool based on an online expectation–maximi- ism for studying the neuroendocrine system in in- zation algorithm. Obtained data were conse- vertebrates. This system of nervous organs and quently analysed in several gene set enrichment endocrine glands regulates large number of life packages, including topGO and gplots (both R functions including movement, ecdysis, courting packages), where highly expressed genes were and mating. We have analysed strand-specific clustered and visualised, according to the func- Illumina RNA-seq data from B. mori samples orig- tional GO enrichment analysis, and clustered by inated from different endocrine glands of both gene expression level. sexes and several developmental stages. Reads Acknowledgements were mapped to the B. mori transcriptome using The study is supported by the APVV-0827-11, the Bowtie 2 aligner. Since the silkworm genome VEGA 2/0164/15, IMTS 26230120002 and IMTS is abundant in repetitive sequences (Mita et al., 26210120002 grants. 2004), the mapping allowing unlimited multi- mappings (required by eXpress) was extremly References computing demanding. Transcript level RNA-seq Mita K et al. (2004) The genome sequence of silkworm, Bombyx mori. DNA Res. 11(1), 27-35. http://dx.doi. quantification was performed using the eXpress org/10.1093/dnares/11.1.27

page 35 (not for indexing) e834 Posters EMBnet.journal 21.A

The NonCode aReNA DB: a non-redundant and integrated collection of non- coding RNAs Giorgio De Caro 1, Arianna Consiglio 1, Domenica D’Elia 1 , Andreas Gisel 1,2, Giorgio Grillo 1, Sabino Liuni 1, Angelica Tulipano 1, Flavio Licciulli 1 1CNR, Institute for Biomedical Technologies, Bari, Italy 2International Institute of Tropical Agriculture (IITA), Ibadan, Nigeria De Caro G et al. (2015) EMBnet.journal 21(Suppl A), e834. http://dx.doi.org/10.14806/ej.21.A.834 The recent availability of high throughput tech- as a web-resource at http://ncrnadb.ba.itb.cnr. nologies, like next generation sequencing (NGS) it/. Sequences have been classified in diverse platforms, has provided the scientific community biotypes and associated to Sequence Ontology with an unprecedented opportunity for large- terms. The database can be queried by using scale analysis of genome in a large number of multi-criteria and ontological search through an organisms. However, among others, one of the easy-to-use web interface, and data exported most challenging task for bioinformaticians is to as non-redundant collections of transcripts an- develop tools that provide biologists with an easy notated in VEGA, ENSEMBL, RefSeq, miRBase, access to curated and non-redundant collec- GtRNAdb and piRNABank. The database is up- tions of sequence data. dated through an automatic pipeline and last Non-coding RNAs, for a long time believed update was on January 2015. Presently NonCode to be not-functional, are emerging as the most aReNA DB contains 134,908 human ncRNAs clas- large and important family of gene regulators. sified in 24 biotypes, and next update will include NonCode aReNA Database is a comprehensive transcripts of Mus musculus and Arabidopsis thal- and non-redundant source of manually curated iana. and automatically annotated ncRNA transcripts. Acknowledgements Originally developed as a component of a big- This work was supported by the Italian MIUR ger project, composed by a datawarehouse for Flagship Project “Epigen”. the functional annotation of ncRNAs from NGS data, NonCode aReNA DB is currently available

page 36 (not for indexing) 21.A EMBnet.journal

Organisational Members of EMBnet

Biocomputing Group SIB Belozersky Institute, Moscow, Russia Swiss Institute of Bioinformatics, Lausanne, Switzerland

BMC TGAC Uppsala Biomedical Centre, Computing Department, The Genome Analysis Centre, Norwich, United Uppsala, Sweden Kingdom

Centre of Bioinformatics UMB SAV Peking University, Beijing, China Institute of Molecular Biology, Slovak Academy of Sciences, Bratislava, Slovakia CMBI Radboud University, Nijmegen, The Netherlands UMBER Faculty of Life Sciences, The University of Manchester, CNR Manchester, United Kingdom Institute for Biomedical Technologies, Bioinformatics and Genomic Group, Bari, Italy

CSC for more information visit our Web site Espoo, Finland www.EMBnet.org

EMBL-EBI Hinxton, Cambridge, United Kingdom

Facultad de Ciencias Exactas y Naturales Universidad de Buenos Aires, Buenos Aires, Argentina EMBnet.journal Institute of Biochemistry Molecular Biology and Biotechnology, University of ISSN 1023-4144 Colombo, Colombo, Sri Lanka Dear reader, Instituto de Biotecnología If you have any comments or suggestions regarding this Universidad Nacional de Colombia, Edificio Manuel journal we would be very glad to hear from you. If you have Ancizar, Bogota, Colombia a tip you feel we can publish then please let us know. Before submitting your contribution read the “Instructions for authors” at http://journal.EMBnet.org/index.php/EMBnetnews/about Instituto Gulbenkian de Ciencia and send your manuscript and supplementary files using our Centro Portugues de Bioinformatica, Oeiras, Portugal on-line submission system at http://journal.EMBnet.org/index. php/EMBnetnews/about/submissions#onlineSubmissions. ITICO Information Technology Infrastructure for Collaborative Past issues are available as PDF files from the Web site: Organizations, United Kingdom http://journal.EMBnet.org/index.php/EMBnetnews/issue/archive

KEMRI Publisher: EMBnet Stichting p/a Wellcome Trust Research Programme, Kilifi, Kenya CMBI Radboud University Nijmegen Medical Centre Lab. Nacional de Computação Científica 6581 GB Nijmegen Lab. de Bioinformática, Petrópolis, Rio de Janeiro, The Netherlands Brazil Email: [email protected] LCSB Tel: +46-18-67 21 21 University of Luxembourg, Luxembourg, Luxembourg

ReNaBi French bioinformatics platforms network, France