Number 104 January 2016 ERCIM NEWS www.ercim.eu

Special theme Tackling Big Data in the Life Sciences

Also in this issue:

Research and Society: “Women in ICT Research and Education” Editorial Information Contents

JoINT ERCIM ACTIoNS ERCIM News is the magazine of ERCIM. Published quarterly, it reports on joint actions of the ERCIM partners, and aims to 4 International Workshop on Computational Intelligence reflect the contribution made by ERCIM to the European for Multimedia Understanding Community in Information Technology and Applied by Davide Moroni, Maria Trocan and Ales Prochazka Mathematics. Through short articles and news items, it pro - vides a forum for the exchange of information between the insti - CIM Working Group “Dependable tutes and also with the wider scientific community. This issue 5 Activities of the ER has a circulation of about 6,000 printed copies and is also Embedded Software-Intensive Systems” available online. by Erwin Schoitsch

ERCIM News is published by ERCIM EEIG BP 93, F-06902 Sophia Antipolis Cedex, France RESEARCh AND SoCIETy Tel: +33 4 9238 5010, E-mail: [email protected] Director: Jérôme Chailloux This section “Women in ICT Research and Education” has ISSN 0926-4981 been coordinated by Lynda Hardman, CWI

Editorial Board: 6 “Because it’s 2016” – Introduction Central editor: by Lynda Hardman Peter Kunz, ERCIM office ([email protected]) Local Editors: 7 Women in Informatics Research and Education : Erwin Schoitsch ([email protected]) by Jane Hillston Belgium:Benoît Michel ([email protected]) Cyprus: Ioannis Krikidis ([email protected]) 8 Reflections from a Leadership Program for Women Czech Republic:Michal Haindl ([email protected]) in Scientific Positions France: Steve Kremer ([email protected]) by Katja Abrahamsson, Letizia Jaccheri and Harald Øverby : Michael Krapp ([email protected]) Greece: Eleni Orphanoudakis ([email protected]), 10 The Girl Project Ada: A Successful Project for Artemios Voyiatzis ([email protected]) Recruiting Female Students to the ICT-Studies at NTNU : Erzsébet Csuhaj-Varjú ([email protected]) by Line Berg Italy: Carol Peters ([email protected]) Luxembourg: Thomas Tamisier ([email protected]) 12 Inspire, Network, Encourage – womENcourage 2015 Norway: Poul Heegaard ([email protected]) by Reyyan Ayfer and Bev Bachmayer Poland: Hung Son Nguyen ([email protected]) Portugal: Joaquim Jorge ([email protected]) 13 Integrating Gender in Research Institutions in Greece : Silvia Abrahão ([email protected]) by Pavlos Hatzopoulos, Nelli Kambouri and Kathy Kikis- Sweden: Kersti Hedman ([email protected]) Papadakis Switzerland: Harry Rudin ([email protected]) The Netherlands: Annette Kik ([email protected]) 14 Women in IT Security Research: The Case of SBA W3C: Marie-Claire Forgue ([email protected]) Research by Veronika Nowak Contributions Contributions should be submitted to the local editor of your country SPECIAL ThEME

Copyright notice The special theme section “Tackling Big Data in the Life All authors, as identified in each article, retain copyright of their Sciences” has been coordinated by Roeland Merks, CWI work and Marie-France Sagot, Inria.

Advertising Introduction to the Special Theme For current advertising rates and conditions, see 16 Tackling Big Data in the Life Sciences http://ercim-news.ercim.eu/ or contact [email protected] by Roeland Merks and Marie-France Sagot

ERCIM News online edition 17 Networks to the Rescue – From Big “Omics” Data to The online edition is published at Targeted Hypotheses http://ercim-news.ercim.eu/ by Gunnar Klau

Subscription 18 Interactive Pay-As-You-Go-Integration of Life Science Subscribe to ERCIM News by sending an email to Data: The HUMIT Approach [email protected] or by filling out the form at the by Christoph Quix, Thomas Berlage and Matthias Jarke ERCIM News website: http://ercim-news.ercim.eu/ 20 Eliminating Blind Spots in Genetic Variant Discovery Next issue by Alexander Schönhuth and Tobias Marschall April 2016, Special theme: Planning and Logistics ERCIM NEWS 104 Janaury 2016 RESEARCh AND INNoVATIoN 21 Computational Estimation of Chromosome Structure by Claudia Caudai and Emanuele Salerno This section features news about research activities and innovative developments from European 22 Modelling Approaches to Inform the Control and research institutes Management of Invasive Seaweeds by James T. Murphy, Mark Johnson and Frédérique Viard 44 Secure and Privacy-Preserving Identity Management in the Cloud 24 Kbdock – Searching and Organising the Structural by Bernd Zwattendorfer, Stephan Krenn and Space of Protein-Protein Interactions Thomas Lorünser by Marie-Dominique Devignes, Malika Smaïl-Tabbone and David Ritchie 46 PRISMACLOUD – Privacy and Security Maintaining Services in the Cloud 25 The Source of the Data Flood: Sequencing Technologies by Daniel Slamanig, Agi Karyda and Thomas by Alberto Magi, Nadia Pisanti and Lorenzo Tattini Lorünser

27 Big Data in Support of the Digital Cancer Patient 47 TREDISEC: Towards Realizing a Truly Secure by Haridimos Kondylakis, Lefteris Koumakis, Manolis and Trustworthy Cloud Tsiknakis, Kostas Marias and Stephan Kiefer by Beatriz Gallego-Nicasio Crespo, Melek Önen and Ghassan Karame 28 Towards an On-board Personal Data Mining Framework For P4 Medicine 48 Smart Devices for the Home of the Future: A by Mohamed Boukhebouze, Stéphane Mouton and Jimmy New Model for Collaboration Nsenga by Daniele Spoladore, Gianfranco Modoni and Marco Sacco 30 Can Data-driven Self-Management Reduce Low Back Pain? 50 FOODWEB – Studying Food Consumption and by Kerstin Bach, Paul Jarle Mork and Agnar Aamodt Production Patterns on the Web by Christoph Trattner, Tomasz Kuśmierczyk and 31 Twitter can Help to Find Adverse Drug Reactions Kjetil Nørvåg by Mark Cieliebak, Dominic Egger and Fatih Uzdilli 51 Robots Learn Actions and Cooperative Tasks by 32 Trust for the “Doctor in the Loop” Imitation by Peter Kieseberg, Edgar Weippl and Andreas Holzinger by Maria Koskinopoulou and Panos Trahanias

34 Big Data Takes on Prostate Cancer 52 Unlocking the Secrets of the Dynamics of by Erwan Zerhouni, Bogdan Prisacari, Qing Zhong, Peter Earth’s Interior Wild and Maria Gabrani by Costas Bekas

35 Mining Electronic Health Records to Validate Knowledge in Pharmacogenomics EVENTS, IN BRIEf by Adrien Coulet and Malika Smaïl-Tabbone Announcement 36 Modelling the Growth of Blood Vessels in Health and 54 EMC² Summit in conjunction with the CPS Disease Week 2016 by Elisabeth G. Rens, Sonja E. M. Boas and Roeland M.H. Merks 54 More on IFIP TC6’s Open Digital Library

38 Modelling? Using Standards Can Help You Book by Brett G. Olivier and Bas Teusink 54 Online Social Networks: Human Cognitive Constraints in Facebook and Twitter Personal 39 Understanding Metadata to Exploit Life Sciences Open Graphs Data Datasets by Paulo Carvalho, Patrik Hitzelberger and Gilles Venturini In Brief 55 Lynda Hardman President Informatics Europe 40 Management of Big and Open Data in the Life Cycle Assessment of Ecosystem Services 55 QuSoft Research Center for Quantum Software by Benedetto Rugani, Paulo Carvalho and Benoit Othoniel Launched

42 WITDOM: Empowering Privacy and Security in Non- 55 ERCIM “Alain Bensoussan” Fellowship trusted Environments Programme by Juan Ramón Troncoso-Pastoriza and Elsa Prieto Pérez

ERCIM NEWS 104 Janaury 2016 3 Joint ERCIM Actions

Workshop lecture hall inside the Strahov Monastery in Prague and Dr. Joan Lasenby, invited speaker from the Engineering Department of the University of Cambridge, UK (photos taken by VIC CVUT).

• Prof. Jonathan Chambers, Newcastle University, UK: Dig - International Workshop ital Signal Processing and Assistive Technologies. on Computational 41 papers were accepted for oral presentation by the pro - gramme chairs, Prof. Jonathan Chambers, Prof. Enis Cetin Intelligence for and Dr. Oldrich Vysata and the workshop featured a set of sessions focussing on current hot topics. A Data Fusion in Multimedia Understanding Biomedicine session stressed the role of computational intel - ligence in putting together data from different biosensors in by Davide Moroni, Maria Trocan and Ales Prochazka the diagnosis of disorders, treatment monitoring, rehabilita - tion and analysis of sport activities. A session was dedicated Over 80 researchers from academia and industry to Ambient Assisted Living (AAL) and Intelligent Tools for attended a workshop on Computational Intelligence for Disabled and Elderly People, where the focus was on how Multimedia Understanding organized by the ERCIM computational intelligence can be used to provide personal - MUSCLE Working Group in Prague, 29-30 October 2015. ized guidance that can empower people and foster a healthier lifestyle. Classical biological data and signal processing was The workshop was held in close collaboration with the the main theme of the BIODAT session, which featured sev - University of Chemistry and Technology and the Czech eral talks on EEG processing, including microsleep and spin - Technical University in Prague. Prof. Vladimir Marik and dles analysis. Finally, a session on Big and Linked Data was Dr. Davide Moroni were honorary chairs. included in which aspects related to smart cities, social media analytics and semantic data stream processing were Among the topics addressed by the MUSCLE working addressed. group, this year it was decided to give special emphasis to biomedical applications. Biological signals and images pose The workshop was held in the historical halls of the Strahov significant challenges to multimedia understanding, which, Monastery and the social programme included a visit to the if addressed with success, would result in advances in both famous library of the monastery. biomedical science and in signal and image processing and recognition. A detailed analysis of raw biomedical data is The complete conference programme and videorecords of essential if their full potential is to be exploited in bioengi - invited talks can be found at (http://iwcim.isep.fr/). All the neering, control and robotic systems. Thanks to contributions accepted papers are available through IEEE Xplore. from several related research communities, the workshop was able to cover and discuss many interesting aspects of this Links: domain, ranging from human motion and rehabilitation to http://wiki.ercim.eu/wg/MUSCLE/ speech recognition, from polysomnography to cell image http://iwcim.isep.fr/ analysis. The relevance of techniques able to extract the most salient features affecting the biomedical domain from the Please contact: ever increasing data deluge, blend them successfully, Davide Moroni, ISTI-CNR, Italy emerged clearly, and was discussed in the lectures delivered E-mail: [email protected] by the distinguished invited speakers of this edition: Maria Trocan, ISEP, France • Dr. Saeid Sanei, Surrey University, UK: Man-Machine E-mail: [email protected] Interaction and Motion Modelling • Dr. Joan Lasenby, University of Cambridge, UK: Appli - Ales Prochazka, UCT & CTU, Czech Republic cations of Dimensionality Reduction Techniques in Med - E-mail: [email protected] ical Data Analysis • Prof. Danilo Mandic, Imperial College, UK: Computa - tional Intelligence and Wearable Body-Sensor Networks

4 ERCIM NEWS 104 January 2016 • R5-COP – “Reconfigurable ROS-based Resilient Reason - Activities of the ERCIM ing Robotic Cooperating Systems” (http://www.r5-cop.eu/ ) • MBAT – “Combined Model-based Analysis and Testing of Working Group “Dependable Embedded Systems” (http://www.mbat-artemis.eu), • nSafeCer – “Safety Certification of Software-intensive Sys - Embedded Software- tems with Reusable Components” (http://www.safecer.eu). Intensive Systems” ARTEMIS was one of the European, industry-driven research initiatives and is now part of the ECSEL Joint by Erwin Schoitsch Technology Initiative (JTI). The current ARTEMIS projects will continue according to the ARTEMIS rules, but managed The ERCIM Working Group "Dependable Embedded by the ECSEL JU. The workshop was very successful and Software-Intensive Systems" (DES WG) co-organized two led to interesting discussions, providing a good overview on events in 2016: a workshop at the 2015 SAFECOMP current work and future perspectives. Details are available conference in Delft on 22 September and a special on request from Erwin Schoitsch. The proceedings have been session on "Teaching, Education and Training for published by Springer as SAFECOMP Workshop Dependable Embedded Cyber-Physical Systems" at the Proceedings LNCS 9338. Abstracts and contents are avail - Euromicro SEAA/DSD Conference on 26-28 August 2015 able at: http://www.springer.com/de/book/9783319242484. in Funchal, Madeira, Portugal. Special Session at SEAA/DSD 2015 ERCIM/EWICS/ARTEMIS Workshop The ERCIM/ARTEMIS/Euromicro Special Session TET- The already well established ERCIM/EWICS/ARTEMIS DEC (Teaching, Education and Training for Dependable Workshop on Dependable Embedded Cyber-physical Embedded Cyber-Physical Systems) was held during the Systems and Systems-of-Systems” (DECSoS) was held as a SEAA/DSD 2015 conference in Funchal, Madeira, Portugal full-day workshop on 22 September in conjunction with the on 26-28 August. The special session covered five presenta - SAFECOMP 2015 conference in Delft, The Netherlands. tions and a concluding plenary discussion. The presentations in the first session focused partly on “E&T Use Cases” in Since it was established in 1979 by the European Workshop ARTEMIS projects and their impact on industrial and uni - on Industrial Computer Systems, Technical Committee 7 on versity education and training. They included: Reliability, Safety and Security (EWICS TC7), SAFECOMP • An introduction to the topic and an overview (Introduction has contributed to the progress of the state of the art in to TET-DEC Session – An E&T Use Case in a European dependable application of computers in safety-related and project) (Erwin Schoitsch), safety-critical systems. SAFECOMP (www.safecomp.org) is • “Reuse in Safety Critical Systems: Educational Use Case Final an annual event covering the experience and new trends in the Results” (Miren Illarramendi Rezabal, Leire Etxeberria and areas of safety, security and reliability of critical computer Xabier Elkorobarrutia Letona from Mondragon University). applications. It provides ample opportunity to exchange insights and experience on emerging methods, approaches The second session discussed more general concepts around and practical solutions. The next SAFECOMP 2016 will take how to teach particular topics: place in Trondheim, Norway, 20-23 September 2016. It is • What Training is Needed by Practicing Engineers Who being organized by Amund Skavhaug, NTNU, supported by Create Cyberphysical Systems? (Christopher Scaffidi), Erwin Schoitsch, AIT, both co-chairs of the ERCIM DES WG • Teaching Software Safety to Master Students (Clara Benac Earle, Ana María Fernández-Soriano, Lars-Ake The DECSoS Workshop included: an overview and introduc - Fredlund and Elena Gómez-Martínez) and tion to the Cyber-physical Systems (CPS) initiatives of the • Applying property-based testing in teaching safety-critical European Union; a presentation of the ARTEMIS projects; system programming (Lars-Ake Fredlund, Angel Herranz and discussions by ERCIM and EWICS. Erwin Schoitsch and Julio Mariño). also presented the following four sessions: • Introduction, and Safety & Cyber-Security Co-assessment The plenary discussion sparked some new insights and rec - (3 papers) ommendations about how to successfully motivate and • Robotics and Motion Control (3 papers) engage in teaching and education and encourage use of • Modelling, Testing and Verification (3 papers) embedded CPS. Input from overseas (the US) provided dif - • Dependability and Scalability (2 papers). ferent views in this area. One recommendation made was to stimulate interest within industry by focusing on existing The workshop was co-hosted by the following Embedded work in connection with the training contents, so managers Systems/CPS projects, which were represented by papers can see direct benefits in terms of project outputs as a result and several other presentations: of the training. The Proceedings have been published by • CRYSTAL – “Critical Systems Engineering Factories” IEEE CPS and are available via IEEE Explore. Details are (http://www.crystal-artemis.eu) available on request from [email protected]. • ARROWHEAD1 – “Ahead of the Future” (http://www.arrowhead.eu/), Please contact: • EMC2 – “Embedded Multi-Core systems for Mixed Crit - Erwin Schoitsch, AIT Austrian Institute of Technology, icality applications in dynamic and changeable real-time ERCIM DES WG co-chair environments” (http://www.artemis-emc2.eu/ ) E-mail: [email protected]

ERCIM NEWS 104 January 2016 5 Research and Society

Women in ICT Research and Education

“Because it’s 2016” a very good place to start. A number of large companies, such as Google, Oracle, Bloomberg, Cisco and Facebook, are by the guest editor Lynda Hardman aware of the necessity to attract and keep women in the field. They spend considerable effort sponsoring events to attract The informatics and mathematics communities suffer from young women and girls and Facebook have even put their an affliction common to many technical and scientific fields, unconscious bias training online [3]. that fewer than 30% of those who choose to study and con - tinue their profession in the field are women. This leads to a Perhaps we will learn that even when gender bias has been masculine-oriented culture that unwittingly discriminates removed, we still don't have a minimum level of 30% women against women. in the community. We have to be very aware that this will still tend to a masculine style environment and is likely to While as a discipline we cannot change the culture in which push the more feminine values out and will require effort to we live, we can at least be aware of it and take appropriate sustain good gender representation. measures. The elephant in the room is “the assumed and indisputable gender neutrality of the practice of science” [1]. I am extremely grateful to our authors for this special sec - tion. They took precious time away from their research and Awareness is key. Particularly that we all hold values that are teaching activities that contribute to building their academic invisible from within the culture in which we operate. These CV. All too often it is the women, who are few and far can be seemingly irrelevant attributes, such as height, where, between, who spend their time on these activities. Attracting for example, USA presidents are taller than average [2], or and keeping women is the responsibility of men and women voice pitch, where a lower pitch increases your perceived alike. Such activities need to become mainstream in a depart - quality as a leader. mental portfolio and assigned time in the same way as other responsibilities. While plenty of excellent women are highly active in infor - matics and mathematics research they are not always as vis - The contributions explain a number of different initiatives ible as they should be. This can be for various reasons, such from different groups, including ACM-W Europe, as the topics on which they work are not seen as “core”. Informatics Europe, and Athena SWAN in the UK. Some are Whatever the underlying cause, they tend not to be invited to actions that management can take, for example Informatics give keynote talks, so that the notion that women are not able Europe and Athena SWAN, whereas others are more oriented to reach the top of their field is perpetuated. to women themselves, such as ACM-W Europe. The contri - butions give pointers to best practices that you can imple - It is no individual's conscious intention that women are dis - ment in your local context. Most measures that can be taken criminated against, so providing unconscious bias training is to redress the bias, such as open recruiting and promotion

6 ERCIM NEWS 104 January 2016 procedures, shared parental responsibilities, will help many ment, collecting best practices already implemented at insti - men as well. The “fight your way to the top” culture is tutions from across Europe. The handy guide makes sugges - stressful for both genders. tions of small changes that can have a large impact for a department aiming to recruit, support and encourage female Europe, however, is a mix of many different cultures where, colleagues to get the most from their careers. The booklet is in some, gender equality is a way of life and, in others, the endorsed by the European Commission and features a fore - culture has yet to recognise that there even is a problem that word by Neelie Kroes, who was Vice-President of the needs to be addressed. Communication across Europe, par - European Commission, responsible for the Digital Agenda at ticularly demonstrable best practices, often from our more the time. The booklet is freely available at the Informatics northern colleagues, can allow us to share best practices Europe web site. across the continent. If you are interested in becoming more involved with initiatives then please feel free to contact When possible WIRE works jointly with ACM-WE, a myself or one of the authors of the contributions. standing committee of ACM Europe, which has a remit to cultivate and inspire people about the opportunities in com - The recently instated Prime Minister of Canada, Justin puter science and clear pathways throughout Europe for Trudeau, presented his gender balanced government on 4th women in computing. In particular, WIRE was proud to November 2015 [4]. When asked why he felt gender parity sponsor the ACM WomENcourage event that took place in was important his reply was, “Because it’s 2015.” Sweden in September 2015.

Further reading: To further raise the profile of gender issues, in 2016 WIRE is Do it yourself – how biased are you? launching a new award in order to recognize and celebrate https://implicit.harvard.edu the achievements of departments who have taken action to attract, retain and promote women. The Informatics Europe References: Minerva Informatics Equality Award will be presented for [1] ERCIM News 104 article by Hatzopoulos et al. p.13 the first time in October 2016. Nominations will be sought [2] https://en.wikipedia.org/wiki/Heights_of_presidents in the spring and the closing date will be 1st May 2016. _and _pr esidential_candidates_of_the_United_States Judged by an international panel, the award will be given to a [3] https://managingbias.fb.com/ department that is able to demonstrate benefits to female [4] http://www.huffingtonpost.ca/2015/11/04/justin-trudeau- staff or students through implemented policies and initia - gender-parity-because-it-s-2015_n_8474386.html tives. On a three-year cycle, the award will focus on dif - ferent stages of the career pipeline: Please concact: • Encouraging female students to enroll in computer sci - Lynda Hardman ence/informatics degree programmes at undergraduate or President Informatics Europe 2016-2017 postgraduate levels; CWI and University Utrecht, The Netherlands • Supporting the transition for PhD students or postdoctoral E-mail: [email protected] researchers into faculty positions; • Developing the careers of female faculty, including reten - tion and promotion.

For the inaugural award in 2016 we will focus on initiatives Women in Informatics to support the career development of female faculty. WIRE has also started work on a web resource, which pro - Research and Education vides links to other organizations and initiatives around the world that can provide support and inspiration for those by Jane Hillston seeking to improve gender balance within an academic envi - ronment. This is an ongoing project but already the resource Informatics Europe is a membership organization for has more than 20 links to stimulating and informative pages, university departments and research laboratories including national initiatives such as the Athena SWAN aiming to create a strong common voice to safeguard, awards in the UK. shape and stimulate quality research and teaching in Informatics in Europe. There are currently over 90 For over a decade, the Equality Challenge Unit (ECU), has member institutions across 25 countries. The been granting Athena SWAN awards to higher education and organization has a number of working groups, including research institutions in the UK, and their departments, if they WIRE, Women in ICST Research and Education, which are able to demonstrate commitment to advancing the careers works to promote actions that contribute to improve of women in science, technology, engineering, mathematics gender balance in Information and Communication and medicine (STEMM). Sciences and Technologies. Athena SWAN awards in the UK In 2013 WIRE published a booklet entitled “More Women in The Equality Challenge Unit is a government-funded body Informatics Research and Education” and circulated it which supports universities and other research institutions to widely within Europe. The booklet is a compact source of build an inclusive culture, promoting diversity and removing clear and simple advice aimed at deans and heads of depart - barriers to progression and success. The Athena SWAN

ERCIM NEWS 104 January 2016 7 Research and Society

charter was established in 2005 for STEMM subjects and in May 2015 was expanded to encompass work undertaken in Reflections from arts, humanities, social sciences, business and law and to promote gender equality more broadly. a Leadership Program Across all disciplines, the existence of the Athena SWAN for Women in Scientific awards has had a profound impact on the level of awareness of diversity and equality issues in UK universities, particu - Positions larly when it was suggested that funding bodies would make future awards conditional on the host department holding an by Katja Abrahamsson, Letizia Jaccheri and Harald award (this has not happened yet). Moreover, a department Øverby or institution that achieves an award cannot then rest on its laurels, as awards must be renewed every three years with This paper presents insights from a leadership demonstrable progress against a previously submitted action programme for women in academic positions in plan. As a discipline computer science and informatics were Computing Sciences with the goal of helping to advance a little slower than some disciplines to get involved, but there their scientific careers. This is a joint programme has been growing interest in recent years and more than 25 between the Department of Computer and Information computer science/informatics departments now hold awards. Science, and the Department of Telematics at The British Computer Society has sought to encourage more Norwegian University of Science and Technology (NTNU). departments to increase their engagement in the Athena SWAN process, for example by organising a successful It has long been recognized that more women need to be “CignetS” meeting in December 2014 to provide collabora - attracted to Computer Science (CS) programs globally [1]. tive support for computing departments preparing applica - However, overwhelming research evidence shows that the tions for the first time. Most of the awards in computer sci - trend has taken the opposite direction. In 1984, 37% of ence/informatics are currently at the bronze level but there Computer Science degrees in the U.S. were awarded to are two silver awards and more aspired to. The model is one women and in 2014 the number dropped to 14 percent [2]. of inevitable rivalry and competition between departments of Currently in Norway, only 26% of professor positions are the same discipline, and between disciplines, but academics held by women [L1]. Gender balance in academic positions are naturally competitive and in this case it’s all in a good is a recognized goal in Norwegian society. cause! NTNU has been making systematic and successful efforts to Links: increase female student recruitment (see article by Line Berg http://www.informatics-europe.org in this issue). Table 1 lists the percentage of women in person years at the Faculty of Information Technology, Mathematics Women in ICST Research and Education working group: and Electrical Engineering (IME), and for the two depart - http://www.informatics-europe.org/services/women-in-icst- ments in this program. There is a significant gap between the research-and-education.html number of women in temporary and permanent academic positions. The leadership program will therefore focus on Booklet: “More Women in Informatics Research and temporary staff to increase their awareness of the importance Education”: of strategic career planning and development. http://www.informatics-europe.org/images/documents/more- women-in-informatics-research-and-education_2013.pdf The goals of our programme, which was launched in 2014, include: enhancing the leadership and strategic skills of the Equality Challenge Unit: http://www.ecu.ac.uk participants; encouraging women to participate more in net - working in order to take a firm stand in scientific positions; Athena SWAN awards: and encouraging women to take top science and technology http://www.ecu.ac.uk/equality-charters/athena-swan/ leadership positions and act as mentors to fellow women. Previous experience from other leadership training pro - grammes is that such programmes are effective in empow - Please contact: ering women to take leadership positions, and focusing on Jane Hillston women already in an academic position within the fields of University of Edinburgh, Scotland, UK Telematics and Computer Science department represents a E-mail: [email protected] key opportunity (World Bank, 2012) that this program will seek to utilize.

The leadership programme involves professional training to be able to manage the gender biases that are embedded in the current system, and be more strategic in scientific career development and advancement. As an example, a training session with an organizational psychologist on "Leadership, gender, and culture” was organized this year. Another important focus is to strengthen internal and external net - working. The programme therefore collaborates with other

8 ERCIM NEWS 104 January 2016 Position category Faculty IME Department CS Department Telematics Table 1: Person years for the Faculty and Departments: Source http://dbh.nsd.uib.no Person Person Person % Women % Women % Women years years years Temporary PhD 244.3 19.8 41.1 21.5 17.6 38.5 and Post Docs Research and 165.3 13.5 42.1 14 14.4 8.3 Teaching Positions similar programmes locally and internationally and invites be able to continue, and extend the program, long term stable highly recognized international female scientists, such as Dr. funding is needed. This has been a major challenge for sim - Leila Alem from Australia and Prof. Gabriella Dodero from ilar projects, and there is a strong need to provide more Italy, who gave inspirational talks and acted as mentors for funding opportunities for programmes with an aim of our programme participants. enhancing women’s advancement in their scientific careers [3]. Therefore, in order to improve the current situation and The programme has also participated in the Grace Hopper have a sustainable impact, further actions need to be taken Conference (GHC) together with over 12,000 other female and more gender balance programmes need to be adopted. colleagues. NTNU’s goal was to attract prospective PhD stu - dents and Post Docs for our programs as well as to network Links: with future colleagues. Attracting new female employees is [L1] Women discriminated in the top positions in the Nor - tough with competition from high-profile companies such as wegian academia (In Norwegian), Dagens Perspektiv: Facebook and Google who emphasise their strong interest in http://www.dagensperspektiv.no/2015/kvinner-diskrim - hiring women in computer science positions. ineres-pa-toppen-i-norsk-akademia [L2] Gender equality and women's empowerment: A criti - As our programme is still at an early stage, it will be a while cal analysis of the third millennium development goal until the full benefit can be assessed. However, based on the 1: http://doi.org/10.1080/13552070512331332273 feedback from the participants and interest shown by other CS units, the programme is on the right track. New CS units References: from NTNU will join the programme in 2016. The partici - [1] K.A. Frenkel: “Women and computing”, Communica - pants appreciate the extra effort that is put into supporting tions of the ACM, 33, 34–46. doi:10.1145/92755.92756, them in their career advancement. As an example, Kerstin 1990. Bach, one of the program participants, said that the pro - [2] B. S. Zweben and B. Bizot: "2012 Taulbee Survey", gramme had been an inspiration and helped her to reflect on Computing Research News, 25(5), 11–60, 2013. where she stands in her career, and how she can support more [3] Waylen, G. (2013). Informal Institutions, Institutional women to aim at top leadership positions in the field of sci - Change, and Gender Equality. Political Research Quar - ence. terly, 67(1), 212–223. doi:10.1177/1065912913510360.

The next step for the programme is to involve more men in Please contact: order to increase awareness of the challenges raised, increase Letizia Jaccheri the attractiveness of PhD programmes in CS and Telematics, NTNU, Norway and to offer more mentoring support for women. However, to E-mail: [email protected]

NTNU's booth at GHC Career Fair in Houston Texas, October 2015.

ERCIM NEWS 104 January 2016 9 Research and Society

The Girl Project Ada: facts Recruitment events A Successful Project The Girls’ Day: • Recruitment to technology and natural sciences stud - for Recruiting female ies Students to the Technology Camp (organized by Ada): • Recruitment to ICT studies ICT-Studies at NTNU • Lectures (examples from 2015): “Online with hack - ers, spies and scammers”, “How to win the Oscars by Line Berg with a degree in mathematics”, “Learning from games” (by the inventor of Kahoot Professor Alf Ten years ago, girls represented only 16% of the Inge Wang) applicants to Science and Technology studies at the • Workshops (examples from 2015); “Introduction to Norwegian University of Science and Technology (NTNU). hacking”, “Build your own computer”, “Code your This year that percentage share exceeded 30%. In the own game”, “Programming ”, “Build your own same period the absolute number of female applicants robot”, “How to get an easier everyday through The more than tripled. This is a consequence of focused and Internet Of Things ”, “Would you like to fly a drone targeted work from NTNU, which has required long-term with your brain?” and “Code a traffic light” dedication and channelling of resources – both time • 70% of participants at the Technology Camp and the and money. Girls’ Day later apply to a science and technology- study at NTNU The main goal of the Girl Project Ada [1] is to increase the • A subtitled film from the Technology Camp 2013: number of women graduating with a master’s degree in ICT. https://www.youtube.com/watch?v=VWXKCg3X- To achieve this, two main approaches are taken: organising XQ recruitment events to promote ICT studies to girls at high school; and operating career networks with professional and Career network social networking amongst female ICT students. • for electronics systems design, cybernetics and robotics, mathematics, communication technology, Recruitment events computer science and informatics students There are two annual recruiting events targeting girls only: • companies: 22 ICT companies (including Facebook, Girls’ Day and the Technology Camp. ABB, Accenture, Acando, BEKK Capgemini, Telenor, EY, Cisco, Kongsberg, Silicon Labs, Nordic The purpose of Girls’ Day, a joint venture organised by three Semiconductor, Sopra Steria, Netlight, Schlumberg - faculties at NTNU, is to inform attendees about science and er, Norsk Elektro Optik, Skatteetaten, Visma, Mesan, technology-studies offered at the university. Girls from the Marintek.) final year of high school are invited to apply. Two hundred • Annual funding 2.1 million NOK (233, 000 Euros) and fifty girls get to spend two days at NTNU; listening to • funded by : 40% from the university gender equality presentations and meeting students who talk about their own fund, 35% from the faculty of Information Technol - study program. Girls’ day has been held annually since the ogy, Mathematics and Electrical Engineering and early 1990s. 25% from the business sponsors.

The Technology Camp [1] aims to promote, and recruit female students to, ICT studies at NTNU, including studies in computer science, informatics, communication tech - nology, electronics systems design, mathematics, and cyber - netics and robotics. We invite every female high school student who is majoring in mathematics to apply to the Technology Camp. Through a selection process, which emphasises a student’s written motivation, grades and national distribution, the 160 best applicants are invited to come to Trondheim and NTNU for a three-day visit to meet students and professors and to learn about ICT. It is also important to give them a feel for how their life as a student will be, through a tour of the city and campus, guided by current female students. They also meet ICT students and representatives from ICT companies. Experience has shown us that one of the most important factors influencing recruitment and motivation is role models. In order to break the stereotype that ICT is only for guys, the most effective way is to make sure that high Figure 1: Number of female applicants for master of science and school girls get to meet the female ICT students who are technology.

10 ERCIM NEWS 104 January 2016 The 160 participants at the Technologycamp at NTNU 1-3 November 2015. Photo: Kai T Dragland. already studying at NTNU, and alumni female ICT students feel for what a working day/typical employer is, and from NTNU who now have exciting jobs in ICT. There is no strengthen friendships within their peer group. Surveys have better way than credible female role models to sell the mes - shown that this has a great impact on preventing drop-outs. sage that ICT studies are suited for girls. Two days before enrolment to NTNU, The Girl Project Ada organizes a welcome day [4] for new female students to The Technology Camp participants get to listen to exciting, assure a positive and welcoming start. We introduce them to fun and motivating presentations and lectures, and engaging current students and student associations and give them a workshops. We spend a lot of resources planning an exciting guided tour of campus with more information about the Girl program for the Technology Camp. Examples of lectures Project Ada. and workshop topics in 2015 are shown in the fact box. Throughout the semester, in addition to organizing career- About 70% of the participants at the Technology Camp and network gatherings and the Oslo-trip for the first year stu - the Girl´s Day later apply to a science and technology-study dents, we organize a variety of courses, including coding at NTNU. Surveys among current students at NTNU who evenings with companies, social networking lunches and participated in the Technology Camp show that this was cru - mountain hikes. The female ICT students also have their cial for their decision to study ICT; “I would never have own ICT lab “Cybele” – girls only, no boys allowed! started communication technology studies if it had not been for the Technology Camp “ (third year Communication Links: Technology student). [L1] http://www.ntnu.edu/girls [L2] https://www.youtube.com/watch?v=VWXKCg3X-XQ Career Network [L3] https://www.ntnu.edu/girls/career_networks The career networks [3] are collaborative projects between [L4] Photos from the Welcome day: the Girl Project Ada and a number of companies in the busi - https://www.flickr.com/photos/ntnuime/sets/721576459 ness community. The aim of the career networks is to pro - 48372010/ mote networking between female students and relevant busi - ness associates in the private sector. The Girl Project Ada Please contact: career networks are for electronics, cybernetics, and ICT stu - Line Berg dents. There are four annual gatherings where students and NTNU, Norway representatives of the companies meet for professional and E-mail: [email protected] social networking. The majority of female ICT students at NTNU end up having their first job with one of the compa - nies in the career network (see fact box). Once a year the Girl Project Ada takes 90 female first year students for a three-day trip to Oslo where we visit the com - panies in the career networks. This allows students to get a

ERCIM NEWS 104 January 2016 11 Research and Society

Women in Computing strengthened their commitment to Inspire, Network, their career. Additionally, an attendee at womENcourage 2015 stated she learned during one of the case studies “That Encourage – no matter what issue appears during our career, there will always be a way to solve it and people who support women womENcourage 2015 in STEM.” Being part of a community and knowing that there will always be someone there to offer support is very by Reyyan Ayfer and Bev Bachmayer fortifying and helps individuals stay in their chosen field. ACM-WE knows that womENCourage celebrations provide ACM President Alex Wolf welcomed the participants at a unique opportunity for students, young researchers and the opening of ACM-W Europe womENcourage 2015 early practitioners to network and realise that they are not Celebration of Women in Computing by effectually alone: there are others like them, they are “part of a tech - pointing out: “We would not be here at all if there was nology community.” Therefore, ACM-W Europe and the not a problem in our community. I want to make sure entire ACM family work to bring inspiring speakers and pan - that all of us understand that this is not a women’s ellists from across Europe and around the world to offer problem! This is a community problem! And it takes women both a sense of community and inspiration to everyone to solve this problem.” increase their commitment to their career.

Alex stated the obvious in his opening talk. This problem has The 2015 celebration included many activities: a career fair, been widely discussed, specifically, research shows that keynote talks, panel discussions, technical paper sessions, “Computer science as a discipline has not done well in workshops, case studies, unconference sessions, and a attracting and retaining women and minority practitioners in hackathon. any proportional scale to population representation: the per - centage of BS degrees conferred to women peaked in 1986 Keynotes are the most inspiring part of a celebration. The and is still on a downturn.” [1] Confirming the point of the key learnings and exemplary presentation of accomplished community problem: “While the gender issue is often per - computer scientists positively influences young students and ceived as a women’s issue, only when the majority of the practitioners [3]. WomENcourage 2015 attendees were for - deans and heads of European informatics departments see tunate to hear from Vicki Hanson, Asa Cajander and Marie- this as both a concern and an opportunity will there be any Paule Cani. significant change.” [2] “More Women in ICT Research and Education” by IE-WIRE made recommendations that help Vicki Hanson’s keynote address focused on her research, administrators and deans with ideas and best practices so highlighting inclusive technologies. Researchers in the field they can recruit, inspire and encourage their current and of accessibility seek to create devices and software to future employees and students. ACM-WE raises awareness address the needs of people who through age or disability, among womENcourage attendees about these recommenda - face exclusion from full societal participation. In creating tions, which represent the minimal expectations of industry accessible technology and novel accessibility tools, research and academia. has not only facilitated digital interactions and quality of life needs for many, but has also served to advance the field of Celebrations are effective in supporting women in tech computing more generally. careers, and research shows that even small “regional” cele - brations are effective in inspiring women and giving them a Åsa’s Cajander’s talk “On Grit and Being the Token Figure” sense of community [3]. ACM-WE survey results from brought up a range of challenges that most women in the womENcourage 2014 mirrored these results: 94% of respon - audience have experienced. Åsa pointed out that many kinds dents believed that their attendance at the Celebration of of grit are needed for a woman to survive in her current envi -

ACM-W Europe womENcourage 2015 participants.

12 ERCIM NEWS 104 January 2016 ronment, and she confirmed that women are not alone when faced with such circumstances. Asa said “Change comes Integrating Gender slowly, but it is changing, we all need to continue to encourage these changes.” in Research Institutions Through examples and videos Marie-Paule explained why in Greece graphical design is essential for human creativity, including the fact that graphical simulation is difficult to control, and by Pavlos Hatzopoulos, Nelli Kambouri she described the new methodology which is emerging to and Kathy Kikis-Papadakis ease the interactive design of complex shapes, objects and characters, possibly in motion. A research project on "Gender, Science and Innovation in Greece" is focusing on gender issues in research WomENcourage 2015 hosted three panel discussions on institutions. Its recommendations were recently Open Source as a Career Path, Teaching and Learning, Out publicised by the Educational Research and Evaluation of the Ordinary Jobs. Through the paper sessions students (ERE) Group of the Foundation for Research and were given the opportunity to present and share their current Technology – FORTH. research work and receive valuable feedback. The unconfer - ence sessions effectuated interesting discussions on subjects “The progress of Greece in integrating gender in research suggested by their participants who were also actively institutions is slow. Without being able to enjoy the benefits involved in the discussions: feminism in computing, women of labor rights and without having any privileges as a worker, in computing: work-life balance, imposter syndrome, data you are still expected to perform all your obligations as a mining, and learning analytics. “The workshops were bril - researcher”. This quote from an interview with a young liant, I really enjoyed the hands-on experience” were the female biologist could well be one of the principal conclu - words of one participant, echoing the sentiments of many. sions of the research project. The research addressed the gen - All workshops taught participants something new from basic dering processes that shape the research sector in Greece, electronics to creating wearables, to understanding open focusing on gendered institutional structures and practices source code. The Hackathon had participants from 17 dif - along with the analysis of gender hierarchies and gender dis - ferent countries. The winning teams’ projects were “Free crimination in research institutions as these are articulated in Fall” (a device to attach to wrist or belt which alerts assis - the contemporary context of the Greek economic crisis. tance if the person suddenly falls), “SafeHome” (a portable CO2 and gas alarm system), and “Be My Eyes” (a device that The absence of an institutional framework for gender helps visually impaired people.) Public research institutes in Greece have not adopted policies or strategies for integrating gender in their work, precluding, ACM-WE is inviting everyone to attend the future in effect, complaints about gender discrimination and barring womENcourage events in September, 2016 in Linz, Austria, the empowerment of women working in this sector. The find - and in September, 2017 in Barcelona, Spain. ings of the research show that, overall, there is quite a limited awareness of gender equality in the research sector in Links: Greece. In many cases, the assumed and indisputable gender ACM: http://www.acm.org/ neutrality of the practice of science is believed to act as the ACM-W: http://women.acm.org/ ultimate safeguard against the imposition and reproduction ACM-Europe: http://europe.acm.org/ of gender hierarchies in the research sector. ACM-WE: http://europe.acm.org/acm-w-europe.html womENcourage2014: In reality, however, women researchers in Greece face signif - http://womencourage.acm.org/archive/2014/index.html icant obstacles to career advancement in public research womENcourage2015: http://womencourage.acm.org/ institutions, particularly women with extended caring responsibilities. These emanate, on the one hand, from the References: dual workload faced by these women researchers, and on the [1] B.Nelson: “The Data On Diversity,” Communications other, from the dominance of networks of male scientists, of the ACM, Vol. 57, No. 11, Pages 86-95, 2014, especially in the areas of decision-making and institutional http://dx.doi.org/10.1145/2597886 strategy development (Katsiaboura and Vogt, 2012). As a [2] L. Hardman: “More Women in Informatics Research result, few women tend to reach high-rank positions and high and Education”, 2013, http://kwz.me/RT career grades, while most women tend to avoid standing as [3] G. Townsend, K. Sloan: “An Effective Alternative to the candidates for these positions in order to avoid the work Grace Hopper Celebration” in Proc. of the 46th ACM overload. Indeed, the extensive interviews conducted during Technical Symposium on Computer Science Education the project showed that women researchers in Greece pre - – SIGCSE ‘15, 2015, dominantly believe that their gender is a barrier to their http://dx.doi.org/10.1145/2676723.26772332015 recruitment, to their selection as leaders of research teams, and to their career advancement. Please contact: Reyyan Ayfer, Bev Bachmayer The management of research teams is another major issue, ACM Europe which is not regulated or monitored by the Greek research E-mail: [email protected] system. We can distinguish three approaches (Schiebinger, 2008): (a) the masculinist approach, where private and

ERCIM NEWS 104 January 2016 13 Research and Society

family issues are silenced altogether and science is defined Link: as a process of knowledge production without gender (b) http://www.genderinscience.gr the feminist approach, where the private sphere is inte - grated in the life of research teams and acts as a factor for References: developing a climate of cooperation amongst research team [1] Gianna Katsiaboura and Annete Vogt, eds: “Gender studies members and for producing scientific results and (c) the and science: problems, tasks, perspectives”, Nissos 2012. intersectional approach, where the categories of gender, [2] Londa Schiebinger: “Gendered Innovations in Science social class, nation and race are taken into account in and Engineering”, Stanford University Press, 2008 recruiting members of research teams and in establishing relations of good cooperation amongst them. In Greece, the Please contact: first, masculinist approach is overwhelmingly dominant, Pavlos Hatzopoulos, Nelli Kambouri, Kathy Kikis- and this often leads to gross violations of labour rights of Papadakis, IACM-FORTH, E-mail: [email protected], precarious female researchers. [email protected], [email protected]

In research centres of the Greek private sector the systems of career advancement and payroll grades are linked to criteria of productivity rather than scientific excellence. In private research institutions gender inequalities are thus primarily Women in IT Security enacted through the division of scientific work into "female" and "male" areas, which correspond to differentiated work Research: schedules, job responsibilities and salaries. In this system, conducting research is constructed as a "female" occupation The Case of SBA Research that remains very poorly paid in comparison to “male” mana - gerial roles. Job promotion and wage increases are thus by Veronika Nowak directly associated with internal mobility schemes from research to management areas, which are predominantly not The low percentage of women in ICT professional fields accessible to women researchers. is a well-documented issue. With IT courses still being dominated by male students – for example, only 16% of Promoting gender equality in research: policy challenges students in TU Wien’s informatics programs are female Women researchers cannot easily challenge gender discrimi - – it is still a challenge for organizations to increase nation in their workplace, since Greece lacks a comprehen - numbers of female employees. How is a research sive legal framework in respect to gender equality in centre like SBA Research (Vienna, Austria) performing research organizations. Even where legal stipulations are in in this regard? place – for example, legislation regarding equal representa - tion of women on scientific management committees – these SBA Research is a research centre for IT security, located in are only observed on an ad hoc basis without any action Vienna, Austria. It was founded in 2006 and currently taken against noncompliant institutions. The lack of a com - employs a staff of approximately 100. It is Austria’s biggest prehensive legal framework leads to the multiplication of research centre, exclusively addressing issues of IT security, gender inequalities, the intensification of direct and indirect and working closely with several Austrian academic (e.g., discrimination, and the prolongation of offensive behaviour TU Wien, University of Vienna, Graz University of towards women scientists stemming from dominant mas - Technology) and numerous industrial partners. culinist cultures. SBA Research has always employed women, but in the early In this context, the integration of gender in research institu - days of the company, women largely occupied administra - tions in Greece is a critical policy challenge. Positive policy tive roles. Although some “female Key Researchers” joined actions that can promote gender equality in the research the team early on, there was no targeted recruitment of sector could include: “female Junior Researchers”. Nevertheless, the share of • The adoption of gender action plans for the promotion of women on SBA Research’s staff has developed nicely over gender equality by all research actors in the private and recent years (Figure 1). After a slight drop from 2010 to 2012 public sectors. – mainly due to transfers to partner institutions or universi - • The adoption of statutes on the operation of research ties – the percentage of women amongst SBA Research’s teams, which will include in their central axes the dimen - employees is increasing. It currently matches the average sions of gender, social class, race, and ethnicity. reported by a survey performed on behalf of the Austrian • The provision of substantial training on the management Ministry for Transport, Innovation and Technology of research groups for men and women researchers. (BMVIT) [1], analysing data from non-university-research • The enactment of statutes on the labour of students and organizations (25-27% share of women). precarious workers in research organizations with a goal to respect labor rights, including rights related to gender The federal funding programme FEMtech [1] was particu - equality. larly beneficial in increasing SBA Research’s recruitment of • The financing and support of networking structures and women. The programme, funded by the Austrian Ministry initiatives among female scientists. for Transport, Innovation and Technology (BMVIT), aims to promote the entry of young female scientists into a career in

14 ERCIM NEWS 104 January 2016 ICT and to further women’s participation in research and cally at girls. We furthermore regard our current and former technology (industry and non-university-research). SBA female employees teaching at universities and universities of Research has so far hosted 15 FEMtech interns, five of applied sciences as role models, showing students that there whom ended up joining SBA Research as permanent is a professional future for women in ICT. employees, while the remainder returned to their universities to finish their Bachelor of Science and Master of Science In summary, SBA Research considers the current situation to studies. Those staying with SBA Research became junior be much better than even five years ago; nevertheless, a lot researchers while pursuing their master’s and Ph.D. degrees; remains to be done. In order to increase the number of one was recently selected as “FEMtech’s Expert of the women working in ICT, it seems feasible to implement fur - Month”. Another intern joined the consulting department of ther strategies to attract undergraduate students to computer SBA Research, transferring her scientific knowledge to prac - science programs in academia and female professionals to a tical applications in an industrial context. career in the respective fields. In addition to public funding programmes, it might be helpful to adjust the scope of ICT The recruitment of female employees remains a challenge, projects or conferences [3]. owing in part to the still relatively low numbers of female compared with male ICT students. For example, at the nearby SBA Research will continue to actively encourage women to TU Wien, females represented only 1,139 out of 7,173 stu - first study and then work in ICT by making use of respective dents enrolled in study programs in the field of informatics funding schemes and offering a company culture favourable for the autumn 2014 semester. Also, a recent study commis - to integrating the professional and personal goals of female sioned by the Kuratorium Sicheres Österreich (Austrian and male researchers. Safety Board) regarding women and cybersecurity shows that although 63% of female students are interested in the topic of cybersecurity, only 24% actually thought about working in this field and just 5% already did so [2]. This raises the questions: what can institutions like SBA Research do firstly, to become known as a possible employer and secondly, to attract female applicants in particular? So far, SBA Research has been a regular partici - pant in job fairs of Austria’s universities of applied sciences and awareness initia - tives such as the “Young Researcher’s Day” which is organized by the Austrian Computer Society.

SBA Research strives to support female and male employees alike in their careers. Both male and female employees are Figure 1: Changes in percentage of female SBA Research staff over three years. invited to apply for and have been granted sabbaticals as well as research stays abroad, e.g., in Boston and Tokyo. The matters of family and children have become an increas - Links: ingly relevant topic amongst SBA Research’s staff. Since SBA Research “Women and IT Security Research”, 2015: 2012 in particular, increasing numbers of employees have http://kwz.me/Ro founded families, putting the issue of maternity and paternity FEMtech programme: leave, as well as part-time employment for young parents, on http://www.femtech.at/en/femtech.html the table. In addition to maternity leave, SBA Research has a techNIKE: http://kwz.me/RO high number of fathers taking time off to be with their chil - dren, in accordance with the respective Austrian legislation. References: Also, previously established working conditions facilitating [1] http://www.femtech.at/sites/default/files/Slide_Gleich - a good work-life balance (e.g., flexible time agreement) have stellungserhebung_0.pdf proven to be an asset to the employees – including manage - [2] Kuratorium Sicheres Österreich. Female Cyber Mentor - ment – when coordinating work and family matters. ing, 2015 (to appear). [3] C. Alvarado and E. Judson, Using Targeted Conferences SBA Research is also taking active steps to increase the visi - to Recruit Women into Computer Science. Communica - bility of women in ICT. In 2015, the second edition of our tions of the ACM 57(3):70-77, 2014. brochure “Women and IT Security Research” was issued, featuring 24 female employees and alumnae of SBA Please contact: Research in all kinds of professional fields. Female Veronika Nowak researchers also take time to participate in initiatives such as SBA Research, Vienna, Austria TU Wien’s techNIKE – a series of workshops aimed specifi - E-mail: [email protected]

ERCIM NEWS 104 January 2016 15 Special Theme: Tackling Big Data in the Life Sciences

Introduction to the Special Theme Tackling Big Data in the Life Sciences by the guest editors Roeland Merks and Marie-France Sagot

The Life Sciences are traditionally a descriptive science, in which both data collection and data analysis both play a central role. The latest decennia have seen major technical advances, which have made it possible to collect biological data at an unprecedented scale. Even more than the speed at which new data are acquired, the very complexity of what they represent makes it particularly diffi - cult to make sense of them. Ultimately, biological data science should further the understanding of biological mechanisms and yield useful predictions, to improve individual health care or public health or to predict useful environmental interferences.

Biologists routinely measure DNA sequences, gene expression, metabolic profiles, and protein levels, infer molecular structures, visualize the positions and shape of cells in developing organs and whole embryos over time, rapidly screen the phenotypic effects of gene mutations, or they monitor the posi - tions of species and individual animals and plants in whole ecosystems or herds, to name just a few.

The resulting datasets yield new insight into biological problems, while at the same time they pose new challenges to mathematicians and informaticians. How do we store these large bodies of data in an effi - cient manner and make sure they remain accessible in the future, while at the same time preserving pri - vacy? How do we search through the data, spot interesting patterns and extract the biologically relevant information? How do we compare data of different conditions or species? How do we integrate data from different sources and across biological levels of organization to make new predictions? Can we then use the resulting patterns to solve the inverse problem and derive meaningful dynamical mathe - matical models from kinetic datasets? How do we model complex and/or large biological systems, e.g., whole cells, embryos, plants or ecosystems? What are the challenges to multiscale modeling?

Given the great variety of topics that such general questions cover, this Special Theme of ERCIM News could only highlight a few, in a selection of nineteen papers. These provide an overview of the latest techniques of data acquisition at the molecular level, address problems of data standardisation and integration, and propose sophisticated algorithms for analysing complex genetic features. Such algorithms help to infer the rules for predictive, dynamical modelling, by characterising genetic inter - actions, as well as molecular and cellular structures from noisy and incomplete data. They also pro - vide key data for modelling multiscale systems at the multicellular or ecosystem level. Alongside data mining approaches, such dynamical models are a useful aid for proposing new control strategies, e.g., in personalised medicine or to help improve medical diagnostics. A further selection of articles dis - cusses visualisation methods for big and noisy data, or suggest to make use of new communication techniques, such as Twitter, to help in quick management of health. Overall these approaches illus - trate the breadth and the beauty of the mathematical and computational approaches that have been developed to cope with the challenges that biology poses to data science. These challenges will only continue to grow in the foreseeable future, as the volume, quality and types of biological data keep on expanding.

Please contact: Roeland Merks, CWI, The Netherlands [email protected]

Marie-France Sagot, Inria, France [email protected]

16 Networks to the Rescue – from Big “omics” Data to Targeted hypotheses by Gunnar Klau

CWI researchers are developing the Heinz family of algorithms to explore big life sciences data in the context of biological networks. Their methods recently pointed to a novel hypothesis about how viruses hijack signalling pathways.

High-throughput technologies are gen - standing paper award for achieving a • BioNet [L2] is an R package that pro - erating more and more measurements of breakthrough in computing optimal vides an easy-to-use interface to cellular processes in health and disease, subnetwork modules by introducing Heinz. Provided with raw data, e.g., leading to ever-increasing amounts of and exploiting a relation to graph from RNA-Seq measurements, it “omics” data. The main challenge is theory. Since then, the method has been generates the input score files needed now to interpret and understand these improved and several variations have by Heinz. The scores are based on a massive data sets in order to ultimately been presented: statistically sound decomposition of answer important biological and bio - • Heinz, the workhorse method, is cur - p-values describing the measure - medical questions. This is a complex rently at version 2.0 [L1]. It takes as ments into signal and noise compo - challenge, however, because important input a set of gene scores and finds an nents. signals are hidden by high noise levels optimal active subnetwork module • Heinz has been adapted to answer a and heterogeneity of samples and dis - with respect to these scores. This is a variety of research questions involv - eases. Inspecting big life sciences data in connected subnetwork where the sum ing the interpretation of differential the context of biological networks helps of the gene scores is maximal. Find - gene expression, GWAS, to address this challenge. To this end, ing such a module is an NP-hard metabolomics, proteomics and researchers at CWI in Amsterdam are problem. Heinz computes provably metagenomics data. developing the Heinz family of algo - optimal modules using advanced • xHeinz [L3] is a recent addition that rithms to analyse and explore genome- techniques from mathematical opti - computes conserved cross-species scale measurements within biological mization. While Heinz 1.0 exploited modules. In a cooperation with the networks. the close relation of the underlying Netherlands Cancer Institute (NKI), Maximum-Weight Connected Sub - xHeinz was used to provide evidence The Heinz family of algorithms graph (MWCS) problem to the Prize- that the differentiation process of the The Heinz project started in 2007 as a Collecting Steiner Tree (PCST) prob - recently discovered Th17 cell type, collaboration with University of lem and relied on PCST codes, Heinz which plays an important role for the Würzburg’s Biocenter. The first algo - 2.0 directly solves MWCS using a immune system, is conserved rithmic prototype, Heinz 1.0 [1], was recursive graph-decomposition between mouse and human [2]. presented in 2008 at ISMB, the premier scheme into bi- and tri-connected • eXamine, a visual analytics app for conference on computational biology. components and a dedicated branch- exploring Heinz results in the popular There, the work received the out - and-cut algorithm. Cytoscape visualization platform,

Figure 1: Optimal Heinz module along with enriched functional and pathway categories using eXamine.

ERCIM NEWS 104 January 2016 17 Special Theme: Tackling Big Data in the Life Sciences

was developed in a cooperation with and pathway categories using eXamine. References: Eindhoven University of Technology. Using the tools from the Heinz family, [1] M. Dittrich et al.: “Identifying The tool makes it easy to explore the researchers have been able to for - functional modules in protein-pro - annotated Heinz modules, for exam - mulate a new hypothesis about tein interaction networks: an inte - ple, in the context of Gene Ontology deregulated signalling of β- catenin by grated exact approach”, Bioinfor - or pathway enrichment. viral receptor proteins. Parts of this new matics 24(13):i223-i231, 2008. hypothesis have now been verified [2] M. El-Kebir et al.: “xHeinz: an All software in the Heinz family is open experimentally and have led to targeted algorithm for mining cross-species source. follow-up studies, which are currently network modules under a flexible under way. conservation model”, Bioinformat - Case study on virally deregulated s ics 31(19):3147-55, 2015. ignalling Future [3] K. Dinkla et al.: “eXamine: Explor - The Human Cytomegalovirus (HCMV) Current research includes the applica - ing annotated modules in net - is a specific type of herpes virus with a tion to cancer genomics data. Here, the works”, BMC Bioinformatics high prevalence of 60% among humans. task is to extract subnetworks that show 15:201, 2014. The interplay of HCMV infections with exclusive mutation patterns in the sam - many diseases, including cancer, is an ples. A long term research goal is to Please contact: important topic of biomedical research. move towards more dynamic descrip - Gunnar W. Klau In a collaboration within the tions of cellular mechanisms. CWI, currently Fulbright Visiting Amsterdam Institute for Molecules, Professor at Brown University Medicines and Systems, Heinz and Links: Tel: +31 20 592 4012 eXamine were used to study a module [L1] http://software.cwi.nl/heinz E-mail: [email protected] that is activated by an HCMV-encoded [L2] http://bionet.bioapps.biozentrum. G-protein coupled receptor. See Figure uni-wuerzburg.de 1 for an illustration of the optimal Heinz [L3] http://software.cwi.nl/xheinz module along with enriched functional

Interactive Pay-As-you-Go-Integration of Life Science Data: The hUMIT Approach

by Christoph Quix, Thomas Berlage and Matthias Jarke

Biomedical research applies data-intensive methods for drug discovery, such as high-content analysis, in which a huge amount of substances are investigated in a completely automated way. The increasing amount of data generated by such methods poses a major challenge for the integration and detailed analysis of the data, since, in order to gain new insights, the data need to be linked to other datasets from previous studies, similar experiments, or external data sources. Owing to its heterogeneity and complexity, however, the integration of research data is a long and tedious task. The HUMIT project aims to develop an innovative methodology for the integration of life science data, which applies an interactive and incremental approach.

Life science research institutes conduct analysis requirements are frequently Further data processing is often done high-content experiments investigating changing in a research environment. in scripting languages such as R or new active substances or with the aim of The a-priori construction of integrated MATLAB by reading and writing the detecting the causes of diseases such as schemas for scientific data is not pos - data from/to files. Therefore, the Alzheimer’s or Parkinson’s. Data from sible, because the detailed structure of HUMIT system currently targets files past experiments may contain valuable the data to be accumulated in future as data sources, but database systems information that could also be helpful experiments cannot be known in and other types of data sources can be for current research questions. advance. Thus, a very flexible and integrated into the system later without Furthermore, the internal data needs to adaptable data management system is changing the basic architecture. In a be compared with other datasets pro - required in which the data can be stored first step, a shallow extraction will be vided by research institutes around the irrespective of its structure. done for the input files in which the world in order to validate the results of metadata of the source files is the experiments. Ideally, all data would The core idea of the HUMIT project is extracted and loaded into the data lake. be integrated in a comprehensive data - illustrated in Figure 1. Data manage - “Data lake” is a new buzzword which base with a nicely integrated schema ment in life science is often file-based, refers to systems in which the data that covers all aspects of research data. as the devices (e.g., an automated from the sources is copied, retaining its However, it is impossible to construct microscope or a reader device) in a lab original structure, to a repository [3]. such a solution since schemas and generate files as their main output. This idea is also applied in HUMIT:

18 ERCIM NEWS 104 January 2016 Figure 1: The HUMIT system.

the source files will be copied to the system is available. Metadata can be [3] I. Terrizzano and Peter M. Schwarz repository and stored with a bi-direc - extracted from various file types and is and Mary Roth and John E. Colino: tional link to their metadata. The stored in a generic metamodel, which is Data Wrangling: The Challenging objects in the repository are immutable based on our experiences in developing Yourney from the Wild to the Lake, and only an append operation will be the GeRoMe metamodel [1]. The in Proc. of CIDR, 2015. supported (no updates or deletes), as framework for metadata extraction is reproducibility and traceability is an extensible and can be easily adapted to Please contact: important requirement for data man - new file types. The design of the inter - Christoph Quix, Thomas Berlage, agement in life science. active data exploration component is Matthias Jarke currently underway and will be one of Fraunhofer Institute for Applied The metadata will be provided to the the core components of the system. Information Technology FIT, Germany user for exploration. By browsing E-mail: through the metadata repository, the The HUMIT project [L1] is coordinated [email protected], user can see what kind of data is avail - by the Fraunhofer Institute for Applied [email protected], able and incrementally construct a data Information Technology FIT and [email protected] mart in which data for her specific funded by the German Federal Ministry requirements are collected. Data inte - of Education and Research. Further par - gration takes place at this stage. This ticipants are the Fraunhofer Institute for type of integration is referred to as Molecular Biology and Applied “pay-as-you-go-integration” [2], since Ecology IME and the German Center only the part of the data lake required for Neurodegenerative Diseases for a particular application is integrated, (DZNE) as life science research insti - and the integration is done while the tutes, and soventec GmbH as industry user is exploring the data. The extrac - partner. The project started in March tion of the detailed data from the source 2015 and will be funded for three years. files (known as “Deep Extraction”) will also be done at this stage if the data is Link: required in the data mart. During the [L1] http://www.humit.de construction of the data mart, the user will be supported in the definition of References: data integration mappings by (semi- [1] D. Kensche, C. Quix, X. Li, Y. Li, )automatic tools (e.g., for schema M. Jarke: Generic schema mappings matching or entity resolution). Existing for composition and query answer - mappings can be also exploited for the ing. Data Knowl. Eng., Vol. 68, no. definition of new mappings. 7, pp. 599-621, 2009. [2] A.D. Sarma, X. Dong, A.Y. Halevy: The requirements analysis for the Bootstrapping pay-as-you-go data project has been completed and a first integration systems, in Proc. of version of the metadata management SIGMOD, pp. 861-874, 2008.

ERCIM NEWS 104 January 2016 19 Special Theme: Tackling Big Data in the Life Sciences

Eliminating Blind Spots in Genetic Variant Discovery

by Alexander Schönhuth and Tobias Marschall

Detecting genetic variants is like spotting tiny sequential differences among gigantic amounts of text fragment data. This explains why some variants are extremely hard to detect or have even formed blind spots of discovery. At CWI, we have worked on developing new tools to eliminate some of these blind spots. As a result, many previously undiscoverable genetic variants now form part of an exhaustive variant catalogue based on the Genome of the Netherlands project data.

In 2007, the advent of "next-generation Note that, in the absence of adequate disease risks with DNA sequence vari - sequencing" technologies revolution - methods to detect them, the first ques - ation, and personalized medicine in ized the field of genomics. It finally tion to ask is: do these variants even general. became affordable to analyse large exist in nature? numbers of individual genomes, by At CWI, as members of the Genome of breaking the corresponding DNA into The presence of possible blind spots has the Netherlands project, we have suc - fragments and sequencing those frag - not kept researchers from analysing ceeded in eliminating a prominent dis - ments, yielding “sequencing reads”. All these gigantic haystacks of sequence covery blind spot, thereby contributing of this is now happening at surprisingly fragments. A prominent example of large numbers of previously undiscov - – nearly outrageously – low cost and such an effort is the "Genome of the erable genetic variants. We achieved high speed. Advances in terms of cost Netherlands" project [2], which has this by reversing a common variant dis - and speed, paired with the relatively aimed at providing an exhaustive sum - covery workflow – usually, large short length of the fragments (in com - mary of genetic variation for a consis - amounts of seemingly ordinary looking parison to “first-generation sequenc - tent population. Launched in 2010, it is sequence fragments are removed, which ing”) comes at a price, however. First, both one of the earliest population-scale turns a big into a small data problem the rapid pile-up of sequencing reads sequencing projects, and still one of the and renders fragment analysis a lot makes for a genuine “big data” prob - largest of its kind -- overall, the frag - easier. In contrast, we process all data lem. Second, the reduced fragment ment data amounts to about 60 ter - [1]: in other words, instead of removing length yields even more complex scien - abytes. The analysis of sequencing data large amounts of hay and, with it, con - tific riddles than in “first-generation is further enhanced by sequencing siderable amounts of needles that are sequencing” times. Overall, the result - related individuals – either family trios too tiny to be easily spotted, we re- ing computational problems are now or (twin) quartets – which allows the arrange the entire haystack such that harder both from theoretical and practi - researchers to study transmission of even the tiny needles stick out. We have cal points of view. Despite – or possibly variants and variant formation within developed a "statistical magnet" that owing to – the incredible mass of data, one generation [3]. The resulting cata - pulls the tiny needles to the surface. certain genetic variants stubbornly logue of variants establishes an invalu - resist detection and form blind spots of able resource, not only for the Dutch, The key to success has been the devel - genetic variant discovery due to experi - but also for closely related European opment of an ultra-fast algorithm that mental and statistical limitations. populations regarding association of empowers the application of this

Point mutations (SNPs) CCCAGCACTTTGGGAGGCCAAGGTGGGGGGAGGAAATTGCTTAAGCCCAGGAGT CCCAGCACTTTGGGAGGTCAAGGTGGGGGGAGGAAATAGCTTAAGCCCAGGAGT extract Deletion DNA CCCAGCACTTTGGGAGGCCAAGGTGGGGGGAGGAAATTGCTTAAGCCCAGGAGT S t r

CCCAGCACTTTGGGAGGCCAAGGTGGGAATTGCTTAAGCCCAGGAGT u c t u ls Insertion r el t a c en l

CCCAGCACTTTGGGAGGCCAAGGTGGGGGGAGGAAATTGCTTAAGCCCAGGAGT m v g a fra r i

CCCAGCACTTTGGGAGTTATGCCAAGGTGGGGGGAGGAAATTGCTTAAGCCCAGGAGT a n

t one "paired-end read" s

Translocation (

S per fragment

CCCAGCACTTTGGGAGGCCAAGGTGGGGGGAGGAAATTGCTTAAGCCCAGGAGT V CCAGCACTTT TGCTTAAGCC s ) CCCAGCACTTTGGGAGAGGTGGGGGGAGGAAATGCCATGCTTAAGCCCAGGAGT DNA fragments Further variants: inversions, duplications, ...

Figure 1: Left: Different classes of genetic variants in human genomes. Right: Next-generation sequencing, only after breaking up DNA in small fragments, one can read the DNA – however, deletions and insertions of length 30-200 letters now are very difficult to spot. We have eliminated this blind spot in discovery by developing new algorithms.

20 ERCIM NEWS 104 January 2016 magnet even on such massive amounts Links: Please contact: of sequence fragments. In summary, the http://homepages.cwi.nl/~as Alexander Schönhuth combination of a sound statistical http://www.nlgenome.nl CWI, The Netherlands machinery with a highly engineered E-mail: [email protected] algorithm allows for implementation of References: a reversed discovery workflow. [1] T. Marschall, I. Hajirasouliha, A. Tobias Marschall was a postdoc at Schönhuth: “MATE-CLEVER: CWI from 2011-2014. Since 2014, he As a result, the Genome of the Mendelian-inheritance-aware dis - holds an appointment as assistant Netherlands project is the first of its covery and genotyping of midsize professor at the Center for kind to exhaustively report on the corre - and long indels”, Bioinformatics Bioinformatics at Saarland University sponding class of genetic variants, pre - 29(24):3143-3150, 2013. and the Max Planck Institute for viously termed “twilight zone deletions [2 ] The Genome of the Netherlands Informatics in Saarbrücken, and insertions”, but which now enjoy Consortium: “Whole-genome Germany somewhat more daylight. sequence variation, population structure and demographic history In future work, we are also planning to of the Dutch population”, Nature eliminate this blind spot in somatic Genetics 46(8):818-825, 2014. variant discovery, which will likely [3] W. Kloosterman, et al.: “Character - reveal large amounts of so far unde - istics of de novo structural changes tected cancer-causing genetic variants, in the human genome”, Genome and will hopefully shed considerable Research 25:792-801, 2015. light on cancer biology as well.

Computational Estimation of Chromosome Structure by Claudia Caudai and Emanuele Salerno

Within the framework of the national Flagship Project InterOmics, researchers at ISTI-CNR are developing algorithms to reconstruct the chromosome structure from "chromosome conformation capture" data. One algorithm being tested has already produced interesting results. Unlike most popular techniques, it does not derive a classical distance-to-geometry problem from the original contact data, and applies an efficient multiresolution approach to the genome under study.

High-throughput DNA sequencing has data carry crucial information about the priate to us, since a high contact fre - enabled a number of recent techniques 3D structure of the chromosomes: quency may well mean that the two (Chromosome Conformation Capture understanding how DNA is structured fragments are close, but the converse is and similar) by which the entire genome spatially is a step towards under - not necessarily true: two fragments that of a homogeneous population of cells standing how DNA works. are seldom in contact are not necessarily can be split into high-resolution frag - physically far from each other. ments, and the number of times any In recent years, a number of techniques Furthermore, we checked the topolog - fragment is found in contact with any for 3D reconstruction have been devel - ical consistency of the distance systems other fragment can be counted. In oped, and the results have been vari - obtained from real data, and found that human cells, the 46 chromosomes con - ously correlated with the available bio - these are often severely incompatible tain about three billion base pairs (3 logical knowledge. A popular strategy with Euclidean geometry [1]. Gbp), for a total length of about 2 m, fit - to infer a structure from contact fre - ting in a nucleus with a radius of 5 to 10 quencies is to transform the number of For these reasons, we chose to avoid a microns. As a typical size for the indi - times any fragment pair is found in con - direct contact-to-distance step in our vidual DNA fragments is 4 kbp, up to tact into the distance between the com - technique. Another problem we had to about 750,000 fragments can be pro - ponents of that pair. This can be done face when trying to estimate the chro - duced from the entire human genome. using a number of deterministic or mosome structure was the above-men - This means that there are more than 280 probabilistic laws, and is justified intu - tioned size of the data record, and the billion possible fragment pairs. Even if itively, since two fragments that are related computational burden. The solu - the genomic resolution is substantially often found in contact are likely to be tion we propose exploits the existence lowered, the resulting data records are spatially close. Once the distances have of isolated genomic regions (the always very large, and need to be treated been derived, structure estimation can Topological Association Domains, or by extremely efficient, accurate proce - be solved as a distance-to-geometry TADs) characterized internally by dures. The computational effort needed problem. However, translating contacts highly interacting fragments, and by is worthwhile, however, as the contact into distances does not seem appro - relatively poor interactions with any

ERCIM NEWS 104 January 2016 21 Special Theme: Tackling Big Data in the Life Sciences

other segment of the genome. This allows us to isolate each TAD and reconstruct its structure from the rele - vant data set, independently of the rest of the genome, then lower the resolu - tion, considering each TAD as a single chain element, and then take the weaker interactions between TAD pairs into account, in a sort of recursive, multires - olution approach.

The result is an algorithm (CHROM - STRUCT [2]) characterized by: Figure 1: Left: Contact frequency matrix for a segment of the long arm of chromosome 1 (q • A new modified-bead-chain model of from 150.28 Mbp to 179.44 Mbp) from human lymphoblastoid cells GM06990, in logarithmic the chromosomes; colour scale. Data from [3]; genomic resolution 100 kbp. The highlighted diagonal blocks • A set of geometrical constraints pro - define our maximum-resolution TADs. Right: one of our reconstructed structures, consisting of ducing solutions with consistent a chain with 292 beads. shapes and sizes; • A likelihood function that does not from chromosome 1, at 100 kbp resolu - Mathematical Models in Biology, contain target distances derived from tion, see Figure 1). The geometrical fea - Springer-Verlag, to appear, DOI: the contact frequencies – in the pres - tures of many of our results correlate 10.1007/978-3-319-23497-7_12. ent version, this likelihood is sampled positively with known functional fea - [2] C. Caudai, et al.: “Inferring 3D by a Monte Carlo strategy to estimate tures of the cells considered in our tests. chromatin structure using a multi - a number of feasible structures for To conclude our research, and to be able scale approach based on quater - each TAD; to assess our results against more nions”, BMC Bioinformatics, Vol. • A recursive framework to associate detailed biological properties, we still 16, 2015, p. 234-244, DOI: the structure of each reconstructed need to remove the experimental biases 10.1186/s12859-015-0667-0 TAD with the shape and the size of a from the raw data, and then try our [3] E. Lieberman-Aiden, et al., “Com - single bead in a lower-resolution strategy on larger parts of (or an entire) prehensive mapping of long-range chain, whose structure, in turn, is genome. interactions reveals folding princi - estimated on the basis of an appropri - ples of the human genome”, Sci - ately binned data set; Link: ence, Vol. 326, 2009; pp. 289-293, • A recursive framework to build the InterOmics Flagship Project: DOI: 10.1126/science.1181369. final structure from the partial results http://www.interomics.eu/web/guest/home at the different levels of genomic res - Please contact: olution. References: Claudia Caudai, Emanuele Salerno [1] C. Caudai, et al.: “A statistical ISTI-CNR, Italy So far, we have tested our algorithm on approach to infer 3D chromatin E-mail: [email protected], part of the human genome (29.2 Mbp structure”, in V. Zazzu et al. (Eds.), [email protected]

Modelling Approaches to Inform the Control and Management of Invasive Seaweeds

by James T. Murphy, Mark Johnson and Frédérique Viard

Invasive non-native plant and animal species are one of the greatest threats to biodiversity on a global scale. In this collaborative European project, we use a computer modelling approach (in association with field studies, ecological experiments and molecular work) to study the impact of an important invasive seaweed species (Undaria pinnatifida) on native biodiversity in European coastal waters under variable climatic conditions.

The introduction of non-native species their population dynamics and invasion one of the world's 100 worst invasive can transform habitats and disrupt success are poorly understood making it species according to the Global ecosystems resulting in serious environ - difficult to develop optimal manage - Invasive Species Database [1]. This mental and economic consequences. ment strategies at both a local and inter - species is the focus of a collaborative Non-native seaweeds represent one of national scale. European research project supported by the largest groups of marine invasive an Irish-Research Council-Marie Curie organisms in Europe. However, often The Asian kelp species Undaria pinnat - Actions co-funded Elevate international the fundamental processes that affect ifida (Wakame) has been nominated as career development fellowship (2013-

22 ERCIM NEWS 104 January 2016 16). It involves a close collaboration • Can we predict the potential range of Simulations have been carried out, between the Ryan Institute at the invasive seaweeds under present and using environmental parameters repre - National University of Ireland, Galway future climatic conditions? sentative of Brittany, France, in order to and the Station Biologique de Roscoff, validate the model against field data France (jointly operated by the French The model framework was developed in collected by researchers at the Station National Centre for Scientific Research the C++ programming language and is Biologique de Roscoff, France. Initial (CNRS) and the Pierre & Marie Curie fully parallelisable to take advantage of results are promising and indicate that University, Paris (UPMC)). high performance computing resources the model can accurately predict the [2]. It represents a robust and adaptable growth dynamics of an U. pinnatifida The aim of this project is to develop a tool to simulate spatially and temporally population in situ and allow us to trace robust, scalable agent-based modelling heterogeneous phenomena in a coastal back the behaviour of the population to tool to investigate expansion rates, environment. A detailed model of the the complex interactions at the indi - abundance and ecological functioning life cycle of U. pinnatifida (including vidual level. Future work will involve of invasive seaweeds. The agent-based the distinct microscopic gametophyte investigations into how changing cli - (or individual-based) modelling para - and macroscopic sporophyte stages) matic conditions may affect the range digm is a useful approach for modelling was then built upon this basic frame - expansion of this species in the future. algal populations as it allows the large- work (see Figure 1). scale population dynamics to be derived The problem of biological invasion and from simple rules dictating the growth Quantitative data from the literature on resulting biodiversity loss in coastal and interactions of the individual mem - the responses of the individual life habitats is a complex question that bers of the population. stages to environmental factors such as cannot be answered by purely theoret - light and temperate were used in order ical or empirical means. This project Some of the questions that we aim to to build up a mechanistic model of the aims to tackle this question using the explore, by combining a theoretical and overall population growth. For combined toolsets of both fields of field-based approach, include: example, photosynthesis-irradiance study and to integrate them into a • What are the main life-history traits curves are used to represent the effect of coherent strategy for control and main - responsible for the invasive potential sunlight levels on growth, and thermal tenance of biodiversity in coastal habi - of seaweeds? performance curves are incorporated to tats. This type of multi-disciplinary • Can we determine the dispersal account for the effect of water tempera - research will help to inform decision potential and thus expansion rates ture on development. This allows us to processes in relation to the control/man - and patterns of invasion? integrate the effects of different envi - agement of invasive species in the • How does the interplay between nat - ronmental conditions on growth marine environment at both a national ural and human-assisted dispersal dynamics and make predictions about and regional scale. affect the expansion rates of marine potential expansion into new habitat invaders? ranges. Links: http://www.sb-roscoff.fr/en/divco- diversity-and-connectivity-coastal- marine-landscapes/team http://www.ryaninstitute.ie

References: [1] IUCN: “Global Invasive Species Database”, ed: IUCN Species Sur - vival Commission. [2] J. T. Murphy, M. P. Johnson, F. Viard: “An agent-based modelling approach to biological invasion by macroalgae in european coastal environments”, in Proc. of the European Conference on Complex Systems 2014, Springer, 2015.

Please contact: James T. Murphy, Adaptation & Diversity in Marine Environment (UMR 7144 CNRS- UPMC), Station Biologique de Roscoff, France Tel: +33 2 98 29 56 57 E-mail: [email protected] Figure 1: A schematic representation of the annual life cycle of the invasive kelp species https://www.researchgate.net/profile/ Undaria pinnatifida, which alternates between the microscopic gametophyte stages and the James_Murphy35 sporophyte stage (which can grow up to 3 metres in length).

ERCIM NEWS 104 January 2016 23 Special Theme: Tackling Big Data in the Life Sciences

Kbdock – Searching and Organising the Structural Space of Protein-Protein Interactions

by Marie-Dominique Devignes, Malika Smaïl-Tabbone and David Ritchie

Big data is a recurring problem in structural bioinformatics where even a single experimentally determined protein structure can contain several different interacting protein domains and often involves many tens of thousands of 3D atomic coordinates. If we consider all protein structures that have ever been solved, the immense structural space of protein-protein interactions needs to be organised systematically in order to make sense of the many functional and evolutionary relationships that exist between different protein families and their interactions. This article describes some new developments in Kbdock, a knowledge-based approach for classifying and annotating protein interactions at the protein domain level.

Protein-protein interactions (PPIs) are sequences and similar biological prop- from nearly 240,000 DDI instances fundamental biophysical interactions. erties. We instigated Kbdock (Kb for extracted from the PDB. As illustrated Our understanding of many biological “knowledge-based”) to address the in Figure 1A, the Kbdock resource can processes relies on the 3D-modelling of problem of PPI modelling and classifi- be queried by Pfam domain to visualise PPIs, and there is a growing need to be cation at the domain level. In fact, as in the DDI network involving this able to classify and analyse the struc- many other complex scientific domains, domain. Otherwise, Kbdock can return tural repertoire of known PPIs using big data approaches in the life sciences the list of DDI instances corresponding computational modelling and analysis can only become viable by explicitly to a given pair of domains contained in techniques. There are currently over considering and making use of prior two interacting proteins, even when 105,000 experimentally determined 3D knowledge. there is little or no similarity with the structures of proteins and protein com- query proteins. Moreover, calculating plexes in the publicly available Protein Essentially, Kbdock is a dedicated rela- 3D super-positions of all DDI instances Data Bank (PDB). Each entry is com- tional database which combines the involving a given domain (Figure 1B) posed of the 3D coordinates of several Pfam domain classification with coor- enabled us to perform a spatial clus- thousands (sometimes even millions) of dinate data from the PDB to analyse tering of all DDIs involving that atoms belonging to one or more linear and model domain-domain interactions domain, thereby identifying a discrete chains of amino-acid residues. By (DDIs) in 3D space. The detection of a number of “domain family binding analysing multiple protein structures DDI instance relies on the computation sites” (DFBSs) [1] on the domain of and their amino-acid sequences, it can of the spatial distance between the sur- interest. This gives us a new and orig- be seen that many protein chains are face residues of the domain instances inal kind of knowledge unit, which is modular, being composed of one or found in each PDB entry. In the latest essential when studying the structure more structural domains. Thus, protein release of Kbdock, after duplicate or and specificity of protein binding sites domains may be considered as “knowl- near-duplicate interactions are on a large scale [2]. The notion of edge units” because they represent removed, a total of 5,139 distinct non- DFBSs also led us to propose a case- abstract classes that group together pro- redundant DDIs involving two dif- based reasoning approach to the teins with similar amino-acid sub- ferent domains have been identified problem of how to retrieve the best

Figure 1: Kbdock answers for a given domain (PF02943: FeThRed_B or Ferredoxin-Thioredoxin Reductase beta chain): (A) the graph of DDIs around this domain in Kbdock (depth = 2); (B) the superposed 3D DDI instances involving this domain.

24 ERCIM NEWS 104 January 2016 available template for modelling pro - using conventional sequence-based or http://www.ebi.ac.uk/uniprot tein-protein “docking” interactions. structure-based comparison techniques. ExPASy: http://enzyme.expasy.org

The Kbdock project is run as a collabo - We are currently working to link References: ration between the Capsid and KBdock's structural domain binding [1] A. W. Ghoorah et al.:“Spatial clus - Orpailleur teams at the Loria/Inria site classification with the widely used tering of protein binding sites for research center in Nancy. It is funded ExPASy Enzyme Classification template based protein docking”, and supported by Inria, the CNRS, and scheme. In order to achieve this, we are Bioinformatics, 2011 Oct 15; the University of Lorraine, as well as developing efficient data-mining 27(20):2820-7. specific ANR (“Agence Nationale pour approaches to process the millions of [2] A. W. Ghoorah et al.: “A structure- la Recherche”) grants. The Kbdock pro - sequence-function associations that are based classification and analysis of gram is available through its online now available in large molecular protein domain family binding sites interface. It may also be queried pro - biology databases, such as Swiss-Prot and their interactions”, Biology grammatically by expert users in order and TrEMBL, which together build the (Basel), 2015 Apr 9; 4(2):327-43. to execute complex or specialised UniProt Knowledgebase at the [3] D. W. Ritchie et al: “Fast protein queries. Recent developments to European Bioinformatics Institute. structure alignment using Gaussian Kbdock make use of a novel protein overlap scoring of backbone pep - structure alignment algorithm called Links: tide fragment similarity”, Bioinfor - “Kpax” that we have developed [3]. http://kbdock.loria.fr matics. 2012 Dec 15; 28(24):3274- This allows queries in Kbdock to span http://kpax.loria.fr, http://hex.loria.fr 81. structural neighbours of the retrieved Protein Data Bank: DDIs, thus allowing Kbdock to search http://www.rcsb.org/pdb/home/home.do Please contact: over more distant regions of protein Pfam domain classification: David Ritchie structure space and to propose protein http://pfam.xfam.org/ Inria, France docking templates that cannot be found UniProt Knowledgebase: E-mail [email protected]

The Source of the Data flood: Sequencing Technologies by Alberto Magi, Nadia Pisanti and Lorenzo Tattini

Where does this huge amount of data come from? What are the costs of producing it? The answers to these questions lie in the impressive development of sequencing technologies, which have opened up many research opportunities and challenges, some of which are described in this issue. DNA sequencing is the process of “reading” a DNA fragment (referred to as a “read”) and determining the exact order of DNA bases (the four possible nucleotides, that are Adenine, Guanine, Cytosine, and Thymine) that compose a given DNA strand. Research in biology and medicine has been revolutionised and accelerated by the advances of DNA and even RNA sequencing biotechnologies.

The sequencing of the first human Sanger his second Nobel Prize in 1980. these platforms are based on various genome was completed in 2003 (though a Sanger sequencing (SSeq) is a implementations of cyclic-array draft was already complete by 2001). This sequencing-by-synthesis technique sequencing. SGS methods (in particular achievement cost 100 million US dollars since it requires DNA polymerase Illumina and Roche) are widely used and took 13 years. Now, only 12 years enzymes to produce the observable for the investigation of the whole spec - later, the cost for the equivalent process output by means of nucleotide radiola - trum of genomic variants from single has dropped to just over $1,000 and takes belling or fluorescent dyes. nucleotide variants (SNVs) to structural just a few days. As a consequence, variants (SVs). Other implementations sequencing the human exome (the coding From around 2005-2006, second-gener - of SGS include SOLiD and Complete part of the genome), for example, or the ation sequencing (SGS) produced a dra - Genomics' nanoball sequencing. whole genome, has become common matic cost reduction, and from that Notably, Ion Torrent was the first plat - practice in medicine, and genomic point on, we have seen a growing diffu - form based on non-optical (i.e., electro - sequencing of many other species has sion of sequencing machines that have chemical) methods paved the way to new challenges and revolutionised clinical research and research fields in the life sciences. practice, as well as molecular biology Today’s low cost of sequencing allows investigations in genomics [1]. The any reasonably funded lab to sequence The first breakthrough in the history of higher error rate (compared with SSeq) multiple genomes, thus raising new DNA sequencing was the development is balanced out by the impressive clinic and research issues. Sequencing of Frederick Sanger's method using throughput of SGS platforms. Though several individuals of the same species, chain-termination with dideoxy they still rely on sequence-by-synthesis for example, is now common for per - nucleotides in 1977, which earned (using chemi- or photo-luminescence), sonalised medicine and for under -

ERCIM NEWS 104 January 2016 25 Special Theme: Tackling Big Data in the Life Sciences

Figure 1: USB-sized Nanopore MinION.

standing how we differ genetically from base-pairs. When we have a robust (i.e., References: each other. Also, RNA-Seq – that is low error rates) technology capable of [1] T. P. Niedringhaus et al.: “Land - sequencing genes that are transcribed for such long reads – which will probably scape of next-generation sequenc - proteins synthesis – allows us to take a be soon – we will certainly witness new ing technologies”, Analytical (possibly comparative) picture of which challenges in genomics. chemistry 83.12 (2011): 4327- genes are expressed under certain condi - 4341. tions. Moreover, sequencing several The machine costs and sizes vary con - [2] C. S. Pareek, R. Smoczynski, A. strands of the same species allows us to siderably, as does the throughput (that Tretyn: “Sequencing technologies investigate intra-species mutations that is, the amount sequenced base-pairs and genome sequencing”, Journal involve the mobile element of the per hour), even within the same gener - of applied genetics 52.4, 2011, genomes. Finally, “metagenomics” ation. In general, both price and size 413-435. [3] G. F. Schneider, C. Dekker: “DNA sequencing with nanopores”, Method Cost per Base ($/Mbp) Read Length Error Rate Nature biotechnology 30.4, 2012, 326-328. SSEQ 400 (up to 2007) 300-1000 10 -5 -10 -2 Please contact: SGS 0.015 (2015) 2 -2 O(10 ) 10 Nadia Pisanti TGS 0.5 (PacBio), 0.65 (Nanopore) O(10 3) 10 -1 University of Pisa and Inria E-mail: [email protected]

studies the microbiology of genetic grow with the throughput. Machines material that is recovered from non-culti - range from the huge and expensive vated environments (e.g., soil, gut, sea- Illumina HiSeq (as big as a closet) to depths) and sequenced. the smaller (desktop-sized) Illumina MiSeq and Ion Torrent, and the even A new revolution is currently underway smaller USB-sized Nanopore MinION, with “third-generation sequencing” shown in Figure 1, passing through the (TGS) techniques [2]. These platforms desktop sized ones. Other performance are based on single molecule real time parameters can instead be grouped sequencing, a single DNA molecule according to generation: Table 1 sequencing approach. While PacBio reports the cost per base of sequencing, platforms exploit an optical detection the length of fragments that can be method, Oxford Nanopore Technolo - output, and the error rate for each tech - gies are based on ionic current measure - nology generation. ments [3]. Both platforms show high error rates, though the length of the reads produced is up to thousands of

26 ERCIM NEWS 104 January 2016 Big Data in Support of the Digital Cancer Patient by Haridimos Kondylakis, Lefteris Koumakis, Manolis Tsiknakis, Kostas Marias and Stephan Kiefer

The iManageCancer project is developing a data management infrastructure for a cancer specific self-management platform designed according to the patients’ needs.

A recent report by the eHealth Task mats, from textual documents and web electronic and patient health records, Force entitled “Redesigning health in tables to well-defined relational data assembling them into ontological Europe for 2020” focuses on how to and APIs (variety). Furthermore, they descriptions relevant to the practice of achieve a vision of affordable, less intru - pertain to ambiguous semantics and systems medicine? And how can we sive and more personalized care, ulti - quality standards resulting from dif - manage the problems associated with mately, increasing quality of life as well ferent collection processes across sites large scale medical data? as lowering mortality. Such a vision (veracity). The vast amount of data gen - depends on the application of ICT and erated and collected comes in so many The high level data management archi - the use of data, and requires a radical different streams and forms — from tecture the iManageCancer is shown in redesign of e-health to meet these chal - physician notes, personal health Figure 1. Within iManageCancer, lenges. Two levers for change as identi - records, images from patient scans, Apache Cassandra is used as an instan - fied by the report are “liberate the data” health conversations in social media tiation of a “data lake” concept to store and “connect up everything”. Fully (variability), to continuous streaming this vast amount of raw data in its native exposing, integrating, linking and information collected from wearables format including structured, semi-struc - exploring health data will have a and other monitoring devices (velocity). tured and unstructured data. Cassandra tremendous impact on improving the These data, if used to their full potential, is an open-source, peer-to-peer and key integrated diagnosis, treatment and pre - may have a tremendous impact on value based store, where data are stored vention of disease in individuals. In healthcare, delivering better outcomes in key spaces, has built-in support for addition it will allow for the secondary at a lower cost (value). As such, key the Hadoop implementation of use of care data for research trans - questions to address include: How can MapReduce and advanced replication forming the way in which care is pro - we develop optimal frameworks for functions and is currently considered vided [1]. large-scale data-sharing? How can we state of the art for real-time big data exploit and curate data from various analysis. On top of the Cassandra, the The iManageCancer H2020 EU project, started in February 2015, aims to pro - vide a cancer specific self-management platform designed according to the needs of patient groups while focusing, in parallel, on the wellbeing of the cancer patient with special emphasis on avoidance, early detection and manage - ment of adverse events of cancer therapy but also, importantly, on psycho-emo - tional evaluation and self-motivated goals. The platform will be centered in a Personal Health Record which will regu - larly monitor the psycho-emotional status of the patient and will periodically record the everyday life experiences of the cancer patient in terms of side effects of therapy, while different groups of patients and their families will share information through diaries, and clini - cians are provided with clinical informa - tion.

The data collected in this context are complex, with hundreds of attributes per patient record that will continually evolve as new types of calculations and analysis/assessment results are added to the record over time (volume). In addi - tion, data exist in many different for - Figure 1: The data management architecture of the iManageCancer platform.

ERCIM NEWS 104 January 2016 27 Special Theme: Tackling Big Data in the Life Sciences

Semantic Integration Layer [2] pushes their ability to handle enormous data Links: selected data to the Semantic Data sets and their “schema-less” nature. But Redesigning health in Europe for Warehouse. Using this architecture we the limited flexibility of their query 2020: http://ec.europa.eu/digital- can select which of the available data mechanisms is a real barrier for any agenda/en/news/eu-task-force-ehealth- should be semantically linked and inte - application that has not predetermined redesigning-health-europe-2020 grated by establishing the appropriate access use cases. The Semantic Ware - http://imanagecancer.eu/ mappings to a modular ontology [3]. house component in the iManageCancer http://www.myhealthavatar.eu/ Then these data are queried, trans - platform fills these gaps by effectively http://cassandra.apache.org/ formed into triples and loaded to the providing a semantically enriched and Semantic Warehouse where they are search optimized index to the unstruc - References: available for further reasoning and tured contents of the Cassandra reposi - [1] H. Kondylakis et al.: “Digital querying. A benefit of the approach is tory. Therefore, our approach tries to Patient: Personalized and Transla - that we can recreate from scratch the offer best of both worlds: efficient per - tional Data Management through resulting triples at any time. However, sistence and availability of heteroge - the MyHealthAvatar EU Project”, for reasons of efficiency the data inte - neous data, and semantic integration and EMBC, 2015. gration engine periodically transforms searching of the “essence” of the [2] H. Kondylakis and D. Plexousakis: only the newly inserted information by ingested information. “Exelixis: Evolving Ontology- checking the data timestamps. Based Data Integration System”, A key next step is to develop the SIGMOD/PODS, 2011. In addition to off-line transformation, eHealth services on top of this data [3] H. Kondylakis et al.: “Agents, on-line transformation is also possible management infrastructure and to test Models and Semantic Integration by issuing transformation events the whole platform in a real-world con - in support of Personal eHealth through an event bus. As such, our text in two clinical pilots, one for chil - Knowledge Spaces”, WISE 2014. architecture adopts a variation of the dren starting in 2016 and one for adults command-query responsibility segrega - starting in 2017. Big Data management Please contact: tion principle where, in order to update is undoubtedly an important area for Haridimos Kondylakis information, one uses a different model healthcare, and it will only become FORTH-ICS, Greece to the model that one is using to read. more critical as healthcare delivery con - E-mail: [email protected] We choose to store the original data tinues to grapple with current chal - using NoSQL technologies owing to lenges.

Towards an on-board Personal Data Mining framework for P4 Medicine

by Mohamed Boukhebouze, Stéphane Mouton and Jimmy Nsenga

A personal on-board data-mining framework that relies on wearable devices and supports on-board data stream mining can help with disease prediction, risk prevention, personalized intervention and patient participation in healthcare. Such an architecture, which allows continuous monitoring and real-time decision-making, can help people living with diseases such as epilepsy.

The emergence of systems biology, big scale in order to discover hidden pat - mobility and safety of patients and data tools and wearable computers is terns and make effective decisions. The enhance their quality of life. transforming medicine from a reactive wearable devices can perform ubiqui - paradigm that focuses on treating dis - tous collections of a large subset of per - Despite the great progress in wearable ease, to “P4 medicine” that focuses on sonal data and be used as interface to technology and ubiquitous computing, predicting diseases, preventing the risks data mining algorithms. The latter are many issues must be addressed in order of developing diseases, personalizing executed remotely on cloud-based to implement efficient on-board data interventions based on the patient con - infrastructure that provides high per - mining for P4 medicine, including: text and participation of the patients in formance processing resources [2]. • Resource limitation. Wearable the management of their own health [1]. However, for P4 medicine to fully devices are limited in terms of com - This new paradigm pledges to improve accomplish its goals, wearable devices putational power, energy consump - healthcare through early detection of dis - must also execute offline data stream tion, storage and memory capacity ease and helping patients to make deci - mining. Indeed, the on-board data and bandwidth. sions about prevention and treatment. mining execution allows continuous • Context and resource changes. Ubiq - monitoring and real-time processing, uitous data stream mining has to deal For the P4 medicine approach to be suc - which can help with early disease detec - with patient situation changes and cessful, personal (patient-specific) data tion and prevention. Therefore, on- device disconnection. should be collected and mined at a large board execution could improve the

28 ERCIM NEWS 104 January 2016 Figure 1: An architecture overview of the device part of the Framework for P4 Medicine.

• Data quality. Despite accrued sophis - rithms concurrently over the network our academic partner LISA Laboratory – tication of current sensors, accuracy of wearable devices. This approach Université Libre de Bruxelles. These of gathered data is not always helps to address the computational notifications contribute to improve the ensured. Furthermore, variable data constraints and energy efficiency safety and quality of life of individuals rates can cause problems [2]. issues by distributing data processing affected by refractory epilepsy. across devices. Furthermore, the framework can provide To deal with these issues, we propose an • A context-aware and resource-aware context-aware recommendations to on-board personal data-mining frame - adaptation strategy is used to detect reduce seizure risk by analysing the trig - work for P4 medicine, currently under changes of patient situation or gering factors (e.g., stress level), conse - development in our research centre, devices and automatically adjust the quently helping to reduce seizure fre - relying on a network of wearable parameters of data mining algo - quency. Finally, the collected data could devices to monitor the patient in real- rithms. This approach is also used to be used by epilepsy specialists to better time based on P4 medicine principles: dynamically distribute the load of identify appropriate treatment. • Predicting the risk of developing dis - data mining algorithms based on eases or faintness (e.g., epilepsy resource availability. Links: seizure, heart attack) based on well- • A probabilistic data mining approach P4 Medicine Institute: defined patterns. The framework noti - is used to deal with data quality http://www.p4mi.org fies patient once a risk is identified. issues by computing data variation LISA Laboratory: lisa.ulb.ac.be • Preventing certain diseases or faint - and uncertainty. In this way, risk pre - ness by providing recommendations diction and prevention recommenda - References: that help to avoid or reduce the risk of tions should be improved because the [1] L. Hood and S. H. Friend: “Predic - developing them. full range of possible outcomes can tive, personalized, preventive, par - • Personalizing the prediction and pre - be taken into account. ticipatory (P4) cancer medicine,” vention recommendations by taking Nature Reviews Clinical Oncology, into account the context of patients To validate our proposed framework, we vol. 8, no. 3, pp. 184–187, 2011. (profile, medical history, actual envi - are considering epilepsy as a case study – [2] M.M. Gaber, J.B. Gomes, and F. ronment, etc.). in particular, refractory epilepsy patients Stahl: “Pocket data mining”. • Participation of the patients in the who do not respond to conventional Springer Book, 2014, ISBN: learning and refinement of the predic - treatment. Thirty percent of people with 9783319027111, pp. 1–112. tion and prevention models by pro - epilepsy have refractory epilepsy, [3] a [3] P. Kwan, and M.J. Brodie: “Early viding feedback about issued notifi - very debilitating form of the disease identification of refractory cation and recommendations. since the seizures are random and not epilepsy.” New England Journal of controllable with medication. The pro - Medicine, 2000, 342(5), 314-319. The proposed framework deals with the posed on-board data-mining framework on-board processing issues by com - can be used to continuously monitor Please contact: bining three approaches (see Figure 1): refractory epilepsy patients and notify Mohamed Boukhebouze, • A distributed data mining approach is them of seizure risks based on early CETIC, Belgium used to execute data mining algo - detection patterns that are developed by [email protected]

ERCIM NEWS 104 January 2016 29 Special Theme: Tackling Big Data in the Life Sciences

Can Data-driven Self-Management Reduce Low Back Pain?

by Kerstin Bach, Paul Jarle Mork and Agnar Aamodt

A new Horizon 2020 research and innovation project will start the development of the SELFBACK decision support system for self-management of low back pain in January 2016.

Low back pain is a common reason for Health Leads from the Netherlands as apps are available on the market, to the activity limitation, sick leave, and dis - well as the National Research Centre best of our knowledge, none of these ability. It is the fourth most common for the Working Environment in apps have documented effects by scien - diagnosis (after upper respiratory infec - Denmark and the University of tific publications and none include an tion, hypertension, and coughing) seen Southern Denmark. active decision support system. In con - in primary care [1]. An expert group trast, we will document the effective - concluded that the most effective Self-management in the form of phys - ness of the SELFBACK decision sup - approach to manage non-specific low ical activity and strength/stretching port system by conducting a nine-month back pain is to discourage bed rest, to exercises constitute the core component randomized control trial. use over-the-counter pain killers in the in the management of non-specific low acute stage if necessary (e.g., to be able back pain; however, adherence to self- The SELFBACK system will constitute to sleep), reassure the patient about the management programs is poor because a data-driven, predictive decision sup - favourable prognosis, advise the patient it is difficult to make lifestyle modifica - port system that uses the Case-Based to stay active, and advise strength and/or tions with little or no additional support. Reasoning (CBR) methodology to cap - stretching exercise to prevent recurrence Moreover, the way the self-manage - ture and reuse patient cases in order to [2]. ment advice is presented, and whether suggest the most suitable activity plans the advice is followed up by reinforce - and goals for an individual patient. In January 2016 a new project funded by ment have substantial impact on adher - Based on personal preferences and clin - the European Commission’s Horizon ence to a self-management plan and ical guidelines, the SELFBACK system 2020 research and innovation pro - progress of symptoms. In the SELF - will guide patients to adjust their gramme will start the development of BACK project we will develop and doc - lifestyle in order to optimize their man - the SELFBACK decision support ument an easy-to-use decision support agement of low back pain, based on system for self-management of low back system to facilitate, improve and rein - patient’s activity data collected using a pain. The consortium includes the force self-management of non-specific wristband. The incoming data will be Norwegian University of Science and low back pain. The decision support analysed to classify the patient’s activi - Technology (coordinator), the system will be conveyed to the patient ties and matched against the proposed University of Glasgow and the Robert via a smart-phone app in the form of activities in order to follow-up and Gordon University in Aberdeen from the advice for self-management. Even advise the patient. The challenge of the UK, the SMEs Kiolis from France and though there are about 300 pain-related data analysis is activity pattern detec - tion, and matching it against existing patient profiles in order to suggest activity goals aiming at the most favourable outcome for the patient. Therefore we will enrich the data stream and the resulting patterns with the patient's symptom state, symptom progression and goal-setting.

Figure 1 gives an overview of the SELFBACK proposed data and deci - sion process model that contains five processing modules, of which four modules target self-management by the patient while the fifth module targets the possibility for co-decision making between the patient and a clinician.

The initial screening module (1) acquires basic patient data, which is sent to the SELFBACK server. This data will be collected through a web - page and will provide the starting point Figure 1: SELFBACK data and decision process model. for assessing the patient's status and for

30 ERCIM NEWS 104 January 2016 running the self-management planning Overall, the aim of the SELFBACK Please contact: module (4) for the first time. The plan system is to solve the problems of rein - Kerstin Bach, Paul Jarle Mork, Agnar for self-management will be updated forcement and adherence to a self-man - Aamodt and revised as more patient data is agement plan by offering an evidence- Norwegian University of Science and added. The activity logging module (2) based system that allows personalized Technology runs continuously since it provides data follow-up and advice to the patient. E-Mail: [email protected], from the wristband regarding the [email protected], patient's physical activity and sleep pat - Link: [email protected] terns. The main goal of this module is to SELFBACK project: verify whether the patient follows the http://research.idi.ntnu.no/selfback/ suggestions in the recommended plan (e.g., for physical activity). As for the References: initial screening data, the activity log [1] P. Wändell et al.: “Most common data is also sent to the server, and diseases diagnosed in primary care becomes input to the periodic run of the in Stockholm”, Sweden, in 2011. self-management planning module (4). Fam Pract 30:506-513, 2013. The user interaction module (3) is basi - [2] M. van Tulder et al.: “Chapter 3. cally a question/answering module, typ - European guidelines for the man - ically initiated by the patient when using agement of acute nonspecific low the SELFBACK app and browsing back pain in primary care. Eur through the given information. Spine J 15:s169-s191, 2006.

Twitter can help to find Adverse Drug Reactions by Mark Cieliebak, Dominic Egger and Fatih Uzdilli

Drugs are great! We all need and use drugs every now and then. But they can have unwanted side- effects, referred to as “adverse drug reactions” (ADRs). Although drug manufacturers run extensive clinical trials to identify these ADRs, there are still over two million serious ADRs in the U.S. every year – and more than 100,000 patients in the U.S. die due to drug reactions, according to the U.S. Food and Drug Administration (FDA) [1]. For this reason, we are searching for innovative and effective ways to find ADRs.

Identifying ADRs is an important task to automatically find relevant messages, fully applied, for instance, in market for drug manufacturers, government to “understand” their content, and to monitoring, customer support and agencies, and public health. One way to extract structured data about the drugs social media analysis. identify them before a drug goes to and (unwanted) reactions. market is through clinical trials. State-of-the-art Governments worldwide also have A typical approach for ADR detection Our ADR system, which implements diverse surveillance programs in order uses Natural Language Processing the technologies shown above, achieves to identify ADRs once the drugs are in (NLP) to analyze tweets automatically. a success rate of 32% (measured in F1- use by consumers. For example, official Input for the system is the entire stream score). This is comparable to other aca - websites such as MedWatch allow both of Twitter messages. Each individual demic ADR systems: in an open inter - patients and drug providers to submit tweet is analyzed, using a classification national competition this year, even the ADRs manually. However, only a very system as shown in Figure 1: The tweet best systems achieved only a success small fraction of all ADRs is submitted is preprocessed and a set of relevant rate of approximately 40% [2]. to these systems – experts estimate that properties (“features”) is extracted. over 90% of all reactions go unreported. Then, a classifier decides whether the For a preliminary evaluation on real- tweet mentions an ADR. This classifier world data, we applied our ADR system Twitter can help! is based on machine learning and was to the full Twitter stream. The low pre - On the other hand, there are millions of trained beforehand on thousands of cision of the system resulted in 20% of messages on Twitter that discuss med - sample tweets that were tagged by all tweets being classified as ADR. This ications and their side-effects. These humans. Finally, a system for named is way too high; there are not that many messages contain data on drug usage in entity extraction is used to output a drug ADR tweets in the Twitter stream. For much larger test sets than any clinical name and associated ADRs. this reason, we pre-filtered the stream trial will ever have. Inspired by this, with a list of 1678 drug names. Out of research teams worldwide, including our This approach is similar to technologies about 50 million tweets, this resulted in team at Zurich University of Applied for sentiment analysis, which decide 13,000 tweets referencing drugs. Using Sciences, are beginning to utilize these whether a tweet is positive or negative. the ADR system on this reduced set messages for ADR detection. The goal is Sentiment analysis is already success - yielded 2800 tweets. We expect to find

ERCIM NEWS 104 January 2016 31 Special Theme: Tackling Big Data in the Life Sciences

Figure 1: Typical system for ADR detection using machine learning.

more than 60% of these to be true ADR improved existing benchmark scores Prospective Studies”, JAMA tweets. there. Applying these technologies to 279(15):1200-1205, 1998. ADR detection will probably help to [2] Pacific Symposium on Biocomput - Future improvements increase the detection rate significantly. ing: http://psb.stanford.edu/work - Automatic detection of ADRs on In addition, our team is working on a shop/wkshp-smm/ Twitter (or other social media channels) system that not only analyzes the text of [3] R. Ginn et al.: “Mining Twitter for is still a very young discipline, which a tweet, but also its context: the timeline Adverse Drug Reaction Mentions: only started some five years ago. There of the user, other messages in the same A Corpus and Classification are only a few teams working on the geographic or temporal context etc. Benchmark”, BioTxtM, 2014. topic at the moment, and a first large- This will allow us to “step back” from scale benchmark dataset was only pub - an isolated event (a single tweet) and Please contact: lished in 2014 [3]. However, we expect see the “whole picture” of the discourse Mark Cieliebak a significant improvement in detection on Twitter. School of Engineering rates in the near future, owing in part to Zurich University of Applied Sciences the existence of several new technolo - References: (ZHAW) gies in machine learning, such as word [1] J. Lazarou, B.H. Pomeranz, and Tel: +41 58 934 72 39 embedding and deep learning. These P.N. Corey: “Incidence of Adverse E-mail: [email protected] have already been successfully applied Drug Reactions in Hospitalized to other text analysis tasks and have Patients: A Meta-analysis of

Trust for the “Doctor in the Loop”

by Peter Kieseberg, Edgar Weippl and Andreas Holzinger

The "doctor in the loop" is a new paradigm in information driven medicine, picturing the doctor as authority inside a loop supplying an expert system with data and information. Before this paradigm is implemented in real environments, the trustworthiness of the system must be assured.

The “doctor in the loop” is a new para - knowledge, but the inclusion of data on ingly frequent use of knowledge dis - digm in information driven medicine, actual patients, as well as treatment covery for the enhancement of medical picturing the doctor as authority inside a results and possible additional (side-) treatments together with the “human in loop with an expert system in order to effects that relate to previous decisions the loop” concept (see [1], for instance): support the (automated) decision of this semi-automated system. The expert knowledge of the doctor is making with expert knowledge. This incorporated into "intelligent" systems information not only includes support in The concept of the "doctor in the loop" (e.g., using interactive machine pattern finding and supplying external is basically an extension of the increas - learning) and enriched with additional

32 ERCIM NEWS 104 January 2016 information and expert know-how. multitude of internal mechanisms that best practices and guidelines in the area Using machine learning algorithms, have various functions, such as of data driven medical research. medical knowledge and optimal treat - enhancing performance. This approach ments are identified. This knowledge is utilizes the transaction mechanism for Special focus in CBmed then fed back to the doctor to assist providing transaction safety (i.e., if a CBmed [L2] is an interdisciplinary him/her (see Figure 1). crash occurs, the database is brought research centre, founded in 2014, with back into a well-defined state) in order the aim of providing biomarker to protect the information sent to the research, mainly in the areas cancer, Manipulation Security and Trust system by the doctor against manipula - metabolism and inflammation. One of The implementation of the doctor in the tion, even if the manipulation is carried the core features of this centre is the loop concept has met several challenges out by the administrator of the database. tight incorporation of ICT as a hori - – both of a technical nature and in other zontal research area that provides data areas. One challenge is gaining the Legal Issues and techniques to all other projects. acceptance of such systems by doctors Legal issues are often overlooked by This does not only apply to applications themselves, who are often not purely technical solutions, but are vital associated with the doctor in the loop, researchers, but medical practitioners. for any doctor that is actually partici - but also other areas, such as data and privacy protection, the development of new data mining techniques and tools for the efficient analysis of large amount of “–omics” data.

For the CBmed consortium, the model of the doctor in the loop offers abundant possibilities, especially in the area of data driven research. Considering the recent surge in big data related research and developed tools, this approach is expected to be one of the major drivers in medical research in the years to come.

Links: [L1]: https://rd-alliance.org/rda-work - ing-group-data-security-and-trust- wgdst-p6-bof-session.html Figure 1: The Doctor in the Loop. [L2]: http://www.cbmed.org/en/

References: While privacy and security in biomed - pating in such an approach as expert. [1] W. S. Levine, ed.: “The control ical data applications have been dis - Important questions that need to be handbook”, CRC press, 1996. cussed extensively in recent years (see addressed include: What information [2] P. Kieseberg, H. Hobel, S. [2] for an overview), the topic of trust can be shared between the doctor and Schrittwieser, E. Weippl, A. has been neglected. Nevertheless, it is the expert system, and what levels of Holzinger, “Protecting anonymity in very important that the trustworthiness data and privacy protection need to be data-driven biomedical science”, in of the systems can be guaranteed in applied? And who is responsible for the “Interactive Knowledge Discovery order to make the abstract concept of a correctness of the results derived from and Data Mining in Biomedical doctor in the loop practicable. the combination of human knowledge Informatics” (pp. 301-316), and machine learning algorithms? Springer Berlin Heidelberg, 2014. In this medical context, it is absolutely Although this is an important field, no [3] P. Kieseberg, J. Schantl, crucial to ensure that the information guidelines are currently available. The P.Fruehwirt and E. R. Weippl, A. sent by the doctor to the machine issue is further complicated by the dif - Holzinger: “Witnesses for the learning algorithms cannot be manipu - ferences between national legislations Doctor in the Loop," in 2015 lated following its submission. In order even between member states of the International Conference on Brain to guarantee this, a new approach for European Union. Defining workflows Informatics & Health (BIH), 2015. protecting the underlying data based on that clinical doctors can reliably apply a hash chain has been proposed explic - without the fear of prosecution lies thus Please contact: itly for doctor in the loop systems (see in the focus of the planned RDA Peter Kieseberg [3]). This approach takes advantage of (Research Data Alliance) Workgroup SBA Research, Vienna, Austria the fact that large amounts of data are “Security and Trust” that held its “Birds E-mail: [email protected] typically stored in databases. While of Feather” session on the 6th RDA these are often identified with the Plenary on September 24th in Paris abstract data model, they are in reality [L1]. One of the major goals of this complex software systems providing a workgroup is to draw together a set of

ERCIM NEWS 104 January 2016 33 Special Theme: Tackling Big Data in the Life Sciences

Big Data Takes on Prostate Cancer

by Erwan Zerhouni, Bogdan Prisacari, Qing Zhong, Peter Wild and Maria Gabrani

Most men, by the time they reach 80 years of age, get prostate cancer. The treatment is usually an operation or irradiation, which sometimes has complications. However, not every tumour is aggressive, in which case there is no urgent need to remove it. Ascertaining whether a tumour is aggressive or insignificant is difficult, but analysis of big data shows great promise in helping in this process.

Prostate cancer (PC) represents the such as immunohistochemistry (IHC) (IHC) stained tissue images. IHC assays second leading cause of cancer related and in situ hybridization (ISH) enable provide significant information with deaths in the Western world. PC is typi - the evidencing of molecular expression respect to cellular heterogeneity and cally diagnosed on the basis of increased patterns through multicolour visualiza - disease progression [1], and are thus levels of serum protein PSA (prostate tion. Such techniques are commonly used to identify patients most likely to specific antigen) together with digital used for targeted treatment response respond to targeted therapy. However, rectal examination, and is confirmed by estimation and monitoring, increasing tissue morphology may have different prostate needle biopsies. However, PSA the need for high-throughput and unbi - molecular signatures (see Figure 1) and biopsies often fail to distinguish ased digital solutions. owing to genetics and other parameters, between clinically indolent and aggres - such as environment and lifestyle. sive forms, leading to overtreatment Emerging initiatives in moving hospi - Currently, IHC image analysis focuses such as unnecessary prostactectomies tals into the digital era use bright-field on the staining intensity performed and irradiation that sometimes greatly and fluorescence scanners to convert mostly in a manual and thus low deteriorates a patient’s quality of life. glass slides of tissue specimens and throughput and biased way. Emerging needle biopsies to virtual microscopy computational techniques use metrics, Currently, pathologists assess patient images of very high quality. Slide such as the H-score, or the Aperio biopsies and tissue resections under a images are huge, with several thousand metric [2]. Recent studies, however, microscope, leading to diagnoses that pixels per axis, turning digital image show that to tailor a patient's treatment are affected by subjective judgment and analysis into a big data problem. and to monitor treatment progression, intra- and inter-observer variability. The Integration of digitized tissue speci - the staining intensity needs to be corre - procedure is time consuming and hence mens with an image analysis workflow lated to the grade of disease; that is, to low-throughput, and hospitals may gen - allows objective and high-throughput the morphological and cellular architec - erate hundreds or even thousands of evaluation of tissue slides, transforming ture that define cancer and many dis - tissue samples per day. This number is surgical pathology into a truly quantita - eases. expected to increase sharply, as the tive and data-driven science. World Health Organization predicts the In the developed framework, we use a number of cancer diagnoses to increase We have developed a computational pre-trained convolutional network to by 70% in the next two decades. framework to extract unique protein extract features that capture both mor - Moreover, novel staining technologies, signatures from immunohistochemistry phology and colour at image patches around identified stained cells. We gen - erate a feature dictionary, whose size is data dependent. To represent tissue and tumour heterogeneity, we capture spa - tial layout information using a commute time matrices approach. We then use the commute time matrix to compute a unique signature per protein and disease grade. The goal of this work is to eval - uate whether the expression of the pro - teins can be used as pathogenesis predi - cator and quantify the relative impor - tance of each protein at the different dis - ease grades. To this end, we evaluate them in the task of classification indi - vidually and in combination. For the individual evaluation, we use a random forest classifier. To evaluate the collec - tive contribution of the proteins, we use a multiple kernel learning (MKL) Figure 1: IHC prostate images indicating the difficulty of sample analysis. Emphasizing the approach. We tested the proposed PTEN (top) and ERG (bottom) proteins. Left to right: normal, low, and intermediate grade framework on a PC tissue dataset and stages of cancer development. Figure taken from Zerhouni et al [3]. demonstrated the efficacy of the derived

34 ERCIM NEWS 104 January 2016 protein signatures for both disease strat - Links: [2] A. Rizzardi et al.: “Quantitative ification and quantification of the rela - http://www.zurich.ibm.com/mcs/systems comparison of immunohistochemi - tive importance of each protein [3]. _biology cal staining measured by digital http://www.wildlab.ch/ image analysis versus pathologist The technique already shows promise, visual scoring”, Diagnostic Pathol - and we expect even greater accuracy in References: ogy 2012, 7:42. our future tuning of these techniques. [1] Q. Zhong et. al.: “Computational [3] E. Zerhouni et al.: “A computational Owing to the limited number of images profiling of heterogeneity reveals framework for disease grading in our dataset, training a CNN was not high concordance between mor - using protein signatures”, IEEE feasible. To this end, we are currently phology- and proteomics-based ISBI 2016 (accepted). investigating the use of a convolutional methods in prostate cancer”, DGP auto-encoder for unsupervised feature 2015. Please contact: extraction. Furthermore, we plan to test Maria Gabrani the framework on larger datasets and IBM Research Zurich more proteins. E-mail: [email protected]

Mining Electronic health Records to Validate Knowledge in Pharmacogenomics by Adrien Coulet and Malika Smaïl-Tabbone

Most of the state of the art in pharmacogenomics (PGx) is based on a bank of knowledge resulting from sporadic observations, and so is not considered to be statistically valid. The PractiKPharma project is mining data from electronic health record repositories, and composing novel cohorts of patients for confirming (or moderating) pharmacogenomics knowledge on the basis of observations made in clinical practice.

Pharmacogenomics (PGx) studies how fulfil statistics validation standards and tories make it possible to assemble individual gene variations cause vari - are difficult to reproduce because of the cohorts of patients to study the impact ability in drug responses. A state of the rarity of gene variations studied of gene variations on drug responses on art of PGx is available and constitutes a (making it hard to recruit sufficiently the basis of practice-based data [2]. basis for implementing personalized large cohorts) and of the multifactorial medicine, i.e., medical treatment tai - aspects of drug responses [1]. The In October, 2015, The French National lored to each patient taking into account increasing use of electronic health Research Agency (ANR) committed his/her genomic context. However, most records (EHRs) generates large reposi - funding to a project called of the state of the art in this domain is tories that offer new opportunities, such PractiKPharma (Practice-based evidence not yet validated, and consequently not as composing patient cohorts for the for actioning Knowledge in yet applicable to medicine. Most infor - study of clinical hypotheses hard to test Pharmacogenomics). The project aims mation results from studies that do not experimentally. Typically, EHR reposi - to validate or moderate PGx state-of- the-art (SOTA) knowledge on the basis of practice-based evidence, i.e., knowl - edge extracted from EHRs. Units of knowledge in PGx typically have the form of ternary relationships gene variant-drug-adverse event, and can be formalized to varying extents using bio - medical ontologies. To achieve this goal, the PractiKPharma consortium will focus on four objectives, illustrated in Figure 1: (1) to extract SOTA knowl - edge from PGx databases and literature; (2) to extract observational knowledge (i.e., knowledge extracted from observa - tional data) from EHRs; (3) to compare knowledge units extracted from these two origins, to confirm or moderate SOTA knowledge, with the goal of enabling personalized medicine; (4) Finally, to emphasize newly confirmed Figure 1: Outline of the four objectives of the ANR PractiKPharma project. knowledge, omics databases will be

ERCIM NEWS 104 January 2016 35 Special Theme: Tackling Big Data in the Life Sciences

investigated for molecular mechanisms • enabling multilingual semantic anno - Link: that underlie and explain drug adverse tation of EHRs; PractiKPharma project: events. This investigation will use and • methods for representing and com - http://practikpharma.loria.fr contribute to the biomedical Linked paring SOTA and observational Open Data [3]. knowledge; References: • a database that maps genotypes to [1] J. P.A. Ioannidis: “To replicate or The PractiKPharma consortium com - quantitative traits to facilitate the not to replicate: The case of pharma - prises four academic partners: two com - study of PGx with EHRs; cogenetic studies”, Circulation: Car - puter science laboratories, the LORIA • the completion and connection of diovascular Genetics, 6:413–8, 2013. (Laboratoire Lorrain de Recherche en Linked Open Data related to PGx; [2] A. Neuraz et al.: “Phenome-wide Informatique et ses Applications, • methods for hypothesizing on mecha - association studies on a quantita - Nancy, France) in Nancy, the LIRMM nisms of adverse events and validated tive trait: Application to TPMT (Laboratoire d’Informatique, de Robot - PGx knowledge units. enzyme activity and thiopurine ique et de Microélectronique de Mont - therapy in pharmacogenomics, pellier, France) in Montpellier; and two Overall, the final goal of PLoS Computational Biology, University Hospitals, the HEGP (The PractiKPharma is to provide clinicians 9(12), 2013. Georges Pompidou European Hospital, with actionable PGx knowledge to [3] K. Dalleau, N. Coumba Ndiaye, A. Assistance Publique – Hôpitaux de Paris establish guidelines that when imple - Coulet: “Suggesting valid pharma - (AP-HP), France) in Paris, specialized mented in personalized medicine will cogenes by mining linked data”, in in EHRs management and pharmacoge - reduce drug adverse events, and Proc. of the Semantic Web Applica - nomics, and the CHU Saint-Etienne improve the quality of clinical care. tions and Tools for the Life Sciences (University Hospital (CHU) of Saint (SWAT4LS) Conference, 2015. Etienne, France), specialized in pharma - The PractiKPharma consortium will covigilance. hire two PhD students, two post-docs, Please contact: one pharmacist and one research engi - Adrien Coulet, LORIA, Nancy, France, PractiKPharma will have impacts both neer to contribute to the project. In addi - Tel: +33 3 54 95 86 38 in computer science and biomedicine, tion, PractiKPharma will foster collabo - E-mail: [email protected] including: ration and dissemination throughout the Malika Smaïl-Tabbone, LORIA, • the development of novel methods EU to take into consideration EHRs of Nancy, France for knowledge extraction from text various populations and to establish Tel: + 33 3 83 59 20 65 and EHRs; international projects. E-mail: [email protected]

Modelling the Growth of Blood Vessels in health and Disease

by Elisabeth G. Rens, Sonja E. M. Boas and Roeland M.H. Merks

Throughout our lives our blood vessels form new capillaries whose insufficient or excessive growth is a key factor in disease. During wound healing, insufficient growth of capillaries limits the supply of oxygen and nutrients to the new tissue. Tumours often attract capillaries, giving them their own blood supply and a route for further spread over the body. With the help of biological and medical colleagues our team develops mathematical models that recapitulate how cells can construct new blood vessels. These models are helping us to develop new ideas about how to stimulate or stop the growth of new blood vessels.

In collaboration with the two academic sels, and they secrete new matrix pro - structures, such as networks and blood hospitals in Amsterdam our research teins to strengthen themselves. Our col - vessel sprouts. Our models describe the group at CWI develops mathematical laborators in the academic hospitals behaviour of the individual cells: how models to help unravel the growth of mimic angiogenesis, by culturing they move, the strength of their adhe - capillaries, the smallest blood vessels in endothelial cells inside gels of extracel - sion to adjacent cells, their shape, and our body. The building blocks of capil - lular matrix proteins. In response to their response to signals from adjacent laries are endothelial cells, a cell type growth factors and other signals in the cells and from the extracellular matrix. that covers the inner walls of all our extracellular matrix, the cells aggregate If we understand how cells must behave blood vessels, and the extracellular and organize into networks of blood- to form a new blood vessel, perhaps matrix, a scaffolding of structural pro - vessel-like sprouts. then we can also learn how to use phar - tein fibres that keeps cells and tissues maceuticals to fine-tune that behaviour together. Together with auxiliary cell The goal of our mathematical modelling so the cells do what we want them to do. types, the endothelial cells remodel the project is to find out the rules that cells If we could do this for a relatively extracellular matrix to form blood ves - must follow in order to form biological simple mechanism like blood vessel

36 ERCIM NEWS 104 January 2016 from cell to cell. Recently we have started to explore how such mechanical cell-cell signalling can help coordinate the behaviour of endothelial cells. We extended our models of angiogenesis with a finite-element model of the A BC extracellular matrix [2]. The endothelial cells apply contractile forces on the matrix. The finite-element model then calculates the strains resulting from these forces, which, in turn, affects the membrane movements of the cells. These relatively few assumptions repro - duced observations on how matrix stiff - D EF ness determines the shape of single cells, how adjacent cells interact with Figure 1: Models of blood vessel growth. (A-C) Growth of a sprout and “overtaking” of the tip one another, and how cells form vas - cell. Colour scale indicates state of the embedded genetic network models; red: high expression cular-like sprouts and networks of tip cell marker Dll4; blue: low Dll4 expression. Figure modified from Ref. [1]. (D-F). Model (Figure 1). of mechanical interaction between cells and extracellular matrix. Due to the cell-matrix interactions, the cell elongated (D), they align with adjacent cells (E), and they form networks Now that our models have given us (F). Figure by EGR, reprinted in modified form from “Merks. ITM Web of Conferences, in some first clues on how cells build press” according to the model described in Ref. [2]. Figures A-F reproduced under the terms of blood vessels, we are taking these the Creative Commons Attribution 4.0 International License insights back to our experimental col - (http://creativecommons.org/licenses/by/4.0/). laborators. The basic understanding of angiogenesis captured by our models is employed in more specific models of growth, this might give us new ideas for molecular regulation of this phenom - diseases, e.g., diabetic retinopathy. The other steps in the development of multi - enon. We discovered that such tip cell models give the experimentalists new cellular organisms. overtaking occurs naturally in our ideas on how to “nudge” the endothelial models, as a side effect of the cell cells to do, or not do, what they want. In an initial set of models, we assumed behaviours responsible for sprouting. Of course, biology still has many sur - that cells attract one another via chem - We integrated the molecular networks prises in store for us; rarely do experi - ical signals. It is not difficult to see that responsible for tip-stalk cell differentia - ments agree with our models in every this rule will not suffice for forming net - tion inside each of the simulated detail. Frustrating as this may sound, works: It simply makes the cells accu - endothelial cells, and could reproduce this is actually when our models mulate into small aggregates. To our many of the experimental observations become most useful. Mismatches surprise, however, the same rules could that were taken as evidence for the between model and experiment high - make cells form networks if the cells molecular regulation of tip cell over - light the areas in vascular biology that had an elongated shape. This makes the taking [1] (Figure 1). Thus our models are not yet fully understood. aggregates stretch out, and connect to suggest that the molecular networks adjacent aggregates such that a network may be there to ensure that the cell that Links: emerges. But the cells did not need to be happens to end up in front takes on the Group website: elongated: if instead the cells became role of leader. http://biomodel.project.cwi.nl insensitive to the attractive signals at the positions of their membranes where Despite the insights we acquired from References: they touched one another, they also our initial models, they ignored a key [1] S. E. M. Boas and R. M. H. Merks: organized into networks. component in angiogenesis. Most cells “Tip cell overtaking occurs as a in our body are surrounded by a fibrous side effect of sprouting in computa - Our recent work employs these two or jelly material called the extracellular tional models of angiogenesis,” models to gain greater insight into the matrix, or “matrix” for short. It has a BMC Systems Biology 9:86, 2015 mechanisms of angiogenesis. Our initial structural role – the matrix gives our [2 R. F. M. van Oers, E. G. Rens, D. models ignored that in actual angiogen - bones their strength and keeps cells J. LaValley, et al.: “Mechanical esis, two types of endothelial cells are together – but recently it has become Cell-Matrix Feedback Explains involved: the tip cells that lead the clear that it also acts as a medium for Pairwise and Collective Endothe - sprouts, and the stalk cells that form the cell-cell signalling. Cells deposit lial Cell Behavior In Vitro“, PLoS body of the sprout by ``stalking’’ the tip growth factors into the matrix, which Computational Biol., 10(8): cells. Recent experimental observations other cells find later on, as if cells leave e1003774, 2014. have shown that one of the stalk cells “written” notes for one another. Also, regularly replaces the tip cell, much like cells apply mechanical forces on the Please contact: migratory birds take turns in the leading matrix, and they respond to forces they Roeland M.H. Merks position in front of the “V”. Much feel in their surroundings. In this way CWI, The Netherlands experimental work has focused on the the matrix can relay mechanical signals E-mail: [email protected]

ERCIM NEWS 104 January 2016 37 Special Theme: Tackling Big Data in the Life Sciences

Modelling? Using Standards Can help you

by Brett G. Olivier and Bas Teusink

Almost everything one does relies on the use of standards, from vehicle safety, to smartphone design. Here we use an idealised scenario to illustrate these standards for modelling.

Standards play an important role in our consensus yeast metabolic network multiple files, such as the two source everyday life; they define how our appli - which he downloads in the SBML Level models as well as the Python scripts ances communicate with one another 3 format. used in the analysis. In addition it can and the address format of our last email. include metadata about these files, the While ubiquitous in experimental sci - SBML Level 3 has a modular design analysis itself as well as the software ences, the application of standards to and while the language core is analo - used to generate the results. The COM - mathematical modelling in the life sci - gous to SBML Level 2, its functionality BINE archive is therefore an ideal way ences is relatively new. With its integral can be extended through the use of of storing and exchanging entire model - use of computer modelling many of packages. In this case the model makes ling experiments. these standardisation efforts have taken use of the Flux Balance Constraints root in the relatively new field of sys - (FBC) package which adds elements Unfortunately, this is not yet a typical tems biology. used in constraint-based modelling such day for the majority of modellers in sys - as flux bounds and objective functions tems biology. The challenge is to With the field now moving to big data [1]. In addition packages exist for hier - expand and implement support for open and whole-cell modelling, standardisa - archical model composition as well as standards in as many tools as possible tion becomes even more important. The the definition of qualitative and spatial thereby enabling the development of a “whole cell model” will consist of many models. new in silico biology. sub-models that are in themselves data that need to be exchanged between To get a feel for his newly acquired Links: researchers and groups – in ways not model, he loads the model into the Modelling Software: originally envisioned by their original online CBM modelling system FAME http://teusinklab.nl/modelling- authors. Beyond model exchange, the methods-and-tools use of standards enables the develop - COmputational Modeling in BIology ment and usage of interoperable simula - NEtwork (COMBINE): tion software. This allows ensembles of http://co.mbine.org software with a diverse range of func - Biomodels database: tionality to be combined together. To http://biomodels.net illustrate this, let us consider an “ideal” SBML: http://sbml.org day in the life of a computational biolo - COPASI: http://copasi.org gist asked to investigate yeast glycolysis Yeast metabolic network in a “growing cell”. http://yeast.sourceforge.net FAME: http://f-a-m-e.org/ Deciding that this model will combine a CBMPy: http://cbmpy.sourceforge.net kinetic model of glycolysis with a Figure 1: The use of open standards for PySCeS: http://pysces.sourceforge.net genome-scale metabolic model, he first encoding models enables the integration of searches the Biomodels database and multiple models and tools. References: finds Teusink’s detailed kinetic model of [1] B.G. Olivier and F.T. Bergmann: glycolysis and downloads it in the “The Systems Biology Markup Systems Biology Markup Language Language (SBML) Level 3 Pack - (SBML) Level 2 format. and starts modelling. Deciding to com - age: Flux Balance Constraints”, bine both these models in a single Journal of integrative bioinformat - SBML is a widely used model-encoding analysis he loads them into two dif - ics, 12, 269, 2015. standard; using XML it aims to describe ferent SBML compatible, Python based [2] F. T. Bergmann et al.: “COMBINE biological processes in a tool inde - tools. Using CBMPy for the genome- archive and OMEX format: one pendent way. In addition to describing scale model and PySCeS for the kinetic file to share all information to the model components (e.g., metabo - model allows him to develop a new reproduce a modeling project”, lites, reactions, kinetic laws, parameters algorithm that combines both models BMC Bioinformatics, 15, 369, and compartments) it also provides a and tools in a single analysis. Having 2014. framework for their detailed annotation. succeeded, all that is left is to back up his work and send the analysis and Please contact: Loading the model into the SBML com - result to his colleagues. For this he Brett G. Olivier pliant COPASI software allows him to makes use of a COMBINE archive. Systems Bioinformatics view the model and its annotation and Vrije Universiteit Amsterdam, immediately start investigating its The COMBINE archive is a single file The Netherlands behaviour. Next he decides to use the in the OMEX format [2]. It can contain E-mail: [email protected]

38 ERCIM NEWS 104 January 2016 Understanding Metadata to Exploit Life Sciences open Data Datasets by Paulo Carvalho, Patrik Hitzelberger and Gilles Venturini

Open data (OD) contributes to the spread and publication of life sciences data on the Web. Searching and filtering OD datasets, however, can be challenging since the metadata that accompany the datasets are often incomplete or even non-existent. Even when metadata are present and complete, interpretation can be complicated owing to the quantity, variety and languages used. We present a visual solution to help users understand existing metadata in order to exploit and reuse OD datasets – in particular, OD life sciences datasets.

The Environmental Research and reuse [1], it is often non-existent or of several data sources – potentially from Innovation Department (ERIN) of the low quality. Sometimes metadata is different countries with metadata repre - Luxembourg Institute of Science and defined, but not always. When it is sented in different languages – are Technology (LIST) conducts research defined, it may be incomplete. implicated. and activities in environmental science Furthermore, there is no single common (including biology, ecology and other standard used to specify metadata: each We propose a visual solution to avoid areas of study) using advanced tools for data provider can choose how metadata such problems and to help users under - big data analytics and visualization. will be represented, and may or may not stand metadata and to search for and ERIN’s e-science unit is currently inves - follow a specific metadata standard, for find specific datasets. Why a visual tigating how open data (OD) datasets, example: Dublin Core Metadata solution? Data visualization can rapidly containing data pertaining to environ - Initiative (DCMI), Data Catalogue assimilate and recognize large amounts mental science and related areas, may be Vocabulary (DCAT), Metadata Objects of information [2]. The datasets and reused. The quality of metadata that Description Schema (MODS). This lack metadata information are stored in a accompanies such datasets is often poor. of consistency increases the difficulty database previously filled by an OD Despite metadata being essential for OD of exploiting metadata, especially when dataset downloader script. We assume that this task has been successfully done: it is not the focus of our work. The metadata-mapper offers an overview of every piece of metadata obtained from the datasets. The way the information is displayed is organised in three levels: • The first level organises the metadata according to whether or not the meta - Figure 1: Initial state of the metadata-mapper. data are assigned to a given research criteria. Research criteria are neces - sary to search datasets, e.g., to enable the user to search a dataset by theme, by date of publication, by language, etc. • The second level refers to the type of metadata value. Every metadata is formed by key-value pairs. Different types of metadata can exist. The value might be a number, a name, a date, an email-address, etc. • Finally, to improve visualization and understanding of all metadata keys, the metadata are organised by name. Several groups exist: they contain the metadata whose key starts with a given letter belonging to the group.

In addition to providing a better under - standing of existing metadata, our meta - data-mapper solution permits the user to establish links between metadata and Figure 2: The metadata of 543 datasets obtained from http://data.gov.au (search term: search criteria. This means that a spe - "biology"). cific metadata is used to apply a search

ERCIM NEWS 104 January 2016 39 Special Theme: Tackling Big Data in the Life Sciences

based on a precise criterion, for If it is impossible to understand and use or whether further modifications need instance: imagine we have a search cri - the metadata associated with individual to be made. teria “Language”. The user can create a datasets, the datasets cannot be used, link between this criteria and metadata and are effectively useless. It is vital References: “Language” and “Country”. The solu - that we are able to unlock the potential [1] N. Houssos, B. Jörg, B. Matthews: tion will convert this relation into a and value of OD datasets. The meta - “A multi-level metadata approach query similar to: “SELECT * FROM data-mapper solution delivers a support for a Public Sector Information Datasets WHERE Language = ? tool in order to understand metadata data infrastructure”, In Proc. of the UNION SELECT * FROM Datasets delivered with datasets but also to link 11th International Conference on WHERE Country = ?”. metadata with research criteria. This is Current Research Information Sys - only a first step towards making meta - tems, pp. 19-31, 2012. Figure 2 demonstrates the result of the data more usable and to permit the reuse [2] S. Card, J. D. Mackinlay, B. Shnei - chart showing all the metadata related of OD datasets. This solution will give derman: “Information visualization. to the datasets found and downloaded the user a global picture of all existing Human-computer interaction: from the http://data.gov.au OD portal metadata, their meaning, and how they design issues, solutions, and appli - after searching the term “biology”. can be used to find datasets. The visual cations, 181, 2009. Five hundred and forty three different approach enables large amounts of datasets were found and downloaded. metadata to be shown at one time. Please contact: 17,435 metadata sets were obtained However, the solution has to be tested Paulo Carvalho along with the datasets. These 17,435 with large numbers of datasets in order Luxembourg Institute of Science and metadata sets contain 34 different to test whether the methodology can Technology keys. deal with massive amounts of metadata E-mail: [email protected]

Management of Big and open Data in the Life Cycle Assessment of Ecosystem Services

by Benedetto Rugani, Paulo Carvalho and Benoit Othoniel

When defined, metadata information that accompanies Big and Open Data (OD) datasets may be hard to understand and exploit. A visual approach can support metadata re-use in integrated ecological-economic modelling. A method that allows specific model datasets to be regularly and consistently updated may improve their readability for use in the Life Cycle Assessment (LCA) modelling of ecosystem services.

“Ecosystem services” is a term used to to develop a novel approach to simulate rating, managing and running a large describe the benefits humans get from the value of biodiversity and ecosystem amount of data (mostly OD), such as the Earth’s natural capital (e.g., forest services within the LCA methodology. georeferenced information, and ecolog - biomass, water purification, soil quality, This work is being conducted by ical/biophysical and process life cycle etc.). These flows reflect the interaction researchers within the RDI Unit on Life data. These need to be stored in the among all capitals of biosphere and Cycle Sustainability and Risk model to describe all the relationships anthroposphere. Complex interdiscipli - Assessment (LiSRA) at the between modules, sub-models and com - nary modelling is necessary to study Environmental Research & Innovation partments of MIMES, e.g., to describe these interactions and to assess (ERIN) Department of the Luxembourg the interactions between the emission of ecosystem service values, thus a big data Institute of Science and Technology pollutants and wastes from energy pro - (BD) collection and elaboration effort is (LIST). The team is adapting MIMES duction or transport systems and their often required. BD have also been put to (Multi-scale Integrated Model of effects on the receiving ecological and good use in the Life Cycle Assessment Ecosystem Services) [2] to the LCA human compartments, i.e., air, soil, (LCA) framework [1], which is one of framework. The main challenge is to water, and the human body. the most robust and commonly applied model the nexus among multiple ele - methodologies to evaluate the impact of ments of the biosphere and the anthro - The global version of MIMES already human activities on ecosystems. posphere at different geographical and contains thousands of variables and However, there is no consensus on how temporal scales, generating a holistic process modelling parameters. to evaluate ecosystem services in LCA. integrated representation of the local However, improving the existing This is due, in part, to the lack of con - (e.g., country scale of Luxembourg), MIMES relationships and model trans - sensus on the standardization and use of multi-regional (e.g., Europe) and global parency requires the incorporation of data and models, which can be broadly relationships between economic and additional “heavy” datasets and meta - retrieved from OD sources. ecological systems. data layers, respectively. As a conse - quence, the number of model reference The aim of the VALUES project (see at: One of the major tasks of VALUES is to variables is expected to increase in the http://www.list.lu/en/project/values/) is solve the problem of collecting, elabo - range of 40-70% by the end of

40 ERCIM NEWS 104 January 2016 Figure 1: Simplified approach for data management in the VALUES project.

VALUES in 2017. On one hand, a cer - meaning (e.g., bottom-left box in Links: tain level of model aggregation needs to Figure 1 represents the metadata http://www.list.lu/en/project/values/ be maintained or achieved to avoid existing for the EU dataset on the asso - http://sourceforge.net/projects/mimes/ computational issues. On the other ciation of species and habitat types to http://www.simulistics.com/ hand, a large amount of datasets and ecosystems: database sources needs to be harmo - http://www.eea.europa.eu/data-and- References: nized in one unique modelling frame - maps/data/linkages-of-species-and- [1] J. Cooper, et al.: “Big Data in Life work. habitat/). The LiSRA and e-Science Cycle Assessment”, Journal of RDI Units of ERIN/LIST are currently Industrial Ecology, vol. 17, pp. As shown in Figure 1, the complexity of working to overcome such data model - 796-799, 2013. sub-models in MIMES can easily be ling obstacles [3]. [2] R. Boumans et al.: “The Multiscale increased when amplifying the resolu - Integrated Model of Ecosystem tion of assessment or the detail of By considering MIMES as test-bed case Services (MIMES): Simulating the process connections. Moreover, a large to develop a methodology of OD man - interactions of coupled human and amount of metadata information must agement and metadata visualization, natural systems”, Ecosystem Serv - be stored in the modular system of com - VALUES can be optimized to guarantee ices, vol. 12, pp. 30-41, 2015. partmental relationships across vari - the success of modelling outputs and [3] P. Carvalho, P. Hitzelberger, and G. ables, and then regularly upgraded with increased system transparency. Venturini: “Understanding Open information collected from the same or Furthermore, the use of OD collection Data CSV File Structures for other BD-like sources (e.g., statistical and visualization techniques in Reuse”, in ERCIM News 100 (Jan - datasets, climate measurements, biolog - VALUES can support future expansions uary 2015), ed ERCIM EEIG, ical engineering databases). Most of the and cross-fertilizations of such IT 2015, p. 36. metadata information is then screened, approaches to access BD within the modified and linked to upstream .CSV broader LCA framework. This project Please contact: files containing the original input represents a tangible example that may Benedetto Rugani datasets necessary to run the model. pave the way towards more effective Luxembourg Institute of Science and and transparent management of meta - Technology (LIST) However, metadata is often missing or data in complex life cycle process unit Tel: +352 2758885039 difficult to understand, posing addi - databases, making a valuable contribu - E-mail: [email protected] tional problems related to the trans - tion to current state-of-the-art practice parency of the modelling approach. and tools for LCA [1]. These complications limit the usability of the data, and the reproducibility of the models that need constant updates. When defined, metadata information may be set up in different languages, under various data types and using dif - ferent keys to represent the same

ERCIM NEWS 104 January 2016 41 Special Theme: Tackling Big Data in the Life Sciences

WITDoM: Empowering Privacy and Security in Non-trusted Environments

by Juan Ramón Troncoso-Pastoriza and Elsa Prieto Pérez

The WITDOM project (empoWering prIvacy and securiTy in non-trusteD environments) develops innovative technical solutions for secure and privacy-preserving processing of genomic and financial data in untrusted environments.

The advent of outsourced and distrib - WITDOM aims to develop three main efficient lattice cryptosystems for uted processing environments like the innovations in this area. Firstly, a novel homomorphic processing, allowing for Cloud is prompting fundamental trans - framework for end-to-end security and faster and more resource-efficient formations in whole ICT ecosystems, privacy that will guarantee efficient and encrypted-domain processing [1]; accu - while bringing new opportunities to verifiable provision of privacy in the rate and effective secure signal pro - stakeholders in terms of the availability context of ICT services owned by third- cessing to cope with the marriage and rational use of physical resources party providers of distributed pro - between cryptography and the ubiqui - with large-scale savings in IT invest - cessing and storage. This framework tous signal processing operations when ments. Conversely, it also poses new will maximize independence from dealing with sensitive signals [2]; effi - security challenges, especially for stated security and privacy commit - cient privacy-enhancing technologies ensuring robust protection of privacy ments by respective providers, and min - for obfuscation, noise addition, and integrity of personal information, imize the current need for blind trust anonymization and data masking, meas - which are essential for societal accept - from clients, based solely on written uring and achieving quantitative guar - ance of new ICT schemes, services and consents. The initial contribution of this antees that the processed data will be solutions. framework builds upon a requirements unidentifiable and the produced results elicitation methodology, synergizing unlinkable; and last, but not least, effi - In this context, the WITDOM project the results of the recently ended FP7 cient and scalable integrity and consis - focuses on developing innovative solu - project PRIPARE with a co-design tency verification techniques to pre - tions for truly efficient and practical pri - process to fuse privacy into the first serve fork-linearizability on data vacy enhancing techniques and efficient stages of the systems design process, accessed and modified by several users signal and data processing in the materializing a true Privacy-by-Design on outsourced data stores [3]. The main encrypted domain for outsourced envi - methodology. strength in our approach is that these ronments. WITDOM’s main goal is to innovations are not applied independ - produce a framework for end-to-end The second dimension of innovations ently or autonomously, but in an end-to- protection of data in untrusted environ - deals with the actual design and devel - end secure and private architecture that ments, with a particular focus on data- opment of tools and technologies for defines a platform that can deploy pri - outsourcing scenarios where new efficient privacy protection of data out - vacy-preserving services on outsourced threats, vulnerabilities and risks due to sourced and processed in untrusted data with quantifiable and assessable new uses require end-to-end security environments. These techniques can be technological guarantees. Promising solutions that will withstand progress for categorized according to the four main initial results have already been pro - the lifetime of applications they support. research areas addressed in WITDOM: duced within WITDOM’s roadmap, which will continue to advance the state of the art in these fields for the next two years.

Finally, the third dimension of WITDOM’s innovations deals with the instantiation of the developed frame - work, platform and tools in two care - fully chosen use-case scenarios, for which privacy is essential owing to the sensitivity of the involved data, and where privacy and confidentiality con - straints are a true barrier to using out - sourced architectures and Cloud-based deployments. The first use-case is a health scenario that involves out - sourcing genetic data processes and workflows for large research analyses and individual clinical analyses; genetic data is extremely sensitive, and Figure 1: WITDOM generic architecture. genomic privacy has become a hot topic

42 ERCIM NEWS 104 January 2016 for research and innovation, to which The WITDOM project is a Research References: WITDOM contributes by focusing on and Innovation Action funded by the [1] D.D. Chen et al.: “High-Speed solutions for outsourced processing of European Commission’s Horizon 2020 Polynomial Multiplication Archi - genetic data. The second scenario deals programme and the Swiss State tecture for Ring-LWE and SHE with outsourced financial analyses Secretariat for Education, Research and Cryptosystems”, IEEE Trans. on based on the management of both cus - Innovation. This project, started in Circuits and Systems 62-I(1): 157- tomers’ data and finance data, to enable January 2015, joins together a multidis - 166 (2015). credit risk calculations and card transac - ciplinary consortium comprising [2] J. R. Troncoso-Pastoriza and F. tion fraud detection deployed as out - Universities, research centres, strong Pérez-González: “Secure Signal sourced secure financial services over industrial stakeholders, and end-users: Processing in the Cloud: enabling private and public Cloud environments. Atos(Spain) as the project coordinator, technologies for privacy-preserving the University of Vigo (Spain) as the multimedia cloud processing”, Research and innovation in this field technical coordinator, Katholieke IEEE Signal Process. Mag. 30(2): cannot ignore the impact of data protec - Universiteit of Leuven (Belgium), IBM 29-41 (2013). tion regulations and directives on the Research (Switzerland), Fondazione [3] M. Brandenburger, C. Cachin evolution of Cloud-related environ - Centro San Raffaele (Italy), BBVA and N. Knežević: “Don't trust the ments and in the processing of personal (Spain), Gradiant (Spain), and the cloud, verify: integrity and consis - and sensitive data. Therefore, Ospedale San Raffaele (Italy). tency for cloud object stores”, WITDOM innovations are built upon a ACM SYSTOR 2015: 16:1-16:11. legal assessment and validation of the Links: evolving European Data Protection http://witdom.eu/ Please contact: Regulation, linking legal and ethical https://twitter.com/W1TD0M Juan Ramón Troncoso-Pastoriza requirements with technological means https://www.linkedin.com/grp/home?gi University of Vigo to guarantee their enforcement. d=8257514 E-mail: [email protected]

Call for Participation ;9CJ8%%mmm$[YcjX(&',$eh]mmm$[YcjX(&',$eh] ECCB 2016 – '&j^;;khef[Wd99ed\[h[dY[e edCCWj^[cWj_YWbWWdZJJ^[eh[j_YWb88_ebe]o 15th European WdZ77ddkWbCC[[j_d]e e\jj^[IIeY_[jo\\ehCCWj^[cWj_YWb8 8_ebe]o Conference Dejj_d]^Wc"'''#'+@@kbo((&', on Computational Biology

The Hague, Netherlands, 3-7 September 2016

ECCB 2016 will welcome scientists working in a variety of disciplines, including bioinformatics, computational biology, biology, medicine, and systems biology. Participating in ECCB 2016 will be the perfect opportunity to keep pace with cutting edge research, and to network with members of ECCB community.

More information: http://www.eccb2016.org/

ERCIM NEWS 104 January 2016 43 Research and Innovation

Secure and Privacy- Preserving Identity Management in the Cloud

by Bernd Zwattendorfer, Stephan Krenn and Thomas Lorünser

CREDENTIAL is an EU H2020 funded research project that is developing, testing, and showcasing innovative cloud-based services for storing, managing, and sharing European digital identity information and other highly critical personal data with a demonstrably higher level of security and privacy than other current solutions. This is Research and achieved by advancing novel cryptographic technologies and improving strong authentication mechanisms.

Innovation Digital identity management (IdM) is an essential tool for managing access to information technology (IT) resources and is an enabler for seamless interaction between systems, organizations, and end users in the future. However, in order to be fully and broadly accepted, IdM must involve secure identification and authentication processes and protect user privacy. This is especially true for high-assurance applica - tion domains such as e-Business, e-Government, or e-Health, which have a superior demand for security and privacy due to the harm a potential breach or identity theft could cause.

Identity management is currently experiencing a paradigm shift, and existing solutions fall short in many aspects when it comes to federated and heterogeneous environments. In the past, IdM was mainly a local issue and most organiza - tions operated their own, custom-tailored identity manage - ment systems within the organization’s domain boundaries. The use of external IdM systems was the exceptional case. Today, we often see mixed systems, mainly because of the increasing use of distributed and inter-connected applica - tions integrating internal and external components, e.g., as in hybrid cloud applications. This situation leads to frag - mented, non-standard authentication situations on the IdM level and causes high administrative costs compared with integrated solutions. Many “identity islands” have to be managed outside the corporate IT environment, and the advantages of integrated identity and access management (IAM) solutions are lost. Important features like single sign- on as well as easy and centralized provisioning/de-provi - sioning, audit, and control of identities are not possible any - more. For these reasons there exists a strong demand for the development and integration of trustworthy IAM systems. Ideally these systems would provide the necessary security and privacy guarantees aspired in federated business envi - ronments with the strongest guarantees possible, by cryp - tography.

The transformation in the identity management world goes hand in hand with the tremendous shift to cloud computing that has shaped the ICT world during recent years. By now, numerous IdM systems and solutions are available as cloud services, providing identity services to applications operated both in closed domains and in the public cloud. This service model is often referred to as Identity (and Access)

ERCIM NEWS 104 January 2016 Figure 1: Privacy preserving IAM workflow.

Management as a Service (IDMaaS). Popular examples for Links: cloud IDMaaS providers are big companies from the sectors Website: https://credential.eu/ of social networks (Facebook, LinkedIn), search engines Twitter: @CredentialH2020 (Google), business solutions (Microsoft, Salesforce), and http://twitter.com/CredentialH2020 online retailers (Amazon). However, no satisfactory LinkedIn: https://linkedin.com/in/credential approaches currently exist which allow the storage and CORDIS: sharing of identity data by service providers in a privacy pre - https://cordis.europa.eu/project/rcn/194869_en.html serving manner – meaning without the identity provider learning the credentials and associated data. References: [1] B. Zwattendorfer, D. Slamanig: “Design Strategies for a The vision of CREDENTIAL is to fill this gap and develop a Privacy-Friendly Austrian eID System in the Public more trustworthy solution by combining secure and efficient Cloud”, Computers & Security 2014. identity management technologies with cryptography for [2] D. Slamanig, K. Stranacher, B. Zwattendorfer: “User- cloud computing [1,2,3]. Users will be able to store identity Centric Identity as a Service-Architecture for eIDs with data in a cloud-based IDMaaS system of an identity provider Selective Attribute Disclosure”, in Proc. of SACMAT such that the confidentiality and authenticity of the data is 2014. upheld even from the provider. Now, if a user wants to access [3] B. Zwattendorfer, A. Tauber: “Secure Cloud Authenti - a specific service at a different provider or from the enter - cation using eIDs”, in Proc. of IEEE CCIS 2012. prise environment, she can advise the identity provider to select specific data items and re-encrypt them for the service Please contact: provider such that (after transmission) the service provider Bernd Zwattendorfer remains the only party capable of accessing the data items in Graz University of Technology plain text. Tel: +43 (0) 316 8735574 E-Mail: [email protected] In comparison to current IDMaaS systems, which have full access to the identity data they are hosting, the CREDEN - Stephan Krenn TIAL solution will significantly improve the privacy of AIT Austrian Institute of Technology GmbH cloud identity service users, whilst maintaining a high degree Tel: +43 (0) 664 88256006 of usability in order to motivate secure handling of services. E-mail: [email protected] Data will be protected with end-to-end encryption, while the authentication of the users against the identity service provider will be secured with efficient and strong state-of- the-art multifactor authentication mechanisms within a con - sistent and holistic security approach.

CREDENTIAL had its kick-off in October 2015 and will last for three years. The consortium is currently doing a technology assessment and a requirements elicitation for a secure and privacy-preserving IDMaaS solution in the cloud. The CREDENTIAL consortium consists of a bal - anced team from seven EU countries, including six industry partners, two applied research organizations, three universi - ties, and one SME.

ERCIM NEWS 104 January 2016 45 Research and Innovation

PRISMACLoUD – Privacy user privacy, we apply privacy enhancing technologies (e.g., attribute-based anonymous credentials) to implement data and Security Maintaining minimization strategies and access privacy. In addition, we are interested in efficient data anonymization algorithms to Services in the Cloud anonymize large datasets to facilitate privacy-friendly data sharing and third party use. by Daniel Slamanig, Agi Karyda and Thomas Lorünser To assure the practical relevance of the developments within The EU Horizon 2020 PRISMACLOUD research project is PRISMACLOUD, the aforementioned efforts are accompa - dedicated to enabling secure and trustworthy cloud- nied by non-cryptographic research topics considered essen - based services by improving and adopting novel tools tial for the commercial success of the project results. We will from cryptographic research. provide secure and efficient software implementations of core technologies and Cloud computing is seen as a major showcase them in selected testbeds. Three growth area in ICT, with a 2013 forecast different use-cases from different applica - from International Data Corporation pre - tion domains will be used to demonstrate dicting that worldwide spending on public and evaluate the potential of the project cloud services will exceed USD $107 bil - outcome, i.e., demonstrate a measurable lion in 2017. Despite these predictions, the increase in service level security and pri - practical adoption of cloud computing vacy. Furthermore, novel human-com - technologies may be greatly hindered by puter interaction (HCI) guidelines, inadequate security and privacy protec - including HCI design patterns for usable tion. Some fundamental properties of cryptography and protocols for the cloud, (public) cloud computing, such as being an will help to design services that respect open platform, its anytime and anywhere the users’ needs and will therefore maxi - accessibility, as well as the intrinsic multi- mize acceptance of the technology. In tenancy, introduce new security threats, leading to tremendous order to use the developed methods properly in novel applica - risk for personal and sensitive data. Studies from the EU tion scenarios after the project, a holistic security framework Agency for Network and Information security (ENISA) show and accompanying usage patterns will be prepared in support that security and privacy concerns represent major stumbling of service developers. Finally, a vital goal of the project is for blocks for cloud adoption within European industry. the results to be incorporated into standards related to cloud security, and we will actively participate in various s tandardi - The need to protect the security and privacy of the data in the zation bodies in the second phase of the project. cloud is therefore a critical issue. The strongest level of pro - tection can be achieved through cryptography. However, The PRISMACLOUD project has been running since although many recent advances in cryptography have yielded February 2015 and is a 42 month project that receives promising tools for cloud computing, far more work is funding from the European Union’s Horizon 2020 Research required to transform theoretical techniques into practical and Innovation programme under grant agreement No solutions for the cloud. PRISMACLOUD is contributing to 644962. The project is coordinated by AIT and its consor - this transformation by developing tools that allow the next tium consists of 16 partners from academia and industry generation of cryptographically secured cloud services to be from nine different countries. built with security and privacy incorporated by design and from end to end. Links: Website: https://prismacloud.eu PRISMACLOUD’s research, and resulting developments LinkedIn: https://linkedin.com/in/prismacloud [1], are based on the following objectives. On the one hand Twitter: @prismacloud, http://twitter.com/prismacloud we focus on confidentiality of data, which is considered CORDIS: http://cordis.europa.eu/project/rcn/194266_en.html absolutely essential when outsourcing data into the cloud. In particular, we target the development of secure distributed References: cloud storage systems (i.e., the cloud-of-cloud approach [2]) [1] T. Lorünser et al.: “Towards a New Paradigm for Priva - as well as encryption and tokenization solutions for legacy cy and Security in Cloud Services”, Cyber Security and applications already running in the cloud. Secondly, we are Privacy, Vol 530 of CCIS, Springer, 2015. putting significant effort in verifiability features for the [2] T. Lorünser, A. Happe, D. Slamanig: “ARCHISTAR: cloud. Thereby, we focus on cryptographic means (such as Towards Secure and Robust Cloud Based Data Shar - verifiable computing and malleable signatures [3]) to protect ing”, CloudCom 2015, IEEE, 2015. the integrity and authenticity of dynamic data in cloud-based [3] D. Derler et al.: “A General Framework for Redactable workflows and computational tasks. Moreover, we are also Signatures and New Constructions”, 18th International focusing on cryptographic means (e.g., graph signatures) that Conference on Information Security and Cryptology, allow auditors to attest or certify the cloud infrastructure and LNCS, Springer, 2015. thus help providers to increase the transparency for cus - tomers without revealing internal information about the con - Please contact: figurations. Thirdly, the privacy of users interacting with a Thomas Lorünser, AIT Austrian Institute of Technology GmbH cloud environment requires adequate protection. To protect Tel: +43 664 8157857, E-mail: [email protected]

46 ERCIM NEWS 104 January 2016 TREDISEC: Towards sions (e.g., users’ private information illegally accessed and used for blackmail [2]). Moreover, companies are no longer Realizing a Truly Secure happy to rely on standard cloud security solutions, instead demanding full control over the security and integrity mech - and Trustworthy Cloud anisms employed to protect their data across its lifecycle. As this data is typically a very valuable asset, security and pri - by Beatriz Gallego-Nicasio Crespo, Melek Önen and vacy emerge as key features in cloud offerings. TREDISEC Ghassan Karame addresses these issues by providing a set of security primi - tives that enhance the resilience of existing cloud infrastruc - The main goal of the TREDISEC project is to increase tures against attacks and vulnerabilities, protecting data end- trust in cloud technology by providing solutions that to-end, and thus making secure and trustworthy cloud sys - enhance data security and provide strong privacy tems a reality. guarantees, without hampering the efficiency-and- reduced-cost attractiveness of existing cloud systems. Achieving end-to-end security within the cloud environment is nevertheless not a straightforward task. Indeed, end-to-end The cloud is the go-to technology for companies and busi - security is at odds with current functionalities offered by the nesses these days. Cloud adoption is growing exponentially cloud, as shown in Figure 1. For example, while protecting each year, fueled by new functionalities and capabilities. cloud customers’ data at rest usually requires data encryption Although it has many advantages, the cloud also has draw - solutions, such a security service inherently refrains cloud backs and security-related issues that can make customers services from offering standard APIs for efficiently pro - shy away from adopting it. Large-scale adoption of the cloud cessing these encrypted data. by enterprises and SMEs is hampered by serious concerns about the security and availability of data stored in the cloud. TREDISEC addresses security and privacy issues by ana - These concerns have been further exacerbated by recent data lyzing, designing and implementing a set of cloud security leak scandals (e.g., corporate data stolen [1], resulting in mil - primitives that are integrated naturally with existing cloud lions of dollars in losses in just one day) and privacy intru - capabilities and functionalities, such as multi-tenancy or storage efficiency. Among other capa - bilities, these primitives support data reduction, enable secure data pro - cessing, enhance data availability and integrity and ensure user isolation and confidentiality in multi-tenant systems. Therefore, the innovation potential of TREDISEC covers: • Deduplication on encrypted and multi-tenant data • Means to verify the integrity and availability of multi-tenant data in presence of storage efficiency • Secure deletion of multi-tenant data in presence of deduplication • Storage efficiency in presence of securely outsourced database man - agement services • Secure outsourced analytics/process - ing in a multi-tenant environment • Trustworthy, consistent access control for multi-tenancy settings • Distributed enforcement of access control policies.

Addressing these innovations is only the first step of the project. The subsequent, and more challenging step, is to con - sider scenarios with multiple functional and security requirements to evaluate the effectiveness of our approach. These security primitives will be integrated and validated in real use case scenarios designed by four partners of the project, representing diverse business goals (for more detail about use cases see the Figure 1: The TREDISEC approach. Links section). Namely, our results will

ERCIM NEWS 104 January 2016 47 Research and Innovation

be validated onto (1) the Greek Research and Technology Smart Devices for the Network’s computation and storage cloud (used by the University of Athens), (2) the cloud storage service provided Home of the Future: by Arsys, (3) Morpho’s identification solutions using bio- metric data that are stored in the cloud , and finally (4) SAP A New Model for use case focuses on migration of legacy databases into a secure cloud, which is an important concern for small, mid- Collaboration sized and large enterprises that wish to move their day-to-day business processes (e.g., ERP, CRM, HR) to the cloud. by Daniele Spoladore, Gianfranco Modoni and Marco Sacco We believe that the results of TREDISEC will have a big impact in business (both large companies and SMEs), An ontology based approach that aims at enhancing allowing them to achieve greater business throughput, and interoperability between home devices and services is lowering the barriers to enter new markets. We are fully being developed by the Italian “Design For All” research aware of the problem of adoption of new technologies by project. companies and believe that TREDISEC will cover at least three of the top five barriers: security, data protection, trust, Contemporary design is characterized by a major paradigm data access, and portability [3]. shift: from traditional design focused on the “average man” to Universal Design, which takes into account a wide variety This paper is a joint work of the TREDISEC project consor- of possible human needs [1]. This paradigm is also applied to tium. The TREDISEC project is funded by the H2020 the field of Ambient Assisted Living (AAL) in the design of Framework of the European Commission under grant agree- smart homes, living environments capable of anticipating ment no. 644412. In this project, nine renowned research and responding to the needs of their inhabitants through tai- institutions and industrial players with balanced expertise in lored services provided by diverse devices (e.g. sensors and all technical aspects of both security and cloud are working actuators). However, these smart objects are managed by dif- together to address the challenges of the project: Atos Spain ferent software, based on their specific data format. Thus, the (project coordinator), NEC Europe, IBM Research, ETH home of the future is currently characterized by a wide Zurich, Eurecom, Arsys Internet, Greek Research and variety of data, which hinders efficient interactions between Technology Network, SAP SE and Morpho (SAFRAN the devices involved. In order to address this issue, the group). Design For All research project, co-funded by the Italian Ministry for Education, University and Research within the Links: cluster of initiatives for Technologies for Ambient Assisted TREDISEC official website: http://tredisec.eu Living, is developing semantic interoperability so that dif- TREDISEC Use Cases Definition: ferent systems can share and exploit the same information. http://tredisec.eu/content/use-cases The main goal of the project is to develop a software archi- References: tecture that supports the design phase for future smart homes [1] Deadline – Sony hack: A Timeline (Last access: suitable for any kind of user, enabling distributed devices to 05.11.15), http://kwz.me/QY adapt to and react with the context. A platform denoted the [2] Tripwire – The state of security. The Ashley Madison Virtual Home Framework (VHF) [2] integrates knowledge Hack – A Timeline (Last access: 05.11.15), about the domestic environment, the smart objects and the http://kwz.me/QB users. Semantic Web technologies have been adopted to for- [3] European Commission:“Quantitative Estimates of the mally describe this information (including the many linking Demand for Cloud Computing in Europe and the Likely elements) in an ontology, which is a “formal specification of Barriers to Up-take”, 2012, http://kwz.me/Qr a shared conceptualization” based on first-order logic lan- guages (RDF and OWL). The ontology approach provides a Please contact: holistic view of the smart home as a whole, considering the Beatriz Gallego-Nicasio Crespo physical dimensions, the users involved, and their evolution Atos, Spain over the time. It also allows the use of reasoning tools, able Tel: +34 912148800 to derive new knowledge about the concepts and their rela- E-mail: [email protected] tionships, thanks to inferencing rules specified in the Semantic Web Rule Language (SWRL).

The semantic model developed, called the Virtual Home Data Model (VHDM), provides a consistent representation of several knowledge domains; it is composed of several modules including: a) the Physiology model, to keep track of users’ medical conditions over time; b) the Smart Object Model, which provides a description of the relationships between appliances and related functionalities; c) the Domestic Environment, which includes information on thermo-hygrometric conditions and air and light quality.

48 ERCIM NEWS 104 January 2016 Figure 1: A fragment of the taxonomy derived from ICF.

A valid starting point for the design of the VHDM has been a cloud-based platform (Microsoft Azure), which guarantees study of the various reference models available in the litera - an efficient management of the overall framework’s hori - ture covering the relevant knowledge domains. For example, zontal scalability. the Physiology model is based on the International Classification of Functioning, Disability and Health (ICF) Link: (Figure 1); the Smart Object model refers to the existing http://www.d4all.eu/en/ Units Ontology, which represents units of measurement and related concepts, in order to support measurements in several References: fields (e.g. physiological data on the user, environmental [1] R. L. Mace, G. J. Hardie, J. P. Place: “Accessible Envi - data and data related to the smart objects) (Figure 2). ronments: Toward Universal Design”, 1991. [2] M. Sacco, E.G. Caldarola, G. Modoni, W. Terkaj: “Sup - A key topic was the selection of a valid database that could porting the Design of AAL through a SW Integration manage and reason over large amounts of semantic data. A Framework: The D4All Project”, Universal Access in survey of the state of the art of existing semantic reposito - HCI, Springer, 2014. ries was carried out stipulating reasoning capability as the [3 ] G. Modoni, M. Sacco, W. Terkaj: “A survey of RDF main criteria of evaluation. On the basis of this qualitative store solutions”, in Proc. of ICE, 2014. analysis [3], Stardog was finally adopted as the RDF-store and SPARQL end point for the validation of the frame - Please contact: work, since, compared to other solutions, this allows a Daniele Spoladore, Gianfranco Modoni higher level of expressivity to represent the inferencing ITIA-CNR, Italy rules. The semantic repository has been installed on a E-mail: {daniele.spoladore, gianfranco.modoni}@itia.cnr.it

Figure 2: An excerpt of the Smart Object model.

ERCIM NEWS 104 January 2016 49 Research and Innovation

over the course of a year (see Figure 1) and over a week. As a fooDWEB – Studying proxy for consumption we used ratings provided by the users to a given recipe, and “recipe uploads” were used a proxy for food Consumption production. Another interesting finding was that some food categories are remaining significantly more interesting to the and Production Patterns users over time than others – e.g. vegetarian recipes have a very short life time, while spring food stays interesting to on the Web people over a longer period. This finding is important since it suggests that temporal effects significantly influence food by Christoph Trattner, Tomasz Kuśmierczyk and Kjetil preferences, and researchers or engineers who are devel - Nørvåg oping recommendation services in such platforms could, for example, take this into account. Food is a central part of our daily lives and is one of the most important factors that shape our health and well- Currently, we are in the process of obtaining data from other being. Although research on individual food preferences online community platforms. We are using this to investigate has existed as a well-established field of study for several several interesting variables, including cultural differences, decades, there has been very little attention devoted to user characteristics – such as age and gender – and many understanding recipe consumption and production more. The project is continuing to grow and current contribu - patterns on the World Wide Web. To contribute to tors include researchers from the Know-Center (Austria), knowledge in this area, we recently initiated “FOODWEB”, NTNU (Norway), L3S (Germany), Freie University of a research project that studies traces of users online with Bolzano (Italy), and TU-München (Germany). the goal of better understanding how we 1

consume and produce food. . 6 ] 9 g . [

Contrary to classic nutrition research, our 5 s n methodology in the FOODWEB project is rather i e 7 t . 5 o

pragmatic. Rather than relying on time con - r p

suming and expensive user surveys, which are 5 . commonly used in nutrition research, we intend to 5

mine traces of users online (e.g. from online food 2 2 ] g [

forums). We then analyze this data to draw con - 1 s 2 e

clusions about a population of users (e.g. in a t a 0 r 2

country or a region) or at the individual level. The d y h 9 1 project started in October 2014 with support from o b r 8 ERCIM (in the form of an Alain Bensoussan a 1 Fellowship to the first author of this article, who c was hosted by NTNU, Norway). 0 ] 6 l 2 a c

In order to conduct our research, we created a k [

0 y

dataset by mining kochbar.de, one of the largest 5 g 2 r

online food communities. Like other related web - e n e

sites, people can upload recipes, rate and com - 0 4 ment on recipes, join or create groups to share 2 l l l l l t t t t t r r r r r n n n n n u u u u u c c c c c p p p p p a a a a a J J J J J A O A O A O A O A O J J J J cooking experiences or create social links to J follow each other. What makes kochbar.de inter - 2010 2011 2012 2013 2014 esting as a test bed is not only their huge user base (and therefore the coverage of a huge population), Figure 1: Seasonal trends in online recipe production (red plot) and but also the rich amount of metadata – which is consumption (black plot) for proteins, carbohydrates and calories per 100g. usually not the case when conducting survey- based research. Interesting information available includes detailed descriptions of the ingredients used and the References: preparation steps needed to cook a meal, detailed nutrition [1] T. Kuśmierczyk, C. Trattner, K. Nørvåg: “Temporality facts per serving, and detailed user information, such as the in Online Food Recipe Consumption and Production”, user’s gender and region. This dataset encompasses more In Proc. of WWW’15, 2015. than 400,000 recipes published between 2010 and 2014, and [2] T. Kuśmierczyk, C. Trattner, K. Nørvåg: “Temporal over 200,000 users. Patterns in Online Food Innovation”, in Proc. of Temp - Web'201, 2015. Our project started with an exploratory analysis of the dataset and several research questions. Two initial papers Please contact: were published at WWW’2015 [1] and TempWeb’2015 [2], Christoph Trattner with interesting findings on temporal effects in how we con - Know-Center, Graz, Austria sume and produce food online. Interesting findings included E-mail: [email protected] temporal effects in terms of production and consumption Christoph Trattner was an ERCIM Fellow hosted by NTNU

50 ERCIM NEWS 104 January 2016 the learned task in an executable robotic behavior. An appro - Robots Learn Actions priate latent space representation is employed that effectively encodes demonstrated actions. A transformation from the and Cooperative Tasks actor’s latent space to the robot’s latent space dictates the action to be executed by the robotic system. Learned actions by Imitation are in turn used in human-robot collaboration (HRC) scenarios to facilitate fluent cooperation between two agents by Maria Koskinopoulou and Panos Trahanias – the human and the robotic system [2].

Imitation learning, which involves a robot observing and We conducted experiments with an actual robotic arm in sce - copying a human demonstrator, empowers a robot to narios involving reaching movements and a set of HRC “learn” new actions, and thus improve robot tasks. The tasks comprised a human opening a closed com - performance in tasks that require human-robot partment and removing an object from it, then the robot was collaboration (HRC). We formulate a latent space expected to perform the correct action to properly close the representation of observed behaviors, and associate compartment. Types of compartments tested included: this representation with the corresponding one for drawers, closets and cabinets. The respective actions that the target robotic behaviors. Effectively, a mapping of robotic systems should perform consisted of “forward observed to reproduced actions is devised that pushing” (closing a drawer), “sideward reaching” (closing an abstracts action variations and differences between the open closet door) and “reaching from the top” (closing a cab - human and robotic manipulators. inet). Furthermore, the robots were trained with two addi - tional actions, namely: “hello-waving” and “sideward Learning new behaviors enables robots to expand their reper - pointing”. Those additional actions were included in order to toire of actions and effectively learn new tasks. This can be furnish the robots with a richer repertoire of actions and formulated as a problem of learning mappings between hence operate more realistically in the HRC scenarios. world states and actions. Such mappings, called policies, enable a robot to select actions based on the current world In these experiments, the robotic arm is expected to repro - state. The development of such policies by hand is often very duce a demonstrated act, taught in different demo-sessions. challenging, so machine learning techniques have been pro - The robotic arm should be able to reproduce the action posed to cope with the inherent complexity of the task. One regardless of the trajectory to reach the goal, the hand config - distinct approach to policy learning is “Learning from uration, the trajectory plan of a reaching scenario or the Demonstration” (LfD), also referred in the literature as reaching velocity, etc. The end effect of the learning process Imitation Learning or Programming by Demonstration is the accomplishment of task-reproduction by the robot. Our (PbD) [1]. experiments are developed using the robotic arm JACO designed by Kinova Robotics [L1], and have also been tested Our research in the LfD field is directed towards furnishing on the humanoid robot NAO by Aldebaran [L2]. robots with task-learning capabilities, given a dataset of dif - ferent demonstrations. Task learning is treated generically in Actual demonstrations may be performed in di fferent ways. our approach, in that we attempt to abstract from the differ - In our study, we recorded RGB-D videos of a human per - ences present in individual demonstrations and summarize forming the respective motion act. From each video, we

Figure 1: The formatted actor’s and robot’s latent spaces, as derived from the respective learnt human motion acts, result in the final robotic motion reproduction.

ERCIM NEWS 104 January 2016 51 Research and Innovation

extract the {x,y,z} coordinates of each joint, previously marked with prominent (yellow) markers. By tracking three Unlocking the Secrets joints of the demonstrator, we form a 9D representation of the observed world. The derived trajectories of human of the Dynamics motion acts are subsequently represented in a way that is readily amenable to mapping in the manipulator’s space. The of Earth’s Interior Gaussian Process Latent Variable Model (GP-LVM) repre - sentation is employed in order to derive an intermediate by Costas Bekas space, termed latent space, which abstracts the demonstrated behavior [3]. The latent space features only the essential “Mantle convection” is the term used to describe the characteristics of the demonstration space: those that have to transport of heat from the interior of the Earth to its be transferred to the robot-execution space, in order to learn surface. Mantle convection drives plate tectonics – the the task. Effectively, the mapping of the observed space to motion of plates that are formed from the Earth’s crust the actuator’s space is accomplished via an intermediate and upper mantle – and is a fundamental driver of mapping in the latent spaces (Figure 1). earthquakes and volcanism. We modelled mantle convection using massively parallel computation. Our initial results regarding the latent space representation are particularly encouraging. We performed many demon - In 2013, a group of IBM and university scientists won the strations of left-handed reaching and pushing movements. ACM Gordon Bell Prize for work on cavitation [1]. Interestingly, the resulting representations assume overlap - Completely against conventional wisdom, their work ping trajectories that are distinguishable, facilitating compact demonstrated that, with advanced algorithmic re-engi - encoding of the behavior. We are currently working on neering, it is possible to achieve close to theoretical peak per - deriving the analogous representation for actions performed formance on millions of cores for computational fluid by the robotic arm, as well as associating the two representa - dynamics simulations. In the meantime, Professors Omar tions via learning techniques. Ghattas (University of Texas at Austin), Michael Gurnis (California institute of Technology), and Georg Stadler (New In summary, we have set the ground for research work in York University) and their groups were collaborating on cre - Robotic LfD and have pursued initial work towards building ating a very high resolution/high fidelity solver for convec - the latent representation of demonstrated actions. In this con - tion in Earth’s mantle. The breakthrough represented by the text, we’re aiming at capitalizing on the latent space repre - Bell prize was the spark that brought together the Texas- sentation to come up with robust mappings of the observed Caltech-NYU team with the group of Costas Bekas at IBM to space to the robot–actuator’s space. Preliminary results indi - form a collaboration aimed at achieving unprecedented sca - cate that the adopted representation in LfD tasks is appro - leout of the mantle convection solver. priate. Extreme scaleout would be a decisive step towards the goals Links: of the original research, which aims to address such funda - [L1] http://kinovarobotics.com/products/jaco-robotics/ mental questions as: What are the main drivers of plate [L2] https://www.aldebaran.com/en/humanoid-robot/nao- motion – negative buoyancy or convective shear traction? robot What is the key process governing the occurrence of earth - quakes – the material properties between the plates or the References: tectonic stress? [2]. [1] S. Schaal, A. Ijspeert, A.Billard: “Computational approaches to motor learning by imitation” Philosophi - A series of key advances in geodynamic modelling, numer - cal Transactions of the Royal Society of London. Series ical algorithms and massively scalable solvers that were B, Biological Sciences, 537–547, 2003. developed over the last several years by the academic mem - [2] M. Koskinopoulou, S. Piperakis, P. Trahanias: “Learn - bers of the team was combined with deep algorithm re-engi - ing from Demonstration Facilitates Human-Robot Col - neering and massively parallel optimization knowhow from laborative Task Execution”, in Proc. 11th ACM/IEEE IBM Research, to achieve unparalleled scalability. This Intl. Conf. Human-Robot Interaction (HRI 2016), New result offers geophysicists powerful tools to improve their Zealand, March 2016. understanding of the forces behind the origin of earthquakes [3] N.D. Lawrence: “Gaussian Process Latent Variable and volcanism [2]. More broadly, the combination of Models for Visualisation of High Dimensional Data”, advanced algorithms and the massively parallel optimiza - Advances in Neural Information Processing Systems tions demonstrated in the work applies much more generally 16, 329–336, 2004. and will enable scientists in multiple fields to make new dis - coveries and make it possible for industries to greatly reduce Please contact: the time it takes to invent, test, and bring new products to Maria Koskinopoulou and Panos Trahanias market – including, for example, new materials and energy ICS-FORTH,, Greece sources. E-mail: {mkosk, trahania}@ics.forth.gr The problem of modeling mantle convection at a global scale with realistic material properties is notoriously difficult. The problem is characterized by an extremely wide range of length scales that need to be resolved—from less than a kilo -

52 ERCIM NEWS 104 January 2016 metre at tectonic plate boundaries to the 40,000 kilometre Sequoia comprises 1.6 million processor cores capable of a circumference of the Earth. Similarly, the rock that com - theoretical peak performance of 20.1 petaflops. A huge chal - prises the mantle has a viscosity (which represents resistance lenge is efficiently scaling to all of the1,572,864 processor to shearing) that varies by over six orders of magnitude. The cores of the full Sequoia. The overall scale-up efficiency grid on which the computations were carried out thus had to using all 96 racks was 96% compared to the single rack run. be adapted locally to the length scale and viscosity variation This is a record-setting achievement and demonstrates that, present. Finally, the partial differential equations describing contrary to widely-held beliefs, numerically optimal implicit mantle convection are highly nonlinear. The result was a solvers can indeed almost perfectly scale to millions of cores, model with 602 billion equations and variables – the solution with careful algorithmic design and massive scaleout of which could be undertaken on only the most powerful knowhow. supercomputers. Understanding mantle convection has been named one of the Rather than using simpler explicit solvers, which are easier “10 Grand Research Questions in Earth Sciences” by the to scale on large supercomputers (since they involve only U.S. National Academies. The authors of [2] hope their work nearest neighbour communication) but at the same time are provides earth scientists with a valuable new tool to help extremely inefficient (as they require a very large number them pursue this grand quest. The work won the 2015 ACM of iterations to converge), the team achieved extreme sca - Gordon Bell Prize. leout to millions of cores for an innovative implicit solver, previously developed by the Texas and NYU members of Many colleagues contributed to this work, too many to be the team. This required extensive algorithmic implementa - included as authors here but all are heartily acknowledged: tion re-engineering and massively parallel adaptation as Johann Rudi, A. Cristiano I. Malossi, Tobin Isaac, Georg unforeseen bottlenecks arise when millions of cores are Stadler, Michael Gurnis, Peter W. J. Staar, Yves Ineichen, used. This success allowed the simulation of mantle con - Costas Bekas, Alessandro Curioni and Omar Ghattas. vection with an unprecedented level of fidelity and resolu - tion. Global high-resolution models such as this are critical Links: for understanding the forces driving and resisting global http://www.zurich.ibm.com/mcs/compsci/parallel/ tectonic plate motion. http://www.zurich.ibm.com/mcs/compsci/engineering/ https://www.youtube.com/watch?v=ADoo3jsHqw8 Combining advances in mathematics, algorithms and mas - sively parallel computing, simulations with the optimized References: mantle convection model demonstrated super-efficiency in [1] D. Rossinelli et al.: “11 PFLOP/s simulations of cloud running on two of the most powerful high-performance com - cavitation collapse”, Proc. of SC2013, Paper 3, Denver, puters in the world [2]. Development and science runs uti - Colorade, 2015. lized the Blue Gene/Q system at CCI, Rensselaer [2] J. Rudi et al.: “An Extreme-Scale Implicit Solver for Com - Polytechnic Institute and the Stampede system at the Texas plex PDEs: Highly Heterogeneous Flow in Earth’s Man - Advanced Computing Center; Initial massive scaling runs tle”, in Proc. of SC2015, Paper 5, Austin Texas, 2015. were made on the Juqueen Blue Gene/Q supercomputer at Juelich Supercomputing Centre in Germany and then further Please contact: scaled up to run on Sequoia, a Blue Gene/Q machine owned Costas Bekas, IBM Research – Zurich, Switzerland by the U.S. Lawrence Livermore National Laboratory. E-mail: [email protected]

Figure 1: Establishing a variable grid for the Earth’s crust and mantel.

ERCIM NEWS 104 January 2016 53 Events

Call for Papers • architectures and platforms for (European Workshop on Industrial embedded (cyber-physical) systems Computer Systems, TC7, Safety, EMC² Summit in • application Models and Design Tools Reliability and Security), and co-hosted for Mixed-Critical, Multi-Core CPS by the ARTEMIS projects ARROW - conjunction with the • dynamic runtime environments and HEAD, CRYSTAL and the standardiza - services tion Innovation Action CP-SETIS. CPS Week 2016 • multi-core hardware architectures and concepts Deadlines: 2016, Vienna, Austria, 11 April 2016 • system design platform, tools, models • Full paper submission: 7 February and interoperability 2016 EMC² – “Embedded Multi-Core systems • applications of multi-core cyber-physi - • Paper acceptance notification: 28 for Mixed Criticality applications in cal systems: avionics, automotive, February 2016 dynamic and changeable real-time envi - space, cross-domain and other applica - • Camera ready paper submission: 10 ronments” is an ARTEMIS/ECSEL Joint tions March 2016 Undertaking project in the ARTEMIS • safety and security co-engineering in Innovation Pilot Programme “Computing open dynamic CPS More information: platforms for embedded systems” (see • next generation embedded/cyber- http://www.artemis- http://www.artemis-emc2.eu/). EMC² physical systems emc2.eu/events/2016041116-emc2- will present work in progress and inter - • standardization, qualification and certi - summit-2016-at-cps-week-2016-vien - mediate results of the ongoing project fication issues of complex critical CPS na-austria-call-for-papers/ and invites researchers and industry to contribute with papers from their current The EMC² Summit is scientifically co- Please contact: work in the areas addressed and related sponsored by the ERCIM Dependable Erwin Schoitsch, AIT Austrian Institute projects. Topics of interest include, but Embedded Software-intensive Systems of Technology GmbH are not limited to: Working Group and EWICS TC7 [email protected]

More on IfIP TC6’s open Digital Book Announcement Library online Social Networks: human

As announced in the July issue of ERCIM News, the IFIP Cognitive Constraints in facebook (International Federation for Information Processing) Technical Committee 6’s (TC6’s) open digital library: and Twitter Personal Graphs http://dl.ifip.org/ is up and running. TC6 deals with Communication Systems and organizes a number of confer - V. Arnaboldi, A. Passarella, M. Conti, R.I.M. Dunbar ences in this area each year. Proceedings of some 30 TC6 conferences are available in the library. The library also This book provides new insights gives access to many other IFIP conferences so that the into the structural properties of overall number of conferences covered is more than 140. In personal online social networks contrast to other digital libraries, IFIP TC6 does not charge and the mechanisms underpin - authors for getting their papers published therein. ning human online social behavior. At the time of writing the most downloaded paper is from IFIP’s Technical Committee 2’s conference on Open Source As the availability of digital com - Systems which took place in Como in June, 2006: munication data generated by “Evolution of Open Source Communities” by Michael social media is revolutionizing Weiss, Gabriella Mouroiu and Ping Zhao. the field of social networks (http://dl.ifip.org/db/conf/oss/oss2006/WeissMZ06.pdf).It analysis, the text discusses the has been cited 35 times, according to Google. use of large- scale datasets to study the structural properties of online ego networks, to During the month of November, the most accessed confer - compare them with the properties of general human social ence was CNSM2015, the 11th International Conference on networks, and to highlight additional properties. Users will Network and Service Management which was held in find the data collected and conclusions drawn useful during Barcelona from 9 to 13 November. design or research service initiatives that involve online and mobile social network environments. Links: http://dl.ifip.org/ The book is intended for researchers and students active in http://dl.ifip.org/db/conf/cnsm/index.html the field of social network analysis, and professionals in the field of Online Social Networking service design. Please contact: Harry Rudin, Swiss Representative to IFIP TC6 Elsevier, Software Engineering and Programming, October [email protected] 2015, 116 pages. ISBN: 9780128030233

54 ERCIM NEWS 104 January 2016 In Brief

QuSoft Research Center Lynda hardman for Quantum Software President Informatics Europe Launched Lynda Hardman (CWI) has been elected President of the Board of Informatics Europe, the European association of computer science depart - QuSoft, the first research center dedicated to f f o ments and research laboratories, starting quantum software, was launched on 3 h s n o r B

from January 2016. ERCIM and Informatics December 2015. This joint initiative of CWI, b o B

: o

t Europe have agreed to increase cooperation the University of Amsterdam and VU o h P on activities that are relevant to both organi - University will be located at Amsterdam sations. Hardman succeeds Carlo Ghezzi Science Park in The Netherlands. QuSoft’s (Politecnico di Milano, Italy). The election mission is to develop new protocols, algo - took place at the annual General Assembly rithms and applications that can be run on during the 11th European Computer Science small and medium-sized prototypes of a Summit in Vienna, Austria. quantum computer. Its main focus is the devel - opment of quantum software, requiring funda - Hardman has been serving as member of the mentally different techniques and approaches Informatics Europe Board since 2012. She from conventional software. QuSoft will founded the Informatics Europe Working specifically target few-qubit applications, Group on Women in Informatics and was quantum testing and debugging, quantum responsible for the publication of the booklet “More Women in architecture and quantum cryptography. Informatics Research and Education”. She regularly speaks at events to inspire and encourage a new generation of female talent in computer sci - Directors are Harry Buhrman (CWI and UvA) ence. and Kareljan Schoutens (UvA). The new center will be hosted by CWI. It aims to grow “Informatics research is a driving force of European economy and to 35 to 40 researchers and will be home to society“, Hardman says. “It is my honour to be able to serve the commu - four research groups, each with numerous nity in my role as president of Informatics Europe. At a European scale we PhD students and postdocs. The QuSoft initia - can increase the quality of informatics research, ensure that educational tive has been warmly welcomed internation - innovations can be more easily disseminated, and also strengthen the ally and enjoys strong support from the development of commercial cooperation across Europe.“ European and North American quantum com - puting communities. The structural funding of Lynda Hardman is a member of the management team at CWI, a senior QuSoft allows for further strengthening of the researcher in the Information Access research group and coordinator of existing ties with quantum computing the CWI Information theme. She is a professor in Multimedia Discourse research organizations worldwide and Interaction at Utrecht University. Her research focuses on human-com - researchers from leading computer industry puter interaction. In 2014 she was named Distinguished Scientist by partners, including Google, Microsoft and ID ACM, for her individual contribution to the field of computing. Quantique. More information: http://www.qusoft.org More information: http://kwz.me/Rm

ERCIM “Alain Bensoussan” fellowship Programme

ERCIM offers fellowships for PhD holders from all over the world. Topics cover most disciplines in Computer Science, Information Technology, and Applied Mathematics. Fellowships are of 12-month duration, spent in one ERCIM member institute. Fellowships are pro - posed according to the needs of the member institutes and the avail - able funding.

Application deadlines: 30 April and 30 September.

More information: http://fellowship.ercim.eu/

ERCIM NEWS 104 January 2016 55 ERCIM – the European Research Consortium for Informatics and Mathematics is an organisa - tion dedicated to the advancement of European research and development, in information technology and applied mathematics. Its member institutions aim to foster collaborative work within the European research community and to increase co-operation with European industry.

ERCIM is the European host of the World Wide Web Consortium.

Consiglio Nazionale delle Ricerche I.S.I. – Industrial Systems Institute Area della Ricerca CNR di Pisa Patras Science Park building Via G. Moruzzi 1, 56124 Pisa, Italy Platani, Patras, Greece, GR-26504 http://www.iit.cnr.it/ http://www.isi.gr/

Norwegian University of Science and Technology Faculty of Information Technology, Mathematics and Electri - Czech Research Consortium cal Engineering, N 7491 Trondheim, Norway for Informatics and Mathematics http://www.ntnu.no/ FI MU, Botanicka 68a, CZ-602 00 Brno, Czech Republic http://www.utia.cas.cz/CRCIM/home.html

SBA Research gGmbH Favoritenstraße 16, 1040 Wien http://www.sba-research.org/ Centrum Wiskunde & Informatica Science Park 123, NL-1098 XG Amsterdam, The Netherlands http://www.cwi.nl/ SICS Swedish ICT Box 1263, SE-164 29 Kista, Sweden http://www.sics.se/

Fonds National de la Recherche 6, rue Antoine de Saint-Exupéry, B.P. 1777 L-1017 Luxembourg-Kirchberg Spanish Research Consortium for Informatics and Mathematics http://www.fnr.lu/ D3301, Facultad de Informática, Universidad Politécnica de Madrid 28660 Boadilla del Monte, Madrid, Spain, http://www.sparcim.es/

FWO F.R.S.-FNRS Magyar Tudományos Akadémia Egmontstraat 5 rue d’Egmont 5 Számítástechnikai és Automatizálási Kutató Intézet B-1000 Brussels, Belgium B-1000 Brussels, Belgium P.O. Box 63, H-1518 Budapest, Hungary http://www.fwo.be/ http://www.fnrs.be/ http://www.sztaki.hu/

University of Cyprus P.O. Box 20537 Foundation for Research and Technology – Hellas 1678 Nicosia, Cyprus Institute of Computer Science http://www.cs.ucy.ac.cy/ P.O. Box 1385, GR-71110 Heraklion, Crete, Greece FORTH http://www.ics.forth.gr/

University of Southampton University Road Southampton SO17 1BJ, United Kingdom http://www.southampton.ac.uk/ Fraunhofer ICT Group Anna-Louisa-Karsch-Str. 2 10178 Berlin, Germany http://www.iuk.fraunhofer.de/ Universty of Warsaw Faculty of Mathematics, Informatics and Mechanics Banacha 2, 02-097 Warsaw, Poland http://www.mimuw.edu.pl/

INESC c/o INESC Porto, Campus da FEUP, Rua Dr. Roberto Frias, nº 378, Universty of Wroclaw 4200-465 Porto, Portugal Institute of Computer Science Joliot-Curie 15, 50–383 Wroclaw, Poland http://www.ii.uni.wroc.pl/

Institut National de Recherche en Informatique VTT Technical Research Centre of Finland Ltd et en Automatique PO Box 1000 B.P. 105, F-78153 Le Chesnay, France FIN-02044 VTT, Finland http://www.inria.fr/ http://www.vttresearch.com

Subscribe to ERCIM News and order back copies at http://ercim-news.ercim.eu/