Ivelize Rocha Bernardo Promoting Interoperability of Biodiversity

Total Page:16

File Type:pdf, Size:1020Kb

Ivelize Rocha Bernardo Promoting Interoperability of Biodiversity Universidade Estadual de Campinas Instituto de Computação INSTITUTO DE COMPUTAÇÃO Ivelize Rocha Bernardo Promoting Interoperability of Biodiversity Spreadsheets via Purpose Recognition Promovendo Interoperabilidade de Planilhas de Biodiversidade através do Reconhecimento de Propósito CAMPINAS 2017 Ivelize Rocha Bernardo Promoting Interoperability of Biodiversity Spreadsheets via Purpose Recognition Promovendo Interoperabilidade de Planilhas de Biodiversidade através do Reconhecimento de Propósito Tese apresentada ao Instituto de Computação da Universidade Estadual de Campinas como parte dos requisitos para a obtenção do título de Doutora em Ciência da Computação. Dissertation presented to the Institute of Computing of the University of Campinas in partial fulfillment of the requirements for the degree of Doctor in Computer Science. Supervisor/Orientador: Prof. Dr. André Santanchè Este exemplar corresponde à versão final da Tese defendida por Ivelize Rocha Bernardo e orientada pelo Prof. Dr. André Santanchè. CAMPINAS 2017 Agência(s) de fomento e nº(s) de processo(s): FAPESP, 2012/16159-6 Ficha catalográfica Universidade Estadual de Campinas Biblioteca do Instituto de Matemática, Estatística e Computação Científica Ana Regina Machado - CRB 8/5467 Bernardo, Ivelize Rocha, 1982- B456p BerPromoting interoperability of biodiversity spreadsheets via purpose recognition / Ivelize Rocha Bernardo. – Campinas, SP : [s.n.], 2017. BerOrientador: André Santanchè. BerTese (doutorado) – Universidade Estadual de Campinas, Instituto de Computação. Ber1. Biodiversidade. 2. Biodiversidade - Banco de dados. 3. Aprendizado de máquina. 4. Planilhas eletrônicas. 5. Integração semântica (Sistemas de computação). I. Santanchè, André, 1968-. II. Universidade Estadual de Campinas. Instituto de Computação. III. Título. Informações para Biblioteca Digital Título em outro idioma: Promovendo interoperabilidade de planilhas de biodiversidade através do reconhecimento de propósito Palavras-chave em inglês: Biodiversity Biodiversity - Databases Machine learning Electronic spreadsheets Semantic integration (Computer systems) Área de concentração: Ciência da Computação Titulação: Doutora em Ciência da Computação Banca examinadora: André Santanchè [Orientador] Antonio Mauro Saraiva José Laurindo Campos dos Santos Flavio Antonio Maës Santos Julio Cesar dos Reis Data de defesa: 24-10-2017 Programa de Pós-Graduação: Ciência da Computação Powered by TCPDF (www.tcpdf.org) Universidade Estadual de Campinas Instituto de Computação INSTITUTO DE COMPUTAÇÃO Ivelize Rocha Bernardo Promoting Interoperability of Biodiversity Spreadsheets via Purpose Recognition Promovendo Interoperabilidade de Planilhas de Biodiversidade através do Reconhecimento de Propósito Banca Examinadora: Prof. Dr. André Santanchè • Instituto de Computação - Unicamp Prof. Dr. Antonio Mauro Saraiva • Escola Politécnica - USP Prof. Dr. José Laurindo Campos dos Santos • Coordenação de Ação Estratégica - INPA Prof. Dr. Flavio Antonio Maës Santos • Instituto de Biologia - Unicamp Prof. Dr. Julio Cesar Dos Reis • Instituto de Computação - Unicamp Aatadadefesacomasrespectivasassinaturasdosmembrosdabancaencontra-seno processo de vida acadêmica do aluno. Campinas, 24 de outubro de 2017 “He who loves practice without theory is like the sailor who boards ship without a rudder and compass and never knows where he may cast.” (Leonardo da Vinci (1452-1519)) Acknowledgements If I arrived here, it is because every person who crossed my life brought me a new expe- rience of self-improvement, and for them, I do not have words to say thank you! How much should I say thank you to my advisor, Prof. Dr. André Santanchè, for the guidance, all dedication, and encouragement throughout the project? Moreover, Prof. Dr. Claudia Bauzer Medeiros, Prof. Dr. Helio Pedrini, Profa. Dra. Maria Cecília Calani Baranauskas, Dra. Debora Pignatari Drucker, Dra. Talita Soares Reis who have been collaborated with this research work, making it even better. How to say thank you to Prof. Dr. Alvaro A. Fernandes for allowing me to have one of the most amazing experiences of my life? I will never forget the opportunity which he offered me of developing my research project for a year at the University of Manchester with him. Prof. Dr. Norman Paton, who together with Alvaro inspired me with their wisdom questions and made me have a huge improvement as a researcher. Prof. Dr. Carole Goble and everyone who works with her, I cannot say how much I am pleased to have had the opportunity to receive your advices and help. I have learned a lot with you all. Thank you very much for everything! My friends from Manchester who made my life more colorful during winter. My friends from LIS-Unicamp who loved me even when the weather was low. Bianca for having this huge heart, I will never forget what you did for me, and Gi, Artemis, and Shella, who welcomed me so well. Helo, because "life is not a cartesian plan!". My father who has inspired me with his strength and optimistic way of facing the ob- stacles. My mother, for her endless dedication, her love, and her patience, for encouraging me to face the challenges. Mom, you showed me that life could be unexpected and that, sometimes, we just need to give the next step to open a world of opportunities. My sister, for being my best friend, without her, life would not be so full of love, so pure complicity. My grandma, teaching me the love for knowledge. My friends Eddy and Bia for encouraging me so much, mainly in these lasts months, I can’t say what would be this doctorate without you both. Lucas and Mau for always supporting me and saving me by offering their love, their time, their house and the love of their dog, Yoshi. Lilian and Ale for sharing their smiles, their thoughts, their couch, their love. Vania, André, and Lettys who have been in my life, and bringing me always so many good things. Chris for encouraging me to follow my way, and for always making me smile. Special thanks to my friend Guilherme, who is no longer with us, but who Iamgoingtorememberfortherestofmylife.Guys,definitelyyouallmakemylife worthwhile. All professors, staff, and colleagues at UNICAMP and the University of Manchester. This work was developed at UNICAMP and participantly at the School of Computer Science at the University of Manchester and financed by FAPESP (2014 / 21963-4) and FAPESP (2012 / 16159-6). I am pleased to acknowledge them. The opinions expressed in this work do not necessarily reflect those of the funding agencies. Resumo Existem muitas iniciativas para promover "intelligent openness"ou "FAIR principles"de dados, ou seja, formas de tornar os dados disponíveis, acessíveis, interoperáveis e reutilizá- veis. No entanto, no domínio da biodiversidade, ainda é habitual que os biólogos produzam seus dados em formatos ad-hoc e heterogêneos. A conformidade com um padrão impõe- lhes um custo inicial de reestruturação e anotação de seus dados. Esta pesquisa aborda este cenário com foco em planilhas. Contribui com uma técnica para produzir automa- ticamente anotações semânticas em dados extraídos de planilhas, explorando a maneira como os atributos são organizados em seus esquemas para inferir seu propósito. Os dados semânticos resultantes podem ser integrados, articulados e manipulados de acordo com sua finalidade, em uma abordagem incremental e exploratória, permitindo que os biólogos naveguem e interajam com uma rede interconectada de dados de biodiversidade. Abstract There are many initiatives to promote "intelligent openness" or "FAIR principles" of data, i.e., ways to turn data Findable, Accessible, Interoperable, and Reusable. They rely on a compliance with reference schemas, common standards or ontologies. However, in the biodiversity domain, it is still usual that biologists produce their data in ad hoc and heterogeneous formats. A compliance with a standard imposes on them an upfront cost of restructuring and annotating their data. This research addresses this scenario focusing on spreadsheets. It presents our technique to automatically produce semantic annotations in data extracted from spreadsheets, exploring the way that attributes are arranged in their schemas to infer their purpose. Elements of the resulting semantic dataset can be integrated, articulated and handled according to their purpose, in an incremental and exploratory approach, allowing biologists to navigate and interact with an interconnected network of biodiversity data. List of Figures 2.1 FieldsCharacterization. 20 2.2 Terms by schema of initial lines . 23 2.3 SciSpread - Proportions among fields of catalog spreadsheets. 23 2.4 Survey - Proportions among fields of catalog spreadsheets. 23 2.5 SciSpread - Proportions among fields of event spreadsheets. 24 2.6 Survey - Proportions among fields of event spreadsheets. 24 2.7 Comparative terms quantities between spreadsheets category . 25 2.8 Comparative terms location between spreadsheets nature . 26 2.9 Spreadsheet 1 - used in the Survey . 26 2.10 Spreadsheet 2 - used in the Survey . 26 2.11 Comparative results about spreadsheets classification . 27 2.12 Spreadsheet 3 - used in the survey . 27 2.13 Conceptual model for catalog spreadsheets annotated with qualifiers . 29 3.1 Biodiversity data grouped by purpose [35] . 34 3.2 Set of spreadsheets used by scientists to record biodiversity data . 35 3.3 Spreadsheets of our survey [9] analysing how the organization of attributes influences the interpretation of a spreadsheet. 36 3.4 Operations for biodiversity data sets according to their purpose [35] . 39 3.5 System
Recommended publications
  • Understanding Semantic Aware Grid Middleware for E-Science
    Computing and Informatics, Vol. 27, 2008, 93–118 UNDERSTANDING SEMANTIC AWARE GRID MIDDLEWARE FOR E-SCIENCE Pinar Alper, Carole Goble School of Computer Science, University of Manchester, Manchester, UK e-mail: penpecip, carole @cs.man.ac.uk { } Oscar Corcho School of Computer Science, University of Manchester, Manchester, UK & Facultad de Inform´atica, Universidad Polit´ecnica de Madrid Boadilla del Monte, ES e-mail: [email protected] Revised manuscript received 11 January 2007 Abstract. In this paper we analyze several semantic-aware Grid middleware ser- vices used in e-Science applications. We describe them according to a common analysis framework, so as to find their commonalities and their distinguishing fea- tures. As a result of this analysis we categorize these services into three groups: information services, data access services and decision support services. We make comparisons and provide additional conclusions that are useful to understand bet- ter how these services have been developed and deployed, and how similar ser- vices would be developed in the future, mainly in the context of e-Science applica- tions. Keywords: Semantic grid, middleware, e-science 1 INTRODUCTION The Science 2020 report [40] stresses the importance of understanding and manag- ing the semantics of data used in scientific applications as one of the key enablers of 94 P. Alper, C. Goble, O. Corcho future e-Science. This involves aspects like understanding metadata, ensuring data quality and accuracy, dealing with data provenance (where and how it was pro- duced), etc. The report also stresses the fact that metadata is not simply for human consumption, but primarily used by tools that perform data integration and exploit web services and workflows that transform the data, compute new derived data, etc.
    [Show full text]
  • Description Logics Emerge from Ivory Towers Deborah L
    Description Logics Emerge from Ivory Towers Deborah L. McGuinness Stanford University, Stanford, CA, 94305 [email protected] Abstract: Description logic (DL) has existed as a field for a few decades yet somewhat recently have appeared to transform from an area of academic interest to an area of broad interest. This paper provides a brief historical perspective of description logic developments that have impacted their usability beyond just in universities and research labs and provides one perspective on the topic. Description logics (previously called terminological logics and KL-ONE-like systems) started with a motivation of providing a formal foundation for semantic networks. The first implemented DL system – KL-ONE – grew out of Brachman’s thesis [Brachman, 1977]. This work was influenced by the work on frame systems but was focused on providing a foundation for building term meanings in a semantically meaningful and unambiguous manner. It rejected the notion of maintaining an ever growing (seemingly adhoc) vocabulary of link and node names seen in semantic networks and instead embraced the notion of a fixed set of domain-independent “epistemological primitives” that could be used to construct complex, structured object descriptions. It included constructs such as “defines-an-attribute-of” as a built-in construct and expected terms like “has-employee” to be higher-level terms built up from the epistemological primitives. Higher level terms such as “has-employee” and “has-part-time-employee” could be related automatically based on term definitions instead of requiring a user to place links between them. In its original incarnation, this led to maintaining the motivation of semantic networks of providing broad expressive capabilities (since people wanted to be able to represent natural language applications) coupled with the motivation of providing a foundation of building blocks that could be used in a principled and well-defined manner.
    [Show full text]
  • Open PHACTS: Semantic Interoperability for Drug Discovery
    REVIEWS Drug Discovery Today Volume 17, Numbers 21/22 November 2012 Reviews KEYNOTE REVIEW Open PHACTS: semantic interoperability for drug discovery 1 2 3 Antony J. Williams Antony J. Williams , Lee Harland , Paul Groth , graduated with a PhD in 4 5,6 chemistry as an NMR Stephen Pettifer , Christine Chichester , spectroscopist. Antony 7 6,7 8 Williams is currently VP, Egon L. Willighagen , Chris T. Evelo , Niklas Blomberg , Strategic development for 9 4 6 ChemSpider at the Royal Gerhard Ecker , Carole Goble and Barend Mons Society of Chemistry. He has written chapters for many 1 Royal Society of Chemistry, ChemSpider, US Office, 904 Tamaras Circle, Wake Forest, NC 27587, USA books and authored >140 2 Connected Discovery Ltd., 27 Old Gloucester Street, London, WC1N 3AX, UK peer reviewed papers and 3 book chapters on NMR, predictive ADME methods, VU University Amsterdam, Room T-365, De Boelelaan 1081a, 1081 HV Amsterdam, The Netherlands 4 internet-based tools, crowdsourcing and database School of Computer Science, The University of Manchester, Oxford Road, Manchester M13 9PL, UK 5 curation. He is an active blogger and participant in the Swiss Institute of Bioinformatics, CMU, Rue Michel-Servet 1, 1211 Geneva 4, Switzerland Internet chemistry network as @ChemConnector. 6 Netherlands Bioinformatics Center, P. O. Box 9101, 6500 HB Nijmegen, and Leiden University Medical Center, The Netherlands Lee Harland 7 is the Founder & Chief Department of Bioinformatics – BiGCaT, Maastricht University, The Netherlands 8 Technical Officer of Respiratory & Inflammation iMed, AstraZeneca R&D Mo¨lndal, S-431 83 Mo¨lndal, Sweden 9 ConnectedDiscovery, a University of Vienna, Department of Medicinal Chemistry, Althanstraße 14, 1090 Wien, Austria company established to promote and manage precompetitive collaboration Open PHACTS is a public–private partnership between academia, within the life science industry.
    [Show full text]
  • The Fourth Paradigm
    ABOUT THE FOURTH PARADIGM This book presents the first broad look at the rapidly emerging field of data- THE FOUR intensive science, with the goal of influencing the worldwide scientific and com- puting research communities and inspiring the next generation of scientists. Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets. The speed at which any given scientific discipline advances will depend on how well its researchers collaborate with one another, and with technologists, in areas of eScience such as databases, workflow management, visualization, and cloud- computing technologies. This collection of essays expands on the vision of pio- T neering computer scientist Jim Gray for a new, fourth paradigm of discovery based H PARADIGM on data-intensive science and offers insights into how it can be fully realized. “The impact of Jim Gray’s thinking is continuing to get people to think in a new way about how data and software are redefining what it means to do science.” —Bill GaTES “I often tell people working in eScience that they aren’t in this field because they are visionaries or super-intelligent—it’s because they care about science The and they are alive now. It is about technology changing the world, and science taking advantage of it, to do more and do better.” —RhyS FRANCIS, AUSTRALIAN eRESEARCH INFRASTRUCTURE COUNCIL F OURTH “One of the greatest challenges for 21st-century science is how we respond to this new era of data-intensive
    [Show full text]
  • Data Curation+Process Curation^Data Integration+Science
    BRIEFINGS IN BIOINFORMATICS. VOL 9. NO 6. 506^517 doi:10.1093/bib/bbn034 Advance Access publication December 6, 2008 Data curation 1 process curation^data integration 1 science Carole Goble, Robert Stevens, Duncan Hull, Katy Wolstencroft and Rodrigo Lopez Submitted: 16th May 2008; Received (in revised form): 25th July 2008 Abstract In bioinformatics, we are familiar with the idea of curated data as a prerequisite for data integration. We neglect, often to our cost, the curation and cataloguing of the processes that we use to integrate and analyse our data. Downloaded from https://academic.oup.com/bib/article/9/6/506/223646 by guest on 27 September 2021 Programmatic access to services, for data and processes, means that compositions of services can be made that represent the in silico experiments or processes that bioinformaticians perform. Data integration through workflows depends on being able to know what services exist and where to find those services. The large number of services and the operations they perform, their arbitrary naming and lack of documentation, however, mean that they can be difficult to use. The workflows themselves are composite processes that could be pooled and reused but only if they too can be found and understood. Thus appropriate curation, including semantic mark-up, would enable processes to be found, maintained and consequently used more easily.This broader view on semantic annotation is vital for full data integration that is necessary for the modern scientific analyses in biology.This article will brief the community on the current state of the art and the current challenges for process curation, both within and without the Life Sciences.
    [Show full text]
  • Social Networking Site for Researchers Aims to Make Academic Papers a Thing of the Past 16 July 2009
    Social networking site for researchers aims to make academic papers a thing of the past 16 July 2009 myExperiment, the social networking site for and traditional ideas of repositories,’ said Professor scientists, has set out to challenge traditional ideas Carole Goble. ‘myExperiment paves the way for of academic publishing as it enters a new phase of the next generation of researchers to do new funding. research using new research methods.’ The site has just received a further £250,000 In its first year, the myExperiment.org website has funding from the Joint Information Systems attracted thousands of users worldwide and Committee (JISC) as part of the JISC Information established the largest public collection of its kind. Environment programme to improve scholarly communication in contemporary research practice. More information: www.myexperiment.org According to Professor David De Roure at the Source: University of Southampton University of Southampton’s School of Electronics and Computer Science, who has developed the site jointly with Professor Carole Goble at the University of Manchester, researchers will in the future be sharing new forms of “Research Objects” rather than academic publications. Research Objects contain everything needed to understand and reuse a piece of research, including workflows, data, research outputs and provenance information. They provide a systematic and unbiased approach to research, essential when researchers are faced with a deluge of data. ‘We are introducing new approaches to make research more reproducible, reusable and reliable,’ Professor De Roure said. ‘Research Objects are self-contained pieces of reproducible research which we will share in the future like papers are shared today.’ The myExperiment Enhancement project will integrate myExperiment with the established EPrints research repository in Southampton and Manchester’s new e-Scholar institutional repository.
    [Show full text]
  • FAIR Computational Workflows
    IMPLEMENTATION ARTICLE Related to other papers in this 29 (p285); 8 (p78) special issue Addressing FAIR principles F1, F2, F3, F4, A1, A1.1, A1.2, A2, I1, I2, I3, R1, R1.1, R1.2, R1.3 FAIR Computational Workflows Carole Goble1†, Sarah Cohen-Boulakia2, Stian Soiland-Reyes1,4, Daniel Garijo3, Yolanda Gil3, Michael R. Crusoe4, Kristian Peters5 & Daniel Schober5 1Department of Computer Science, The University of Manchester, Oxford Road, Manchester M13 9PL, UK 2Laboratoire de Recherche en Informatique, CNRS, Université Paris-Saclay, Batiment 650, Université Paris-Sud, 91405 ORSAY Cedex, France 3Information Sciences Institute, University of Southern California, Marina Del Rey CA 90292, USA 4Common Workflow Language project, Software Freedom Conservancy, Inc. 137 Montague St STE 380, NY 11201-3548, USA 5Leibniz Institute of Plant Biochemistry (IPB Halle), Department of Biochemistry of Plant Interactions, Weinberg 3, 06120 Halle (Saale), Germany Keywords: Computational workflow; Reproducibility; Software; FAIR data; Provenance Citation: C. Goble, S. Cohen-Boulakia, S. Soiland-Reyes, D. Garijo, Y. Gil, M.R. Crusoe, K. Peters & D. Schober. FAIR computational workflows. Data Intelligence 2(2020), 108–121. doi: 10.1162/dint_a_00033 ABSTRACT Computational workflows describe the complex multi-step methods that are used for data collection, data preparation, analytics, predictive modelling, and simulation that lead to new data products. They can inherently contribute to the FAIR data principles: by processing data according to established metadata; by creating metadata themselves during the processing of data; and by tracking and recording data provenance. These properties aid data quality assessment and contribute to secondary data usage. Moreover, workflows are digital objects in their own right.
    [Show full text]
  • Hosts: Monash Eresearch Centre and Messagelab Seminar :The Long Tail Scientist Presenter: Prof Carole Goble, Computer Science
    Hosts: Monash eResearch Centre and MessageLab Seminar :The Long tail Scientist Presenter: Prof Carole Goble, Computer Science, University of Manchester Venue: Seminar Room 135, Building 26 Clayton Time and Date: Wed 3 August 2011, 5-6pm Abstract Big science with big, coordinated and collaborative programmes – the Large Hadron Collider, the Sloan Sky Survey, the Human Genome and its successor the 1000 Genomes project – hogs headlines and fascinates funders. But this big science makes up a small fraction of research being done. Whilst the big journals – Nature, Science - are often the first to publish breakthrough research, work in a vast array of smaller journals still contributes to scientific knowledge. Every day, PhD students and post-docs are slaving away in small labs building up the bulk of scientific data. In disciplines like chemistry, biology and astronomy they are taking advantage of the multitude of public datasets and analytical tools to make their own investigations. Jim Downing at the Unilever Centre for Molecular Informatics was one of the first to coin the term of “Long Tail Science” – that large numbers of small researcher units is an important concept. We are not just standing on the shoulders of a few giants but standing on the shoulders of a multitude of the average sized. Ten years ago I started up the myGrid e- Science project (http://www.mygrid.org.uk ) specifically to help the long tail bioinformatician, and later other long tail scientists from other disciplines. And it turns out that the software, services and methods we develop and deploy (the Taverna workflow system, myExperiment, BioCatalogue, SysMO-SEEK, MethodBox) apply just as well to the big science projects.
    [Show full text]
  • The Rise of Bioinformatics and the in Silico Experiment Has Revolutionised the Life Sciences
    Getting Serious about a Community Bio-Service Catalogue Carole Goble and Katy Wolstencroft The Open Middleware Infrastructure Institute, The School of Computer Science The University of Manchester, UK [email protected], [email protected] http://www.omii.ac.uk The rise of bioinformatics and the in silico experiment has revolutionised the Life Sciences. Biologists now share a global community with rich, publicly accessible data resources and analysis tools; currently 850 databases are publicly web-accessible [1]. To get the most value out of these resources, however, they need to be able to integrate, interrogate and mine heterogeneous data and associated knowledge from distributed sources. myGrid (http://www.mygrid.org.uk) has developed the Taverna workbench which is a platform for accessing these distributed resources and providing the mechanical means of interoperating between them using workflows [2]. Workflows are an embodiment of the experimental protocol, to be repeated, reused, inspected and shared to improve and disseminate experimental best practice [3]. To enable scientists to design workflows, they have to discover services (and other prior workflows) and understand how to invoke them. Taverna enables access to 3000+ services that can become steps in a workflow: a bewildering, and increasing, number. To invoke a service the scientist must know the format of the input(s) the service is expecting. To combine services, they must also know the output formats. The heterogeneity of the bioinformatics domain and a lack of standard data format(s) means that describing services with simple typing is impractical. Describing the syntactic interface does not provide enough information for the user to successfully invoke the service.
    [Show full text]
  • BENCHMARKING WORKFLOW DISCOVERY 3 the Workflow Literature
    CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1{7 Prepared using cpeauth.cls [Version: 2002/09/19 v2.02] Benchmarking Workflow Discovery: A Case Study From Bioinformatics Antoon Goderis1, Paul Fisher1, Andrew Gibson1,3, Franck Tanoh1, Katy Wolstencroft1, David De Roure2, Carole Goble1,¤ 1 School of Computer Science The University of Manchester Manchester M13 9PL, United Kingdom 2 School of Electronics and Computer Science University of Southampton Southampton SO17 1BJ, United Kingdom 3 Swammerdam Institute for Life Sciences Universiteit van Amsterdam Amsterdam, The Netherlands SUMMARY Automation in science is increasingly marked by the use of workflow technology. The sharing of workflows through repositories supports the veri¯ability, reproducibility and extensibility of computational experiments. However, the subsequent discovery of workflows remains a challenge, both from a sociological and technological viewpoint. Based on a survey with participants from 19 laboratories, we investigate current practices in workflow sharing, re-use and discovery amongst life scientists chiefly using the Taverna workflow management system. To address their perceived lack of e®ective workflow discovery tools, we go on to develop benchmarks for the evaluation of discovery tools, drawing on a series of practical exercises. We demonstrate the value of the benchmarks on two tools: one using graph matching, the other relying on text clustering. key words: Scienti¯c Workflow, Bioinformatics, Discovery, Benchmark, Taverna, myExperiment ¤Correspondence to: [email protected] Received Copyright °c 2000 John Wiley & Sons, Ltd. Revised 2 A. GODERIS 1. Introduction The process of scienti¯c research has a crucial social element: it involves the sharing and publication of protocols and experimental procedures so that results can be reproduced and properly interpreted, and so that others may re-use, repurpose and extend protocols to support the advancement of science.
    [Show full text]
  • Professor Carole Goble Dr. John Brooke Summary of Talk
    Enabling Grid and e- Science Projects in the North-West Professor Carole Goble Dr. John Brooke http://www.esnw.ac.uk Summary of talk • Role and structure of ESNW • Hub and spoke model • Networking and AccessGrid • Two examples of applications • Future strategy Enabling Science in the North West ESNW Strategy- Bio-Medical Physics Social science and Astronomy Chemistry Semantic and Database Knowledge Technologies Technologies Generic Grid and e-Science technologies ESNW links and outreach RealityGridRealityGrid NERCNERC projectsprojects myGridmyGrid MRCMRC projectsprojects ESRCESRC projectsprojects TextText MiningMining JISCJISC projectsprojects DTIDTI projectsprojects LocalLocal NWDANWDA e-Sciencee-Science GOSCGOSC NW-GridNW-Grid ESNW Strategy An Umbrella to NW e-Science Activities Infrastructure --DAI UK activities e-Science pilots EU e-Science ESNW Structure Industrial Reps Faculty of Science Steering Manchester Computing Committee Regional Reps Development Advisors Co-Directors Grid Architect Manager Management Team Admin Support Technical Support Rolling out ESNW: Hub and Spoke •£3.18 million UoM investment supplements £1.7 million DTI/EPSRC •Virtual presence in spoke nodes NaCTEM MIB – Access Grid – deployment of 6 NIHBI Physics AGs more funded by other means Grid – High performance comms fabric- implemented in July 2004 Kilburn •Spoke node strategy Medical Building – PhysicsGrid refurb, re-equip & School Jodrell comms case Bank – National Institute for Bio-Health Informatics (funded by NWDA) – National Centre for e-Social Teaching Social Science Hospitals science Chemistry National Centre for Text Mining NCeSS •Core Staff Multi-site collaboration • Multicast • AG uses multicast networking • Enables more efficient use of bandwidth • Multicast deployed over JANET core and at most e-Science centres Other hub and spoke models Hub at Manchester, nodes have specialisms e.g.
    [Show full text]
  • Anchors in Shifting Sand: the Primacy of Method in the Web of Data
    Anchors in Shifting Sand: the Primacy of Method in the Web of Data David De Roure Carole Goble School of Electronics and Computer Science School of Computer Science University of Southampton The University of Manchester Southampton SO17 1BJ, UK Manchester M13 9PL, UK +44 23 8059 2418 +44 161 275 6195 [email protected] [email protected] ABSTRACT In the interest of reproducibility we must primae facie focus on The wealth of new government and scientific data appearing on making explicit the method by which results are generated. the Web is to be welcomed and makes it possible for citizens and Methods can then be first-class Web citizens in our emerging scientists to interpret evidence and obtain new insights. But how scientific practice. For example we can create a “pipeline”, will they do this, and how will people trust the results? We “script”, “mashup”, “workflow”, “query” or “business process" to suggest the Linked Data Web must embrace the “methods” by generate a result based on data sources, and this provides our which results are obtained as well as the results themselves. By route to repeatability, reproducibility and reuse. We get the latest making methods first class citizens, results can be explained, results, and we better understand the provenance of our results so interpreted and assessed, and the methods themselves can be that they can be explained, interpreted, trusted and reused by shared, discussed, reused and repurposed. We present the others. myExperiment.org website, a social network of people sharing Crucially, those working with the data also benefit from shareable reusable methods for processing research data, and make some methods.
    [Show full text]