Ivelize Rocha Bernardo Promoting Interoperability of Biodiversity

Universidade Estadual de Campinas Instituto de Computação INSTITUTO DE COMPUTAÇÃO Ivelize Rocha Bernardo Promoting Interoperability of Biodiversity Spreadsheets via Purpose Recognition Promovendo Interoperabilidade de Planilhas de Biodiversidade através do Reconhecimento de Propósito CAMPINAS 2017 Ivelize Rocha Bernardo Promoting Interoperability of Biodiversity Spreadsheets via Purpose Recognition Promovendo Interoperabilidade de Planilhas de Biodiversidade através do Reconhecimento de Propósito Tese apresentada ao Instituto de Computação da Universidade Estadual de Campinas como parte dos requisitos para a obtenção do título de Doutora em Ciência da Computação. Dissertation presented to the Institute of Computing of the University of Campinas in partial fulfillment of the requirements for the degree of Doctor in Computer Science. Supervisor/Orientador: Prof. Dr. André Santanchè Este exemplar corresponde à versão final da Tese defendida por Ivelize Rocha Bernardo e orientada pelo Prof. Dr. André Santanchè. CAMPINAS 2017 Agência(s) de fomento e nº(s) de processo(s): FAPESP, 2012/16159-6 Ficha catalográfica Universidade Estadual de Campinas Biblioteca do Instituto de Matemática, Estatística e Computação Científica Ana Regina Machado - CRB 8/5467 Bernardo, Ivelize Rocha, 1982- B456p BerPromoting interoperability of biodiversity spreadsheets via purpose recognition / Ivelize Rocha Bernardo. – Campinas, SP : [s.n.], 2017. BerOrientador: André Santanchè. BerTese (doutorado) – Universidade Estadual de Campinas, Instituto de Computação. Ber1. Biodiversidade. 2. Biodiversidade - Banco de dados. 3. Aprendizado de máquina. 4. Planilhas eletrônicas. 5. Integração semântica (Sistemas de computação). I. Santanchè, André, 1968-. II. Universidade Estadual de Campinas. Instituto de Computação. III. Título. Informações para Biblioteca Digital Título em outro idioma: Promovendo interoperabilidade de planilhas de biodiversidade através do reconhecimento de propósito Palavras-chave em inglês: Biodiversity Biodiversity - Databases Machine learning Electronic spreadsheets Semantic integration (Computer systems) Área de concentração: Ciência da Computação Titulação: Doutora em Ciência da Computação Banca examinadora: André Santanchè [Orientador] Antonio Mauro Saraiva José Laurindo Campos dos Santos Flavio Antonio Maës Santos Julio Cesar dos Reis Data de defesa: 24-10-2017 Programa de Pós-Graduação: Ciência da Computação Powered by TCPDF (www.tcpdf.org) Universidade Estadual de Campinas Instituto de Computação INSTITUTO DE COMPUTAÇÃO Ivelize Rocha Bernardo Promoting Interoperability of Biodiversity Spreadsheets via Purpose Recognition Promovendo Interoperabilidade de Planilhas de Biodiversidade através do Reconhecimento de Propósito Banca Examinadora: Prof. Dr. André Santanchè • Instituto de Computação - Unicamp Prof. Dr. Antonio Mauro Saraiva • Escola Politécnica - USP Prof. Dr. José Laurindo Campos dos Santos • Coordenação de Ação Estratégica - INPA Prof. Dr. Flavio Antonio Maës Santos • Instituto de Biologia - Unicamp Prof. Dr. Julio Cesar Dos Reis • Instituto de Computação - Unicamp Aatadadefesacomasrespectivasassinaturasdosmembrosdabancaencontra-seno processo de vida acadêmica do aluno. Campinas, 24 de outubro de 2017 “He who loves practice without theory is like the sailor who boards ship without a rudder and compass and never knows where he may cast.” (Leonardo da Vinci (1452-1519)) Acknowledgements If I arrived here, it is because every person who crossed my life brought me a new expe- rience of self-improvement, and for them, I do not have words to say thank you! How much should I say thank you to my advisor, Prof. Dr. André Santanchè, for the guidance, all dedication, and encouragement throughout the project? Moreover, Prof. Dr. Claudia Bauzer Medeiros, Prof. Dr. Helio Pedrini, Profa. Dra. Maria Cecília Calani Baranauskas, Dra. Debora Pignatari Drucker, Dra. Talita Soares Reis who have been collaborated with this research work, making it even better. How to say thank you to Prof. Dr. Alvaro A. Fernandes for allowing me to have one of the most amazing experiences of my life? I will never forget the opportunity which he offered me of developing my research project for a year at the University of Manchester with him. Prof. Dr. Norman Paton, who together with Alvaro inspired me with their wisdom questions and made me have a huge improvement as a researcher. Prof. Dr. Carole Goble and everyone who works with her, I cannot say how much I am pleased to have had the opportunity to receive your advices and help. I have learned a lot with you all. Thank you very much for everything! My friends from Manchester who made my life more colorful during winter. My friends from LIS-Unicamp who loved me even when the weather was low. Bianca for having this huge heart, I will never forget what you did for me, and Gi, Artemis, and Shella, who welcomed me so well. Helo, because "life is not a cartesian plan!". My father who has inspired me with his strength and optimistic way of facing the ob- stacles. My mother, for her endless dedication, her love, and her patience, for encouraging me to face the challenges. Mom, you showed me that life could be unexpected and that, sometimes, we just need to give the next step to open a world of opportunities. My sister, for being my best friend, without her, life would not be so full of love, so pure complicity. My grandma, teaching me the love for knowledge. My friends Eddy and Bia for encouraging me so much, mainly in these lasts months, I can’t say what would be this doctorate without you both. Lucas and Mau for always supporting me and saving me by offering their love, their time, their house and the love of their dog, Yoshi. Lilian and Ale for sharing their smiles, their thoughts, their couch, their love. Vania, André, and Lettys who have been in my life, and bringing me always so many good things. Chris for encouraging me to follow my way, and for always making me smile. Special thanks to my friend Guilherme, who is no longer with us, but who Iamgoingtorememberfortherestofmylife.Guys,definitelyyouallmakemylife worthwhile. All professors, staff, and colleagues at UNICAMP and the University of Manchester. This work was developed at UNICAMP and participantly at the School of Computer Science at the University of Manchester and financed by FAPESP (2014 / 21963-4) and FAPESP (2012 / 16159-6). I am pleased to acknowledge them. The opinions expressed in this work do not necessarily reflect those of the funding agencies. Resumo Existem muitas iniciativas para promover "intelligent openness"ou "FAIR principles"de dados, ou seja, formas de tornar os dados disponíveis, acessíveis, interoperáveis e reutilizá- veis. No entanto, no domínio da biodiversidade, ainda é habitual que os biólogos produzam seus dados em formatos ad-hoc e heterogêneos. A conformidade com um padrão impõe- lhes um custo inicial de reestruturação e anotação de seus dados. Esta pesquisa aborda este cenário com foco em planilhas. Contribui com uma técnica para produzir automa- ticamente anotações semânticas em dados extraídos de planilhas, explorando a maneira como os atributos são organizados em seus esquemas para inferir seu propósito. Os dados semânticos resultantes podem ser integrados, articulados e manipulados de acordo com sua finalidade, em uma abordagem incremental e exploratória, permitindo que os biólogos naveguem e interajam com uma rede interconectada de dados de biodiversidade. Abstract There are many initiatives to promote "intelligent openness" or "FAIR principles" of data, i.e., ways to turn data Findable, Accessible, Interoperable, and Reusable. They rely on a compliance with reference schemas, common standards or ontologies. However, in the biodiversity domain, it is still usual that biologists produce their data in ad hoc and heterogeneous formats. A compliance with a standard imposes on them an upfront cost of restructuring and annotating their data. This research addresses this scenario focusing on spreadsheets. It presents our technique to automatically produce semantic annotations in data extracted from spreadsheets, exploring the way that attributes are arranged in their schemas to infer their purpose. Elements of the resulting semantic dataset can be integrated, articulated and handled according to their purpose, in an incremental and exploratory approach, allowing biologists to navigate and interact with an interconnected network of biodiversity data. List of Figures 2.1 FieldsCharacterization. 20 2.2 Terms by schema of initial lines . 23 2.3 SciSpread - Proportions among fields of catalog spreadsheets. 23 2.4 Survey - Proportions among fields of catalog spreadsheets. 23 2.5 SciSpread - Proportions among fields of event spreadsheets. 24 2.6 Survey - Proportions among fields of event spreadsheets. 24 2.7 Comparative terms quantities between spreadsheets category . 25 2.8 Comparative terms location between spreadsheets nature . 26 2.9 Spreadsheet 1 - used in the Survey . 26 2.10 Spreadsheet 2 - used in the Survey . 26 2.11 Comparative results about spreadsheets classification . 27 2.12 Spreadsheet 3 - used in the survey . 27 2.13 Conceptual model for catalog spreadsheets annotated with qualifiers . 29 3.1 Biodiversity data grouped by purpose [35] . 34 3.2 Set of spreadsheets used by scientists to record biodiversity data . 35 3.3 Spreadsheets of our survey [9] analysing how the organization of attributes influences the interpretation of a spreadsheet. 36 3.4 Operations for biodiversity data sets according to their purpose [35] . 39 3.5 System

Ivelize Rocha Bernardo Promoting Interoperability of Biodiversity

Understanding Semantic Aware Grid Middleware for E-Science

Description Logics Emerge from Ivory Towers Deborah L

Open PHACTS: Semantic Interoperability for Drug Discovery

The Fourth Paradigm

Data Curation+Process Curation^Data Integration+Science

Social Networking Site for Researchers Aims to Make Academic Papers a Thing of the Past 16 July 2009

FAIR Computational Workflows

Hosts: Monash Eresearch Centre and Messagelab Seminar :The Long Tail Scientist Presenter: Prof Carole Goble, Computer Science

The Rise of Bioinformatics and the in Silico Experiment Has Revolutionised the Life Sciences

BENCHMARKING WORKFLOW DISCOVERY 3 the Workﬂow Literature

Professor Carole Goble Dr. John Brooke Summary of Talk

Anchors in Shifting Sand: the Primacy of Method in the Web of Data