<<

A Survey of Informatics: Concepts, Practices, and Challenges

Luiz M. R. Gadelha Jr.1* Pedro C. de Siracusa1 Artur Ziviani1 Eduardo Couto Dalcin2 Helen Michelle Affe2 Marinez de Siqueira2 Luís Alexandre Estevão da Silva2 Douglas A. Augusto3 Eduardo Krempser3 Marcia Chame3 Raquel Lopes Costa4 Pedro Milet Meirelles5 and Fabiano Thompson6

1National Laboratory for Scientific Computing, Petrópolis, 2Friedrich-Schiller-University Jena, Jena, Germany 2Rio de Janeiro Botanical Garden, Rio de Janeiro, Brazil 3Oswaldo Cruz Foundation, Rio de Janeiro, Brazil 4National Institute of Cancer, Rio de Janeiro, Brazil 5Federal University of Bahia, Salvador, Brazil 6Federal University of Rio de Janeiro, Rio de Janeiro, Brazil

Abstract The unprecedented size of the human population, along with its associated economic activities, have an ever increasing impact on global environments. Across the world, countries are concerned about the growing resource consumption and the capacity of to provide them. To effectively conserve biodiversity, it is essential to make indicators and knowledge openly available to decision-makers in ways that they can effectively use them. The development and deployment of mechanisms to produce these indicators depend on having access to trustworthy data from field surveys and automated sensors, biological collections, molec- ular data, and historic academic literature. The transformation of this raw data into synthesized information that is fit for use requires going through many refinement steps. The methodologies and techniques used to manage and analyze this data comprise an area often called biodiversity informatics (or e-Biodiversity). Bio- diversity data follows a life cycle consisting of planning, collection, certification, description, preservation, discovery, integration, and analysis. Researchers, whether producers or consumers of biodiversity data, will likely perform activities related to at least one of these steps. This article explores each stage of the life cycle of biodiversity data, discussing its methodologies, tools, and challenges.

1 Introduction arXiv:1810.00224v2 [q-bio.PE] 7 Dec 2020 Humanity is increasingly influencing global environments [195]. In many countries, it has raised govern- mental concern about the unbalance between resource consumption by these activities and the capacity of ecosystems to provide resources. This unbalance has resulted, for instance, in loss of forest cover in many places, extinction of , and decreased availability of . Humans rely on services in various activities. These services, such as food and water, are a result of processes that occur within these ecosystems. Various studies show that there is a strong relationship between human activities, global changes, biodiversity, ecosystem processes, and ecosystem services. Chapin et al. [58] observe that biodiver- sity variables, such as the number of species present, the number of individuals of each species, and which species are present, along with the types of interactions (e.g. trophic, competitive, symbiotic) that are taking place between these species, determine the species traits that affect ecosystem processes. These traits can

*E-mail address: [email protected]

1 be defined as characteristics or attributes of species that are expressed by genes or affected by the environ- ment. They also observe that global changes, often triggered by humans, such as invasive species, increased atmospheric carbon dioxide, and land-use change can significantly alter these biodiversity variables and, consequently, the expression of species traits. This, in turn, affects ecosystem processes and their resulting services, which can have negative impacts on human development. Changes in these ecosystem services that are due to changes in biodiversity can sometimes be non-linear and stochastic, which can pose a significant risk to humans. Similar conclusions have been reached in other surveys on the relationship between biodi- versity, ecosystem functioning, and ecosystem services [53, 143]. Cardinale et al. [53] observe that after a species become extinct, the resulting changes to ecological processes strongly depend on which traits were eliminated. Hooper et al. [143] observe that is as significant to ecosystem change as the direct effects of global changes, such as elevated carbon dioxide in the atmosphere and ozone depletion. This, in turn, affects critical ecosystem services for the local population, such as food production, air quality, and fresh water.

Water ... Pest control

Food Recreation

Human Activities Ecosystem Services

Global Changes Ecosystem Processes

Increased CO2 Species evenness Species richness in atmosphere

Species interactions Species composition Land-use change ......

Figure 1: Relationship between global changes and biodiversity (modified from [58]).

A major effort to address the problem was started in 1992 during the Earth Summit in Rio de Janeiro with the signature of the Convention on Biological Diversity (CBD) [56], a legally-binding international treaty. Its main objectives are the conservation of biodiversity, including ecosystems, species, and genetic resources, and their sustainable and fair use. Countries are required to elaborate and execute a strategy for biodiversity conservation, known as a National Biodiversity Strategy and Action Plan (NBSAP), and to put in place mechanisms to monitor and assess the implementation of this strategy. They should periodically report their progress on implementing their NBSAPs. Brazil, for instance, has advanced considerably in the cre- ation of protected areas but recent expansion of agribusiness, mining, and hydroelectric power projects have threatened these advances [94]. The Strategic Plan for Biodiversity 2011-2020 defines actions to be taken by countries to achieve a set of twenty targets by 2020, known as the Aichi Biodiversity Targets. It should be observed that the United Nations General Assembly declared 2011-2020 the United Nations Decade on Biodi- versity. In 2012, Intergovernmental Platform on Biodiversity and Ecosystem Services (IPBES) was created to allow for closer cooperation between scientists and policy makers on assessing the status of biodiversity and

2 ecosystem services and their relationship. To meet targets on biodiversity conservation, Balmford et al. [19] observe that it is essential to make indicators and knowledge openly available to decision-makers in ways that they can effectively use them. The Group on Earth Observations Biodiversity Observation Network (GEO BON) proposed a set of twenty-two Essential Biodiversity Variables (EBVs) [211, 221] that should allow for monitoring and evaluating biodiversity change. The development and deployment of mechanisms to produce these indicators depend on having access to trustworthy data from field surveys and automated sensors, biological collections, molecular data, and historic academic literature. Peterson [215], however, shows that there are information gaps across thematic and geographical areas, suggesting that there should be funding and training for institutions and personnel working on biodiversity analysis to allow for evaluating EBVs globally. The transformation of raw data into synthesized data that is fit for use requires going through many refinement steps. One should assess its quality [59] by evaluating its taxonomic, geographical, and temporal accuracy. In many cases, the geographic coverage of the data is limited requiring, for instance, the use of species’ distribution models [218] to estimate the likelihood of a particular species to occur in some geographic region. The methodologies and techniques used to manage and analyze this data comprise an area often called biodiversity informatics (or e-Biodiversity) [34, 248, 117, 124, 138, 156]. Guralnick and Hill [117], for instance, propose the concept of a global biodiversity map that would record global patterns of biodiversity and how they change over time, derived various data sources such as , biodiversity literature, biological collections, and DNA sequence databases. On top of such map, various analyses, such as species richness and distributions, can be periodically updated using most recent data. Hardisty et al. [124] recently gathered requirements for biodiversity informatics systems. They include, for instance, an aggregator for taxonomic names that integrates the distinct existing checklists; the use of persistent identifiers to reference datasets, scientific workflows, and scientists; use of linked data and ontologies to enable integrating data on different aspects of biodiversity; advancing data quality and fitness-for-use evaluation methodologies. In this survey, we give an overview of this research area covering its main concepts, practices, and some of the existing challenges. According to the approach proposed in [183], biodiversity data follows a life cycle consisting of the following stages: planning, collection, certification, description, preservation, discovery, in- tegration, and analysis. It is worth noting that after the analysis activity, new biodiversity data management cycles may be triggered as a result. In this survey, we group these steps in two main stages: data management and analysis and synthesis. Such steps are illustrated in Figure 2. Researchers, whether producers or con- sumers of biodiversity data will likely perform activities related to at least one of these steps. The remainder of this article addresses each stage of the life cycle of biodiversity data, describing their methodologies, tools, recommendations, and challenges. In Table 1, we provide an overview of various tools and databases surveyed along with which steps of the e-Biodiversity life cycle they approach. In Table 2, we describe computational techniques used in each step of the e-Biodiversity life cycle that will be explored in this work. Throughout this work, we use definitions from the International Code of Nomenclature for algae, fungi and plants (ICN) [176]. This document outlines a set of rules and guidelines for scientifically naming and grouping plants, fungi, and algae, consisting of a universally adopted reference by the botanical scientific community. Nomenclature best-practices for other groups of organisms are governed by other (though simi- lar) documents. Within the domain of biology, is, in a general sense, the of classification of organisms. Organisms are classified according to their shared characteristics and grouped at distinct levels of specificity (or taxonomic ranks) using a hierarchical system, in which groups that are more specific are nested within broader ones. A taxonomic rank refers to the level of the taxonomic hierarchy at which a group of organisms is defined. The most relevant ranks adopted in (in descending hierarchical ) are Kingdom, Phylum (or Division), Class, Order, Family, , Species. The taxonomic resolution of a biological sample is the rank of the most specific taxonomic determination that has been assigned to it. For instance, if a sample has been determined up to the level of species, this rank is also its taxonomic resolution. As taxa relate to each other in a tree-like hierarchical structure (with each child taxon having exactly one parent, while a parent taxon can have one or more children), taxonomic identities of a specimen at ranks higher than its resolution can be directly determined. Although this term is not included in the ICN document, we use this

3 Table 1: A selection of e-Biodiversity databases and tools classified according to target life-cycle step: data planning and collection (DC), data quality and fitness-for-use (DQ), data description (DD), data preservation and publication (DP), data discovery and integration (DI), and computational modeling and data analysis (CM).

Tool or database name Reference DC DQ DD DP DI CM DMPTool [261] X Morpho [134] X X X X Metacat [30] X X X X X Catalogue of Life [231] X X X X X BHL [120] X X X X X eBird [263] X X X X X eMammal [129] X X X X X Brazilian Flora 2020 [98] X X X X X BRAHMS [95] X X X X X Jabot [242] X X X X X MorphoBank [201] X X X X X BaMBa [178] X X X X X Atlas of Living [26] X X X X X X speciesLink [50] X X X X X X GenBank [27] X X X X X X MG-RAST [181] X X X X X X BOLD [222] X X X X X X iPlant [110] X X X X X X VoSeq [208] X X X X X X SISS-Geo [57] X X X X X X BioGeomancer [118] X Taxamatch [224] X Geospatial Data Quality [200] X TNRS [42] X Kurator [190] X BioVel [125] X X X EU-Brazil OpenBio [11] X X X Model-R [232] X X X GGBN [84] X X X BioCollections [239] X X X GBIF [86] X X X X DataONE [182] X X X X OBIS [113] X X X X SiBBr [22] X X X X IPT [229] X X Scratchpads [247] X X X WholeTale [44] X X Maxent [218] X ConsNet [66] X OpenModeller [252] X SPAdes [20] X SAHM [189] X HipMer [107] X SUPER-FOCUS [241] X

4

Figure 2: Biodiversity Life Cycle.

definition throughout this text. A taxon is a taxonomic group of organisms at the level of any rank, which are considered by professional taxonomists to form a taxonomic unit. Species is one of the taxonomic ranks in which organisms can be classified. It is regarded to be a basic unit of taxonomic classification, although organisms can be further classified in lower-hierarchy taxonomic ranks (i.e., infraspecific ranks). Differently from other ranks, the name of a species is composed using a binomial nomenclature system, composed of the name of the genus followed by a specific epithet, e.g. Caryocar brasiliense, Myrcia guianensis, or Solanum lycocarpum. When botanists sample organisms in the field, they either collect part of the organism (e.g. a branch of a tree), the entire organism, or multiple individuals of the same type. Any of these collected biological materials is an evidence of the existence of a particular organism at some place and time and should be properly deposited in a biological collection for being preserved as a reference. A specimen is defined as one such evidence and refers to a punctual observation of a single kind of organism. Although a specimen could be classified by a taxonomist as being a representative of a given species, this is not a requirement for it to be included in scientific collections. Although taxonomists classify specimens in a best effort manner (the most taxonomically precise as possible), sometimes only higher ranks can be determined. The highest taxonomic rank at which the specimen could be identified is known as its taxonomic resolution. After properly deposited in a biological collection, each record receives a taxonomic identification that assigns the individual to a taxon. Physical specimens stored in biological collections (also referred to as vouchers) are often associated with complementary information, either annotated by the responsible collectors during the collecting act; or an-

5 Table 2: Computational techniques used in the e-Biodiversity life cycle: string processing (SP), metadata management (MD), conceptual modeling (CM), semantic web (SW), deep learning (DL), machine learning (ML), statistics (ST), geographical information systems (GS), graph theory (GT), parallel and (PD), web services (WS). e-Biodiversity life cycle step SP MD CM SW DL ML ST GS GT Data Planning and Collection X X Data Quality X X X X Data Description X Data Publication X Data Discovery and Integration X Ecological Niche Modeling X X X X Biodiversity Networks X Biodiversity Genomics X X X X Wildlife Health Analysis X X X X

notated at later stages, after the specimen is deposited in the collection [59]. Information from the collection event include the date, time, and the geographic location where the specimen was collected; the names of the collectors who were involved in the collection event; and eventual field notes describing contextual re- marks, such as weather conditions, habitat features, or the sampling method used. Another crucial piece of information is the taxonomic identity of the specimen, which can be determined by the collectors themselves or by professional taxonomists once the biological material is incorporated to the collection (although some materials eventually remain unidentified). The taxonomic identity of a specimen includes not only the taxon name assigned to the sample, but also its nomenclatural status and authorship, the name of the person who has provided the identification, and in some cases, information regarding the certainty of identification. As the taxonomic identity of a specimen can be re-evaluated by specialists several times after the first determination (though it requires that the investigator has access to the physical specimen), a history of determinations for specimens is usually stored in a collection. Vouchered specimens, together with their associated data, is what scientifically testifies a punctual observation of a species by a collector, at some location and at some point in time, and is thus referred to as a species occurrence. e-Biodiversity, or mainly known as “Biodiversity Informatics”, relates to the use of information technology (IT) to support the needs of understanding biodiversity, by organizing knowledge about individual biolog- ical organisms and the ecological systems they form. Over time, biodiversity informatics will deliver an increasingly interconnected digital resource supporting scientific research of the natural world [138]. The application of IT to support biodiversity research starts during the 1960s, tackling mainly two different as- pects: handling scientific collections in museums and botanic gardens, and providing a fundamental tool for the development of the Numerical Taxonomy [212, 290, 243, 246, 250, 171, 251, 254, 230, 74, 234]. In the 1970’s the use of computer to produce identification keys was explored [122], along with the expansion of their use in the management of herbarium records [31, 276, 280]. In the same decade, we also have the initiative of the Flora of North America Program to “create a computer data bank of taxonomic information about the vascular plants of North America north of Mexico.” [154]. The decade of the 1980’s is considered the “golden era” of the biodiversity informatics, characterized by a significant intellectual and bibliographic productivity and the sharing of ideas in scientific events. As result, iconic publications come to light, such as [7, 1]. In 1985, the foundations of a key organization in the development of the biodiversity informatics were launched: The Taxonomic Database Working Group (TDWG). The first meeting of TDWG was held at the Conservatoire et Jardin Botaniques, in Geneva, from 28th to 30th September 1985. In the first meetings, it became clear the need for elements of standardization for taxonomic databases and means for data exchange [266, 36]. In the same decade, we have witnessed the emergence of the first standards for data transfer [147, 281, 36, 70] and the establishment of the first institutional databases and information systems, such as

6 “TROPICOS, run by the Missouri Botanical Garden; ILDIS, an international system covering legumes; BONAP, the Biota of North America Program; and CITES, Cactaceae list run at the Royal Botanic Gardens Kew. For microorganisms the impetus seems to have come from the culture collections and information industry; both BIOSIS and DSM (Deutsche Sammlung von Microorganismen und Zellkulturen) have established master lists for bacteria and others are available for yeasts and certain fungi” [35]. At the end of the 1980s and beginning of the 1990s, the efforts for modeling taxonomic databases un- der the nomenclature rules and the dynamic of the classification became the concern of some researchers [24, 204, 8, 9, 137]. It is important to notice that it was during this period that small computers became more popular and affordable, added to the availability of database software and high-level programming lan- guages, which made possible the creation of experimental databases dedicated to floristic and monographic data, as also morphological and chemical data, among others [7]. In September 1990, an interdisciplinary workshop, organized by University of California and funded by NSF, united different professionals to discuss the benefits of modern computer techniques - expert workstation for , identification, phylogenetic trees, databases, and geographical information systems – for systematic biology [97]. In October 1990, a symposium called “Designs for a Global Plant Species Database”, held in Delphi, Greece, focused in dis- cussing different approaches for “creating and operating a global plant species information system – a data system that would provide international access to data accumulated on all of the world’s plants.” [35]. In 1992, the CBD, in its article 7, item D, says: “Maintain and organize, by any mechanism data, derived from identification and monitoring activities pursuant to subparagraphs (a), (b) and (c) above.” In the same convention, the article 17 was dedicated to the “Exchange of Information”, evocating the parties to facilitate the exchange of information of “technical, scientific and socio-economic research, as well as information on training and surveying programmes, specialized knowledge, indigenous and traditional knowledge as such and in combination with the technologies referred to in Article 16, paragraph 1. It shall also, where feasi- ble, include repatriation of information” [56]. This convention becomes the cornerstone of the Biodiversity Informatics area, coining its name as the result of the affiliation between five agencies, forming the “Cana- dian Biodiversity Informatics Consortium”, to implement the CBD in Canada [32]. In 1999, a new milestone was set with the findings of the final report of the OECD Megascience Forum - Working Group on Biological Informatics, which understood that “An international mechanism is needed to make biodiversity data and information accessible worldwide”, and “recommends that the governments of OECD countries establish and support a distributed system of interlinked and interoperable modules (databases, software and networking tools, search engines, analytical algorithms, etc.) that together will form a Global Biodiversity Information Facility (GBIF)” [32]. In September 2000 “ for Biodiversity” reached notoriety becoming cover of a special issue of the Science Magazine, consolidating itself as an area of research [262]. In 2004, the Global Biodiversity Information Facility (GBIF), an Internet-accessible interoperable network of biodiversity databases and information technology tools - went online with a prototype data portal1 for simultaneously accessing data from the world’s natural history collections, herbaria, culture collections, and observational databases [86]. In July 2012, around 100 experts from a wide variety of disciplines gathered in Copenhagen to the Global Biodiversity Informatics Conference (GBIC). The conference sets out a framework - The Global Biodiversity Informatics Outlook, launched in October 2013, to harness the immense power of information technology and an open data culture to gather unprecedented evidence about biodiversity and to inform better decisions [138]. The remaining of this article is organized as follows. In section 2, we describe the main steps of managing biodiversity data, including the tasks involved in planning and collecting biodiversity data (subsection 2.1), data quality and fitness-for-use issues in biodiversity (subsection 2.2), the main metadata standards and tools for biodiversity (subsection 2.3), the standards, tools, and systems that support publishing and preserving biodiversity data (subsection 2.4), and techniques used to integrate biodiversity data from different sources and for discovering it (subsection 2.5). In section 3, we discuss existing tools for analysis and synthesis of biodiversity data, including computational models and areas of application. Finally, in section 4, we conclude the survey by presenting some of the current challenges of managing and analyzing biodiversity data.

1http://www.gbif.org

7 2 Data Management

From planning and collecting biodiversity data to making it fit-for-use there are many steps that need to be followed, including data planning and collection, data quality and fitness-for-use, data description, data preservation and publication, and data discovery and integration. These steps comprise a biodiversity data management life-cycle, Figure 2 (top-right), that we present in this section.

2.1 Data Planning and Collection

The various steps of the biodiversity life cycle usually comprise a Data Management Plan (DMP) [183] of biodiversity research activities. Some research funding agencies in countries like the United States require the submission of a DMP in submissions to calls for research proposals. Many funding agencies require that proposals should include a data management plan. A DMP is usually comprised of: which data will be collected; which formats or standards will be used for this data; which metadata will be provided and in which standard or format; what are the policies for data usage and sharing; how data will be stored and how it will be preserved in the long-term; and how data management will be funded. The DMP Tool2, for instance, is an online tool that supports designing and implementing a data management plan. Biodiversity is concerned with the variety of living organisms, which can be measured in many different ways and scales, from a record of an organism observed in a geographical location at a particular date (a species occurrence) [288] to the relative abundance of species in a water sample collected at a long-term ecological research site [184]. Omics also present many opportunities for exploring biodiversity. Molecular data from environmental samples [286, 228], for instance, can be analyzed in metagenomics studies to identify functional traits and the taxonomic classification of organisms present. Biodiversity data can be collected in various ways: biosensor networks, field expeditions, observations made by citizen scientists, among others. In the collection process, it is important to use unique identifiers for project, sampling event, sampling area, and protocol used [259]. These identifiers will later allow the data collected to be stored in biodiversity databases consistently. Whenever possible, the terms should follow a controlled vocabulary or ontology, such as the Biodiversity Collections Ontology (BCO) [279]. Next, we list common sources of data that are used to describe and analyze biodiversity:

• Species Occurrences. Species occurrences are one of the most frequently available types of data concerning biodiversity. The main attributes of a species occurrence are given by: a taxon, which is defined as a group of one or more populations of an organism or organisms that form a unit; a location; and a date of occurrence. Species occurrence records originate from different sources. In order to facilitate the management and improve the accessibility of such information, most institutions currently maintain it organized in digital spreadsheets or in relational database systems, while also keeping references to the physical specimens they refer to. Some institutions are even deploying efforts towards digitizing the physical specimens. Hardisty et al. [124] observe that only about 10% of natural history collections are digitized and that tools are required to accelerate the process. In addition to specimens from biological collections, human observations are another source of species occurrences records. These observations take place, for instance, during field expeditions or even through citizen science initiatives (eBird [263], iNaturalist3). In some cases, species are maintained in the culture of living organisms, as is the case of various collections of fungi and other organisms.

• Species Checklists. Surveys are often performed within a geographic region, such as a continent [272], a country, or a national park, to determine which species are present in them. These surveys usually result in a list of taxon names called a species checklist. They might also be restricted to a particular kingdom or . Forzza et al. [98], for instance, describe how the Brazilian Flora List published in 2010 was assembled, which involved aggregating information about species vouchers from herbarium

2https://dmptool.org/ 3http://www.inaturalist.org

8 information systems and having taxonomists to review it. The Catalogue of Life4 aggregates more than 100 species checklists and contains information of about 1.6 million species.

• Sample-based and Observational Data. Sample-based data is collected during events, which may be one-time or periodical, typically involves environmental data and has a wide range and diversity of measurements. They may involve, for instance, abiotic measurements and population surveys in different temporal and spatial scales in transects, grids and plots [172]. A typical context in which they are collected are the Long-Term Ecological Research (LTER) projects [184] projects. Because of the heterogeneity of ecological data [225], there is not a controlled vocabulary that is widely used. Some initiatives in this direction include ontologies such as ENVO, OBOE, and BCO [279]. The most common tools for publishing ecological data rely on metadata to describe tabular datasets that comprise them. Such metadata allows general information, such as dataset owner identification, geographic, temporal, and taxonomic coverages, to be recorded, facilitating their interpretation by users. Metadata also allows textually describing the meaning of each column of a tabular dataset. Later in this article, the Ecological Metadata Language [93], a metadata standard for ecological datasets, will be described.

• Molecular Data. The analysis of DNA, RNA and proteins have various applications to the study of biodiversity. The ge- nomic sequences obtained directly from environmental samples containing communities of microorgan- isms, i.e. metagenomes [228], for instance, provide important information to analyze their taxonomic and functional characteristics. Biological sequences can support taxonomists as well [265] in identi- fying species. Taxonomists can also use small genomic or gene regions to assess biological diversity accross all domains of life. The Barcode of Life [222, 258] project, for example, analyzes and standard- izes small regions of genes to help in identifying species. Some systems, such as VoSeq [208], allow for connecting vouchers present in biological collections to DNA sequences present in genomic databases. Guralnick and Hill [117] observe that diversity can be more precisely measured, when compared to simply counting the number of species, by how species are phylogenetically related. As examples, they assess conservation priority of North American using their phylogenetic distinctness and extinc- tion risk and analyze the dispersal of the influenza A virus also using phylogenetic analysis.

• Academic Literature. A vast amount of information about the biodiversity is available in the academic literature. Field expeditions syntheses are often available only in scientific papers. Data related to sam- pling, collection, and their analysis have often not been propagated to biodiversity databases. Some initiatives, such as the Biodiversity Heritage Library (BHL) [120], aim to use technologies, such as op- tical character recognition, to extract this information from scientific articles and make them available in public databases.

• Images and Videos. Field expeditions to conduct sampling often involve the production of images and videos that support the analysis of the studied sites. In the following sections, we describe Audubon Core5, a controlled vocabulary for describing multimedia resources associated with sampling and species occurrence data.

• Remote Sensing. According to Turner et al. [271], most remote-sensing instruments do not have enough resolution to gather information about organisms but there were advances that enabled some aspects of biodiversity to be observed, such as differentiating species assemblages and tree species [67]. They also argue that, when instrument resolution is insufficient for direct observation, indirect methods can be applied to estimate species distributions and richness. Pettorelli et al. [217] observe that many EBVs could be derived from satellite remote sensing, which can provide global-scale regular monitoring. Some of these potential EBVs include, for instance, vegetation height and area index. It is also observed that raw satellite data could be processed by scientific workflows, including tasks such as statistical analysis and classification algorithms, to generate EBVs.

4http://www.catalogueoflife.org 5https://terms.tdwg.org/wiki/Audubon_Core

9 2.2 Data Quality and Fitness-for-Use

Although biodiversity scientists have undoubtedly benefited from to massive volumes of species occurrence data from many biological collections, there are some caveats that must be accounted for, before using data for modeling. Data is not always adequate for investigating every aspect of natural systems, using inadequate data for studying specific aspects of biological diversity can lead to erroneous or misleading results [59], and investigators must be aware of the inherent limitations of their data before formulating their questions. The availability of detailed information is still very scarce for most known organisms. This scenario, referred to as the Wallacean Shortfall [169], is even more critical in megadiverse countries, which still remain largely unexplored for many regions and taxonomic groups [248]. The lack of sufficient data for threatened species is even more concerning, as designing effective programs for their conservation require knowledge on their geographic distribution and ecological requirements. This shortage of data, combined with the non-systematic sampling and insufficient quality limits the use of data from biological collections for many intended applications, many of which require an intensive amount of data to be available [115]. Failing to account for the inherent limitations of such data while posing and investigating their hypotheses, researchers may obtain erroneous or misleading results, eventually impacting the success of management policies that rely on such information [59]. A definition for data quality based on its fitness for the intended use was first proposed in the context of geographical information systems [65], and became widely adopted by the biodiversity informatics com- munity. According to this definition, quality is not an absolute attribute of a dataset but is rather given by its potential to provide users with valuable information, in specific contexts. Assessing quality attributes of data is a fundamental step for any applications that might use it, and requires that users previously delimit the purpose, scope, and requirements of their investigation. Data is considered to be of high quality if it is suitable for supporting a given investigation. Depending on the application, users might need to improve the fitness of the data they have in hand, which is part of the data quality management process. Loss of qual- ity in biodiversity data can occur during multiple stages of its life cycle [59], including the moment of the recording event, its preparation before it is incorporated in the collection, its documentation, digitalization, and storage. Soberón and Peterson [248] list common issues regarding biodiversity data quality. Specimens of biological collections, from which a considerable amount of species occurrence data is extracted, may have incorrect or outdated taxonomic identifications. Biological taxonomy is constantly changing to accommodate new knowledge about species. It is also possible that there are georeferencing errors due to annotation error or instrument inaccuracy. In old records, due to the unavailability of mechanisms for accurate assessment of location, it is common to find only textual descriptions about where a specimen was collected. One important aspect that often limits the usability of primary data from biological collections concerns the way in which it is gathered in the field. In general, most species occurrence data composing biological collections derive from exploratory field expeditions, in which organisms are recorded in a non-systematic observational fashion by different collectors, using different methods and at distinct circumstances (though records resulting from experimental studies are eventually incorporated in museums as well). As a result, the distribution of the sampling effort in such datasets is uneven and rarely quantified, leading to sampling biases. Building models without accounting for biases in data has been observed to strongly impact their per- formance, leading to spurious results which can be misinterpreted and, ultimately, lead to wrong decisions. For instance, assessing patterns of species richness from species occurrence datasets has been shown to be particularly challenging due to geographical bias in data [144, 223], as higher diversities tend to be observed at more accessible sites due to higher sampling effort. As defined by [64], biases are uniform shifts in mea- sured values, resulting from systematic errors that are introduced by some measurement system. They are expressed as unrealistic tendencies in data, and can usually be mitigated with the adoption of random sam- pling designs. Sampling bias in biodiversity data can be classified into several distinct categories, depending on the aspect of data under investigation [76]. Not all taxa are quantitatively represented in biological collections in the same proportions as they occur in natural systems leading to taxonomic bias. Collection sites are not randomly selected in geographic space, nor they are all sampled to the same extent. As features of the landscape make some areas more accessible for collection activities than others [135], geographic bias arises as a consequence of non-uniform collecting effort

10 in geographic space. Some regions that are more accessible being thoroughly sampled (such as areas near urban centers, roadsides, and margins of rivers); while others that are more inaccessible, such as , being only poorly or not sampled at all. Geographic bias is also observed at broader scales. A compilation of the representativity of plants in GBIF by [180] has shown that among the most representative countries and regions are the United States (mainly the west ), Central America, countries in Europe (including the Nordic countries), Australia, Japan, and New Zealand. The patterns of the recording activities of collectors are not uniform over time. Instead, collectors often show preferences towards performing field work in periods when they can get more productive, have more financial resources, or can find more organisms of their interests, leading to temporal bias. One of the objectives of CBD is to establish a global knowledge network on taxonomy [124]. Taxonomic concepts [28] are often incorrectly modeled in biodiversity databases. Berendsohn [29] developed a concep- tual database model for the International Organisation for Plant Information covering the different aspects and concepts that are present in taxonomy. Several tools can be used to reduce or eliminate species misiden- tification. For instance, official species catalogs are available online for taxon querying, such as the Catalog of Life, the World Register of Marine Species6, and the Brazilian Flora Species List [98]. These can be used to support taxonomic data quality assessment of occurrence records. Most of these catalogs are also accessible via application programming interfaces () available via the web, allowing the automation of this type of assessment with scripts or applications. Dalcin [75] investigates data quality in taxonomic databases, proposing quality metrics and techniques for error prevention, detection, and correction. He explores which dimensions are present when evaluating taxonomic data quality. Accuracy would measure how correct and reliable the data is. Believability is defined as how true and credible data is. Completeness measures to what extent attributes or data are missing. Consistency is defined as the absence of contradictions in the data. Flexibility is the capacity of representation of data to change in order to accommodate new requirements. Relevance is given by how applicable and helpful data is to a particular task. Timeliness how up-to-date the data is. Some experiments are conducted on detecting spelling errors in scientific names, which may be caused by insertions, deletions, substitutions, transpositions, or combination of the previous. Two types of algorithms are applied, phonetic algorithms, such as Soundex [141], are based on pronunciation similarity. String similarity algorithms, such as Levenshtein distance [160], were also evaluated. Levenshtein algorithm showed a high incidence of false alarms. Phonetic algorithms had lower execution time than string-similarity ones. More recently, Rees [224] observes that taxonomic names can contain errors due to misspelling which can lead to failure in retrieving data. He proposes Taxamatch, a method for approximate matching of taxonomic names. It uses a modified version of the Damerau-Levenshtein Distance [278] algorithm for genus and species name matching and a phonetic algorithm for authority matching. Experiments showed that the method is able to identify close to 100% of errors in taxon scientific names. [124] observes that there are studies about biodiversity that do not require naming organisms. For instance, metagenomic studies concentrate on analyzing samples to classify them according to functional traits identified through sequence alignment with genomic databases. For collections that have digital images of their specimens available, a promising approach is to use deep learning techniques [236] to automate species identification [39]. Regarding georeferencing problems, Guralnick [117] mentions the importance of determining the geo- referencing uncertainty of occurrence records and its impact on the scale at which studies can be performed. Tools like BioGeomancer [118] and Geolocate7 try to infer what the geographic coordinates of an occurrence of species from a textual location description. Otegui and Guralnick [200] propose a web API that performs simple consistency checks in occurrence records, such as coordinates with zero value, disagreeing coordinates and country identification, and inverted coordinates. Veiga et al. [274] propose a framework for biodiversity data quality assessment and management that allows for users to define their data quality requirements and when a particular dataset is fit-for-use in a standardized manner. Data quality assessment is given by the evaluation of fitness for use of a dataset for some application. Data quality management is defined as the process of improving the fitness-for-use of a dataset.

6http://www.marinespecies.org/ 7http://www.museum.tulane.edu/geolocate/

11 The framework is given by three main components: DQ Needs, DQ Solutions, DQ Report. DQ Needs supports the definition of the intended use for a dataset, the respective data quality dimensions, acceptable criteria for data quality measurements in these dimensions; and activities to improve data quality. DQ Solutions describe mechanisms that support meeting the requirements defined in the DQ Needs component, such as tools that implement techniques to improve data quality measurements in some dimension. The DQ Report component describes the dataset that is being assessed and managed by the framework and assertions on this dataset describing measurements or amendments applied to it as specified in the other components. The authors envision a Fitness for Use Backbone that would implement these components and where participants could share their data quality requirements and tools. More recently, Morris et al. [190] have extended Kurator [81], a library of data curation scientific workflows, to report data quality in terms of the data quality framework proposed by Veiga et al. [274].

2.3 Data Description In the description step, metadata is produced to describe biodiversity data. This metadata is essential for users to interpret datasets they download. In this section, we describe the standards, practices, and recom- mendations for documenting and describing biodiversity data.

2.3.1 Ecological Metadata Language The Ecological Metadata Language (EML) [93] is a metadata standard originally developed for the descrip- tion of ecological data. It is also used currently to describe datasets about species observations. The standard has several profiles with their respective fields that can be used to define the attributes of a dataset. A sci- entific description profile contains fields such as the creator, geographic coverage (geographicCoverage), tem- poral coverage (temporalCoverage), taxonomic coverage (taxonomicCoverage), and sampling protocol (sam- pling) used. This profile is used to define attributes of the dataset as a whole. The data representation profile, through the dataTable entity, allows for describing the attributes of a tabular dataset. One can define the data types of such attributes, such as dates and numerical values, as well as their constraints, such as minimum and maximum values. Used together, the scientific description and the data representation profiles can provide good quality documentation for a dataset, supporting users to interpret it in a meaningful way. EML metadata is expressed with the XML language, illustrated in Listing 1. Normally, biodiversity databases provide tools for editing and producing metadata in the EML standard in a more user-friendly way through a graphical interface. The data repository DataONE [182], for example, allows users to provide metadata through a graphical tool called Morpho [134]. The same repository has also a web interface called Metacat [30], which allows for loading tabular ecological data in free format doc- umented with the EML standard. The EML standard is also used to describe datasets on species occurrences and sampling events, as will be described in the following section.

2.4 Data Preservation and Publication In the preservation stage, biodiversity datasets are published in some database, such as DataONE and GBIF, where they will be available to the scientific community. These databases adopt practices of curation and management of the data aiming its preservation and availability in the long term. There are several possible procedures for publication, in this section standards and procedures for loading a set of data to a biodiversity database will be described. The publication workflow of the main current repositories will also be described. For better management of biological collections [235], several systems for this purpose have been de- veloped in the last decades. Among the common features in this category of software are the management of specimens, control of determination history, taxonomy, images associated with specimens, bibliographic references, curatorial management activities, user management, reports tracking the evolution of collections, printing labels in varied sizes and data quality. These systems also have specific controls for the different types of scientific collections, such as exsicates, bromeliads, DNA, photo library, fruits, fungi, woods, orchids, seeds, in vitro, living, zoological, conservation, and related materials. Among the main ones are the BRAHMS

12 Baseline assessment of mesophotic reefs of West South Atlantic seamounts based on water quality, microbial diversity, benthic cover and <a href="/tags/Fish/" rel="tag">fish</a> biomass data Fabiano Thompson Federal University of Rio de Janeiro Assistant Professor \protect\vrule width0pt\protect\href{http://www.microbiologia.biologia.ufrj.br Seamounts are considered important ... Vitoria Trindade Chain -38.875 -16.375 -17.125 -21.375 2009-03-13 2009-03-22 Listing 1: Part of EML metadata in XML.

[95], used in more than 80 countries, allowing work with botanical collections, Specify8, which is used in more than 500 institutions worldwide for over 30 years, managing collections of flora and fauna; the Emu9 proprietary software for managing collections, including botanical ones; BG-Base10 also with more than 3 decades of use, widely used in botanical gardens and arboretums. Jabot [242] is used since 2005 in the Rio de Janeiro Botanical Garden and started to be shared in the model of with more than forty herbaria in Brazil.

2.4.1 Darwin Core

Darwin Core [282] is a standard for representing and sharing biodiversity data. The standard consists of a list of terms related to biodiversity and their definitions. Darwin Core (DwC) discussions, evolution, and main- tenance are conducted by TDWG (Biodiversity Information Standards), an association for the development and promotion of standards for recording and exchanging biodiversity data. DwC emerged as a term profile in the 1998 Species Analyst system developed by the University of Kansas for the management of biological collections. In 2002, it was adopted for the exchange of information in Mammal Networked Information System (MaNIS), a distributed system composed of several institutions that maintain biological collections of mammals. In 2009, the DwC standardization process was started, which was ratified in October of the same year at the TDWG annual meeting. DwC is based on the Dublin Core11 standard, taking advantage of

8http://specifysoftware.org 9https://emu.axiell.com 10http://www.bg-base.com 11http://dublincore.org/

13 its terms for resource description such as type, modified and license, and complementing them with specific biodiversity terms, such as catalogNumber and scientificName. The DwC vocabulary terms are organized as follows. The classes indicate the categories or entities defined in the standard. Examples of classes are: Event, Location and Taxon. Each class has a set of properties, which are its attributes. For example, the Location class has attributes such as country and decimalLatitude. Finally, values can be assigned to properties, such as “Chile”, −33.61 for the country and decimalLatitude properties, respectively. It is worth noting that it is recommended that, whenever possible, the values come from some controlled vocabulary, in the case of textual values, or some formatting standard, in case of numerical or temporal values. For example, species names from some recognized list of species, such as the Catalog of Life. Table 3 illustrates the representation of species occurrence data with DwC. These records come from a dataset published through the Brazilian Marine Biodiversity Database (BaMBa) [178] in GBIF12.

Table 3: Species occurrences represented with DwC. id eventDate decimalLatitude decimalLongitude scientificName 6 2002-08-01 -20.805828 -37.761231 Alectis ciliaris (Bloch, 1787) 118 2002-08-01 -22.382222 -37.587500 Balistes vetula (Linnaeus, 1758) 141 2002-08-01 -19.848744 -38.134635 crysos (Mitchill, 1815) 507 2002-08-01 -20.525417 -29.310350 Thunnus obesus (Lowe, 1839)

Normally, a set of data in the DwC format is accompanied by metadata, which is defined in the EML (Eco- logical Metadata Language) standard [93]. In EML, fields such as the title, authors, geographic and temporal coverage of the dataset are found, which help users interpret datasets formatted in the DwC standard. Like relational databases, datasets that follow the DwC format can contain multiple tables that are related through properties that are common to all of them. Such an organization allows, for example, sample data to be expressed also in this standard. Tables 2 and 3 illustrate this type of data organization to represent species sampling. Table 2 contains the sampling events, four in total. The eventId column contains an identifier for each event. The other columns describe the event date, latitude, and longitude, respectively. Table 3 contains counts of organisms for each event. The eventId column, describes which event in Table 1 the counts refer to. For example, the first two rows in the table refer to the event that has identifier 1, which is associated with a sampling performed on March 18, 2009.

Table 4: Sampling events represented with DwC. eventId eventDate decimalLatitude decimalLongitude 1 2009-03-18 -20.51 -38.07 2 2009-03-18 -20.57 -34.80 3 2011-02-11 -20.50 -25.35 4 2011-02-11 -20.47 -24.80

Biodiversity datasets formatted with DwC can be published in global-scale biodiversity databases such as the GBIF [85]. GBIF acts as a central registry and aggregator for datasets published by its national and organizational nodes using the Integrated Publishing Toolkit (IPT) [229]. The publishing workflow includes the following steps: (i) mapping the internal representation of biodiversity data to DwC and extracting it; (ii) adding EML metadata describing the biodiversity data; (iii) packaging both EML metadata and DwC- formatted data to a Darwin Core Archive (DwC-A); (iv) GBIF and national biodiversity aggregators, such as the Brazilian Biodiversity Information System (SiBBr) [105, 22], harvest DwC-A and ingest them into their databases. This process is illustrated in Figure 3.

12http://www.gbif.org/dataset/1edcfe6d-da55-4d59-b30e-468cd21f8b0b

14 Table 5: Species occurrences related to the events in Table 4.

eventId organismQuantity scientificName 1 2 Alectis ciliaris (Bloch, 1787) 1 5 Balistes vetula (Linnaeus, 1758) 2 1 Thunnus obesus (Lowe, 1839) ......

Figure 3: Data publication using IPT.

Many research groups have tabular data about biodiversity stored in various formats and do not have the resources to format them according to DwC. Different approaches in , coupled with distinct research traditions, both in their subdisciplines and in related fields, lead to the production of highly heterogeneous data. Such data can be, among others, counts of individuals, measures of environmental variables or repre- sentations of ecological processes. The terminologies used also vary according to the research line, as well as how to structure the data digitally [150]. The EML metadata standard was also adopted for describing the ecological datasets. The datasets themselves, due to the heterogeneity, are published in the original format, through spreadsheets or textual files with values separated by commas. In these cases, Metacat [30] can support data publication and preservation. It is responsible for receiving, storing and disseminating datasets of, for example, Long-Term Ecological Research (LTER) [184]. The Brazilian Marine Biodiversity Database (BaMBa) [178], for instance, was developed to store large datasets from integrated holistic studies, including physico-chemical, microbiological, benthic and fish parameters. BaMBa is linked to SiBBr and has instances of both IPT and Metacat, making it possible to publish data using the workflows previously described. The publication of data, and its consequent preservation, is a contribution to the scientific community as a whole. The data can be reused by other scientists who can explore them from other points of view. For the publisher of the data, benefits can also be observed. A recent study [220] shows that articles that provide the data used in their analyses in public repositories tend to have a larger number of citations. Data publication and preservation can also be helpful when biological collections are lost due to disasters. In September 2, 2018, there was a catastrophic fire in the Brazilian National Museum, destroying the vast majority of approx- imately 20 million items in its collections comprising areas such as archeology, anthropology, zoology, and botany. Many destroyed items belonged to biological collections, including one on invertebrates. Through data publication on GBIF13, the museum was able to preserve information about many specimens, 269,660 records were available in September 20, 2018, many of these containing images.

13https://www.gbif.org/publisher/4205110f-3f0f-40d8-bd0f-2fa71bc827b5

15 2.4.2 Other Standards

Access to Biological Collection Data (ABCD) provides a common definition for content data from living col- lections, natural history collections, and observation datasets. It also offers a detailed treatment of provider rights and copyright statements. In many cases, it defines elements for both highly atomized and less struc- tured data to encourage potential providers to take part in information networks even if their collection databases are less atomized or not normalized [116]. Currently, ABCD is used to publish data, for instance, in GBIF. Multi-media resources can provide reliable evidence for the occurrence of a taxon in a particular place and time, and there is a growing recognition that a biodiversity-related multimedia object could be used as a “primary biodiversity record” if the metadata associated with the object is available and has high quality. Therefore, Audubon Core [191] standard came to fill the gap of a standard related to multimedia resources, as digital or physical artifacts that normally comprise more than text. These include photographs, artwork, drawings, sound, video, animations, and presentation materials, as well as interactive online media such as species identification tools. The Audubon Core is a set of vocabularies designed to represent metadata for biodiversity multimedia resources and collections. These vocabularies aim to represent information that will help to determine whether a particular resource or collection will be fit for some particular biodiversity science application before acquiring the media. Among others, the vocabularies address such concerns as the management of the media and collections, descriptions of their content, their taxonomic, geographic, and temporal coverage, and the appropriate ways to retrieve, attribute and reproduce them [191]. Plinian Core [227, 203] is a set of vocabulary terms that can be used to describe all kind of properties related to taxa and in its actual version, Plinian Core incorporates a number of elements already defined within standards in use, such as EML and DwC.

2.5 Data Discovery and Integration

The search for data to perform biodiversity analysis and synthesis research is still a challenging task. The most recent developments have occurred with the emergence of databases that aggregate datasets at global and national scales such as GBIF [85], DataONE [182], SiBBr [105] and BaMBa [178]. The use of metadata and data publishing standards allows institutions to map the internal representations of this information to a format that is clearly specified and can be consumed and processed automatically by machines. Biodiver- sity information aggregation databases allow datasets to be geographically, taxonomically, and temporally searched. Languages and data analysis environments, such as R and Python, already have packages and libraries that are integrated with the repositories and aggregators of biodiversity data. rgbif14, for example, is a package for R that allows searching and retrieving records directly from GBIF, with pygbif15 being its analog for Python. Often scientists need to combine data from different sources into integrative research. For example, physico-chemical data can be combined with metagenomic data to try to establish correlations that explain some ecosystem processes and implications such effectiveness of marine protected areas [45], contribution of sea-mounts [177]. The activity of combining data from different sources is called data integration and is one of the most active areas of research on scientific data management [5, 186]. Existing biodiversity databases have advanced by establishing standards for metadata, such as EML [93], and for data such as DwC. However, these are limited to defining controlled vocabularies, consisting of standardized terms in each of the themes. A more sophisticated approach, involving not only the definition of terms but also the relationships between them and rules of inference, which are called ontologies, is the subject of the Semantic Web research area. Some initiatives in this direction in the area of biodiversity and ecology include ontologies such as the Environment Ontology (ENVO) and the Biodiversity Collections Ontology (BCO) [279]. Ontologies allow cross-referencing of different domains (Linked Data) and semantic queries, providing a data integration tool considerably more powerful than the current ones.

14https://cran.r-project.org/web/packages/rgbif/ 15https://recology.info/2015/11/pygbif/

16 3 Data Analysis and Synthesis

In this section, we present some examples where biodiversity data is analyzed along with the computational methods used. These main biological examples are related with ecological niche modeling, network science, biodiversity genomics, wildlife health monitoring. We also explore methods for interconnecting various computational tasks, i.e. biodiversity workflow management, and for keeping track of data derivation in these workflows with the purpose of enabling analysis and synthesis reproducibility.

3.1 Ecological Niche Modeling Ecological Niche Modeling (ENM) is used to predict the potential geographic distribution of a given species, based on environmental factors [216]. A niche-based model represents an approximation of the ecological (realized) niche of a species in the environmental dimensions analyzed [218, 216]. This model is made using a family of statistical tools to analyze the environmental information associated with the occurrence points (geographic coordinates), generating maps with an indication of geographic areas with the environmental suitability of the modeled species [88, 111]. Different studies include ENM with the objective of analyzing, for example, the potential distribution of invasive species (e.g. [214]), with indication of vulnerable areas, the distribution of species in scenarios of (e.g. [245, 268, 207, 14, 283, 15], dissemination of infectious diseases (e.g. [71]), and selecting conservation areas (e.g. [17, 90, 63, 205]). The concept of niche is defined, according to Chase and Leibold [61], as environmental conditions that meet the minimum requirements of a species so that its birth rate is higher than its mortality rate. There are three main factors that determine the niche of a species: abiotic (environmental) conditions, biotic conditions, such as species interactions, and dispersal capacity [249]. These are illustrated by the BAM diagram which depicts the biotic factors (B), the abiotic factors (A) and the mobility (M) [216]. The Ecological Niche Modeling involves three general steps: (1) Pre-processing, (2) Modeling and (3) Post-processing. In the pre-processing stage, after the acquisition of species occurrence records (biotic vari- ables), in databases such as SpeciesLink [50] and GBIF [86], the coordinates must be verified, to eliminate positional uncertainty and inconsistencies, such as inverted longitude and latitude. Abiotic variables can be downloaded, for example, in climatology databases such as Worldclim [136] and CHELSA [151]. The en- vironmental layers are downloaded and converted so they can be used as input to model algorithms along with occurrence points. These datasets are usually in raster format, that is, a grid of two-dimensional cells, where each cell has a value. The resolution of the dataset is given by the size of a cell and the smaller the cell, the higher the resolution. In this step, we also verify the sample bias, using a) spatial filters - to remove points very close geographically, in order to select points with a minimum geographical distance between them [193, 40, 273]. This procedure aims to reduce the spatial bias effects of points. And b) environmental filters - to remove points with low variation in environmental layers, avoiding environmen- tally close points [168, 273]. This procedure aims to reduce the environmental autocorrelation of the data, and the use of clean data generates models with greater predictive power [47, 157, 4]. The modeling step consists in the application of algorithms and the occurrence and biotic and abiotic data raised in the pre- processing phase, for the creation of the models. Various algorithms are used in ecological niche modeling, some based on machine learning or environmental envelopes. Machine learning algorithms include Maxent [218], Genetic Algorithm for Rule-Set Production (GARP) [260], both require only presence points. Other algorithms include Generalized Linear Models (GLM) [175], Generalized Additive Models (GAM) [126], and Random Forest [43]. Post-processing is the evaluation stage of the generated model. The validation of a model is made from the comparison of the generated results, against distribution data of the species not used in the modeling process. Among the post-processing procedures, analyses are performed to increase reliability, or to reduce the uncertainty of the models generated by different algorithms [108]. A consensus model is generated from means of combined projection techniques, where high suitability areas coincide in most of the models generated for a given species [13]. The calculation of the area under the curve (AUC) is the most used validation test [108], but a number of others are applied to evaluate the performance of the models, such as Kappa (Cohen’s Kappa Statistic), True Skill Statistics (TSS), Lowest Presence Threshold (LPT) [10, 87, 206, 166] and more complex strategies with a combination of maps, considering, for example,

17 geographic barriers, deforested areas [12, 165, 206], and multidimensional analyses [79]. A framework for scalable and reproducible ecological niche modeling, Model-R [232] was developed with the objective of unifying pre-existing ecological niche modeling tools in a web interface that automates processing steps and modeling performance. This tool includes packages related to data pre-processing like retrieve and clean- ing data (e.g. RJabot, a feature that allows you to search for and retrieve Jabot [242] occurrence data), multi-projection tools that can be applied to different temporal and spatial datasets and post-processing tools linked to the generated models. The Model-R has seven algorithms implemented and available for modeling: BIOCLIM, Mahalanobis distance, Maxent, GLM, RandomForest, SVM, and DOMAIN. The algorithms, as well as the entire modeling process, can be parameterized using command line tools or through the web interface. Figure 4 describes typical ENM steps.

Pre-processing

Occurrence data Abiotic data Abiotic data correlation selection and retrieval selection and retrieval analysis and filtering

Post-processing Modeling

Algorithm configuration and execution Model projection and evaluation

Figure 4: Typical ENM steps comprising (1) pre-processing (occurrence data selection and retrieval, abiotic data selection and retrieval, and abiotic data correlation analysis and filtering), (2) modeling (algorithm configuration and execution), and (3) post-processing (model projection and evaluation).

So far, ecological niche modeling has relied mostly on abiotic variables. A challenge is to incorporate biotic information as well, such as species interactions. According to [216], these are difficult to gather since they often require surveys at local scales. One experiment using biotic information was performed by [132] incorporating mutualism information to model four species, which improves the accuracy of prediction considerably.

18 3.2 Network Science Applied to Biodiversity

Network Science refers to a relatively new domain of scientific investigation, aiming to describe emergent properties and patterns from complex systems of interacting entities. Such relational systems are naturally represented as networks, in which interactions are represented as pairwise connections (links) between enti- ties (nodes) and assume particular semantics depending on the of the modeled phenomenon. The rise of this field is strongly associated with recent advances of information technology, which provided scientists with novel tools for collecting, storing, and processing data from many knowledge domains more efficiently and in larger scales. Although a variety of networked systems in many disciplines had been studied long before that, technological advances allowed us to model real-world systems in much more detail, from large volumes data that are often public or easily accessible to the investigator. Network science [21, 196] has been applied to model networked systems in a variety of knowledge domains, including the Internet, scien- tific collaboration networks, and ecological networks just to cite a few [6]. Given their relational structure, network models are formally represented as graphs. Network modeling has been widely adopted in the con- text of biodiversity research, especially for investigating ecological and evolutionary aspects of ecosystems and natural communities. Efforts toward this goal have led to the creation of the field of network ecology, which has undergone a noticeable growth over the last few years [41]. Network ecology has traditionally focused on describing general aspects of the entangled networks by which organisms interact. As ecological interactions are regarded as key processes modeling ecosystems functioning and structure, unraveling their architecture and dynamics is essential for understanding a variety of ecosystem features, such as stability and energy flow. Interaction networks can be broadly classified as food webs, host-parasitoid webs, or mutualistic webs [146], being food webs the first ones described in literature, since two classical papers by [164, 199]. Besides ecological interactions, network thinking has also been applied for modeling other aspects of natural systems. Patterns of movement can be investigated in a structured way, for instance, by means of movement networks [149]. These networks represent geographical space as a set of discrete and interconnected locations, forming a mesh of possible routes through which (or groups of animals) travel. Links between each pair of locations are weighted according to their geographical connectivity. Animal movement is thus regarded as dynamic processes composed of sequences of discrete movement steps running through the network structure. As the spatial feature is key in this type of network, they are also referred to as spatial networks [23]. Others have applied network science to investigate biogeographical patterns, such as species co-occurrence. The so-called co-occurrence networks model species associations in terms of their geographical distributions, such that species which are often observed occurring together in the same set of localities are considered to be strongly connected to each other. Similarly to other networked systems, co-occurrence networks are composed of a majority of the species holding co-occurrence links to very few others, whilst only a few species are connected to many others [16]. Co-occurrence network analysis has been used for many ap- plications in biodiversity studies, such as for selecting subsets of species to be used as surrogates for the characterization of biological communities [270]; for assessing the resilience of biotic communities towards climate change [16]; and for identifying modularity (clusters of overlapping species ranges) in biological communities from animal-location bipartite networks [267]. The social network analytics framework has also been applied in some biodiversity studies, though in most cases for modeling animal social behavior [92]. An alternative perspective is to look at communities of biodiversity data producers and consumers, in order to better understand the myriad of contexts in which data is collected, shared, and used. Mapping data flow within the community of biodiversity informatics initiatives, for instance, could help to prioritize and to improve the coordination of collaborative actions, leading to more effective biodiversity data-based policies [33]. Further, important scientific communities and gaps could be identified and characterized by exploring collaborative paper authoring networks and scientific topic networks [41]. Another interesting example of a collaboration network in biodiversity is given in [114], where a correspondence network of 19th-20th century botanists was structured from digitized data from the British Herbaria. Botanists composing this network corresponded with each other by exchanging specimens, a practice that has led to the formation of exchange clubs. Many aspects regarding the particular ways botanists used to work as well as the roles they assumed could be investigated with the aid of exchange

19 networks. Finally, a better understanding of the factors and processes influencing the composition of species occur- rence datasets would be invaluable for improving data usability, especially for species distribution modeling [76]. As biological collections are typically composed of an ensemble of opportunistic species occurrence records, each of which having been gathered in a particular context by a different collection team, their datasets do not necessarily reflect the biological diversity from the areas in which the collections are physi- cally located. Rather, they best reflect the interests of their most active and relevant collectors, i.e. those who have contributed to the collection to larger extents.

3.3 Biodiversity Genomics The DNA is the universal code of all organism in the Earth, present from simple species, such as bacteria and viruses, to more complex organisms, such as the vertebrates. The complete DNA sequence of an organism defines its . Due to the advances in new sequencing technologies, called Next Generation Sequenc- ing (NGS), unknown have been sequenced, assembled and deposited in public data repositories of molecular data. This data is growing fast because of the decreased cost of NGS and increased capacity of computational infrastructures [257, 159]. The advances in genomic information bring results in many appli- cation areas to society [161], including the production of valuable bioproducts at industrial scale (biofuels, bioenergy, cellulose fibers, gum chemicals, oils and resins), biomonitoring of species (for instance, viruses in epidemiological surveillance [142]), new drugs (vaccine design [128] and protein therapeutics [158]). Molecular approaches are becoming one the most relevant tools to support the taxonomist in species identification [130]. The community is looking for the genes or regions of the genome as DNA barcode candidates, such as COI (cytochrome C oxidase I), used to identify animals (mammals, insects, fishes); ITS (internal transcribed spacer) for fungi; ribosomal RNA (rRNA) – subunit 16S for identify bacteria; matK (maturase K) and rbcL (ribulose-bisphosphate carboxylase) to identify plants. BOLD (Barcode of Life Data System) [222] provides an integrated bioinformatics platform that assists in the acquisition, storage, analysis and publication of DNA barcode records. It is developed and hosted by the International Barcode of Life project (iBOL), one of the largest biodiversity genomics initiative ever executed. Hundreds of biodiversity scientists, bioinformaticians and technologists from 25 nations are working together to construct a richly parameterized DNA barcode reference library that will be the foundation for a DNA-based identification system for all multi-cellular life [202]. In general, the macroscopic species are identified and cataloged by morphological aspects and receive a voucher identifier which contains information about this species and metadata (geographical location, collector, collection date, etc.). The Global Genome Biodiversity Network (GGBN) [84, 83] data portal, for instance, stores information about vouchered collections of DNA or tissue samples. Museums and institutions are joining efforts to link their biological collection with genetic data as nucleotide sequences from particular genes. Genbank (from NCBI), DDBJ (DNA Databank of Japan), ENA (European Nucleotide Archive), which are part of the International Nucleotide Sequence Databases Collaboration (INSDC), are the most popular repositories for nucleotide sequences. Although the voucher tag is present in Genbank since 1998, it re- mained poorly used [237]. Recently, the BioCollections database from NCBI connected specimen vouchers to sequence records in GenBank [239]. Other interesting area comprised the recover of the DNA of the extinct species, denominated ancient DNA (or aDNA) which provides resources to understand the evolutionary process. The reconstruction of aDNA involves material derived from archaeological specimens, mummified tissues, preserved plant remains, and from other environments, such as permafrost and sediments [46]. Until now, most of the extinct species sequenced belong to mammalian and ancient humans [48]. The experimental difficulty and the challenges in this area are related to the quality of the material (which degrade with time) and contamination. Considering microscopic organisms, NGS opens a new world of capabilities reducing the need for cul- ture isolation of microscopic species such as bacteria, viruses, and Archaea. This field is known as metage- nomics, which is defined as the analysis of sequences taken from environmental samples, which are called metagenomes [286]. Sequencing these metagenomes produces fragments of sequences, i.e. sequence reads, of organisms that are present in the environmental samples, which may belong to multiple species, and

20 are considered extremely challenging to analyze from a computational perspective. These sequences are usually filtered to exclude those that belong to taxons that are not of interest. The resulting datasets are described using metadata standards, such as MIxS [289], for supporting data discovery and mining. Next, the overlapping sequence reads are used to obtain longer sequences called contigs in a process known as assembly. Metagenome assembly is often based on de Bruijn graphs [69]. Such graphs are built using se- quence reads as labels to edges that connect vertices labeled by the prefixes and suffixes of the respective sequence reads. The assembly problem is then reduced to the problem of finding a Eulerian cycle in a de Bruijn graph, i.e. a cycle that visits every graph edge exactly once. Implementations of assemblers that use de Bruijn graphs include SPAdes [20] and HipMer [107]. The later is a parallel implementation for high per- formance computing platforms using the Partitioned Global Adress Space [287, 277] programming model. There are several important biological discoveries based on complete and near complete genomes assembled from metagenomes. For instance, the possible ancestor of mitochondria and the possible ancestor of the first eukaryotic cells were proposed from reconstructed phylogenies from those genomes [173, 89, 292]. After assembly, the resulting sequences can be analyzed for identifying genes using either sequence alignment, for genes with homologs present in public databases, or through ab initio gene prediction using, for instance, hidden Markov models. Other types of analyses involving metagenomes include evaluation of species di- versity and functional annotation [286]. Various tools compose these different analyses into metagenomic workflows. MG-RAST [181, 284], for instance, is a web portal that provides metagenomic dataset analysis workflows containing activities such as quality control, similarity-based annotation, and functional and tax- onomic profiling. SUPER-FOCUS [241] also produces functional and taxonomic profiles from metagenomic datasets. However, its organism identification is based on alignment-free techniques used by the FOCUS [240] tool. Metagenomics can support various environmental studies such as the analysis of diseases. Garcia et al. [106] identified taxonomic groups that were more abundant in Mussismilia braziliensis affected by the white plague disease when compared to healthy corals of the same species. Integrating data from metagenomics with data from other aspects of biodiversity, such as species populations and environ- mental monitoring, is still a challenging task. More recently, there were efforts to integrate these standards [269]. Ongoing efforts for improving data integration in bioinformatics and biodiversity are using semantic web techniques [279]. These efforts are essential in supporting integrative ecosystem studies, such as [177], where different attributes of the ecosystem found in the mesophotic reefs of the Vitória-Trindade seamount chain were correlated to infer its properties. As the computational challenges in this area, we can point questions related to storage, recovery, and integration of the information; conceptual modeling, ontology and semantic representation of the molecular domain. Furthermore, there are usually multiple computational activities in bioinformatics analyses including filtering, normalization, and annotation. Efforts to ensure reproducibility [68] of these analyses involve (but are not limited to) task composition tools (scripts [18], pipelines, scientific workflows [163] and software containers [38]), web-based software platforms, such as Galaxy [25], commonly used applications, and source code available in repositories such as Github. We explore these issues in more detail in subsection 3.6.

3.4 Wildlife Health Monitoring A comprehensive approach for wildlife health monitoring involves many challenges that need to be addressed in order to result in a globally effective mechanism for diseases prevention, such as: the difficulty and limited access to wild, mostly uninhabited areas; how to overcome the high diversity and complexity of parasites, vectors, hosts and disease ecology; the methodology and infrastructure for properly collecting, storing and managing georeferenced high-quality data; how to integrate specialists from different areas to handle data, species and distinct socioenvironmental contexts; the research on knowledge extraction from data-driven models to understand, identify and predict risks to ultimately convey relevant information to society; and finally, how to sensitize decision-makers about the importance of monitoring as well as to engage the population as committed citizen scientists. The challenges are yet more acute in megadiverse countries which, in addition to biodiversity richness, usually also have to cope with vast territorial distances and sociocultural diversity. He et al. [129] present the eMammal framework for wildlife monitoring supported by citizen scientists.

21 Animal images collected with camera traps are sent to its database where visual animal recognition tech- niques are applied. The species identification recommendations generated are reviewed by citizen scientists and, subsequently, by experts. The resulting validated records are made available to wildlife and ecological researchers. eBird [263] also leverages the capability of citizen scientists to gather bird observation records. Automated data quality filters are used to support species identifications performed by citizen scientists. In Brazil, there is still the difficulty of building a mechanism that is not impaired by its large territorial exten- sion and its poorly integrated sectoral policies. The Brazilian Wildlife Health Information System (SISS-Geo) [57] is a platform for collaborative monitoring that intends to overcome the challenges in wildlife health. It aims integration and participation of various segments of society, encompassing: the registration of oc- currences by citizen scientists; the reliable diagnosis of pathogens from the laboratory and expert networks; and computational and mathematical challenges in analytical and predictive systems, knowledge extraction, data integration and visualization, and geographic information systems. It has been successfully applied to support decision-making on recent wildlife health events, such as a recent Yellow Fever epizooty [72, 188]. By automating the search for occurrence patterns, the information reaches more efficiently citizens na- tionwide, from the general population through experts, as well as provides the opportunity for the acquisition of knowledge about the possible patterns and parameters that contribute to the occurrence of diseases. In the medium- and long-term it also builds the capacity of researchers to develop complex modeling in the ecology of diseases that can possibly exploit geographic information in order to improve accuracy. Moreover, occurrence patterns yield data that can assist national policy on health and on biodiversity conservation. Machine learning has been used for image analysis in wildlife monitoring, such as for automated species classification. As mentioned earlier, in [129] the authors describe how species recognition is tackled within the eMammal cyber-infrastructure from camera-trap digital images. Once an animal (or group of) crosses the motion sensor and triggers the camera, the resulting sequence of captured images is processed in order to detect and segment the animal from the natural scene. Detection is the task of identifying the bounding box within the animals lie on the image, whereas segmentation is separating the animals (foreground) from the scene (background). Both tasks are challenging in this context—sometimes even for humans—as the animals are quite camouflaged by the heavy amount of natural elements in the wild. The approach developed to tackle this object-cutting problem [226] takes into consideration multiple image frames from the captured sequence: a standard background-foreground classifier is applied to each frame, however, the locally obtained information is fused across all frames in a collaborative manner; this process is repeated iteratively until the refinement converges. According to the authors, this novel technique led to an improvement in the average segmentation precision of near 15% over the state-of-the-art algorithm. After the background-foreground image segmentation, an even more challenging task takes place: the animal species recognition. Here, the segmented image patches of animals are fed into previously trained machine-learning models—built based on existing labeled images—in order to classify them by species. Since these patches usually contain some elements from the background scene, i.e. the segmentation is not perfect, the recognition model has to be able to cope with a good deal of noise; to make things harder, the model also needs to recognize animals at different poses. Which machine-learning algorithm is the most adequate to train such recognition models depends mainly on the amount of available training data [62]. In summary, conventional supervised classification algorithms are recommended for small training datasets whereas deep-learning algorithms best fit the case of an abundance of labeled data. In [62], using a training dataset of 14346 images of 20 animal species, a deep convolutional neural network (DCNN) achieved an accuracy of 38%, while a Bag-of-Words model (BoW) achieved 33%. Regardless of the learning algorithm, these species recognition results are still disappointing and of limited use. From an optimistic point of view, though, it is expected that DCNN will perform better with larger training datasets as this architecture is known for its high learning capacity [62]. For instance, in [112] the authors report an accuracy of 88.9% on a large dataset of near 1 million images containing 26 wild animal species. Moreover, there is evidence that increasing the deepness of the neural network architecture leads to higher performance even with no additional training data [112]. Automatically identifying animals species from images seems to be the current trend in wildlife moni- toring. Another great work in this vein was recently presented in [198], where the authors, motivated by the need to eliminate the burden of manual labeling by specialists and volunteers, proposed a deep neural network (DNN) to not only identify animals but also to count them and describe their behavior. As discussed

22 earlier, large datasets are required in order to harness the full potential of DNNs. By using the Snapshot Serengeti (SS) dataset which contains a total of 10.8 million classified images of 48 species to train the deep-learning model, an impressive accuracy of 93.8% was obtained. The task of recognition was divided into two stages: in the first stage, a model was trained exclusively to separate images containing at least one animal from images without animals. Then, the second DNN model was trained to take the resulting images with animals (only a quarter of the total) and perform the extraction of information, that is, identification of species, number of animals and their characteristics. Instead of resorting to different models for each task of the second stage, the authors opted for training a single model. The reasoning behind this choice is that (i) learning related tasks simultaneously is more efficient and (ii) a single model has advantages of a reduced amount of parameters and therefore total complexity. In the article, nine DNN architectures were tested, as well as the ensemble model consisting of all them. For the task of detecting images that contain animals, the best accuracy of 96.8% was achieved by the Very Deep Convolutional Network architecture (VGG) [244]. Re- garding the species identification, the ensemble of DNNs obtained the highest score, 94.9%, followed by the Deep Residual Learning architecture (ResNet) [127], with 93.8%. Interestingly, when considered the relaxed definition of accuracy as being the correct answer in the top-5 guesses by the DNN model, which would also greatly save human labor in manual labeling, these scores further improve to 99.1% and 98.8%, respectively. Counting the number of animals in the image showed to be a hard task for DNNs. Again, the ensemble of DNNs model got the best score, correctly predicting the number of animals 63.1% of the time. If allowed to approximately count the number of animals up to an error of ±1, the accuracy climbs to 84.7%. In this task, the best individual architecture was again ResNet, achieving a score of 62.8% and 83.6% for the exact and approximate accuracies, respectively. Finally, the task of describing the characteristics of the animals aimed to detect the following attributes: standing, resting, moving, eating, and whether young animals are present. Note that this is a non-exclusive multilabel classification task, since one or more animals may exhibit multiple attributes. In this task, the ensemble of DNNs model obtained 76.2% accuracy, and the second best, ResNet, got 75.6%. All these enthusiastic results put Deep Learning Networks as the current state-of-the-art when it comes to pattern recognition in wildlife monitoring. This translates to a significant reduction of human labor and also opens a lot of new applications in wildlife monitoring, especially for those small projects that cannot recourse to volunteers and experts for labeling. We are approaching the stage where machine-learning models are as good as humans in species identification, and they can even excel humans in that task, eventually.

3.5 Other Biodiversity Data Analysis Applications

The difficulty in the extraction of knowledge from large databases [123], such as GBIF, demands other meth- ods to access and manage biological data [145]. As biodiversity databases have observed a substantial in- crease in data, knowledge extraction from them has become a challenge [82]. In this sense, Hochachka [139] argues that for the development of ecological analyses where there is little prior knowledge and hypotheses are not clearly developed, exploratory analyses with data mining techniques [162] are more appropriate than the confirmatory analyses, that is designed to test hypotheses or estimate model parameters. In ecology, some research using data mining was conducted by Spehn and Korner [253]. Pino-Mejiás et al. [219] used classification algorithms for predicting the potential habitat of species; a decision tree algorithm was also used for forest growing stock modeling [77]; Kumar et al. [155] used cluster analysis to identify regions with similar ecological conditions, Flügge et al. [96] used multivariate spatial associations for grouping species into disjunct sets with similar co-association values. One of the many possibilities of using data mining was investigated by Silva [241], which developed a methodology to allow for the application of association anal- ysis for extracting patterns of co-occurrence from a dataset from the 50ha Forest Dynamics Project on Barro Colorado Island, finding patterns of positive and negative correlation. To do this, association analysis was applied with the Apriori algorithm [3]. Ciarlegio et al. [66] proposed ConsNet, a software for designing conservation area networks using tabu search [109] with multi-criteria objectives.

23 3.6 Biodiversity Workflow Management and Reproducibility Scientific data is being produced at an exponential growth rate by increasingly available scientific sensors. This, coupled with sophisticated computational models that process this data, has demanded new techniques [133] for managing computational scientific experiments in a scalable and reproducible way. Wilson et al. [285] propose best practices for managing scientific computations. These include: recording datasets, programs, libraries, and parameters used, including their respective identifiers or versions, to enable better reproducibility; and using high-level languages for programming and moving to lower-level languages only when performance improvement is necessary. These experiments are often specified as scientific workflows [78, 238, 163], which are given by a composition of computational tasks that exchange data through pro- duction and consumption relationships. A scientific workflow management system (SWMS) provides features such as fault-tolerance, scalable execution, scalable data management, data dependency tracking, and prove- nance recording, that greatly reduce the complexity of managing the life-cycle of these experiments [174]. Scientific workflows are often provided through research data portals, Chard et al. [60] present a design pattern for such portals for data-intensive scientific problems. Reproducibility [209] is an essential property in science. In computational research, it can be a challenging task since one might need vasts amounts of data or supercomputing resources to reproduce a result. Sandve et al. [233] propose rules that can be fol- lowed to better support reproducibility, including recording the steps that were executed to obtain a result, archiving programs that were used in a computational experiment, and versioning the scripts and workflows used. Meng et al. [179] propose a framework that tackles reproducibility by providing features for sand- boxing and preserving computational environments. A combination of containers [38, 121] and tools for intercepting system calls is used in order to achieve preservation. Many computational experiments have a detailed record of their execution, such as the datasets used and computational tasks used. enable easier verification of results. These records describe the [101, 52] of the computational experiment. It can support reproducibility and validation of e-Science experiments. Miles et al. [185], for instance, propose an architecture for validation if e-Science experiments based on both provenance assertions and ontologies. DataONE, for instance, included support for tracking the provenance of their datasets [51]. Biodiversity follows the same trend of rapidly increasing production of data found in other areas of sci- ence. Currently, biodiversity data is being integrated at a global scale through initiatives such as GBIF [85]. Techniques for analysis and synthesis of biodiversity data, such as ENM [216, 88], are widely used. These analyses typically employ several different applications executed in a loosely coupled manner, being a typical use case for scientific workflow management tools [167]. Next, we list some works related to scientific work- flows and reproducibility in biodiversity. Pennington et al. [210] describe the implementation of species dis- tribution modeling (SDM) scientific workflows using Kepler [170]. Their approach allows for easy manage- ment of structural aspects of the scientific workflow, such as easily replacing application components. They also developed application components for data transformation and pre-processing, geospatial processing, and semantic annotation of processes. These experiments use occurrence data from the Mammal Networked Information System (MaNIS)16 and future climatological scenarios from IPCC to predict the climate-change impact on more than 2000 species. Morisette et al. [189] present the Software for Assisted Habitat Modeling (SAHM) that allows for managing the various steps of SDM, including pre- and post-processing activities. The implementation is coupled to the Vistrails [102] scientific workflow management system, which sup- ports provenance management. Talbert et al. [264] also describe SAHM and analyze the data management challenges of SDM using scientific workflows. Amaral et. al [11] present the EUBrazilOpenBio Hybrid Data Infrastructure which implements cloud services for the biodiversity domain such as taxonomic mapping and resolution and SDM. Scientific workflows are supported both with DAGMan [73] and EasyGrid AMS [37]. They evaluate the execution of SDM on cloud computing resources showing good performance. Candela et al. [49] give a detailed description of an integrated cloud-based environment for SDM of the of EUBrazilOpen- Bio Hybrid Data Infrastructure which includes components for retrieving species occurrences, environmental layers, and execution of various models for predicting species distributions. Some SDM applications and workflows are available through web portals, such as the Biodiversity Virtual e-Laboratory (BioVel) [125]. BioVel [125] offers a web-based environment for managing scientific workflows for biodiversity. Various

16http://manisnet.org

24 pre-defined activities are available in its interface: geographic and temporal selection of occurrences (BioS- TIF), data cleaning, taxonomic name resolution, ecological niche modeling algorithms, population modeling, ecosystem modeling, and metagenomics and applications [275]. iPlant [110] is a computational research infrastructure, or , for plant science. Its ap- plications include the Tree of Life to produce phylogenetic trees of all green plant species; and Genotype to Phenotype to predict plant phenotypes from their genetic data. Kurator [81] is a software package for the Kepler [170] scientific workflow management system that supports composing various data curation activi- ties into scientific workflows. Pre-built activities include georeferencing, scientific name, and flowering time validators. Provenance is recorded to document all the transformations activities that data went through caused by the various data curation activities. Nguyen et al. [197] developed scientific workflows for assess- ing ecosystem risk based on IUCN guidelines [152] that use five rule-based criteria to assign one of eight risk categories that range from least concern (LC) to collapsed (CO). The assessment is performed in two phases. First, a stochastic ecosystem model is executed for the Meso-American Reef Ecosystem risk assessment by predicting future reef properties under diverse scenarios. This step was implemented both in Nimrod/G [2] and Spark [291], for comparative purposes. The Spark version had a considerably better performance in terms of computing time. Next, a workflow was implemented in the Kepler scientific workflow management system [170] to execute the IUCN ecosystem risk assessment methodology using the results of the stochastic ecosystem model execution and applying its five rule-based criteria. Reproducibility is an important property in this process since risk assessment is often re-executed and its results need to be discussed by experts and decision makers [119]. Cohen-Boulakia et al. [68] explore the use of scientific workflows for reproducibility of computational experiments in the life sciences. They analyze scientific workflows techniques and systems and evaluate to what extent they support reproducibility requirements in life science applications. Plant phenotyping, which evaluates how plants respond to different environmental conditions by monitoring their traits, was one of the use cases. For instance, keeping track of several different tool versions used in a workflow and their respective compatibility is one of its reproducibility requirements. They define different levels of re- producibility in the workflow context. Considering two scientific workflows A and B and assuming A has already been executed. When B is executed: repeatability is achieved when B contains exactly the same com- ponents of A; replicability is obtained when B uses similar [255] input components of A and both executions reach the same conclusion; reproducibility happens when both executions lead to the same scientific conclu- sion; reusability is observed when the specification B contains the specification of A. The authors analyze three workflow aspects from the reproducibility perspective: workflow specification, workflow execution, and workflow context and runtime environment. Workflow specifications can support better reusability through common specification languages, such as CWL17 (Common Workflow Language), and annotations. Assessing workflows similarity is critical for reuse but progress is still needed in solving this problem. Recording and analyzing workflow execution details can be supported by provenance information. While most systems sup- port the PROV [187] standard, visualizing and analyzing large provenance datasets is still challenging. Also, preserving the runtime environment is still a challenge that is being addressed with virtualization technolo- gies [121]. WholeTale [44], for instance, is a computational environment that has reproducibility features. It has components for data collection, identity management, data publication, and interfaces to analytical tools, called frontends. These frontends will manipulate data and can be given, for instance, by interactive notebooks such as Jupyter18. The system is integrated with DataONE [182], users can search and retrieve datasets from it. Frontends are packaged as Docker containers [38] that can be executed on high perfor- mance computing resources. The interaction between the datasets and analytical tools is documented and recorded in a metadata management system. This allows for reproducing the entire computational research performed, from data retrieval to data analysis and its outputs, including the computational environments used.

17http://www.commonwl.org/ 18http://jupyter.org

25 4 e-Biodiversity Challenges and Concluding Remarks

The acceleration of global changes requires a constant assessment of their impacts on biodiversity and, con- sequently, on ecosystem services that are essential to human survival. Some areas of the globe, e.g. the south Atlantic , remain highly understudied and therefore their biodiversity underestimated. A better under- standing of marine biodiversity could be achieved with help of e-Biodiversity to leverage surveys to uncover novel species and systems, such as the Great Amazon Reef [99]. To address this problem, biodiversity data must be systematically collected and analyzed. In this context, e-Biodiversity, or Biodiversity Informatics, is an essential collection of methodologies, tools, and techniques to achieve this goal. The EBVs [211] were proposed as a set of indicators that would allow for systematic monitoring of biodiversity. However, the pro- duction of these indicators is still a challenge [215], in particular regarding the existence of information gaps that can prevent global-scale inferences on the state of biodiversity. These inferences provide essential input to decision-makers in devising governmental policies toward meeting global targets on biodiversity conserva- tion, such as the Aichi Biodiversity Targets. In this section, we describe existing challenges for e-Biodiversity to become a systematic and global-scale tool for monitoring and making inferences about biodiversity. As described in subsection 2.2, the availability of detailed information about most organisms is still very scarce [213]. This hinders the usage of this data in many biodiversity data analysis applications, such as ecological niche modeling. Furthermore, we described other issues with biodiversity data, such as biases and frequent taxonomic and georeferencing errors. Therefore, one of the challenges of e-Biodiversity is not only to increase the amount of available data, filling some of the existing gaps but also to reduce its bias and improve its quality. Some promising work addressing these issues are listed next: • Heidorn [131] observes that data from smaller scientific projects is rarely available to other scientists even though its aggregated size and value for research is considerable. This phenomenon is denom- inated the long tail of science. In biodiversity and ecology, some progress has been achieved through projects such as GBIF and DataONE, that receive a considerable amount of their datasets from small re- search groups. One issue with making these datasets available is the effort required to map the concepts present in them to standard vocabulary terms used in major biodiversity databases. Entity resolution techniques [153] have the potential to assist and speed up these record linkage routines. • Data collection could be substantially intensified by applying artificial intelligence methods for au- tomating specimen identification, some preliminary work in this direction include the application of deep learning techniques for species identification in herbarium sheets [54, 55]. • Remote sensing provides the opportunity to regularly observe the Earth and, therefore, could benefit biodiversity monitoring by increasing the amount of data collected. It can also be a valuable tool to observe areas that are difficult to access through field expeditions. One of the pioneering works in this area was proposed by Holden and Ledrew [140] by using hyperspectral remote sensing to monitor coral reefs. Clark et al. [67] identified tree species using remote sensing images. Fretwell et al. [103] were able to use satellite-based remote sensing to survey the Emperor Penguin on a global scale. • As observed in section 2.2, assessing the quality of a dataset is a critical step for any subsequent anal- ysis and synthesis activity that might use it. Users should establish the intended use of datasets in their research. Determining if a dataset is fit for a particular use is still a challenge in e-Biodiversity since records available in public databases contain various types of errors. A promising approach was proposed by Veiga et al. [274] comprised of a framework for biodiversity data quality assessment and management that allows for users to define their data quality requirements and when a particular dataset is fit-for-use in a standardized manner. Morris et al. [190] made some progress by imple- menting a library of small data quality assessment routines that can be composed into more complex workflows, to report data quality in terms of the framework proposed by Veiga et al. [274]. In the Global Biodiversity Information Outlook [138] report, produced by leading biodiversity informatics researchers, a number of areas of biodiversity informatics of limited or minimal progress were identified and can be considered research challenges. Biological system modeling was considered an area of research in

26 biodiversity informatics with minimal progress. Advances in this area could be comprised of computational, or in-silico models or simulations ranging from single organisms to entire ecosystems. Current temporal and spatial modeling in biodiversity, such as ecological niche modeling, described in subsection 3.1, rely on species occurrence data. More fine-grained modeling would require incorporating species trait data. Cardinale et al. [53], for instance, advocated the development of new predictive models that take into account species interactions to predict the impact of biodiversity on ecosystem processes based on species traits. Areas of limited progress identified by the Global Biodiversity Information Outlook included:

• Automated remote-sensed observation has the potential to enable observation of biodiversity in large and remote areas. More recently, preliminary work was conducted on defining biodiversity indicators that could be derived from images collected by satellite remote sensing [217] and processed using statistical analysis and classification algorithms.

• Identifying trends and making predictions about biodiversity could determine future trends in biodi- versity under different global change scenarios. Ongoing research in this area includes predicting how climate change will affect species distributions [245, 268, 207, 14, 283, 15] and zoonotic diseases [91].

• Providing access to aggregate species trait data, consisting of data on species characteristics and their in- teractions. Data is incomplete, there is not much data about relative abundances of species, their traits, and on how they interact. This information is needed for creating better models to study ecosystem pro- cesses. This type of data can enable more complex biological systems modeling, such as evolutionary inference. MorphoBank [201], for instance, allows for scientists to upload images with the morphology of organisms with associated data.

Scientists often need to combine multiple sources of data [150, 104] in biodiversity analysis and synthesis activities. Although there are many gaps in biodiversity data, such as the reduced availability of species trait data, there are many machine-readable and freely available, i.e. open, datasets [225] from areas such as re- mote sensing, socioeconomics, and climatology, that can be integrated into biodiversity studies. Open data is widely available online, including data provided by many governments. However, it is highly heterogeneous, dispersed in multiple sources, and may not provide metadata or schema. Metadata, described in section 2.3, is helpful in discovering datasets and in integrating them when dataset attribute definitions are provided, as it is possible with EML [93]. Semantic web [279], as described in section 2.5, can also support data integration through the use of various existing ontologies for biodiversity and other domains. However, their increased usefulness depends on the widespread adoption of ontologies and metadata standards by data providers, a process that is still underway. A promising approach to overcome these limitations has been to use machine learning techniques to support open data integration activities [186, 80], such as entity matching [192, 194]. These recently proposed techniques could be leveraged and extended for integrating biodiversity and other related datasets. Trends, indicators, and facts derived from biodiversity data analysis and synthesis activities might be used for guiding governmental decision making in critical areas such as conservation area planning, impact assessment of large construction work projects, and zoonotic disease prevention. Since such decisions can have large-scale impacts in society, being able to trace back the processes involved in reaching them is an essential property. Therefore, it is important to use methodologies and techniques that are reproducible [209, 233, 68, 148] when executing these activities. Some initial advances were achieved in projects such as DataONE [51] and WholeTale [44] by recording the provenance of biodiversity datasets and of their analysis and synthesis. However, reproducibility in computational science, in general, is still a challenge. Reproducibility tool usability is still considered low [100] making it difficult or time-consuming for scientists to keep track of the data derivation steps in their analyses. Another challenge in reproducing a scientific workflow is finding similar scientific workflows and verifying if they reach similar conclusions [68]. Starlinger et al. [256] advanced in addressing this problem by proposing a layer decomposition of scientific workflows into an ordered representation of its activities suitable for comparison. Providing reproducible frameworks for biodiversity analysis and synthesis activities would enable better decision traceability and validation of trends and indicators produced by them.

27 Acknowledgements

The work is partially supported by CAPES, CNPq, and FAPERJ.

References

[1] Lois A. Abbott, Frank A. Bisby, and David J. Rogers. Taxonomic Analysis in Biology: Computers, Models, and Databases. Columbia University Press, 1985. [2] D. Abramson, J. Giddy, and L. Kotler. High performance parametric modeling with Nimrod/G: killer application for the global grid? In Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000, pages 520–528. IEEE Comput. Soc, 2000. [3] Rakesh Agarwal and Ramakrishnan Srikant. Fast algorithms for mining association rules. In Proc. of the 20th VLDB Conference, pages 487–499, 1994. [4] Matthew E. Aiello-Lammens, Robert A. Boria, Aleksandar Radosavljevic, Bruno Vilela, and Robert P. Anderson. spThin: an R package for spatial thinning of species occurrence records for use in ecological niche models. Ecography, 38(5):541–545, 5 2015. [5] Anastasia Ailamaki, Verena Kantere, and Debabrata Dash. Managing scientific data. Communications of the ACM, 53(6):68, 6 2010. [6] Réka Albert and Albert-László Barabási. Statistical mechanics of complex networks. Reviews of Modern Physics, 74(1):47–97, 1 2002. [7] R. Allkin, F. A. Bisby, and Systematics Association. Databases in systematics. Academic Press, 1984. [8] R. Allkin and R. J. White. Data management models for biological classification. In H. H. Bock, editor, Classification and related methods of data analysis, pages 653–660. Elsevier Science Publishers B.V., 1988. [9] Robert Allkin, Richard J. White, and Peter J. Winfield. Handling the taxonomic structure of biological data. Mathematical and Computer Modelling, 16(6-7):1–9, 6 1992. [10] Omri Allouche, Asaf Tsoar, and Ronen Kadmon. Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS). Journal of Applied Ecology, 43(6):1223–1232, 9 2006. [11] Rafael Amaral, Rosa M. Badia, Ignacio Blanquer, Ricardo Braga-Neto, Leonardo Candela, Donatella Castelli, Christina Flann, Re- nato De Giovanni, William A. Gray, Andrew Jones, Daniele Lezzi, Pasquale Pagano, Vanderlei Perez-Canhos, Francisco Quevedo, Roger Rafanell, Vinod Rebello, Mariane S. Sousa-Baena, and Erik Torres. Supporting biodiversity studies with the EUBrazilOpen- Bio Hybrid Data Infrastructure. Concurrency and Computation: Practice and Experience, 27(2):376–394, 2 2015. [12] Robert P Anderson, Daniel Lew, and A.Townsend Peterson. Evaluating predictive models of species’ distributions: criteria for selecting optimal models. Ecological Modelling, 162(3):211–232, 4 2003. [13] M. Araujo and M New. Ensemble forecasting of species distributions. Trends in Ecology & Evolution, 22(1):42–47, 1 2007. [14] M.B. Araújo, D. Nogués-Bravo, I. Reginster, M. Rounsevell, and R.J. Whittaker. Exposure of European biodiversity to changes in human-induced pressures. Environmental Science & Policy, 11(1):38–45, 2 2008. [15] Miguel B. Araújo and A. Townsend Peterson. Uses and misuses of bioclimatic envelope modeling. Ecology, 93(7):1527–1539, 7 2012. [16] Miguel B. Araújo, Alejandro Rozenfeld, Carsten Rahbek, and Pablo A. Marquet. Using species co-occurrence networks to assess the impacts of climate change. Ecography, 34(6):897–908, 12 2011. [17] Miguel B Araújo and Paul H Williams. Selecting areas for species persistence using occurrence data. Biological Conservation, 96(3):331–345, 12 2000. [18] Yadu Babuji, Kyle Chard, Ian Foster, Daniel S. Katz, Michael Wilde, Anna Woodard, and Justin Wozniak. Parsl : Scalable Parallel Scripting in Python. In 10th International Workshop on Science Gateways (IWSG 2018), 2018. [19] Andrew Balmford, Leon Bennun, Ben Ten Brink, David Cooper, Isabelle M Côte, Peter Crane, Andrew Dobson, Nigel Dudley, Ian Dutton, Rhys E Green, Richard D Gregory, Jeremy Harrison, Elizabeth T Kennedy, Claire Kremen, Nigel Leader-Williams, Thomas E Lovejoy, Georgina Mace, Robert May, Phillipe Mayaux, Paul Morling, Joanna Phillips, Kent Redford, Taylor H Ricketts, Jon Paul Rodríguez, M Sanjayan, Peter J Schei, Albert S van Jaarsveld, and Bruno A Walther. Ecology: The Convention on Biological Diversity’s 2010 target. Science (New York, N.Y.), 307(5707):212–3, 1 2005. [20] Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A. Gurevich, Mikhail Dvorkin, Alexander S. Kulikov, Valery M. Lesin, Sergey I. Nikolenko, Son Pham, Andrey D. Prjibelski, Alexey V. Pyshkin, Alexander V. Sirotkin, Nikolay Vyahhi, Glenn Tesler, Max A. Alekseyev, and Pavel A. Pevzner. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Se- quencing. Journal of , 19(5):455–477, 5 2012. [21] Albert-László Barabási. Network Science. Cambridge University Press, 2016. [22] Clara Baringo Fonseca, Luiza Correa, Nayara Soto, and Rafael Sacramento. SiBBr: Envisioning the spatial distribution of Brazilian biodiversity records. Proceedings of TDWG, 1:e19966, 8 2017.

28 [23] Jordi Bascompte. Networks in ecology. Basic and Applied Ecology, 8(6):485–490, 11 2007. [24] J. B. Beach, S. Pramanik, and J. H. Beaman. Hierarchic Taxonomic Databases. In R. Fortuner, editor, Advances in Computer Methods for Systematic Biology, pages 241–256. Johns Hopkins University Press, 1993. [25] Oscar C Bedoya-Reina, Aakrosh Ratan, Richard Burhans, Hie Lim Kim, Belinda Giardine, Cathy Riemer, Qunhua Li, Thomas L Olson, Thomas P Loughran, Bridgett M Vonholdt, George H Perry, Stephan C Schuster, and Webb Miller. Galaxy tools to study genome diversity. GigaScience, 2(1):17, 1 2013. [26] Lee Belbin and Kristen J Williams. Towards a national bio-environmental data facility: experiences from the Atlas of Living Australia. International Journal of Geographical Information Science, 30(1):108–125, 1 2016. [27] Dennis A Benson, Mark Cavanaugh, Karen Clark, Ilene Karsch-Mizrachi, David J Lipman, James Ostell, and Eric W Sayers. GenBank. Nucleic acids research, 41(Database issue):36–42, 1 2013. [28] Walter G. Berendsohn. The Concept of "Potential Taxa" in Databases. Taxon, 44(2):207–212, 1995. [29] Walter G. Berendsohn. A Taxonomic Information Model for Botanical Databases: The IOPI Model. Taxon, 46(2):283–309, 1997. [30] C. Berkley, M. Jones, J. Bojilova, and D. Higgins. Metacat: a schema-independent XML database system. In Proceedings Thirteenth International Conference on Scientific and Statistical Database Management. SSDBM 2001, pages 171–179. IEEE Comput. Soc, 2001. [31] R. E. Beschel and J. H. Soper. The automation and standardization of certain herbarium procedures. Canadian Journal of Botany, 48(3):547–554, 3 1970. [32] Botanic Garden BGBM and Botanical Museum Berlin-Dahlem. "Biodiversity Informatics", The Term. http://www.bgbm.org/biodivinf/TheTerm.htm, 2010. [33] Heather Bingham, Michel Doudin, Lauren Weatherdon, Katherine Despot-Belmonte, Florian Wetzel, Quentin Groom, Edward Lewis, Eugenie Regan, Ward Appeltans, Anton Güntsch, Patricia Mergen, Donat Agosti, Lyubomir Penev, Anke Hoffmann, Hannu Saarenmaa, Gary Geller, Kidong Kim, HyeJin Kim, Anne-Sophie Archambeau, Christoph Häuser, Dirk Schmeller, Ilse Geijzendorf- fer, Antonio García Camacho, Carlos Guerra, Tim Robertson, Veljo Runnel, Nils Valland, and Corinne Martin. The Biodiversity Informatics Landscape: Elements, Connections and Opportunities. Research Ideas and Outcomes, 3:e14059, 6 2017. [34] F. A. Bisby. The Quiet Revolution: Biodiversity Informatics and the Internet. Science, 289(5488):2309–2312, 9 2000. [35] F. A. Bisby, G. F. Russell, and R. J. Pankhurst. Designs for a global plant species information system. Published for the Systematics Association by Clarendon Press, 1993. [36] Frank A. Bisby, D. A. Sutton, and G. F. Russell. Report of the Third Meeting of the Taxonomic Databases Working Group, Edinburgh, 19-21 October 1987. Huntia, 8(2):93–108, 1992. [37] Cristina Boeres and Vinod E. F. Rebello. EasyGrid: towards a framework for the automatic Grid enabling of legacy MPI applica- tions. Concurrency and Computation: Practice and Experience, 16(5):425–432, 4 2004. [38] Carl Boettiger. An introduction to Docker for reproducible research. ACM SIGOPS Operating Systems Review, 49(1):71–79, 1 2015. [39] Pierre Bonnet, Hervé Goëau, Siang Thye Hang, Mario Lasseck, Milan Šulc, Valéry Malécot, Philippe Jauzein, Jean-Claude Melet, Christian You, and Alexis Joly. Plant Identification: Experts vs. Machines in the Era of Deep Learning. In Multimedia Tools and Applications for Environmental & Biodiversity Informatics, pages 131–149. Springer International Publishing, Cham, 2018. [40] Robert A. Boria, Link E. Olson, Steven M. Goodman, and Robert P. Anderson. Spatial filtering to reduce sampling bias can improve the performance of ecological niche models. Ecological Modelling, 275:73–77, 3 2014. [41] Stuart R. Borrett, James Moody, and Achim Edelmann. The rise of Network Ecology: Maps of the topic diversity and scientific collaboration. Ecological Modelling, 293:111–127, 12 2014. [42] Brad Boyle, Nicole Hopkins, Zhenyuan Lu, Juan Antonio Raygoza Garay, Dmitry Mozzherin, Tony Rees, Naim Matasci, Martha L Narro, William H Piel, Sheldon J McKay, Sonya Lowry, Chris Freeland, Robert K Peet, and Brian J Enquist. The taxonomic name resolution service: an online tool for automated standardization of plant names. BMC bioinformatics, 14(1):16, 1 2013. [43] Leo Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001. [44] Adam Brinckman, Kyle Chard, Niall Gaffney, Mihael Hategan, Matthew B. Jones, Kacper , Sivakumar Kulasekaran, Bertram Ludäscher, Bryce D. Mecum, Jarek Nabrzyski, Victoria Stodden, Ian J. Taylor, Matthew J. Turk, and Kandace Turner. Computing environments for reproducibility: Capturing the “Whole Tale”. Future Generation Computer Systems, 2 2018. [45] Thiago Bruce, Pedro M Meirelles, Gizele Garcia, Rodolfo Paranhos, Carlos E Rezende, Rodrigo L de Moura, Ronaldo-Francini Filho, Ericka O C Coni, Ana Tereza Vasconcelos, Gilberto Amado Filho, Mark Hatay, Robert Schmieder, Robert Edwards, Elizabeth Dinsdale, and Fabiano L Thompson. Abrolhos Bank Reef Health Evaluated by Means of Water Quality, Microbial Diversity, Benthic Cover, and Fish Biomass Data. PLoS ONE, 7(6):e36687, 2012. [46] Andrew S. Burrell, Todd R. Disotell, and Christina M. Bergey. The use of museum specimens with high-throughput DNA se- quencers. Journal of Human Evolution, 79:35–44, 2 2015. [47] Justin M. Calabrese, Grégoire Certain, Casper Kraan, and Carsten F. Dormann. Stacking species distribution models and adjusting bias by linking them to macroecological models. Global Ecology and Biogeography, 23(1):99–112, 1 2014. [48] Kevin L. Campbell and Michael Hofreiter. New Life for Ancient DNA. Scientific American, 307(2):46–51, 7 2012.

29 [49] Leonardo Candela, Donatella Castelli, Gianpaolo Coro, Pasquale Pagano, and Fabio Sinibaldi. Species distribution modeling in the cloud. Concurrency and Computation: Practice and Experience, 28(4):1056–1079, 7 2016. [50] Dora A. L. Canhos, Mariane S. Sousa-Baena, Sidnei de Souza, Leonor C. Maia, João R. Stehmann, Vanderlei P. Canhos, Renato De Giovanni, Maria B. M. Bonacelli, Wouter Los, and A. Townsend Peterson. The Importance of Biodiversity E-infrastructures for Megadiverse Countries. PLOS Biology, 13(7):e1002204, 7 2015. [51] Yang Cao, Christopher Jones, Víctor Cuevas-Vicenttín, Matthew B. Jones, Bertram Ludäscher, Timothy McPhillips, Paolo Missier, Christopher Schwalm, Peter Slaughter, Dave Vieglais, Lauren Walker, and Yaxing Wei. DataONE: A Data Federation with Prove- nance Support. In Provenance and Annotation of Data and Processes. IPAW 2016. Lecture Notes in Computer Science, vol 9672, pages 230–234. Springer, 2016. [52] Lucian Carata, Sherif Akoush, Nikilesh Balakrishnan, Thomas Bytheway, Ripduman Sohan, Margo Selter, and Andy Hopper. A primer on provenance. Communications of the ACM, 57(5):52–60, 5 2014. [53] Bradley J Cardinale, J Emmett Duffy, Andrew Gonzalez, David U Hooper, Charles Perrings, Patrick Venail, Anita Narwani, Georgina M Mace, David Tilman, David A Wardle, Ann P Kinzig, Gretchen C Daily, Michel Loreau, James B Grace, Anne Lari- gauderie, Diane S Srivastava, and Shahid Naeem. Biodiversity loss and its impact on humanity. Nature, 486(7401):59–67, 6 2012. [54] Jose Carranza-Rojas, Herve Goeau, Pierre Bonnet, Erick Mata-Montero, and Alexis Joly. Going deeper in the automated identifi- cation of Herbarium specimens. BMC Evolutionary Biology, 17(1):181, 12 2017. [55] Jose Carranza-Rojas, Alexis Joly, Hervé Goëau, Erick Mata-Montero, and Pierre Bonnet. Automated Identification of Herbarium Specimens at Different Taxonomic Levels. In Multimedia Tools and Applications for Environmental & Biodiversity Informatics, pages 151–167. Springer International Publishing, Cham, 2018. [56] Convention on Biological Diversity CBD. Text of the Convention. https://www.cbd.int/convention/text/default.shtml, 1992. [57] Marcia Chame, Helio J. C. Barbosa, Luiz M. R. Gadelha, Douglas A. Augusto, Eduardo Krempser, and Livia Abdalla. SISS-Geo: Leveraging Citizen Science to Monitor Wildlife Health Risks in Brazil. Technical report, 3 2018. [58] F S Chapin, E S Zavaleta, V T Eviner, R L Naylor, P M Vitousek, H L Reynolds, D U Hooper, S Lavorel, O E Sala, S E Hobbie, M C Mack, and S Díaz. Consequences of changing biodiversity. Nature, 405(6783):234–42, 5 2000. [59] Arthur D. Chapman. Principles and Methods of Data Cleaning - Primary Species and Species-Occurence Data. Technical report, Global Biodiversity Information Facility, 2005. [60] Kyle Chard, Eli Dart, Ian Foster, David Shifflett, Steven Tuecke, and Jason Williams. The Modern Research Data Portal: a design pattern for networked, data-intensive science. PeerJ Computer Science, 4:e144, 2018. [61] Jonathan Chase and Mathew Leibold. Ecological niches: linking classical and contemporary approaches. University of Chicago Press, 2003. [62] Guobin Chen, Tony X. Han, Zhihai He, Roland Kays, and Tavis Forrester. Deep convolutional neural network based species recognition for wild animal monitoring. In 2014 IEEE International Conference on Image Processing (ICIP), pages 858–862. IEEE, 10 2014. [63] Youhua Chen. Conservation biogeography of the snake family Colubridae of China. North-Western Journal of Zoology, 5(2):251– 262, 2009. [64] N. R. Chrisman. The error component in spatial data. In Paul A. Longley, Michael F. Goodchild, David J. Maguire, and David W. Rhind, editors, Geographic Information Systems, chapter 12, pages 165–174. Wiley, 1991. [65] Nicholas R. Chrisman. Part 2: Issues and Problems Relating to Cartographic Data Use, Exchange and Transfer: The Role Of Quality Information In The Long-Term Functioning Of A Geographic Information System. Cartographica: The International Journal for Geographic Information and Geovisualization, 21(2-3):79–88, 10 1984. [66] Michael Ciarleglio, J. Wesley Barnes, and Sahotra Sarkar. ConsNet: new software for the selection of conservation area networks with spatial and multi-criteria analyses. Ecography, 32(2):205–209, 4 2009. [67] M Clark, D Roberts, and D Clark. Hyperspectral discrimination of tropical rain forest tree species at leaf to crown scales. Remote Sensing of Environment, 96(3-4):375–398, 6 2005. [68] Sarah Cohen-Boulakia, Khalid Belhajjame, Olivier Collin, Jérôme Chopard, Christine Froidevaux, Alban Gaignard, Konrad Hinsen, Pierre Larmande, Yvan Le Bras, Frédéric Lemoine, Fabien Mareuil, Hervé Ménager, Christophe Pradal, and Christophe Blanchet. Scientific workflows for computational reproducibility in the life sciences: Status, challenges and opportunities. Future Generation Computer Systems, 75:284–298, 10 2017. [69] Phillip E C Compeau, Pavel A Pevzner, and Glenn Tesler. How to apply de Bruijn graphs to genome assembly. Nature Biotechnology, 29(11):987–991, 2011. [70] Barry J. Conn. HISPID3 : Herbarium information standards and protocols for interchange of data - Version 3. Royal Botanic Gardens, Sydney, 1996. [71] Jane Costa, A Townsend Peterson, and C Ben Beard. Ecologic niche modeling and differentiation of populations of Triatoma brasiliensis neiva, 1911, the most important Chagas’ disease vector in northeastern Brazil (hemiptera, reduviidae, triatominae). The American Journal of Tropical Medicine and Hygiene, 67(5):516–520, 11 2002.

30 [72] Dinair Couto-Lima, Yoann Madec, Maria Ignez Bersot, Stephanie Silva Campos, Monique de Albuquerque Motta, Flávia Bar- reto dos Santos, Marie Vazeille, Pedro Fernando da Costa Vasconcelos, Ricardo Lourenço-de Oliveira, and Anna-Bella Failloux. Potential risk of re-emergence of urban transmission of Yellow Fever virus in Brazil facilitated by competent Aedes populations. Scientific Reports, 7(1):4848, 12 2017. [73] Peter Couvares, Tevfik Kosar, Alain Roy, Jeff Weber, and Kent Wenger. Workflow management in Condor. In Workflows for e-Science, pages 357–375. Springer, 2007. [74] Theodore J. Crovello. Problems in the Use of Electronic Data Processing in Biological Collections. Taxon, 16(6):481, 12 1967. [75] Eduardo Couto Dalcin. Data Quality Concepts and Techniques Applied to Taxonomic Databases. PhD , University of Southamp- ton, 2005. [76] Barnabas H. Daru, Daniel S. Park, Richard Primack, Charles G. Willis, David S Barrington, Timothy J. S. Whitfeld, Tristram G Seidler, Patrick W. Sweeney, David R. Foster, Aaron M. Ellison, and Charles C. Davis. Widespread sampling biases in herbaria revealed from large-scale digitization. bioRxiv, 2017. [77] Marko Debeljak, Aleš Poljanec, and Bernard Ženko. Modelling forest growing stock from inventory data: A data mining approach. Ecological Indicators, 41:30–39, 6 2014. [78] Ewa Deelman, Dennis Gannon, Matthew Shields, and Ian Taylor. Workflows and e-Science: An overview of workflow system features and capabilities. Future Generation Computer Systems, 25(5):528–540, 5 2009. [79] José Alexandre F. Diniz-Filho, Luis Mauricio Bini, Thiago Fernando Rangel, Rafael D. Loyola, Christian Hof, David Nogués-Bravo, and Miguel B. Araújo. Partitioning and mapping uncertainties in ensembles of forecasts of species turnover under climate change. Ecography, 32(6):897–906, 12 2009. [80] Xin Luna Dong and Theodoros Rekatsinas. Data integration and machine learning. Proceedings of the VLDB Endowment, 11(12):2094–2097, 8 2018. [81] L. Dou, G. Cao, P.J. Morris, R.A. Morris, B. Ludäscher, J.A. Macklin, and J. Hanken. Kurator: A Kepler Package for Data Curation Workflows. Procedia Computer Science, 9:1614–1619, 2012. [82] Lisa W. Drew. Are We Losing the Science of Taxonomy? BioScience, 61(12):942–946, 12 2011. [83] G Droege, K Barker, O Seberg, J Coddington, E Benson, W G Berendsohn, B Bunk, C Butler, E M Cawsey, J Deck, M Döring, P Flemons, B Gemeinholzer, A Güntsch, T Hollowell, P Kelbert, I Kostadinov, R Kottmann, R T Lawlor, C Lyal, J Mackenzie- Dodds, C Meyer, D Mulcahy, S Y Nussbeck, É O’Tuama, T Orrell, G Petersen, T Robertson, C Söhngen, J Whitacre, J Wieczorek, P Yilmaz, H Zetzsche, Y Zhang, and X Zhou. The Global Genome Biodiversity Network (GGBN) Data Standard specification. Database : the journal of biological databases and curation, 2016:baw125, 2016. [84] Gabriele Droege, Katharine Barker, Jonas J Astrin, Paul Bartels, Carol Butler, David Cantrill, Jonathan Coddington, Félix For- est, Birgit Gemeinholzer, Donald Hobern, Jacqueline Mackenzie-Dodds, Éamonn Ó Tuama, Gitte Petersen, Oris Sanjur, David Schindel, and Ole Seberg. The Global Genome Biodiversity Network (GGBN) Data Portal. Nucleic acids research, 42(Database issue):607–12, 1 2014. [85] J. L. Edwards. Interoperability of Biodiversity Databases: Biodiversity Information on Every Desktop. Science, 289(5488):2312– 2314, 9 2000. [86] James L. Edwards. Research and Societal Benefits of the Global Biodiversity Information Facility. BioScience, 54(6):486, 6 2004. [87] Jane Elith, Catherine H. Graham, Robert P. Anderson, Miroslav Dudík, Simon , Antoine Guisan, Robert J. Hijmans, Falk Huettmann, John R. Leathwick, Anthony Lehmann, Jin Li, Lucia G. Lohmann, Bette A. Loiselle, Glenn Manion, Craig Moritz, Miguel Nakamura, Yoshinori Nakazawa, Jacob McC. M. Overton, A. Townsend Peterson, Steven J. Phillips, Karen Richardson, Ricardo Scachetti-Pereira, Robert E. Schapire, Jorge Soberón, Stephen Williams, Mary S. Wisz, and Niklaus E. Zimmermann. Novel methods improve prediction of species’ distributions from occurrence data. Ecography, 29(2):129–151, 4 2006. [88] Jane Elith and John R. Leathwick. Species Distribution Models: Ecological Explanation and Prediction Across Space and Time. Annual Review of Ecology, Evolution, and Systematics, 40(1):677–697, 12 2009. [89] Laura Eme, Anja Spang, Jonathan Lombard, Courtney W. Stairs, and Thijs J. G. Ettema. Archaea and the origin of eukaryotes. Nature Reviews Microbiology, 15(12):711–723, 11 2017. [90] Robin Engler, Antoine Guisan, and Luca Rechsteiner. An improved approach for predicting the distribution of rare and endangered species from occurrence and pseudo-absence data. Journal of Applied Ecology, 41(2):263–274, 4 2004. [91] Agustín Estrada-Peña, Richard S Ostfeld, A Townsend Peterson, Robert Poulin, and José de la Fuente. Effects of environmental change on zoonotic disease risk: an ecological primer. Trends in parasitology, 30(4):205–14, 4 2014. [92] Katherine Faust. Animal Social Networks. In The SAGE Handbook of Social Network Analysis, chapter 11, pages 146–166. SAGE Publications, 2011. [93] Eric H. Fegraus, Sandy Andelman, Matthew B. Jones, and Mark Schildhauer. Maximizing the Value of Ecological Data with Structured Metadata: An Introduction to Ecological Metadata Language (EML) and Principles for Metadata Creation. Bulletin of the Ecological Society of America, 86(3):158–168, 7 2005. [94] J Ferreira, L E O C Aragão, J Barlow, P Barreto, E Berenguer, M Bustamante, T A Gardner, A C Lees, A Lima, J Louzada, R Pardini, L Parry, C A Peres, P S Pompeu, M Tabarelli, and J Zuanon. Environment and Development. Brazil’s environmental leadership at risk. Science (New York, N.Y.), 346(6210):706–7, 11 2014.

31 [95] Denis Filer. BRAHMS - Botanical Research and Herbarium Management System: Training Guide and Introductory Course. , 2013. [96] Anton J. Flügge, Sofia C. Olhede, and David J. Murrell. A method to detect subcommunities from multivariate spatial associations. Methods in Ecology and Evolution, 5(11):1214–1224, 11 2014. [97] R. Fortuner. Advances in Computer Methods for Systematic Biology: Artificial Intelligence, Databases, Computer Vision. The Johns Hopkins University Press, 1993. [98] Rafaela C. Forzza, José Fernando A. Baumgratz, Carlos Eduardo M. Bicudo, Dora A. L. Canhos, Anibal A. Carvalho, Marcus A. Nadruz Coelho, Andrea F. Costa, Denise P. Costa, Michael G. Hopkins, Paula M. Leitman, Lucia G. Lohmann, Eimear Nic Lughadha, Leonor Costa Maia, Gustavo Martinelli, Mariângela Menezes, Marli Pires Morim, Ariane Luna Peixoto, José R. Pirani, Jefferson Prado, Luciano P. Queiroz, Sidnei Souza, Vinicius Castro Souza, João R. Stehmann, Lana S. Sylvestre, Bruno M. T. Walter, and Daniela C. Zappi. New Brazilian Floristic List Highlights Conservation Challenges. BioScience, 62(1):39–45, 1 2012. [99] Ronaldo B. Francini-Filho, Nils E. Asp, Eduardo Siegle, John Hocevar, Kenneth Lowyck, Nilo D’Avila, Agnaldo A. Vasconcelos, Ricardo Baitelo, Carlos E. Rezende, Claudia Y. Omachi, Cristiane C. Thompson, and Fabiano L. Thompson. Perspectives on the Great Amazon Reef: Extension, Biodiversity, and Threats. Frontiers in Marine Science, 5, 4 2018. [100] Juliana Freire and Fernando Chirigati. Provenance and the Different Flavors of Computational Reproducibility. Bulletin of the Technical Committee on Data Engineering, 41(1):15–26, 2018. [101] Juliana Freire, David Koop, Emanuele Santos, and Cl Silva. Provenance for Computational Tasks: A Survey. Computing in Science & Engineering, 10(3):11–21, 5 2008. [102] Juliana Freire, Cláudio T Silva, Steven P Callahan, Emanuele Santos, Carlos E Scheidegger, and Huy T Vo. Managing rapidly- evolving scientific workflows. In Provenance and Annotation of Data, pages 10–18. Springer, 2006. [103] Peter T. Fretwell, Michelle A. LaRue, Paul Morin, Gerald L. Kooyman, Barbara Wienecke, Norman Ratcliffe, Adrian J. Fox, Andrew H. Fleming, Claire Porter, and Phil N. Trathan. An Emperor Penguin Population Estimate: The First Global, Synoptic Survey of a Species from Space. PLoS ONE, 7(4):e33751, 4 2012. [104] Ei Fujioka, Connie Y. Kot, Bryan P. Wallace, Benjamin D. Best, Jerry Moxley, Jesse Cleary, Ben Donnelly, and Patrick N. Halpin. Data Integration for Conservation: Leveraging Multiple Data Types to Advance Ecological Assessments and Habitat Modeling for Marine Megavertebrates using OBIS-SEAMAP. Ecological Informatics, 20:13–26, 1 2014. [105] Luiz Gadelha, Pedro Guimarães, Ana Maria Moura, Debora P. Drucker, Eduardo Dalcin, Guilherme Gall, Jurandir Tavares, Daniele Palazzi, Maira Poltosi, Fabio Porto, Francisco Moura, and Wagner Vieira Leo. SiBBr: Uma Infraestrutura para Coleta, Integração e Análise de Dados sobre a Biodiversidade Brasileira. In VIII Brazilian e-Science Workshop (BRESCI 2014). Proc. XXXIV Congress of the Brazilian Computer Society, 2014. [106] Gizele D Garcia, Gustavo B Gregoracci, Eidy de O Santos, Pedro M Meirelles, Genivaldo G Z Silva, Rob Edwards, Tomoo Sawabe, Kazuyoshi Gotoh, Shota Nakamura, Tetsuya Iida, Rodrigo L de Moura, and Fabiano L Thompson. Metagenomic analysis of healthy and white plague-affected Mussismilia braziliensis corals. Microbial ecology, 65(4):1076–86, 5 2013. [107] Evangelos Georganas, Aydın Buluç, Jarrod Chapman, Steven Hofmeyr, Chaitanya Aluru, Rob Egan, Leonid Oliker, Daniel Rokhsar, and Katherine Yelick. HipMer: an extreme-scale de novo genome assembler. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC ’15, pages 1–11, New York, New York, USA, 2015. ACM Press. [108] Tereza Cristina Giannini, Marinez F. Siqueira, André L. Acosta, Francisco C.C. Barreto, Antonio M. Saraiva, and Isabel Alves-dos Santos. Desafios atuais da modelagem preditiva de distribuição de espécies, 11 2012. [109] Fred Glover. Future paths for integer programming and links to artificial intelligence. Computers & Operations Research, 13(5):533–549, 1 1986. [110] S. A Goff, M. Vaughn, S. McKay, E. Lyons, A. E Stapleton, D. Gessler, N. Matasci, L. Wang, M. Hanlon, A. Lenards, A. Muir, N. Merchant, S. Lowry, S. Mock, M. Helmke, A. Kubach, M. Narro, N. Hopkins, D. Micklos, U. Hilgert, M Gonzales, C. Jordan, E. Skidmore, R. Dooley, J. Cazes, R. McLay, Z. Lu, S. Pasternak, L. Koesterke, W. H. Piel, R. Grene, C. Noutsos, K. Gendler, X. Feng, C. Tang, M. Lent, S. J. Kim, K. Kvilekval, B.S. Manjunath, V. Tannen, A. Stamatakis, M. J. Sanderson, S. M. Welch, K. Cranston, P. Soltis, D. Soltis, B. O’Meara, C. Ane, T. Brutnell, D. J. Kleibenstein, J. W. White, J. Leebens-Mack, M. J. Donoghue, E. P. Spalding, T. J. Vision, C. R. Myers, D. Lowenthal, B. J. Enquist, B. Boyle, A. Akoglu, G. Andrews, S. Ram, D. Ware, L. Stein, and D. Stanzione. The iPlant Collaborative: Cyberinfrastructure for Plant Biology, 2 2011. [111] Vitor H. F. Gomes, Stéphanie D. IJff, Niels Raes, Iêda Leão Amaral, Rafael P. Salomão, Luiz de Souza Coelho, Francisca Dionízia de Almeida Matos, Carolina V. Castilho, Diogenes de Andrade Lima Filho, Dairon Cárdenas López, Juan Ernesto Guevara, William E. Magnusson, Oliver L. Phillips, Florian Wittmann, Marcelo de Jesus Veiga Carim, Maria Pires Martins, Mariana Vic- tória Irume, Daniel Sabatier, Jean-François Molino, Olaf S. Bánki, José Renan da Silva Guimarães, Nigel C. A. Pitman, Maria Teresa Fernandez Piedade, Abel Monteagudo Mendoza, Bruno Garcia Luize, Eduardo Martins Venticinque, Evlyn Márcia Moraes de Leão Novo, Percy Núñez Vargas, Thiago Sanna Freire Silva, Angelo Gilberto Manzatto, John Terborgh, Neidiane Farias Costa Reis, Juan Carlos Montero, Katia Regina Casula, Beatriz S. Marimon, Ben-Hur Marimon, Euridice N. Honorio Coronado, Ted R. Feldpausch, Alvaro Duque, Charles Eugene Zartman, Nicolás Castaño Arboleda, Timothy J. Killeen, Bonifacio Mostacedo, Rodolfo Vasquez, Jochen Schöngart, Rafael L. Assis, Marcelo Brilhante Medeiros, Marcelo Fragomeni Simon, Ana Andrade, William F. Laurance, José Luís Camargo, Layon O. Demarchi, Susan G. W. Laurance, Emanuelle de Sousa Farias, Henrique Eduardo Men- donça Nascimento, Juan David Cardenas Revilla, Adriano Quaresma, Flavia R. C. Costa, Ima Célia Guimarães Vieira, Bruno

32 Barçante Ladvocat Cintra, Hernán Castellanos, Roel Brienen, Pablo R. Stevenson, Yuri Feitosa, Joost F. Duivenvoorden, Gerardo A. Aymard C., Hugo F. Mogollón, Natalia Targhetta, James A. Comiskey, Alberto Vicentini, Aline Lopes, Gabriel Damasco, Nállarett Dávila, Roosevelt García-Villacorta, Carolina Levis, Juliana Schietti, Priscila Souza, Thaise Emilio, Alfonso Alonso, David Neill, Francisco Dallmeier, Leandro Valle Ferreira, Alejandro Araujo-Murakami, Daniel Praia, Dário Dantas do Amaral, Fernanda An- tunes Carvalho, Fernanda Coelho de Souza, Kenneth Feeley, Luzmila Arroyo, Marcelo Petratti Pansonato, Rogerio Gribel, Boris Villa, Juan Carlos Licona, Paul V. A. Fine, Carlos Cerón, Chris Baraloto, Eliana M. Jimenez, Juliana Stropp, Julien Engel, Marcos Silveira, Maria Cristina Peñuela Mora, Pascal Petronelli, Paul Maas, Raquel Thomas-Caesar, Terry W. Henkel, Doug Daly, Mar- cos Ríos Paredes, Tim R. Baker, Alfredo Fuentes, Carlos A. Peres, Jerome Chave, Jose Luis Marcelo Pena, Kyle G. Dexter, Miles R. Silman, Peter Møller Jørgensen, Toby Pennington, Anthony Di Fiore, Fernando Cornejo Valverde, Juan Fernando Phillips, Gonzalo Rivas-Torres, Patricio von Hildebrand, Tinde R. van Andel, Ademir R. Ruschel, Adriana Prieto, Agustín Rudas, Bruce Hoffman, César I. A. Vela, Edelcilio Marques Barbosa, Egleé L. Zent, George Pepe Gallardo Gonzales, Hilda Paulette Dávila Doza, Ires Paula de Andrade Miranda, Jean-Louis Guillaumet, Linder Felipe Mozombite Pinto, Luiz Carlos de Matos Bonates, Natalino Silva, Ri- cardo Zárate Gómez, Stanford Zent, Therany Gonzales, Vincent A. Vos, Yadvinder Malhi, Alexandre A. Oliveira, Angela Cano, Bianca Weiss Albuquerque, Corine Vriesendorp, Diego Felipe Correa, Emilio Vilanova Torre, Geertje van der Heijden, Hirma Ramirez-Angulo, José Ferreira Ramos, Kenneth R. Young, Maira Rocha, Marcelo Trindade Nascimento, Maria Natalia Umaña Medina, Milton Tirado, Ophelia Wang, Rodrigo Sierra, Armando Torres-Lezama, Casimiro Mendoza, Cid Ferreira, Cláudia Baider, Daniel Villarroel, Henrik Balslev, Italo Mesones, Ligia Estela Urrego Giraldo, Luisa Fernanda Casas, Manuel Augusto Ahuite Reategui, Reynaldo Linares-Palomino, Roderick Zagt, Sasha Cárdenas, William Farfan-Rios, Adeilza Felipe Sampaio, Daniela Pauletto, Elvis H. Valderrama Sandoval, Freddy Ramirez Arevalo, Isau Huamantupa-Chuquimaco, Karina Garcia-Cabrera, Lionel Hernandez, Luis Valenzuela Gamarra, Miguel N. Alexiades, Susamar Pansini, Walter Palacios Cuenca, William Milliken, Joana Ricardo, Gabriela Lopez-Gonzalez, Edwin Pos, and Hans ter Steege. Species Distribution Modelling: Contrasting presence-only models with plot abundance data. Scientific Reports, 8(1):1003, 12 2018. [112] Alexander Gomez, Augusto Salazar, and Francisco Vargas. Towards Automatic Wild Animal Monitoring: Identification of Animal Species in Camera-trap Images using Very Deep Convolutional Neural Networks. 3 2016. [113] J. Grassle. The Ocean Biogeographic Information System (OBIS): An on-line, worldwide atlas for accessing, modeling and mapping marine biological data in a multidimensional geographic context. , 13(3):5–7, 2000. [114] Q. J. Groom, C. O’Reilly, and T. Humphrey. Herbarium specimens reveal the exchange network of British and Irish botanists, 1856–1932. New Journal of Botany, 4(2):95–103, 8 2014. [115] A. Guisan, N. E. Zimmermann, J. Elith, C. H. Graham, S. Phillips, and A. T. Peterson. What Matters for Predicting the Occurrences of Trees: Techniques, Data, or Species’ Characteristics? Ecological Monographs, 77(4):615–630, 11 2007. [116] Anton Güntsch, Walter G. Berendsohn, and Patricia Mergen. The BioCASE Project - a Biological Collections Access Service for Europe. Ferrantia, 51:103–108, 2007. [117] Robert Guralnick and Andrew Hill. Biodiversity informatics: automated approaches for documenting global biodiversity patterns and processes. Bioinformatics (Oxford, England), 25(4):421–428, 2 2009. [118] Robert P Guralnick, John Wieczorek, Reed Beaman, and Robert J Hijmans. BioGeomancer: automated georeferencing to map the world’s biodiversity data. PLoS biology, 4(11):e381, 11 2006. [119] Siddeswara Guru, Ivan C. Hanigan, Hoang Anh Nguyen, Emma Burns, John Stein, Wade Blanchard, David Lindenmayer, and Tim Clancy. Development of a cloud-based platform for reproducible science: A case study of an IUCN Red List of Ecosystems Assessment. Ecological Informatics, 36:221–230, 11 2016. [120] N. E. Gwinn and C. Rinaldo. The Biodiversity Heritage Library: sharing biodiversity literature with the world. IFLA Journal, 35(1):25–34, 3 2009. [121] Jack S. Hale, Lizao Li, Christopher N. Richardson, and Garth N. Wells. Containers for Portable, Productive, and Performant Scientific Computing. Computing in Science & Engineering, 19(6):40–50, 11 2017. [122] A. V. Hall. A Computer-Based System for Forming Identification Keys. Taxon, 19(1):12, 2 1970. [123] Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd edition, 2011. [124] Alex Hardisty, Dave Roberts, Wouter Addink, Bart Aelterman, Donat Agosti, Linda Amaral-Zettler, Arturo H Ariño, Christos Arvanitidis, Thierry Backeljau, Nicolas Bailly, Lee Belbin, Walter Berendsohn, Nic Bertrand, Neil Caithness, David Campbell, Guy Cochrane, Noël Conruyt, Alastair Culham, Christian Damgaard, Neil Davies, Bruno Fady, Sarah Faulwetter, Alan Feest, Dawn Field, Eric Garnier, Guntram Geser, Jack Gilbert, Grosche, David Grosser, Bénédicte Herbinet, Donald Hobern, Andrew Jones, Yde de Jong, David King, , Hanna Koivula, Wouter Los, Chris Meyer, Robert A Morris, Norman Morrison, David Morse, Matthias Obst, Evagelos Pafilis, Larry M Page, Roderic Page, Thomas Pape, Cynthia Parr, Alan Paton, David Patterson, Elisabeth Paymal, Lyubomir Penev, Marc Pollet, Richard Pyle, Eckhard von Raab-Straube, Vincent Robert, Tim Robertson, Olivier Rovellotti, Hannu Saarenmaa, Peter Schalk, Joop Schaminee, Paul Schofield, Andy Sier, Soraya Sierra, Vince , Edwin van Spronsen, Simon Thornton-Wood, Peter van Tienderen, Jan van Tol, Eamonn O’Tuama, Peter Uetz, Lea Vaas, Régine Vignes Lebbe, Todd Vision, Duong Vu, Aaike De Wever, Richard White, Kathy Willis, and Fiona Young. A decadal view of biodiversity informatics: challenges and priorities. BMC ecology, 13(1):16, 1 2013. [125] Alex R. Hardisty, Finn Bacall, Niall Beard, Maria-Paula Balcázar-Vargas, Bachir Balech, Zoltán Barcza, Sarah J. Bourlat, Renato De Giovanni, Yde de Jong, Francesca De Leo, Laura Dobor, Giacinto Donvito, Donal Fellows, Antonio Fernandez Guerra, Nuno Ferreira, Yuliya Fetyukova, Bruno Fosso, Jonathan Giddy, Carole Goble, Anton Güntsch, Robert Haines, Vera Hernández Ernst,

33 Hannes Hettling, Dóra Hidy, Ferenc Horváth, Dóra Ittzés, Péter Ittzés, Andrew Jones, Renzo Kottmann, Robert Kulawik, Sonja Lei- denberger, Päivi Lyytikäinen-Saarenmaa, Cherian Mathew, Norman Morrison, Aleksandra Nenadic, Abraham Nieva de la Hidalga, Matthias Obst, Gerard Oostermeijer, Elisabeth Paymal, Graziano Pesole, Salvatore Pinto, Axel Poigné, Francisco Quevedo Fernan- dez, Monica Santamaria, Hannu Saarenmaa, Gergely Sipos, Karl-Heinz Sylla, Marko Tähtinen, Saverio Vicario, Rutger Aldo Vos, Alan R. Williams, and Pelin Yilmaz. BioVeL: a virtual laboratory for data analysis and modelling in biodiversity science and ecology. BMC Ecology, 16(1):49, 2016. [126] T. J. Hastie and R. Tibshirani. Generalized additive models. Chapman and Hall, 1990. [127] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. 12 2015. [128] Linling He and Jiang Zhu. Computational tools for epitope vaccine design and evaluation. Current Opinion in Virology, 11:103– 112, 4 2015. [129] Zhihai He, Roland Kays, Zhi Zhang, Guanghan Ning, Chen Huang, Tony X. Han, Josh Millspaugh, Tavis Forrester, and William McShea. Visual Informatics Tools for Supporting Large-Scale Collaborative Wildlife Monitoring with Citizen Scientists. IEEE Circuits and Systems Magazine, 16(1):73–86, 2016. [130] P. D. N. Hebert, A. Cywinska, S. L. Ball, and J. R. DeWaard. Biological identifications through DNA barcodes. Proceedings of the Royal Society B: Biological Sciences, 270(1512):313–321, 2 2003. [131] P. Bryan Heidorn. Shedding Light on the Dark Data in the Long Tail of Science. Library Trends, 57(2):280–299, 2008. [132] Risto K. Heikkinen, Miska Luoto, Raimo Virkkala, Richard G. Pearson, and Jan-Hendrik Körber. Biotic interactions improve prediction of boreal bird distributions at macro-scales. Global Ecology and Biogeography, 16(6):754–763, 11 2007. [133] Tony Hey, Stewart Tansley, and Kristin Tolle. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, 2009. [134] D. Higgins, C. Berkley, and M.B. Jones. Managing heterogeneous ecological data using Morpho. In Proceedings 14th International Conference on Scientific and Statistical Database Management, pages 69–76. IEEE Comput. Soc, 2002. [135] R. J. Hijmans, K. A. Garrett, Z. Huamán, D. P. Zhang, M. Schreuder, and M. Bonierbale. Assessing the Geographic Representa- tiveness of Genebank Collections: the Case of Bolivian Wild Potatoes. Conservation Biology, 14(6):1755–1765, 7 2008. [136] Robert J. Hijmans, Susan E. Cameron, Juan L. Parra, Peter G. Jones, and Andy Jarvis. Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology, 25(15):1965–1978, 12 2005. [137] C. M. Hine, K.M. Becker, and A. Doroszenko. Nomenclatural instability from the information scientists’ point of view: problems and possibilities. In D. L. Hawksworth, editor, Improving the Stability of Names: Needs and Options. International Association for , 1991. [138] Donald Hobern, Alberto Apostolico, Elizabeth Arnaud, Juan Carlos Bello, Dora Canhos, Gregoire Dubois, Dawn Field, Enrique García, Alex Hardisty, Jerry Harrison, Bryan Heidorn, Leonard Krishtalka, Erick Mata, Roderick Page, Cynthia Parr, Jeff Price, and Selwyn Willoughby. Global Biodiversity Information Outlook - Delivering Biodiversity Knowledge in the Information Age. Technical report, GBIF Secretariat, 2013. [139] Wesley M. Hochachka, Rich Caruana, Daniel Fink, Art Munson, Mirek Riedewald, Daria Sorokina, and Steve Kelling. Data-Mining Discovery of Pattern and Process in Ecological Systems. Journal of Wildlife Management, 71(7):2427, 2007. [140] H. Holden and E. Ledrew. Hyperspectral identification of features. International Journal of Remote Sensing, 20(13):2545–2563, 1 1999. [141] D. Holmes and M.C. McCabe. Improving precision and recall for Soundex retrieval. In Proceedings. International Conference on Information Technology: Coding and Computing, pages 22–26. IEEE Comput. Soc, 2002. [142] Edward C. Holmes. Evolutionary History and Phylogeography of Human Viruses. Annual Review of Microbiology, 62(1):307–328, 10 2008. [143] David U Hooper, E Carol Adair, Bradley J Cardinale, Jarrett E K Byrnes, Bruce A Hungate, Kristin L Matulich, Andrew Gonzalez, J Emmett Duffy, Lars Gamfeldt, and Mary I O’Connor. A global synthesis reveals biodiversity loss as a major driver of ecosystem change. Nature, 486(7401):105–8, 6 2012. [144] Joaquín Hortal, Jorge M. Lobo, and Alberto Jiménez-Valverde. Limitations of Biodiversity Databases: Case Study on Seed-Plant Diversity in Tenerife, Canary Islands. Conservation Biology, 21(3):853–863, 6 2007. [145] Doug Howe, Maria Costanzo, Petra Fey, Takashi Gojobori, Linda Hannick, Winston Hide, David P. Hill, Renate Kania, Mary Schaeffer, Susan St Pierre, Simon Twigger, Owen White, and Seung Yon Rhee. Big Data: The future of biocuration. Nature, 455(7209):47–50, 9 2008. [146] Thomas C. Ings, José M. Montoya, Jordi Bascompte, Nico Blüthgen, Lee Brown, Carsten F. Dormann, François Edwards, David Figueroa, Ute Jacob, J. Iwan Jones, Rasmus B. Lauridsen, Mark E. Ledger, Hannah M. Lewis, Jens M. Olesen, F.J. Frank van Veen, Phil H. Warren, and Guy Woodward. Review: Ecological networks - beyond food webs. Journal of Animal Ecology, 78(1):253–269, 1 2009. [147] IUCN-BGCS. The International Transfer Format for Botanic Garden Plant Records. Plant Taxonomic Database Standards, no. 1. http://www.huntbotanical.org/publications/show.php?123. Hunt Institute for Botanical Documentation, 1987. [148] Peter Ivie and Douglas Thain. Reproducibility in Scientific Computing. ACM Computing Surveys, 51(3):1–36, 7 2018.

34 [149] David M.P. Jacoby and Robin Freeman. Emerging Network-Based Tools in Movement Ecology. Trends in Ecology & Evolution, 31(4):301–314, 4 2016. [150] Matthew B. Jones, Mark P. Schildhauer, O.J. Reichman, and Shawn Bowers. The New Bioinformatics: Integrating Ecological Data from the Gene to the . Annual Review of Ecology, Evolution, and Systematics, 37(1):519–544, 12 2006. [151] Dirk Nikolaus Karger, Olaf Conrad, Jürgen Böhner, Tobias Kawohl, Holger Kreft, Rodrigo Wilber Soria-Auza, Niklaus E. Zimmer- mann, H. Peter Linder, and Michael Kessler. Climatologies at high resolution for the earth’s land surface areas. Scientific Data, 4:170122, 9 2017. [152] David A. Keith, Jon Paul Rodríguez, Kathryn M. Rodríguez-Clark, Emily Nicholson, Kaisu Aapala, Alfonso Alonso, Marianne Asmussen, Steven Bachman, Alberto Basset, Edmund G. Barrow, John S. Benson, Melanie J. Bishop, Ronald Bonifacio, Thomas M. Brooks, Mark A. Burgman, Patrick Comer, Francisco A. Comín, Franz Essl, Don -Langendoen, Peter G. Fairweather, Robert J. Holdaway, Michael Jennings, Richard T. Kingsford, Rebecca E. Lester, Ralph Mac Nally, Michael A. McCarthy, Justin Moat, María A. Oliveira-Miranda, Phil Pisanu, Brigitte Poulin, Tracey J. Regan, Uwe Riecken, Mark D. Spalding, and Sergio Zambrano- Martínez. Scientific Foundations for an IUCN Red List of Ecosystems. PLoS ONE, 8(5):e62111, 5 2013. [153] Hanna Köpcke, Andreas Thor, and Erhard Rahm. Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, 3(1-2):484–493, 9 2010. [154] Harriet Meadow Krauss. The Information System Design for the Flora North America Program. Brittonia, 25(2):119–134, 1973. [155] Jitendra Kumar, Richard T. Mills, Forrest M. Hoffman, and William W. Hargrove. Parallel k-Means Clustering for Quantitative Delineation Using Large Data Sets. Procedia Computer Science, 4:1602–1611, 2011. [156] John La Salle, Kristen J Williams, Craig Moritz, F. Essl, S. Dullinger, W. Rabitsch, PE. Hulme, P. Pyšek, JRU. Wilson, DM. Richard- son, A. Marques, S. Díaz, HM. Pereira, WD. Kissling, AK. Skidmore, The biodiversity informatics community, WK. Michener, JD. Nichols, EG. Cooch, JM. Nichols, JR. Sauer, J. Silvertown, JL. Dickinson, B. Zuckerberg, DN. Bonter, JB. Losos, DPC. Peters, KM. Havstad, J. Cushing, C. Tweedie, O. Fuentes, N. Villanueva-Rosales, S. Vicario, B. Balech, G. Donvito, P. Notarangelo, G. Pesole, L. Belbin, KJ. Williams, J. Wieczorek, D. Bloom, R. Guralnick, S. Blum, M. Döring, R. Giovanni, T. Robertson, D. Vieglais, H. Constable, R. Guralnick, J. Wieczorek, C. Spencer, AT. Peterson, GBIF, W. Jetz, JM. McPherson, RP. Guralnick, PDN. Hebert, JR. deWaard, EV. Zakharov, SWJ. Prosser, JE. Sones, JTA. McKeown, B. Mantle, J. La Salle, EM. Lemmon, AR. Lemmon, MR. Jones, JM. Good, BC. Schlick-Steiner, W. Arthofer, FM. Steiner, RA. Catullo, S. Ferrier, AA. Hoffmann, DF. Rosauer, S. Ferrier, KJ. Williams, G. Manion, JS. Keogh, SW. Laffan, J. Kattge, RG. Beutel, S-Q. Ge, T. Hörnschemeyer, RG. Beutel, L. Vilhelmsen, N. Akkari, D. Koon-Bong Cheung, H. Enghoff, P. Stoev, CV. Nguyen, DR. Lovell, M. Adcock, J. La Salle, BL. Mantle, JL. Salle, N. Fisher, P. Flemons, P. Berents, J. La Salle, QD. Wheeler, P. Jackway, SL. Winterton, D. Hobern, D. Lovell, JN. Thompson, JM. Tylianakis, RK. Didham, J. Bascompte, DA. Wardle, RJ. Morris, OT. Lewis, HCJ. Godfray, FJ. Frank van Veen, RJ. Morris, HCJ. Godfray, JH. Poelen, JD. Simons, CJ. Mungall, K. de Queiroz, RR. Sokal, S. Ratnasingham, PDN. Hebert, C. Moritz, CE. González- Orozco, T. Laity, JT. Miller, G. Jolley-Rogers, G. Jolley-Rogers, T. Varghese, P. Harvey, N. dos Remedios, JT. Miller, MJ. Grundy, RAV. Rossel, RD. Searle, PL. Wilson, C. Chen, LJ. Gregory, RA. Viscarra Rossel, C. Chen, MJ. Grundy, R. Searle, D. Clifford, PH. Campbell, MF. Goodchild, WK. Michener, KJ. Williams, L. Belbin, MP. Austin, J. Stein, S. Ferrier, W. Hallgren, C. Reed, K. Buehler, L. McKee, B. Smith, P. Buttigieg, N. Morrison, B. Smith, C. Mungall, S. Lewis, TE. Consortium, AT. Peterson, J. Soberón, L. Kr- ishtalka, A. Basset, W. Los, L. Belbin, C. McDonald, TH. Booth, KJ. Williams, L. Belbin, TH. Booth, S. Ferrier, VA. Funk, KS. Richardson, S. Ferrier, DP. Faith, PA. Walker, P. Flemons, R. Guralnick, J. Krieger, A. Ranipeta, D. Neufeld, L. Belbin, J. Daly, T. Hirsch, D. Hobern, J. LaSalle, A. Jones, R. White, E. Orme, RT. Fielding, RN. Taylor, CH. Saslis-Lagoudakis, X. Hua, E. Bui, C. Moray, L. Bromham, DAL. Canhos, B. , B. Gemeinholzer, A. Treloar, H. Zauner, and EC. Moylan. Biodiversity analysis in the digital era. Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 371(1702):534–547, 9 2016. [157] José J. Lahoz-Monfort, Gurutzeta Guillera-Arroita, and Brendan A. Wintle. Imperfect detection impacts the performance of species distribution models. Global Ecology and Biogeography, 23(4):504–515, 4 2014. [158] Benjamin Leader, Quentin J. Baca, and David E. Golan. Protein therapeutics: a summary and pharmacological classification. Nature Reviews Drug Discovery, 7(1):21–39, 1 2008. [159] Christopher T. Lee and Rommie E. Amaro. Exascale Computing: A New Dawn for Computational Biology. Computing in Science & Engineering, 20(5):18–25, 9 2018. [160] Vladimir Levenshtein. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Soviet Physics - Doklady, 10(8):707–710, 1966. [161] Harris A. Lewin, Gene E. Robinson, W. John Kress, William J. Baker, Jonathan Coddington, Keith A. Crandall, Richard Durbin, Scott V. Edwards, Félix Forest, M. Thomas P. Gilbert, Melissa M. Goldstein, Igor V. Grigoriev, Kevin J. Hackett, David Haussler, Erich D. Jarvis, Warren E. Johnson, Aristides Patrinos, Stephen Richards, Juan Carlos Castilla-Rubio, Marie-Anne van Sluys, Pamela S. Soltis, Xun Xu, Huanming Yang, and Guojie Zhang. Earth BioGenome Project: Sequencing life for the future of life. Proceedings of the National Academy of Sciences, 115(17):4325–4333, 4 2018. [162] Shu-Hsien Liao, Pei-Hui Chu, and Pei-Yuan Hsiao. Data mining techniques and applications – A decade review from 2000 to 2011. Expert Systems with Applications, 39(12):11303–11311, 9 2012. [163] Chee Sun Liew, Malcolm P. Atkinson, Michelle Galea, Tan Fong Ang, Paul Martin, and Jano I. Van Hemert. Scientific Workflows: Moving Across Paradigms. ACM Computing Surveys, 49(4):1–39, 12 2016. [164] Raymond L. Lindeman. The Trophic-Dynamic Aspect of Ecology. Ecology, 23(4):399–417, 10 1942.

35 [165] Canran Liu, Pam M. Berry, Terence P. Dawson, and Richard G. Pearson. Selecting thresholds of occurrence in the prediction of species distributions. Ecography, 28(3):385–393, 6 2005. [166] Canran Liu, Matt White, and Graeme Newell. Measuring and comparing the accuracy of species distribution models with presence-absence data. Ecography, 34(2):232–243, 4 2011. [167] Ji Liu, Esther Pacitti, Patrick Valduriez, and Marta Mattoso. A Survey of Data-Intensive Scientific Workflow Management. Journal of Grid Computing, 13(4):457–493, 3 2015. [168] Bette A. Loiselle, Peter M. Jørgensen, Trisha Consiglio, Iván Jiménez, John G. Blake, Lúcia G. Lohmann, and Olga Martha Montiel. Predicting species distributions from herbarium collections: does climate bias in collection sampling influence model outcomes? Journal of Biogeography, pages 070908043732003–???, 9 2007. [169] Mark Lomolino. Conservation biogeography. In Frontiers in Biogeography: New Directions in the Geography of Nature, pages 293–296. Sinauer, 2004. [170] Bertram Ludäscher, Ilkay Altintas, Chad Berkley, Dan Higgins, Efrat Jaeger, Matthew Jones, Edward A. Lee, Jing Tao, and Yang Zhao. Scientific workflow management and the Kepler system. Concurrency and Computation: Practice and Experience, 18(10):1039–1065, 8 2006. [171] Robert D. MacDonald. Electronic Data Processing Methods for Botanical Garden and Arboretum Records. Taxon, 15(8):291–295, 11 1966. [172] William Magnusson, Ricardo Braga-Neto, Flávia Pezzini, Fabrício Baccaro, Helena Bergallo, Jerry Penha, Domingos Rodrigues, Luciano M. Verdade, Albertina Lima, Ana Luísa Albernaz, Jean-Marc Hero, Ben Lawson, Carolina Castilho, Débora Drucker, Elisabeth Franklin, Fernando Mendonça, Flávia Costa, Graciliano Galdino, Guy Castley, Jansen Zuanon, Julio do Vale, José Laurindo Campos dos Santos, Regina Luizão, Renato Cintra, Reinaldo I. Barbosa, Antônio Lisboa, Rodrigo V. Koblitz, Cátia Nunes da Cunha, and Antonio R. Mendes Pontes. Biodiversity and Integrated Environmental Monitoring. Áttema Editorial, 2013. [173] Joran Martijn, Julian Vosseberg, Lionel Guy, Pierre Offre, and Thijs J. G. Ettema. Deep mitochondrial origin outside the sampled alphaproteobacteria. Nature, 557(7703):101–105, 5 2018. [174] Marta Mattoso, Claudia Werner, Guilherme Horta Travassos, Vanessa Braganholo, Eduardo Ogasawara, Daniel Oliveira, Sergio Cruz, Wallace Martinho, and Leonardo Murta. Towards supporting the life cycle of large scale scientific experiments. International Journal of Business Process Integration and Management, 5 2010. [175] P. McCullagh and J. Nelder. Generalized Linear Models. Chapman and Hall/CRC, second edition, 1989. [176] J. McNeill. International code of nomenclature for algae, fungi and plants (Melbourne code): adopted by the Eighteenth International Botanical Congress Melbourne. Koeltz Scientific Books, 2012. [177] Pedro M. Meirelles, Gilberto M. Amado-Filho, Guilherme H. Pereira-Filho, Hudson T. Pinheiro, Rodrigo L. de Moura, Jean- Christophe Joyeux, Eric F. Mazzei, Alex C. Bastos, Robert A. Edwards, Elizabeth Dinsdale, Rodolfo Paranhos, Eidy O. Santos, Tetsuya Iida, Kazuyoshi Gotoh, Shota Nakamura, Tomoo Sawabe, Carlos E. Rezende, Luiz M. R. Gadelha, Ronaldo B. Francini- Filho, Cristiane Thompson, and Fabiano L. Thompson. Baseline Assessment of Mesophotic Reefs of the Vitória-Trindade Seamount Chain Based on Water Quality, Microbial Diversity, Benthic Cover and Fish Biomass Data. PLOS ONE, 10(6):e0130084, 6 2015. [178] Pedro Milet Meirelles, Luiz M. R. Gadelha Jr., Ronaldo Bastos Francini-Filho, Rodrigo de Moura Leão, Gilberto Menezes Amado- Filho, Alex Cardoso Bastos, Rodolfo Pinheiro da Rocha Paranhos, Carlos Eduardo Rezende, Jean Swings, Eduardo Siegle, Nils Edvin Asp Neto, Sigrid Neumann Leitão, Ricardo Coutinho, Marta Mattoso, Paulo S. Salomon, Rogério A. B. Valle, Renato Crespo Pereira, Ricardo Henrique Kruger, Cristiane Thompson, and Fabiano L. Thompson. BaMBa: towards the integrated management of Brazilian marine environmental data. Database, 2015, 1 2015. [179] Haiyan Meng, Rupa Kommineni, Quan Pham, Robert Gardner, Tanu Malik, and Douglas Thain. An invariant framework for conducting reproducible computational science. Journal of Computational Science, 9:137–142, 7 2015. [180] Carsten Meyer, Patrick Weigelt, and Holger Kreft. Multidimensional biases, gaps and uncertainties in global plant occurrence information. Ecology Letters, 19(8):992–1006, 8 2016. [181] F Meyer, D Paarmann, M D’Souza, R Olson, E M Glass, M Kubal, T Paczian, A Rodriguez, R Stevens, A Wilke, J Wilkening, and R A Edwards. The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC bioinformatics, 9:386, 1 2008. [182] William K. Michener, Suzie Allard, Amber Budden, Robert B. Cook, Kimberly Douglass, Mike Frame, Steve Kelling, Rebecca Koskela, Carol Tenopir, and David A. Vieglais. Participatory design of DataONE—Enabling cyberinfrastructure for the biological and environmental sciences. Ecological Informatics, 11:5–15, 9 2012. [183] William K Michener and Matthew B Jones. Ecoinformatics: supporting ecology as a data-intensive science. Trends in ecology & evolution, 27(2):85–93, 2 2012. [184] William K. Michener, John Porter, Mark Servilla, and Kristin Vanderbilt. Long term ecological research and information manage- ment. Ecological Informatics, 6(1):13–24, 1 2011. [185] Simon Miles, Sylvia C. Wong, Weijian Fang, Paul Groth, Klaus-Peter Zauner, and Luc Moreau. Provenance-based validation of e-science experiments. Web Semantics: Science, Services and Agents on the World Wide Web, 5(1):28–38, 3 2007. [186] Renée J. Miller. Open Data Integration. Proceedings of the VLDB Endowment, 11(12):2130–2139, 2018.

36 [187] Luc Moreau, Paul Groth, James Cheney, Timothy Lebo, and Simon Miles. The rationale of PROV. Web Semantics: Science, Services and Agents on the World Wide Web, 35:235–257, 12 2015. [188] Andres Moreira-Soto, Maria Celeste Torres, Marcos Cesar Lima de Mendonça, Maria Angelica Mares-Guia, Cintia Damas- ceno dos Santos Rodrigues, Allison , Carolina Cardoso dos Santos, Eliane Saraiva Machado Araújo, Carlo Fischer, Rita Maria Ribeiro Nogueira, Christian Drosten, Patricia Carvalho Sequeira, Jan Felix Drexler, and Ana Maria Bispo de Filippis. Evidence for Multiple Sylvatic Transmission Cycles During the 2016-2017 Yellow Fever Virus Outbreak, Brazil. Clinical Microbiology and Infection, 2 2018. [189] Jeffrey T. Morisette, Catherine S. Jarnevich, Tracy R. Holcombe, Colin B. Talbert, Drew Ignizio, Marian K. Talbert, Claudio Silva, David Koop, Alan Swanson, and Nicholas E. Young. VisTrails SAHM: visualization and workflow management for species habitat modeling. Ecography, 36(2):129–135, 2 2013. [190] Paul J. Morris, James Hanken, David Lowery, Bertram Ludäscher, James Macklin, Timothy McPhillips, John Wieczorek, and Qian Zhang. Kurator: Tools for Improving Fitness for Use of Biodiversity Data. Biodiversity Information Science and Standards, 2:e26539, 6 2018. [191] Robert A Morris, Vijay Barve, Mihail Carausu, Vishwas Chavan, José Cuadra, Chris Freeland, Gregor Hagedorn, Patrick Leary, Dimitry Mozzherin, Annette Olson, Gregory Riccardi, Ivan Teage, and Greg Whitbread. Discovery and publishing of primary biodiversity data associated with multimedia resources: The Audubon Core strategies and approaches. Biodiversity Informatics, 8(2), 7 2013. [192] Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. Deep Learning for Entity Matching. In Proceedings of the 2018 International Conference on Management of Data - SIGMOD ’18, pages 19–34, New York, New York, USA, 2018. ACM Press. [193] Babak Naimi, Andrew K. Skidmore, Thomas A. Groen, and Nicholas A. S. Hamm. Spatial autocorrelation in predictors reduces the impact of positional uncertainty in occurrence data on species distribution modelling. Journal of Biogeography, 38(8):1497–1509, 8 2011. [194] Fatemeh Nargesian, Erkang Zhu, Ken Q. Pu, and Renée J. Miller. Table union search on open data. Proceedings of the VLDB Endowment, 11(7):813–825, 3 2018. [195] Tim Newbold, Lawrence N. Hudson, Samantha L. L. Hill, Sara Contu, Igor Lysenko, Rebecca A. Senior, Luca Börger, Dominic J. Bennett, Argyrios Choimes, Ben Collen, Julie Day, Adriana De Palma, Sandra Díaz, Susy Echeverria-Londoño, Melanie J. Edgar, Anat Feldman, Morgan Garon, Michelle L. K. Harrison, Tamera Alhusseini, Daniel J. Ingram, Yuval Itescu, Jens Kattge, Victoria Kemp, Lucinda Kirkpatrick, Michael Kleyer, David Laginha Pinto Correia, Callum D. Martin, Shai Meiri, Maria Novosolov, Yuan Pan, Helen R. P. Phillips, Drew W. Purves, Alexandra Robinson, Jake Simpson, Sean L. Tuck, Evan Weiher, Hannah J. White, Robert M. Ewers, Georgina M. Mace, Jörn P. W. Scharlemann, and Andy Purvis. Global effects of land use on local terrestrial biodiversity. Nature, 520(7545):45–50, 1 2015. [196] Mark Newman. Networks: An Introduction. , 2010. [197] Hoang Anh Nguyen, Lucie Bland, Tristan Roberts, Siddeswara Guru, Minh Dinh, and David Abramson. A Computational Pipeline for the IUCN Risk Assessment for Meso-American Reef Ecosystem. In 2017 IEEE 13th International Conference on e-Science (e-Science), pages 286–294, 2017. [198] Mohammad Sadegh Norouzzadeh, Anh Nguyen, Margaret Kosmala, Alexandra Swanson, Meredith S. Palmer, Craig Packer, and Jeff Clune. Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning. Proceed- ings of the National Academy of Sciences, 115(25):E5716–E5725, 6 2018. [199] Howard T. Odum. Primary Production in Flowing Waters. Limnology and Oceanography, 1(2):102–117, 4 1956. [200] Javier Otegui and Robert P Guralnick. The Geospatial Data Quality REST API for Primary Biodiversity Data. Bioinformatics (Oxford, England), pages btw057–, 2 2016. [201] Maureen A. O’Leary and Seth Kaufman. MorphoBank: phylophenomics in the “cloud”. Cladistics, 27(5):529–537, 10 2011. [202] Roderic D M Page. Biodiversity informatics: the challenge of linking data and the role of shared identifiers. Briefings in bioinfor- matics, 9(5):345–54, 9 2008. [203] Francisco Pando. How species interactions are managed in Plinian Core: Status and questions. Proceedings of TDWG, 1:e20556, 8 2017. [204] R. J. Pankhurst. Taxonomic databases: the PANDORA system. In R. Fortuner, editor, Advances in Computer Methods for Systematic Biology, pages 230–240. Johns Hopkins University Press, 1993. [205] Richard G. Pearson. Species’ Sistribution Modeling for Conservation Educators and Practiotioners. Lessons in Conservation, 3:54–89, 2010. [206] Richard G. Pearson, Christopher J. Raxworthy, Miguel Nakamura, and A. Townsend Peterson. Predicting species distributions from small numbers of occurrence records: a test case using cryptic geckos in Madagascar. Journal of Biogeography, 34(1):102– 117, 9 2006. [207] Richard G. Pearson, Wilfried Thuiller, Miguel B. Araújo, Enrique Martinez-Meyer, Lluís Brotons, Colin McClean, Lera Miles, Pedro Segurado, Terence P. Dawson, and David C. Lees. Model-based uncertainty in species range prediction. Journal of Biogeography, 33(10):1704–1711, 10 2006.

37 [208] Carlos Peña and Tobias Malm. VoSeq: A voucher and DNA sequence web application. PLoS ONE, 7(6):1–4, 2012. [209] Roger D. Peng. Reproducible Research in Computational Science. Science, 334(6060):1226–1227, 2011. [210] Deana D Pennington, Dan Higgins, A Townsend Peterson, Matthew B Jones, Bertram Ludäscher, and Shawn Bowers. Ecological Niche Modeling Using the Kepler Workflow System. In Ian J Taylor, Ewa Deelman, Dennis B Gannon, and Matthew Shields, editors, Workflows for e-Science, pages 91–108. Springer London, 1 2007. [211] H M Pereira, S Ferrier, M Walters, G N Geller, R H G Jongman, R J Scholes, M W Bruford, N Brummitt, S H M Butchart, A C Cardoso, N C Coops, E Dulloo, D P Faith, J Freyhof, R D Gregory, C Heip, R Höft, G Hurtt, W Jetz, D S Karp, M A McGeoch, D Obura, Y Onoda, N Pettorelli, B Reyers, R Sayre, J P W Scharlemann, S N Stuart, E Turak, M Walpole, and M Wegmann. Ecology. Essential biodiversity variables. Science (New York, N.Y.), 339(6117):277–8, 1 2013. [212] F. H. Perring. Data-Processing for the Atlas of the British Flora. Taxon, 12(5):183–190, 6 1963. [213] A. Townsend Peterson. Uses and Requirements of Ecological Niche Models and Related Distributional Models. Biodiversity Informatics, 3, 12 2006. [214] A. Townsend Peterson and C. Richard Robins. Using Ecological-Niche Modeling to Predict Barred Owl Invasions with Implications for Spotted Owl Conservation. Conservation Biology, 17(4):1161–1165, 8 2003. [215] A. Townsend Peterson and Jorge Soberón. Essential biodiversity variables are not global. Biodiversity and Conservation, 2017. [216] A. Townsend Peterson, Jorge Soberón, Richard G. Pearson, Robert P. Anderson, Enrique Martínez-Meyer, Miguel Nakamura, and Miguel B. Araújo. Ecological Niches and Geographic Distributions. Princeton University Press, 2011. [217] Nathalie Pettorelli, Martin Wegmann, Andrew Skidmore, Sander Mücher, Terence P. Dawson, Miguel Fernandez, Richard Lucas, Michael E. Schaepman, Tiejun Wang, Brian O’Connor, Robert H.G. Jongman, Pieter Kempeneers, Ruth Sonnenschein, Allison K. Leidner, Monika Böhm, Kate S. He, Harini Nagendra, Grégoire Dubois, Temilola Fatoyinbo, Matthew C. Hansen, Marc Paganini, Helen M. de Klerk, Gregory P. Asner, Jeremy T. Kerr, Anna B. Estes, Dirk S. Schmeller, Uta Heiden, Duccio Rocchini, Henrique M. Pereira, Eren Turak, Nestor Fernandez, Angela Lausch, Moses A. Cho, Domingo Alcaraz-Segura, Mélodie A. McGeoch, Woody Turner, Andreas Mueller, Véronique St-Louis, Johannes Penner, Petteri Vihervaara, Alan Belward, Belinda Reyers, and Gary N. Geller. Framing the concept of satellite remote sensing essential biodiversity variables: challenges and future directions. Remote Sensing in Ecology and Conservation, pages n/a–n/a, 3 2016. [218] Steven J. Phillips, Robert P. Anderson, and Robert E. Schapire. Maximum entropy modeling of species geographic distributions. Ecological Modelling, 190(3-4):231–259, 1 2006. [219] Rafael Pino-Mejías, María Dolores Cubiles-de-la Vega, María Anaya-Romero, Antonio Pascual-Acosta, Antonio Jordán-López, and Nicolás Bellinfante-Crocci. Predicting the potential habitat of oaks with data mining models and the R system. Environmental Modelling & Software, 25(7):826–836, 7 2010. [220] Heather A Piwowar and Todd J Vision. Data reuse and the open data citation advantage. PeerJ, 1:e175, 1 2013. [221] Vânia Proença, Laura Jane Martin, Henrique Miguel Pereira, Miguel Fernandez, Louise McRae, Jayne Belnap, Monika Böhm, Neil Brummitt, Jaime García-Moreno, Richard D. Gregory, João Pradinho Honrado, Norbert Jürgens, Michael Opige, Dirk S. Schmeller, Patrícia Tiago, and Chris A.M. van Swaay. Global biodiversity monitoring: From data sources to Essential Biodiversity Variables. Biological Conservation, 213:256–263, 9 2017. [222] Sujeevan Ratnasingham and Paul D N Hebert. bold: The Barcode of Life Data System (http://www.barcodinglife.org). Molecular ecology notes, 7(3):355–364, 5 2007. [223] Sushma Reddy and Liliana M. Dávalos. Geographical sampling bias and its implications for conservation priorities in Africa. Journal of Biogeography, 30(11):1719–1727, 11 2003. [224] Tony Rees. Taxamatch, an Algorithm for Near (’Fuzzy’) Matching of Scientific Names in Taxonomic Databases. PloS one, 9(9):e107510, 1 2014. [225] O J Reichman, Matthew B Jones, and Mark P Schildhauer. Challenges and opportunities of open data in ecology. Science (New York, N.Y.), 331(6018):703–5, 2 2011. [226] Xiaobo Ren, Tony X. Han, and Zhihai He. Ensemble Video Object Cut in Highly Dynamic Scenes. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 1947–1954. IEEE, 6 2013. [227] Santiago Martínez de la Riva, Carmen Quesada, Francisco Pando de la Hoz, Bárbara Ayala-Orozco, Ángela M. Suárez-Mayorga, María Mora, Danny Vélez, Jose Trinidad Mendoza Gaytan, Daniel Lins da Silva, Patricia Koleff, Maria Esther Quintero Rivero, and Aurelio Sanabria. Plinian Core: integrating information about species. In Biodiversity Information Standards - TDWG 2013 Annual Conference, 2013. [228] Robert J Robbins, Linda Amaral-Zettler, Holly Bik, Stan Blum, James Edwards, Dawn Field, George Garrity, Jack A Gilbert, Renzo Kottmann, Leonard Krishtalka, Hilmar Lapp, Carolyn Lawrence, Norman Morrison, Eamonn Ó Tuama, Cynthia Parr, Inigo San Gil, David Schindel, Lynn Schriml, David Vieglas, and John Wooley. RCN4GSC Workshop Report: Managing Data at the Interface of Biodiversity and (Meta)Genomics, March 2011. Standards in genomic sciences, 7(1):159–65, 10 2012. [229] Tim Robertson, Markus Döring, Robert Guralnick, David Bloom, John Wieczorek, Kyle Braak, Javier Otegui, Laura Russell, and Peter Desmet. The GBIF Integrated Publishing Toolkit: Facilitating the Efficient Publishing of Biodiversity Data on the Internet. PLoS ONE, 9(8):e102623, 8 2014.

38 [230] David J. Rogers, Henry S. Fleming, and George Eastbrook. Use of computers in studies of taxonomy and evolution. In Evolutionary Biology, vol. 1, pages 169–196. 1967. [231] Yuri Roskov, Thomas Kunze, L Paglinawan, T Orrell, D Nicolson, Alastair Culham, N Bailly, P Kirk, T Bourgoin, G Baillargeon, and others. Species 2000 & ITIS Catalogue of Life, 2013 Annual Checklist. 2013. [232] Andrea Sánchez-Tapia, Marinez Ferreira de Siqueira, Rafael Oliveira Lima, Felipe Sodré M. Barros, Guilherme M. Gall, Luiz M. R. Gadelha, Luís Alexandre E. da Silva, and Carla Osthoff. Model-R: A Framework for Scalable and Reproducible Ecological Niche Modeling. In High Performance Computing: 4th Latin American Conference, CARLA 2017. Communications in Computer and Information Science, vol. 796, pages 218–232. Springer, 2018. [233] Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, and Eivind Hovig. Ten Simple Rules for Reproducible Computational Research. PLoS Computational Biology, 9(10):e1003285, 10 2013. [234] Lenore Sarasan. Why museum computer projects fail. Museum News, 59(4):40–49, 1981. [235] David E. Schindel and Joseph A. Cook. The next generation of natural history collections. PLOS Biology, 16(7):e2006125, 7 2018. [236] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015. [237] C. L. Schoch, B. Robbertse, V. Robert, D. Vu, G. Cardinali, L. Irinyi, W. Meyer, R. H. Nilsson, K. Hughes, A. N. Miller, P. M. Kirk, K. Abarenkov, M. C. Aime, H. A. Ariyawansa, M. Bidartondo, T. Boekhout, B. Buyck, Q. Cai, J. Chen, A. Crespo, P. W. Crous, U. Damm, Z. W. De Beer, B. T. M. Dentinger, P. K. Divakar, M. Duenas, N. Feau, K. Fliegerova, M. A. Garcia, Z.-W. Ge, G. W. Griffith, J. Z. Groenewald, M. Groenewald, M. Grube, M. Gryzenhout, C. Gueidan, L. Guo, S. Hambleton, R. Hamelin, K. Hansen, V. Hofstetter, S.-B. Hong, J. Houbraken, K. D. Hyde, P. Inderbitzin, P. R. Johnston, S. C. Karunarathna, U. Koljalg, G. M. Kovacs, E. Kraichak, K. Krizsan, C. P. Kurtzman, K.-H. Larsson, S. Leavitt, P. M. Letcher, K. Liimatainen, J.-K. Liu, D. J. Lodge, J. Jennifer Luangsa-ard, H. T. Lumbsch, S. S. N. Maharachchikumbura, D. Manamgoda, M. P. Martin, A. M. Minnis, J.-M. Moncalvo, G. Mule, K. K. Nakasone, T. Niskanen, I. Olariaga, T. Papp, T. Petkovits, R. Pino-Bodas, M. J. Powell, H. A. Raja, D. Redecker, J. M. Sarmiento-Ramirez, K. A. Seifert, B. Shrestha, S. Stenroos, B. Stielow, S.-O. Suh, K. Tanaka, L. Tedersoo, M. T. Telleria, D. Udayanga, W. A. Untereiner, J. Dieguez Uribeondo, K. V. Subbarao, C. Vagvolgyi, C. Visagie, K. Voigt, D. M. Walker, B. S. Weir, M. Weiss, N. N. Wijayawardene, M. J. Wingfield, J. P. Xu, Z. L. Yang, N. Zhang, W.-Y. Zhuang, and S. Federhen. Finding needles in haystacks: linking scientific names, reference specimens and molecular data for Fungi. Database, 2014:bau061– bau061, 6 2014. [238] Ashley Shade and Tracy K. Teal. Computing Workflows for Biologists: A Roadmap. PLOS Biology, 13(11):e1002303, 11 2015. [239] Shobha Sharma, Stacy Ciufo, Elena Starchenko, Dakshesh Darji, Larry Chlumsky, Ilene Karsch-Mizrachi, and Conrad L Schoch. The NCBI BioCollections Database. Database, 2018, 1 2018. [240] Genivaldo Gueiros Z Silva, Daniel A Cuevas, Bas E Dutilh, and Robert A Edwards. FOCUS: an alignment-free model to identify organisms in metagenomes using non-negative least squares. PeerJ, 2:e425, 2014. [241] Genivaldo Gueiros Z. Silva, Kevin T. Green, Bas E. Dutilh, and Robert A. Edwards. SUPER-FOCUS: a tool for agile functional analysis of shotgun metagenomic data. Bioinformatics, 32(3):354–361, 2 2016. [242] Luís Alexandre Estevão da Silva, Claudio Nicoletti de Fraga, Thaís Moreira Hidalgo de Almeida, Marcos Gonzalez, Rafael Oliveira Lima, Mônica Sousa da Rocha, Ernani Bellon, Rafael da Silva Ribeiro, Felipe Alves de Oliveira, Leonardo da Silva Clemente, Ulises Rodrigo Magdalena, Erika von Sohsten Medeiros, Rafaela Campostrini Forzza, Luís Alexandre Estevão da Silva, Claudio Nicoletti de Fraga, Thaís Moreira Hidalgo de Almeida, Marcos Gonzalez, Rafael Oliveira Lima, Mônica Sousa da Rocha, Er- nani Bellon, Rafael da Silva Ribeiro, Felipe Alves de Oliveira, Leonardo da Silva Clemente, Ulises Rodrigo Magdalena, Erika von Sohsten Medeiros, and Rafaela Campostrini Forzza. Jabot - Sistema de Gerenciamento de Coleções Botânicas: a experiência de uma década de desenvolvimento e avanços. Rodriguésia, 68(2):391–410, 6 2017. [243] P. C. Silva. Machine Data Processing and Plant Taxonomy. Science, 152(3724):934–935, 5 1966. [244] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. 9 2014. [245] Marinez Ferreira de Siqueira and Andrew Townsend Peterson. Consequences of global climate change for geographic distributions of cerrado tree species. Biota Neotropica, 3(2):1–14, 2003. [246] A. C. Smith. Advice to Administrators of Systematic Collections. Taxon, 15(6):201–205, 7 1966. [247] Vincent S Smith, Simon D Rycroft, Irina Brake, Ben Scott, Edward Baker, Laurence Livermore, Vladimir Blagoderov, and David Roberts. Scratchpads 2.0: a Virtual Research Environment supporting scholarly collaboration, communication and data publica- tion in biodiversity science. ZooKeys, 150(150):53–70, 1 2011. [248] Jorge Soberón and A Townsend Peterson. Biodiversity informatics: managing and applying primary biodiversity data. Philosoph- ical transactions of the Royal Society of London. Series B, Biological sciences, 359(1444):689–98, 4 2004. [249] Jorge M. Soberón. Niche and area of distribution modeling: a population ecology perspective. Ecography, 33(1):159–167, 2 2010. [250] Robert R. Sokal. Numerical Taxonomy. Scientific American, 215(6):106–116, 12 1966. [251] Robert R. Sokal and P. H. A. Sneath. Efficiency in Taxonomy. Taxon, 15(1):1–21, 1 1966. [252] Mauro Enrique Souza Muñoz, Renato Giovanni, Marinez Ferreira Siqueira, Tim Sutton, Peter Brewer, Ricardo Scachetti Pereira, Dora Ann Lange Canhos, and Vanderlei Perez Canhos. openModeller: a generic approach to species’ potential distribution modelling. GeoInformatica, 15(1):111–135, 8 2009.

39 [253] Eva M. Spehn and Christian Korner. Data mining for global trends in mountain biodiversity. CRC Press, 2009. [254] Donald F. Squires. Data Processing and Museum Collections: A Problem for the Present. Curator: The Museum Journal, 9(3):216– 227, 9 1966. [255] Johannes Starlinger, Bryan Brancotte, Sarah Cohen-Boulakia, and Ulf Leser. Similarity search for scientific workflows. Proceedings of the VLDB Endowment, 7(12):1143–1154, 8 2014. [256] Johannes Starlinger, Sarah Cohen-Boulakia, Sanjeev Khanna, Susan B. Davidson, and Ulf Leser. Effective and efficient similarity search in scientific workflow repositories. Future Generation Computer Systems, 56:584–594, 3 2016. [257] Zachary D. Stephens, Skylar Y. Lee, Faraz Faghri, Roy H. Campbell, Chengxiang Zhai, Miles J. Efron, Ravishankar Iyer, Michael C. Schatz, Saurabh Sinha, and Gene E. Robinson. Big Data: Astronomical or Genomical? PLOS Biology, 13(7):e1002195, 7 2015. [258] MY Stockle and PDN Hebert. Barcode of life. Scientific American, 2008. [259] Karen I. Stocks, Nancy Jacobsen Stout, and Timothy M. Shank. Information Management Strategies for Deep-Sea Biology. In Malcolm R. Clark, Mireille Consalvey, and Ashley A. Rowden, editors, Biological Sampling in the Deep Sea, pages 368–385. Wiley Blackwell, 2016. [260] David Stockwell. The GARP modelling system: problems and solutions to automated spatial prediction. International Journal of Geographical Information Science, 13(2):143–158, 3 1999. [261] Carly Strasser, Stephen Abrams, and Patricia Cruse. DMPTool 2: Expanding Functionality for Better Data Management Planning. International Journal of Digital Curation, 9(1):324–330, 2014. [262] A Sugden and E Pennisi. Diversity digitized. Science (New York, N.Y.), 289(5488):2305, 9 2000. [263] Brian L. Sullivan, Jocelyn L. Aycrigg, Jessie H. Barry, Rick E. Bonney, Nicholas Bruns, Caren B. Cooper, Theo Damoulas, André A. Dhondt, Tom Dietterich, Andrew Farnsworth, Daniel Fink, John W. Fitzpatrick, Thomas Fredericks, Jeff Gerbracht, Carla Gomes, Wesley M. Hochachka, Marshall J. Iliff, Carl Lagoze, Frank A. La Sorte, Matthew Merrifield, Will Morris, Tina B. Phillips, Mark Reynolds, Amanda D. Rodewald, Kenneth V. Rosenberg, Nancy M. Trautmann, Andrea Wiggins, David W. Winkler, Weng-Keen Wong, Christopher L. Wood, Jun Yu, and Steve Kelling. The eBird enterprise: An integrated approach to development and application of citizen science. Biological Conservation, 169:31–40, 1 2014. [264] Colin Talbert, Marian Talbert, Jeff Morisette, and David Koop. Data Management Challenges in Species Distribution Modeling. IEEE Bulletin of the Technical Committee on Data Engineering, 36(4):31–40, 2013. [265] Diethard Tautz, Peter Arctander, Alessandro Minelli, Richard H. Thomas, and Alfried P. Vogler. A plea for DNA taxonomy. Trends in Ecology & Evolution, 18(2):70–74, 2 2003. [266] Biodiversity Information Standards TDWG. TDWG History. http://www.tdwg.org/about-tdwg/history/, 2017. [267] Elisa Thébault. Identifying compartments in presence-absence matrices and bipartite networks: insights into modularity mea- sures. Journal of Biogeography, 40(4):759–768, 4 2013. [268] Chris D. Thomas, Alison Cameron, Rhys E. Green, Michel Bakkenes, Linda J. Beaumont, Yvonne C. Collingham, Barend F. N. Erasmus, Marinez Ferreira de Siqueira, Alan Grainger, Lee Hannah, Lesley Hughes, Brian Huntley, Albert S. van Jaarsveld, Guy F. Midgley, Lera Miles, Miguel A. Ortega-Huerta, A. Townsend Peterson, Oliver L. Phillips, and Stephen E. Williams. Extinction risk from climate change. Nature, 427(6970):145–148, 1 2004. [269] Eamonn Ó Tuama, John Deck, Gabriel Dröge, Markus Döring, Dawn Field, Renzo Kottmann, Juncai Ma, Hiroshi Mori, Norman Morrison, Peter Sterk, Hideaki Sugawara, John Wieczorek, Linhuan Wu, and Pelin Yilmaz. Meeting Report: Hackathon-Workshop on Darwin Core and MIxS Standards Alignment (February 2012). Standards in genomic sciences, 7(1):166–70, 10 2012. [270] Ayesha I. T. Tulloch, Iadine Chadès, Yann Dujardin, Martin J. Westgate, Peter W. Lane, and David Lindenmayer. Dynamic species co-occurrence networks require dynamic biodiversity surrogates. Ecography, 39(12):1185–1196, 12 2016. [271] Woody Turner, Sacha Spector, Ned Gardiner, Matthew Fladeland, Eleanor Sterling, and Marc Steininger. Remote sensing for biodiversity science and conservation. Trends in Ecology & Evolution, 18(6):306–314, 6 2003. [272] Carmen Ulloa Ulloa, Pedro Acevedo-rodríguez, Stephan Beck, Manuel J Belgrano, Rodrigo Bernal, Paul E Berry, Lois Brako, Marcela Celis, Gerrit Davidse, Rafaela C. Forzza, S. Robbert Gradstein, Omaira Hokche, Blanca León, Susana León-yánez, Robert E Magill, David A Neill, Michael Nee, Peter H Raven, Heather Stimmel, Mark T Strong, José L Villaseñor, James L Zarucchi, Fernando O Zuloaga, and Peter M Jørgensen. An integrated assessment of the vascular plant species of the Americas. Science, 358(6370):1–5, 2017. [273] Sara Varela, Robert P. Anderson, Raúl García-Valdés, and Federico Fernández-González. Environmental filters reduce the effects of sampling bias and improve predictions of ecological niche models. Ecography, pages no–no, 1 2014. [274] Allan Koch Veiga, Antonio Mauro Saraiva, Arthur David Chapman, Paul John Morris, Christian Gendreau, Dmitry Schigel, and Tim James Robertson. A conceptual framework for quality assessment and management of biodiversity data. PLOS ONE, 12(6):e0178731, 6 2017. [275] Saverio Vicario, Bachir Balech, Giacinto Donvito, Pasquale Notarangelo, and Graziano Pesole. The BioVel Project: Robust phylo- genetic workflows running on the GRID, 11 2012. [276] Dale H. Vitt, Diana G. Horton, and Peter Johnston. A Computer Program for Printing Herbarium Labels. Taxon, 26(4):425, 8 1977.

40 [277] Mattias De Wael, Stefan Marr, Bruno De Fraine, Tom Van Cutsem, and Wolfgang De Meuter. Partitioned Global Address Space Languages. ACM Computing Surveys (CSUR), 47(4):62, 5 2015. [278] Robert A. Wagner and Roy Lowrance. An Extension of the String-to-String Correction Problem. Journal of the ACM, 22(2):177– 183, 4 1975. [279] Ramona L. Walls, John Deck, Robert Guralnick, Steve Baskauf, Reed Beaman, Stanley Blum, Shawn Bowers, Pier Luigi Buttigieg, Neil Davies, Dag Endresen, Maria Alejandra Gandolfo, Robert Hanner, Alyssa Janning, Leonard Krishtalka, Andréa Matsunaga, Peter Midford, Norman Morrison, Éamonn Ó. Tuama, Mark Schildhauer, Barry Smith, Brian J. Stucky, Andrea Thomer, John Wieczorek, Jamie Whitacre, and John Wooley. Semantics in Support of Biodiversity Knowledge Discovery: An Introduction to the Biological Collections Ontology and Related Ontologies. PLoS ONE, 9(3):e89606, 3 2014. [280] Clifford M. Wetmore. Herbarium Computerization at the University of Minnesota. Systematic Botany, 4(4):339, 1979. [281] Richard J. White and Robert Allkin. Language for the definition and exchange of biological data sets. Mathematical and Computer Modelling, 16(6-7):199–223, 6 1992. [282] John Wieczorek, David Bloom, Robert Guralnick, Stan Blum, Markus Döring, Renato Giovanni, Tim Robertson, and David Vieglais. Darwin Core: an evolving community-developed biodiversity data standard. PloS one, 7(1):e29715, 1 2012. [283] J. A. Wiens, D. Stralberg, D. Jongsomjit, C. A. Howell, and M. A. Snyder. Niches, models, and climate change: Assessing the assumptions and uncertainties. Proceedings of the National Academy of Sciences, 106(Supplement_2):19729–19736, 11 2009. [284] Andreas Wilke, Jared Bischof, Wolfgang Gerlach, Elizabeth Glass, Travis Harrison, Kevin P. Keegan, Tobias Paczian, William L. Trimble, Saurabh Bagchi, Ananth Grama, Somali Chaterji, and Folker Meyer. The MG-RAST metagenomics database and portal in 2015. Nucleic Acids Research, 44(D1):D590–D594, 1 2016. [285] Greg Wilson, D A Aruliah, C Titus Brown, Neil P Chue Hong, Matt Davis, Richard T Guy, Steven H D Haddock, Kathryn D Huff, Ian M Mitchell, Mark D Plumbley, Ben Waugh, Ethan P White, and Paul Wilson. Best practices for scientific computing. PLoS biology, 12(1):e1001745, 1 2014. [286] John C Wooley, Adam Godzik, and Iddo Friedberg. A primer on metagenomics. PLoS computational biology, 6(2):e1000667, 2 2010. [287] Katherine Yelick, Parry Husbands, Costin Iancu, Amir Kamil, Rajesh Nishtala, Jimmy Su, Michael Welcome, Tong Wen, Dan Bonachea, Wei-Yu Chen, Phillip Colella, Kaushik Datta, Jason Duell, Susan L. Graham, Paul Hargrove, and Paul Hilfinger. Pro- ductivity and performance using partitioned global address space languages. In Proceedings of the 2007 international workshop on Parallel symbolic computation - PASCO ’07, page 24, New York, New York, USA, 2007. ACM Press. [288] Chris Yesson, Peter W Brewer, Tim Sutton, Neil Caithness, Jaspreet S Pahwa, Mikhaila Burgess, W Alec Gray, Richard J White, Andrew C Jones, Frank A Bisby, and Alastair Culham. How global is the global biodiversity information facility? PloS one, 2(11):e1124, 1 2007. [289] Pelin Yilmaz, Renzo Kottmann, Dawn Field, Rob Knight, James R Cole, Linda Amaral-Zettler, Jack A Gilbert, Ilene Karsch- Mizrachi, Anjanette Johnston, Guy Cochrane, Robert Vaughan, Christopher Hunter, Joonhong Park, Norman Morrison, Philippe Rocca-Serra, Peter Sterk, Manimozhiyan Arumugam, Mark Bailey, Laura Baumgartner, Bruce W Birren, Martin J Blaser, Vivien Bonazzi, Tim Booth, Peer Bork, Frederic D Bushman, Pier Luigi Buttigieg, Patrick S G Chain, Emily Charlson, Elizabeth K Costello, Heather Huot-Creasy, Peter Dawyndt, Todd DeSantis, Noah Fierer, Jed A Fuhrman, Rachel E Gallery, Dirk Gevers, Richard A Gibbs, Inigo San Gil, Antonio Gonzalez, Jeffrey I Gordon, Robert Guralnick, Wolfgang Hankeln, Sarah Highlander, Philip Hugenholtz, Janet Jansson, Andrew L Kau, Scott T Kelley, Jerry Kennedy, Dan Knights, Omry Koren, Justin Kuczynski, Nikos Kyrpides, Robert Larsen, Christian L Lauber, Teresa Legg, Ruth E Ley, Catherine A Lozupone, Wolfgang Ludwig, Donna Lyons, Eamonn Maguire, Barbara A Methé, Folker Meyer, Brian Muegge, Sara Nakielny, Karen E Nelson, Diana Nemergut, Josh D Neufeld, Lindsay K New- bold, Anna E Oliver, Norman R Pace, Giriprakash Palanisamy, Jörg Peplies, Joseph Petrosino, Lita Proctor, Elmar Pruesse, Chris- tian Quast, Jeroen Raes, Sujeevan Ratnasingham, Jacques Ravel, David A Relman, Susanna Assunta-Sansone, Patrick D Schloss, Lynn Schriml, Rohini Sinha, Michelle I Smith, Erica Sodergren, Aymé Spo, Jesse Stombaugh, James M Tiedje, Doyle V Ward, George M Weinstock, Doug Wendel, Owen White, Andrew Whiteley, Andreas Wilke, Jennifer R Wortman, Tanya Yatsunenko, and Frank Oliver Glöckner. Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications. Nature biotechnology, 29(5):415–20, 5 2011. [290] Ellis L. Yochelson. Nomenclature in the Machine Age. Systematic Zoology, 15(1):88–91, 3 1966. [291] Matei Zaharia, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, Ion Stoica, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, and Shivaram Venkataraman. Apache Spark: a unified engine for big data processing. Communications of the ACM, 59(11):56–65, 10 2016. [292] Katarzyna Zaremba-Niedzwiedzka, Eva F. Caceres, Jimmy H. Saw, Disa Bäckström, Lina Juzokaite, Emmelien Vancaester, Ki- ley W. Seitz, Karthik Anantharaman, Piotr Starnawski, Kasper U. Kjeldsen, Matthew B. Stott, Takuro Nunoura, Jillian F. Banfield, Andreas Schramm, Brett J. Baker, Anja Spang, and Thijs J. G. Ettema. Asgard archaea illuminate the origin of eukaryotic cellular complexity. Nature, 541(7637):353–358, 1 2017.

41