GLOBAL BIODIVERSITY INFORMATION FACILITY

Building and Utilizing the GBIF Online Metacollection” The ButterfliesButterflies of CanaCanada:da: A Case Study

Larry Speers GMBA Data Mining Workshop Kazbegi, Georgia July 25- 28, 2006. WWW.GBIF.ORG Global Biodiversity Information Facility “Taken collectively, the plant and specimens in the world’s museum collections combined with recent observational and monitoring data provide our most complete picture of the biological diversity of the planet.” Global Biodiversity Information Facility History of Collections Development

• Collection growth has not been strategic but has been dependent on the: • Taxonomic interests of individual staff members present at any particular time • Changing emphasis and interests of funding agencies • Opportunities for staff to participate in various collecting activities/expeditions • National interests - changing political situations • Personal networks of individual staff members for the exchange of material

Global Biodiversity Information Facility Impact of Historic Development

z Critical material is often not located in local collections z It is impossible to predict the holdings of any collection either taxonomically, temporally or geographically. z Each collection only has a small portion of the relevant material that is needed to address most scientific questions.

Global Biodiversity Information Facility Building the Online “Metacollection”

GBIF is an international scientific co-operative project based on a multilateral agreement (MoU) between countries, economies and international organisations, dedicated to: z establishing an interoperable, distributed network of databases containing scientific biodiversity information z with initial focus on species- and specimen-level data, z with links to molecular, genetic and ecosystems levels Global Biodiversity Information Facility GBIF's role in biodiversity information networks

Global Biodiversity Information Facility 1. The GBIF Network handles primary biodiversity data

Class: Insecta Taxonomic Names Order: Sequence Data Synonym: Pyralis nubilalis Hübner, 1796 Locus: AAL35331 Family: Pyralidae Definition: acyl-CoA Z/E11 desaturase

1 mvpyattadg hpekdecfed... : Ostrinia Hübner, 1825

Species: Ostrinia nubilalis (Hübner, 1796) Taxonomic Descriptions Diagnosis: Wingspan 26-30mm; Vernacular (EN): European Corn-borer sexually dimorphic;male: Vernacular (DE): Maiszünsler forewings ochreous to dark brown; female: forewings pale Vernacular (ES): Piral del maíz yellow; … Vernacular (FR): Pyrale du maïs Digital Literature Family: Gramineae and Web Resources

Pheromones of Ostrinia http://www.nysaes.cornell.edu/fst/faculty/acree Foodplant: Zea mais L. 1753 /pheronet/phlist/ostrinia.html Ecological Interactions Collection: DGH Lepidoptera Record id: DGHEUR_003217 Specimens and Country: France Abiotic Data Coordinates: 03.047˚E 48.730˚N Observations Average Rainfall Date: 28 June 2003 Location: 48.82°N 2.29°E Collector: Donald Hobern Jan Feb Mar Apr ... 182.3 120.6 158.1 204.9 ... 2. Users and applications need data structured according to standards

2003-06-08 DGH DGH Lepidoptera DGHEUR_0002976 Dichomeris marginella (Fabricius, 1781) June 2003 O Animalia S M T W T F S Lepidoptera Gelechiidae 1 2 3 4 5 6 7 Dichomeris marginella 8 9 10 11 12 13 14 (Fabricius, 1781) Donald Hobern 15 16 17 18 19 20 21 Donald Hobern 2003 22 23 24 25 26 27 28 06 08 29 30 Europe Denmark Gentofte Amt Merianvej, Hellerup 12.538 55.737 100 1 1 in Skinner trap Observation record formatted using the Darwin Core 3. Web services support the exchange of structured data

Standardised Heterogenous Databases Web Services Structured Data Internet Users

4. GBIF Data Nodes provide biodiversity web services

Specimens: Observations: Flowering Birds of Central Plants of Africa America

Specimens: Museum A Observations: Proteaceae of Butterflies of Observer the World Belize Network B

Taxon Names: Proteaceae of Checklist: the World Birds of Belize

Specimens: Specimens: GBIF Bacteria Mammals of Cultures North Europe Network

Taxon Names: Taxon Names: Mammals of Bacteria the World

Further Links: Further Links: Mammals Bacteria

Museum C University D 5. The GBIF Network maintains a central registry of Data Nodes

Data Node Type of data Taxon Region Records Museum A Specimen/Observation Flowering Plants Africa 327000 Specimen/Observation Proteaceae World 23000 Taxonomic Names Proteaceae World 1500 Observer Network B Specimen/Observation Birds Central America 68500 Specimen/Observation Butterflies Belize 4200 Name List Birds Belize 587 Museum C Specimen/Observation Mammals North Europe 1800 Taxonomic Names Mammals World 8000 General Resources Mammals World 600 University D Specimen/Observation Bacteria World 1200 Taxonomic Names Bacteria World 5000 General Resources Bacteria World 400 6. GBIF maintains an index to biodiversity data

User requests

GBIF Data Nodes Biodiversity Data Access

Specimen Data SpecimenSpecimen Data Data

Specimen Data ObservationSpecimen Data Data Taxonomic Biodiversity Name Catalogue Data Service of Life Specimen Data Index (ECAT) SpecimenName Lists Data

Specimen Data LinksSpecimen to other Data data 7. The GBIF Portal offers a gateway to data

6 records

Show specimen records for 35 records Erinaceus europaeus

17 records

58 records: GBIF 1. Museum A Paris 2. Museum A Nice Portal 3. Museum A Paris 4. Museum A Avignon 5. Museum A Avignon 0 records 6. Museum A Marseille 7. Observer B Norwich 8. Observer B Norwich 9. Observer B Southampton . . . 8. GBIF Participant Nodes can offer tailored information

Show specimen records Geographic Services for Erinaceus europaeus from France

Show occurrence of Hérisson d’Europe

GBIF 26 records: GBIF France Portal 1. Museum A Paris 2. Museum A Nice 58 GBIF records: 3. Museum A Paris 4. Museum A Avignon 5. Museum A Avignon 1. Museum A Paris ; 6. Museum A Marseille 2. Museum A Nice ; 23. Observer B Calais 3. Museum A Paris ; 29. Observer B Paris 4. Museum A Avignon ; 5. Museum A Avignon ; . . . 6. Museum A Marseille ; 7. Observer B Norwich : 58. Museum C Toulouse 8. Observer B Norwich : 9. Observer B Southampton : . . . 58. Museum C Toulouse ; Growth in Data Sharing, Feb 2004 - Jul 2005

Records (in millions) Providers 140 90,0

80,0 120 70,0 100 60,0 Records 80 50,0

60 40,0 Providers 30,0 40 20,0 20 10,0

0 0,0

4 4 4 4 5 5 0 0 05 05 0 05 0 -0 -0 l-04 -04 -04 - - - - b r r- u g c n y n p J u ov- e a Jul- Fe Ma A May-04Jun-04 A Sep-04Oct-04N D J Feb-05Mar-05Apr Ma Ju All GBIF tools are…

z Open source, open access z Free z Supported (helpdesk available) z See www.gbif.org/serv/gbif-tools

Global Biodiversity Information Facility Growth rate actually needed…

Actual and Needed Data Growth Rates

Providers Potential Providers 4000 Records 1200 Potential Records 3500 1000 D

3000 a t a 800 R e s

2500 c r o e r d id s v ( o

r 2000 600 i n P

m a t illio a 1500 D 400 n s 1000 )

200 500

0 0 De Ma Ju S De Ma Ju A Ap P P P P ep- ug- o o o o c- r n- c- r n- r t t t t -0 04 -0 05 -0 ent ent ent ent 0 4 0 0 5 0 6 3 4 4 5 ia ia ia ial l l l

Global Biodiversity Information Facility Butterflies of Canada Project

Lessons Learned

Global Biodiversity Information Facility

Global Biodiversity Information Facility What do we mean by ‘Data Quality’?

An essential or distinguishing characteristic necessary for [spatial] data to be fit for use. SDTS 02/92

The general intent of describing the quality of a particular dataset or record is to describe the fitness of that dataset or record

for a particular use that one may have in Slide mind for the data. Complements A. Chapman Jellyfish, San Diego, USA Amsterdam Chrisman May 2004, 1991 What do we mean by “fitness for use”?

Fitness for use –Does species ‘x’ occur in Tasmania? –Does species ‘x’ occur in National Park ‘y’

Slide Complements A. Chapman Tierra del Fuego, Argentina Amsterdam May 2004 Error

Error is inescapable and it should be recognised as a fundamental dimension of data. Chrisman 1991

Slide Complements A. Chapman Amsterdam May 2004 Bolax gummifera, Argentina Geographic outliers - GIS

• Country, State, named district, etc.

Gazetteer of Brazilian localities Slide Complements Amsterdam May 2004 A Chapman Geographic outliers - GIS

Slide Complements A. Chapman

Amsterdam MayAsplenium 2004 bulbiferum, New Zealand Data Error – leads to Uncertainty

•• SpeciesSpecies ––NamesNames ––GeocodeGeocode ––AltitudeAltitude ––CollectorsCollectors ––DatesDates

Slide Complements A. Chapman Amsterdam May 2004 Crab, Florianopolis, Brazil

Reducing error

1. Error prevention 2. Error detection

Slide Complements A. Chapman Amsterdam May 2004 Methods for geocode validation

• Internal Database Checks • External Database Checks • Outliers in Geographic Space - GIS Slide • Outliers in Environmental Space - Models Complements A. Chapman • Statistical outliers Butterfly, Florida, USA Amsterdam May 2004 Internal/External Database Checks

• Logical inconsistencies within the database • Checking one field against another – Text location vs geocode or District/State • Checking one database against another

– Gazetteers Slide Complements – DEM A. Chapman

– CollectorsAmsterdam MayMagellanic 2004 Penguin, Argentina Acacia orites - 19 records - 9 Temperature parameters

35

30

25

20

15

10 Temperature (C) 5

0 t t t t t t t t c w w d s m m m e r a p l n x n a q q t y c w n n q q Reverse Jack-knife m m

Slide Complements Amsterdam May 2004 A Chapman Acacia dealbata, Australia Outliers in climate space

(T=0.95(√n)+0.2) where ‘n’ is the number of records

Slide Complements A. Chapman Amsterdam May 2004 Acacia dealbata, Australia What can you do for the Metacollection? z Help develop the global online ‘metacollection’ by sharing your data through the GBIF network z Insure that newly collected data is properly geo-referenced z Review and modify your current data management practices to ensure that future research is not burdened with the same inefficiencies that impede present-day researchers

Global Biodiversity Information Facility