2013 Rubenstein Research Fellows

Papilionoidea of the World: Evaluation and validation of EOL and BHL data for Hesperiidae

JR Ferrer-Paris, AY S´anchez-Mercado, C Lozano, L Zambrano, J Soto, J Baettig and P Ortega Centro de Estudios Bot´anicos y Agroforestales Instituto Venezolano de Investigaciones Cient´ıficas

Report EOLR.r.2013.10 available at the PoW home page Version of 13 de noviembre de 2013 CC BY-NC 3.0,Some rights reserved

Abstract

We evaluate the representativeness of two open sources of data for the butterfly Hesperiidae that represent almost 20 % of the known of butterflies (Papilionoidea). First we built a taxonomic checklist from available information and ordered the species lists according to a preliminary phylogeny. Checklists are based on the most updated and com- plete synonimic list and catalogues available in public sources, and phylogenies are based on approximated phylogenies for several clades within the family. For each species we retrieved all available text data objects from the Encyclopedia of Life, EOL and all pages from the Biodiversity Heritage Library, BHL. We then analyse the distribution of data objects, pages and records per species and the representativeness of each data source accross the phylogeny, and compare them with the results obtained for other families. Hesperiidae are poorly represented in both sources, available content was generally lower and less rich than for Papilionidae and Pieridae, and not much better than for Riodinidae. Differences between hesperid subfamilies were evident, but not extrem, with Eudaminae, Coeliadinae and Trapetizinae slightly better represented. In EOL, the main contributors are 1 associated with phylogenetic and molecular data providers, and fewer data objects about their biology and ecology. In BHL many species names have matches pages, but biological information is more difficult to locate due to high rates of false positives. Targeted manual search and validation of EOL data objects provided an important set of complementary hostplant records, including records for 64 additional hesperid species and 141 additional host-plant species, which represent an increase between 6 and 8 % in the compilation of hostplant associations for the family.

1. Checklist and phylogeny

First we built a taxonomic checklists and ordered the species lists according to an ap- proximated phylogeny of the Hesperiidae. This family represent almost 20 % of the known species of Papilionoidea3. Details for this section are available at the PoW homepage under A working checklist of butterfly species. For a long time, Hesperiidae has been considered as a superfamily on its own, and a sister group to all the papilionoidea6, but is now considered to be part of the extended Papilionoidea clade4. There is no single complete checklist for all Hesperiidae of the world, so we combined regional checklists for the Neotropical, Nearctic, Afrotropical and Australian faunas, with information from other sources for the rest of the world7;8;9;10;14. Our provisional compilation includes 3992 species. Seven subfamilies are recognized within Hesperiidae1: the Euschemoninae, Coeliadinae and Eudaminae take basal positions, while the closely related Hesperiinae, Trapezitinae and Heteropterinae form the most derived clade, and the Pyrginae placed in an intermediate position.

> plot(read.tree(text="(Coeliadinae,(Euschemoninae,(Eudaminae,(Pyrginae,(Heteropterinae,(Trapezitinae,Hesperiinae))))));"))

2 Hesperiinae

Trapezitinae

Heteropterinae

Pyrginae

Eudaminae

Euschemoninae

Coeliadinae

The very diverse and heterogeneous Hesperiinae and Pyrginae represent the main bulk of genera and species:

subfamily genero val 1 Coeliadinae 8 89 2 Eudaminae 50 430 3 Euschemoninae 1 1 4 Hesperiinae 316 2055 5 Heteropterinae 11 182 6 Pyrginae 147 1126 7 Trapezitinae 18 75

3 We use an approximated level phylogeny based on the tree of life webpage1, which itself is based mostly on work by Warren et al. 11,12, and includes 551 genera. Relationships within subfamilies are still unresolved for many genera.

Pseudodrephalys

Cornuphallus

Xenophanes Timochreon

Spioniades

Plumbago Camptopleura Paramimus Achlyodes Pythonides Haemactis

Cycloglypha Theagenes Doberes Zobera Gorgythion Gindanes Charidia Eracon Potamanaxas Timochares Milanion Atarnes Aethilla Quadrus

Trina

Clito Ouleus Ebrietas

Ephyriades Sostrata Zera Chiomara Helias

Anastrus Paches Cyclosemia Onenses Diaeus Erynnis Carrhenes Anisochoria Mylon Celotes Pyrgus Heliopyrgus HeliopetesZopyrion SystaseaAntigonus PseudocoladeniaSarangesa Eretis KatreusCelaenorrhinus Gesta AleniaTapena Seseria SatarupaProcampta Pintara Morvina OdinaNetrocoryne Grais Mooreana Myrinia Tosta GerosisExometoeca EagrisDarpa Ocella DaimioTagiades PachyneuriaNisoniades Xispia Odontoptilum Noctuana Burca NetrobalaneLeucochitonea CtenoptilumCaprona MictrisIliana Abantis Mimia Jera ColadeniaChamunda ChaetocnemeCapila PolyctorPellicia CalleagrisAbraximorpha StaphylusSophista Zonia HesperopsisGorgopas YangunaPyrrhopyge Carcharodus Protelbella Pholisora Viola Pseudocroniades OchropygeNosphistia Muschampia Sarbia CarterocephalusConognathus Bolla MysoriaMysarbia Mimoniades ArteurotiaGomaliaSpialia Mimardaris MicrocerisMetardaris MelanopygeJonaspyge LeptalinaWindia JemadiaGunayan Metisella Parelbella Dardarina ElbellaCyanopyge FreemanianaHovala CroniadesChalypyge Ardaris HeteropterusArgopteronPiruna CreonpygeAmysoria ButleriaDalla Amenis PassovaMyscelus Granila OreisplanusMesodina Azonax AspithaOxynetra MotasinghaHesperillaDispar Cyclopyge VenadaThessia RidensPolythrix TrapezitesToxidia Heronia Epargyreus EctomisChrysoplectrum Croitana Cephise NeohesperillaHewitsoniellaHerimosaFelicena NerulaOechydrus MarelaParacogia Cogia ProeidosaPasma Dyscophellus Ampittia NascusPhocides TelemiadesPolygonus AeromachusThoressaHalpeSovia Aurina OileidesPorphyrogenes Baracus Ocyba Ochus Salatis SebastonymaPithauriaOnryza Bungalotis TarsoctenusPhareas AcerbasAcada PhanusEntheus Acleros CabirusAugiades AgathymusAegialeActinor Hyalothyrus UdranomiaDrephalys AndronymusAncistroidesAlera Calliades ZestusaLobocla ApostictopterusAnkola Typhedanus CodatractusAguna AstictopterusArtitropaArnetta Cabares NarcosiusAstraptes CaenidesBarca Urbanus ChondrolepisCeratrichia Autochton AchalarusThorybes Creteus Spathilepia Cupitha Proteides EogenesEetion Chioides Erionota Euschemon Fresna CoeliadesChoaspes GalergaFulda HasoraBibasis Gamia BadamiaPyrrhiades Gangara Allora GorgyraGe Pyrrhochalcia Gretna WallengreniaXeniades HyarotisHidari Vacerra Hypoleucis TirynthiaTirynthoides Iambrix Serdis IdmonIlma Racta Isma QuasimellanaQuinta Isoteinon Coeliadinae PropertiusPyrrhocalles Kedestes Problema KoruthaialosLepella ParatrytonePhemiades LotongusLycas Parachoranthus Malaza Oxynthes Matapa Eudaminae OligoriaOrthos MegathymusMelphina Oeonus Meza MoloNeochlodes MoltenaMonza MetronMisius Mopala JongianaLinka NotocryptaOerane Euschemoninae Holguinia Orses CyclosmaCynea Osmodes Cravera Osphantes Choranthus Paracleros OnespaChalcone Parosmodes Hesperiinae Librita Pardaleodes Buzyges ParonymusPemara AtrytoneAtrytonopsis PericharesPerrotia Asbolis Pirdana Arotis AnatrytoneDecinea Plastingia Heteropterinae Stinga PlatyleschesPloetziaPrada Poanes PolitesOchlodes AppiaPompeius PraescoburaProsopalpus Hylephila Psolos PseudocopaeodesHesperia PseudokeranaPseudosarbia Pyrginae Atalopedes PteroteinonPudicitia CongaNyctelius Pyroneura Thespieus Quedara Euphyes LibraLindra PyrrhopygopsisScobura Hansa Salanoemia Trapezitinae ZariaspesCaligulana RhabdomantisSemalea Vinpeius StallingsiaStimulaSuada Rhinthon PropapiasPunta Suastus Phlebodes Tiacellia ParphorusPheraeus TeniorhinusTsitana Mucia Turnerina LerodeaLevina UdaspesUnkanaZela Joanna CanthaHalotus VerticaBruna XanthodiscaBaoris Styriodes XanthoneuraBorbo Miltomiges Brusa DionEnosis ZographetusZophopetes Iton Vinius Caltoris PeniculaSodalia Gegenes VenasNiconiades Parnara Tigasis ThargellaThoon Pelopidas Sucova RepensSaniba PolytremisPrusianaZenoniaTelicota Radiatus Banta PhanesPsoralis Arrhenes Peba PambaPanca Kobrona Onophas CephrenesMimeneSabera NaevolusNastra MoerisMolla Oriens MnasitheusMnestheus Suniana MethionopsisMnasinous Potanthus Pastria LudensMethion Ocybadistes LamponiaLerema IngloriusJustinia Aides Igapophilus Gallio Argon EpriusFlaccilla Oarisma CobalopsisEutocus Taractrocera Aroma Apaustus Arita Artines MorysAdlerodea MnasilusMonca Copaeodes PapiasVehilius Adopaeoides CymaenesVidius Ancyloxypha Vettius AmblyscirtesSaturnus Mnasicles Remella Virga Callimormus Thymelicus Eutychide Lento Calpodes Cumbre Lucida Carystina CarystusChloeria Damas Cobalus DubiellaEbusus

Carystoides Orphe Moeros Nyctus Cobaloides Telles

Tisias Evansiella Synale Zenis Saliana Talides Falga Megaleas Tellona Sacrator Turesis Tromba Lychnuchus

Panoquina Synapte Turmada Corticea

Zalomes Thracides Wahydra

Mnaseas Neoxeniades Lychnuchoides Anthoptus

The most species-rich genera for the principal subfamilies are:

Subfamily Coeliadinae ::

Hasora Bibasis Coeliades 40 20 17

4 Subfamily Eudaminae ::

Urbanus Astraptes Aguna 36 29 26

Subfamily Pyrginae ::

Celaenorrhinus Staphylus Pyrgus 119 55 48

Subfamily Heteropterinae ::

Dalla Metisella Piruna 95 23 20

Subfamily Hesperiinae ::

Halpe Thoressa Potanthus 54 38 37 We use three metrics derived from phylogenetic community analysis5;13 to measure the phylogenetic representativeness of data from each source. Phylogenetic species richness, (PSR), is related to the number of taxa in a sample (SR), but accounting for the decrease of variance due to phylogenetic relatedness. Phylogenetic species evenness, (PSE), is a measure of phylogenetic variability that incorporate the effect of relative species abundance (here the effect of the number of data objects, pages or records). Higher values represent more similar abundances for all taxa, but the maximal value of one is only possible when the species considered are complete unrelated (star phylogeny). The Mean Pairwise Distance, (MPD), is the phylogenetic difference between two randomly taken individuals (here data objects, pages or records) from a sample.

2. Encyclopedia of Life, EOL

We used the EOL API to retrieve information from The Encyclopedia of Life for each species in our checklist. Details about the protocol used are available in the PoW home page under EOL data search. In this version we are assuming that the corresponding services are handling synonymies correctly, and thus we did not retrieve information for alternative names that might be present in some data sources. This would probably be desirable in the future. In fact many names in our checklists returned matches for several taxon concepts within EOL hierarchies, but some names returned no match in EOL, or they returned a name match, but no text object. 5 The percentage of names matched was 88 %, but the percentage of species with one or more text data objects was only 49.3 % for Hesperiidae. These totals are lower than the corresponding values that we have previously found for Papilionidae and Pieridae, but higher than for Riodinidae2. We used a log-count ranked plot to compare the distribution of the number of EOL text data objects among species in Hesperiidae, and compare it with the results for the other three families evaluated. We further divided the ranks by the number of species with data to make the distributions comparables between families. Data points were slightly jittered to allow the visualization of several overlapping cases. 100 ● Papilionidae Hesperiidae

● 50 Pieridae ● Riodinidae ● ● ● ●●

●● ●●

20 ●●● ●● ● ●●●● ●● ●●● ●●●

●●●●● ●●●

10 ●●●●● ●● ●●●● ●●●●● ● ●● ●●● ● ●● ●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●

● ●●● ●●●●●●●●●●● 5

● ●●●●● ●●● ●

●● ●●●● ●●●● ● ●●

● ● ● ●●● ●● ● ●●● ● ● ● ●● ●●●●●●● ●●● ●●● ● ●● ● ● ● ●●● ●●● 2 ●● ●●● ● ●● ● ● ● ● ● ● ● ●●● ● ●● ● EOL Text Data Objects per species EOL Text

● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ●●●● ● ● ● ● ●●● ●● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●●● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● 1 ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●

0.0 0.2 0.4 0.6 0.8 1.0

Species Rank/Total number of species

The distribution of information accross species has a log-normal distribution, with few abundant, common or biologically interesting species having much more available information, 6 and a large number of poorly represented species. This pattern is more striking in Hesperiidae than in the other three families previously evaluated, with very few species having more than 50 text data objects, and more than 50 % of the species with only one data object. Most of the species with more than 50 text data objects are in the Hesperiinae, and one in the Eudaminae. Eudaminae also have a higher median value than most of the other subfamilies, although most species have fewer than six data object.

100 ●

● ● ● ● 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● ● 5 ● ● ●

2

1 Pyrginae Eudaminae Coeliadinae Hesperiinae Trapezitinae Heteropterinae Euschemoninae

2.1. Quality of available data We further use two indirect measures of the quality of the data. EOL provides “richness scores” for the content associated with one species, it measures the diversity of content as 7 the variety of sources, quantity of data, among other things. We also measure the size (log of text length) of the data objects as a proxy for quality. Both metrics suggest that the quality of content for Hesperiidae is less than for Papilionidae and Pieridae, but slightly better than for Riodinidae.

● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 0 EOL richness score Papilionidae Hesperiidae Pieridae Riodinidae

● ● ● ● ● ● 10 ● ● ● ● ● 6 2 Length of data objects

There were important differences between subfamilies, with Trapezitinae and Eudaminae having in average richer content than the other subfamilies, although species with the best content come mainly from Hesperiinae, with few species from Pyrginae and Eudaminae.

8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ●

10 richness score 5

2 Pyrginae Eudaminae Coeliadinae Hesperiinae Trapezitinae Heteropterinae Euschemoninae

2.2. Agents We also evaluated which sources are contributing more information to EOL for Hesperi- idae. For this family most data objects are in English, with only modest contributions in other languages.

> table(EOLo.hspr$lan)

de en es fr nl 1 7710 28 4 3

9 NatureServe and Barcode of life (BOLD) are the two most important sources for Hes- periidae and the other three families examined, followed by Wikipedia. INBio and IUCN, are very important for other families but not for Hesperiidae, while Discover Life (DL), Tree of Life (ToL), North American Butterfly Knowledge Network (NABKN) and Plant-Caterpillar- Parasitoid Interactions (PCPI) provide more content for Hesperiidae than for other families.

NatureServe BOLD Wikipedia INBio Others IUCN UAM DL ToL NABKN BA Papilionidae 432 550 424 129 102 141 70 37 63 48 12 Hesperiidae 2825 2097 1146 0 286 62 227 290 190 183 102 Pieridae 797 794 484 201 86 180 126 60 38 79 186 Riodinidae 367 407 82 355 64 46 0 26 61 29 0 PCPI ADW ARKive LepTree IABIN Papilionidae 27 99 70 6 1 Hesperiidae 154 67 21 73 28 Pieridae 36 37 49 12 19 Riodinidae 53 0 0 15 0

Within Hesperiidae, NatureServe, BOLD, and several of the other agents contribute more content for Hesperiinae, Pyrginae and Eudaminae, while Wikipedia seems to be the main source for the smaller subfamilies like Coeliadinae, Heteropterinae, and Trapezitinae, but contributes less to Eudaminae.

NatureServe BOLD Wikipedia DL Others UAM ToL NABKN PCPI BA Coeliadinae 0 26 42 0 0 0 2 0 0 0 Eudaminae 220 571 50 56 38 14 30 0 38 0 Euschemoninae 0 1 1 0 0 0 2 0 0 0 Hesperiinae 2053 708 540 156 168 137 95 183 65 33 Heteropterinae 76 52 128 4 3 7 2 0 0 0 Pyrginae 469 686 308 73 75 69 53 0 51 59 Trapezitinae 0 51 73 0 1 0 4 0 0 0 LepTree ADW IUCN IABIN ARKive INBio Coeliadinae 0 0 0 0 0 0 Eudaminae 21 13 0 2 0 0 Euschemoninae 0 0 0 0 0 0 Hesperiinae 28 54 4 8 14 0 Heteropterinae 1 0 2 9 7 0 Pyrginae 23 0 54 9 0 0 Trapezitinae 0 0 2 0 0 0

We also mapped the contribution of the different sources (number of text objects) onto the phylogeny of the Hesperiidae. Three resources with a phylogenetic/genetic focus have very good representation along the whole cladogram of the Hesperiidae: ToL, LepTree and BOLD. Wikipedia has some evident gaps in several genera of Hesperiinae, Pyrginae and Pyrrhopyginae. All other sources have a 10 rather patchy coverage, with marked bias toward Pyrgus, Erynnis, Agathymus, Amblyscirtes, Euphyes, Hesperia, Polites and Atrytonopsis. DL Others UAM NatureServe ARKive NABKN ADW BA IUCN BOLD PCPI LepTree ToL Wikipedia IABIN

For Hesperiidae the Wikipedia, BOLD and ToL have a better representation of all phylo- genetic groups, with high number of genera, and the highest values of phylogenetic richness, eveness and MPD. Only LepTree achieves a similar level of PSE and MPD with a smaller number of included genera.

Row.names SR PSR vars MPD PSEs 4 BOLD 369 255.879190 5.3765502 1.4381735 0.7210408 15 Wikipedia 233 160.092163 5.9625074 1.3492748 0.6775453 13 ToL 168 116.615046 5.2059573 1.3993681 0.7038738 11 5 DL 99 66.645417 3.6695095 1.2951248 0.6541702 11 Others 96 62.715943 3.5855872 1.2125916 0.6126779 10 NatureServe 65 41.887955 2.6350063 1.0481854 0.5322817 8 LepTree 62 44.205902 2.5349600 1.4041440 0.7135814 12 PCPI 37 25.358182 1.6464064 1.3274990 0.6821870 9 NABKN 28 6.710034 1.3029831 0.3569919 0.1851069 14 UAM 16 9.395879 0.8275591 1.0985855 0.5859123 6 IABIN 11 7.513455 0.6257207 1.2215399 0.6718469 7 IUCN 7 4.806667 0.4671107 0.4919951 0.2869971 3 BA 7 3.860000 0.4671107 0.8925932 0.5206794 1 ADW 4 2.018182 0.3631625 0.6375342 0.4250228 2 ARKive 3 1.583636 0.3440774 0.7038384 0.5278788

2.3. Extracting hostplants records from EOL data objects Some details of the data extraction protocol for EOL data objects are available in the PoW home page under EOL data validation. We found that only a small proportion of text data objects are dedicated exclusively to butterfly hostplants associations. In fact only 2 % of the data objects for Hesperiidae refer to Trophic strategy, Hostplant, Associations or Foodplant in their title, similar low values were found in other families. However, up to 16.5 % of the text data objects contained keywords associated with hostplant records for Hesperiidae. For Hesperiidae there were more matches with keywords like feed on, Host and Plant, but also for their most common resource, grasses:

Attracted to Egg Eier Fabaceae Feeding Feed on 47 175 1 21 46 445 Feeds on Foodplant grasses Host Host Plant Larvae 16 178 161 307 169 628 Larval Legume Ovipos Plant Planta Poaceae 127 26 21 647 7 31

3. Biodiversity Heritage Library, BHL

We used the BHL API to retrieve information from the Biodiversity Heritage Library for each species in our checklist. Details about the search protocol used are available in the PoW home page under BHL data search. In this version we are assuming that the corresponding services are handling synonyms correctly, and thus we did not retrieve information for alter- native names that might be present in some data sources. This would probably be desirable in the future. We found matches in BLH for 57.3 % of the species of Hesperiidae in our checklist, with slightly lower percentages for Coeliadinae and Hesperiinae, and higher for Eudaminae, Het- eropterinae and Trapezitinae:

12 Subfamily x 1 Coeliadinae 54.9 2 Eudaminae 67.4 3 Euschemoninae 100.0 4 Hesperiinae 54.4 5 Heteropterinae 65.8 6 Pyrginae 57.1 7 Trapezitinae 62.7

However the distribution of number of matched pages per species is strongly skewed, with very few species saving more than 50 matches and almost thousand species with one or no matches. This distribution is much more skewed than for Papilionidae or Pieridae, but slightly more even than for Riodinidae.

13 5e+03 Papilionidae Hesperiidae Pieridae ● Riodinidae 5e+02 ● ● ●

● ● ● ●● 5e+01 ●●●● BHL pages ●● ●●● ●●●●●● ●● ●●●●●● ●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●● 5e+00 ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●● ●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● 5e−01 0.0 0.2 0.4 0.6 0.8 1.0

Species Rank/Total number of species

Within Hesperiidae, the pattern is very similar for the big subfamilies, and slightly more skewed for Eudaminae, but extremely uneven for Heteropterinae, while some species of Coeliadinae and the genus Euschemon appear relatively more often in BHL pages than corresponding species in other subfamilies.

14 5e+03 Coeliadinae Eudaminae Euschemoninae Hesperiinae Heteropterinae

5e+02 Pyrginae Trapezitinae 5e+01 BHL pages 5e+00 5e−01 0.0 0.2 0.4 0.6 0.8 1.0

Species Rank/Total number of species

3.1. Extracting hostplants records from BHL pages Some details of the data extraction protocol for BHL are available in the PoW home page under BHL data validation. We searched the downloaded OCR pages for a list of different keywords associated with hostplant records. However a great proportion of the matched pages did not refer to text pages but ads, indices and bibliography pages, which are not categorized within BHL. Thus we used a classification tree to estimate page type based on some text metrics like the number of digits, line breaks, alphabetic characters, number and diversity of words, etc. This classification tree was trained with a subset of pages with known content, Ferrer-Paris and S´anchez-Mercadosee 15 2, for details. We then applied this classification tree to all the OCR pages with matches for Hesperiidae, and we found that approximately two thirds were predicted to be real text pages: anuncios indice referencia texto 770 3988 438 6343 But only 18.8 % of the pages are classified as text, and have at least one keyword. is.text.page keyword.match FALSE TRUE FALSE 3953 3586 TRUE 1834 2166 This is lower than the corresponding percentages for Papilionidae (22 %), Pieridae (28.6 %), but still higher than for Riodinidae (13.9 %). Up to 25 of the 26 keywords used were matched at least once. Egg, Larvae and Plant were the most frequent for Hesperiidae, while the corresponding german terms Eier, Raupen and Pflanze, and other keywords like Feeding, that were common for other families are less represented in Hesperiidae pages. keyword Papilionidae Pieridae Riodinidae Hesperiidae 1 Aliment 35 117 2 13 2 Attracted to 73 139 10 70 3 Catterpillar 0 0 0 1 4 Egg 1846 3783 151 1072 5 Eier 1811 2479 65 138 6 Fabaceae 3 20 7 44 7 Feeding 959 2076 103 567 8 Feed on 287 334 23 130 9 Feeds on 365 341 37 258 10 Foodplant 483 701 75 387 11 grasses 109 263 9 262 12 Hospedera 0 2 0 0 13 Host 751 1334 79 508 14 Host Plant 179 170 18 93 15 Larvae 2281 5596 223 1359 16 Larval 825 1469 95 535 17 Legume 9 100 2 46 18 Nahrung 165 401 8 23 19 Ovipos 535 813 63 316 20 Pflanze 686 1372 23 119 21 Plant 3428 6644 296 1994 22 Planta 544 1054 39 209 23 Poaceae 4 6 1 30 24 Raupen 2025 2879 66 148 25 Recurso 18 22 0 9 16 4. Contribution of EOL and BHL data as sources of hostplant records

We mapped the data available from both sources onto the phylogenies of Hesperiidae, and compared it with the available information about hostplant associations for the family. Here we use the data base compiled previously by Ferrer-Paris et al. 3 to summarize current knowledge about the group and compare it with the number of EOL data objects selected by simple keyword matching, and the number of BHL pages that were classified as text pages and selected by simple keyword matching. Most genera have data in EOL, BHL and in the compilation of hostplants (59.3 %), and only 12 genera are not represented in any of these sources. BHL and EOL had similar values of all four metrics of phylogenetic representativeness, the compilation of hostplant records has lower number of genera and lower phylogenetic richness (PSR), but higher values of MPD and eveness.

SR PSR vars MPD PSE EOL 489 332.3680 2.421940 1.323811 0.6632621 BHL 482 326.0034 2.657059 1.341425 0.6721068 HOST 361 249.5992 5.492270 1.453305 0.7286711

The previous compilation has hostplant records for 361 genera of Hesperiidae, while EOL and BHL appear to have additional information for the following genera:

EOLdataObjects BHLpages Allora 2 1 Paracogia 0 1 Nerula 0 1 Ectomis 0 1 Metardaris 0 1 Sarbia 0 1 Capila 0 3 Coladenia 0 3 Leucochitonea 1 3 Darpa 0 1 Exometoeca 0 1 Odina 0 2 Tapena 0 4 Onenses 0 2 Plumbago 0 1 Zobera 0 2 Pseudodrephalys 0 2 Haemactis 0 2 Myrinia 0 1 Viola 3 13 17 Gorgopas 0 1 Conognathus 0 2 Dalla 0 19 Argopteron 0 2 Rachelia 1 0 Neohesperilla 4 0 Ochus 0 1 Sebastonyma 0 1 Alera 0 2 Apostictopterus 0 1 Arnetta 0 1 Ge 0 1 Melphina 1 0 Meza 4 0 Pardaleodes 3 2 Ploetzia 1 0 Prada 0 2 Rhabdomantis 1 0 Teniorhinus 4 0 Zographetus 0 1 Mimene 0 1 Adopaeoides 1 0 Telles 3 0 Virga 0 2 Vidius 0 2 Adlerodea 0 1 Eprius 0 1 Pamba 0 1 Repens 8 3 Styriodes 0 1 Halotus 0 2 Mucia 0 1 Stinga 0 2 Buzyges 0 1 Cravera 0 1 Phemiades 0 3 Propertius 0 3 Pyrrhocalles 0 1 Racta 1 0

4.1. Manual validation of selected EOL text data objects We selected 1203 EOL pages for manual validation of association records for genera poorly or not represented in the previous compilation. There were hostplant records in 30.5 % of 18 these pages, which were associated to 504 species of Hesperiidae. The plant names and terms were extracted manually, and then the validity and taxonmic rank was checked automatically with family, genus and species names from The Plant List www.theplantlist.org and the Angiosperm Phylogeny Website at http://www.mobot.org/MOBOT/research/APweb/. We found a total number of 1788 records. This includes 46 general records (with am- biguous common names or general terms), and 143 records remain unverified (probably mis- spellings, transcription errors or nomenclatural problems). 823 records refer to valid plant species names, and 184 additional names were identified as synonyms and replaced by the accepted names. There are 234 additional records with valid generic name but invalid or unresolved specific epithet. Finally there were 351 records with valid plant genus name and 7 with valid plant family name. The 377 Hesperiidae species with valid hostplant records in the selected EOL text data objects included 64 species that were not recorded in our previous compilation, indicating a 6.1 % increase in the number of species with records for the family with only 16.9 % of the EOL text data objects manually validated. Similarly, the 552 plant species reported in the validated records included 141 species that were not previously recorded for this family, indicating a 7.7 % increase. However, manual validation was targeted to the butterfly genera poorly or no represented in the previous compilation, and therefore we would not expect these numbers to increase much more with an exhaustive manual validation.

4.2. Manual validation of selected BHL pages Manual validation of hostplant records in BHL has been much less efficient due to the delays with the classification of page types, a high proportion of false positives keyword matches, and the general and heterogeneous nature of text pages extracted from different historical sources, as opposed to the more concise and formated content of EOL text data objects. We have manually reviewed 3098 of 57123 BHL pages with associated butterfly names for the three families (Papilionidae, Hesperiidae, Pieridae and Riodinidae). This first selection was based on text characteristics (high probability of being a text page), presence of keywords related to hostplant associations, and presence of scientific names matching butterfly names with few hostplant records in the previous compilation. However this selection procedure still returns a great proportion of pages without useful content. So far we have found 348 pages with hostplant records for butterflies, including only 9 pages with hostplant records for Hesperiidae. Validation of plant and butterfly names in this small set of pages has not provided new association records for Hesperiidae.

5. Conclusions

Hesperiidae can be considered to be a poorly studied group of butterflies. Available con- tent for Hesperiidae was generally lower and less rich than for Papilionidae and Pieridae, but still slightly better than Riodinidae. Within Hesperiidae, differences between subfamilies

19 were evident, but not extrem, with Eudaminae, Coeliadinae and Trapetizinae slightly better represented than other subfamilies, and large variation within each subfamily. In EOL, the main contributors are associated with phylogenetic and molecular data providers. This is most probably due to a combination of scarce knowledge about their biology and ecology, and to the increasing interest in describing and solving phylogenetic relationships within this complex group of species. Searching for hostplant associations in EOL is greatly improved by the search of keywords, although the false positive rate is relatively high. However the access to concrete data objects allows fast manual validation of most records, and an important amount of useful, and new information can be gathered with some effort. BHL provides a large number of matches for Hesperiidae species names. However, clas- sifying useful content is more difficult. We suspect that a large proportion of matches refer to indices, reference list or bibliographies and comercial pages in older journals. We further narrowed search of hostplant associations by searching for keywords and butterfly species with poor representation in previous compilations, but this provided no new records. Manual validation of this information is very slow due to the high amount of redundant information, and the lack of concrete records for many species.

Referencias

[1] Andrew V. Z. Brower and Andrew Warren. Hesperiidae Latreille 1809. Skippers. http:// tolweb.org/Hesperiidae/12028/2008.04.07 in The Tree of Life Web Project, http: //tolweb.org/, 2008.

[2] J. R. Ferrer-Paris and A. Y. S´anchez-Mercado. Papilionoidea of the world: Evaluation and validation of eol, bhl and gbif data for papilionidae, pieridae and riodinidae. Tech- nical Report EOLR.r.2013.08, Centro de Estudios Bot´anicos y Agroforestales, Instituto Venezolano de Investigaciones Cient´ıficas, Maracaibo, Estado Zulia, August 2013. Avail- able at the PoW home page.

[3] Jos´eR. Ferrer-Paris, Ada S´anchez-Mercado, Angel´ L. Viloria, and John Donaldson. Congruence and diversity of butterfly-host plant associations at higher taxonomic levels. PLoS ONE, 8(5):e63570, 05 2013. doi: 10.1371/journal.pone.0063570. URL http://dx. doi.org/10.1371%2Fjournal.pone.0063570.

[4] Maria Heikkil¨a, Lauri Kaila, Marko Mutanen, Carlos Pe˜na, and Niklas Wahlberg. Cre- taceous origin and repeated tertiary diversification of the redefined butterflies. Pro- ceedings of the Royal Society B: Biological Sciences, 279(1731):1093–1099, 2012. doi: 10.1098/rspb.2011.1430. URL http://rspb.royalsocietypublishing.org/content/ 279/1731/1093.abstract.

[5] M.R. Helmus, T.J. Bland, C.K. Williams, and Ives A.R. Phylogenetic measures of biodiversity. American Naturalist, pages E68–E83, 2007.

20 [6] Niels P. Kristensen, Malcolm J. Scoble, and Ole Karsholt. phylogeny and systematics: the state of inventorying moth and butterfly diversity. Zootaxa, 1668:699– 747, 2007.

[7] Gerardo Lamas. Atlas of Neotropical Lepidoptera. CheckList: Part 4A Hesperioidea- Papilionoidea. Scientific Publishers, 2004.

[8] Gerardo Lamas. Australian checklist. Available on-line from the Taxome project web- page, 2006.

[9] Jonathan Pelham. A skeleton checklist of the butterflies of the united states and canada. Provisional word-processed checklist available on-line from the Taxome project webpage, 2006.

[10] Markku Savela. Lepidoptera and some other life forms. Available on-line athttp:// www.nic.funet.fi/pub/sci/bio/life/intro.html, 1999-2013.

[11] A. D. Warren, J. R. Ogawa, and A. V. Z. Brower. Phylogenetic relationships of subfam- ilies and circumscription of tribes in the family hesperiidae (lepidoptera: Hesperiodea). Cladistics, 24:642–676, 2008.

[12] A. D. Warren, J. R. Ogawa, and A. V. Z. Brower. Revised classification of the family hesperiidae (lepidoptera: Hesperioidea) based on combined molecular and morphological data. Systematic Entomology, 34:467–523, 2009.

[13] C. Webb, D. Ackerly, M. McPeek, and M. Donoghue. Phylogenies and community ecology. Annual Review of Ecology and Systematics, 33:475–505, 2002.

[14] Mark C. Williams. Afrotropical checklist. Available on-line from the Taxome project webpage, 2003.

21