Utah State University DigitalCommons@USU

Ecology Center Publications Ecology Center

8-8-2019

What We (Don’t) Know About Global Diversity

William K. Cornwell University of New South Wales

William D. Pearse Utah State University

Rhiannon L. Dalrymple University of New South Wales

Amy E. Zanne George Washington University

Follow this and additional works at: https://digitalcommons.usu.edu/eco_pubs

Part of the Ecology and Evolutionary Biology Commons

Recommended Citation Cornwell, W.K., Pearse, W.D., Dalrymple, R.L. and Zanne, A.E. (2019), What we (don't) know about global plant diversity. Ecography, 42: 1819-1831. doi:10.1111/ecog.04481

This Article is brought to you for free and open access by the Ecology Center at DigitalCommons@USU. It has been accepted for inclusion in Ecology Center Publications by an authorized administrator of DigitalCommons@USU. For more information, please contact [email protected]. What we (don’t) know about global plant diversity

William K. Cornwell1, William D. Pearse2, Rhiannon L. Dalrymple1, and Amy E. Zanne3

1 Evolution & Ecology Research Centre, School of Biological, Earth and Environmental Sciences, UNSW

Sydney, Australia

2 Department of Biology & Ecology Center, Utah State University, Logan, UT, USA

3 Department of Biological Sciences, George Washington University, Washington, DC, USA

Corresponding author email: [email protected]

ORCiD ids and emails:

William K. Cornwell: 0000-0003-4080-4073, [email protected]

William D. Pearse: 0000-0002-6241-3164, [email protected]

Rhiannon L. Dalrymple: 0000-0002-2660-1424, [email protected]

Amy E. Zanne: 0000-0001-6379-9452, [email protected]

1 Abstract

2 The era of big biodiversity data has led to rapid, exciting advances in the theoretical and applied biological, ecological

3 and conservation sciences. While large genetic, geographic and trait databases are available, these are neither complete

4 nor random samples of the globe. Gaps and biases in these databases reduce our inferential and predictive power, and

5 this incompleteness is even more worrisome because we are ignorant of both its kind and magnitude. We performed a

6 comprehensive examination of the taxonomic and spatial sampling in the most complete current databases for plant

7 genes, locations, and functional traits. To do this, we downloaded data from The Plant List (), the Global

8 Biodiversity Information Facility (locations), TRY (traits) and GenBank (genes). Only 17.7% of the world’s described

9 and accepted land plant species feature in all three databases, meaning that more than 82% of known plant biodiversity

10 lacks representation in at least one database. Species coverage is highest for location data and lowest for genetic data.

11 and orchids stand out taxonomically and the equatorial region stands out spatially as poorly represented in

12 all databases. We have highlighted a number of clades and regions about which we know little functionally, spatially

13 and genetically, on which we should set research targets. The scientific community should recognize and reward the

14 significant value, both for biodiversity science and conservation, of filling in these gaps in our knowledge of the plant

15 tree of life.

16 Keywords: Bias, functional trait, genetic, knowledge gaps, macro-ecology, spatial, taxonomic

17 Introduction

18 In perhaps the first plant trait database, Theophrastus of Eresus (371–287 BC), a student of Plato and Aristotle, assembled

19 and published a complete list of the woody and herbaceous growing in his garden (Theophrastus 1916). Since then

20 the nature and scale of plant databases have changed, but the goal of organized, accessible information about the world’s

21 plants remains the same. For the first time, after more than 2000 years of research, global versions of Theophrastus’ list

22 are now on the horizon.

23

24 Global biological databases have two goals: (1) provide the best global snapshots of biodiversity available at a given

25 time, and (2) organize their underlying data to facilitate their intersection with other datasets to address research

26 questions. From decades of hard work and accumulated knowledge, we know more now than we ever have about

27 biodiversity. The hurdles stopping us from attaining our goal of available, comprehensive global plant databases remain

28 daunting but also increasingly surmountable. In this effort, we believe it is worthwhile, periodically, to take stock of

29 what information we have and what we still lack. Here we ask: how much do we know about global plant diversity, and

30 what and where are the most conspicuous blank pages in our encyclopedia of plant life?

31

32 There are four primary classes of global biological data: taxonomic, geographic, genetic, and trait. Together they provide

33 the most complete picture currently available of where and how organisms make a living across the globe and how they

34 have evolved through time, with taxonomy providing the link between the other three types of primarily species-level

35 information. Each of these data types is complex in its own way. Two classes, taxonomic, and trait, require taxon-based

36 specialized knowledge—botanists are required to build and curate such datasets [e.g., The Angiosperm Phylogeny Website

37 http://www. mobot.org/MOBOT/research/APweb/, The Plant List http://www.theplantlist.org/, The State of the World’s

38 Plants Report RBG Kew (Willis 2017) , and TRY (Kattge et al. 2011)]. In contrast, data within genetic and geographic

39 databases share similar structure and curation protocols across all organisms, and as such, data for plants, animals, and

40 microbes can be usefully kept in one place. General databases for all organisms exist for genes (GenBank; Benson et al.

41 2013), phylogenetic relationships based on those genes (The Open Tree of Life; Hinchliff et al. 2015) and spatial

42 observations (GBIF; http://gbif. org/). Here, using The Plant List as our taxonomic backbone, we explore the data available

43 in GenBank, GBIF, and TRY, as these are the most complete current databases with fundamental, disaggregated data which

44 has been collected about global plants (König et al. 2019).

45

46 These three databases (GenBank, GBIF, and TRY) were assembled from numerous research projects with diverse

47 original goals and data availability; this diversity, while valid within each effort, creates global-scale biases in terms of 2 48 which species feature in a particular database and which do not, which in turn effects the usability of data (Beck et al.

49 2013, 2014, Hinchliff and Smith 2014, Gratton et al. 2017). The globally uneven distribution of research funding may

50 play a role in creating biases in data collection by limiting researchers’ efforts to certain geographical regions or taxa

51 (particularly to species of commercial or cultural value). Furthermore, biodiversity is not evenly distributed across the

52 globe. For instance, the high alpha and beta-diversity and commonness of rare species in many tropical ecosystems

53 means that the sampling effort required to capture some percentage of the species present during a sampling trip at low

54 latitudes is much more intensive than at high latitudes. Global patterns in biodiversity, combined with patchy sampling

55 effort, has left us with gaps in biodiversity data. Finally, non-random gaps within a given database can intersect with

56 gaps in other databases in surprising ways, reducing the number of species about which we have a comprehensive

57 understanding. The nature and source of gaps in each kind of information, and the interactions and consequences of

58 these shortfalls have been clearly conceptually defined (Hortal et al. 2015). There have been some recent assessments of

59 the current gaps in data, including outlines of gaps in phylogenetic data for fungi (Nilsson et al. 2016) and across the tree

60 of life (Hinchliff et al. 2015), and in spatial data for birds (Team eBird, July 2014) and plants (Meyer et al. 2016, Stropp

61 et al. 2016). However, the relative size and overlap of gaps in these data types, and taxonomic and spatial biases in their

62 intersection has never been defined. Thus, our main aim is to examine these biases, gaps, and overlaps of gaps in plants

63 across multiple data sources, starting with the most basic definition of species coverage: presence or absence.

64

65 Such data paucity matters. Lack of data is a major impediment to conservation: we cannot save a species if we do not

66 know where it is or what it needs to live. Data are crucial to identifying conservation crises (Rodrigues et al. 2006,

67 Cardoso et al. 2011, Keith 2015). For example, high-resolution remote-sensing deforestation maps (Hansen et al. 2013),

68 when combined with even small amounts of location data, can identify rare species at risk of extinction and worthy of

69 conservation efforts. A great deal of resources are allocated to conservation actions for some broadly-covered species

70 (i.e., species for which we have taxonomic, genetic, spatial, and trait data) that are at risk; in contrast, poorly-covered

71 species may be going extinct before conservation biologists are able to interrogate a geographic database to determine

72 where they are found or a trait database to determine its form or characteristics should they go looking for them. It is our

73 hope that an assessment of gaps in plant data across geographic space and the extant branches of the plant phylogeny

74 will spur the sharing and/or further collection of data where they are most sorely needed.

75

76 By isolating the key spatial and taxonomic knowledge gaps in plant data and the intersections between them, we hope to

77 increase our chances of filling in the most important missing pages in the book of plant life as efficiently as possible. To

78 quantify and understand these gaps, we present an analysis from our recent downloads of global plant databases. Here, 3 79 we quantify (1) the species coverage in each of three databases (genes: GenBank, traits: TRY, and spatial locations:

80 GBIF) databases, and (2) the intersection of these databases, reporting the proportion of broadly-covered species (in all

81 three databases) in contrast to patchily-covered species (in two or fewer databases) and poorly-covered species (in zero

82 databases). Finally, we (3) identify the phylogenetic clades and geographical regions for which we have different

83 amounts of trait and genetic data, allowing us to contrast the clades and parts of the world for which we have relatively

84 poor and broad coverage. While plant scientists may all agree that there are gaps in our knowledge of plants (Meyer et

85 al. 2016, Gallagher et al. 2019, Saatkamp et al. 2019), they are unlikely to know where the most gaping holes occur (as

86 found in systematic surveys of botanical trait knowledge; see FitzJohn et al. 2014). By highlighting the clades and

87 regions in most urgent need of attention, we hope to turn what appears an insurmountable task into a manageable

88 checklist of gaps to be filled.

89

90 Materials and Methods

91 Working across databases requires a common taxonomy: to combine information about species from one database with

92 another necessitates a common taxonomy. We use The Plant List (TPL; http://www.theplantlist.org/) to link species

93 information across the databases we explore. At the time of writing, TPL (v1.1) records 350,699 names as “accepted”

94 species of land plants. The taxonomic scope of this analysis (and of TPL) is embryophytes (land plants), including the

95 bryophytes (, hornworts, and liverworts), lycophytes, monilophytes (ferns and allies), and spermatophytes (seed

96 plants including gymnosperms and angiosperms). We emphasize that our use of TPL is conservative, as by making use

97 only of species that are “accepted” we exclude less studied lineages in need of additional taxonomic attention. Further,

98 variation in how sub-specific epithets are handled among databases limits our analysis to species binomials, and so we do

99 not address cases where databases contain data on different sub-species or varieties of the same species. Continuing to

100 resolve taxonomic controversies in a globally accepted, versioned, and updateable list will help interoperability of these

101 databases and benefit biodiversity science.

102 Databases

103 We examined the state of three databases: GenBank (genetic; Benson et al. 2013), the Global Biodiversity Information

104 Facility (GBIF—geographic; http://www.gbif.org/), and TRY (functional traits; Kattge et al. 2011). These databases

105 were selected as we consider them to be the three most complete databases on plant biodiversity currently available for

4 106 the most important aspects of plant biology. While there may be more data available for certain species from other,

107 smaller, non-associated platforms, we present an overview of coverage and intersection of the data from these sources

108 only as they represent the most readily-available and commonly-utilized resources for plant scientists. Below we

109 describe how each of the databases (genetic, geographic, and functional trait) were downloaded and analyzed. Each

110 database organization does its own synonymy updating, so we did not attempt to improve on those steps.

111

112

113 Data were downloaded on the 21st of May, 2018. All of our code and analyses are available online

114 (https://github.com/cornwell-lab-unsw/sTEP_overlap). The online access of our data and code allows for others to

115 adapt our work for other purposes and permits future readers to compare how our knowledge of plants continues to

116 change over time. We note that data quality is an issue with all three of these global databases. The quality of the data

117 within these global collaborations continues to improve but errors remain. Improving data quality is an important and

118 ongoing effort (Franklin et al. 2017, Cicero et al. 2017, Veiga et al. 2017) that will likely never end.

119

120

121 Geographic data

122 We obtained a complete set of GBIF observations with georeferences (i.e., the observation had a location associated

123 with it) and had no “spatial issues”. Removing any records with “spatial issues” means our records were free from

124 obvious issues (e.g., latitude/longitude location swaps or mislabelled identifications over water). In our May 2018

125 download of GBIF, we obtained 174,464,135 observations (tab-delimited file-size 101 GB). There were many

126 incomplete identifications (e.g., Carex sp.) in the database, which were removed in our processing. Once these data were

127 synchronized to the TPL species list, we were left with a total of 150,270,519 observations for 257,266 species.

128

129 Trait data

130 We used the species-based summary of TRY (Kattge et al. 2011) data available online (https://www.try-

131 db.org/dnld/TryAccSpecies.txt) to obtain a list of all species for which any trait data were available in TRY. After

132 synchronizing the species names to TPL, the database contained 124,414 species for which we have data on at least

133 one trait.

134 Genetic data

5 135 We compiled a list of all taxa from NCBI’s taxonomy (which is used in all GenBank searches) as green plants (Plantae).

136 To do this, we downloaded NCBI’s complete taxonomy network, then extracted all named terminal nodes (i.e., taxa)

137 subtending from the ‘green plants’ node (number 33090 at that time). After synchronizing the species names to TPL, the

138 database contained 108,616 species.

139

140 Analysis

141 Our first and second aims were to quantify the coverage and intersection of the three databases (TRY, GenBank, GBIF).

142 We examined the presence and absence of records for each species in TPL for each database, allowing us to take stock

143 of the species coverage and overlap of available data. We thus identified how many species are broadly-covered (i.e.

144 have more than zero data in all three databases), patchily-covered (i.e. have more than zero data in two or fewer

145 databases) and poorly-covered (i.e. have zero data in all three databases). With this approach, species that are present in

146 all three databases with at least a single record in each are considered "broadly-covered". These definitions of broadly-

147 and patchily-covered do not mean that these species are particularly well represented in any given database. There is no

148 single metric by which we could determine whether a species is ‘sufficiently’ covered in a given database, since

149 different research questions have different data requirements; we instead focus on describing the presence or absence of

150 any data at all in a given database.

151

152 Our third aim was to examine the taxonomic and geographic distribution patterns in sampling in all three databases. To

153 do this, we identified the phylogenetic clades and geographical regions for which we have some trait and genetic data,

154 allowing us to contrast the clades and parts of the world for which we have relatively poor coverage. At its core, our

155 approach quantified the extent of deviation from completely unbiased sampling. As with all analysis at this scale,

156 significance testing would be of little use: all patterns described would be highly significant because of the large sample

157 sizes involved. Instead we sought to quantify the extent to which sampling is uneven across taxonomy and geography.

158

159 For taxonomy, we analyzed all families for the degree to which sampling (species coverage) is non-randomly higher or lower

160 than the overall rate across the entire database.

161

6 162 A general method that searches for these clades is presented by Cornwell et al. (2014). We modified that method to use the

163 G-statistic (Sokal and Rohlf 1981) as our analysis focuses on presence/absence of data (a binary variable), unlike the

164 continuous data in Cornwell et al. (2014) (code for this analysis is available at www.github.com/traitecoevo/ ksi). The

165 key feature of this approach is that it statistically balances the size of the clade with its deviation from the mean sampling

166 rate (i.e., presence of records for species in that clade in a given database) across all land plants. Thus, our results are not

167 driven by (true) variation in the diversity of the families. This approach allowed us to rank families by the extent to which

168 they deviate from the mean sampling rate for all land plants and discover both the clades that are the most under sampled

169 and those that are the most over sampled. We call this our "distinctiveness index".

170

171 For geography, we took two approaches to determine spatial data gaps in trait and genetic databases: one ("observation”)

172 considers all GBIF observations, while the other (“species”) first calculates the median latitude for each species. The

173 observation (N=150,270,519) or species (N=257,266) was then classified as either with or without trait data, with or

174 without genetic data, and with or without both trait and genetic data. We fit six generalized additive models for very large

175 datasets (function "BAM" in Wood 2017) with the only predictor being latitude. The model estimates the probability of

176 an observation or species having trait data, genetic data, or both, at a given latitude. Because fitting generalized additive

177 models is memory intensive, even with 128 GB of RAM, random sub-sampling of the observation database down to

178 10,000,000 points was necessary. This sub-sampling did not affect the result, however, with repeated samples producing

179 identical fits.

180

181 Because research agendas and research funding are often geographically constrained, we sought to develop a way to

182 make data collection priority lists at geographic scales smaller that the globe. To do this, we split the raw GBIF

183 observations by continent (ESRI 2010), but this could also be done at the country scale or even the scale of a specific

184 reserve. We then isolated species that are only found in one biogeographic region, that is, species endemic to that

185 particular region. Considering each continent separately, we use the approach described by Cornwell et al. (2014) to

186 quantify the most disproportionately under-sampled and over-sampled families within each region. This approach was

187 designed to identify region-specific gaps in data and a region-specific priority list for future data collection. Note that

188 this analysis excludes both species with distributions on multiple continents and species that are invasive somewhere,

189 but we expect these species to be in the broadly-covered category.

7 190 Results

191 Of the accepted plant species of the world, 17.7% (61,902 species) are broadly-covered in the datasets, meaning we

192 know something about their genetic make-up, something about their functional traits, and something about their geo-

193 referenced location (Table 1). These species make up the "broadly-covered" section of the Venn diagram (Figure 1),

194 where genetic data, trait data and geographic data overlap.

195

196 More than half of all plant species - 55.6% - are patchily-covered across the datasets, meaning they are present in two or

197 fewer datasets (Table 1; Figure 1). Building on centuries of collections and a global network of museums and herbaria,

198 GBIF has by far the most complete species coverage, with occurrence data for 257,266 species (73.3% of accepted

199 diversity). In TRY, trait data are available for 124,414 species (35.5% of accepted diversity). The database with the least

200 coverage is GenBank, where genetic data are available for 108,616 species (31.0% of accepted diversity).

201

202 For 26.7% of plant species we have no information other than their names, there being no data available for them in these

203 datasets. That is, these species are poorly-covered and we have zero genetic, location or functional trait data in the

204 databases considered (Figure 1).

205

206 Where we know a lot

207 All large data sources have areas of strength and weakness across the plant phylogeny. There are several hotspots in our

208 knowledge, and the families that are broadly-covered (that is, having higher than average proportion of member species

209 occupying the "broadly-covered" section of Figure 1) tend to be widespread and of economic importance either as food

210 or resource crops, or as cultivars (Table 3; Figure 3). These include the Poaceae (grasses), Pinaceae (pines, spruces, firs),

211 Solanaceae (nightshades, with tomatoes, potatoes, peppers and eggplants), Moraceae (mulberries, breadfruit, jackfruit,

212 figs), Eurphorbiaceae (rubber, spurge cultivars), Proteaceae (Macadamia, silky oak, cultivars), Iridaceae (irises, saffron)

213 and Polemonaceae (Phlox). There are broadly-covered families which do not have obvious commercial value but may

214 simply have been of interest for research and exploration by particular research groups: two that came out in the top 10

215 are the Zamiaceae (cycads) and Orobanchaceae (broomrapes, mostly parasitic). Other areas of the phylogeny with good

216 sampling across all datasets are the gymnosperms and early diverging angiosperms (Figure 3), perhaps because they

217 were identified by Darwin as informative with respect to the "abominable mystery" of the origin of angiosperms

218 (Darwin and Seward 1903). Some of the orders with high coverage (Figure 3) are that way because of their unusually 8 219 small clade size. For instance, Trochodendrales and Acorales have full coverage — both orders are comprised of only

220 two accepted species, while Ginkgoales consists of a single, commercially-important, accepted species.

221

222 The spatial patterns in genetic and trait data availability (Figure 2) are evidence of the non-random distribution of

223 research sites and unequal sampling efforts across latitudes. There are more data available in GenBank and TRY for the

224 northern hemisphere than the southern hemisphere and both temperate regions are better sampled than the equatorial

225 region (Figure 2). The availability of both trait and genetic data is best around 45° N. If you were to take a random GBIF

226 observation from that latitude you would have a greater than 95% chance of accessing trait data for that species, a 90%

227 chance of there being genetic data, and almost 90% chance of there being both genetic and trait data available. If you

228 were to select a species at random at this latitude, there would be a 65% chance of there being trait data, a 60% chance

229 of there being genetic records for this species, but only 45% chance of both data types being available (Figure 2).

230

231 For many accepted species, there are 10 or fewer geographical observance records and no genetic or trait data (Figure 4).

232 The species for which there are upwards of 1000 spatial records in GBIF include the dominant trees and shrubs of North

233 America, Europe, and South East Australia. As these species are prevalent in regions where many scientists live and work,

234 they are well documented in numerous ways and make up much of the "broadly-covered" section of Figure 1.

235

236 Where we don’t know a lot

237 While the databases all lack a certain number of species contributing to the known extant diversity, there are plant

238 families for which we have little data across all databases (Table 4, Figure 3). Most of the poorly-covered families, those

239 families that are made up of species that are undersampled geographically, genetically, and functionally, are or

240 liverwort families, but Orchidaceae, Asteraceae and Begoniaceae also earn this title. When we consider the taxonomic

241 weaknesses of databases individually, we find that they share similarities in the distribution of data gaps (Table 4). GBIF

242 is extremely poorly sampled for mosses and liverworts—all of the top ten most geographically undersampled families

243 come from these two clades. Moss and liverwort families are also undersampled for both traits and genes, nine and

244 seven of the top ten most undersampled families in these data sets come from these clades, respectively. The traits of

245 Orchidaceae remain extremely undersampled despite the clade’s widespread cultivation and intense horticultural use.

246 For genetic data, the other families that make the top ten most undersampled list are all fairly diverse and widespread

247 tropical understory families: Begoniaceae, Acanthaceae, and Piperaceae.

9 248

249 We identify the three family groups for which we have the greatest gaps in data for endemic species of each of the

250 different continents (Table 2). As well as seeing similar overall patterns in data gaps to those outlined above, there are

251 some families which made multiple appearances. More data are needed for Rubiaceae endemics of Asia, Australia, NZ

252 and Oceania, and South America; for Asteraceae endemics of Europe, Asia and Africa; for Lamiaceae endemics of

253 Europe and Australia; and for Araceae endemics of Australia and South America.

254

255 The low latitudes are hotspots for plant diversity, home to some of the most diverse ecosystems on the planet, and yet

256 we find ourselves with a paucity of plant data for these tropical regions. It is in the equatorial region where we find the

257 least species coverage of genetic and trait data. For instance, if one were to select a single species at random at a latitude

258 of 0°, there would only be a 40% chance of data for that species being available in TRY, a mere 27% chance of there

259 being data for that species in GenBank (dotted lines; Figure 2), and less than a 20% chance of both trait and genetic data

260 being available.

261 Discussion

262 Shortfalls in our knowledge of global plant diversity

263 Our knowledge of plants is incomplete in a number of well-defined ways (see Hortal et al. 2015), four of which are

264 pertinent to the database resources we have analyzed here. First, given the rate at which species are being discovered, we

265 are quite confident that we have not sampled and described all plant life on Earth—this is the Linnean shortfall in our

266 knowledge of biodiversity (Lomolino 2004). Second, we lack basic taxonomic or phylogenetic information on the

267 placement of many plants in the tree of life—this is the Darwinian shortfall (Diniz-Filho et al. 2013). Third, we do not

268 know the distribution (range) of many plants on Earth—this is the Wallacean shortfall (Lomolino 2004). Fourth, we do

269 not know about the functional traits of many plants—this is the Raunkiæran shortfall (Hortal et al. 2015). A major result

270 from our study is that, in terms of the presence or absence of readily-available data for plant species, the Wallacean

271 shortfall (26.7%) is currently far less severe than the Darwinian (69%) or Raunkiæran (64.5%) shortfalls.

272

10 273 Methods within some lines of scientific inquiry lend themselves well to systematic sampling. Remote sensing, for

274 example, surveys every pixel of the planet for an extensive range of abiotic and biotic features. Similarly, despite the

275 uneven distribution of biodiversity and the challenge in getting to remote areas, taxonomy has an inherent goal to

276 describe every species on earth. The Wallacean shortfall is less severe than others due mainly to its strong link to

277 taxonomy related resources — GBIF has had great success in encouraging herbaria and museum collections to

278 participate in systematic centralization of data from collections; at the time of writing, 59 countries were contributing

279 data. Nonetheless, there is more to do: the gap in our geographic records of the plant kingdom currently stands at 26.7%,

280 meaning that there are 93,433 accepted species for which we do not know even a single location in the world they grow

281 or have grown. This is literally lost diversity data—for the species to have been described taxonomically, it was

282 collected, but the geographic information (the location where that specimen was collected) has been lost.

283

284 As we lack important information about where they live, we also then lack information about their ecology and

285 evolution. Most worryingly, without even basic geographic information, any true assessment of the conservation status

286 of these species is impossible. However, as more of the world’s herbarium collections become digitized and uploaded

287 directly to GBIF, or through intermediary initiatives, we do expect that this shortfall will decline relatively rapidly.

288 Indeed, comparison of the current coverage of GenBank, GBIF and TRY to that from a preliminary analysis for this

289 paper in 2015 indicated that the Wallacean shortfall is decreasing the fastest (see Appendix 1).

290

291 Relative to the Wallacean shortfall, the Raunkiæran and Darwininan shortfalls are more severe. The explicit goal of

292 complete sampling of plant diversity for either trait or genetic data has never been clearly stated perhaps because it was

293 so distant, and the result is incomplete, patchy coverage of plant species in these datasets. The Darwinian shortfall is the

294 most severe of the three: we have a complete lack of genetic information for 69% of known plant species. Worryingly,

295 this dataset has shown the slowest pace of growth in species coverage over a 3-year period (see Appendix 1), likely

296 because it is much more laborious and expensive to produce a single record of genetic information than for either of the

297 other databases. The Raunkiæran shortfall stands at 64.5% of accepted diversity, which is a considerable gap in our plant

298 knowledge given the current focus on trait-based analyses in the scientific exploration of ecological and evolutionary

299 processes. There are other taxonomic groups, including birds and mammals, for which complete sampling has been an

300 explicit aim, and as such, relevant trait datasets have high-to-complete species coverage (e.g. Jones et al. 2009, Wilman

301 et al. 2014). Despite the daunting diversity of land plants, we are now at the point where complete sampling through

302 targeted campaigns is a feasible goal.

11 303

304 Perhaps the most worrying of these shortfalls is the Linnean as it is hard to know anything about what we do not know;

305 these are the species for which we know literally nothing, and they do not even have an accepted name. Estimations are

306 that the number of un-named plant species may be 10-20% of the number of known plant species (Joppa et al. 2011) and

307 that the majority may inhabit diverse, tropical regions (Wheeler et al. 2004). These may have never been collected

308 (collection of new species in the field may be the greatest rate-limiting step towards mapping the whole tree of plant life;

309 May 2004), or they may have been collected but misidentified or are yet to be described. If they have been collected,

310 there may be associated genetic, functional and spatial data but without a name as a link, this information is lost to us. It

311 can be difficult for taxonomists to keep up with the influx of new specimens needing description (and the revisions that

312 new specimens can warrant), and there may not be the expertise or financial support available to do so - often resulting

313 in significant lag time between collection and publication of descriptions with accepted scientific names (Wheeler et al.,

314 2004, Bebber et al. 2010). With new information and taxonomic effort, the number of accepted species will no doubt

315 rise in the coming years (see Mora et al. 2011); however, the training of taxonomists and the financial support of their

316 research in both new and existing collections is crucial to the reduction of the Linnean shortfall.

317

318 Each of these four gaps in our knowledge of plant diversity combines to create what we suggest is a fifth shortfall in our

319 knowledge of global plant diversity: we do not know anything about the geography, anything about the genes, nor

320 anything about the traits for the vast majority of known plant species. We term this the Venn shortfall, since it relates to

321 a lack of intersection among databases. The Venn shortfall is a critical issue that affects both biodiversity science and

322 conservation. How can we assess conservation risk of a species if all the information we have about them is a name? As

323 species that are broadly-covered across databases are more likely to be used in research as there is data sufficiency

324 requirements, patchily- and poorly-covered species will become relatively less and less studied over time (see Appendix

325 2). The narrow overlap among the different dimensions of our knowledge of plant diversity means that much research

326 and policy-decision making is limited to data on a relatively small (17.7% of accepted diversity) and non-random subset

327 of plants species, and we know that when data are biased or extremely sparse we are at risk of missing or misidentifying

328 patterns or processes (Hortal et al. 2015). If we acknowledge the Venn shortfall as a serious problem facing conservation

329 biodiversity science, then we can treat this as an important goal worthy of attention and funding.

330

331 The assessments of data gaps and biases that we have here described are based on the presence or absence of an

332 accepted species in the databases. However, the presence of any data about a species does not equate to usefulness for

333 research. The issues of data quality and sufficiency are both complex, even when considering only one form of data, as 12 334 what is deemed sufficient and of high enough quality is entirely context and question dependent. For instance, two

335 recent Nature papers constructed (with great effort) subsets of these data that were sufficient for their research purposes:

336 Zanne et al. (2014) built the largest timetree for plants and were able to find sufficient data for 28,838 TPL accepted

337 species (using a patchy matrix based on 7 genes, 8.2% of accepted species; 32,223 total species considered); Díaz et al.

338 (2016) assembled data on 6 key functional traits for 46,085 TPL accepted species (13% of accepted species), although

339 values for all 6 traits were only available for 2,214 species (0.063% of accepted species; See Appendix 2). Further, data

340 on plant traits is not evenly spread across plant tissues, with much more coverage for leaf traits as compared to root

341 traits. The leaf trait with the highest coverage in the TRY database (Kattge et al. 2011, see https://www.try-

342 db.org/de/de.php; 30296, number of leaflets per leaf) shows almost an order of magnitude greater coverage than the

343 most sampled root trait (5535, root type/root architecture), and it is worth noting that no root traits were included in the

344 Díaz et al. (2016) synthesis. While the GBIF spatial observation data has the highest species coverage of databases we

345 considered, data limitations, errors and uncertainties cumulatively limit an understanding of individual species’

346 distributions: there are only accessible digital information of high enough quality to estimate distribution for

347 approximately a quarter of plant species (Meyer et al. 2016). As such, it is not feasible for us to quantify the availability

348 and overlap in sufficient data for all general research purposes, so we have begun to address this by quantifying the gaps

349 in terms of there being any data at all. Once we apply ourselves to fill the spatial and taxonomic gaps in species

350 coverage, and close the Venn shortfall, the fact will remain that more data will be needed to enable research using data

351 on many of the world’s plant species.

352

353 Closing the gaps

354 While there can never be a complete substitute for directly measured data, temporary solutions have been proposed to

355 leverage existing data to circumvent the shortfalls we describe above. These include phyndr, an algorithm that

356 maximizes the overlap between phylogenetic and species’ data (i.e., trait and spatial; Pennell et al. 2016), Phylomatic

357 which uses taxonomic data to fill-in missing phylogenetic data (Webb and Donoghue 2005), and approaches which use

358 taxonomic information to “gap-fill” our knowledge of plant functional diversity (FitzJohn et al. 2014, Taugourdeau et al.

359 2014, Schrodt et al. 2015). As human pressure on plant diversity increases through time, such techniques may be our

360 best hope of estimating policy-relevant information immediately (König et al. 2019), but they also introduce uncertainty,

361 and dealing with this uncertainty in downstream analyses may be nontrivial. Leaving aside the difficulties involved in

362 imputing data (and validating those imputations with enough real data), most simple models of trait evolution for plants

363 are far from adequate (Pennell et al. 2015) and heterogeneity in evolutionary processes among the branches of the tree is 13 364 common (Eastman et al. 2011). Such evolutionary methods can be powerful for filling in small gaps where information

365 on a species can be estimated based on its close relatives, but they are of limited use in large clades with poor data

366 coverage. Most worryingly, because gap-filling in most cases uses taxonomic or phylogenetic information to infer

367 species values, there is the danger of circularity in many contexts, such as evolutionary and biogeographic studies. For

368 goals that are specific to the unsampled species— such as, what is the species’ ecology and its conservation risk—

369 imputation will simply not be informative given the current patchy data. Such approaches cannot supersede the

370 collection of novel data, but they do help us make the most of what data we do have.

371

372 Unlike the Linnean shortfall, targeted programs may reduce the magnitude of the Venn, Darwinian, Wallacean and

373 Raunkiæran shortfalls. Held within the world’s > 3000 herbaria are almost 390 million specimens (Thiers 2017) and

374 considerable efforts are underway to increase availability of data from these collections through large digitization and

375 sharing initiatives (Pyke and Ehrlich 2010, James et al. 2018). The number of species for which we have only names and

376 no other data is large (13.2%), however, these species must have had museum/herbarium records at some time. If they

377 still exist and are findable, these specimens contain sufficient data to address multiple shortfalls at once: there are

378 collection location data (albeit for older specimens this tends to be at a course scale; Pyke and Ehrlich 2010), and the

379 vouchers themselves can be sources of genetic and trait data (Corney et al. 2012, Li et al. 2015, Younis et al. 2018). To

380 some extent this is ongoing, but current progress is haphazard. To be maximally efficient with our sampling efforts, it is

381 worth (1) prioritizing herbaria that are known to have collections of undersampled taxa, especially those that are missing

382 from all three of the databases, and (2) gathering data from specimens in locations known to be effected by habitat loss

383 as quickly as possible (as these specimens might be the only collections that we ever obtain).

384

385 Our survey of global plant diversity leads us to make four observations that can inform a more optimal strategy for

386 filling in knowledge gaps. First, grant programs can have direct impacts on the future of the shortfalls we have

387 demonstrated by incentivizing work which creates overlapping species data, and such opportunities should be expanded.

388 We suggest that funding programs should support researchers collecting and sharing data on undersampled regions or

389 clades of the world. Second, publication of results from genetic data has long required centralized deposition of raw

390 sequences with meta-data; A similar approach within collectors of locations and trait data would broaden the base of

391 contributors and rapidly increase availability of both spatial and trait data (Schneider et al. 2018). Third, one clear bias is

392 that many broadly-covered families are those with some commercial value. However, species do not need to have direct

393 commodity or cultural usefulnesses to be important (Ehrlich and Wilson 1991). There is immense basic and applied

394 value in novel data on undersampled families (orchids, Begoniaceae, Asteraceae, and most families see Table 14 395 4) in addressing biases in our datasets to future biodiversity and conservation science. Last, we need to intensify genetic

396 and trait sampling efforts in undersampled regions: spatial biases are evident in the availability of both genetic and

397 functional trait data.

398

399 Conclusions

400 Despite shortfalls (Linnaen, Darwinian, Wallacean, Raunkiæran, Venn) within and across data sources, there are more

401 reasons than ever to be optimistic about the future of our knowledge of global plant diversity. Biodiversity sciences are

402 progressing in leaps and bounds, empowered by the sharing of large amounts of data. However, the intersection of data

403 among our plant databases is particularly patchy and non-random, and as we move towards a more complete picture of

404 global functional diversity, more data are clearly needed. By taking a targeted approach to expanding and improving the

405 status of our data foundations for such work, we can progress towards an ever more complete picture of the many facets

406 of plant biodiversity, and greater inferential and predictive power from our global plant data.

407

408 Gap-filling is also a goal among those who study other taxa: interestingly eBird — an ornithologic citizen science effort —

409 performed a gap analysis similar to ours in 2014 and were able to generate gap-filling enthusiasm, which led to the sharing of

410 data on 227 out of the 421 missing bird species (see https://ebird.org/news/ebirds-missing-species/ and

411 https://ebird.org/news/ebirds-missing-species-v2/). The sheer number of missing species is much larger for plants, but

412 publicizing target lists is a good first step, and with enough of a reward, progress could be rapid. Let us begin!

413

414 Declarations

415 Acknowledgments

416 This research includes computations using the Linux computational cluster Katana supported by the Faculty of Science,

417 UNSW Sydney. This research was initiated as part of the sTEP working group funded by sDIV, the Synthesis Centre of

418 the German Centre for Integrative Biodiversity Research Halle-Jena-Leipzig. Thanks go to two anonymous reviewers

419 for their comments on this work. WDP and the Pearse lab are funded by NSF ABI-1759965, NSF EF-1802605, and

420 USDA Forest Service agreement 18-CS-11046000-041.

15 421 Author contributions

422 WKC and AEZ conceived of the initial ideas and questions. WKC and WDP designed and performed the analyses. All

423 authors wrote and revised the manuscript.

424

425 Tables

426 Table 1: Count of accepted plant species in each of the database and overlap of species among two- and three-way

427 combinations of databases.

Genbank GBIF TRY All three Genbank 108616 100507 63783 GBIF 100507 257266 114067 TRY 63783 114067 124414 All three 61902 428

429

16 430 Table 2: The three most undersampled families for species endemic to each continent. Note that endemicity in this analysis is

431 based on GBIF observations and may be an underestimate due to species that are invasive elsewhere.

432

Proportion Estimated Distinctiveness Continent Database Family of species continental index sampled endemics Africa TRY Acanthaceae 0.097 963 227.263 TRY Asteraceae 0.195 2513 137.427 TRY Aizoaceae 0.149 955 113.331 Genbank Acanthaceae 0.173 963 87.737 Genbank Begoniaceae 0.009 111 70.565 Genbank Lauraceae 0.061 163 59.395 Asia TRY Orchidaceae 0.050 3085 1094.982 TRY Gesneriaceae 0.028 617 269.321 TRY Asteraceae 0.150 2486 216.418 Genbank Orchidaceae 0.284 3085 149.851 Genbank Begoniaceae 0.051 234 144.026 Genbank Rubiaceae 0.247 1352 118.264 Australia TRY Orchidaceae 0.079 557 295.071 TRY Lamiaceae 0.149 261 72.413 TRY Araceae 0.024 42 32.383 Genbank Orchidaceae 0.217 557 56.275 Genbank Frankeniaceae 0.000 32 28.407 Genbank Rubiaceae 0.144 111 25.815 Europe TRY Asteraceae 0.262 1382 92.533 TRY Orchidaceae 0.103 213 79.508 TRY Lamiaceae 0.192 193 28.876 Genbank Asteraceae 0.158 1382 258.282 Genbank Rosaceae 0.197 615 54.249 Genbank Onagraceae 0.000 24 18.678 North America TRY Crassulaceae 0.181 215 119.096 TRY Pottiaceae 0.000 47 73.190 TRY Bryaceae 0.000 46 71.631 Genbank Myrtaceae 0.064 297 154.550 Genbank Begoniaceae 0.009 109 90.513 Genbank Primulaceae 0.093 215 87.264 NZ and TRY Orchidaceae 0.077 195 84.651 Oceania TRY Rubiaceae 0.153 235 48.578 TRY Pandanaceae 0.051 59 30.951 Genbank Phyllanthaceae 0.098 82 29.027 Genbank Euphorbiaceae 0.059 51 25.326 Genbank Orchidaceae 0.200 195 22.924 South TRY Orthotrichaceae 0.000 128 182.369 America TRY Pottiaceae 0.000 127 180.941 TRY Dicranaceae 0.000 109 155.250 Genbank Rubiaceae 0.148 1923 174.428

17 Genbank Begoniaceae23 0.018 219 107.190 Genbank Araceae 0.129 848 103.845 433

434

435

18 436 Table 3: The most well-sampled families for each database, as well as those which are broadly-covered across all three

437 databases, i.e., having a disproportionately high proportion of species in the “broadly-covered” overlap section of the Venn in

438 Figure 1.

Proportion Estimated Distinctiveness Database Family sampled diversity index TRY Poaceae 0.77 11470 8648.92 Begoniaceae 0.96 1601 2754.26 Euphorbiaceae 0.67 6547 2670.13 Orobanchaceae 0.91 1614 2218.14 Santalaceae 0.90 992 1302.79 Moraceae 0.85 1217 1263.58 Loranthaceae 0.91 887 1185.14 Urticaceae 0.74 1465 895.42 Proteaceae 0.76 1253 854.95 Iridaceae 0.61 2315 644.84 Genbank Brassicaceae 0.53 4065 895.56 Restionaceae 0.84 529 635.86 Apiaceae 0.52 3270 631.14 Liliaceae 0.65 745 365.97 Pinaceae 0.87 255 349.81 Poaceae 0.39 11470 339.32 Zamiaceae 0.88 216 312.67 Polemoniaceae 0.71 455 302.12 Plantaginaceae 0.51 1614 296.57 Ericaceae 0.44 3553 282.12 GBIF Proteaceae 0.96 1253 458.72 Solanaceae 0.88 2677 337.89 Myrtaceae 0.83 5970 335.46 Ericaceae 0.85 3553 258.46 Passifloraceae 0.93 932 247.73 Malvaceae 0.83 4506 232.37 Chrysobalanaceae 0.97 535 230.45 Melastomataceae 0.83 4078 226.11 Iridaceae 0.86 2315 222.26 Piperaceae 0.85 2658 217.26 All three Poaceae 0.31 11470 1319.29 Pinaceae 0.84 255 529.78 Euphorbiaceae 0.27 6547 385.33 Orobanchaceae 0.38 1614 382.65 Solanaceae 0.33 2677 373.78 Moraceae 0.41 1217 365.32 Polemoniaceae 0.56 455 333.91 Zamiaceae 0.71 216 296.21 Iridaceae 0.32 2315 287.26 Proteaceae 0.37 1253 274.77 439

440

441

19 442 Table 4: The most disproportionately undersampled families within each database. Those families which are undersampled

443 across the board are "poorly-covered", having the lowest proportion of representatives in the central overlap section the

444 Venn diagram.

Proportion Estimated Distinctiveness Database Family sampled diversity index try Orchidaceae 0.19 27782 4131.81 Pottiaceae 0.0003 3169 2775.82 Hypnaceae 0.0004 2519 2200.85 Lejeuneaceae 0.0004 2269 1980.10 Bryaceae 0 2103 1849.77 Dicranaceae 0.003 2116 1775.56 Sematophyllaceae 0.001 1629 1403.18 Orthotrichaceae 0.002 1272 1078.43 Brachytheciaceae 0.005 1102 909.56 Hookeriaceae 0.002 831 703.53 genbank Orchidaceae 0.22 27782 1179.95 Hypnaceae 0.05 2519 1048.10 Pottiaceae 0.09 3169 902.62 Begoniaceae 0.02 1601 899.25 Bryaceae 0.07 2103 753.64 Sematophyllaceae 0.09 1629 457.60 Dicranaceae 0.12 2116 423.91 Orthotrichaceae 0.10 1272 316.40 Fissidentaceae 0.07 817 289.72 Acanthaceae 0.19 3948 277.58 gbif Orchidaceae 0.57 27782 3833.24 Hypnaceae 0.40 2519 1249.90 Pottiaceae 0.46 3169 1058.46 Sematophyllaceae 0.42 1629 703.30 Bryaceae 0.46 2103 695.73 Dicranaceae 0.52 2116 455.56 Lejeuneaceae 0.54 2269 394.40 Brachytheciaceae 0.49 1102 288.83 Neckeraceae 0.46 816 273.87 Fissidentaceae 0.49 817 217.23 All three Orchidaceae 0.09 27782 1848.76 Pottiaceae 0.0003 3169 1214.53 Hypnaceae 0.0004 2519 961.78 Lejeuneaceae 0.0004 2269 864.75 Bryaceae 0 2103 814.59 Dicranaceae 0.002 2116 764.57 Sematophyllaceae 0.0006 1629 616.81 Orthotrichaceae 0.0008 1272 478.85 Begoniaceae 0.02 1601 423.88 Asteraceae 0.14 32920 405.18 445

20 446 Figure legends

447 Figure 1: Shortfalls in our knowledge of the plant kingdom. (1) The Linnean shortfall (diversity for which we have no names)

448 is the area outside of the "accepted taxonomic names" footprint. (2) The Darwinian shortfall (diversity for which we have

449 zero genetic information in GenBank) is the area outside of the "Genetic data" footprint. (3) The Wallacean shortfall

450 (diversity for which we have zero location information in GBIF) is the area outside of the "Geographic data" footprint. (4)

451 The Raunkiæran shortfall (diversity for which we have zero records of functional trait data in TRY) is the area outside of our

452 "Trait data" coverage. (5) The Venn shortfall, a culmination of these first four shortfalls, is the area outside of where these

453 four kinds of knowledge intersect - the area of overlap where species are considered "broadly-covered".

454

455 Figure 2: Generalized additive models for the relationship between latitude and availability of plant data. The left panel

456 models individual GBIF observations (N=150, 270, 519); the right panel models species median values for latitude (N=257,

457 266). Presented are the probabilities at different latitudes that there would be genetic or trait data available to match to a

458 given GBIF observation (left panel) or a given species (right panel). Line thickness is indicative of sample size. Note that

459 tropical areas (especially just south of the equator) have the poorest data coverage, and just north of 40° N is where we find

460 the best coverage of both functional trait and genetic plant data.

461

462 Figure 3: The proportion of species in each order of the tree of life for which we have functional trait, genetic and location

463 data, and the proportion which are well-sampled, “broadly-covered” across all three types of data available. Phylogeny from

464 Zanne et al. (2014).

465

466 Figure 4: Ranking of the number of occurrences in GBIF for plant species; colors indicate the presence of trait and genetic

467 data. Species with many spatial records also tend to have genetic and trait data available. Species with few GBIF records are

468 less likely to be covered by the other databases. Note the log-scaled x-axis.

21 469 References

470 Bebber, D. P. et al. 2010. Herbaria are a major frontier for species discovery. – Proceedings of the National Academy of

471 Sciences 107: 22169–22171.

472 Beck, J. et al. 2013. Online solutions and the ‘wallacean shortfall’: what does GBIF contribute to our knowledge of species’

473 ranges? – Diversity and Distributions 19: 1043– 1050.

474 Beck, J. et al. 2014. Spatial bias in the GBIF database and its effect on modeling species’ geographic distributions. –

475 Ecological Informatics 19: 10–15.

476 Benson, D. A. et al. 2013. Genbank. – Nucleic Acids Research 41: D36–D42.

477 Cardoso, P. et al. 2011. The seven impediments in invertebrate conservation and how to overcome them. – Biological

478 Conservation 144: 2647 – 2655.

479 Cicero, C. et al. 2017. Biodiversity informatics and data quality on a global scale 1. – The Extended Specimen: Emerging

480 Frontiers in Collections-Based Ornithological Research .

481 Corney, D. P. et al. 2012. Automatic extraction of leaf characters from herbarium specimens. – Taxon 61: 231–244.

482 Cornwell, W. K. et al. 2014. Functional distinctiveness of major plant lineages. – Journal of Ecology 102: 345–356.

483 Darwin, F. & Seward, A. C. eds. 1903. More letters of Charles Darwin. A record of his work in a series of hitherto

484 unpublished letters. London: John Murray. Volume 2.

485 Díaz, S. et al. 2016. The global spectrum of plant form and function. – Nature 529: 167.

486 Diniz-Filho, J. A. F. et al. 2013. Darwinian shortfalls in biodiversity conservation. – Trends in Ecology & Evolution 28: 689–

487 695.

488 Eastman, J. M. et al. 2011. A novel comparative method for identifying shifts in the rate of character evolution on trees. –

489 Evolution 65: 3578–3589.

490 Ehrlich, P. R. and Wilson, E. O. 1991. Biodiversity studies: Science and policy. – Science

491 253: 758–762.

492 Environmental Research Systems Institute (ESRI). 2010. World Continents from ESRI data and maps CD. – Redlands, CA.

22 493 FitzJohn, R. G. et al. 2014. How much of the world is woody? – Journal of Ecology 102: 1266–1272.

494 Franklin, J. et al. 2017. Big data for forecasting the impacts of global change on plant communities. – Global Ecology and

495 Biogeography 26: 6–17.

496 Gallagher, R. et al. 2019. The Open Traits Network: Using Open Science principles to accelerate trait-based science across

497 the Tree of Life. EcoEvoRxiv. doi:10.32942/osf.io/kac45.

498 Gratton, P. et al. 2017. A world of sequences: can we use georeferenced nucleotide databases for a robust automated

499 phylogeography? – Journal of Biogeography 44: 475-486.

500 Hansen, M. C. et al. 2013. High-resolution global maps of 21st-century forest cover change. – Science 342: 850–853.

501 Hinchliff, C. E. and Smith, S. A. 2014. Some limitations of public sequence data for phylogenetic inference (in plants). –

502 PLoS One 9: e98986.

503 Hinchliff, C. E. et al. 2015. Synthesis of phylogeny and taxonomy into a comprehensive tree of life. – Proceedings of the

504 National Academy of Sciences 112: 12764–12769.

505 Hinchliff, C. E. et al. 2015b. Synthesis of phylogeny and taxonomy into a comprehensive tree of life. – Proceedings of the

506 National Academy of Sciences 112: 12764–12769.Hortal, J. et al. 2015. Seven shortfalls that beset large-scale knowledge

507 of biodiversity. – Annual Review of Ecology, Evolution, and Systematics 46: 523–549.

508 James, S. A. et al. 2018. Herbarium data: Global biodiversity and societal botanical needs for novel research. – Applications

509 in Plant Sciences 6: e1024.

510 Jones, K. E. et al. 2009. PanTHERIA: a species-level database of life history, ecology, and geography of extant and recently

511 extinct mammals. – Ecology 90: 2648–2648.

512 Joppa, L. N. et al. 2011. How many species of flowering plants are there? – Proceedings of the Royal Society of London B:

513 Biological Sciences 278: 554–559.

514 Kattge, J. et al. 2011. Try - a global database of plant traits. – Global Change Biology 17: 2905–2935.

515 Keith, D. A. 2015. Assessing and managing risks to ecosystem biodiversity. – Austral Ecology 40: 337–346.

516 König, C., Weigelt, P., Schrader, J., Taylor, A., Kattge, J., & Kreft, H. 2019. Biodiversity data integration--The significance

517 of data resolution and domain. – PLoS biology 17: e3000183.

518 Li, X. et al. 2015. Plant dna barcoding: from gene to genome. – Biological Reviews 90: 157–166.

23 519 Lomolino, M. V. 2004. Conservation biogeography. – In: Lomolino, M. V. and Heaney, L.R. (eds.), Frontiers of

520 Biogeography: new directions in the geography of nature. Sinauer Associates, Inc., pp. 293–296.

521 May, R. M. 2004. Tomorrow’s taxonomy: Collecting new species in the field will remain the rate-limiting step. –

522 Philosophical Transactions: Biological Sciences 359: 733–734.

523 Meyer, C. et al. 2016. Multidimensional biases, gaps and uncertainties in global plant occurrence information. – Ecology

524 Letters 19: 992–1006.

525 Mora, C. et al. 2011. How many species are there on earth and in the ocean? – PLoS Biol 9: e1001127.

526 Nilsson, R. H. et al. 2016. Top 50 most wanted fungi. – MycoKeys 12: 29–40.

527 Pennell, M. W. et al. 2015. Model adequacy and the macroevolution of angiosperm functional traits. – The American

528 Naturalist 186: E33–E50.

529 Pennell, M. W. et al. 2016. A simple approach for maximizing the overlap of phylogenetic and comparative data. – Methods

530 in Ecology and Evolution 7: 751–758.

531 Pyke, G. H. and Ehrlich, P. R. 2010. Biological collections and ecological/environmental research: a review, some

532 observations and a look to the future. – Biological Reviews 85: 247–266.

533 Rodrigues, A. S. et al. 2006. The value of the IUCN red list for conservation. – Trends in Ecology & Evolution 21: 71–76.

534 Saatkamp, A. , Cochrane, A. , Commander, L. , Guja, L. , Jimenez‐Alfaro, B. , Larson, J. , Nicotra, A. , Poschlod, P. ,

535 Silveira, F. A., Cross, A. , Dalziell, E. L., Dickie, J. , Erickson, T. E., Fidelis, A. , Fuchs, A. , Golos, P. J., Hope, M. ,

536 Lewandrowski, W. , Merritt, D. J., Miller, B. P., Miller, R. , Offord, C. A., Ooi, M. K., Satyanti, A. , Sommerville, K. D.,

537 Tangney, R. , Tomlinson, S. , Turner, S. and Walck, J. L. (2019). A research agenda for seed‐trait functional ecology. New

538 Phytologist, 221(4), 1764-1775.

539 Schneider, F. D. et al. 2018. Towards an ecological trait-data standard. – bioRxiv 328302

540 Schrodt, F. et al. 2015. Bhpmf–a hierarchical bayesian approach to gap-filling and trait prediction for macroecology and

541 functional biogeography. – Global Ecology and Biogeography 24: 1510–1521.

542 Sokal, R. and Rohlf, F. 1981. The principles and practice of statistics in biological research. 2 edn. – Freeman, WH, New

543 York.

24 544 Stropp, J. et al. 2016. Mapping ignorance: 300 years of collecting flowering plants in Africa. – Global Ecology and

545 Biogeography 25: 1085–1096.

546 Taugourdeau, S. et al. 2014. Filling the gap in functional trait databases: use of ecological hypotheses to replace missing data.

547 – Ecology and Evolution 4: 944–958.

548 Team eBird July 2014. ebird’s missing species. – Website https://ebird.org/news/ebirds-missing-species/ [accessed 20

549 September 2017].

550 Theophrastus 1916. Enquiry Into Plants, Translated by A.F. Hort. – Harvard University Press.

551 Thiers, B. 2017. The world’s herbaria 2016: A summary report based on data from index herbariorum. – Website

552 http://sweetgum. nybg. org/science/ih/ [accessed 9 September 2017] .

553 Veiga, A. K. et al. 2017. A conceptual framework for quality assessment and management of biodiversity data. – PloS one

554 12: e0178731.

555 Webb, C. O. and Donoghue, M. J. 2005. Phylomatic: tree assembly for applied phylogenetics. – Molecular Ecology Notes 5:

556 181–183.

557 Wheeler, Q. D. et al. 2004. Taxonomy: impediment or expedient? – Science 303: 285.

558 Willis, K.J. (ed.) 2017. State of the World’s Plants 2017. Report. Royal Botanic Gardens, Kew.

559 Wilman, H. et al. 2014. Eltontraits 1.0: Species-level foraging attributes of the world’s birds and mammals. – Ecology 95:

560 2027–2027.

561 Wood, S. 2017. Generalized Additive Models: An Introduction with R. 2 edn. – Chapman and Hall/CRC.

562 Younis, S. et al. 2018. Taxon and trait recognition from digitized herbarium specimens using deep convolutional neural

563 networks. – Botany Letters 0: 1–7.

564 Zanne, A. E. et al. 2014. Three keys to the radiation of angiosperms into freezing environments. – Nature 506: 89–92.

25 Supplementary Material

Cornwell et al. What we (don’t) know about global plant diversity. Ecography

Appendix 1

Preliminary work for this manuscript began in March of 2015. At that time data were downloaded in a similar manner to that presented in the main document. Table S1 shows the species coverage of each dataset at that time. While the data downloads of 2015 and 2018 are not perfectly comparable (due to small changes in data handling/cleaning both by us and by the databases), the changes in species coverage over time are indicative of the relative rate at which shortfalls in our knowledge of plant diversity are being addressed.

Table S1: Coverage of accepted plant species in the three databases in data downloads from March 2015

Genbank GBIF TRY All three Genbank 97921 81988 42832 GBIF 81988 217681 65535 TRY 42832 65535 74327 All three 39650

Appendix 2

Venn diagram showing the relative size and overlap of data used by Zanne et al. (2014) and Díaz et al. (2016). The species were included in these studies because there was sufficient data available for their particular research aims. Total number of species for Zanne et al. (2014): 28,838. Total number of species for Díaz et al. (2016): 2214. Only 1713 species were included in both studies, highlighting the fact that data sufficiency is entirely context dependant. Most of the species that were included in these studies fall within out "broadly-covered" category: 25,235 (79 %) of those used by Zanne et al.

(2014) and 1,952 (88 %) of those used by Díaz et al. (2016). This demonstrates how species that are broadly-covered across databases become more and more studied over time relative to patchily- and poorly-covered species, and how closing the

Venn shortfall will impact the nature of the species-pools which are suitable for inclusion in future global plant research projects.

Figure S1: Relative sizes and overlap of the data for which there were sufficient data for inclusion in Zanne et al. (2014) and

Díaz et al. (2016).

2