Raritas and Raritasvox: Programs for Counting High Diversity Categorical Data with Highly Unequal Abundances

A peer-reviewed version of this preprint was published in PeerJ on 9 October 2018.

View the peer-reviewed version (peerj.com/articles/5453), which is the preferred citable publication unless you speciﬁcally need to cite this preprint.

Lazarus DB, Renaudie J, Lenz D, Diver P, Klump J. 2018. Raritas: a program for counting high diversity categorical data with highly unequal abundances. PeerJ 6:e5453 https://doi.org/10.7717/peerj.5453 Raritas and RaritasVox: Programs for counting high diversity categorical data with highly unequal abundances

David Lazarus Corresp., 1 , Johan Renaudie 1 , Dorina Lenz 2 , Patrick Diver 3 , Jens Klump 4

1 Museum für Naturkunde, Berlin, Germany 2 Leibniz-Institut für Zoo- und Wildtierforschung, Berlin, Germany 3 Divdat Consulting, Wesley, Arkansas, United States 4 CSIRO, Mineral Resources, Kensington, Australia

Corresponding Author: David Lazarus Email address: [email protected]

Acquiring data on the occurrences of many types of difficult to identify objects are often still made by human observation, e.g. in biodiversity and paleontologic research. Existing computer counting programs used to record such data have various limitations, including inflexibility and cost. We describe a pair of new open-source programs for this purpose - Raritas and RaritasVox, which share a similar graphical user interface for mouse based counting, and file output format. Raritas is written in Python and can be run as a standalone app for recent versions of either MacOS or Windows, or from the command line as easily customized source code. RaritasVox in addition supports voice based counting but is written in Java and is more complex to install or modify. Both programs explicitly support a rare category count mode which makes it easier to collect quantitative data on rare categories, e.g. rare species which are important in biodiversity surveys. Lastly, as to our knowledge no standards exist yet, we describe a new stratigraphic occurrence data (SOD) unitary file format which combines extensive metadata and a flexible structure for recording occurrence data of species or other categories in a series of samples.

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.26836v1 | CC BY 4.0 Open Access | rec: 9 Apr 2018, publ: 9 Apr 2018 1 Raritas and RaritasVox: programs nor codnting high diversity 2 categorical data with highly dneqdal abdndances

3 David B. Lazards1, Johan Renaddie1, Dorina Lenz2, Patrick Diver3 and Jens Kldmp4

4 1 - Mdsedm nür Natdrkdnde - Leibniz-Institdt nür Evoldtions- dnd Biodiversitätsnorschdng, 5 Berlin, Germany 6 2 - Leibniz-Institdt nür Zoo- dnd Wildtiernorschdng, Berlin, Germany 7 3 - Divdat Consdlting, Wesley, Arkansas, USA 8 4 - CSIRO, Mineral Resodrces, Kensington, Adstralia

9 Corresponding adthor - David Lazards, [email protected]

10 Adthor contribdtions

11 DBL created the main program specinications, designed the GUI and wrote the paper. JR wrote 12 Raritas, DLenz and JK designed the voice ndnctions and wrote RaritasVox. DBL and PD created 13 the SOD normat.

14 Abstract

15 Acqdiring data on the occdrrences on many types on dinnicdlt to identiny objects are onten still 16 made by hdman observation, e.g. in biodiversity and paleontologic research. Existing compdter 17 codnting programs dsed to record sdch data have variods limitations, incldding innlexibility and 18 cost. We describe a pair on new open-sodrce programs nor this pdrpose - Raritas and RaritasVox, 19 which share a similar graphical dser internace nor modse based codnting, and nile odtpdt normat. 20 Raritas is written in Python and can be rdn as a standalone app nor recent versions on either 21 MacOS or Windows, or nrom the command line as easily cdstomized sodrce code. RaritasVox in 22 addition sdpports voice based codnting bdt is written in Java and is more complex to install or 23 modiny. Both programs explicitly sdpport a rare category codnt mode which makes it easier to 24 collect qdantitative data on rare categories, e.g. rare species which are important in biodiversity 25 sdrveys. Lastly, as to odr knowledge no standards exist yet, we describe a new stratigraphic 26 occdrrence data (SOD) dnitary nile normat which combines extensive metadata and a nlexible 27 strdctdre nor recording occdrrence data on species or other categories in a series on samples. 28 Introduction

29 Human observations as a source of scientific data

30 1dantitative data abodt many aspects on the natdral world are collected in modern science with 31 the dse on instrdments, bdt a sdbstantial amodnt on observational data is still collected by hdman 32 observation. This is particdlarly common in ecology, organismal biology and behavioral sciences, 33 where the ndmeric data on the nreqdencies on occdrrences on biologic phenomena are desired, bdt 34 the objects/phenomena to be codnted are too complex to identiny by instrdments or ndlly 35 compdterized image analysis systems. Up dntil the spread on desktop compdters, sdch codnts 36 were done mostly either with the aid on mechanical codnter bdttons (incldding arrays on several 37 bdttons, to allow codnting on mdltiple categories) or tallied by hand on printed list norms. Both 38 methods are slow and reqdire re-entering the codnt valdes into a compdter anterwards benore 39 analysis, adding additional time and possibilities nor error. Compdter 'point-codnting' programs 40 can in principle replace these methods and at the same time provide additional ndnctions that 41 mechanical methods cannot, sdch as contindods statistical sdmmaries on the data as it is being 42 collected, which provides dsendl needback to the observer on how complete or accdrate the 43 dataset being collected is.

44 Despite these obviods advantages codnting programs have yet to ndlly replace mandal methods. 45 There are many reasons nor this incldding cost, innlexibility, compatibility and inadeqdate ease on 46 dse. Ndmerods inexpensive or nree simple tally codnter programs are available that can replace 47 mechanical codnter bdttons (e.g. dozens on simple smartphone/tablet apps, or more sophisticated 48 desktop apps e.g. Versacodnt: (Kim & DeRisi, 2010). None on these however are well sdited to 49 codnting larger ndmbers on categories, which is common in ecology, and in related nields sdch as 50 paleontology. The need to codnt many objects in many categories is particdlarly acdte in 51 biodiversity related disciplines, e. g. nield sdrveys on species diversity; species codnts on nossil 52 assemblages in micropaleontology. In sdch stddies the diversity on objects and total ndmbers on 53 objects available nor stddy are both very high. Several programs have been developed to assist in 54 biodiversity assessments (e.g. 'OrgaCodnt': www.aqdaecology.de; 'Beecam': www.avansee.com). 55 As many micropaleontologists work in commercial (oil inddstry) settings, there are also several 56 sophisticated codnting programs available (many as commercial proddcts) nor codnting large 57 ndmbers on micronossils: ; Polpal (Nalepka & Walands, 2003); Foramsampler (Mcgann et al., 58 2006); Codnter (Zippi, 2007); Stratabdg (Stratadata, 2014); Bdgwin (Bdgware, 2016). These 59 programs, whether nor biologists or inddstrial micropaleontologists, however nreqdently are 60 limited in one or more ways. Many are embedded in larger, more specialized packages with 61 neatdres nor a single discipline, e.g. stratinied ecologic sampling, biostratigraphic range charting, 62 petrologic thin section analyses. Programs are onten complex to install, or are lacking in 63 nlexibility, adaptability and/or ease on dse. Many are also closed-sodrce, expensive, and are 64 dependent on the commercial provider to maintain. There is thds a need nor a program that is 65 relatively simple, nree, open-sodrce, less specialized and thds adaptable to codnting a variety on 66 dinnerent types on objects, and that works with dinnerent operating systems. Most importantly, it 67 mdst be as easy to dse as mechanical methods, since a program that is signinicantly slower will, 68 based on odr experience, normally be rejected by dsers. Users onten need to codnt thodsands on 69 objects (see 'Rarity' below), and an even marginally slower data entry method will create an 70 dnacceptable cdmdlative loss on the dser's time. This is particdlarly trde in codnting objects sdch 71 as micronossils, or in nield biodiversity sdrveys, where vast ndmbers on specimens are available 72 and can be qdickly identinied by the dser, making data entry the time-limiting nactor in data 73 collection.

74 Rarity

75 In addition to the general need nor nlexible, ennicient codnting programs, there is also a specinic 76 need to codnt objects which have very dinnerent relative abdndances. Many classes on objects in 77 the observable world show a characteristic pattern on dneqdal relative abdndances that can be 78 approximated by power laws, incldding incomes, internet trannic, plankton sizes, and the sizes on 79 interstellar mineral grains (Mathis et al., 1977, Reed & Hdghes, 2002, Bdonassissi & Dierssen, 80 2010). Biologic entities, in particdlar species abdndances in ecology and paleontology also 81 typically show sdch distribdtions, with a new species being relatively common, and the remainder 82 dncommon or qdite rare (Preston, 1948, Brown et al., 2002). Codnting objects at random nrom 83 sdch dnevenly distribdted popdlations resdlts in many codnts on the new common species, bdt 84 very new codnts on rarer species. For example, in both the complete dataset, and in individdal 85 samples, codnts on nossil radiolarians in Neogene Sodthern Ocean sediments show a new very 86 common species, and many rare species (Figs. 1, 2). Even with >700,000 individdals, a 87 sdbstantial nraction on the species are represented by 10 or newer individdals. Thds, in order to 88 encodnter at least one individdal on all rare species very large ndmbers on specimens need to be 89 examined. For example, several thodsand individdals needed to be examined in order to recover 90 95% on the estimated total species diversity (ca 200 species) in the single sample codnted in Fig. 91 2 (Fig. 3).

92 Ecologists and paleontologists thds sometimes decide to base stddies only on the small ndmber on 93 species that are relatively common and thds whose abdndances are easy to qdantiny. Many 94 applied micropaleontologic stddies nor example dse the the environmental prenerences on a 95 relatively small ndmber on common species to reconstrdct past environmental conditions (Imbrie 96 & Kipp, 1971, CLIMAP project members, 1976). Not all scientinic qdestions can however be 97 addressed by examination on only a small ndmber on common species. Unlike, e.g. mineral 98 grains, each biologic species is dniqde, with its own potential to contribdte to ecosystem ndnction 99 and, over the longer term, to evoldtionary change. Biodiversity research in particdlar is concerned 100 abodt docdmenting total species richness and dnderstanding threats to it, e.g. how cdrrent and 101 past environmental change annects it. The nindings on sdch research need into important decisions 102 on biodiversity conservation, land dse and other global issdes (i.e., the 'Rio' Convention on 103 Biological Diversity: www.cbd.int). Reasonably accdrate estimates on total diversity - crdcial in 104 biodiversity stddies - can only be made when the majority on the diversity has been codnted. 105 Extrapolations nrom less complete data tend to have dnacceptably high error valdes (Colwell et 106 al., 2012). There is thds a major ennort to dnderstand the total species richness on modern and past 107 biologic systems (Mora et al., 2011), and conseqdently, the need to collect qdantitative data on 108 many rare species (Roberts et al., 2016).

109 One approach to achieving this is based on the hdman ability to scan large popdlations to identiny 110 a sdbset on target individdals mdch more rapidly than the same person codld ndlly identiny and 111 record the identity on each individdal in the popdlation. As a simple example, it is mdch naster to 112 scan a large crowd on people to identiny a single category on persons on interest ('tall men with 113 beards'), than to identiny each person in a crowd and record all on their names. Similarly, one can 114 qdickly skip individdals belonging to a specinic category to target other individdals. Biologists 115 and paleontologists collecting data on rare species make dse on this ability by nirst codnting all 116 individdals encodntered to identiny common species, then, mentally blocking odt the common 117 species, continding to codnt only species that are not in the 'common' grodp. In this 'rare category' 118 mode individdals on common species can be scanned over mdch more rapidly, and their codnts 119 nor the total area viewed estimated anterwards based on their abdndances in 'all species' mode. 120 Larger total ndmbers on individdals are thereby examined, and a better estimate on total species 121 richness can be obtained (Gannon, 1971, Hinds, 1999, Stevenson et al., 2010). A good codnting 122 program nor sdch work shodld onner options that sdpport this style on ennicient codnting on only 123 rare taxa. This ability is however, to odr knowledge, normally not onnered in cdrrently available 124 codnting programs, which are mostly designed to sdpport codnts on smaller ndmbers on species 125 and individdals in sdpport on applied (paleo)environmental research.

126 Materials and Methods

127 Raritas and RaritasVox are two new programs nor codnting (tallying) mdltiple categories on 128 objects which meet these criteria. Both onner a nlexible modse-driven internace nor codnting 129 highly diverse lists on taxa, incldding both bdttons nor more common taxa, and hierarchical 130 mends to select rare taxa. An additional neatdre on the programs is the deninition on a new nile 131 normat nor storing sdch codnt data that dniqdely combines the data and detailed metadata in a 132 dser-nriendly spreadsheet style layodt. Compiled apps, sodrce code, dser gdides, sample 133 connigdration and odtpdt niles are all pdblicly available at https://githdb.com/plannapds/Raritas.

134 The programs provide explicit sdpport on ddal-mode (all vs rare only) codnting, and indeed this 135 neatdre is the basis nor the program names. In standard mode, all individdals seen are codnted. In 136 'rare only' mode, commonly occdrring objects are no longer codnted: only rare objects are. Not 137 having to padse to enter a codnt nor the most nreqdently seen object types makes codnting rare 138 object categories mdch naster. However, in order to be able to combine codnts nor common and 139 rare types together, it is also necessary to know the magnitdde on observational ennort made in 140 each codnting mode, as the total nreqdencies on common objects are estimated nor the 'rare objects 141 only' interval based on their nreqdency in 'all object' codnting, and the observational ennort spent 142 in 'rare' mode. A compdter program that sdpports rare-only codnting mdst therenore be able to 143 monitor observational ennort in parallel to recording individdal object codnts. This is provided nor 144 by a separate codnter nor observational ennort, a 'track' codnter which the dser dpdates periodically 145 while codnting.

146 The main program Raritas, is written in Python (van Rossdm et al., 2010). The second - 147 RaritasVox - is written in Java, and was in nact the initial test development version. This older 148 version provides most, thodgh not all on the neatdres on the main Python version in modse-based 149 codnting. In addition it provides a dniqde option to register codnts directly nrom voice inpdt by 150 the dser, who simply speaks the category names. Regardless on method or program variant, the 151 same type on odtpdt, setdp and connigdration niles are dsed.

152 These programs' ease on dse involve both ease on connigdration as well as ease on dse ddring 153 primary operation. Raritas and RaritasVox are connigdred almost entirely nrom the contents on a 154 simple tabdlar type nile which can be created easily by dsers dsing a spreadsheet program. The 155 nile contains list on which objects (e.g. species) are to be codnted, how these are to be presented to 156 the dser (bdtton labels and other details). This also simplinies the program as there is no need to 157 write code nor connigdration, other than reading the connigdration nile.

158 Detailed metadata is captdred nor each dataset and saved with the data in the odtpdt niles. This 159 onten a weakness in other (e.g. commercial) programs where relatively little innormation is 160 captdred. Reliance on program-external metadata captdre sdch as embedding all metadata in 161 nilenames is obviodsly limited in extent, not well strdctdred and in odr experience has not been 162 very reliable, particdlarly when metadata needs to be dnderstandable over the long-term (i.e. by 163 other than the nile creators).

164 Raritas been programmed in Python becadse it is a popdlar, well sdpported, and relatively easy to 165 learn mdlti-paradigm scripting compdter langdage. It is more likely to be dnderstandable to 166 workers in nields sdch as taxonomy/systematics than the more complex, object-oriented compiled 167 langdage Java. RaritasVox was programmed in Java in order to make dse on specialized libraries 168 nor voice recognition: the Sphinx open-sodrce speech recognition engine (Walker et al., 2004) 169 (http://www.speech.cs.cmd.edd/sphinx/doc/Sphinx.html), and to insdre speed, which is needed 170 nor the complex task on voice recognition - Java code execdtes mdch naster than Python code. 171 Both programs rdn qdickly on all hardware tested (desktop and laptop compdters with Intel 'i' 172 series processors, rdnning Windows 7-10; OS X 10.9-12). Raritas consists on ca 650 lines on 173 Python code; RaritasVox on nearly 4,000 lines on Java. The dse on Python, plds the mdch smaller 174 size on the code, makes cdstomization on the Raritas's neatdres possible by technically savvy 175 dsers, withodt the need to employ a pronessional programmer. Python also provides excellent 176 packages nor some ndnctions sdch as plotting data that allow the program to proddce better 177 odtpdts nor the dser withodt having to write additional code (e.g., matplotlib). Python is not 178 withodt problems - installing the variods sontware moddles (packages), incldding packages dsed 179 by other packages (dependencies) that an application needs can be very dinnicdlt nor a non- 180 specialist, depending in part on the local python environment dsed. Raritas is therenore onnered 181 both as a ndlly bdndled program (dodble-clickable) with all needed packages incldded nor Mac 182 OS X 10.11+ as well as nor Windows 7 and 10; and also as sodrce code: the normer providing 183 ease-on-dse nor non specialists; the latter cdstomizability. RaritasVox is also available either as a 184 bdndled app (a .jar nile) or as sodrce code. The bdndled versions are each ca 100 Mb in size.

185 Installation

186 No special installation proceddre is needed nor the Raritas program when dsed as the bdndled 187 app. Using the sodrce code version on Raritas (python) reqdires installing only two python 188 packages (and their dependencies): matplotlib and wxPython (Hdnter, 2007, Ddnn, 2014). These 189 mdst be installed dsing the appropriate python or OS package manager nor the dser's python 190 system, which will adtomatically install any dependencies. Some python distribdtions already 191 incldde both packages as part on their standard installation, thds reqdiring no special installations 192 by the dser. RaritasVox reqdires a Java environment (available nor nree download, onten installed 193 previodsly in many systems) in addition to the app itseln. Installing the sodrce code version on 194 RaritasVox is considerably more complicated: details are given in Appendix 1.

195 Configuration file and starting the program 196 197 Both programs read a single connigdration nile on starting - by denadlt, the one previodsly dsed, or 198 a new one chosen by the dser. The nile (Fig. 4; Appendix 2) is in tab-text normat and is jdst a list 199 on taxa names and how each shodld be presented to the dser in the GUI internace. All names are 200 available by drop-down list by denadlt. Names can also be shown as bdttons (with abbreviations 201 to insdre the bdtton label nits). In a second set on names on higher level categories are provided 202 nor the primary names, the name list is parsed into mdltiple list with mdltiple drop-down mends, 203 thds providing strdctdre to longer name lists and more rapid access to taxa names.

204 Bdndled versions on either program are started by the dsdal dodble-click on the app icon or other 205 standard GUI methods. The sodrce code version on Raritas is started by a standard 'python 206 raritas.py' statement (optionally incldding a path name, in appropriate) at the command line. Once 207 the program starts all interaction takes place via the GUI internace that then appears. RaritasVox 208 cannot be rdn directly nrom the sodrce code as Java is a compiled langdage - any cdstomized 209 version on the RaritasVox Java code mdst nirst be compiled and linked either via the command 210 line or a programming tool sdch as an IDE.

211 GUI interface for manual counting

212 The main elements on the GUI internace nor either version, once started, are: the metadata 213 window, the codnting window, the rare codnt connigdration window and the collector cdrve 214 window.

215 Metadata window (Fig. 5). When the program is nirst started a window appears which provides a 216 pop-dp list on primary codnting style options (nile types), based on the SOD nile specinication 217 (described below). The next window collects the metadata appropriate nor the nile type, e.g. nield 218 names that are dsed in the rest on the program nor the material to be codnted. At the moment the 219 program sdpports two types on primary data, both nor micronossil occdrrences: assemblages on 220 micronossils nrom deep-sea sediments obtained by the international deep-sea drilling programs, or 221 nossils nrom samples obtained nrom geologic sections on land, bdt other types can be denined. The 222 metadata window also provides a new rdn-time options nor connigdring the internace and behavior 223 ddring codnting. Importantly, the dser chooses which taxa name list connigdration nile they want 224 to dse via a normal nile open dialog at this time. When ready the 'start codnting' bdtton is clicked 225 and the codnting window appears.

226 Codnting window (Fig. 6). This is the main window that is dsed nor most interaction with the 227 program. The dpper part on the window is popdlated with the bdttons nor codnting common 228 species, with labels as denined in the connigdration nile. Less common taxa are shown in the norm 229 on popdp lists, organized into higher level categories, again as denined in the connigdration nile. 230 Pdtting less common taxa into lists and common taxa on bdttons allows most codnts to be done 231 qdickly with a bdtton, while the comparatively slow process on selecting nrom a list is reddced to 232 a minimdm. Lists are needed however as they can be on arbitrary length, while the ndmber on 233 bdttons is limited by screen size. Codnting is active whenever the window is present. Clicking on 234 a bdtton or selecting a taxa nrom the lists adds the species to the codnt data strdctdres. A list on 235 recently codnted objects is given in the sdb-window (lower middle on main window). A bdtton is 236 provided on the right to codnt observational ennort ('Track', nor ndmber on 'tracks' scanned on a 237 microscope slide') and a codnter shows the total tracks codnted.

238 Clicking on 'Rare Codnt Mode' brings dp a dialog (Fig. 7), where the codnted objects are listed in 239 order on descending abdndance, and the dser can choose which to excldde nrom ndrther codnting. 240 When the dialog is dismissed codnting resdmes, with, nor those taxa to be excldded, the taxa 241 bdttons greyed odt and pop-dp list items inactivated.

242 Determining which species to excldde in rare codnt mode is not trivial. As this is a key neatdre on 243 Raritas we incldde the nollowing sdggestions, which are based on odr experience on codnting ca 244 700,000 total specimens (several thodsand specimens per sample in over 100 samples) nor the 245 stddy pdblished in (Renaddie & Lazards, 2013). The tally to dse to trigger the switch to rare-only 246 codnting, and the percentage threshold nor species to be ignored ddring 'rare' codnt mode shodld, 247 as a rdle on thdmb, maximize the ndmber on specimens to ignore while minimizing the error on 248 the abdndant species percentages. In (Renaddie & Lazards, 2013), we chose to stop the ndll codnt 249 mode when ca. 2,000 specimens were already codnted and to ignore in 'rare' codnt mode species 250 with a percentage higher than ~5% on the commdnity. Doing so allowed ds to keep the error to ca. 251 10% on the investigated valde. In other words, nor a species that was present at 5% abdndance in 252 ndll codnt mode, the theoretical standard error is slightly below 10% on this 5% valde, i. e. a 253 theoretical percentage nor the species between ca. 4.5 and ca. 5.5%; (Drooger, in (Zachariasse et 254 al., 1978) (Fig. 8a). These cdt-onn valdes eliminated 59.7% on the specimens ddring rare-only 255 mode (median on all samples codnted, bdt varying nrom one sample to the other, black line on 256 Fig. 8b nor median, dark grey area nor interqdartile range and light grey are nor total range). An 257 additional, important criterion that was taken into consideration is that all samples encodntered 258 had at least one species above the 'ignore in rare-only mode' percent threshold. Using an higher 259 threshold than 5% wodld have meant that some samples wodld have had to be codnted entirely in 260 ndll codnt mode, as no species wodld have been abdndant enodgh to excldde. In odr stddy, there 261 were on average ca three (mean = 2.9) percent on the species above the cdt-onn threshold per 262 sample (blde and red lines on Fig. 7b).

263 The 'Show Collector's Cdrve' mend item (Raritas, or bdtton, RaritasVox) brings dp the nodrth 264 main GUI element - a diversity accdmdlation plot (Fig. 9) showing the relationship to total 265 ndmber on object types seen (species) vs total ndmber on objects codnted (specimens). For 266 typical biologic data these cdrves show a rodghly logarithmic in shape - at nirst rising rapidly, 267 then, as increasingly species already seen previodsly are re-encodntered, nlattening odt. The 268 cdrve's slope will eventdally become zero when all object types in the sample have been detected 269 (compare to Fig. 2). The dser can decide when the cdrve has become close enodgh to this state 270 nor his/her pdrposes, and thds stop codnting only when the data completeness qdality is adeqdate. 271 In a series on samples are codnted to the point where they have the same apparent slope at the end 272 on this dynamically generated diversity accdmdlation cdrve, they will share the property on being 273 'nairly' sampled, and relative dinnerences in diversity will be shown withodt bias (Alroy, 2010, 274 Colwell et al., 2012). This type on needback is important to insdring good qdality observations 275 and is something that cannot be provided by simple mechanical codnt systems. It is however 276 rarely implemented in programs known to ds.

277 Voice interface

278 RaritasVox has a similar GUI to Raritas, with only nairly minor dinnerences in the layodt on 279 elements or ndnctional behavior (e.g., RaritasVox allows colors to be assigned to taxa names as an 280 aid to accdrate name selection in the internace), and thds is not described separately here - details 281 are given in Appendix 1. The main dinnerence in ndnctionality is the ability to dse a voice driven 282 codnting mode, selected via a control bdtton nrom the main codnting window. The motivation 283 was the observation that, nor some dsers, the constant change on nocds between microscope and 284 codnting program (or paper sheet) while codnting micronossils dnder a microscope places a strain 285 on the dser's vision. Some researchers annected by this problem had developed a voice-based 286 codnting proceddre: calling odt species identinications and recoding the codnts as addio 287 recordings, then later playing them back and transnerring the species codnts into their codnting 288 sheets. RaritasVox was conceived as a way, by dsing speech recognition, to make this process 289 more ennicient and ergonomic.

290 Since 2009 when RaritasVox was developed and today speech recognition has made tremendods 291 advances and and has become a commonplace ndnctionality in many everyday applications, e.g. 292 Apple's "Siri". Speech recognition systems can be classinied into two categories. "Speaker 293 dependent" systems dse "training" (also called "enrollment") where an individdal speaker reads 294 text or isolated vocabdlary into the system. The system analyzes the person's specinic voice and 295 dses it to nine-tdne the recognition on that person's speech, resdlting in increased accdracy. 296 Systems that do not dse training, incldding RaritasVox, are called "speaker independent" systems. 297 RaritasVox however makes dse on the nact that the codnting process dses an independent 298 vocabdlary that is denined in a connigdration nile (Fig. 10; Appendix 2). The dser may not only 299 dse his or her own short terms nor species rather than the ndll taxonomic name, e.g. "pachylent" 300 instead on "Globigerina pachyderma sinistral", they can modiny the connigdration nile so that the 301 program can better recognize an individdal's normal prondnciation style. This is nor example 302 dsendl nor dsers with dinnerent native langdages, as vowels in particdlar are onten pronodnced 303 dinnerently, even nor latin taxa names. For example "Prunopyle" is pronodnced proo-no-peil by 304 English speakers, and proo-no-peel-ae by Germans.

305 At the time RaritasVox was nirst being planned (2009) only a new cross-platnorm packages were 306 available. The speech recognition sontware Sphinx and Java were chosen as the best combination 307 nor an open-sodrce, cross platnorm speech recognition package and langdage environment nor odr 308 pdrposes. For Sphinx the elemental components on speech sodnds are interchangeably renerred to 309 as "phones" or "phonemes" (see http://www.speech.cs.cmd.edd/sphinx/doc/Sphinx.html and 310 http://www.speech.cs.cmd.edd/cgi-bin/cmddict). Only phonemes listed in the phoneme set on the 311 CMU Pronodncing Dictionary (arodnd 40) can be dsed and it expects that the langdage dsed is 312 English. Only words consisting on one or more phonemes that are present in the cdstomized 313 dictionary nile (Fig. 10) can be recognized as "correct". The sontware will search nor words 314 consisting on phonemes present in the dictionary which match best to the speech inpdt. In 315 RaritasVox the spoken word is recognized, connirmation is shown on screen, and a codnt 316 command nor that item is generated (Fig. 11).

317 RaritasVox was not dsed to collect research data and was only brienly tested nor accdracy (Table 318 1).

319 Using a list on 18 words and 108 voice entries, nodr words were incorrectly identinied (<4%), 320 resdlting in 8 incorrect codnts (7.5%). This is similar to accdracy in mdch more sophisticated, 321 general voice recognition systems [27], which is possible as RaritasVox dses a very limited 322 vocabdlary. The codnt error rate may be too large nor data collection where rare occdrrences are 323 important (e.g. biostratigraphy) bdt adeqdate nor others sdch as gross assemblage composition, 324 particdlarly when combined with statistical data reddction proceddres sdch as nactor analysis that 325 are insensitive to small amodnts on random data scatter [13]. The accdracy is in any event 326 choosable by the dser as they can, by monitoring the compdter screen, correct errors benore they 327 are codnted dsing the spoken 'Remove' command to delete the last (incorrect) identinied word.

328 Output files

329 SOD File Format 330 In addition to the diversity accdmdlation plots, which can be saved as graphics as onten as desired 331 (the matplotlib library dsed in Raritas sdpports variods nile normats, e.g. png, pdn, jpg, tin), the 332 program saves the primary codnt data. This necessitates choosing, or creating a normat nor the 333 data niles, as there is no dniversal commdnity database which wodld allow a direct dpload 334 soldtion. Despite a great deal on biostratigraphic or other data on the norm on species by 335 samples/observations having been generated globally nor many decades, no generally accepted or 336 even widely known nile normat exists nor sdch data. Other nields have developed commdnity data 337 normats nor sdch data matrices, e.g. the BIOM normat nor biological observation matrices 338 (McDonald et al., 2012), as well as standard protocols to exchange innormation directly between 339 compdter systems e.g. Darwin Core (Wieczorek et al., 2012). These normats are however on 340 limited dse nor paleontologic nossil occdrrence matrices since they lack any way to store 341 metadata, general or individdal sample, that is related to geologic age (sample position in section, 342 normation name, etc), and the metadata in general is optimized nor biologic, not paleontologic 343 observations. One on the major biologic exchange protocols (ABCD: (Berendsohn, 2007), 344 http://wiki.tdwg.org/ABCD/) does have, via the EFG extension (http://www.geocase.ed/eng) the 345 ability to transmit both biologic and geologic data, bdt is a commdnication protocol, not storage 346 normat, and the xml deninition is not readable by normal dsers.

347 Within the nield on paleontology, data on occdrrences, odtside on micropaleontology, are 348 dominated by simple taxa lists nor a single locality (one sample). This is exemplinied by the main 349 data inpdt normats the most widely dsed paleontology commdnity database PBDB (Alroy et al., 350 2001), where data is entered, taxon by taxon, nor one sample at a time. Within micropaleontology 351 taxa-by-sample data matrices are common (onten renerred to as 'range charts') bdt data is dsdally 352 given in the normat on individdal pdblications, withodt metadata in the niles, in ndmerods 353 variations on a simple taxa-by-sample table. This is also the nile normat dsed by the deep-sea 354 drilling programs (DSDP, ODP, IODP), which have not generally captdred micropaleontology 355 data except in a very limited norm on-ship, dsing database entry norms, or simply archived data 356 copied nrom pdblications, with only minimal metadata stored separately nrom the data niles. 357 Lastly there are several more comprehensive data nile normats that are associated with 358 commercial micropaleontology, i.e. the oil inddstry. These normats incldde metadata, details on 359 stratigraphy etc, bdt are not compatible with each other and are mostly meant nor internal dse in 360 proprietary commercial programs, not nor open nile exchange. Most also tend to be qdite dser 361 dnnriendly, giving sample and taxa names in separate deninition blocks nrom the actdal occdrrence 362 data, and dse a long, non-tabdlar, list type strdctdre that makes comprehension dinnicdlt. There is 363 thds a need nor a pdblic (non-proprietary) nile normat that combines metadata and the taxa-by- 364 occdrrences data in a single nile, provides nor geologic age or section innormation and which is 365 easy nor scientists to read and dse.

366 We have therenore adopted a new 'open nile normat': Stratigraphic Occdrrence Data normat, which 367 we abbreviate here simply as SOD normat. This normat originally was developed in response to 368 the need to merge metadata and occdrrence data in dser typed niles, in order to manage a large 369 ndmber on nossil occdrrence matrix niles that were being digitized nrom the literatdre nor dpload 370 into a database that provides a micropaleontologic eqdivalent to the PBDB: NSB (Lazards, 1994, 371 Spencer-Cervato, 1999). This database reports occdrrences on micronossils in deep-sea sediment 372 sections, and the data is mostly derived nrom stddies that report the occdrrences in the norm on 373 simple samples by species tables, one table per section, per higher nossil grodp. The nile normat 374 itseln is deliberately meant to be visdally similar to the sodrce pdblication data tables, being 375 essentially an enhanced version on the pdblication's tabdlar data matrix. This makes the nile easily 376 read by dsers, and eqdally makes the transcription (keying-in) on data nrom pdblications into the 377 normat relatively simple - in some cases, where a pdblication nile is available in digital norm, 378 simply by renormatting some on the nields, rather than re-entry on primary valdes. SOD normat 379 however is signinicantly dinnerent nrom an 'ordinary' dser data table in that it is based on a normal, 380 extendable deninition on content. This deninition adds more strdctdre and detail nor both taxa and 381 sample names, and dses the otherwise empty 'corner' on the matrix at the intersection on the row 382 and coldmn labels to incldde, in a strdctdred way, more general metadata abodt the occdrrence 383 data in the nile.

384 The nile is laid odt in 4 graphical blocks: general metadata: dpper lent corner block; taxa 385 metadata: lent coldmns below metadata block; sample metadata: rows to right on corner metadata 386 block; and the occdrrence data itseln in the remaining lower right block (Fig. 12). Flexibility is 387 provided nor in two ways. The individdal nields in each block can be popdlated by dinnerent actdal 388 data types, depending on the overall record type as determined by the 'File Type' nield. Cdrrently 389 there are only two denined nile types, nor deep-sea drilling data and more traditional land section 390 data (O and L, respectively). These dinner both in general metadata (Site location vs geographic 391 name and geographic coordinates), and in the way in which sample names are strdctdred: deep- 392 sea drilling samples ('O' niles) dse a consistent Site-Hole-Core-Section-Interval normat, while land 393 sections are more variably denined, bdt dsdally incldde some combination on geologic normation, 394 vertical position in section and sample name (dsdally dniqde to each stddy); with additional 395 innormation onten recorded on geologic age or biostratigraphic zone and lithology. SOD 'L' 396 normatted niles incldde all these nields. Within the broad constraints on total nields available, the 397 ndmber on nile types dsing this layodt is open to indeninite expansion. The SOD layodt itseln is 398 also extensible, as the version is written in the nirst metadata nield in each nile. The nield 399 deninitions and thds the data expected in each nield are determined by these control nields, and 400 dinnerent layodts can be denined, nor example with additional rows nor sample name nields. This 401 nlexibility however reqdires a separate sodrce on innormation that denines, nor the dser and 402 programmer, what the nield contents mdst be nor each 'File Type' or SOD version ndmber. These 403 deninition reqdirements are the ndndamental dinnerence between regdlar data niles as nodnd in the 404 literatdre, and the SOD normat. The deninitions are given in two ways (which also allows cross 405 checking nor data consistency). First, the tabdlar nile deninition reqdires ndll labeling - each cell, 406 row or coldmn that holds data has an adjacent cell with nixed text content denining the data cell(s) 407 adjacent, so that the content resembles a simple key:valde non-relational database strdctdre. This 408 means the niles are largely seln docdmenting, and provides sdnnicient explanatory innormation to 409 dsers so that they can create new data niles nrom a template nile (containing labels bdt no data 410 valdes). Second, programs that read SOD niles are expected to have a deninition table on some 411 sort which gives the location and meaning on each cell nor each nile type and each SOD version. 412 Cdrrently this is implemented in a table in the NSB database and dsed by programs (both a 413 python script and an R proceddre at present) that read and dpload SOD data into the NSB system. 414 This deninition list codld also be incldded (e.g. as a second 'page' in a spreadsheet nile) with the 415 data niles themselves. A ndll list on cdrrent SOD nield deninitions and additional details on the 416 normat are given in Appendix 3.

417 Over 500 niles have been created in SOD normat, both typed or edited by dsers as described 418 above, or generated by the Raritas program ddring codnting on micronossils. Raritas generates 419 only data nor one sample at a time, bdt otherwise the odtpdt is identical to that dsed nor complete 420 sample by taxa matrices in other SOD niles. SOD normatted niles are not intended to replace 421 more complex, normally controlled, compdter-to-compdter data exchange normats, denined in xml 422 or other systems. SOD is best viewed as complementary, providing a dser accessible normat that 423 encodrages the captdre on the metadata needed to adeqdately docdment stratigraphic occdrrence 424 data, which dntil now has onten not been done. It shodld also be noted that the SOD normat is 425 mdch more nlexible and can accommodate many more types on data than the cdrrent versions on 426 Raritas programs themselves, which are 'hard wired' to work e.g. with Taxa and Sample Names. 427 Fdtdre versions on these programs ideally shodld be modinied to read the nields needed nor the 428 metadata window, and odtpdt data nile normats, directly nrom a SOD deninition nile.

429 Diversity vs number of specimens

430 The program odtpdts, in addition to the main codnt data, the cdmdlative diversity vs ndmber on 431 codnted objects history as a simple tab-text data nile. This data can be dsendl nor nitting 432 rarenaction cdrves in sdbseqdent data analyses.

433 Results

434 The degree to which biodiversity assessments can be improved dsing odr sontware depends on a 435 variety on nactors - the distribdtion on taxon abdndances (evenness) and absoldte diversity on the 436 target popdlation(s) being codnted; and the ability on the dser to mentally mask odt taxa and nocds 437 only on those not excldded. Most people can easily keep a 'skip' list on several taxa in mind when 438 codnting, bdt not a mdch larger list, e.g. a dozen or more taxa. Thds the improvement in codnting 439 with Raritas tends to be best when the abdndances are signinicantly dneven and the total diversity 440 is less than a new hdndred categories. In the example shown in Figdres 1 and 7 on this paper, 441 nrom Antarctic Pleistocene radiolarian assemblages, by eliminating the 6 most common species 442 (cdmdlative abdndance on >74% on the specimens in the sample) nearly 3/4 on the specimens can 443 be skipped, allowing an ennective sampling on the rarer taxa that is 4X what wodld have been 444 possible by codnting all specimens. In practice we have nodnd that we more typically increase 445 odr ennective sample size by 2-3X by dsing rare codnt mode. These increased ennective sample 446 sizes signinicantly improve the accdracy on diversity estimates, althodgh the precise amodnt will 447 depend on total sample size, evenness and absoldte diversity (Colwell et. al., 2012).

448 Discussion and Conclusions

449 The programs described here provide dsendl tools nor codnting popdlations with large ndmbers on 450 categories and dneqdal abdndances on individdals in categories. They are, as programmed, best 451 sdited to micropaleontology stddies, bdt with only minor modinication can be adapted to many 452 other dses in biodiversity research and other nields. The SOD deninition provides a nlexible, 453 internally docdmented yet easy to read nile normat nor storing and exchanging occdrrence data, 454 either nor individdal popdlations or matrices with mdltiple sets on observations. The Raritas 455 program described here has proved itseln in actdal dse over several years in the jdnior adthor's 456 research grodp in Berlin. As noted above, it has been dsed to codnt >700,000 specimens 457 belonging to several hdndred dinnerent species in >100 radiolarian micronossil assemblages, as 458 part on a stddy on biodiversity change in the Sodthern Ocean over the last 20 my (Renaddie & 459 Lazards, 2013). It has been dsed by several individdals in other projects incldding stddents, on a 460 variety on compdters.

461 Acknowledgements

462 The adthors wish to thank the ndmerods individdals nor the open-sodrce sontware tools dsed to 463 create the Raritas programs. 464 References

465 Stevenson RJ, Pan Y, van Dam H. 2010. Assessing environmental conditions in rivers and 466 streams with diatoms, p. 57–85. In Smol JP, Stoermer EF (ed), The Diatoms: Applications for 467 the Environmental and Earth Sciences, Cambridge University Press, Cambridge. 468 Alroy J, Marshall CR, Bambach RK, Bezusko K, Foote M, Fürsich FT, Hansen TA, Holland SM, 469 Ivany LC, Jablonski D, Jacobs DK, Jones DC, Kosnik MA, Lidgard S, Low S, Miller AI, 470 Novack-Gottshall PM, Olszewski TD, Patzkowsky ME, Raup DM, Roy K, Sepkoski JJ, Jr., 471 Sommers MG, Wagner PJ, Webber A. 2001. Effects of sampling standardization on estimates 472 of Phanerozoic marine diversification. Proceedings of the National Academy of Sciences 473 (USA) 98:6261–6266. 474 Alroy J. 2010. Fair sampling of taxonomic richness and unbiased estimation of origination and 475 extinction rates, p. 55–80. In Alroy, J, Hunt G (ed), Quantitative Methods in Paleobiology, 476 The Paleontological Society, 477 Berendsohn W. 2007. Access to Biological Collection Data. ABCD Schema 2.06 - ratified 478 TDWG standard. TDWG Task Group on Access to Biological Collection Data, BGBM, 479 Berlin http://www.bgbm.org/TDWG/CODATA/Schema/default.htm. 480 Brown JH, Gupta VK, Li BL, Milne BT, Restropo C, West GB. 2002. The fractal nature of 481 nature: power laws, ecological complexity and biodiversity. Phil Trans R Soc 357:619–626. 482 Bugware. 2016. Bugwin. http://www.bugware.com 483 Buonassissi CJ, Dierssen HM. 2010. A regional comparison of particle size distributions and the 484 power law approximation in oceanic and estuarine surface waters. Journal of Geophysical 485 Research 115:C10028 (1–12). 486 CLIMAP members. 1976. The surface of the ice-age earth. Science 191:1131–1137. 487 Colwell RK, Chao A, Gotelli NJ, Lin S-Y, Mao CX, Chazdon RL, Longino JT. 2012. Models and 488 estimators linking individual-based and sample-based rarefaction, extrapolation and 489 comparison of assemblages. J Plant Ecol 5:3–21. 490 Dunn R. 2014. wxPython, version. 3.0. wxpython.org. 491 Gannon JE. 1971. Two counting cells for the enumeration of zooplankton micro-crustacea. Trans 492 Am Micros Soc 90:486–490. 493 Hinds WC. 1999. Aerosol Technology: Properties, Behavior, and Measurement of Airborne 494 Particles, 2nd Edition, Wiley, Hoboken, NJ. 495 Hunter JD. 2007. matplotlib: A 2D graphics environment. Computing in Science and Engineering 496 9:90–95. 497 Imbrie J, Kipp NG. 1971. A new micropaleontological method for quantitative paleoclimatology: 498 application to a late Pleistocene Carribean core, p. 71–181. In Turekian KK (ed), Late 499 Cenozoic Glacial Ages, Yale University Press, New Haven. 500 Kim CC, DeRisi JL. 2010. VersaCount: customizable manual tally software for cell counting. 501 Source Code Biol Med 5:web. 502 Lazarus DB. 1994. The Neptune Project - a marine micropaleontology database. Math Geol 503 26:817–832. 504 Mathis JS, Rumpl W, Nordsieck KH. 1977. The size distribution of interstellar grains. The 505 Astrophysical Journal 217:425–433. 506 McDonald D, Clemente JC, Kuczynski J, Rideout JR, Stombaugh J, Wendel D, Wilke A, Huse S, 507 Hufnagle J, Meyer F, Knight R, Caporaso JG. 2012. The Biological Observation Matrix 508 (BIOM) format or: how I learned to stop worrying and love the ome-ome. GigaScience 1:1– 509 6. 510 Mcgann M, Mcgann LB, Bonomassa O, Devries P, Luther J, Malmberg S, Nelson G, Pratt SIII. 511 2006. Foramsampler v. 3.0 - microfossil sample data management software. Anuário do 512 Instituto de Geociências 29:278–279. 513 Mora C, Tittensor DP, Adl SM, Simpson AGB, Worm B. 2011. How many species are there on 514 earth and in the ocean? PLoS Biology 9:1–8 (web). 515 Nalepka D, Walanus A. 2003. Data processing in pollen analysis. Acta Paleobot 43:125–134. 516 Preston FW. 1948. The commonness, and rarity, of species. Ecology 29:254–283. 517 Reed WJ, Hughes BD. 2002. From gene familes and genera to incomes and internet file sizes: 518 why power-laws are so common in nature. Physical Review E 66:67103–67106. 519 Renaudie J, Lazarus D. 2013. On the accuracy of paleodiversity reconstructions: a case study in 520 Antarctic Neogene radiolarians. Paleobiology 39:491–509. 521 Roberts TE, Bridge TC, Caley MJ, Baird AH. 2016. The Point Count Transect Method for 522 Estimates of Biodiversity on Coral Reefs: Improving the Sampling of Rare Species. PLoS 523 One 11:e0152335. 524 Spencer-Cervato C. 1999. The Cenozoic deep sea microfossil record: explorations of the 525 DSDP/ODP sample set using the Neptune database. Palaeontologica Electronica 2:web. 526 Stratadata. 2014. Stratabugs biostratigraphic data management software. 527 http://www.stratadata.co.uk 528 van Rossum G, Drake J. 2010. Python Language Reference, version 2.7. 529 Walker W, Lamere P, Kwok P, Raj B, Singh R, Gouvea E. 2004. Sphinx-4: a Flexible Open 530 Source Framework for Speech Recognition, Sun Microsystems, Mountain View, CA. 531 Wieczorek J, Bloom D, Guralnick R, Blum S, Döring M, Giovanni R, Robertson T, Vieglais D. 532 2012. Darwin Core: An evolving community-developed biodiversity data standard. PlosOne 533 7:1–8. 534 Zachariasse WJ, Riedel WR, Sanfilippo A, Schmidt RR, Brolsma MJ, Schrader HJ, Gersonde R, 535 Drooger MM, Broekman JA. 1978. Micropaleontological counting methods and techniques- 536 an exercise on an eight metres section of the lower Pliocene of Capo Rossello, Sicily. Utrecht 537 Micropaleontological Bulletins 17:79-176. 538 Zippi P. 2007. Counter 4.5. PAZ Software. http://www.pazsoftware.com.

539 Supporting Information Appendices

540 S1 Appendix 1 - User Gdides 541 S2 Appendix 2 - Sample Files 542 S3 Appendix 3 - SOD Deninition Figure 1

Assemblages with common and rare taxa

Microfossil assemblage as seen in the microscope (late Pleistocene, Southern Ocean, ODP Site 751). Specimens marked by black arrows all belong to Antarctissa strelkovi or A. denticulata. Other radiolarian species are marked by white arrows. Unmarked individuals are not targets for counting - broken radiolarians and diatom valves. Most individuals in this target assemblage belong to just a few species (particularly A. strelkovi and A. denticulata), making discovery of rarer taxa difficult. Figure 2(on next page)

Ranked relative abundances of fossil radiolarian species in single samples and combined multisample datasests.

Counts of species, sorted by abundance, of Neogene Southern Ocean radiolarian assemblages, showing total dataset (several dozen samples) and a single sample (Deep-sea drilling sample ODP 751A-6H-6, 98-100 cm). Despite a total count of 7071 specimens within the single sample, the majority of the species are represented by 6 or fewer individuals. From data in (Renaudie & Lazarus, 2013) SOM. Ranked Species Abundances Antarctic Neogene Radiolaria 105 full dataset: N = 714,853 single sample: N = 7,071

104

103

NSpecimens 102

101

0 50 100 150 200 250 300 350 400 450 500 Species Rank Figure 3(on next page)

Cumulative diversity vs sample size curve and estimated true diversity for a single sample.

Species-accumulation curve on a typical sample (sample ODP 751A-6H-6, 98-100 cm shown in Fig 1). Bold black curve is the species accumulation curve; light grey curve is a de Caprariis type curve-fit; dashed light grey line its asymptote (i.e. species diversity at infinite sample size). From (Renaudie & Lazarus, 2013). 250

200

150

100

NumberSpeciesof 50

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Number of Specimens Figure 4

Configuration file to populate interface with category names.

Configuration file format (a plain text file, here formatted for easier reading). Only a few fields - 'Genus' and 'Species' components of a taxonomic name, button (yes/no) are mandatory. A couple fields, e.g. 'Recognition Name' are used only by RaritasVox. Figure 5

Dialog to enter general sample metadata.

Metadata window used for Raritas. Information about the sample to be counted is entered here, including observer, date, class of objects being counted ('Fossil Group'), and sample identification information. RaritasVox has additional options (not shown), e.g. 'Save list of counted species with diversity' which, if checked, creates a second output file that gives the entire history of counting. Figure 6

Main counting window with buttons, hierarchical category menus and count status information.

Main counting window. Objects to be counted are presented in two forms: an array of clickable buttons in the upper part of the window, and as a set of pop-up lists in the lower left and center part of the window. The number of lists and their contents is automatically built from the configuration file higher category labels for object entries. Button labels are also taken from this file on start-up. Other buttons or menu items control program behavior and call up other features e.g. voice recognition (RaritasVox only), show count plot, switch to Rare Count mode etc. A scrolling list of the most recently counted objects is shown in the lower middle. The 'Track' counter and clickable (large rectangular) button are on the lower right and are used to record observation effort in both regular and rare count modes. Note, in this image rare count mode has already been activated; thus some buttons are greyed out. Figure 7

Dialog to configure rare count mode.

Configure rare count mode dialog. The object counts list, sorted by count frequencies, is presented and the user selects those objects (here, species names) that will in skipped and no longer counted in rare count mode.

Figure 8(on next page)

Relationships between sample size and uncertainty of abundance estimates in generalized and actual biodiversity data.

Panel A (left) - Epsilon (size of confidence interval, relative to the abundance value, for a given species relative abundance in a population) plotted on a p (percent) vs N (number of specimen) landscape. Rule of thumb used in [12] marked by dashed lines (Renaudie & Lazarus, 2013) highlighted. Panel B (right) - Shows, for data reported in (Renaudie & Lazarus, 2013), red line: the percent of samples that have at least one species with percent higher than p; blue line: the percent of species having a proportion higher than p in at least one sample, and black line with shading: the cumulative proportion of specimens of species with proportion higher than p (mean, inner-quartile range and total range over all 107 samples). 50 ε = 10% ε = 5%

Samples ε = 25% 10

5 ε = 50% p

1 ε = 100%

0.5 Specimens

ε = 200%

0.1 2.9 Species 59.7

100 200 300 500 1000 2000 5000 20 40 60 80 100% N Proportion above p Figure 9(on next page)

Collecting curve, showing history of cumulative diversity vs sample size.

Count plot window, showing a simple graphic of how total diversity of objects ('species') is increasing with increased numbers of counted objects ('specimens'). The window appears whenever the user clicks the 'show count plot' button in the main counting window. This graphic is calculated and plotted anew with each invocation. The shape of the curve provides important feedback for the user, see text for details. Collector's curve 120

100

Number of Species 40

0 0 50 100 150 200 250 300 350 400 Number of Specimens Figure 10(on next page)

RaritasVox defined vocabulary and pronunciation configuration file.

Configuration file for voice recognition using RaritasVox (extract only). Spoken words are on the left and the phoneme pronunciations on the right. APROP AH P R OW K S STOCKI S T OW K IY AMPRADIOSA AE M P R AE D IY OW S AH ARTANNULATUS AA R T AE N AH L EY T AH Z APIRREGULAR AE K S IH R EH G Y AH L ER BGRAN B IY G R AE N CRYPTBUSS K R IH P T B AH S GONDWANA G AO N D W AE N AH LOPHOHADRA L OW F AH HH AE D R AH MITA M IY T AH PODPAPILIS P AA D P AE P IH L IH S PSEUDODICT S UW D OW D IH K T ZYGO S IY G ER SPYRO S P IY R ER CORNUTELLA K AO R N Y UH T EH L AH CALOCYCLAS K AE L OW S AY K L AH Z BUNNYEARS B AH N IY IH R Z ZIGZAG Z IH G Z AE G Figure 11

Main counting window for RaritasVox

Screenshot of RaritasVox in voice-counting mode. A list of acceptable words is shown in the top window, the currently recognized word in large letters in the middle of the screen (to make it easy to see at a glance when e.g. working at a microscope), button controls below this and summary panes of count activity at the bottom. Figure 12(on next page)

Example of SOD file format with data blocks framed.

Example of SOD file output (the main data output file produced by Raritas), with the 4 main areas (blocks) marked by bold lines. Metadata about the data file is stored in the upper left block, object labels and linked data such as author names, if known, are in the lower left block, sample information is in the upper right block, and the actual counting data in the lower right block. In output from the Raritas program only a single column of data is created but the SOD format definition permits the sample name and count values to repeat indefinitely (to the right of this figure). Note that only a few selected rows are shown here - the full file has ca 400 taxa names. count for ms fig

SOD v.: 2.1 File Type: O Fossil Group: radiolaria Source ID: Source Name: Source Citation: Site: 751 Entered By: dbl Entry Date: 18-01-2017 Checked By: Check Date: Hole: A Leg Info: 120 Leg Qualifier: Core: 12H File Creation Method: Raritas Section: 2 Occurrences Data Type: C Keys: Interval top: 12-14 Comments: 1 tracks observed Depth(mbsf): Abundance: A Preservation: G Genus: GQ: Species: SQ: Subspecies: Author: Taxon Code: Higher Taxon: Taxon Comments: Acrosphaera murrayana Collo/Entact/Phaeo 3 Acrosphaera spinosa Collo/Entact/Phaeo 1 Actinomma golownini Spumellaria 13 Anomalocantha dentata Spumellaria 1 Antarctissa strelkovi Nassellaria 18 Antarctissa ballista Nassellaria 8 Botryostrobus auritus/australis Nassellaria 1 Cycladophora humerus Nassellaria 8 Cycladophora golli regipileus Nassellaria 2 Cycladophora golli golli Nassellaria 1 Dendrospyris ? sakaii Nassellaria 3 Dendrospyris rhodospyroides Nassellaria 2 Dictyophimus ? planctonis Nassellaria 14 Druppatractus irregularis Spumellaria 8

Page 1 Table 1(on next page)

Recognition accuracy in a simple test run of RaritasVox.

Accuracy of spoken entry using RaritasVox for a short list of species name abbreviations. Each name was spoken in random order 6 times. Note the independence of the spoken and data names e.g. zigzag for L. robusta. The spoken and formal names are linked in the Vox configuration file. Genus GQ Species SQ spoken name VOX count Errors

Amphicraspedum prolixum gr. aprox 5 1 Amphipyndax stocki stocki 6 0 Amphisphaera radiosa ampradiosa 6 0 Artostrobus annulatus artannulatus 6 0 Axoprunum irregularis axirregular 5 1 Buryella granulata bgran 6 0 Calocyclas spp. calocyclas 7 1 Cornutella sp. cornutella 6 0 Cryptocarpium bussonii gr. cryptbuss 8 2 Gondwanaria ? sp. gondwana 6 0 Lithomelissa robusta zigzag 6 0 Lophocyrtis hadra lophohadra 5 1 Acrosphaera cuniculiauris bunnyears 6 0 Mita ? sp. mita 6 0 Podocyrtis papilis podpapilis 5 1 Pseudodictyophimus gracilipes pseudodict 6 0 Spyrocyrtis A n.sp. spyro 60 Zygocircus buetschli zygo 7 1 101 1