Miroslav Valan Automated image-based taxon identification using deep learning

Automated image-based identificationtaxon using deep learning and citizen-science contributions and citizen-science contributions

Miroslav Valan

ISBN 978-91-7911-416-9

Department of Zoology

Doctoral Thesis in Systematic Zoology at Stockholm University, Sweden 2021 Automated image-based taxon identification using deep learning and citizen-science contributions Miroslav Valan Academic dissertation for the Degree of Doctor of Philosophy in Systematic Zoology at Stockholm University to be publicly defended on Wednesday 10 March 2021 at 14.00 in Vivi Täckholmsalen (Q-salen), NPQ-huset, Svante Arrhenius väg 20.

Abstract The sixth mass extinction is well under way, with biodiversity disappearing at unprecedented rates in terms of species richness and biomass. At the same time, given the currentpace, we would need the next two centuries to complete the inventory of life on Earthand this is only one of the necessary steps toward monitoring and conservation of species. Clearly, there is an urgent need to accelerate the inventory and the taxonomic researchrequired to identify and describe the remaining species, a critical bottleneck. Arguably, leveraging recent technological innovations is our best chance to speed up taxonomic research. Given that has been and still is notably visual, and the recent break-throughs in computer vision and machine learning, it seems that the time is ripe to exploreto what extent we can accelerate morphology- based taxonomy using these advances inartificial intelligence. Unfortunately, these so-called deep learning systems often requiresubstantial computational resources, large volumes of labeled training data and sophisticated technical support, which are rarely available to taxonomists. This thesis is devoted to addressing these challenges. In paper I and paper II, we focus on developing an easy-to-use (’off-the-shelf’) solution to automated image-based taxon identification, which is at the same time reliable, inexpensive, and generally applicable. This enables taxonomists to build their own automated identification systems without prohibitive investments in imaging and computation. Our proposed solution utilizes a technique called feature transfer, in which a pretrained convolutional neural network (CNN) is used to obtain image representations (”deep features”) for a taxonomic task of interest. Then, these features are used to train a simpler system, such as a linear support vector machine classifier. In paper I we optimized parameters for feature transfer on a range of challenging taxonomic tasks, from the identification of to higher groups --- even when they are likely to belong to subgroups that have not been seen previously --- to the identification of visually similar species that are difficult to separate for human experts. In paper II, we applied the optimal approach from paper I to a new set of tasks, including a task unsolvable by humans - separating specimens by sex from images of body parts that were not previously known to show any sexual dimorphism. Papers I and II demonstrate that off-the-shelf solutions often provide impressive identification performance while at the same time requiring minimal technical skills. In paper III, we show that phylogenetic information describing evolutionary relationships among organisms can be used to improve the performance of AI systems for taxon identification. Systems trained with phylogenetic information do as well as or better than standard systems in terms of common identification performance metrics. At the same time, the errors they make are less wrong in a biological sense, and thus more acceptable to humans. Finally, in paper IV we describe our experience from running a large-scale citizen science project organized in summer 2018, the Swedish Ladybird Project, to collect images for training automated identification systems for ladybird . The project engaged more than 15,000 school children, who contributed over 5,000 images and over 15,000 hours of effort. The project demonstrates the potential of targeted citizen science efforts in collecting the required image sets for training automated taxonomic identification systems for new groups of organisms, while providing many positive educational and societal side effects.

Stockholm 2021 http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-189460

ISBN 978-91-7911-416-9 ISBN 978-91-7911-417-6

Department of Zoology

Stockholm University, 106 91 Stockholm

AUTOMATED IMAGE-BASED TAXON IDENTIFICATION USING DEEP LEARNING AND CITIZEN-SCIENCE CONTRIBUTIONS

Miroslav Valan

Automated image-based taxon identification using deep learning and citizen-science contributions

Miroslav Valan ©Miroslav Valan, Stockholm University 2021

ISBN print 978-91-7911-416-9 ISBN PDF 978-91-7911-417-6

Printed in Sweden by Universitetsservice US-AB, Stockholm 2021 Dedicated to my daughter Mia and son Teodor without whom this thesis would have been completed two years ago.

Acknowledgment

I would like to express my deepest gratitude to my supervisor Fredrik Ronquist for his patience, unconditional support and understanding. I am in his debt for supporting my choices and for finding a way to finish this thesis without too much hassle. I have received much support from my co-supervisors Atsuto Maki, Karoly Makonyi and Nuria Albet-Tores. Also, I would like to thank my numerous colleagues from the SU and NRM and especially Savantic which I considered my second home during the last five years. I am very grateful to all my coauthors and those who in one way or the other were involved in the work that resulted in this thesis. Thank you all. I am indebted to my wife Vlasta, for her love, patience, encouragement, endless support, sacrifice and forgiveness during the most important journey - life. I want to thank my parents, Dragica and Mirko and my brother Dragoslav. Lastly and most importantly, I want to thank my children for being an inexhaustible source of joy and happiness. And the best thing is that we are just starting. To my 3 years old son Teodor for making me feel stronger, more organized and better at prioritizing what matters the most in life. You taught me a lesson I once knew: You can succeed only after you try, so never stop trying. To my giggly and miraculous baby girl Mia. I love seeing you happy for no obvious reason. So, just keep giggling.

Abstract

The sixth mass extinction is well under way, with biodiversity disappearing at unprece- dented rates in terms of species richness and biomass. At the same time, given the current pace, we would need the next two centuries to complete the inventory of life on Earth and this is only one of the necessary steps toward monitoring and conservation of species. Clearly, there is an urgent need to accelerate the taxonomic research required to identify and describe the remaining species. Arguably, leveraging recent technological innovations is our best chance to speed up taxonomic research. Given that taxonomy has been and still is notably visual, and the recent breakthroughs in computer vision and machine learning, it seems that the time is ripe to explore to what extent we can accelerate morphology-based taxonomy using these advances in artificial intelligence (AI). Unfortunately, these so-called deep learning systems often require substantial computational resources, large volumes of labeled training data and sophisticated technical support, none of which are readily avail- able to taxonomists. This thesis is devoted to addressing these challenges. In paper I and paper II, we focus on developing an easy-to-use (’off-the-shelf’) solution to automated image-based taxon identification, which is at the same time reliable, inexpensive, and gen- erally applicable. Such a system would enable taxonomists to build their own automated identification systems without advanced technical skills or prohibitive investments in imag- ing or computation. Our proposed solution utilizes a technique called feature transfer, in which a pretrained convolutional neural network is used to obtain image representations (”deep features”) for a taxonomic task of interest. Then, these features are used to train a simpler system, such as a linear support vector machine classifier. In paper I we optimized

1 parameters for feature transfer on a range of challenging taxonomic tasks, from the iden- tification of insects to higher groups ––– even when they are likely to belong to subgroups that have not been seen previously ––– to the identification of visually similar species that are difficult to separate for human experts. We find that it is possible to find a solution that performs very well across all of these tasks. In paper II, we applied the optimal approach from paper I to a new set of tasks, including a task unsolvable by humans - separating specimens by sex from images of body parts that were not previously known to show any sexual dimorphism. Papers I and II demonstrate that an off-the-shelf solution can provide impressive identification performance while at the same time requiring minimal technical skills. In paper III, we show that information describing evolutionary relationships among organisms can be used to improve the performance of AI systems for taxon identification. Systems trained with taxonomic or phylogenetic information do as well as or better than standard systems in terms of generally accepted identification performance metrics. At the same time, the errors they make are less wrong in a biological sense, and thus more accept- able to humans. Finally, in paper IV we describe our experience from running a large-scale citizen science project organized in summer 2018, the Swedish Ladybird Project, to collect images for training automated identification systems for ladybird beetles. The project en- gaged more than 15,000 school children, who contributed over 5,000 images and over 15,000 hours of effort. The project demonstrates the potential of targeted citizen science efforts in collecting the required image sets for training automated taxonomic identification systems for new groups of organisms, while providing many positive educational and societal side effects.

2 Abstrakt

Vi ¨ar mitt inne i den sj¨atte massutrotningen, och den biologiska m˚angfalden f¨orsvinner i en rasande fart. Arter f¨orloras f¨or alltid och och den totala biomassan minskar stadigt. Samtidigt skulle vi beh¨ova tv˚ahundra ˚ar f¨or att slutf¨ora inventeringen av livet p˚ajorden i nuvarande takt, och detta ¨ar bara ett av de n¨odv¨andiga stegen mot ¨overvakning och bevarande av den biologiska m˚angfalden. Med tanke p˚adetta st˚ar det klart att vi m˚aste f¨ors¨oka p˚askynda den taxonomiska forskning som kr¨avs f¨or att identifiera och beskriva de ˚aterst˚aende arterna. Att utnyttja de senaste ˚arens tekniska framsteg ¨ar sannolikt v˚ar b¨asta chans att g¨ora detta. Med tanke p˚aatt taxonomi har varit och fortfarande ¨ar baserat till tor del p˚avisuella karakt¨arer, och att det har gjorts stora framsteg de senaste ˚aren inom datorseende och maskininl¨arning, ¨ar det h¨og tid att utforska i vilken utstr¨ackning vi kan accelerera morfologibaserad taxonomi med hj¨alp av artificiell intelligens (AI). De senaste framstegen bygger p˚as˚akallad djupinl¨arning (“deep learning”), vilket ofta kr¨aver bety- dande ber¨akningsresurser, stora volymer tr¨aningsdata och avsev¨ard teknologisk kompetens. Dessa resurser ¨ar s¨allan tillg¨angliga f¨or taxonomer. Forskningen som redovisas i denna avhandling syftar till att avhj¨alpa dessa problem. I uppsats I och uppsats II fokuserar vi p˚aatt utveckla en l¨attanv¨and standardl¨osning f¨or automatiserad bildbaserad taxoniden- tifiering, som samtidigt ¨ar tillf¨orlitlig, l¨attillg¨anglig och allm¨ant till¨amplig. Ett s˚adant standardsystem skulle g¨ora det m¨ojligt f¨or taxonomer att bygga sina egna automatiserade identifieringssystem utan o¨overkomliga investeringar i ber¨akningsresurser eller i att acku- mulera stora digitala bilddatabaser. V˚ar l¨osning anv¨ander en teknik som bygger p˚aatt extrahera de element eller egenskaper som uppfattas av ett avancerat neuralt n¨atverk (ett

3 “convolutional neural network”) tr¨anat f¨or en generell bildklassificeringsuppgift i bilder avsedda f¨or en annan uppgift, en taxonomisk identifieringsuppgift. De extraherade bilde- genskaperna kan sedan anv¨andas f¨or att tr¨ana ett enklare klassificeringssystem, till exempel en s˚akallad st¨odvektormaskin (“support vector machine”). Vi optimerade parametrarna f¨or den h¨ar typen av system p˚aen rad utmanande taxonomiska uppgifter, fr˚an identifier- ing av insekter till h¨ogre taxa ––– ¨aven n¨ar de sannolikt tillh¨or undergrupper som inte har setts tidigare ––– till identifiering av visuellt snarlika arter som ¨ar sv˚ara att s¨arskilja f¨or m¨anskliga experter. Vi fann att det var m¨ojligt att utforma ett s˚adant system s˚aatt det hade god prestanda f¨or samtliga dessa uppgifter. I uppsat II anv¨ande vi det optimala sys- temet fr˚an papper I till en ny upps¨attning uppgifter, inklusive en uppgift som inte kan l¨osas av m¨anniskor - att separera hanar fr˚an honor utifr˚an bilder av kroppsdelar som inte tidi- gare var k¨anda att visa n˚agon sexuell dimorfism. Uppsats I och emph II visar att det g˚ar att utveckla standardl¨osningar som ger imponerande identifieringsprestanda hos de f¨ardiga identifieringssystemen och samtidigt kr¨aver minimala tekniska f¨ardigheter av anv¨andaren. I uppsats III visar vi att information som beskriver evolution¨ara sl¨aktskapsf¨orh˚allanden mellan organismer kan anv¨andas f¨or att f¨orb¨attra prestandan hos AI-system f¨or taxonomisk identifiering. System tr¨anade med taxonomisk eller fylogenetisk information presterar lika bra som eller b¨attre ¨an standardsystem n¨ar de utv¨arderas med allm¨ant accepterade pre- standam˚att. Samtidigt ¨ar felen de g¨or mindre felaktiga i biologisk mening och d¨armed mer acceptabla f¨or m¨anniskor. Slutligen beskriver vi i uppsats IV v˚ar erfarenhet av att genomf¨ora ett storskaligt medborgarvetenskapligt projekt som anordnades sommaren 2018, Nyckelpigef¨ors¨oket, f¨or att samla in bilder f¨or att tr¨ana AI-system f¨or identifiering av ny- ckelpigor. Projektet engagerade mer ¨an 15 000 skolbarn, som bidrog med ¨over 5,000 bilder och ¨over 15,000 timmars arbete. Projektet visar vilken enorm potential som finns i att engagera medborgarforskare i att samla in de n¨odv¨andiga bilderna f¨or att kunna tr¨ana AI-system f¨or automatisk identifiering av nya grupper av djur och v¨axter. Samtidigt kan s˚adana projekt ge m˚anga positiva bieffekter. Inte minst kan de v¨acka allm¨anhetens ny- fikenhet inf¨or den biologiska m˚angfalden och intresset f¨or att bevara den f¨or framtiden.

4 Author’s contributions

The thesis is based on the following articles, which are referred to in the text by their Roman numerals:

I. Valan, M., Makonyi, K., Maki, A., Vondr´aˇcek, D., & Ronquist, F. (2019). Auto- mated taxonomic identification of insects with expert-level accuracy using effective feature transfer from convolutional networks. Systematic Biology, Volume 68, Issue 6, November 2019, Pages 876–895, https://doi.org/10.1093/sysbio/syz014. II. Valan, M., Vondr´aˇcek, D., & Ronquist, F. Awakening taxonomist’s third eye: exploring the utility of computer vision and deep learning in systematics. Submitted. III. Valan, M., Nylander A. A. J. & Ronquist F. AI-Phy: improving automated image- based identification of biological organisms using phylogenetic information. Manuscript. IV. Valan, M., Bergman M., Forshage M. & Ronquist F. The Swedish Ladybird Project: Engaging 15,000 school children in improving AI identification of ladybird beetles. Manuscript.

5 Candidate contributions to thesis articles*

Typeofcontribution paperI paperII paperIII paperIV Conceived the study AAAA Designed the study AAAA Collected the data AAAA Analyzed the data AAAA Manuscript preparation AAAA

Table 1: Contribution Explanation: A - Substantial: took the lead role and performed the majority of the work. B - Significant: provided a significant contribution to the work C - Minor: contribution in some way, but contribution was limited.

6 Contents

Abstract 1

Author contributions to thesis articles 5

Abbreviations 9

1 Introduction 10 1.1 Convolutional neural networks and deep learning ...... 13 1.2 Aimofthecurrentthesis...... 16

2 Summary of papers 21 2.1 PaperI...... 21 2.1.1 Materialandmethods ...... 21 2.1.2 Datasets...... 22 2.1.3 Experiments...... 22 2.1.4 Results...... 26 2.2 PaperII ...... 28 2.2.1 Materialandmethods ...... 28 2.2.2 Results...... 31 2.3 PaperIII...... 33 2.3.1 Materialandmethods ...... 33 2.3.2 Results...... 35 2.4 PaperIV...... 36

7 2.4.1 Design ofthestudy. Materialand methods ...... 36 2.4.2 Results...... 37

3 Discussion 40

4 Concluding remarks and thoughts on not so far future 48

References 53

8 Abbreviations

A list of abbreviations used in the thesis: AI - Artificial Intelligence ATI - Automated Taxonomic Identification aystem CAM - Class Activation Maps CNN - Convolutional Neural Network CS - Citizen Science DL - Deep Learning GBIF - Global Biodiversity Information Facility GPU - Graphical Processing Unit FC - Fully Connected LR - Logistic Regression LS - Label Smoothing PLS - Phylogenetic Label Smoothing SLP2018 - Swedish Ladybird Project SVC - Support Vector Classifier SVM - Support Vector Machin TLS - Taxonomic Label Smoothing

9 Chapter 1

Introduction

An understanding of the natural world and what’s in it is a source of not only a great curiosity but great fulfillment.

David Attenborough

Biodiversity is under unprecedented pressure due to climate change and the influence of humans. Based on the alarming rates at which species are disappearing it is more than obvious that the sixth mass extinction is under way (Ehrlich, 1995; Laliberte and Ripple, 2004; Dirzo et al., 2014; Ripple et al., 2014; Maxwell et al., 2016; Ceballos et al., 2017). Precious life forms are lost before we became aware of their existence; forms that took evolution millions of years to create. If we would know what we have and what we may lose it would be easier to convince decision-makers to take appropriate action to stop this devastating loss of biodiversity. The scientific field charged with the task of describing and classifying life on Earth is taxonomy, an endeavor that is as old as humans. Since the very beginnings, we aimed to understand the World around us; we observed, compared, tried to understand and made some conclusions; then we passed the knowledge on to the coming generations in oral and later in written form. It is easy to imagine how food (i.e. living beings - plants, and fungi) was on top of our priorities; we had to learn what is edible and tasty

10 and what not, so we probably relied on some information about anatomical features to distinguish one form of life from another. Years later, this became more structured so different forms of life started to be compared based on the same body parts, or the absence or presence of some morphological structures. The first written descriptions of different species were composed by compiling such observations of characters. This is considered as the beginning of descriptive taxonomy. During the 18th century, Carl Linnaeus, a Swedish botanist, zoologist and taxonomist, established universally accepted conventions for classifying nature within a nested hierarchy and for the naming of organisms. Today, this system is still in use and it is known as Linnaean taxonomy or modern taxonomy. Taxonomy remained predominantly descriptive until the mid-20th century when it be- came more quantitative thanks to the developments in statistics. Data such as length, width, angles, counts and ratios, combined with multivariate statistical methods, provided a deeper understanding of patterns in the biological world. This marked the beginning of traditional morphometrics (Marcus, 1990). In 1980’s, taxonomist applied approaches to quantify and analyse variations in shape (known as geometric morphometrics Rohlf and Marcus (1993)), which was based on coordinates of outlines or landmarks. These were useful for graphical visualisation and/or statistical analyses, but they were also used in building some of the first systems for automated taxon identification (see below). Throughout its historical development, it has become increasingly clear that taxonomy is more than just a descriptive scientific discipline; it is a fundamental science on which other sciences—such as ecology, evolution and conservation—rely. In an important sense, taxonomy represents the World’s scientific frontier, marking the boundary between the known and the unknown in our discovery of life forms. Unfortunately, taxonomic research is still slow in expanding this frontier. At the current pace, it is expected that it will take many years to describe all species of biological organisms on the planet. The gaps in our taxonomic knowledge and the shortage of taxonomic expertise is known as the taxonomic impediment (Agnarsson and Kuntner, 2007; Walter and Winterton, 2007; Rodman and Cody, 2003; Ebach et al., 2011; Coleman, 2015). Clearly, accelerating taxonomic research would bring many positive effects on a wide range of immensely important decisions our

11 civilization needs to make in the very near future. One possible approach to combating the taxonomic impediment would be to build so- phisticated automated taxon identification systems (ATIs). ATIs could help in two ways. First, they could take care of routine identifications, freeing up the time of taxonomic experts so that they could focus on more challenging and critical tasks in expanding our knowledge of biodiversity. Second, sophisticated ATIs could also directly help in the pro- cess of identifying and describing new life forms. Until recently, however, ATIs were not particularly effective in solving these tasks. An important reason for this is that they were based on hand-crafted features. For example, if the purpose were to identify insects, relevant features might be the wing venation patterns, the positions of wing vein junc- tions, or the outlines of the whole body. After human experts identified some potentially informative features, these features would then have to be identified in images manually or through automated procedures that were specifically designed for the task at hand (Ar- buckle et al., 2001; Feng et al., 2016; Francoy et al., 2008; Gauld et al., 2000; Lytle et al., 2010,?; O’Neill, 2007; Schr¨oder et al., 1995; Steinhage et al., 2007; Tofilski, 2007, 2004; Watson et al., 2003; Weeks et al., 1999a,b, 1997). Some of the ATIs developed using these techniques have shown great performance (Martineau et al., 2016), but the approach is difficult to generalize because it requires knowledge of programming and image analysis (to formalize manual or code automatic procedures for feature extraction), of machine learning (to build an appropriate classifier) and of the task itself (expertise on the taxa of interest). Clearly, this approach does not generalize well. For every new task we need to consider factors that determine the best target features, and then hand-craft procedures to encode those features. For these reasons, such ATIs have been presented for only a few groups. Note that a considerable amount of human effort must be spent before we can even evaluate whether it is feasible to solve the identification task at hand using this approach.

12 1.1 Convolutional neural networks and deep learning

In recent years, more general approaches to image classification have developed greatly (LeCun et al., 2015; Schmidhuber, 2015). This is part of a general trend in computer science towards more sophisticated and intelligent systems, that is, towards more sophis- ticated artificial intelligence (AI). The trend is driven by improved algorithms, rapidly increasing amounts of data, and faster and cheaper computation. In the field of computer vision, the development has been particularly fast in recent years with the introduction of more complex and sophisticated artificial neural networks, known as convolutional neural networks (CNNs), and the training of advanced (deep) versions of these networks with massive amounts of data, also known as deep learning (DL). The dramatic progress in computer vision has been enabled also by the development of graphical processing units (GPUs), adding a considerable amount of cheap processing power to modern computer systems. The first super-human performance of GPU-powered CNNs in an image classification task (Cire¸san et al., 2011) was reported in 2011 in a traffic sign competition (Stallkamp et al., 2011). The breakthrough came in 2012, when a CNN architecture called AlexNet (Krizhevsky et al., 2012) out-competed all other systems in the ImageNet Large Scale Visual Recognition Challenge (Russakovsky et al., 2015), a larger and more popular image classification challenge. The good news about DL performance spread quickly, and we soon witnessed successful applications in other research areas, such as face verification (Taigman et al., 2014), object localisation (Tompson et al., 2015), image and video translation into natural language (Karpathy and Fei-Fei, 2015), language translation (Sutskever et al., 2014; Jean et al., 2015), speech recognition (Sainath et al., 2013; Hinton et al., 2012; Zhang and Zong, 2015) and question-answer problems (Kumar et al., 2016). The core of every CNN architecture is a set of convolutional (conv) layers, hence the name convolutional neural network (Fukushima, 1979, 1980; Fukushima et al., 1983; LeCun et al., 1989). The convolutional part of a CNN enables automatic feature learning; it works as a “feature extractor”. The resulting features are then fed through one or more

13 fully connected (FC) layers, which deal with the classification task. The FC layers in principle correspond to a traditional multi-layer perceptron (Rosenblatt, 1957) which is a simple fully-connected feed-forward artificial neural network. Learning in a CNN is possible thanks to the backpropagation algorithm (Kelley, 1960; Linnainmaa, 1976; Werbos, 1982; Rumelhart et al., 1986; Schmidhuber, 2014) and gradient-based optimization (Robbins and Monro, 1951; Kiefer et al., 1952; Bottou et al., 2018). Most of the CNNs used today also contain other layers, such as pooling (dimensionality reduction) (Fukushima, 1979, 1980), normalization (e.g. BatchNorm (Ioffe and Szegedy, 2015), helps with stabilizing the training) or regularization layers (Hanson, 1990; Srivastava et al., 2014) (helps with addressing the over-fitting), among many others.

Figure 1.1: Architecture of VGG16 (Simonyan and Zisserman 2014), a simple modern CNN. VGG16 consists of five convolutional blocks, each block consisting of two or three convolutional layers (green) followed by a MaxPooling layer (red). These blocks are fol- lowed by three layers of fully connected neurons (gray), the last of which consists of a vector of length 1000. Each element in this vector corresponds to a unique category in the ImageNet Dataset (Russakovsky et al., 2015) for which this architecture was initially built. Adopted from paper I

To better understand the basic structure of a CNN, consider Figure 1.1 illustrating a well known deep CNN architecture, VGG16 (Simonyan and Zisserman, 2014). This

14 architecture is simple, yet very powerful and therefore one of the best studied. It has also become one of the most commonly utilized architectures for addressing various research questions. VGG16 consists of five convolutional blocks followed by two FC hidden layers and the output layer (also FC), where the number of nodes corresponds to the number of categories the network is trained for. The convolutional block is made of convolutional (conv) layers followed by a MaxPooling layer. Every conv layer in the VGG family is made of 3x3 filters. The number of layers in each block and the number of filters in each layer vary, so we have 2x64, 2x128, 3x256, 3x512, and 3x512 ”layers x filters” respectively for the five convolutional blocks. Note that some of the recent CNNs have more than a hundred layers including dozens of convolutional layers with much more complex architectures. VGG16 uses max pooling with kernel of size 2x2 and a stride of 2, taking only the maximum value within the kernel (other options would be the average, sum, etc). This results in reduced width and height of the feature matrix by a factor of two, and total amount of data by a factor of four. Unlike a conv layer, where nodes are connected to the input image or previous layer only by the local region of the same size as the corresponding kernel, the nodes in a fully connected layer (FC) are connected to every node in the previous layer (as in a simple multi-layer perceptron). Modern CNNs often require large sets of labeled images for successful supervised learn- ing. Recently, it has been discovered that features learned by a CNN that has been trained on a generic image classification task (source task) can be beneficial in solving a more spe- cialized problem (target task) using a technique called transfer learning (Caruana, 1995; Bengio, 2012; Yosinski et al., 2014; Azizpour et al., 2016)). Transfer learning works pri- marily because a fair amount of relevant low-level features (edges, corners, etc.) are likely similar between source and target tasks. Intermediate (you can think of eye, nose, mouth, etc.) and high-level (e.g. head, leg) features are more specialized and their usefulness depends on the distance between the source and target tasks. There are two variants of transfer learning: feature transfer and fine-tuning. In feature transfer, a pretrained CNN serves as an automated feature extractor (Azizpour et al., 2016; Donahue et al., 2014; Oquab et al., 2014; Sharif Razavian et al., 2014; Zeiler and

15 Fergus, 2014; Zheng et al., 2016)). Each image is fed through a pretrained CNN, and its representation (feature vector) is extracted from one of the layers of the CNN, capturing low- to high-level image features. Then, these features are used to train a simpler machine learning system, such as a Support Vector Machine (SVM) (Cortes and Vapnik, 1995)), a Logistic Regression (LR) (Cox, 1958), a Random Forest (Breiman, 2001) or a Gradient Boosting (Friedman, 2001). This approach is usually computationally more efficient and it can benefit from properties of the chosen classifier (e.g. SVMs tend to be resistant to overfitting, outliers and class imbalance, while LR is simple, intuitive and efficient to train). Taking a pretrained CNN (or part of it) as initialization for training a new model is known as fine-tuning. Fine-tuning tends to work well when the specialized task is similar to the original task (Yosinski et al., 2014). Compared to training a CNN from scratch, fine-tuning reduces the hunger for data and improves convergence speed, but it may require a fair amount of computational power. In fine-tuning, the images have to be run through the CNN in a forward pass, and then the computed derivatives from the predictions have to be backpropagated to modify the filters (the latter is the more computationally expensive part). This process of alternating forward and backward passes has to be repeated until our model converges. There is also the problem of defining appropriate learning hyper- parameters in order to enable sufficient flexibility in learning of the new task while avoiding overfitting.

1.2 Aim of the current thesis

With the breakthroughs in deep learning and computer vision outlined above, it is now possible to meet the requirements for highly competent ATIs (Wu et al., 2019; Hansen et al., 2020; Joly et al., 2018; Van Horn et al., 2018; Cui et al., 2018), which can help accelerate taxonomic research. Given a sufficient number of training examples and their labels (e.g. a species name obtained from a taxonomic expert or with DNA sequencing), these new systems learn to identify features important for identification directly from images, without any interference from humans; that is, there is no need for an expert

16 to indicate what is informative, the system finds the relevant image features by itself. However, as indicated above, a limiting factor is access to sufficient amounts of training data, which could be a serious challenge for most species identification tasks. There are various reasons for this. Firstly, the species abundances are usually imbalanced: there are often a few common species, while the majority of species are rarely seen and almost never photographed (or collected). Secondly, the number of images, for those species that are photographed (or collected), is hugely imbalanced towards more attractive groups or subgroups. Among insects, for instance, butterflies are wildly popular targets for nature photographers, while small midges, flies or parasitic wasps are almost never photographed regardless of how common they are. The popularity may also vary among morphs or life stages; for instance, butterfly eggs and pupae are photographed much less than adult butterflies. Thus, collecting enough images of all species and relevant morphs to be able to train a state-of-the-art AI system may be a daunting task. In addition to the challenge of putting together an adequate training set, another serious challenge in training such a state-of-the-art AI system on a dedicated taxonomic task is that it requires advanced technical skills that most taxonomists lack. In this thesis, I address these challenges. The main focus of the thesis has been on insects because they are diverse, challenging to identify and there are many groups of insects that are poorly studied. In fact, more than half of the known species on Earth are insects (over a million according to Zhang (2011)); and many scientists are suggesting that what we know today is just a fraction of what is left to be discovered (Mora et al., 2011; Stork et al., 2015; Novotny et al., 2002). For illustration, consider estimates of the number of undescribed species of all chordates together (15,000), plants (80,000) and insects (4,000,000) (Chapman et al., 2009). The disparity is even greater if we base the comparison on how much we know about their physiology, behaviour, spatial and temporal distributions. Despite these knowledge gaps, insects play many important roles in our ecosystems, both beneficial ones, for instance as pollinators of crops, but also less favorable ones, for instance as pests, invasive species or even vectors of disease. The enormous diversity of insects, the shortage of taxonomic expertise (Gaston and May, 1992; Gaston and O’Neill, 2004), and the importance of many

17 insect species in our ecosystems combine to emphasize the need for accelerating taxonomic research on insects and the potential use for ATIs in doing so. Nevertheless, the findings presented in the thesis are general and should apply to image-based identification of any group of organisms with AI. An important goal of the current thesis has been to develop techniques enabling tax- onomists to build their own sophisticated ATIs using reliable and computationally inex- pensive approaches, and without prohibitive investments in imaging (paper I and II). In paper I, we explored methods that might allow taxonomists to develop ATI systems even when the available image data and machine learning expertise are limited. Specifically, we focused on feature transfer, as previous work has indicated that features obtained from pretrained CNNs is a good starting point for most visual recognition tasks (Sharif Raza- vian et al., 2014). A CNN pretrained on a general image classification task was used as an automated feature extractor, and the extracted features were then used in training a simpler classification system for the taxonomic task at hand. By optimizing the feature extraction protocol, we were able to develop high-performing ATIs for a range of taxo- nomic identification tasks using fairly limited image sets as training data. Specifically, we looked at two challenging types of tasks: (1) identification of insects to higher groups, even when they are likely to belong to subgroups that have not been seen previously; and (2) identification of visually similar species that are difficult to separate for human experts. For the first type of task, we looked at the identification of images from the iDigBio repos- itory of Diptera and Coleoptera, respectively, to higher taxonomic groups. For the second type of task, we looked at the identification of beetles of the genus Oxytherea to species, using a dataset assembled for the paper, and stonefly larvae to species, using a previously published dataset. In paper II, we aimed to address some questions on automated identification that are frequently asked by insect taxonomists: Which techniques are best suited for a quick start on an ATI project? How much data is needed? What is the needed image resolution? Is it possible to tackle identification tasks that are unsolvable by humans? To answer these questions, we created two novel datasets of 10 visually similar species of the flower chafer

18 genus Oxythyrea. The best performing system found in paper I was then used as an off-the-shelf solution and applied to these datasets in several experiments designed to answer the questions. In addition, we repeated the same experiments using some state- of-the-art approaches in image recognition. We show that our off-the-shelf system, while offering an ”easy-to-use instant-return” approach, is often sufficient for testing interesting hypotheses. In fact, the identification performance of ATIs based on the off-the-shelf system was not too far from that of state-of-the-art approaches in our experiments, and it provided similar insights (feasibility, misidentification patterns, etc.) compared to the more advanced systems. We even demonstrate that our off-the-shelf approach can be successfully used on a challenging task that appears unsolvable to humans. It is well known that CNNs occasionally make catastrophic errors; e.g., misidentifying one category for a completely unrelated category - a mistake that humans would be very unlikely to make. We address this in a biological setting in paper III by leveraging a recently introduced technique called label smoothing (Szegedy et al., 2016). Specifically, we propose label smoothing based on taxonomic information (taxonomic label smoothing) or distances between species in a reference phylogeny (phylogenetic label smoothing). We show that networks trained with taxonomic or phylogenetic information perform at least as well on common performance metrics as standard systems (accuracy, top3 accuracy, f1 score macro), while making errors that are more acceptable to humans and less wrong in an objective biological sense. We validated our proposed techniques on two empirical examples (38,000 outdoor images of 83 species of snakes, and 2,600 habitus images of 153 species of butterflies and moths). As mentioned above, CNNs typically require large training sets of accurately labeled images. Assembling such training sets for developing ATIs could be addressed by soliciting the help from citizen scientists. We explored this in paper IV. In the Swedish Ladybird Project (SLP2018), we engaged more than 15,000 Swedish school children in collecting photos of ladybird beetles (). The children collected more than 5,000 photos of 30 species of coccinellids. This is almost as many coccinellid images as the rest of the World contributed from around the globe to the Global Biodiversity Information Facility

19 (GBIF) portal during the same period–––the summer of 2018. We found that adding the SLP2018 images to the GBIF data resulted in improvements of ATI model performance across various evaluation metrics for all but the most common ladybird species.

20 Chapter 2

Summary of papers

Begin at the beginning,” the King said, very gravely, ”and go on till you come to the end: then stop.

Lewis Carroll

2.1 Paper I

2.1.1 Material and methods

Our experiments in paper I were designed to find optimal feature extraction settings for various taxonomic identification tasks and training datasets using a single feed-forward pass through a pretrained CNN. Recent work has indicated that these so-called deep features, although the extraction of them has been learned on a general image classification task, are very robust and, in combination with simple classifiers such as SVMs (Cortes and Vapnik, 1995), can yield results on par with or better than state-of-the-art results obtained with hand-crafted features (Azizpour et al., 2016; Sharif Razavian et al., 2014; Donahue et al., 2014; Oquab et al., 2014; Zeiler and Fergus, 2014; Zheng et al., 2016). A well known CNN architecture, VGG16 (Simonyan and Zisserman, 2014), and its publicly available checkpoint pretrained on the ImageNet task (Simonyan and Zisserman,

21 2014), were utilized across all our experiments. Our experiments were based on features extracted after each conv block, and we refer to them as c1-c5, respectively. The FC layers were excluded because they were dependent on the image input size. In our experiments, we investigated the effects of: input image size, pooling strategy (Max vs Average), features from different layers (feature depth), normalization (l2 and/or signed square root), feature fusion, non-global pooling and image aspect ratio.

2.1.2 Datasets

To find optimal hyperparameters for feature extraction we created four datasets repre- senting two types of challenging taxonomic tasks; (1) identifying insects to higher groups when they are likely to belong to subgroups that have not been seen previously; and (2) identifying visually similar species that are difficult to separate even for experts. Three out of four datasets (D1-D4) were assembled specifically for this paper (Table 2.1). The first two datasets (head view of flies and top view of beetles, D1 and D2 re- spectively) were designed to investigate how far this approach can get us when assigning novel species to known higher taxonomic categories. The remaining two datasets were used to investigate whether the same techniques would be able to discriminate among visually very similar species (top view of sibling beetle species and species of larvae in different life stages, D3 and D4 respectively). Images from all four datasets were taken in lab settings. They all had uniform background (the same uniform background across all images in D3-D4) and with small amounts of image noise (pins, dust, labels, scales, measurements). In all datasets but D4, objects were large, centered and share almost the same object orientation (imaged in a standard taxonomic imaging procedure).

2.1.3 Experiments

Impact of image size. Previous work demonstrated that concatenating features from images of different scales (image sizes) could improve the performance on fine-grained classification tasks (Takeki et al., 2016). However, in order to obtain features from the

22 Table 2.1: Datasets used in paper I. Datasets D1 and D2 are used for a task of assigning novel species to known higher taxonomic categories and the other two datasets for a task of separating specimens of visually similar species. In all datasets, the images were taken in lab settings with uniform background, large centered objects (not in all images in D4); same object orientation (except D4) and small amount of background noise (pins, dust, labels, scales, measurements). Stars (*) indicate datasets composed for paper I using images obtained from www.idigbio.org. Adapted from paper I.

ID Insect Categories Imagespertaxa View Source

D1 Flies 11families 24-159 face *

D2 Beetles 14families 18-900 top *

D3 Beetles 3species 40-205 top Thisstudy

D4 Stoneflies 9species 107-505 top Lytleetal.(2010)

same image of different scales one needs to execute multiple feed-forward passes which results in increased computational cost. Unlike this technique, we opted for finding the optimal input size for a single feed-forward pass. In this experiment we restricted our attention to c5.

Impact of pooling strategy. Global pooling (Lin et al., 2013) is a common way to reduce dimensionality of deep features. Despite several recently proposed alternatives, the two most common pooling strategies are still global max pooling and global average pooling. We experimented with both pooling strategies and with a simple combination of the two (concatenation). As in the previous experiment, we used c5 features only.

23 Impact of feature depth. According to Azizpour et al. (2016), one of the most impor- tant factors for the transferability of pretrained features is the distance between the target and the source tasks. If the task is to separate breeds of dogs then we may expect the layers toward the end (FC layers) to perform the best. This is because the source dataset ImageNet has a lot of dog categories so the later layers have probably learned so-called high level features (you can think of body parts and their shapes - legs, head). In contrast, if the task is to separate two visually similar beetle species that differ only in small details, such as the degree of hairiness (corresponding to fine-grained differences in image texture), then we may want to focus on features from earlier layers (conv layers). To investigate how the feature depth affects performance on our taxonomic identification tasks, we compared extracted features from all five convolutional blocks c1-c5.

Impact of feature normalization. Reducing the variance of the elements in the feature vectors is known to facilitate classification. We experimented with two common normaliza- tion techniques: l2 -normalization and signed squared root normalization as in Arandjelovic and Zisserman (2013).

Impact of feature fusion. The advantage of combining features from different layers is demonstrated in Zheng et al. (2016). Unlike their work, we only tested fusion of features from conv blocks (c1-c5 ) to avoid dependency on image input size.

Impact of non-global pooling. Feature matrices of the intermediate layers are large. The total size is equal to HxWxF - where H is the height, W is the width and F is the depth or the number of filters of the convolutional block. As the first two dimensions (H,W) of the feature matrices depend on image input size, and the number of filters is large, some dimensionality reduction is necessary in extracting features from intermediate conv layers. Global pooling decreases the feature matrix to a vector of size 1x1xF. This minimizes the computational cost for classifier training, prevents overfitting, but it is also known to result in better performance compared to just flattening raw feature matrices (Zheng et al., 2016).

24 We investigated the effect of intermediate levels of dimensionality reduction. Specifically, we reduced raw feature matrices to matrices of sizes 2x2xF, 4x4xF, 7x7xF, 14x14xF and 28x28xF, which were then flattened. These intermediate levels of dimensionality reduction increase computational cost but potentially preserve more information.

Impact of image aspect ratio. We maintained the image aspect ratio across all the experiments described above. The images were symmetrically padded with random or uniform pixels, which resulted in preserved object aspect ratio but some loss of information due to the added uninformative pixels. An alternative procedure would be to instead preserve the image information by image resizing, resulting in distorted objects, instead of padding with uninformative pixels. In this experiment, we compared these two approaches to examine whether it was more important to maintain aspect ratio or to preserve image information.

Classifier and evaluation of classification performance

The extracted features were fed into SVM, specifically a one-vs-all support vector classifier (SVC) (Cortes and Vapnik, 1995). This classifier is a common choice for these types of applications because it is memory efficient (uses only support vectors), and because it works well with high dimensional spaces (Vapnik, 2000) and with unbalanced datasets (He and Garcia, 2009). We validated our results using a tenfold stratified random sampling strategy without replacement. In each iteration, one subset was used as the test set, while the classifier was trained on the remaining nine. As the evaluation metric we used accuracy averaged across individual subsets. A similar validation strategy was utilized across other experiments in this thesis unless otherwise noted.

25 2.1.4 Results

Evaluation of individual steps

100 97 95

90 96 90 95 80 85 94 70 c3 80 norm avg 93 Accuracy (%) Accuracy

avg max (%) Accuracy Accuracy (%) Accuracy c4 D1 60 D1 92 c5 75 D2 D2 D3 D3 fused 50 91 D4 D4 70 128x128 224x224 320x320 416x416 512x512 c1 c2 c3 c4 c5 fused global 2x2 4x4 7x7 14x14 28x28

Image size Layers Size of preserved features

Figure 2.1: We show A) the effect of the image size and pooling strategy (left); B) the effect of the feature depth, normalization and feature fusion(center); and C) the effect of non-global pooling in one of our datasets - D3 (right). Adapted from paper I.

Impact of image size. The first step in our experiments was to find an appropriate image size that would perform well across tasks and datasets. We focused on c5 features and assessed the performance for several input sizes (Fig 2.1 - left). We found that the accuracy increased until the size of 416x416 on most of the datasets, and that in some cases using even larger images resulted in worse performance. Thus, we decided to proceed with 416x416 input image size.

Impact of pooling strategy. In our experiments, global average pooling yielded better results than global max pooling. This could be explained by the fact that, in our datasets, objects occupy a large portion of the image. Thus, by averaging, we allow more object information to affect the final feature state than if we simply take a single value (the max), as in max pooling. Concatenating the two feature vectors obtained from the two different pooling strategies yielded results that were intermediate between the two separate results.

26 Impact of feature depth. Our results show that c4 features perform the best, while c3 yields results that are comparable with those of c5.

Impact of feature normalization. Signed square root normalization increased the performance but, somewhat surprisingly, we found that l2 -normalization had a negative effect on accuracy, regardless of whether it was performed with or without signed square root normalization.

Impact of feature fusion. Feature fusion further improved the accuracy. The only exception was on D3, where fusion marginally fell behind the single best layer (only one image difference in accuracy).

Impact of non-global pooling. In this experiment we obtained further improvements on three out of four datasets. For D1-D3, we found that pooling down to 4x4xF yielded the best results, surpassing the globally pooled features. The most evident advantage of an intermediate level of pooling was on D3. We did not see any improvements at all on D4. This was the only dataset with non-centered and occasionally small objects, which could potentially result in having many receptive fields with no information about the object. In the three remaining datasets (D1-D3), the objects were always large, presumably generating a significant number of receptive fields in later CNN layers.

Impact of image aspect ratio. In all experiments reported above, the image aspect ratio was maintained. However, we were able to slightly improve the best performing model from the previous experiment by resizing images without maintaining the aspect ratio. Obviously, such resizing distorts objects but it also provides more information because there are no added uninformative pixels.

Performance on related tasks.

To validate our conclusions, we evaluated our optimal solution on several recently published biological image classification tasks (see Table 2.2). Our approach performed about as well

27 Table 2.2: Comparison of performance of our method on some recently published biological image classification tasks. We used input images of size 416x416 (160x160 for Pollen23), global average pooling, fusion of c1 –c5 features, and signed square root normalization. Accuracy is reported using tenfold cross validation except for EcuadorMoths, where we used the same partitioning as in the original study. Bold font indicates the best identification performance. The result on EcuadorMoths in parenthesis is from a single c4 block. (fm = families; sp = species). Adapted from paper I.

Datasets Classes Our Others Reference Method

ClearedLeaf 19fm 88.7 71 Wilf et al. (2016) SIFT + SVM CRLeaves 255sp 94.67 51 Carranza-Rojas et al. (2017) finetune InceptionV3 EcuadorMoths 675sp 55.4 55.7 Rodneretal.(2015) AlexNet+SVM EcuadorMoths 675 sp (58.2) 57.2 Weietal.(2017) VGG16+SCDA+SVM Flavia 32 sp 99.95 99.6 Sunetal.(2017) ResNet26 Pollen23 23sp 94.8 64 Gon¸calvesetal.(2016) CST+BOW

as or better than previously published methods.

2.2 Paper II

2.2.1 Material and methods

For the experiments in paper II, we created two image datasets on 10 visually similar species of flower chafer beetles of the genus Oxythyrea. For the first dataset, we collected images using a standardized taxonomic imaging setup, in which images with different depth of field were stacked together in a single high resolution image. For the second dataset, the same specimens were photographed in a much simpler and faster way using only a smartphone and a cheap 2$ attachable lens (see Figure 2.2). We first experimented with images of different habitus views (dorsal and ventral). Hu-

28 abigail albopicta cinctella dulcis funesta gronbechi noemi pantherina subcalva tripolitana

B

both sexes both

smartphone

females

dorsal males

A

females

ventral males

Figure 2.2: Fig. 2. Datasets of ten Oxythyrea species used in paper II. We show example images of dataset B (first row) collected with a smartphone and a cheap 2$ attachable lens and example images from dataset A (remaining rows) collected in a standardized taxonomic imaging setting. Dataset A contains images of dorsal and ven- tral habitus including images of both sexes. Note that Oxythyrea beetles show sexual dimorphism only on their abdomen (ventral view). Adapted from paper II.

29 mans often find one habitus view more useful for identification than the other and we investigated whether this was also true for the ATIs we developed. Then, we investigated how ”quick and dirty” image collection using a smartphone and a cheap attachable lens compares with the time-consuming standard taxonomic imaging setting. Lastly, we ex- plored whether the same approach was applicable on tasks unsolvable by humans. Here, we experimented with separating Oxythyrea specimens by sex using images of the dorsal view. According to previous work on Oxythyrea, these species do not show any sexual dimorphism in this view. The experiments in paper II were based on optimal parameters from paper I. Specif- ically, we used VGG16 (Simonyan and Zisserman, 2014) pretrained on ImageNet (Rus- sakovsky et al., 2015) for feature extraction and SVC (Cortes and Vapnik, 1995) for clas- sification. Features from all five convolutional blocks were reduced using global average pooling, then concatenated and normalized using the signed square-root method. The only difference was the input image size, which was set to smaller size (224x224 ) in order to speed up the experiments. The validation approach was the same as in paper I. In addition to the off-the-shelf approach developed in paper I, we repeated the same experiments using some current state-of-art techniques in image recognition. Specifically, we utilized a well known CNN SE-ResNext101-32x4 (Hu et al., 2018) and fine-tuned a publicly available checkpoint pre-trained on the ImageNet dataset (Russakovsky et al., 2015). This architecture is a variant of ResNets (He et al., 2016) (networks with residuals modules) with many improvements added subsequently, such as ”cardinality” (Xie et al., 2017) and a squeeze and excite block (Hu et al., 2018). The learning rate (lr) was adjusted using the one cycle policy ((Smith, 2018, 2015; Smith and Topin, 2017)), with a maximum lr set to 0.0006. We set the batch size to 12 and back-propagated accumulated gradients on every two iterations for a total of 12 epochs. As our optimization strategy, we used an adaptive learning rate optimization algorithm called Adam (Kingma and Ba, 2014) with a momentum of 0.9. Regularization was done using i) a dropout (Srivastava et al., 2014) layer (0.5) inserted before the last layer; ii) label smoothing as our classification loss function (Szegedy et al., 2016); and iii) augmentation (zooming, flipping, shifting,

30 brightness, lighting, contrast and mixup (Zhang et al., 2017)). Lastly, with Class Activation Maps (CAM) (Zhou et al., 2016), we visualized so-called heat maps for relevant category- specific regions of images. These heat maps light up the regions that are important for the AI system in identifying an image as belonging to a particular category.

2.2.2 Results

Off-the-shelf approach

Our results using the off-the-shelf system suggest that either one of the habitus views (dorsal or ventral) can be successfully utilized in identifying the ten species of Oxythyrea. However, better accuracy is achieved with images of dorsal habitus compared to ventral (3x smaller error rate). This finding corresponds to human perception of the difficulty of the task. Combining information from both views, we observed only slight improvements. This is likely caused by the above-mentioned difference in error rate between the views. If we had combined information from similarly performing views, one would have expected a greater positive impact of the combination. The images collected by a smartphone and a cheap attachable lens performed almost as well as the high-resolution images. Although humans find such images difficult to use for identification, clearly inferior to high-resolution images, they are apparently often sufficient for machine identification. A possible reason for this is that the images fed to current AI systems for image classification are reduced in size (often to around 224x224 pixels). After reduction in size, the high-resolution taxonomic images are probably quite similar to the smartphone ones, possibly explaining why they do not result in significantly better identification performance. The off-the-shelf feature transfer solution found sexual dimorphism of the dorsal habitus in at least some Oxythyrea species. In two species, O. albopicta and O. noemi, the model seemed to be able to identify most of the males correctly, while the sex identifications for female specimens were comparable to guessing randomly.

31 Figure 2.3: Class Activation Maps for the specimens from the species/sex task (only select specimens are depicted). Adapted from paper II.

Beyond off-the-shelf solution

As expected, fine-tuning yielded better results than the off-the-shelf solution, drastically reducing the error rates (2-5x). This approach also allowed us to compute heat maps, which made it possible to compare machine reasoning about the identification task to the reasoning of taxonomists. The heat maps showed that the model was often focusing on the pronotum. In species considered easier to identify, this was the only region that was highlighted (O. albopicta, O. cinctella), while on more difficult species (e.g. O. dulcis, O. noemi ) the model used information from a wider region including the whole elytra. According to the heat maps, O. tripolitana was easily identified using information from the small part between the pronotum and the scutellum. This species has an accumulation of setae in this region, which until now has not been seen as a reliable morphological character among taxonomists. When the task was sorting to sex alongside species based on the dorsal habitus, the model again focused on the pronotum and sometimes on the elytra. It is not clear what exactly is the discriminating feature. Looking at the heat maps from the ventral side,

32 for six species the model easily recognized the median region, where the pattern of white dots was present only in males (Fig. 2.3). In the remaining species, this pattern was not present in either sex. However, the same region was still highlighted. The reason was likely a sex-specific grove present in the median part of the abdomen, which is clearly visible if the abdomen is examined from the side and with appropriate illumination. However, this groove was impossible for humans to see in most of the images of the ventral habitus used in the experiment.

2.3 Paper III

2.3.1 Material and methods

In paper III, we explored the utility of taxonomic or phylogenetic information in training and evaluating CNNs for identification of biological species, with the aim of improving these systems so that they make less catastrophic errors. Specifically, we included the biological information during the training by adjusting targets with label smoothing (LS) based on taxonomic information (taxonomic label smoothing - TLS) or distances between species in a reference phylogeny (phylogenetic label smoothing - PLS). Similar to paper II we used finetunning from a pretrained checkpoint and we compared our approach against two well established baselines: one-hot encoding and standard LS (Szegedy et al., 2016). In one-hot encoding, categories are represented as binary vectors of length equal to the number of unique categories, with 1 for the correct target and 0 for the other categories. LS is a weighted average of the one-hot encoded labels and the uniform probability distribution over labels. In both scenarios, networks are optimized toward targets using a distribution, in which categories are equally distant from each other. In such a setting, a network is equally penalized for every misidentification it makes. With our approach, hierarchical information based on biological relatedness is incorporated in the target encoding. This results in the network being less penalized for misidentifying categories that are closely related biologically, and more penalized for mixing up distant categories.

33 Our approach differs from one-hot en- coding and standard label smoothing only in the target encoding. We use a weighted average of one-hot encoded labels and a non-uniform distribution over labels rep- resenting hierarchical biological informa- tion about the categories based on taxo- nomic ranks, anagenetic distance, or clado- Figure 2.4: Systems optimized toward one- genetic distance. For taxonomic rank, we hot or LS (Szegedy et al., 2016) targets as- counted the number of shared taxonomic sume all categories are equally distant from levels (genus and family) between the cor- each other and hence all errors are penalized rect target and each of the other cate- equally. We propose to smooth the targets gories. For anagenetic-distance smoothing, using hierarchical biological relatedness infor- we used the branch lengths separating the mation (taxonomy or phylogeny) so that sys- correct target and each of the other cat- tems are penalized more for erroneous identi- egories on the reference tree as the dis- fications that are farther away from the cor- tance measure. For cladogenetic-distance rect category. smoothing, we instead used the number of edges separating the categories on the reference tree as the distance measure. Lastly, we normalized the resulting values by subtracting them from the maximum value (so that closely related categories had the highest values); and then we normalized the values, that is, we divided the values with the sum over all categories so that the sum of the distribution was equal to 1 as in one-hot labels. For all hierarchical smoothing methods, we explored several mix-in proportions (smoothing values), β, of hierarchical information to binary in- formation (β ∈ {0.025, 0.05, 0.1, 0.2, 0.4}) on two image data sets: 38,000 outdoor images of 93 species of snakes and 2,600 habitus images of 153 species of Lepidoptera (butterflies and moths) images obtained from GBIF (2020). In addition to accuracy used in papers I-II, here we used two more common evalua- tion metrics: Top-N-accuracy or topN - if the correct category is among the N highest

34 probabilities, we count the answer as correct (in this study N=3); and f1 score macro - a weighted average of recall and precision, where recall is the ratio of correctly predicted positive observations to all observations in the actual category, and precision is the ratio of correctly predicted positive observations to the total predicted positive observations. All three of these evaluation metrics assume that all errors are equally bad. For that reason we also measured the accuracy at the genus and family levels, and for snake dataset we report the accuracy of predicting a relevant biological trait, namely whether a snake species is venomous or not.

2.3.2 Results

Firstly, we evaluated the experiments with standard evaluation metrics. On the first dataset, one-hot encoding gave better results than LS on accuracy and top3-accuracy, but slightly worse on f1 scores with macro averaging. Systems trained using phylogenet- ically informed targets (TLS or PLS) with small smoothing values (0.025 ≤ β ≤ 0.1) gave slightly better results than both benchmarks, with exception of anagenetic-distance smoothing, which performed on par with the benchmarks. On the second dataset, the LS benchmark consistently gave better results than the one-hot benchmark. Results from ex- periments with TLS and PLS were consistently better than the one-hot benchmark. When compared to the second benchmark, LS, the inclusion of hierarchical information (TLS or PLS) with intermediate smoothing values (0.05 ≤ β ≤ 0.4) gave similar accuracy and f1 scores, while it often gave better top3. Our phylogenetically or taxonomically informed approach performed better than both benchmarks on evaluation metrics that take into account the hierarchical information. Specifically, we found that TLS or PLS (based on anagenetic or cladogenetic distances), across a range of different smoothing values, yielded better results on the accuracy of identifications at the genus or family level, or the accuracy of predicting an important biological trait (a snake being venomous). Thus, even if the systems trained using TLS or PLS made the same number of errors as the benchmark reference systems, the errors

35 tended to be less serious in that the misidentified categories involved organisms that were more closely related to each other.

2.4 Paper IV

2.4.1 Design of the study. Material and methods

The aim of paper IV was to describe lessons learned during Swedish Ladybird Project 2018 (SLP2018), a citizen-science project focused on collecting smartphone images in or- der to develop an ATI tool for Swedish species of ladybird beetles. The first phase, the citizen-science (CS) part, was organized in the summer of 2018. Initially, we aimed at schoolchildren (ages 6-16), but after the initial press release the project caught the at- tention of many local and national media and attracted a lot of interest from potential participants. Therefore, we decided to extend the project to include the interested public in general, and preschool kids (up to age 6) in particular. We offered teachers to register in advance to allow direct communication with the project team and to receive additional support. The preregistered classes were also provided with an “experiment kit”, which included a guide for teachers, a macro-lens for mobile devices. and a recently published comprehensive field guide to Swedish ladybird species. In total, 700 experiment kits were dispensed among participants. The aim was to provide one kit per 15 participating kids, so larger classes or sets of classes received more than one kit. Contributions were submitted through an app specifically made for this project. For the identification of the ladybird species in the collected images, we relied on experts. After completion of the project an evaluation survey was sent out the registered participants. For comparison purposes, we downloaded all the images of Swedish ladybirds available through the Global Biodiversity Information Facility (GBIF) GBIF (2020). The taxonomic identifications of the images provided by GBIF were taken at face value. GBIF images con- tributed in 2018 were used for the evaluation of ATIs (”GBIF2018”), while the remaining images (collected before 2018; ”GBIF training”) were used as an additional training set

36 that we could compare to the image set collected through the SLP2018 project. Specifically, to evaluate the SLP2018 contribution, we trained networks on SLP2018, GBIF training and SLP2018+GBIF training, and compared the identification performance of the result- ing ATIs. We fine-tuned a well known and light architecture ResNet50 (He et al., 2015) from a publicly available checkpoint pretrained on ImageNet (Russakovsky et al., 2015). The finetunning procedure resembled the one used in paper II but with different hyper- parameters. As our evaluation metrics we used accuracy as described in paper I and f1 score macro as described in paper III. In addition to these common metrics, we used f1 score macro on subsets of species created based on the number of images per species on GBIF: with the two dominant species, Harmonia axyridis and Coccinella septempunctata; other common species (ranked 3-10 in abundance); and the remaining species.

2.4.2 Results

Almost 400 teachers registered to participate in the project. According to the 24 replies we received in the evaluation survey, many of them supervised more than one class. In average, the number of students per registered teacher was 31.5 (range 16-67), or in total around 12,000 children. If we consider unregistered participants, which also include schools and preschools, and expeditions for kids organized by amateur entomologists and natural history museums, we estimate that the total number of children participating in the project was around 15,000. The registered teachers were instructed to spend 1-2 hours of effort on field work. The survey revealed that most of the groups had searched ladybirds multiple times (up to 28 times) and they mostly invested up to 2h (80%) on the task on each field trip. Perhaps even more astonishing was that some groups, presumably preschool children, had spent orders of magnitude more time than expected (replies included “every day”, “all the time throughout the project period” and “many times”). In total, we received over 5,000 images for almost half of the Swedish ladybird species

37 Figure 2.5: Geographical distribution of the contributed images. Most of the images came from the most populated counties Stockholm, Sk˚ane and V¨astra G¨otaland (a). In (b) we show locations where images are taken (each dot represents a single image) and in (c) we show the normalized contribution per capita for each of the 21 counties (darker is more). Adapted from paper IV.

38 (30/71). The first images were submitted in early June, followed by the first peak during the first three weeks of summer (weeks 26-28). The summer was calm until weeks 35-38, which brought the biggest contributions. The contributions were received from all of the 21 counties of Sweden. The most populated counties—Stockholm, V¨astra G¨otaland and Sk˚ane—contributed most of the images. However, when accounting for population size, Gotland stood out with 7 times as many contributions per capita as the country’s average. The majority of the insects from contributed photos (68%) were identified as Coc- cinella septempunctata. Similar patterns were found in the individual months, with C. septumpunctata comprising between 60 and 80% of the monthly contributions. In addition to C. septempunctata, only three more species were represented by more than 3% of the total number of images: Psyllobora vigintiduopunctata (7.7%), Harmonia axyridis (5.8%) and Adalia bipunctata (5.4%). The experiments designed to analyze the contribution of SLP2018 in the context of developing ATIs showed that the estimated value of the SLP2018 data heavily depended on the choice of evaluation metric. The SLP2018 data, when compared to the GBIF data, did not contribute significantly to improving the accuracy or top3 scores. These metrics favor majority categories (such as C. septempunctata and H. axyridis in our study) and these were already well represented in the GBIF dataset with 80% of all the ladybirds images on GBIF belonging to one of these two species. However, adding SLP2018 data to GBIF training notably improved our second metric, F1 score with macro averaging. The improvement came from minority categories, and we could see that by comparing F1 scores over three subsets of species. The scores for the first subset, comprising the two dominant species (C. septempunctata and H. axyridis), remained largely unchanged. The scores for the second subset, containing the remainder of the common species, were slightly but clearly improved, while the scores for the last subset, containing the rare (minority) species, benefited the most from the addition of the SLP2018 images.

39 Chapter 3

Discussion

The scientists of today think deeply instead of clearly. One must be sane to think clearly, but one can think deeply and be quite insane.

Nikola Tesla

Our results from paper I show that there is considerable room for optimization of current DL techniques so that they generate better ATIs in settings that are typically en- countered by taxonomists. Surprisingly, our experiments in paper I show that it is possible to find strategies that significantly boost the performance of ATI systems across a range of taxonomic tasks and datasets, and these strategies seem to be successful also on related tasks. Specifically, the results obtained in paper I are based on a computationally effi- cient approach, namely feature transfer from a single feed-forward pass through a publicly available checkpoint of the VGG16 (Simonyan and Zisserman, 2014) system pretrained on the ImageNet classification task (Russakovsky et al., 2015). When compared to a baseline application of this approach, we were able to significantly improve the performance of the resulting ATI by introducing each of the following modifications or additions: larger image input size 416x416, global average pooling, fusion of features from all five convolutional blocks and signed square root normalization.

40 Our results also indicate that it may be possible to further improve identification per- formance by optimizing the dimensionality reduction. In our experiments, global pooling (1x1xF) was not always optimal; pooling to small but not minimal feature matrices (2x2xF or 4x4xF) often resulted in better performance. Specifically, intermediate pooling resulted in significant improvements on three out of four datasets. It turned out that intermediate-level features extracted from the c4 layer were the most useful ones for the taxonomic tasks we tackled in paper I. According to Zheng et al. (2016), this is not the case for generic image classification tasks (e.g. Caltech-101 (Li Fei-Fei et al., 2006), Caltech-256 (Griffin et al., 2007), and PASCAL VOC07 (Everingham et al., 2015)), nor for popular fine-grained identification tasks (Bird-200-2011 (Wah et al., 2011) and Flower-102 (Nilsback and Zisserman, 2008)), where the most discriminating layers are the first fully connected layer and c5, respectively. A possible reason for this discrepancy could be that our images are more distant from the ImageNet dataset than any of the above-mentioned image datasets, or that differences between categories in our datasets are even more subtle than those in the previous fine-grained identification tasks. From the perspective of insect taxonomy, perhaps the most interesting insight from paper I is that, with optimized feature transfer, it is possible to develop high-performance ATIs for a wide range of tasks without expertise in image analysis or machine learning. Developing ATIs with identification accuracy on par with or exceeding that of human ex- perts clearly seems to be within reach for non-experts. Thus, many taxonomists should be able themselves to leverage the latest advances in CNNs and deep learning to develop high-performance ATIs for just about any classification task of interest in insect system- atics. In paper II, we examined the potential of the system with the optimal feature transfer protocol identified in paper I in testing a set of new taxonomic hypotheses. With no further optimization, our off-the-shelf solution from paper I proved to be very useful in tackling several challenging identification tasks, some of which were previously thought to be unsolvable by humans. A particularly striking example is the identification of sex from images of the dorsal habitus of Oxythyrea beetles; sexual dimorphism in dorsal habitus has

41 not been noted previously in these beetles. It is perhaps not surprising that the amount of data is important for good performance of machine identification (the more examples, the better; paper I) or that the difficulty increases with the amount of morphological variation within categories (sexual dimorphism, life stages, age, etc.; paper I, II). Thus, collecting a sufficient number of accurately labeled images for developing a sufficiently competent ATI is important, but may appear to be a difficult task, particularly since these image sets often have to be assembled essentially from scratch for many taxonomic tasks. Fortunately, our results (paper II) indicate that, at least for some tasks, images collected with a smartphone and a cheap attachable lens are quite sufficient. Thus, photographing specimens with a smartphone sometimes offers an interesting alternative to the time-consuming procedure of collecting high resolution images in a standardized taxonomic imaging setup. We estimated that by using a smartphone, one may collect 40x more images in the same amount of time as that required for the traditional setup. It is worth noting that an ATI based on a larger set of smartphone images may actually perform better than a system trained on a smaller number of high- resolution images. The smartphone approach is also much easier to scale as everyone has a smartphone today, while the setup used for high-resolution imaging of insects is mostly restricted to entomologists at larger institutions and requires a substantial investment in equipment. In both paper I and II, we observed that machines and humans tend to find the same or similar tasks challenging. Good examples are separating visually similar species (sibling species) (paper I, II) or placing a species to the correct higher taxonomic group (family or tribe in our case) when the subgroup it belongs to (species and genus in our case) has not been seen before (paper I). However, we found occasional cases where machines made catastrophic errors that no human would make, something we tried to address in paper III (see below). An off-the-shelf solution can be sufficient for developing an ATI with adequate perfor- mance for many tasks, as shown in papers I-II, but even when this is not the case, such a solution is useful in testing whether an identification task is feasible at all, and whether

42 it is worth spending more resources on developing a more sophisticated system. Going beyond off-the-shelf and utilizing state-of-the-art techniques can improve AI performance, but also give additional insights, as illustrated by our heat maps (class activation maps). This technique highlights the regions of images containing the features that a model finds informative for a given task (Fig. 2.3). Our experiments (paper II) indicate that human experts often find features occurring in the highlighted regions useful for species discrim- ination. However, these heat maps are not always easy to interpret. They do not tell us exactly what the discriminating features are but rather indicate where they are located. For example, consider an image of a beetle specimen, and assume that a region of the specimen with a lot of hairs is highlighted by the heat map (as in our experiments). This does not tell us what feature of the hairs is important for discrimination (their number, distribution, thickness, length or something else). In fact, we cannot even be sure that the discriminating feature has anything to do with the hairs, it could be some underlying feature, such as the shape or size of the hairy body part. Therefore, we should not expect the heat maps to point exactly to an interesting morphological feature but rather to give us hints of where to look. Then, it is up to the expert to hypothesize what the actual discriminating feature is. Another interesting phenomenon that we were able to demonstrate with the heat maps was the fact that CNNs learn from whole images, including information that may be irrelevant for the task. Specifically, the heat maps from one of the species studied in paper highlighted the region where the pin is located. It turned out that all the imaged specimens of this species were collected by a single entomologist, who was pinning them from a non-standard angle, and this entomologist did not contribute any specimens of the other species. This is an excellent example of how bias can be introduced in image training sets. Intuitively, one might expect that more sophisticated training of AI systems could reduce the risk of such irrelevant biases, or the risk of making catastrophic errors, that is, misidentifying categories that are completely unrelated. In paper III we explore this idea by proposing to utilize phylogenetic relationships among biological organisms (TLS

43 and PLS) in the training of ATIs. Our solution is inspired by the recently introduced LS (Szegedy et al., 2016) technique, which uses a weighted average of one-hot targets and uniform distribution over labels during training. However, unlike both LS and one- hot encoding, TLS and PLS use target encodings based on non-uniform distributions. Specifically, they mix one-hot encoding with a small proportion of a distribution based on a hierarchical scoring scheme reflecting either taxonomic (TLS) or phylogenetic (PLS) information. This rewards the system being trained for predicting the correct category, while punishing it for errors in proportion to how much these errors violate what we know about the biological relationships among the categories. The technique we propose could easily be applied also in other cases where there is a natural hierarchy describing the similarity relations among the categories. As demonstrated by our results, systems trained in this way tend to make less serious errors. For this reason, hierarchically aware systems might be preferred in many practical applications over standard systems, even when they make more errors in total. Examples may include cases where the cost of making an error is very high, such as misidentifying a venomous snake for a non-venomous one. The results presented in paper III indicate that systems trained with phylogenetic information often perform on par with or better than baseline systems in terms of common evaluation metrics, such as accuracy, topK-accuracy and f1 score macro. When evaluated with custom metrics that take the biological context into account (accuracy on the genus and family levels, or accuracy in predicting a biological trait) we observed that the sys- tems trained using TLS or PLS often outperformed both of the baseline systems. In our experiments, a range of smoothing ratios (β values) was tested, and although 0.05 and 0.1 seemed to perform well on both datasets, we noticed that applying the same smoothing ratio across tasks might be suboptimal given the current methodological setup. The reason for this is possibly the difference in the number of categories across tasks and the fact that we distribute total smoothing value β across all categories. As a result, tasks with a larger number of categories have smaller average smoothing value per category. A possible solu- tion could be to distribute the smoothing only across a fixed number of the most related categories and keep the remaining categories at zero in the smoothing distribution.

44 Assembling sufficiently large training sets of accurately labeled images will remain a challenge in the development of ATIs for many organism groups. Using citizen-science contributions is an obvious possibility, and our attempt in this direction (paper IV) exceeded all expectations. We estimated that approximately 15,000 children participated in the project. Given our conservative estimate that each child devoted 1 hour to the task, this translates to a contribution of approximately 15,000 hours (375 weeks or 83 person months) of effort to the project. If a single person would be required to perform this task by himself/herself, it would take that person more than 10 years to complete it ( 1,600 annual working hours in Sweden (Charlie Giattino and Roser, 2013)). Over 5,000 images were received in two major waves, one in the early summer and one in the late summer. These waves were anticipated as a result of the design of the project, which closely followed the biology of ladybirds in Sweden. The adult ladybirds become active in May-June searching for food and to mate, then they slow down until the end of the summer when the new generation emerges, and joins the adult beetles who overwintered in the search for food to stack reserves for the upcoming winter. The natural life-history fluctuation in adult beetle activity were probably further accentuated by the timing of project activities. The first wave coincided with the project kick-off and the launch of an extensive project marketing campaign. We did have participants that contributed images during the whole summer (preschools, field stations, museums and groups of amateur entomologists), but the image contributions were fewer during this period. Finally, the majority of the contributed images were received in the second wave in the late summer, when schoolchildren returned to school and teachers had scheduled project activities in the field. The results were impressive in terms of both species coverage and taxonomic compo- sition. In total, 30 of the 71 naturally occurring coccinellid species were photographed. The most common species were photographed the most, while some rare species were com- pletely absent from the contributed images. An interesting finding was that Harmonia axyridis, an invasive species that was first recorded from Sweden a decade ago (Brown et al., 2007), was the third most commonly photographed species. As one might expect,

45 we noted some biases toward species that thrive in urban environments and against those that occur in habitats less suitable for field excursions with children or in less populated parts of the country (northern species). The 5,119 images collected by the SLP2018 project represent a significant expansion of what was at the time available through GBIF. SLP2018 images increased the number of Swedish images of ladybird beetles by a factor of 50. The number of SLP2018 images approached the number of images submitted that year to GBIF from around the world (7,264). and represent 21% of all GBIF images collected over the last 20 years from all over the globe of species of ladybird beetles occurring in Sweden. These numbers clearly show the potential of citizen-science projects in quickly assembling sizeable sets of images for suitable organism groups. Our experiments also confirm that the SLP2018 data provided valuable information for training ATIs. The increase in identification accuracy was particularly obvious for the less common species, the ones for which the number of images available previously was insufficient for adequate training of ATIs. It is interesting to speculate on how useful citizen-science projects might be in general for assembling suitable image sets for training ATIs. Clearly, the most significant boosts will be expected for organism groups and species for which there are few existing images. The group also needs to be fairly easy to identify and to photograph for the participating citizen scientists. We think that a fair number of groups of insects, other invertebrates, fungi and plants may fit these criteria. For some of the more obscure groups, it may sometimes be possible to find more specialized amateur naturalists that are willing to contribute, while other groups are suitable for larger- scale projects targeting non-specialists as in the SLP2018 project. An interesting lesson is the outsize contribution of the images of the rare species and of the rarely photographed subgroups, such as some of the subfamilies of Coccinellidae that were poorly represented in both the SLP2018 and GBIF image sets. It may well be possible to increase the effectiveness of citizen-science projects by directing the attention of participants towards these subgroups or species, perhaps by considering various point reward systems. Clearly, citizen science provides many valuable opportunities for advancing the development of ATIs. At the same

46 time, these projects can also help raise the awareness of the value of biodiversity and have many other positive societal side effects.

47 Chapter 4

Concluding remarks and thoughts on not so far future

The present is theirs; the future, for which I really worked, is mine.

Nikola Tesla

The research presented in this thesis demonstrates how current AI technologies — specifically CNNs and DL — can solve several types of challenging taxonomic identification tasks (papers I-II). We managed to develop an optimized feature extraction protocol that was successful in tackling a broad range of taxonomic identification tasks, making it possible to offer an easy-to-use instant-return approach for taxonomists interested in investigating the feasibility of using AI in addressing particular identification problems. Specifically, we demonstrated the usefulness of our optimized solution in: (1) identifying insects to higher groups when they are likely to belong to subgroups that have not been seen previously; (2) identifying visually similar species that are difficult to separate even for experts; and (3) solving some identification tasks that appear unsolvable to humans, such as detecting the sex of a specimen when there is no prior evidence in the literature of sexual dimorphism. We also showed that going beyond such an off-the-shelf solution by utilizing state- of-the-art AI techniques can yield further improvements in identification accuracy while

48 providing additional insights. A particularly striking example of the latter is provided by the heat maps presented in paper II, highlighting regions that are particularly important for the ATI in discriminating among categories of imaged objects. This type of approach may well turn out to be a powerful tool that can guide experts in identifying the informative morphological features that are critical for both taxonomic and phylogenetic research. In paper III, we demonstrated that by leveraging taxonomic or phylogenetic infor- mation we can train systems that are as good as standard systems in terms of common evaluation metrics (accuracy, topK accuracy, f1 score macro), but make errors that are less wrong in an objective biological sense (paper III). An obvious research direction in- spired by this paper would be to explore the use of CNNs for phylogenetic inference. My initial experiments show that models optimized only toward targets based on phylogenetic information are quite good at approximating the position of species in the tree of life. Of course, this requires that a reference phylogeny is available for training. However, there are numerous ways in which such phylogeny-trained AI systems could be useful. For instance, a system trained using an approximate or incomplete reference phylogeny could learn to distinguish phylogenetically important features, and thus refine the phylogeny in the way it perceives similarities among images. It should also be able to place unseen species in a phylogenetic context more accurately than other systems. It may also turn out that AI “knowledge” about what image features structure the phylogeny of one group of organisms can transfer to related groups. If so, then an AI system trained to reconstruct phylogeny may well be able to infer phylogeny of a group of organisms it has not seen previously using only images of those organisms. We already mentioned that the credit for recent developments in computer vision goes to the faster and cheaper computational power, improved algorithms, and increasing amounts of image training data available on the internet. We are witnessing constant, fast-paced improvements in the first two, regardless of what happens in the taxonomic community. However, the taxonomic community could contribute substantially to the field making even bigger strides forward by generating sufficient amounts of accurately labeled image data for training of AI systems. An obvious way to generate more data is to use CS efforts similar to

49 our project described in paper IV. However, the CS path does not seem a viable solution to the problem of accumulating adequate training sets for a signification number of insect species. Some organism groups are simply too difficult for citizen scientists to find in the field, or to photograph in such a way that it will be possible to identify the imaged species. In some cases one may be able to extend the limits of what can be done by focusing on specialized amateur naturalists, to train amateur naturalists to work with new groups, or to elicit the help of CS in imaging existing collections of specimens. However, in the end, accumulating sufficient training data has to be achieved using other approaches for many organism groups. An interesting alternative approach to CS would be to use camera traps for accumu- lating images for training. Camera traps have the additional advantage that they can also be used for continuous monitoring. There is already considerable experience with camera traps for mammals, and with developing AI techniques for these data. For example, in the Snapshot Serengeti project (Swanson et al., 2015), CNN-based models have now been developed that are better than humans in terms of accuracy at recognizing common species (zebra, lion, elephant)(Norouzzadeh et al., 2018; Willi et al., 2019). There are also several successful examples targeting insects, mainly driven by industrial applications (e.g. see the recent review paper on automated monitoring of pests (Cardim Ferreira Lima et al., 2020) and references therein). However, as in the CS approach, we need to find ways how to gen- erate labels for the images we collect. For instance, by the year 2015, Snapshot Serengeti had accumulated over 1,500,000 labeled images by using an army of 100,000 volunteers. In this project, it was found that having an image labeled by one volunteer gave only 85% accuracy in the identification. This improved further to 95% and 98% accuracy when using the voting consensus of 10 and 20 volunteers, respectively (see Swanson et al. (2015) for details of the setting). A single expert could label images with 96.6% accuracy, CNNs alone reached 97% accuracy, and CNN+expert in a human-in-the-loop approach reached 99.8% accuracy. Unfortunately, insect identification is considerably more challenging for humans than mammal identification for several reasons, including the enormous diversity of insects, and becoming an expert on even a single group can require a lifetime of training.

50 Thus, entomologists need to be more creative than mammalogists in order to get over this hump caused by the need for sizeable collections of accurately labeled training data. Most of the challenges we find in building AI identification systems for taxonomy are not unique to our field. In fact, we can count on constant advancements in DL and com- puter vision to also benefit the construction of ATIs. For instance, there has recently been significant progress in utilizing unlabeled images in training (Chen et al., 2020; He et al., 2020), which considerably reduces the need for labeled data at the expense of computa- tional resources. Because computational resources are getting cheaper and faster, while human labor costs for labeling are increasing, this seems a particularly promising future direction for the development of ATIs and many other AI applications. Another research area of potential interest to taxonomists, where the DL community is very active, is so- called automated machine learning (AutoML) (see (He et al., 2021) and references therein). The idea is to build solutions based on cutting-edge techniques but without any human involvement, thus completely removing the need for machine learning expertise. In other words, the idea is to completely automate data preparation, feature engineering, hyper- parameter optimization, and neural architecture search. There are many other promising ideas currently pursued by the DL community, which will, sooner or later, affect many aspect of taxonomy and transform the way we work with species discovery, description and identification. My impression is that the time is ripe for biologists experimenting with AI techniques to set bigger goals that could provide more substantial benefits and bring more transformative changes. For instance, imagine a robot with a camera, which could capture specimens, automatically detect them and recognize most of the species, collect behavioural data, and then, depending on a predefined set of rules, select specimens to be collected and preserved for future study, eliminating specimens of invasive or harmful species, and releasing the remainder. How about nano-robots small enough so that millions could fit on a single head of an insect pin; if they can flow through our bloodstreams, detect and cure diseases then it might be possible to have similar nano-robots fly around in the wild, detect an interesting specimen, collect behavioural data and sample tissue for genetic analysis in

51 a non-destructive way. The recent progress in technology gives me confidence that these types of systems might be feasible in the near future, and I hope I will be able to contribute to the journey towards building them. In conclusion, today’s ”machines” can see, sometimes better than us, they learn much faster, they never forget and their knowledge is transferable to new tasks in similar domains. Technology enables us already today to automatize many aspects of taxonomic work, and it is only a matter of time until we will be able to put together some pieces of technology already perfected in various other industries to achieve even more impressive tasks. Taking advantage of these opportunities will enable us to progress more forcefully than ever toward the goal of saving the Earth. Despite all the challenges, I think the future is bright, as those challenges are nothing but new opportunities.

52 Bibliography

Agnarsson,I. and Kuntner,M. Taxonomy in a changing world: seeking solutions for a science in crisis. Systematic Biology, 56(3):531–539, 2007.

Arandjelovic,R. and Zisserman,A. All about VLAD. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 1578–1585. cv-foundation.org, 2013.

Arbuckle,T., Schr¨oder,S., Steinhage,V., and Wittmann,D. Biodiversity informatics in ac- tion: identification and monitoring of bee species using ABIS. In Proc. 15th Int. Symp. Informatics for Environmental Protection, volume 1, pages 425–430. enviroinfo.eu, 2001.

Azizpour,H., Razavian,A. S., Sullivan,J., Maki,A., and Carlsson,S. Factors of transferabil- ity for a generic ConvNet representation. IEEE Trans. Pattern Anal. Mach. Intell., 38 (9):1790–1802, September 2016.

Bengio,Y. Deep learning of representations for unsupervised and transfer learning. In Proceedings of ICML workshop on unsupervised and transfer learning, pages 17–36, 2012.

Bottou,L., Curtis,F. E., and Nocedal,J. Optimization methods for large-scale machine learning. Siam Review, 60(2):223–311, 2018.

Breiman,L. Random forests. Machine learning, 45(1):5–32, 2001.

Brown,P., Adriaens,T., Bathon,H., Cuppen,J., Goldarazena,A., H¨agg,T., Kenis,M., Klaus- nitzer,B., Kov´aˇr,I., Loomans,A., et al. Harmonia axyridis in europe: spread and distri-

53 bution of a non-native coccinellid. In From Biological Control to Invasion: the Ladybird Harmonia axyridis as a Model Species, pages 5–21. Springer, 2007.

Cardim Ferreira Lima,M., Damascena de Almeida Leandro,M. E., Valero,C., Pereira Coro- nel,L. C., and Gon¸calves Bazzo,C. O. Automatic detection and monitoring of insect pests—a review. Agriculture, 10(5):161, 2020.

Carranza-Rojas,J., Goeau,H., Bonnet,P., Mata-Montero,E., and Joly,A. Going deeper in the automated identification of herbarium specimens. BMC Evol. Biol., 17(1):181, August 2017.

Caruana,R. Learning many related tasks at the same time with backpropagation. In Advances in neural information processing systems, pages 657–664, 1995.

Ceballos,G., Ehrlich,P. R., and Dirzo,R. Biological annihilation via the ongoing sixth mass extinction signaled by vertebrate population losses and declines. Proceedings of the national academy of sciences, 114(30):E6089–E6096, 2017.

Chapman,A. D. et al. Numbers of living species in australia and the world. 2009.

Charlie Giattino,E. O.-O. and Roser,M. Working hours. Our World in Data, 2013. https://ourworldindata.org/working-hours.

Chen,T., Kornblith,S., Norouzi,M., and Hinton,G. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.

Cire¸san,D., Meier,U., Masci,J., and Schmidhuber,J. A committee of neural networks for traffic sign classification. In The 2011 international joint conference on neural networks, pages 1918–1921. IEEE, 2011.

Coleman,C. O. Taxonomy in times of the taxonomic impediment–examples from the com- munity of experts on amphipod crustaceans. Journal of Crustacean Biology, 35(6): 729–740, 2015.

54 Cortes,C. and Vapnik,V. Support-vector networks. Mach. Learn., 20(3):273–297, Septem- ber 1995.

Cox,D. R. The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2):215–232, 1958.

Cui,Y., Song,Y., Sun,C., Howard,A., and Belongie,S. Large scale fine-grained categoriza- tion and domain-specific transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

Dirzo,R., Young,H. S., Galetti,M., Ceballos,G., Isaac,N. J., and Collen,B. Defaunation in the anthropocene. science, 345(6195):401–406, 2014.

Donahue,J., Jia,Y., Vinyals,O., Hoffman,J., Zhang,N., Tzeng,E., and Darrell,T. DeCAF: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning, pages 647–655. jmlr.org, January 2014.

Ebach,M. C., Valdecasas,A. G., and Wheeler,Q. D. Impediments to taxonomy and users of taxonomy: accessibility and impact evaluation. Cladistics, 27(5):550–557, 2011.

Ehrlich,P. R. The scale of human enterprise and biodiversity loss. Extinction rates, 1995.

Everingham,M., Eslami,S. M. A., Van-Gool,L., Williams,C. K. I., Winn,J., and Zisser- man,A. The Pascal Visual Object Classes Challenge: A Retrospective. International Journal of Computer Vision, 111(1):98–136, 2015.

Feng,L., Bhanu,B., and Heraty,J. A software system for automated identification and retrieval of moth images based on wing attributes. Pattern Recognit., 51:225–241, March 2016.

Francoy,T. M., Wittmann,D., Drauschke,M., M¨uller,S., Steinhage,V., Bezerra-Laure,M. A. F., De Jong,D., and Gon¸calves,L. S. Identification of africanized honey bees through wing morphometrics: two fast and efficient procedures. Apidologie, 39(5):488–494, September 2008.

55 Friedman,J. H. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.

Fukushima,K. Neocognitron: a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics, 36(4):193–202, 1980. URL http://www.ncbi.nlm.nih.gov/pubmed/7370364.

Fukushima,K. Neural network model for a mechanism of pattern recognition unaffected by shift in position-neocognitron. IEICE Technical Report, A, 62(10):658–665, 1979.

Fukushima,K., Miyake,S., and Ito,T. Neocognitron: A neural network model for a mecha- nism of visual pattern recognition. IEEE Transactions on Systems, Man, and Cybernet- ics, SMC-13(5):826–834, sep 1983. ISSN 0018-9472. doi: 10.1109/TSMC.1983.6313076. URL http://ieeexplore.ieee.org/document/6313076/.

Gaston,K. J. and O’Neill,M. A. Automated species identification: why not? Philo- sophical Transactions of the Royal Society B: Biological Sciences, 359(1444):655– 667, apr 2004. ISSN 0962-8436. doi: 10.1098/rstb.2003.1442. URL http://rstb. royalsocietypublishing.org/content/359/1444/655.

Gaston,K. J. and May,R. M. Taxonomy of taxonomists. Nature, 356(6367):281–282, mar 1992. ISSN 0028-0836. doi: 10.1038/356281a0. URL http://dx.doi.org/10.1038/ 356281a0.

Gauld,I. D., O’Neill,M. A., Gaston,K. J., and Others. Driving miss daisy: the performance of an automated insect identification system. Hymenoptera: evolution, biodiversity and biological control, pages 303–312, 2000.

GBIF. Home page. Available from: https://www.gbif.org, 2020. Accessed: November 2020.

Gon¸calves,A. B., Souza,J. S., Silva,G. G. d., Cereda,M. P., Pott,A., Naka,M. H., and Pistori,H. Feature extraction and machine learning for the classification of brazilian savannah pollen grains. PLoS One, 11(6):e0157044, June 2016.

56 Griffin,G., Holub,A., and Perona,P. Caltech-256 Object Category Dataset. Technical report, 2007.

Hansen,O. L., Svenning,J.-C., Olsen,K., Dupont,S., Garner,B. H., Iosifidis,A., Price,B. W., and Høye,T. T. Species-level image classification with convolutional neural network enables insect identification from habitus images. Ecology and evolution, 10(2):737–747, 2020.

Hanson,S. J. A stochastic version of the delta rule. Physica D: Nonlinear Phenomena, 42 (1-3):265–272, 1990.

He,H. and Garcia,E. A. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263–1284, 2009.

He,K., Zhang,X., Ren,S., and Sun,J. Spatial pyramid pooling in deep convolutional net- works for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell., 37(9):1904–1916, September 2015.

He,K., Zhang,X., Ren,S., and Sun,J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

He,K., Fan,H., Wu,Y., Xie,S., and Girshick,R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.

He,X., Zhao,K., and Chu,X. Automl: A survey of the state-of-the-art. Knowledge- Based Systems, 212:106622, 2021. ISSN 0950-7051. doi: https://doi.org/10.1016/ j.knosys.2020.106622. URL http://www.sciencedirect.com/science/article/pii/ S0950705120307516.

Hinton,G., Deng,L., Yu,D., Dahl,G., Mohamed,A.-r., Jaitly,N., Senior,A., Vanhoucke,V., Nguyen,P., Sainath,T., and Kingsbury,B. Deep Neural Networks for Acoustic Model- ing in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal

57 Processing Magazine, 29(6):82–97, nov 2012. ISSN 1053-5888. doi: 10.1109/MSP.2012. 2205597. URL http://ieeexplore.ieee.org/document/6296526/.

Hu,J., Shen,L., and Sun,G. Squeeze-and-excitation networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

Ioffe,S. and Szegedy,C. Batch normalization: Accelerating deep network training by re- ducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

Jean,S., Cho,K., and Memisevic,R. On Using Very Large Target Vocabulary for Neural Machine Translation. In ACL-IJCNLP, pages 1–10, 2015.

Joly,A., Go¨eau,H., Botella,C., Glotin,H., Bonnet,P., Vellinga,W.-P., Planqu´e,R., and M¨uller,H. Overview of lifeclef 2018: a large-scale evaluation of species identification and recommendation algorithms in the era of ai. In International Conference of the Cross-Language Evaluation Forum for European Languages, pages 247–266. Springer, 2018.

Karpathy,A. and Fei-Fei,L. Deep Visual-Semantic Alignments for Generating Image De- scriptions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3128–3137, 2015.

Kelley,H. J. Gradient theory of optimal flight paths. Ars Journal, 30(10):947–954, 1960.

Kiefer,J., Wolfowitz,J., et al. Stochastic estimation of the maximum of a regression func- tion. The Annals of Mathematical Statistics, 23(3):462–466, 1952.

Kingma,D. P. and Ba,J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Krizhevsky,A., Sutskever,I., and Hinton,G. E. ImageNet Classification with Deep Convo- lutional Neural Networks. Advances In Neural Information Processing Systems, pages 1–9, 2012. ISSN 10495258.

58 Kumar,A., Irsoy,O., Ondruska,P., Iyyer,M., Bradbury,J., Gulrajani,I., Zhong,V., Paulus,R., and Socher,R. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. jun 2016. URL http://arxiv.org/abs/1506.07285.

Laliberte,A. S. and Ripple,W. J. Range contractions of north american carnivores and ungulates. BioScience, 54(2):123–138, 2004.

LeCun,Y., Boser,B., Denker,J. S., Henderson,D., Howard,R. E., Hubbard,W., and Jackel,L. D. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1(4):541–551, dec 1989. ISSN 0899-7667. doi: 10.1162/neco.1989.1.4.541. URL http://www.mitpressjournals.org/doi/abs/10.1162/neco.1989.1.4.541.

LeCun,Y., Bengio,Y., and Hinton,G. Deep learning. nature, 521(7553):436–444, 2015.

Li Fei-Fei, Fergus,R., and Perona,P. One-shot learning of object categories. In IEEE Trans- actions on Pattern Analysis and Machine Intelligence, volume 28, pages 594–611, apr 2006. doi: 10.1109/TPAMI.2006.79. URL http://ieeexplore.ieee.org/document/ 1597116/.

Lin,M., Chen,Q., and Yan,S. Network in network. December 2013.

Linnainmaa,S. Taylor expansion of the accumulated rounding error. BIT Numerical Math- ematics, 16(2):146–160, 1976.

Lytle,D. A., Mart´ınez-Mu˜noz,G., Zhang,W., Larios,N., Shapiro,L., Paasch,R., Mold- enke,A., Mortensen,E. N., Todorovic,S., and Dietterich,T. G. Automated processing and identification of benthic invertebrate samples. J. North Am. Benthol. Soc., 29(3): 867–874, June 2010.

Marcus,L. F. Traditional morphometrics. Proceedings of the Michigan morphometrics, 2:77–122, 1990. URL https://www. researchgate.net/profile/F{%}7B{_}{%}7DRohlf/publication/ 30850286{%}7B{_}{%}7DProceedings{%}7B{_}{%}7Dof{%}7B{_}{%}7Dthe{%}7B{_}{%}7DMichigan{%}7B{_ links/5566227508aefcb861d1971b.pdf{%}7B{#}{%}7Dpage=87.

59 Martineau,M., Conte,D., Raveaux,R., Arnault,I., Munier,D., and Venturini,G. A survey on image-based insects classification. Pattern Recognit., 2016.

Maxwell,S. L., Fuller,R. A., Brooks,T. M., and Watson,J. E. Biodiversity: The ravages of guns, nets and bulldozers. Nature News, 536(7615):143, 2016.

Mora,C., Tittensor,D. P., Adl,S., Simpson,A. G., and Worm,B. How many species are there on earth and in the ocean? PLoS biology, 9(8):e1001127, 2011.

Nilsback,M.-E. and Zisserman,A. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.

Norouzzadeh,M. S., Nguyen,A., Kosmala,M., Swanson,A., Palmer,M. S., Packer,C., and Clune,J. Automatically identifying, counting, and describing wild animals in camera- trap images with deep learning. Proceedings of the National Academy of Sciences, 115 (25):E5716–E5725, 2018.

Novotny,V., Basset,Y., Miller,S. E., Weiblen,G. D., Bremer,B., Cizek,L., and Drozd,P. Low host specificity of herbivorous insects in a tropical forest. Nature, 416(6883):841, 2002.

O’Neill,M. A. DAISY: a practical tool for semi-automated species identification. Auto- mated taxon identification in systematics: theory, approaches, and applications. CRC Press/Taylor & Francis Group, Boca Raton/Florida, pages 101–114, 2007.

Oquab,M., Bottou,L., Laptev,I., and Sivic,J. Learning and transferring mid-level image representations using convolutional neural networks. Proc. IEEE, 2014.

Ripple,W. J., Estes,J. A., Beschta,R. L., Wilmers,C. C., Ritchie,E. G., Hebblewhite,M., Berger,J., Elmhagen,B., Letnic,M., Nelson,M. P., et al. Status and ecological effects of the world’s largest carnivores. Science, 343(6167), 2014.

Robbins,H. and Monro,S. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.

60 Rodman,J. E. and Cody,J. H. The taxonomic impediment overcome: Nsf’s partnerships for enhancing expertise in taxonomy (peet) as a model. Systematic biology, 52(3):428–435, 2003.

Rodner,E., Simon,M., Brehm,G., Pietsch,S., Wolfgang W¨agele,J., and Denzler,J. Fine- grained recognition datasets for biodiversity analysis. July 2015.

Rohlf,J. F. and Marcus,L. F. A revolution morphometrics. Trends in ecol- ogy & evolution, 8(4):129–32, apr 1993. ISSN 0169-5347. doi: 10.1016/ 0169-5347(93)90024-J. URL http://www.sciencedirect.com/science/ article/pii/016953479390024Jhttps://www.researchgate.net/publication/ 49756524{%}7B{_}{%}7DA{%}7B{_}{%}7DRevolution{%}7B{_}{%}7Din{%}7B{_}{%}7DMorphometrics

Rosenblatt,F. The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory, 1957.

Rumelhart,D. E., Hinton,G. E., and Williams,R. J. Learning representations by back- propagating errors. nature, 323(6088):533–536, 1986.

Russakovsky,O., Deng,J., Su,H., Krause,J., Satheesh,S., Ma,S., Huang,Z., Karpathy,A., Khosla,A., Bernstein,M., Berg,A. C., and Fei-Fei,L. ImageNet large scale visual recog- nition challenge. Int. J. Comput. Vis., 115(3):211–252, December 2015.

Sainath,T. N., Mohamed,A.-r., Kingsbury,B., and Ramabhadran,B. Deep convolutional neural networks for LVCSR. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 8614–8618. IEEE, may 2013. ISBN 978-1-4799-0356-6. doi: 10.1109/ICASSP.2013.6639347. URL http://ieeexplore.ieee.org/document/ 6639347/.

Schmidhuber,J. Who invented backpropagation? More in [DL2], 2014.

Schmidhuber,J. Deep learning in neural networks: An overview. Neural networks, 61: 85–117, 2015.

61 Schr¨oder,S., Drescher,W., Steinhage,V., and Kastenholz,B. An automated method for the identification of bee species (hymenoptera: Apoidea). In Proc. Intern. Symp. on Conserving Europe’s Bees. Int. Bee Research Ass. & Linnean Society, London, pages 6–7, 1995.

Sharif Razavian,A., Azizpour,H., Sullivan,J., and Carlsson,S. CNN features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on com- puter vision and pattern recognition workshops, pages 806–813. cv-foundation.org, 2014.

Simonyan,K. and Zisserman,A. Very deep convolutional networks for Large-Scale image recognition. September 2014.

Smith,L. N. No more pesky learning rate guessing games. CoRR, abs/1506.01186, 2015. URL http://arxiv.org/abs/1506.01186.

Smith,L. N. A disciplined approach to neural network hyper-parameters: Part 1–learning rate, batch size, momentum, and weight decay. arXiv preprint arXiv:1803.09820, 2018.

Smith,L. N. and Topin,N. Super-convergence: Very fast training of residual networks using large learning rates. CoRR, abs/1708.07120, 2017. URL http://arxiv.org/abs/1708. 07120.

Srivastava,N., Hinton,G., Krizhevsky,A., Sutskever,I., and Salakhutdinov,R. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.

Stallkamp,J., Schlipsing,M., Salmen,J., and Igel,C. The german traffic sign recognition benchmark: a multi-class classification competition. In The 2011 international joint conference on neural networks, pages 1453–1460. IEEE, 2011.

Steinhage,V., Schr¨oder,S., Cremers,A. B., and Lampe,K.-H. Automated extraction and analysis of morphological features for species identification. In Automated taxon identifi- cation in systematics: Theory, approaches and applications, pages 115–129. CRC Press, 2007.

62 Stork,N. E., McBroom,J., Gely,C., and Hamilton,A. J. New approaches narrow global species estimates for beetles, insects, and terrestrial . Proceedings of the National Academy of Sciences, 112(24):7519–7523, 2015.

Sun,Y., Liu,Y., Wang,G., and Zhang,H. Deep learning for plant identification in natural environment. Comput. Intell. Neurosci., 2017:7361042, May 2017.

Sutskever,I., Vinyals,O., and Le,Q. V. Sequence to Sequence Learning with Neural Net- works. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014.

Swanson,A., Kosmala,M., Lintott,C., Simpson,R., Smith,A., and Packer,C. Snapshot serengeti, high-frequency annotated camera trap images of 40 mammalian species in an african savanna. Scientific data, 2(1):1–14, 2015.

Szegedy,C., Vanhoucke,V., Ioffe,S., Shlens,J., and Wojna,Z. Rethinking the inception archi- tecture for computer vision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.

Taigman,Y., Yang,M., Ranzato,M., and Wolf,L. DeepFace: Closing the Gap to Human- Level Performance in Face Verification. In Taigman, Y., Yang, M., Ranzato, M. & Wolf, L. Deepface: closing the gap to human-level performance in face verification. In Proc. Conference on Computer Vision and Pattern Recognition 1701–1708, pages 1701–1708, 2014.

Takeki,A., Trinh,T. T., Yoshihashi,R., Kawakami,R., Iida,M., and Naemura,T. Combining deep features for object detection at various scales: finding small birds in landscape images. IPSJ transactions on computer vision and applications, 8(1):1–7, 2016.

Tofilski,A. Drawwing, a program for numerical description of insect wings. J. Insect Sci., 4:17, May 2004.

Tofilski,A. Automatic measurement of honeybee wings. In Automated Taxon Identification in Systematics: Theory, Approaches and Applications, pages 277–288. CRC Press, 2007.

63 Tompson,J., Goroshin,R., Jain,A., LeCun,Y., and Bregler,C. Efficient Object Localiza- tion Using Convolutional Networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 648–656, 2015.

Van Horn,G., Mac Aodha,O., Song,Y., Cui,Y., Sun,C., Shepard,A., Adam,H., Perona,P., and Belongie,S. The inaturalist species classification and detection dataset. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

Vapnik,V. N. The nature of statistical learning theory. Statistics for Engineering and Information Science. Springer-Verlag, New York, 2000.

Wah,C., Branson,S., Welinder,P., Perona,P., and Belongie,S. The caltech-ucsd birds-200- 2011 dataset. 2011.

Walter,D. E. and Winterton,S. Keys and the crisis in taxonomy: extinction or reinvention? Annu. Rev. Entomol., 52:193–208, 2007.

Watson,A. T., O’Neill,M. A., and Kitching,I. J. Automated identification of live moths (macrolepidoptera) using digital automated identification system (daisy). System. Bio- divers., 1(3):287–300, 2003.

Weeks,P. J. D., Gauld,I. D., Gaston,K. J., and O’Neill,M. A. Automating the identification of insects: a new solution to an old problem. Bull. Entomol. Res., 87(02):203–211, 1997.

Weeks,P. J. D., O’Neill,M. A., Gaston,K. J., and Gauld,I. D. Species–identification of wasps using principal component associative memories. Image Vis. Comput., 17(12): 861–866, October 1999a.

Weeks,P. J. D., O’Neill,M. A., Gaston,K. J., and Gauld,I. D. Automating insect identifi- cation: exploring the limitations of a prototype system. J. Appl. Entomol., 123(1):1–8, January 1999b.

64 Wei,X.-S., Luo,J.-H., Wu,J., and Zhou,Z.-H. Selective convolutional descriptor aggregation for Fine-Grained image retrieval. IEEE Trans. Image Process., 26(6):2868–2881, June 2017.

Werbos,P. J. Applications of advances in nonlinear sensitivity analysis. In System modeling and optimization, pages 762–770. Springer, 1982.

Wilf,P., Zhang,S., Chikkerur,S., Little,S. A., Wing,S. L., and Serre,T. Computer vision cracks the leaf code. Proc. Natl. Acad. Sci. U. S. A., 113(12):3305–3310, March 2016.

Willi,M., Pitman,R. T., Cardoso,A. W., Locke,C., Swanson,A., Boyer,A., Veldthuis,M., and Fortson,L. Identifying species in camera trap images using deep learning and citizen science. Methods in Ecology and Evolution, 10(1):80–91, 2019.

Wu,X., Zhan,C., Lai,Y., Cheng,M., and Yang,J. Ip102: A large-scale benchmark dataset for insect pest recognition. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8779–8788, 2019. doi: 10.1109/CVPR.2019.00899.

Xie,S., Girshick,R., Doll´ar,P., Tu,Z., and He,K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.

Yosinski,J., Clune,J., Bengio,Y., and Lipson,H. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320– 3328, 2014.

Zeiler,M. D. and Fergus,R. Visualizing and Understanding Convolutional Networks. Computer Vision – ECCV 2014, 8689, 2014. doi: 10.1007/978-3-319-10590-1. URL http://link.springer.com/10.1007/978-3-319-10590-1.

Zhang,H., Cisse,M., Dauphin,Y. N., and Lopez-Paz,D. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.

65 Zhang,J. and Zong,C. Deep Neural Networks in Machine Translation: An Overview. IEEE Intelligent Systems, 30(5):16–25, sep 2015. doi: 10.1109/MIS.2015.69. URL http: //ieeexplore.ieee.org/document/7243232/.

Zhang,Z.-Q. Animal biodiversity: An outline of higher-level classification and survey of taxonomic richness, volume 3148. Magnolia Press, 2011.

Zheng,L., Zhao,Y., Wang,S., Wang,J., and Tian,Q. Good practice in CNN feature transfer. April 2016.

Zhou,B., Khosla,A., Lapedriza,A., Oliva,A., and Torralba,A. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016.

66 I

Syst. Biol. 68(6):876–895, 2019 © The Author(s) 2019. Published by Oxford University Press on behalf of the Society of Systematic Biologists. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. DOI:10.1093/sysbio/syz014 Advance Access publication March 2, 2019 Automated Taxonomic Identification of Insects with Expert-Level Accuracy Using Effective Feature Transfer from Convolutional Networks

, , ,∗ , , MIROSLAV VALAN1 2 3 ,KAROLY MAKONYI1 4,ATSUTO MAKI5,DOMINIK VONDRÁCEKˇ 6 7, AND FREDRIK RONQUIST2 1Savantic AB, Rosenlundsgatan 52, 118 63 Stockholm, Sweden; 2Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Frescativagen 40, 114 18 Stockholm, Sweden; 3Department of Zoology, Stockholm University, Universitetsvagen 10, 114 18 Stockholm, Sweden; 4Disciplinary Domain of Science and Technology, Physics, Department of Physics and Astronomy, Nuclear Physics, Uppsala University, 751 20 Uppsala, Sweden; 5School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm, SE-10044 Sweden; 6Department of Zoology, Faculty of Science, Charles University in Prague, Viniˇcná7, CZ-128 43 Praha 2, Czech Republic; 7Department of Entomology, National Museum, Cirkusová 1740, CZ-193 00 Praha 9 - Horní Poˇcernice, Czech Republic ∗Correspondence to be sent to: Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Frescativagen 40, 114 18 Stockholm, Sweden E-mail: [email protected].

Received 21 February 2018; reviews returned 13 February 2019; accepted 20 February 2019 Associate Editor: Thomas Buckley

Abstract.—Rapid and reliable identification of insects is important in many contexts, from the detection of disease vectors and invasive species to the sorting of material from biodiversity inventories. Because of the shortage of adequate expertise, there has long been an interest in developing automated systems for this task. Previous attempts have been based on laborious and complex handcrafted extraction of image features, but in recent years it has been shown that sophisticated convolutional neural networks (CNNs) can learn to extract relevant features automatically, without human intervention. Unfortunately, reaching expert-level accuracy in CNN identifications requires substantial computational power and huge training data sets, which are often not available for taxonomic tasks. This can be addressed using feature transfer: a CNN that has been pretrained on a generic image classification task is exposed to the taxonomic images of interest, and information about its perception of those images is used in training a simpler, dedicated identification system. Here, we develop an effective method of CNN feature transfer, which achieves expert-level accuracy in taxonomic identification of insects with training sets of 100 images or less per category, depending on the nature of data set. Specifically, we extract rich representations of intermediate to high-level image features from the CNN architecture VGG16 pretrained on the ImageNet data set. This information is submitted to a linear support vector machine classifier, which is trained on the target problem. We tested the performance of our approach on two types of challenging taxonomic tasks: 1) identifying insects to higher groups when they are likely to belong to subgroups that have not been seen previously and 2) identifying visually similar species that are difficult to separate even for experts. For the first task, our approach reached >92% accuracy on one data set (884 face images of 11 families of Diptera, all specimens representing unique species), and >96% accuracy on another (2936 dorsal habitus images of 14 families of Coleoptera, over 90% of specimens belonging to unique species). For the second task, our approach outperformed a leading taxonomic expert on one data set (339 images of three species of the Coleoptera genus Oxythyrea; 97% accuracy), and both humans and traditional automated identification systems on another data set (3845 images of nine species of Plecoptera larvae; 98.6 % accuracy). Reanalyzing several biological image identification tasks studied in the recent literature, we show that our approach is broadly applicable and provides significant improvements over previous methods, whether based on dedicated CNNs, CNN feature transfer, or more traditional techniques. Thus, our method, which is easy to apply, can be highly successful in developing automated taxonomic identification systems even when training data sets are small and computational budgets limited. We conclude by briefly discussing some promising CNN- based research directions in morphological systematics opened up by the success of these techniques in providing accurate diagnostic tools.

Rapid and reliable identification of insects, either to as orders, but already at the family level the task species or to higher taxonomic groups, is important becomes quite challenging, even for experts, unless in many contexts. Insects form a large portion of the we restrict the problem to a particular life stage, biological diversity of our planet, and progress in the geographic region, or insect order. Generally speaking, understanding of the composition and functioning of the the lower the taxonomic level, the more challenging the planet’s ecosystems is partly dependent on our ability to identification task becomes (Fig. 1). At the species level, effectively find and identify the insects that inhabit them. reliable identification may require years of training and There is also a need for easy and accurate identification specialization on one particular insect taxon. Such expert of insects in addressing concerns related to human food taxonomists are often in short demand, especially for and health. Such applications include the detection of groups that are not showy and attractive, and their time insects that are pests of crops (FAO 2015), disease vectors could be better spent than on routine identifications. (WTO 2014), or invasive species (GISD 2017). For these reasons, there has long been an interest in Identifying insects is hard because of their immense developing automated image-based systems for insect species diversity [more than 1.02 million species identification (Schröder et al. 1995; Weeks et al. 1997, described to date (Zhang 2011)] and the significant 1999a, 1999b; Gauld et al. 2000; Arbuckle et al. 2001; variation within species due to sex, color morph, Watson et al. 2003; Tofilski 2004, 2007; ONeill 2007; life stage, etc. With some training, one can learn Steinhage et al. 2007, Francoy et al. 2008; Yang et al. 2015; how to distinguish higher taxonomic groups, such Feng et al. 2016; Martineau et al. 2017). Common to all

876 2019 VALAN ET AL.—AUTOMATED TAXONOMIC IDENTIFICATION OF INSECTS 877

huge increase in the volume of data used for training have generated spectacular advances in recent years. easy class These developments, in turn, would not have been possible without the extra computational power brought by modern graphical processing units (GPUs). In contrast to traditional approaches of machine learning, requiring handcrafted feature extraction, DL and CNNs enable end-to-end learning from a set order of training data. In end-to-end learning, the input consists of labeled raw data, such as images, nothing else. The images may even represent different views, body parts, or life stages—the CNN automatically finds the relevant set of features for the task at hand. CNNs have been particularly successful in image family classification tasks, where large labeled training sets are available for supervised learning. The first super- human performance of GPU-powered CNNs (Cire¸san et al. 2011) was reported in 2011 in a traffic sign competition (Stallkamp et al. 2011). The breakthrough came in 2012, when a CNN architecture called AlexNet genus (Krizhevsky et al. 2012) outcompeted all other systems in the ImageNet Large Scale Visual Recognition Challenge (Russakovsky et al. 2015), at the time involving 1.3 million images divided into 1000 categories, such as “lion,” “cup,” “car wheel,” and different breeds of cats and dogs. Since then, CNN performance has improved hard species significantly thanks to the development of deeper, more complex neural network architectures, and the use of larger data sets for training. Open-source licensing of DL development frameworks has triggered further FIGURE 1. A schematic illustration of the taxonomy of insects. The full methodological advances by attracting a vast developer tree is organized into hierarchical ranks; it contains approximately 1.02 community. million known species and several millions that remain to be described. Training a complex CNN from scratch to performance Classifying a specimen to a group of higher rank, such as order, is usually relatively easy with a modest amount of training. The challenge levels that are on par with humans requires a huge set and amount of required expertise increases considerably (transition of labeled images and consumes a significant amount from green to red) as the taxonomic rank is lowered. of computational resources, which means that it is not realistic currently to train a dedicated CNN for most such systems designed to date is that they depend on image classification tasks. However, in the last few years, handcrafted feature extraction. “Handcrafted” or “hand- it has been discovered that one can take advantage engineered” are standard terms in machine learning and of a CNN that has been trained on a generic image computer vision referring to the application of some classification task in solving a more specialized problem process, like an algorithm or a manual procedure, to using a technique called transfer learning (Caruana 1995; extract relevant features for identification from the raw Bengio 2011; Yosinski et al. 2014; Azizpour et al. 2016). data (images in our case). Examples of features that This reduces the computational burden and also makes have been used for taxonomic identification include the it possible to benefit from the power of a sophisticated wing venation pattern, the relative position of wing vein CNN even when the training set for the task at hand is junctions, and the outline of the wing or of the whole moderate to small. body. Although many of these systems achieve good Two variants of transfer learning have been tried. identification performance, the need for special feature In the first, fine-tuning, the pretrained CNN is slightly extraction tailored to each task has limited their use in modified by fine-tuning model parameters such that the practice. CNN can solve the specialized task. Fine-tuning tends In recent years, deep learning (DL) and convolutional to work well when the specialized task is similar to the neural networks (CNNs) have emerged as the most original task (Yosinski et al. 2014), but it may require a effective approaches to a range of problems in automated fair amount of training data and computational power. classification (LeCun et al. 2015; Schmidhuber 2015), It is also susceptible to overfitting on the specialized task and computer vision is one of the fields where these when the data sets are small because it may incorrectly techniques have had a transformative impact. The basic associate a rare category with an irrelevant feature, such ideas have been around for a long time (Fukushima 1979, as a special type of background, which just happens 1980; Fukushima et al. 1983) but a significant increase in to be present in the few images of that category in the the complexity and size of the neural networks and a training set. 878 SYSTEMATIC BIOLOGY VOL. 68

TABLE 1. Comparison of the performance of some automated realistic-size image sets and computational budgets image identification systems prior to CNNs and some recent state- available to systematists. The article represents one of-the-art CNN-based methods on two popular fine-grained data sets (i.e., data sets with categories that are similar to each other), of the first applications of CNN feature transfer to Bird-200-2011 (Wah et al. 2011), and Flower-102 (Nilsback and challenging and realistic taxonomic tasks, where a high Zisserman 2008) level of identification accuracy is expected. In contrast Methods Bird Flower References to previous studies, all independent identifications used here for training and validation have been provided by Pre-CNN methods Color+SIFT 26.7 81.3 (Khan et al., 2013) taxonomic experts with access to the imaged specimens. GMaxPooling 33.3 84.6 (Murray and Perronnin, 2014) Thus, the experts have been able to examine characters that are critical for identification but that are not visible CNN-based techniques CNNaug-SVM 61.8 86.8 (Razavian et al., 2014) in the images, such as details of the ventral side of MsML 67.9 89.5 (Qian et al., 2014) specimens imaged from above. The experts have also Fusion CNN 76.4 95.6 (Zheng et al., 2016) had access to collection data, which often facilitates Bilinear CNN 84.1 — (Lin et al., 2015) identification. Refined CNN 86.4 — (Zhang et al., 2017) We examined two types of challenging taxonomic Note: All CNN-based methods used pretrained VGG16 and transfer tasks: 1) identification to higher groups when many learning (Simonyan and Zisserman 2014). Numbers indicate the specimens are likely to belong to subgroups that have percentage of correctly identified images in the predefined test set, not been seen previously and 2) identification of visually which was not used during training. similar species that are difficult to separate even for experts. For the first task, we assembled two data sets The second variant of transfer learning is known as consisting of diverse images of Diptera faces and the feature transfer, and involves the use of the pretrained dorsal habitus of Coleoptera, respectively. For the second CNN as an automated feature extractor (Donahue et al. task, we used images of three closely related species 2014; Oquab et al. 2014; Razavian et al. 2014; Zeiler and of the Coleoptera genus Oxythyrea, and of nine species Fergus 2014; Azizpour et al. 2016; Zheng et al. 2016). The of Plecoptera larvae (Lytle et al. 2010). Training of pretrained CNN is exposed to the training set for the the automated identification system was based entirely specialized task, and information is then extracted from on the original images; no preprocessing was used the intermediate layers of the CNN, capturing low- to to help the computer identify features significant for high-level image features; see description of the CNN identification. layer architecture below). The feature information is then In all our experiments, we utilized the CNN used to train a simpler machine learning system, such architecture VGG16 with weights pretrained on the as a support vector machine (SVM) (Cortes and Vapnik ImageNet data set (Simonyan and Zisserman 2014)for 1995), on the more specialized task. Feature transfer in feature extraction, and a linear SVM (Cortes and Vapnik combination with SVMs tends to work better than fine- 1995) for classification. Our work focused on optimizing tuning when the specialized task is different from the feature extraction techniques to reach high levels of original task. It is computationally more efficient, works identification accuracy. We also analyzed the errors for smaller image sets, and SVMs are less susceptible made by the automated identification system in order to overfitting when working with imbalanced data sets, to understand the limitations of our approach. Finally, that is, data sets where some categories are represented to validate the generality of our findings, we tested by very few examples (He and Garcia 2009). our optimized system on several other biological image Sophisticated CNNs and transfer learning have been classification tasks studied in the recent literature on used successfully in recent years to improve the automated identification. classification of some biological image data sets, such as “Caltech-UCSD Birds-200-2011” (Birds-200-2011) (Wah et al. 2011) (200 species, 40–60 images per species) and “102 Category Flower Data set” (Flowers-102) (Nilsback MATERIALS AND METHODS and Zisserman 2008) (102 flower species commonly occurring in the UK, 40–258 images per species) (Table 1). Data Sets and Baselines Similar but larger data sets contributed by citizen Data sets. We used four image data sets scientists are explored in several ongoing projects, such (D1–D4) for this study (Table 2, Fig. 2, as Merlin Bird ID (Van Horn et al. 2015), Pl@ntNet (Joly Supplementary Material available on Dryad at et al. 2014) and iNaturalist (web application available http://dx.doi.org/10.5061/dryad.20ch6p5. The first at http://www.inaturalist.org). These data sets involve two data sets (D1–D2) consist of taxonomically diverse outdoor images of species that are usually easy to images that were used in testing whether the automated separate for humans, at least with some training, and the system could learn to correctly circumscribe diverse automated identification systems do not quite compete higher groups based on exemplar images, and then in accuracy with human experts yet. use that knowledge in classifying images of previously The main purpose of the current article is to explore unseen subtaxa of these groups. Data set D1 contains 884 the extent to which CNN feature transfer can be images of Diptera faces, representing 884 unique species, used in developing accurate diagnostic tools given 231 genera, and 11 families. Data set D2 has 2936 images 2019 VALAN ET AL.—AUTOMATED TAXONOMIC IDENTIFICATION OF INSECTS 879

TABLE 2. Data sets used in this study Images per taxon Image size Data set Taxon Level No. of taxa (range) (height×width) Part of insect Source D1 Flies family 11 24–159 946×900 Face—frontal view www.idigbio.org D2 Beetles family 14 18–900 1811×1187 Body—dorsal view www.idigbio.org D3 Beetles species 3 40–205 633×404 Body—dorsal view this study D4 Stoneflies species 9 107–505 960×1280 Body—variable view Lytle et al. 2010

FIGURE 2. Sample images of each of the four data sets. For each data set, we show two or three different image categories. The example images illustrate some of the taxon diversity and variation in image capture settings within each category. of the dorsal habitus of Coleoptera, all belonging to to this study; it consists of 339 images of the dorsal unique specimens that together represent >1932 species, habitus of three closely related species (Sabatinelli 1981) >593 genera, and 14 families (352 images identified only of the genus Oxythyrea: O. funesta, O. pantherina, and to genus and 339 only to tribe or family). Both D1 and O. subcalva (Coleoptera: Scarabaeidae). The depicted D2 were used to test family-level identification. Subsets specimens are from four different countries of the south- of D2 were also used to test discriminative power at west Mediterranean region where these taxa occur. the tribal level (data set D2A consisting of 21 tribes of Only O. funesta has a wider distribution, extending Curculionidae with 14 or more images each (average into Europe and Asia as well. One of the authors 32); in total 675 images belonging to 109 unique genera) of the current article (DV) worked extensively on the and at the genus level (data set D2B consisting of images taxonomy of this genus, based both on morphological of two genera, Diplotaxis [100 images] and Phyllophaga and genetic data (Vondráˇcek2011; Vondráˇcek et al. 2017). [121 images], both from the tribe Melolonthini of the Current knowledge does not allow specimens of the family Scarabaeidae; in total, D2B contains images three species studied here from North African localities of 132 unique species; in two-thirds of the cases the to be identified with 100 % accuracy based only on images represent a single example of the species or the dorsal view of adults. Additional information from one of two examples). D1 and D2 were assembled from the ventral side of the specimen is also needed; the images downloaded from iDigBio (http://idigbio.org) shape of the genitalia and the exact collection locality as described in the Supplementary Material available are also helpful in identifying problematic specimens on Dryad. The identifications provided by iDigBio were (Sabatinelli 1981; Mikši´c 1982; Baraud 1992). For D3, assumed to be correct (but see Results for D2). the identifications provided by DV after examination The last two data sets (D3–D4) were used to test of the imaged specimens, and the locality data were the success of the automated system in discriminating assumed to be correct. Morphological species concepts between visually similar species that are difficult to were independently validated by genetic sequencing of separate even for human experts. Data set D3 is original other specimens. 880 SYSTEMATIC BIOLOGY VOL. 68

Data set D4 was taken from a previous study of and determination labels, scales, measurements, glue automated identification systems (Lytle et al. 2010) and cards, etc.) or on the specimens (pins, pin verdigris, consists of 3845 images of nine species of Plecoptera dust etc.). larvae representing nine different genera and seven Data preprocessing. All three color channels (red, families. The identification system developed in the green, and blue) of the images were used. Before original study (Lytle et al. 2010) was based on various feeding images into CNNs, they are commonly resized handcrafted techniques to find image features of to squares with VGG16 having default input size of interest, such as corners, edges, and outlines. These 224×224 pixels. Since D1, D2 and D3 images represent features were then described with scale-invariant feature standard views but vary in image aspect ratio (height transform (SIFT) descriptors (Lowe 2004), which were vs. width), resizing them to squares could lead to subsequently submitted to a random forest classifier uneven distortion of objects from different categories, (Breiman 2001). potentially resulting in information loss, or introducing In contrast to the other data sets, the specimens of D4 bias that could interfere with training. Therefore, we are not portrayed in a standardized way. The position performed the main experiments using images where and orientation of the specimens vary randomly, and the aspect ratio (height vs. width) was preserved. The specimens may be damaged (missing legs or antennae). best performing model was subsequently tested also on All larval stages are included, and the specimens vary images that were resized to squares, even if it affected the dramatically in size. On one extreme, the specimen may aspect ratio. be so small that it only occupies a small part of the image; To preserve aspect ratio, we first computed the average on the other extreme, only part of the specimen may be image height and width for each data set. Then we visible (Lytle et al. 2010). The variation in D4 makes more resized images such that one dimension of the image information about each category available, which should would fit the average size, and the other would be less make the identification engine more reliable and robust. than or equal to the average. If the latter was less than On the other hand, different object sizes and viewing the average, we would center the image and then add angles may require the system to learn a wider range of random pixel values across all channels around the distinguishing features, making the identification task original image to fill the square frame. To preserve a more challenging. It is not obvious which of these effects minimalistic approach, no segmentation (cropping of the will be stronger. image) was performed, and the background was kept on Even though all species in D4 belong to different all images. genera, two of the species ( californica and Training and validation sets. We divided each data Doroneuria baumanni) are very similar and quite set into ten subsets using stratified random sampling, challenging to separate. The original study reported the ensuring that the proportion of images belonging to each identification accuracy achieved by humans, including category was the same in all subsets. We then applied some entomologists, in separating sample images of our procedure to the images belonging to nine of the these two species after having been trained on separating subsets and used the last subset as the test image set. The them using 50 images of each species. We compared reported accuracy values represent averages across ten the human performance to the performance of our repetitions of this procedure, each one using a different method in separating the same two species in the entire subset as the test set. data set, which contains 490 images of Calineuria and The images of D1–D3 all represent unique individuals, 532 of Doroneuria). The true identifications provided but D4 contains multiple images (4–5) of each specimen, in the original study were based on examination of varying in zooming level, pose, and orientation of the the actual specimens by two independent taxonomic specimen (Lytle et al. 2010). The main results we report experts, which had to agree on the determination; these here do not take this into account; thus, the test set was identifications are assumed here to be correct. likely to contain images of specimens that had already Image characteristics. The images in D1-D3 represent been seen during training. To investigate whether this standardized views, in which specimens are oriented affected the overall performance of our method, we in a similar way and occupy a large portion of repeated the procedure with a new partitioning of D4 the image area. However, the exact orientation and into subsets in which all images of the same specimen position of appendages vary slightly, and some of the were kept in the same subset, thus ensuring that none of imaged specimens are partly damaged. In D4, both the specimens imaged in the training set had been seen specimen orientation and size vary considerably as previously by the system. described above (Fig. 2). Images from all data sets were acquired in lab settings. Imaging conditions vary significantly in D1 and D2, since these images have Model different origins, but are standardized in D3 and D4. We used VGG16 for feature extraction. VGG16 The backgrounds are generally uniform, particularly is known for its generalization capabilities and is in D3 and D4, but the color is often manipulated in commonly used as a base model for feature extraction D1–D2 to get better contrast. In D2, some images have (Caruana 1995; Bengio 2011; Yosinski et al. 2014). VGG16 been manipulated to remove the background completely. has a simpler structure compared to other current Noise may be present both in the background (collection architectures, such as Microsofts ResNets (He et al. 2016) 2019 VALAN ET AL.—AUTOMATED TAXONOMIC IDENTIFICATION OF INSECTS 881

FIGURE 3. CNN model architecture. We used the CNN model VGG16 (Simonyan and Zisserman 2014) for feature extraction. The model consists of five convolutional blocks, each block consisting of two or three convolutional layers (green) followed by a MaxPooling layer (red). These blocks are followed by three layers of fully connected neurons (gray), the last of which consists of a vector of length 1000. Each element in this vector corresponds to a unique category in the ImageNet Challenge. The convolutional layers are obtained by applying different filters of size 3×3 to the input from the preceding layer. The MaxPooling layer reduces the dimensions of the output matrix, known as the feature matrix, to half the width and half the height using spatial pooling, taking the maximum of the input values. or Google’s Inception v3 (Szegedy et al. 2016b), Inception bottleneck features, but feature matrices from deeper v4 (Szegedy et al. 2016a), and Xception (Chollet 2016). layers can also be used. It is standard practice to apply However, these more complex models also tend to learn global pooling when extracting information from a more domain-specific features, which makes them more feature matrix (Lin et al. 2013). Global pooling will challenging to use for feature extraction on unrelated reduce a feature matrix of dimension HxWxF—where tasks. H is height, W is width, and F is the number of filters— Specifically, we used VGG16 (Simonyan and to a matrix of dimension 1×1×F, that is, a vector Zisserman 2014) implemented in keras (Chollet of length F. This may result in information loss, but 2015) with a TensorFlow backend (Abadi et al. 2016). some reduction in dimensionality is needed, especially In VGG16, the input images are processed in five when using deeper feature matrices or when working convolutional blocks that we will refer to as c1– c5, with large input images. For instance, an image of size respectively (Fig. 3). A block is made up of two or three 416×416 would produce a feature matrix of dimension convolutional layers, followed by a MaxPooling layer. 52×52×256 in block c3 of VGG16, corresponding to Each convolutional layer applies a number of 3×3 filters almost 700,000 features; this would be very difficult to to the input. The number of different filters used in the process for a classifier. Global pooling would reduce convolutional layers vary across blocks: in the original this feature matrix to a more manageable vector of VGG16 architecture used here, they are 64, 128, 256, 512, just 256 feature elements. To optimize transfer learning and 512 for convolutional block c1-c5, respectively. In for the taxonomic identification tasks we examined, the MaxPooling layer, the matrix is reduced to half the we experimented with different methods of feature width and half the height by spatial pooling, taking the extraction and pooling as described below. maximum value across a 2×2 filter. Thus, an original Impact of image size and pooling strategy.To image of 224×224 pixels (the standard size used by examine the impact of image size and pooling strategy, VGG16) is reduced to an output matrix of height × we focused on bottleneck features. We analyzed the width 7×7 (224/25=7) after having passed through performance across all four data sets using input all five blocks. However, in the process the matrix has images being 128, 224, 320, 416, or 512 pixels wide, gained in depth through the application of filters in each and using two different pooling strategies: global max of the convolutional layers. Thus, in block c5 the total pooling (using the maximum value for each filter layer) output dimension is 7×7×512, where 512 is the number and global average pooling (using the average value). of filters in this block. The output of each block is known There are other dimensionality reduction strategies that as its feature matrix, and the feature matrix of c5 is known could have been used, but these two pooling strategies as the bottleneck features. The last convolutional block completely dominate current CNN applications. is succeeded by three fully connected layers, the last Impact of features from different layers, multilayer of which provides the output related to the ImageNet feature fusion and normalization. Using the optimal classification task (Fig. 3). image size and pooling strategy from the previous step, we compared the discriminative potential of features from deeper convolutional blocks (c1, c2, c3, c4) with Feature Extraction that of the commonly used bottleneck features of c5. When using a CNN for feature extraction, it is common We did not attempt to extract features from the final to focus on the fully connected final layers or on the fully connected layers, as these top layers tend to learn 882 SYSTEMATIC BIOLOGY VOL. 68 high-level semantics specific for the task the model is c3+c4+c5. Before training the classifier on the data, we trained on. In the ImageNet challenge, such semantics applied signed square root normalization. might include the presence or absence of typical cat or dog features, for example. Therefore, one might expect that features extracted from deeper layers of Classifier and Evaluation of Identification performance the CNN would give better results when the model is Classifier. As a classifier for the extracted features, we used as a feature extractor for unrelated classification used an SVM method (Cortes and Vapnik 1995), which is tasks (Zheng et al. 2016). This should be particularly a common choice for these applications (Donahue et al. true for tasks where one attempts to separate visually 2014; Oquab et al. 2014; Razavian et al. 2014; Zeiler and similar categories, so-called fine-grained tasks, such as Fergus 2014; Azizpour et al. 2016; Zheng et al. 2016). the taxonomic tasks we focus on in this article. SVM performance is at least on par with other methods We also investigated the effect of feature fusion, that have been studied so far (e.g., Azizpour et al. 2016). introduced in Zheng et al. (2016); it involves SVMs are supervised learning models that map inputs concatenating feature vectors from several convolutional to points in high-dimensional space in such a way that blocks. Finally, we examined the effect of signed square the separate categories are divided by gaps that are as root normalization and 12-normalization as introduced wide and clear as possible. Once the SVM classifier is in Arandjelovi´c and Zisserman (2013). Normalization trained, new examples are mapped into the same space procedures are used in this context to reduce the and predicted to belong to a category based on which variance of the elements in the feature vector. Reducing side of the gaps they fall. the influence of extreme values is known to improve the SVMs are memory efficient because they only use performance of SVMs. a subset of training points (support vectors). They Impact of preserving more features. Previous generalize well to high dimensional spaces (Vapnik and research has shown that globally pooled features Chapelle 2000) and are suitable for small data sets where generally give better performance than raw features in the number of dimensions is greater than the number of feature extraction (Zheng et al. 2016). However, it is not examples (Vapnik 1999). SVMs are also good at learning known whether intermediate levels of dimensionality from imbalanced data sets (He and Garcia 2009), even reduction might give even better performance than though the performance tends to be lower on categories either of these two extremes. To investigate this, we with few examples. applied intermediate levels of pooling to the feature Specifically, we used the linear support vector matrices from the three best performing convolutional classification algorithm (LinearSVC) implemented in blocks identified in the previous step. the scikit-learn Python package (Pedregosa et al. 2011) Input images of the size selected in the first step, under default settings. LinearSVC uses a one-versus-all 416×416, generate feature matrices of dimensions 52× strategy, which scales well (linearly) with the number 52×256, 26×26×512, and 13×13×512 for c3, c4, and of categories. Other SVM implementations use a one- c5, respectively. If all features from the output matrix versus-one scheme, the time complexity of which of c4, say, were used, a total of 26×26×512 = 346,112 increases with the square of the number of categories. features would be obtained. In contrast, global pooling A disadvantage of LinearSVC is that it does not provide would yield a 1D vector of only 1×1×512 = 512 features. probability estimates. To explore intermediate levels of data compression, we Measuring identification performance. The applied pooling of various degrees to obtain NxNxF identification performance was tested on the four feature matrices, each of which was then reshaped and main data sets and the data subsets described above. All flattened to a 1D feature vector. To standardize the accuracies reported were calculated as the proportion dimensionality reduction procedure, we first generated of predicted labels (identifications) that exactly 26×26×F feature matrices for each of c3, c4, and c5 as match the corresponding true labels. Error rate (or follows. For c3, the output feature matrix of size 52×52× misclassification rate) is the complement of accuracy, 256 was average-pooled to 26×26×256 applying a 2×2 that is, the proportion of predicted labels that do not filter with stride two. The features of c4 already had the match the true labels. correct dimension (26×26×512), while the features of c5 Impact of sample size. To examine the influence of the were extracted before the last MaxPooling layer to retain image sample on identification performance, we studied the input size 26×26×512 of c5. the relation between the identification accuracy for each To each of these 26×26×F feature matrices, we then category and the number and types of images of that added zero-padding to get feature matrices of size 28× category in D1 and D2. We also created random subsets 28×F. Using spatial pooling of various degrees, the of D4 with smaller number of images per species than matrices were then reduced to matrices of size 14×14 ×F, the original data set. 7×7×F,4×4×F, and 2×2×F. Each of these matrices Recently published image identification challenges. was reshaped and flattened to a one-dimensional feature To validate the generality of our findings, we explored vector, and its performance compared that of the 1×1×F the performance of our approach on several recently feature vector resulting from global pooling. We also published biological image data sets used to develop investigated the effect of combining intermediate-level automated identification systems for tasks similar to the pooling with feature fusion for c3+c4, c3+c5, c4+c5, and ones studied here. We did not optimize a model for each 2019 VALAN ET AL.—AUTOMATED TAXONOMIC IDENTIFICATION OF INSECTS 883 of these data sets; instead, we used a standard model that significantly affected by image size and pooling strategy performed well across data sets D1–D4 according to the (Fig. 4). The use of global average pooling yielded results from our optimization experiments. notably better results than global max pooling (Fig. 4a). Specifically, we used the following five data sets to This may be related to the fact that the imaged specimens test the general performance of our standard model: 1) in our data sets typically occupy a large fraction of ClearedLeaf, consisting of 7597 images of cleared leaves the image, such that features across the entire image from 6949 species and 2001 genera (Wilf et al. 2016); 2) contribute to identification. The difference between CRLeaves, containing 7262 leaf scans of 255 species of average pooling and max pooling was slightly smaller in plants from Costa Rica (Mata-Montero and Carranza- D4, which might be explained by this data set containing Rojas 2015); 3) EcuadorMoths, containing 2120 images a substantial number of images of small specimens of 675 moth species, all genetically verified but only occupying only a tiny part of the picture frame. some properly morphologically studied and described, Input images of size 416×416 performed better than belonging to the family Geometridae (Brehm et al. 2013); other image sizes when global max pooling was used, 4) Flavia, with 1907 images of leaves from 32 plant species with the exception of D3 where 320×320 images were (Wu et al. 2007); and 5) Pollen23, covering the pollen best. When global average pooling was used instead, grains of 23 Brazilian plant species (805 images in total) the optimal image size was still 416×416 for D3 and (Gonçalves et al. 2016). For ClearedLeaf, we used one D4, but the largest image size (512×512) yielded a of the identification challenges in the original study, slight improvement over 416×416 images for D1 and involving families with 100+ images each. D2. Based on these results, we decided to proceed with Scripting and visualization. The python scripts used the remaining optimization steps using input images of for the experiments are provided as Supplementary size 416×416 and global average pooling, unless noted Material available on Dryad. Visualizations were otherwise. generated using the plotly package (Plotly Technologies Impact of features from different layers, multilayer 2015). Image feature space was visualized using t- feature fusion, and normalization. Using c4 features distributed stochastic neighbor embedding (t-SNE) resulted in the best identification results across all data (Maaten and Hinton 2008), a popular technique for sets (Fig. 4b). Features from c4 consistently outperformed exploring high dimensional data in 2D or 3D space the features from c5; sometimes, c5 features were also (Fig. 12). It is an unsupervised method where the outperformed by c3 features. In D3, high identification identities of the images are not provided during training. accuracy was already obtained with features from c2. The identities are only used subsequently to assign This might be due to the fact that the classes in different colors to the data points. D3, three sibling species, are differentiated by minute details (quantity, location, and size of the white spots on the elytra and on the pronotum, and the amount RESULTS of setae), which are presumably well represented by feature matrices close to the raw image. Signed square Feature Extraction root normalization consistently improved identification Impact of image size and pooling strategy. Our accuracy (Fig. 4b), while adding 12-normalization results showed that identification performance was deteriorated performance.

A B

100 D1-avg D1-avg+norm 95 D1-max D1-avg D1-best 90 D1-best 90 D2-avg D2-avg+norm D2-max 80 D2-avg 85 D2-best D2-best D3-avg D3-avg+norm 70 D3-max D3-avg 80 Accuracy (%)Accuracy Accuracy (%)Accuracy D3-best D3-best 60 D4-avg D4-avg+norm 75 D4-max D4-avg D4-best 50 D4-best 70 128x128 224x224 320x320 416x416 512x512 c1 c2 c3 c4 c5 fused

Image size Layers

FIGURE 4. a) Impact of image size and average versus max global pooling of c5 features on identification accuracy. Data sets are separated by color, and the best result for each data set is indicated with a cross (X). Global average pooling was always better than global max pooling. In general, 416×416 tended to be the best image size. It was the optimal size for two out of four data sets (global average pooling) or three out of four data sets (global max pooling). b) Impact of feature depth, normalization and feature fusion on identification accuracy. Input images of size 416×416 and average global pooling was used in all cases. Data set colors and indication of the settings giving the best identification performance as in (a); note that the performance of feature fusion here is slightly different from that in (a) because of minor differences in the protocol between the experiments. Normalization and feature fusion generally improved identification accuracy, although c4 features tended to outperform features from all other individual layers. 884 SYSTEMATIC BIOLOGY VOL. 68

The fused features generated better identification preserving more features (2×2×F, 4×4×For7×7×F) results than features from individual convolutional from the best performing convolutional block yielded blocks in all data sets except D3, where the c4 features identification results that were competitive with those alone were slightly better than fused features (5.57% from fused, globally pooled features (1×1×F). In the against 6.48% error rate). Feature fusion was particularly case of D3, the nonglobal features from c3, c4, and effective for D1, where the error rate was reduced by c5 all outperformed the best result from the previous one-third over that of c4 features alone (from 11.43% to experiments, cutting the error rate in half (from 5.57% 8%). The effect on D2 was also significant (a reduction to 2.64%). It is also interesting to note that c3 achieved by 14%, from 5.48% to 4.73% error rate). Even for D4, identification accuracies that were comparable to those where identification accuracies were generally very high, of c4 when more fine-grained feature information was feature fusion resulted in a significant reduction in error retained. An explanation for this result could be that rate (from 0.86% to 0.34%). differences between the three sibling species in D3 Using global max pooling instead of global average are due to small details represented by only a few pooling resulted in reduced identification accuracy, pixels each, information that could easily be lost in but otherwise gave similar results with respect to dimensionality reduction. the relative performances of both individual and Impact of aspect ratio. All results reported above fused feature matrices. Combining features from global were obtained with images where the aspect ratio average and global max pooling did not improve the was maintained, even if it involved introduction of results; instead, the identification performance was uninformative pixels to fill out the square input intermediate between that of each method applied frame typically used with VGG16. To inspect how this separately (Supplementary Fig. 1). affects identification performance, we exposed the best Impact of preserving more features. For three out of performing models to images where the input frame four data sets, the best results were not obtained with had been filled completely even if that distorted the fusion of globally pooled features (Fig. 5). For D1–D3, proportions of the imaged objects. The performance was

A B (D1) 92 (D2) 96 90 95

88 94

86 93 c3 c3 c4 92 c4 Accuracy (%) Accuracy

Accuracy (%) Accuracy 84 c5 c5 82 fused c3-c5 91 fused c3-c5 best Fig. 4B best Fig. 4B 90 global 2 4 7 14 28 global 2 4 7 14

Size of preserved features Size of preserved features C D (D4) (D3) c3 99 96 c4 c5 fused c3-c5 98 best Fig. 4B 94 c3 c4 97 Accuracy (%) Accuracy Accuracy (%) Accuracy c5 92 fused c3-c5 best Fig. 4B 96

global 2 4 7 14 28 global 2 4 7 14

Size of preserved features Size of preserved features

FIGURE 5. Impact of preserving more features during feature extraction on identification accuracy for data sets D1 (a), D2 (b), D3 (c), and D4 (d). We used input images of size 416×416 in all cases, and we pooled features from each of the last three convolutional blocks (c3–c5) into matrices of dimension N×N×F, where F is the number of filters of the corresponding convolutional block and N was set to 1 (global average pooling), 2, 4, 7, 14, or 28. The dashed black line indicates the best performing model from previous experiments (Fig. 4b). For D2 and D4 we stopped atN=14because of the computational cost involved in computing the value for N = 28, and since the accuracy had already dropped significantly atN=14.Because of the computational cost, we computed performance of fused feature matrices only up toN=7(D1andD3)orN = 4 (D2 and D4). On data sets D1, D2, and D3, the optimum identification performance was somewhere between globally pooled and nonpooled features. For D4, global pooling yielded the best identification performance. 2019 VALAN ET AL.—AUTOMATED TAXONOMIC IDENTIFICATION OF INSECTS 885

max pooling feature makes it particularly easy to identify members of avg pooling this family from head images. Among the misclassified 15 best conv layer normalized images, the second best guess of the model was correct in fused c1-5 features approximately two out of three cases (Fig. 9), indicating non-global pooling 10 that the model was usually not completely off even in these cases. Almost one-third (29%) of D1 images are from genera Error rate (%) Error rate 5 represented by only one, two, or three examples. The model was able to learn from at most two examples from such a genus, and often none. As one might expect, our 0 D1 D2 D3 D4 findings showed that the families with a small average number of exemplars per genus were also the ones that Datasets were most difficult for the model to learn to identify correctly (see Fig. 8). FIGURE 6. Reduction in error rate throughout our optimization steps, using input images of size 416×416. The minimum error rates (and Data set D2—Families of Coleoptera. The best top identification accuracies) for D1–D3 were achieved after applying model had an error rate of 3.95% (116/2936) when nonglobal feature pooling in the last step (red bar). Note the drastic discriminating between the 14 beetle families in D2 improvement that came with nonglobal feature pooling in D3, the most (Fig. 9). As for D1, the number of examples per fine-grained data set studied here. The best result for D4 was achieved already in the penultimate step, by fusing globally pooled features category affected the result significantly (Fig. 7), with from c1 to c5. more examples giving higher identification accuracy. A notable outlier is the family Tenebrionidae, which had an unexpectedly low identification accuracy (85%) similar to that in the original experiments, with only given the sample size (158 images). Tenebrionidae is slight changes in accuracy (+0.38%, +0.27%, +0.77%, and a cosmopolitan and highly diverse family with over +0.03% for D1–D4, respectively). 20,000 species. Tenebrionids vary significantly in shape, color and morphological features, and the family is well known to contain species that superficially look similar Taxonomic Identification to typical representatives of other families of Coleoptera. Another case that stands out is the family Task 1—Identifying Insects to Higher Groups Chrysomelidae (70% accuracy) (Fig. 7). The poor Data set D1. Frontal view of the head of Diptera identification performance for this family was likely specimens. The best model achieved 7.35% error rate related to the small number of examples (44 images), when discriminating between the 11 families in D1 and the fact that these examples covered a significant (Fig. 9). As expected, families represented by few amount of the morphological diversity in the family specimens were more difficult for the model to identify (37,000 species described to date). The Chrysomelidae correctly than those represented by many exemplars examples for which we had at least genus name (40 of (Fig. 7). Interestingly, the Pipunculidae were the most 44 images) are scattered over 25 genera, 11 tribes, and easily identified family, even though they were only 7 subfamilies. Interestingly, one of the subfamilies is represented by 30 images. The vernacular name of this better represented than the others, the Cassidinae (14 family is big-headed flies, referring to the big eyes that examples from 10 genera and 3 tribes); only one of these cover nearly the entire head. This highly distinctive images was misclassified by the best model.

A B

FIGURE 7. Effect of sample size on identification accuracy. a) Scatter plot of identification accuracies against sample sizes for all families from D1 and D2. Note that accuracies varied more across families at small sample sizes. Accuracies were also affected by how well the sample covered the variation in each family; such effects appear to explain many of the outliers (arrows; see text). b) The effect of the number of images per species on the performance of the best model on D4. Each point represents the average across ten random subsamples of the original data set. 886 SYSTEMATIC BIOLOGY VOL. 68

set, and increased to more than 90% for genera with >15 12 D1 examples. 10 D1-average accuracy D2B—Genera of Melolonthini. The best model for 8 D2 D2 was able to separate the two chosen genera of D2-average 6 accuracy Melolonthini with an error rate of 2.7%. The misclassified

Error rate (%) Error rate 4 images were predominantly from species that had not been seen by the model (5 of 6 misclassified images were 2 the only exemplar of the species). For species with more one 2 3 4-6 7-10 10-15 15+ than three example images, the error rate dropped to

Examples per genus 1.6%, compared to 3.2% for species represented by only one example image. FIGURE 8. Effect of sample nature on identification accuracy. Scatter The two genera in D2B are from different subtribes, plot of identification accuracies against the number of example images and experts can easily distinguish them by looking at per genus for all families from D1 and D2. The more examples there were on average of each genus in the family, the more likely it was that features of the abdomen and the labrum. However, none the model had seen something similar to the image being classified, of these characters is visible in images of the dorsal lowering the misclassification rate. habitus, making the task of identifying the genera from the D2B images quite challenging even for experts. We asked one of the leading experts on Melolonthini, Aleš As for D2, the most challenging images were those Bezdˇek, to identify the images in D2B to genus; his error belonging to genera likely to have few or no examples rate was 4.1%. in the training set (Fig. 8). For the majority of the misclassified images, the second best guess of the model was correct (Fig. 9). After we had completed the experiments, we Task 2—Identifying Visually Similar Species discovered that seven of the D2 images had been D3—Closely related species of Oxythyrea. The best erroneously classified by the experts submitting them model had a misclassification rate of only 2.67%. to iDigBio. This means that the training set we used Oxythyrea funesta was the most challenging species to was not clean but contained 0.2% noise on average. identify, with five misclassified images (5.2% error rate), Apparently, this did not seem to be a significant followed by O. pantherina with three misclassified images obstacle in developing a model with high identification (1.5% error rate) and O. subcalva with one misclassified performance. When the erroneously labeled images image (2.5% error rate). were in the test set, the best model identified four of them Some of the misclassified images are indeed to the right family.Of the remaining three specimens, one challenging (Fig. 10). For instance, the left two does not belong to any of the 14 families studied here, and misclassified specimens of O. funesta would probably another one is a poorly preserved specimen for which we be identified by most humans as O. pantherina,the were not able to establish the correct family identification species that our model suggested. Similarly, all three ourselves with any degree of confidence. misclassified specimens of O. pantherina are more similar Data set D2A—Tribes of Curculionidae. Applying to O. funesta, the identification suggested by our model, the best model for D2 on this task, we reached an than to typical O. pantherina. overall identification accuracy of >79%. The tribes with D.V., who studied the genus for 8 years, estimates fewer examples were generally more challenging to that he can reliably identify up to 90% of the specimens identify correctly. The accuracy dropped to 42% when of these three species based on dorsal habitus alone. no example of the genus was included in the training When challenged with the D3 images, DV was able to

A B 2.32% C D 0.26% 5.07% 0.647% 0.026% 1.35% 0.988% 2.65% 0.052% 1.24%

1st suggestion 2nd suggestion 3rd suggestion D1 D2 D3 D4 remaining suggestions

92.3% 96% 97.3% 99.7%

FIGURE 9. Proportion of correct identifications from the top three suggestions of the model for data set D1 (a), D2 (b), D3 (c), and D4 (d). When the most probable identification (first suggestion) of the model was erroneous, the second most probable identification was usually correct, indicating that the model was not too far off. 2019 VALAN ET AL.—AUTOMATED TAXONOMIC IDENTIFICATION OF INSECTS 887

FIGURE 10. Examples of misclassified images from D3 (Oxythyrea). Next to the misclassified images of O. funesta (5 images), O. pantherina (3 images), and O. subcalva (1 image) we show typical representatives of the three Oxythyrea species. Note, for instance, that the first two misclassified specimens of O. funesta are quite similar to typical specimens of O. pantherina, the identification suggested by our model.

FIGURE 11. Examples of misclassified images from D4 (Plecoptera larvae). Note that most misclassified specimens are small or photographed from angles making them difficult to identify. identify 95% of the images correctly, but some of those specimens are also small, occupying a small portion of identifications were aided by the fact that he recognized the picture frame. However, even for the misclassified the imaged specimens and remembered what species he images, the model mostly (9 of 13 cases) suggested the had identified them to when he photographed them. correct taxonomic identification with the second highest D4—Species of Plecoptera larvae. The original study probability (Fig. 9). reported 5.5% error rate for this data set using a system Larios et al. (2008) trained 26 people, most of whom based on handcrafted feature extraction (Lowe 2004). had some prior training in entomology, to distinguish The best of our models had an error rate of only 0.34%. from Doroneuria baumanni (Fig. 2). On four of the nine species it reached 100% accuracy, Of the nine species, these are the ones that are most including , the species with the smallest difficult to separate from each other. Participants were number of examples and the highest misclassification first trained on 50 random samples from each species, rate in the original study (10.1%). and then tested on another set of 50 images. The study In the images misclassified by our model (Fig. 11), it reported an error rate of 22%. is noticeable that the specimen is often turned toward These two species were also the ones that were most the side, unlike the majority of the images where the often confused by our model: all misclassified images specimen is seen from above. Many of the misclassified of C. californica were classified as D. baumanni, and 888 SYSTEMATIC BIOLOGY VOL. 68 vice versa. However, our model misclassified only four CRLeaves. Using fine-tuned Inception v3 (Szegedy images of C. californica (0.82% error rate) and four images et al. 2016b) pretrained on ImageNet, Rojas et al. of D. baumanni (0.75% error rate), even though it was (2017) report 51% accuracy in identifying the 7262 faced with the harder task of classifying the images into images of CRLeaves to the 255 included plant species. one of nine species instead of just separating these two They also tried fine-tuning on a bigger plant data set species. The original study reported misclassification (253,733 images) followed by fine-tuning on target tasks rates of 6.7% and 5.1% for C. californica and D. baumanni, such as CRLeaves. Although this approach improved respectively. accuracies by 5% on other data sets examined in their D4 contains multiple images (4–5) of the same study, it actually lowered the performance on CRLeaves specimen, varying in zoom, pose, and orientation (Lytle (49.1% accuracy). Our model achieved an identification et al. 2010). In the original experiment, we did not keep accuracy of >94%. these images together. Thus, the test set would typically EcuadorMoths. This data set of 675 geometrid moth contain images of specimens that had already been seen species (2120 images) is extremely challenging because during training, albeit in a different magnification, and of the visual similarity of the species and the small seen from a different angle and in a different pose. When number of images per species. Previous studies have all images of the same specimen were kept in the same reported 55.7% accuracy (Rodner et al. 2015) and 57.19% fold, such that test images never included specimens that accuracy (Wei et al. 2017). Both studies were based on had been seen before, the accuracy of the best model models pretrained on ImageNet (Russakovsky et al. remained high (98.62% accuracy and corresponding to 2015). The first utilized feature extraction by globally 1.38% error rate). pooling bottleneck features from AlexNet (Krizhevsky Under these more stringent conditions, the majority of et al. 2012). The second utilized VGG16 (Simonyan misclassified images belonged to specimens that differ and Zisserman 2014), introducing a novel approach markedly from typical representatives of the species. the authors called selective convolutional descriptor For instance, one specimen of Isogenoides is much bigger aggregation (SCDA) for localizing the main object and darker (probably representing a later larval stage) and removing noise. Wei et al. (2017) show that the than the other specimens; three of the five images SCDA approach outperforms 11 other CNN methods on of this specimen were misclassified. One specimen of EcuadorMoths. They used a training set of 1445 images, Hesperoperla pacifica is pale while others are dark (four containing on average only two images per species. of five images misclassified) and one Zapata specimen Using the same partitioning into training and test data is completely curved and photographed from a lateral set, our model achieved 55.4 % identification accuracy, view, which is not the case for the other specimens (five slightly worse than both previous studies. However, of five images misclassified). Misclassified C. californica features from block c4 alone outperformed the default and D. baumanni have parts occluded, are photographed settings (58.2%). This is similar to the results we found from a lateral view, are unusually curved, or are really for D3, which is also a task with visually similar and small and pale. It is interesting to note that all but eight closely related species. of the 774 imaged specimens (1%) had the majority of Flavia. The original study used handcrafted feature their images correctly identified by our model. extraction and reported 90.3% identification accuracy (Wu et al. 2007). This was subsequently improved with other traditional techniques (93.2 %, 97.2 %) (Kulkarni et al. 2013; Kadir 2014). The performance was recently Performance on Related Tasks pushed even further with deep belief networks (99.37% To examine the generality of our results, we tested the accuracy) (Liu and Kan 2016) and by several CNN-based performance of our approach on five related biological approaches (97.9 % and 99.65% accuracy; Barré et al. image identification challenges that have been examined 2014 and Sun et al. 2017, respectively). Our base model in recent studies of automated identification systems. misidentified only one image in Flavia, achieving 99.95% We did not optimize our model for each of these data accuracy. sets. Instead, we used settings that performed well across Pollen23. The original study reported 64 % accuracy all of the data sets D1–D4. Specifically, unless noted on this task with the best-performing model (Gonçalves otherwise, we used input images of size 416×416, global et al. 2016). It was based on combination of handcrafted average pooling, fusion of c1–c5 features, and signed features, such as color, shape and texture, and a method square root normalization. known as “bag of words” (Yang et al. 2007). The ClearedLeaF. The original study (Wilf et al. 2016) performance was comparable to that of beekeepers, reported results on various identification tasks involving who were trained for this particular task; they achieved this data set, using manual extraction of SIFT descriptors 67% accuracy. Images from this data set are small (see Lowe 2004). We randomly picked one of the tasks, (approximately one half of the images having width or identifying families with 100+ images each. The original height <224 pixels, the default input size for VGG16, study reported 72.14 % accuracy on this task; with so we decided to use 160×160 as the size of our our approach, we reached an identification accuracy of input images. Our model achieved >94% identification >88%. accuracy on this task. 2019 VALAN ET AL.—AUTOMATED TAXONOMIC IDENTIFICATION OF INSECTS 889

TABLE 3. Experiments on recently published biological image classification tasks Data sets Categories Our method Original study Method Reference ClearedLeaf 19 families 88.7 71 SIFT + SVM (Wilf et al., 2016) CRLeaves 255 species 94.67 51 finetune InceptionV3 (Carranza-Rojas et al., 2017) EcuadorMoths 675 species 55.4 55.7 AlexNet + SVM (Rodner et al., 2015) (58.2) 57.19 VGG16 + SCDA + SVM (Weietal., 2017) Flavia 32 species 99.95 99.65 ResNet26 (Sun et al., 2017) Pollen23 23 species 94.8 64 CST+BOW (Gonçalves et al., 2016) Note: We used input images of size 416×416 (160×160 for Pollen23), global average pooling, fusion of c1–c5 features, and signed square root normalization. Accuracy is reported using 10-fold cross validation except for EcuadorMoths, where we used the same partitioning as in the original study. Bold font indicates the best identification performance.

(A)(B) (D1)(D2)

(C)(D) (D3)(D4)

FIGURE 12. Visualization of the c4 feature space of data sets D1 (a), D2 (b), D3 (c), and D4 (d) using t-SNE (Maaten and Hinton 2008). The c4 features contain a considerable amount of information that is useful in separating image categories. Note that some categories are more isolated and easier to identify than others. Note also that it is common for categories to consist of several disjunct sets of images. These disjunct sets are likely to represent images that differ from each other in features that are irrelevant for identification, such as background color or life stage.

DISCUSSION training sets and enough computational resources to train state-of-the-art CNNs from scratch on taxonomic CNN Feature Transfer identification tasks. Therefore, some form of transfer The ultimate goal of the machine learning community learning is needed, and feature transfer has recently is to build systems that can perceive and execute emerged as the method of choice. highly complex tasks without prior knowledge and with Perhaps, the most important reason for this is that minimal if any human intervention (Bengio and LeCun feature transfer is more computationally efficient than 2007). In recent years, DL and CNNs have contributed fine-tuning. In fine-tuning, the images have to be to spectacular advances in this direction in computer run through the CNN in a forward pass, and then vision. Image classification tasks can now be solved the computed derivatives from the predictions have with a very high degree of accuracy using systems that to be backpropagated to modify the CNN, which is support end-to-end learning (LeCun et al. 2015). computationally expensive. Even worse, this process In this study, our aim was to explore the extent to of alternating forward and backward passes has to be which systematists can build on these breakthroughs in repeated until our model converges, which only happens developing automated taxonomic identification systems. if we have chosen the right parameters. Choosing the Clearly, we currently lack both sufficiently large right parameters is not trivial; it requires machine 890 SYSTEMATIC BIOLOGY VOL. 68 learning knowledge. In feature extraction, the image One possible explanation for these differences is that is only passed forward through the pretrained CNN our data sets are more fine-grained than any of the data once; no backpropagation or fine-tuning of the model sets examined in these studies. However, the results we is needed. The only parameter tuning that is needed obtained on other recently published data sets suggest in feature extraction is the adaptation of the classifier, that fusion of intermediate-level features will perform which is fast because the classifier focuses entirely on well on a broad range of biological image classification the smaller training set for the target task. tasks. Feature transfer is known to outperform hand- Our results also indicate that preserving more features engineered features on some tasks (Donahue et al. 2014) than the standard off-the-shelf approach may give and fine-tuning often further helps the performance better identification performance (Fig. 4b). Preserving (Li and Hoiem 2016). However, when fine-tuning on more features increases the computational cost, but small or on unbalanced data set, overfitting is a serious this may be a price worth paying when working with concern. This is usually tackled with aggressive data challenging fine-grained classification tasks, or when augmentation, which may include geometric (cropping, the identification accuracy is more important than flipping, rotating, shifting, and distorting) or color the computational cost of training the classifier. The (shading, shifting color channels, or changing the advantage of preserving more features is most evident brightness) transformations. Both the original images on D3, which is the most fine-grained of all the data sets and their duplicates (D instances) are then used in studied here (Fig. 6). training the network, increasing the size of the training Recent research has shown that CNNs trained from set from N to N(D+1) images. This further increases the scratch on the large ImageNet data set perform well computational cost of fine-tuning. despite the presence of small amounts of noise (0.3 %) For these reasons, the combination of a pretrained (Russakovsky et al. 2015). Our results for D2 (0.2% noise) CNN, usually VGG16, for feature extraction and linear indicate that also feature transfer from pretrained CNNs SVMs for classification is a common choice in recent is robust to a similar level of noise in the training set for computer vision studies (Donahue et al. 2014; Razavian the target task. This facilitates practical use of methods et al. 2014; Azizpour et al. 2016; Zheng et al. 2016). based on CNN feature transfer, as it can be exceedingly Nevertheless, our results show that there is still room difficult to completely eliminate mislabeled data from for improvement. Specifically, by experimenting with training sets even when identifications are provided by the feature extraction strategy, our results show that experts. the identification accuracy can be improved significantly It is somewhat surprising that the optimal feature compared to the basic off-the-shelf approach that extraction strategy turned out to be so similar across involves global max pooling of bottleneck features from all tasks and data sets (D1–D4) examined in detail input images of size 224×224 (Figs. 4–6). here, despite the fact that they are so varied. It is true Our finding that global average pooling performs that D1–D3 share several important features: the object better than global max pooling (Fig. 4a) is congruent with comprises a major part of the image and the images the results of a recent study applying feature transfer are standardized with respect to viewing angle and to fine-grained classification tasks (Zheng et al. 2016). orientation of the specimen. However, these features are A possible explanation for the better performance of not shared by D4. The tasks are also different in that D1– average pooling is that the objects in our images tend D2 involve the recognition of variable higher taxa, while to be big and occupy a large part of the picture frame. D3–D4 involve the identification of more homogeneous Thus, features from all parts of the image contribute to categories with fine-grained differences among them. identification, which they can do through the average but The only striking features that are shared across all not the max value. This explanation is consistent with problems examined here are the small training sets, the observation that the difference between average and the clean images (few distracting objects), and the max pooling was smaller for D4, containing a number of high labeling (and expected identification) accuracy. images of smaller objects (Fig. 4a). Thus, it might be possible to find a single feature Our results suggest that there may be an advantage extraction protocol that would perform well across a of using intermediate-level features compared to broad class of biological identification problems with bottleneck features or features from the final layers. similar properties. This prediction is supported by the These findings contrast with those from generic results from applying a standard feature extraction classification tasks, such as Caltech-101 (Fei-Fei et al. protocol inspired by our results for D1–D4 on a 2006), Caltech-256 (Griffin et al. 2007), and PASCAL range of recently studied biological image classification VOC07 (Everingham et al. 2010), where the first fully problems (Table 3). Without task-specific optimization, connected layer has been found to be the most our approach managed to beat the state-of-the-art discriminative layer. It also differs from the results of identification accuracy for almost all data sets, often with previous studies of fine-grained identification tasks, a substantial margin. such as Bird-200-2011 (Wah et al. 2011) and Flower-102 Undoubtedly, further work can help improve the (Nilsback and Zisserman 2008), where it has been found performance of the method described here. For that c5 is the most discriminative layer (Zheng et al. 2016). instance, preprocessing images by implementing 2019 VALAN ET AL.—AUTOMATED TAXONOMIC IDENTIFICATION OF INSECTS 891 object segmentation (separation of objects from if the species are very similar to each other. Our data the background) has been shown to improve the sets used for the identification of higher groups (D1– performance of image classification in general D2) are particularly challenging because a large fraction (Rabinovich et al. 2007), and it seems likely it will of the images (all in D1) belong to unique species. also help improve our CNN feature extraction protocol. Thus, the model needs to be able to place specimens Other interesting techniques that have emerged recently, in the correct higher group even though it has not and that should be tried in the CNN feature extraction seen the species before, and possibly not the genus it framework, include: assembling multiple architectures belongs to either. This requires the model to achieve a as in bilinear CNNs (Lin et al. 2015); using multiple good understanding of the diagnostic features of the images as inputs as in Siamese CNNs (Zbonar and higher taxonomic groups. In view of these challenges, LeCun 2016); and identifying informative image parts the accuracies we report here on the higher-group with HSnet search prior to feeding them into the CNN identification tasks (>92% for D1 and >96% for D2) (Lam et al. 2017). are quite impressive, especially given the small number It may also be possible to improve identification of images per category in the training set. It requires a performance by rethinking taxonomic tasks. For decent amount of training for a human to reach the same instance, preliminary experiments that we performed levels of identification accuracy. Given larger training indicate that identification performance can be increased sets, it seems likely that the automated systems would be by subdividing the categories of interest before training able to compete successfully with experts on these tasks. the system. In taxonomic identification, appropriate On the fine-grained tasks (D3–D4, D2B), our system subdivisions might include sex (male or female), life clearly reaches or exceeds expert-level accuracies even stage, or color morph. For instance, assume that we are with the small to modest training sets used here. On trying to separate two closely related species, A and D3, the best taxonomists in the world could beat the B, in a group with distinct sex dimorphism. Then it automated system, but only if they had access to images might be a good idea to train the system on separating of the ventral side of the specimens and locality of males from females at the same time as it is trained collecting. For a fair comparison, however, one would to separate the species, as this will help the system in then also have to make ventral-view images available in finding the sex-specific species differences between A the training set used for the automated system, which and B. Another idea is to combine images representing might tip the race again in the favor of machines. different viewing angles of the same specimen in the On task D2B (separating two genera of Scarabaeidae), training set. For instance, the system might be trained our automated system outperformed the expert on separating combined images showing both the dorsal we consulted, and on D4 it achieved significantly and ventral side of the same specimen. higher identification accuracies in separating the most challenging species pair than reported previously for 26 entomologists trained on the task (Larios et al. 2008) (>98% vs. 78%). The human identification accuracies Taxonomic Identification reported by Larios et al. may seem low, but previous Taxonomic identification tasks differ in several ways studies have shown that humans are prone to make from more typical image classification problems. For errors, regardless of whether they are experts or instance, it is common that taxonomic image sets use nonexperts (MacLeod et al. 2010; Culverhouse 2014; a standard view, that specimens are oriented in the same Austen et al. 2016 and references cited therein). The way and occupy a large part of the frame, and that there errors can be due to a variety of reasons: fatigue, are few disturbing foreign elements in the images. These boredom, monotony, imperfect short-term memory, etc. are all features that should make it easier for machine (Culverhouse 2007). learning algorithms to learn the classification task, since In many ways, our results demonstrate wide-ranging they reduce the variation within image categories. If similarities between humans and machines in learning images are more heterogeneous, like our data set D4, and performing taxonomic identification tasks. For one expects more training data to be needed for the instance, the amount of training is clearly important same level of identification accuracy. Interestingly, our in improving machine performance (Fig. 7). Machine results suggest that the extra difficulties introduced by identification accuracies are significantly lower for more heterogeneous images are not as dramatic as one groups that have not been seen much previously, and might think. Using roughly 400 images per category, we for entirely novel taxa (Fig. 8). Taxa that are unusually actually achieved better identification accuracies for the heterogeneous D4 than we did with the less challenging variable, such as the Tenebrionidae, are difficult to learn, D1–D3 using 100–200 images per category. while uniform groups that have striking diagnostic Taxonomic identification tasks differ widely in the features, such as Pipunculidae, will be picked up quickly morphological variation one expects within categories, (Fig. 7). and the degree to which it is possible to cover that It is also easy to recognize human-like patterns in variation in the training set. It is perhaps not surprising the erroneous machine identifications. For instance, the that the task of identifying higher taxonomic groups is D4 specimens that are misclassified by the automated more challenging than the separation of species, even system are unusually small, occluded, or imaged 892 SYSTEMATIC BIOLOGY VOL. 68 from a difficult angle (Fig. 11). These are specimens new identification tasks with only some basic skills in that humans would also have difficulties identifying python scripting. correctly. Similarly, the errors in machine identification With the advent of DL and efficient feature transfer of Oxythyrea species are similar to the errors one might from pretrained CNNs, the field of automated taxonomic expect humans to make (Fig. 10). identification has entered a completely new and exciting Our results clearly demonstrate that modern machine phase. Clearly, routine tasks of sorting and identifying learning algorithms are powerful enough to address biological specimens could be replaced already today even the most challenging taxonomic identification in many situations by systems based on existing CNN tasks. For instance, it appears worthwhile to test whether technology. Actually, it may be time now to look the method described here would also be able to beyond the obvious application of CNNs to routine identify cryptic species that are extremely challenging identification, and start thinking about completely new or impossible for human taxonomists to separate, such ways in which CNNs could help boost systematics as bumblebee species of the lucorum complex (Scriven research. et al. 2015). In addressing such tasks, however, several One interesting application may be online discovery problems might occur due to the difficulties of image of new taxa as more and more of the world’s natural annotation and the model’s “hunger for examples.” For history collections get imaged and published on the web. instance, humans might not be able to provide accurate Although detection of images representing potential identifications without preparation of the genitalia new categories remains a challenging problem in or extraction of DNA for sequencing. Thus, accurate computer vision, one can imagine a future in which a identification of the specimens may leave them in a taxonomist would be alerted by an automated system condition making them unsuited for imaging, so unless when images of specimens that potentially represent they were imaged prior to identification, the specimens new species of her group become available. cannot be included in training sets. Using the high-level semantics of CNNs to generate An interesting aspect of taxonomic identification natural-language descriptions of image features is is that it falls into two distinct subdisciplines: field another active research field in computer vision identification and identification of museum specimens. (Donahue et al. 2015; Karpathy and Fei-Fei 2015; Xu et al. Even though the two are clearly separated, it is also 2015) that may have a dramatic impact on systematics. obvious that knowledge developed in one domain will In theory, such systems could completely automate be useful in the other. Transferring knowledge between the generation of natural-language descriptions of new such related applications is known as domain adaptation species. in computer vision (Csurka 2017). This is a very active The CNN-based automated identification systems research topic, and although current techniques are less clearly develop a powerful high-level understanding mature than the image classification algorithms used in of the tasks with which they are challenged. If that this article, we might expect significant progress in the understanding of taxonomic identification tasks could near future. It seems likely that citizen-science projects be communicated effectively to biologists, it could like iNaturalist and e-Bird will be able to generate large replace traditional identification keys. Perhaps the best training sets of field photos of animals and plants in way to do this would be through visualizations of the near future, allowing the development of powerful morphospace that are optimized for identification. In automated systems for field identification. Efficient other words, morphospace as it is organized by a domain adaptation should make it possible to extract CNN trained on a relevant identification task. It is not the machine knowledge about organism identification difficult to imagine ways in which such approaches accumulated in such projects and use it in developing might generate identification tools that would be highly better systems for taxonomic identification of museum efficient and very appealing to humans. specimens. It may also be useful in many cases to transfer One could also imagine CNN-based systems for machine knowledge in the other direction. automating morphological phylogenetic analysis. Throughout the history of evolutionary biology, comparative anatomists and paleontologists have CONCLUSION demonstrated that it is possible for the human brain In conclusion, our results show that effective to reconstruct phylogenies by intuitively evaluating feature extraction from pretrained CNNs provides a large body of morphological facts. In many cases, a convenient and efficient method for developing these morphology-based hypotheses have withstood fully automated taxonomic identification systems. The the test of time even though they were challenged by general effectiveness of our approach on a wide range many of the early molecular phylogenetic analyses. If of challenging identification tasks suggests that the it is possible for humans, it should also be possible for method can be applied with success even without task- CNNs to learn how to infer phylogeny from morphology specific optimizations. The method is robust to noise in documented in images. the training set and yields good performance already It is important for systematists to encourage and with small numbers of images per category. Using the accelerate the collection and publishing of accurately provided code, it should be easy to apply the method to labeled digital image sets, as the shortage of suitable 2019 VALAN ET AL.—AUTOMATED TAXONOMIC IDENTIFICATION OF INSECTS 893 training data is one of the major hurdles today Institute in Prague for his help with specimen IDs, for the development of computer-assisted taxonomic Aleš Bezdˇek from the Biology Centre CAS (Institute identification. The large museum digitization efforts that of Entomology) in Ceskéˇ Budˇejovice for his attempt are under way in several countries will clearly benefit the to compete with the machine in the Diplotaxis versus development of automated identification systems, but Phyllophaga task and Petr Šípek from Charles University more such projects are needed. The new identification (Faculty of Science) for his valuable discussion on the systems, in turn, will help to significantly increase the genus Oxythyrea. M.V. and D.V. are grateful to Martin impact of these imaging projects by using them to Fikáˇcek from the National Museum in Prague and generate novel discoveries in taxonomy and systematics. Alexey Solodovnikov from the Natural History Museum Clearly,morphology-based taxonomy and systematics of Denmark for initiating the collaboration that resulted are facing a period of transformative changes. By taking in this publication. advantage of the new opportunities that are opening up thanks to developments in computer vision, systematists and taxonomists could accelerate the discovery of the planet’s biological diversity, free up more time to REFERENCES sophisticated research tasks, and facilitate the study of Abadi M., Agarwal A., Barham P.,Brevdo E., Chen Z., Citro C., Corrado classical evolutionary problems, such as the evolution of G.S., Davis A., Dean J., Devin M., Ghemawat S., Goodfellow I., Harp morphological characters and the dating of divergence A., Irving G., Isard M., Jia Y., Jozefowicz R., Kaiser L., Kudlur M., times using fossils. Levenberg J., Mane D., Monga R., Moore S., Murray D., Olah C., Schuster M., Shlens J., Steiner B., Sutskever I., Talwar K., Tucker P., Vanhoucke V., Vasudevan V., Viegas F., Vinyals O., Warden P., Wattenberg M., Wicke M., Yu Y., Zheng X. 2016. TensorFlow: SUPPLEMENTARY MATERIAL large-scale machine learning on heterogeneous distributed systems. Available from arXiv:1603.04467. Data available from the Dryad Digital Repository: Arandjelovi´c R., Zisserman A. 2013. All about VLAD. 2013 IEEE http://dx.doi.org/10.5061/dryad.20ch6p5 and https:// Conference on Computer Vision and Pattern Recognition. Portland: github.com/valanm/off-the-shelf-insect-identification. IEEE. Arbuckle T., Schröder S., Steinhage V., Wittmann D. 2001. Biodiversity informatics in action: identification and monitoring of bee species using ABIS. Proc. 15th Int. Symp. Informatics for Environmental AUTHOR CONTRIBUTIONS Protection, Zürich, Switzerland: ETH. vol. 1. p. 425–430. Austen G.E., Bindemann M., Griffiths R.A., Roberts D.L. 2016. Species M.V., A.M., and F.R. conceived the idea; M.V., identification by experts and non-experts: comparing images from K.M., A.M., D.V., and F.R. designed the study; M.V. field guides. Sci. Rep. 6:33634. compiled and filtered data sets D1 and D2 and D.V. Azizpour H., Razavian A.S., Sullivan J., Maki A., Carlsson S. 2016. Factors of transferability for a generic ConvNet representation. IEEE accumulated D3; M.V. wrote the scripts, performed Trans. Pattern Anal. Mach. Intell. 38:1790–1802. analyses, interpreted results, and prepared the first Baraud J. 1992. Coléoptères Scarabaeoidea d’Europe. Fédération draft. All authors revised and commented drafts at Française des Sociétés de Sciences Naturelles & Société Linnéenne different stages and contributed to the final version of de Lyon, Faune de France, 78:1–856. the manuscript. Barré P., Stöver B.C., Müller K.F., Steinhage V. 2017. LeafNet: a computer vision system for automatic plant species identification. Ecol. Inform. 40:50–56. Barker E., Barker W., Burr W., Polk W., Smid M. 2012. Recommendation for key management part 1: general (revision 3). NIST Spec. Pub. FUNDING 800:1–147. This work was supported by the European Union’s Bengio Y. 2011. Deep learning of representations for unsupervised and Horizon 2020 research and innovation program transfer learning. Proceedings of ICML Workshop on Unsupervised and Transfer Learning. Bellevue: IEEE. p. 17–36. under the Marie Sklodowska-Curie [642241 to M.V.]; Bengio Y., LeCun Y. 2007. Scaling learning algorithms towards AI. In: from Charles University [SVV 260 434/2018], from Bottou L, Chapelle D, DeCoste D, Weston J, editors. Large-scale Ministry of Culture of the Czech Republic [DKRVO kernel machines. Cambridge MA: MIT Press. vol. 34. p. 1–41. 2018/14, National Museum, 00023272], and from Brehm G., Strutzenberger P., Fiedler K. 2013. Phylogenetic diversity of geometrid moths decreases with elevation in the tropical Andes. European Community Research Infrastructure Ecography 36:1247–1253. Action under the Synthesys Project [FR-TAF-5869 Breiman L. 2001 Random forests. Mach. Learn. 45:5–32. (http://www.synthesys.info/)] to D.V.; and from the Carranza-Rojas J., Goeau H., Bonnet P., Mata-Montero E., Joly A. Swedish Research Council [2014-05901 to F.R.]. 2017. Going deeper in the automated identification of herbarium specimens. BMC Evol. Biol. 17:181. Caruana R. 1995. Learning many related tasks at the same time with backpropagation. In: Tesauro G, Touretzky DS and Leen TK, editors. ACKNOWLEDGMENTS Advances in neural information processing systems 7. Cambridge MA: MIT Press. p. 657–664. The authors thank Claes Orsholm from Savantic AB, Chollet F. 2015. Keras. GitHub. https://github.com/fchollet/keras. Johannes Bergsten and Allison Hsiang from the Swedish Chollet F. 2016. Xception: deep learning with depthwise separable Museum of Natural History, and Viktor Senderov from convolutions. Available from: arXiv:1610.02357. Cire¸san D., Meier U., Masci J., Schmidhuber J. 2011. A committee of the Bulgarian Academy of Sciences for their valuable neural networks for traffic sign classification. The 2011 International comments, Jiˇrí Skuhrovec from the Crop Research Joint Conference on Neural Networks. San Jose: IEEE. 1918–1921. 894 SYSTEMATIC BIOLOGY VOL. 68

Cortes C., Vapnik V. 1995. Support-vector networks. Mach. Learn. Karpathy A., Fei-Fei L. 2015. Deep visual-semantic alignments for 20:273–297. generating image descriptions. Proceedings of the IEEE Conference Csurka G. 2017, Domain Adaptation for Visual Applications: A on Computer Vision and Pattern Recognition. Boston: IEEE. p. 3128– Comprehensive Survey. Available from: arXiv:1702.05374 3137. Culverhouse P. F. 2007. Natural object categorization: man versus Khan R., de Weijer J.v., Khan F.S., Muselet D., Ducottet C., Barat machine. In: MacLeod, N., editor. Automated taxon identification C. 2013. Discriminative color descriptors. 2013 IEEE Conference in systematics: theory, approaches and applications. Boca Raton, on Computer Vision and Pattern Recognition. Portland: IEEE. Florida: CRC Press. p. 25–45. p. 2866–2873. Culverhouse P. F., Macleod N., Williams R., Benfield M.C., Lopes R.M., Kolbert E. 2014. The sixth extinction: an unnatural history. New York: Picheral M. 2014. An empirical assessment of the consistency of A&C Black. 319 pp. taxonomic identifications. Mar. Biol. Res. 10:73–84. Krizhevsky A., Sutskever I., Hinton G.E. 2012. ImageNet classification Donahue J., Hendricks L.A., Guadarrama S,. Rohrbach M., with deep convolutional neural networks. In: Pereira, F., Burges, Venugopalan S., Saenko K., Darrell T. 2015. Long-term recurrent C. J.C., Bottou, L., Weinberger, K.Q., editors. Advances in neural convolutional networks for visual recognition and description. information processing systems 25. Lake Tahoe: IEEE. p. 1097–1105. Proceedings of the IEEE Conference on Computer Vision and Kulkarni A.H., Rai H.M., Jahagirdar K.A., Upparamani P.S. 2013. A Pattern Recognition. Boston: IEEE. p. 2625–2634. leaf recognition technique for plant classification using RBPNN and Donahue J., Jia Y., Vinyals O., Hoffman J., Zhang N., Tzeng E., Darrell Zernike moments. Int. J. Adv. Res. Comput. Commun. Eng. 2:984– T. 2014. DeCAF: a deep convolutional activation feature for generic 988. visual recognition. International Conference on Machine Learning. Larios N., Deng H., Zhang W., Sarpola M., Yuen J., Paasch R., Moldenke Columbus: IEEE. p. 647–655. A., Lytle D.A., Correa S.R., Mortensen E.N., Shapiro L.G., Dietterich Everingham M., Van Gool L., Williams C. K.I., Winn J., Zisserman T.G. 2008. Automated insect identification through concatenated A. 2010. The pascal visual object classes (VOC) challenge. Int. J. histograms of local appearance features: feature vector generation Comput. Vis. 88:303–338. and region detection for deformable objects. Mach. Vis. Appl. Food and Agriculture Organization of the United Nations. 19:105–123. Plant pests and diseases. http://www.fao.org/emergencies/ Lam M., Mahasseni B., Todorovic S. 2017 Fine-grained recognition as emergency-types/plant-pests-and-diseases/en/. Accessed HSnet search for informative image parts. The IEEE Conference on October 12, 2017. Computer Vision and Pattern Recognition (CVPR). Honolulu: IEEE. Fei-Fei L., Fergus R., Perona P. 2006. One-shot learning of object LeCun Y., Bengio Y., Hinton G. 2015. Deep learning. Nature 521:436– categories. IEEE Trans. Pattern Anal. Mach. Intell. 28:594–611. 444. Feng L., Bhanu B., Heraty J. 2016. A software system for automated Li Z., Hoiem D. 2016. Learning without forgetting. Available from: identification and retrieval of moth images based on wing attributes. arXiv:1606.09282. Pattern Recognit. 51:225–241. Lin M., Chen Q., Yan S. 2013. Network in network. Available from: Francoy T.M., Wittmann D., Drauschke M., Müller S., Steinhage arXiv:1312.4400. V., Bezerra-Laure M. A.F., De Jong D., Gonçalves L.S. 2008. Lin T.-Y., RoyChowdhury A., Maji S. 2015. Bilinear CNNs for fine- Identification of africanized honey bees through wing grained visual recognition. Proceedings of the IEEE International morphometrics: two fast and efficient procedures. Apidologie Conference on Computer Vision. Boston: IEEE. p. 1449–1457. 39:488–494. Liu N., Kan J.-M. 2016. Plant leaf identification based on the multi- Fukushima K. 1979. Neural network model for a mechanism of feature fusion and deep belief networks method. J. Beijing For. Univ. pattern recognition unaffected by shift in position—neocognitron. 38:110–119 (in Japanese with English abstract). Trans. IECE Jpn. (A) 62- Lowe D.G. 2004. Distinctive image features from scale-invariant A(10):658–665. keypoints. Int. J. Comput. Vis. 60:91–110. Fukushima K. 1980. Neocognitron: a self-organizing neural network Lytle D.A., Martínez-Muñoz G., Zhang W., Larios N., Shapiro L., model for a mechanism of pattern recognition unaffected by shift Paasch R., Moldenke A., Mortensen E.N., Todorovic S., Dietterich in position. Biol. Cybern. 36:193–202. T.G. 2010. Automated processing and identification of benthic Fukushima K., Miyake S., Ito T. 1983. Neocognitron: a neural network invertebrate samples. J. North Am. Benthol. Soc. 29:867–874. model for a mechanism of visual pattern recognition. IEEE Trans. Maaten L. van der, Hinton G. 2008. Visualizing data using t-SNE. J. Syst. Man Cybern. SMC 13:826–834. Mach. Learn. Res. 9:2579–2605. Gauld I.D., ONeill M.A., Gaston K.J., Others 2000. Driving miss MacLeod N., Benfield M., Culverhouse P. 2010. Time to automate daisy: the performance of an automated insect identification system. identification. Nature 467:154–155. In: Austin AD, Dowton M, editors. Hymenoptera: evolution, Martineau M., Conte D., Raveaux R., Arnault I., Munier D., Venturini biodiversity and biological control. Collingwood, VIC, Australia: G. 2017. A survey on image-based insects classification. Pattern CSIRO. p. 303–312. Recognit. 65:273–284. Global Invasive Species Database. (2017). Available from: http://www. Mata-Montero E., Carranza-Rojas J. 2015. A texture and curvature iucngisd.org/gisd/100_worst.php. Accessed on 12-10-2017. bimodal leaf recognition model for identification of Costa Rican Gonçalves A.B., Souza J.S., da Silva G.G., Cereda M.P., Pott A., Naka plant species. 2015 Latin American Computing Conference (CLEI). M.H., Pistori H. 2016. Feature extraction and machine learning for Arequipa, Peru: IEEE. p. 1–12. the classification of Brazilian Savannah pollen grains. PLoS One Mikši´c R. 1982. Monographie der Cetoniinae der Paläarktischen 11:e0157044. und Orientalischen Region (Coleoptera, Lamellicornia). Band 3. Griffin G., Holub A., Perona P.2007.Caltech-256 object category dataset. Forstinstitut in Sarajevo. 530 pp. Pasadena (CA): California Institute of Technology. Murray N., Perronnin F. 2014. Generalized max pooling. Available He H., Garcia E.A. 2009. Learning from imbalanced data. IEEE Trans. from: arXiv:1406.0312. Knowl. Data Eng. 21:1263–1284. Nilsback M.E., Zisserman A. 2008. Automated flower classification He K., Zhang X., Ren S., Sun, J. 2016. Deep residual learning over a large number of classes. 2008 Sixth Indian Conference on for image recognition. Proceedings of the IEEE Conference on Computer Vision, Graphics Image Processing. Bhubaneswar, India: Computer Vision and Pattern Recognition. Las Vegas: IEEE. IEEE. p. 722–729. p. 770–778. ONeill M.A. 2007. DAISY: a practical tool for semi-automated species Joly A., Goëau H., Bonnet P., Baki´c V., Barbe J., Selmi S., Yahiaoui I., identification. Automated taxon identification in systematics: Carré J., Mouysset E., Molino J.-F., Boujemaa N., Barthélémy D. 2014. theory, approaches, and applications. Boca Raton, FL: CRC Interactive plant identification based on social image data. Ecol. Press/Taylor & Francis Group. p. 101–114. Inform. 23:22–34. Oquab M., Bottou L., Laptev I., Sivic J. 2014. Learning and Kadir A. 2014. A model of plant identification system using transferring mid-level image representations using convolutional GLCM, lacunarity and shen features. Available from: arXiv:1410. neural networks. Proceedings of the IEEE conference on computer 0969. vision and pattern recognition. Columbus: IEEE. p. 17171724. 2019 VALAN ET AL.—AUTOMATED TAXONOMIC IDENTIFICATION OF INSECTS 895

Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel Vapnik V.N. 1999. An overview of statistical learning theory. IEEE O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas Trans. Neural Netw. 10:988–999. J., Passos A., Cournapeau D., Brucher M., Perrot M., Duchesnay É. Vapnik V., Chapelle O. 2000. Bounds on error expectation for support 2011. Scikit-learn: machine learning in python. J. Mach. Learn. Res. vector machines. Neural Comput. 12:20132036. 12:2825–2830. Vondráˇcek D. 2011. Population structure of flower chafer Plotly Technologies, I. 2015. Collaborative data science. Montral, QC: Oxythyrea funesta (Poda, 1761) and phylogeny of the genus Plotly Technologies Inc. Oxythyrea Mulsant, 1842. (Diploma thesis). Available from Qian Q., Jin R., Zhu S., Lin Y. 2014. Fine-Grained visual categorization https://dspace.cuni.cz/handle/20.500.11956/41477. via multi-stage metric learning. Available from: arXiv:1402.0453. Vondráˇcek D., Hadjiconstantis M., Šipek P.2017.Phylogeny of the genus Rabinovich A., Vedaldi A., Belongie S.J. 2007.Does image segmentation Oxythyrea using molecular, ecological and morphological data from improve object categorization? San Diego: UCSD CSE Department. adults and larvae (Coleoptera: Scarabaeidae: Cetoniinae). pp. 857– Tech. Rep. CS2007-090. p. 1–9 858. In: Seidel M., Arriaga-VarelaE., Vondráˇcek D., editors. Abstracts Razavian A. S., Azizpour H., Sullivan J., Carlsson, S. 2014. CNN features of the Immature Beetles Meeting 2017,October 5–6th, Prague, Czech off-the-shelf: an astounding baseline for recognition. Proceedings of Republic. Acta Entomol. Mus. Natl. Pragae 57:835–859. the IEEE Conference on Computer Vision and Pattern Recognition Wah C., Branson S., Welinder P., Perona P., Belongie S. 2011. The Workshops. Columbus: IEEE. p. 806–813. Caltech-UCSD birds-200-2011 dataset. Pasadena (CA): California Rodner E., Simon M., Brehm G., Pietsch S., Wägele J.W., Denzler J. Institute of Technology. 2015. Fine-grained recognition datasets for biodiversity analysis. Watson A.T., O’Neill M.A., Kitching I.J. 2003. Automated identification Available from: arXiv:1507.00913. of live moths (macrolepidoptera) using digital automated Russakovsky O., Deng J., Su H., Krause J., Satheesh S., Ma S., Huang identification system (DAISY). Syst. Biodivers. 1:287–300. Z., Karpathy A., Khosla A., Bernstein M., Berg A.C., Fei-Fei L. 2015. Weeks P., Gauld I.D., Gaston K.J., O’Neill M.A. 1997. Automating the ImageNet large scale visual recognition challenge. Int. J. Comput. identification of insects: a new solution to an old problem. Bull. Vis. 115:211–252. Entomol. Res. 87:203–211. Sabatinelli G. 1981. Le Oxythyrea Muls. del Mediterraneo: studi Weeks P. J.D., O’Neill M.A., Gaston K.J., Gauld I.D. 1999a. Species– morfologici sistematici (Coleoptera, Scarabaeoidae). Fragm. identification of wasps using principal component associative Entomol. 16:45–60. memories. Image Vis. Comput. 17:861–866. Schröder S., Drescher W., Steinhage V., Kastenholz B. 1995. Weeks P. J.D., ONeill M.A., Gaston K.J., Gauld, I.D. 1999b. Automating An automated method for the identification of bee species insect identification: exploring the limitations of a prototype system. (Hymenoptera: Apoidea). Proc. Intern. Symp. on Conserving J. Appl. Entomol. 123:1–8. Europe’s Bees. London (UK): Int. Bee Research Ass. & Linnean Wei X.-S., Luo J.-H., Wu J., Zhou Z.-H. 2017. Selective convolutional Society, London. p. 6–7. descriptor aggregation for fine-grained image retrieval. IEEE Trans. Scriven J.J., Woodall L.C., Tinsley M.C., Knight M.E., Williams P.H., Image Process. 26:2868–2881. Carolan J.C., Brown M.J.F., Goulson D. 2015. Revealing the hidden Wilf P., Zhang S., Chikkerur S., Little S.A., Wing S.L., Serre T. 2016. niches of cryptic bumblebees in Great Britain: implications for Computer vision cracks the leaf code. Proc. Natl. Acad. Sci. USA conservation. Biol. Conserv. 182:126–133. 113:3305–3310. Simonyan K., Zisserman, A. 2014. Very deep convolutional networks World Health Organization. 2014. A global brief on vector-borne for large-scale image recognition. Available from: arXiv:1409.1556. diseases. WHO/DCO/WHD/2014.1. Schmidhuber J. 2015. Deep learning in neural networks: an overview. Wu S.G., Bao F.S., Xu E.Y., Wang Y.X., Chang Y.F., Xiang Q.L. 2007. A Neural Netw. 61:85–117. leaf recognition algorithm for plant classification using probabilistic Steinhage V., Schröder S., Cremers A.B., Lampe K.-H. 2007. neural network. 2007 IEEE International Symposium on Signal Automated extraction and analysis of morphological features for Processing and Information Technology., Cairo, Egypt: IEEE. p. 11– species identification. In: MacLeod N, editor. Automated taxon 16. identification in systematics: theory, approaches and applications. Xu K., Ba J., Kiros R., Cho K., Courville A., Salakhudinov R., Zemel Boca Raton, Florida: CRC Press. p. 115–129. R., Bengio Y. 2015. Show, attend and tell: neural image caption Stallkamp J., Schlipsing M., Salmen J., Igel C. 2011. The German generation with visual attention. Lile, France: International Machine Traffic Sign Recognition Benchmark: a multi-class classification Learning Society (IMLS). p. 2048–2057. competition. The 2011 International Joint Conference on Neural Yang J., Jiang Y.-G., Hauptmann A.G., Ngo C.-W. 2007. Evaluating bag- Networks. San Jose: IEEE. p. 1453–1460. of-visual-words representations in scene classification. Proceedings Sun Y., Liu Y., Wang G., Zhang H. 2017. Deep learning for plant of the International Workshop on Multimedia Information identification in natural environment. Comput. Intell. Neurosci. Retrieval. Augsburg, Germany: IEEE. p. 197–206. 2017:7361042. Yang H.-P., Ma C.-S., Wen H., Zhan Q.-B., Wang X.-L. 2015. A tool for Szegedy C., Ioffe S., Vanhoucke V., Alemi A. 2016a. Inception- developing an automatic insect identification system based on wing v4, Inception-ResNet and the impact of residual connections on outlines. Sci. Rep. 5:12786. learning. Available from: arXiv:1602.07261. Yosinski J., Clune J., Bengio Y., Lipson H. 2014. How transferable are Szegedy C., Vanhoucke V.,Ioffe S., Shlens J., Wojna Z. 2016b. Rethinking features in deep neural networks? In Ghahramani, Z., Welling, M., the inception architecture for computer vision. Proceedings of the Cortes, C., Lawrence, N.D., and Weinberger, K.Q., (eds): Advances IEEE Conference on Computer Vision and Pattern Recognition. Las in Neural Information Processing Systems 27. Curran Associates, Vegas: IEEE. p. 2818–2826. Inc. 3320–3328. Tofilski A. 2004. DrawWing, a program for numerical description of Zbontar J., LeCun Y. 2016. Stereo matching by training a convolutional insect wings. J. Insect Sci. 4:17. neural network to compare image patches. J. Mach. Learn. Res. 17:2. Tofilski A. 2007. Automatic measurement of honeybee wings. In: Zeiler M.D., Fergus R. 2014. Visualizing and understanding MacLeod N, editor. Automated taxon identification in systematics: convolutional networks. Comput. Vis. ECCV 2014. 818833. theory, approaches and applications. Boca Raton, Florida: CRC Zhang Z.-Q. 2011. Animal biodiversity: an outline of higher-level Press. p. 277–288. classification and survey of taxonomic richness. Zootaxa 3148:1–237. Van Horn G., Branson S., Farrell R., Haber S., Barry J., Ipeirotis P., Zhang W., Yan J., Shi W., Feng T., Deng D. 2017. Refining Perona P., Belongie S. 2015. Building a bird recognition App and deep convolutional features for improving fine-grained image large scale dataset with citizen scientists: the fine print in fine- recognition. J. VLSI Signal Process. Syst. Signal Image Video grained dataset collection. Proceedings of the IEEE Conference on Technol. 2017:27. Computer Vision and Pattern Recognition. Boston: IEEE. p. 595– Zheng L., Zhao Y., Wang S., Wang J., Tian Q. 2016. Good practice in 604. CNN feature transfer. Available from: arXiv:1604.00133 II

Awakening taxonomist’s third eye: exploring the utility of computer vision and deep learning in insect systematics Miroslav Valan1,2,3,*, Dominik Vondr´aˇcek4,5, Fredrik Ronquist2

1 Savantic AB, Rosenlundsgatan 52, 118 63 Stockholm, Sweden 2 Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Frescativagen 40, 114 18 Stockholm, Sweden 3 Department of Zoology, Stockholm University, Universitetsvagen 10, 114 18 Stockholm, Sweden 4 Department of Zoology, Faculty of Science, Charles University in Prague, Viniˇcn´a7, CZ-128 43 Praha 2, Czech Republic 5 Department of Entomology, National Museum, Cirkusov´a1740, CZ-193 00 Praha 9 - Horn´ıPoˇcernice, Czech Republic

* [email protected]

Abstract

Automating taxonomic identification has been a dream for many insect systematists, as this could free up time for more challenging scientific endeavours, such as circumscribing and describing new species or inferring phylogeny. The last decades have seen several attempts to develop systems for automated identification, but it has required significant skills and effort to develop them, the systems have only been able to handle specific tasks, and they have rarely been able to compete in performance with human taxonomists. With recent advances in computer vision and machine learning, this situation is now rapidly changing. Here, we illustrate several practical perspectives on the current state-of-the-art in automated identification using a case study, the discrimination of ten species of the flower chafer beetle genus Oxythyrea, and an imagined conversation between an insect taxonomist and an expert in artificial intelligence. Modern automated identification is based on training convolutional neural networks (CNNs) using extremely large training sets, so-called deep learning. In previous work, we have shown that the size of the training sets required for insect identification can be reduced significantly by using CNNs pretrained on general image classification tasks. Here we demonstrate that high-resolution images are not needed; adequate training sets for insect identification can be collected cheaply and rapidly using smartphone cameras with an attachable zoom lens. Furthermore, we demonstrate that easy-to-use off-the-shelf solutions are often sufficient even for challenging identification tasks that are impossible for humans, such as separating the sexes from the dorsal habitus in these beetles. Finally, we demonstrate recent techniques that can highlight the morphological features used by the trained systems in discriminating species. This can be helpful in developing a better understanding of how the species differ. We hope that these results will encourage insect systematists to explore some of the many exciting opportunities that modern AI tools offer.

1/22 Introduction 1

Andrew Ng, co-founder of Coursera, founder of Google Brain and former Baidu Chief Sci- 2 entist, has argued that artificial intelligence (AI) is ”the new electricity” [17]. In the early 3 20ˆth century, electrification transformed transportation, agriculture and manufacturing, 4 permanently changing human societies. A century later, we can see how AI is starting 5 to have a similar, revolutionary impact on a range of societal sectors, from finance to 6 the automotive industry. AI is used everywhere: from product-recommendation systems 7 to search engines and virtual voice assistants. In fact, we encounter AI on a daily basis, 8 often without even realizing it. Everyone knows that self-driving cars are powered by AI, 9 but it is less commonly acknowledged that manual operation of a vehicle now typically 10 relies on AI powered navigation, path optimization and travel time estimation. The AI 11 revolution is generating interest in a wide range of fields, and systematic entomology is 12 not an exception. 13 An important reason for the current AI hype is the recent progress in computer vision 14 made possible by the development of convolutional neural networks (CNNs) that learn 15 complex tasks from very large training sets, that is, through deep learning. Deep learning 16 of complex visual tasks by CNNs would not have been possible without significant recent 17 improvements in learning algorithms, easy access to big data (>80% of the data on 18 the web is visual, consisting of videos or images), and the availability of cheap and fast 19 computation (particularly the development of graphical processing power driven by the 20 gaming industry). Given enough training data, CNNs can learn on their own which 21 features to use to discriminate between different categories of images (often referred 22 to as image classification). Even more astonishing, once these features are learned for 23 one task, the CNN can use its feature-recognition skills in tackling a new but related 24 task, thus eliminating the need to train a dedicated CNN from scratch for every new 25 task [2,8,14,24,39]. For instance, a CNN that can distinguish between cats, dogs and 26 bicycles can more easily learn to distinguish between different species of insects [33]. 27 This is particularly important for visual identification tasks where it is difficult to gather 28 sufficiently large image sets to train dedicated CNNs from scratch, such as most insect 29 identification tasks. 30 Insect systematics is to a great extent based on visual data, and it is increasingly clear 31 that we can use state-of-the-art artificial intelligence (AI) techniques like CNNs and deep 32 learning to automate many routine identification tasks that are currently performed by 33 humans — without compromising identification accuracy [15, 20, 33, 35]. The aim of this 34 paper is not to show how CNNs perform compared to humans or traditional machine 35 learning approaches on yet another image classification task. Instead, we focus on some 36 questions that are frequently asked by members of the systematic entomology community, 37 who wonder about the feasibility of applying these new AI techniques to their tasks of 38 interest. We will address these questions in the form of an imagined discussion between 39 a taxonomist and an AI expert, focused on a case study, namely the discrimination of 40 ten flower chafer species of the beetle genus Oxythyrea. This pilot example allows us 41 to illustrate several important points, but also to explore questions and ideas that are 42 relevant to insect identification in particular. We hope that the discussion itself, as well 43 as some of the new results we present here, will encourage more entomologists to enrich 44 their toolbox with these easy-to-learn instant-return AI techniques. 45

Off-the-shelf solutions often sufficient 46

Taxonomist (T): Developing automated systems for insect identification has proven 47 challenging in the past. What makes these new methods different? 48 AI expert (A): Previous systems had to be engineered for a special task. For 49

2/22 instance, landmarks in the images had to be identified prior to analysis, or the systems 50 had to be tailor-made to identify salient features for the task at hand. The new methods 51 learn the task from scratch, including the identification of important discriminating 52 features, entirely on their own. The only thing that is required is a large set of images 53 with known identities, which can be used in training the systems. 54 Automated identification systems developed for large citizen-science projects ––– like 55 iNaturalist, Merlin-Bird (for e-Bird) or Pl@ntNet — exemplify the new approach well. 56 Over time, the citizen scientists engaged in these projects accumulate large sets of images, 57 many of which are reliably identified. These huge sets of identified images provide ideal 58 data for training AI models for automated taxon identification. The training of the 59 CNN systems on these data sets requires considerable computational resources. However, 60 the systems that are produced are optimized for mobile devices, with a small memory 61 footprint and fast inference times, which comes at the expense of accuracy. This means 62 that it would be possible to train larger, deeper and more complex CNN systems that 63 would perform even better in terms of accuracy than these systems, which already have 64 impressive performance. 65 T: So, if I wanted to apply these techniques to an insect identification task I am 66 interested in, I would need a huge training set and vast computational resources? 67 A: No, I do not think so. These super systems are impressive, but developing them 68 requires a fair amount of skills and resources. For most identification tasks, off-the- 69 shelf solutions are quite sufficient. With an ”off-the-shelf solution”, I mean a CNN 70 that has been trained on a generic image classification task. This CNN can be used 71 for feature extraction, and the features then used in training a simpler identification 72 system [2, 8, 24, 39] for a more specialized task, such as insect identification [33]. 73 In this context, off-the-shelf means that the AI software is not developed for a single, 74 specialized task. The software is ready to use without modifications on a different task, 75 bypassing the need for machine learning expertise to tune the software’s hyperparameters. 76

T: What if we have some (let say 1%) misidentified specimens in the training data? 77 Can we build models robust enough to deal with that? 78 A: CNNs are clearly robust enough to deal with low levels of label noise. For instance, 79 ImageNet, a cornerstone benchmark of recent image classification systems has 0.3% label 80 noise [21] and similar noise levels (0.2%) were shown to have no adverse effect in some 81 experiments on insect identification [33]. Of course, training sets on cryptic species may 82 be considerably more noisy. In fact, in those cases we may not even be able to determine 83 the noise level accurately after the training set was collected because, unlike the cases 84 cited above, visually inspecting images of cryptic species would probably not suffice to 85 say whether the specimen was correctly identified. 86 Learning from noisy data is an active area of research, and studies have shown that 87 it is sometimes possible to successfully combat up to 50% noise [1,12,13,18,31,34,36,41]. 88 However, adopting a specialized method for noisy data is probably not necessary for 89 most taxonomic identification tasks. After all, natural history collections are assembled 90 by experts, and we might expect the percentage of misidentifications to be low enough 91 in most cases that they cause no problems for standard CNNs. 92

T: How about cryptic species or other challenging tasks? Is it possible to train an AI 93 algorithm to perform tasks that humans cannot do? If so, would an off-the-shelf solution 94 really be sufficient then? 95 A: To my knowledge, nobody has tried to use CNNs to discriminate cryptic species. 96 A dedicated CNN trained specifically for such a task would probably have a good chance 97 of succeeding, but even an off-the-shelf approach might generate enough insights to 98 decide whether automated identification is feasible. In fact, the biggest challenge here 99

3/22 Fig. 1. Many people think that the major challenge in developing AI systems, like systems for automated taxon identification, is coding (left). In reality, the major challenge is to assemble appropriate training data (right). (Image courtesy: Hackernoon)

is to collect enough data, not to select the best algorithm, unlike what many people 100 think (Fig. 1). If I understand correctly, identifying challenging taxa requires careful 101 examination of each specimen, possibly involving dissection of genitalia or even DNA 102 sequencing. Therefore, assembling a sufficiently large set of correctly identified images 103 to train a dedicated CNN will be tedious and demand more resources than is probably 104 available in most cases. Clearly, we cannot rely on standard citizen science projects 105 to accumulate appropriate training sets for cryptic species, because the rapporteurs 106 are unlikely to have the required skills nor the possibility of examining the specimens 107 in sufficient detail to arrive at a correct identification. To get a better understanding, 108 why don’t we pick a task you are interested in and try an off-the-shelf solution (see 109 Methods for details) that requires no particular technical skills (machine learning or 110 programming), and see what we can accomplish? 111

A case study 112

T: Let us develop an identification system for the species of the flower chafer beetle genus 113 Oxythyrea. I have worked with this genus for a long time, so I know it well. It comprises 114 ten species [6], which are quite similar in habitus (Fig. 2), although they are well defined 115 by a set of morphological features, including the dorsal and ventral patterns of white 116 spots, as well as the shape of the male aedeagus and endophallic sclerites [3, 4, 16, 22, 23]. 117 Species are also genetically distinct (Vondr´aˇcek unpublished data). To separate them 118 based on morphology, the dorsal habitus is slightly more helpful than the ventral habitus, 119 but both are needed for reliable identification. In some cases, it is also necessary to 120 have access to locality, collecting date and biotope information. For instance, some 121 O. funesta, O. pantherina, and O. subcalva specimens from North Africa can only be 122 identified reliably if you have access to information about the biotope where they were 123 collected, especially if they are females so that you cannot examine the shape of the 124 aedeagus. O. funesta is tied to humid, O. pantherina to semi-arid, and O. subcalva to 125 arid biotopes. Specimens from Europe are easier to identify because O. funesta is the 126 only species in this group that occurs there, as far as we know. Another challenging 127 set of species is O. dulcis and O. abigail. They are very similar but geographically well 128 separated, which facilitates identification. The morphology of the male genitalia is also 129

4/22 quite different, especially the shape of the endophallic sclerites. 130 A: As a baseline, let us try the dorsal view only without location information or any 131 other information that could facilitate identification. These constraints should make 132 the task more challenging for the AI system. We can add information from the ventral 133 habitus later, and see if it improves the performance of the algorithm. 134

T: How many images do we need for the training set? And how about image quality? 135 Do we need high-resolution images? 136 A: A few dozen images per taxon is usually enough for a proof-of-concept system. 137 Let us aim for that. High-resolution images are not required, and although it could help, 138 my guess is that it would not have a significant impact if we chose low-resolution images 139 instead. Due to the heavy computation involved in training, almost all state-of-the-art 140 CNNs by default resize input images to 299x299 or 224x224 pixels (1000+ pixels is 141 considered high resolution here). Regarding the number of images, you could in theory 142 start with as little as one training example per category (one-shot learning) and still get 143 a reasonably high-performing CNN model. With only one image per category, however, 144 the model might struggle when test images deviate from the training images in ways 145 that are not relevant for identification. It would also be difficult to gain insights into 146 where, why and how the model fails with such a small training and testing set. 147 A: When building a CNN model for taxon identification, the performance will depend 148 to a large extent on how the categories and the training examples are distributed in 149 morphological space. Individuals within the same species can vary due to sex, color 150 morph, life stage, etc. A small training set may not contain all the major variants, in 151 which case performance will suffer. With more images you increase the likelihood that 152 the relevant morphs are seen in the training phase. By increasing the sample size, it is 153 also easier for the CNN model to form high-level abstractions, “concepts” if you wish, 154 which define the type of variation one might expect to see among individuals belonging 155 to the same species, and between individuals belonging to different species. For instance, 156 specimens preserved or photographed with a certain technique, say a blue or white 157 background, may stand out from all other specimens, but this pattern may be irrelevant 158 for separating the species. Certain morphs may also be more challenging to identify 159 than others, and reliable discrimination may require special features specific for those 160 morphs. An example would be males and females; it may well be that males have to be 161 separated based on other features than females. 162

A: In Oxythyrea, can you distinguish the sexes easily? If so, which sex is more 163 challenging to determine to species, males or females? 164 T: There are sex-specific differences in the white pattern on the ventral side, which 165 makes it easy to identify the sex, but not in all ten species. Only in six of them this 166 works. For the other four species, you need to look for the specific tiny groove on the 167 abdomen, which is present only in males, but very difficult to find. Usually it is necessary 168 to examine the abdomen from different angles using a stereomicroscope to be sure. The 169 dorsal habitus in these ten species do not exhibit any sexual dimorphism; at least, sexual 170 dimorphism has not been observed nor hypothesized in the literature. 171 A: Let us use this then to test whether an off-the-shelf system can do something 172 that is impossible or at least very difficult for humans. It can serve as a model for the 173 potential to identify cryptic species with off-the-shelf solutions. I am not saying that the 174 task of separating the sexes of Oxythyrea from the dorsal habitus is super interesting in 175 itself. 176 T: I am saying it is [laughing]. In fact, I do not know if this is even feasible but let 177 us try and see what happens. Imagine how helpful it could be if we succeeded. You 178 have hundreds, maybe thousands of these beetles glued on cards across many collections 179

5/22 in European museums and you are not able to unglue them because it is typically 180 forbidden to do that. You have no idea whether you are looking at a male or female 181 specimen. Nevertheless, knowing the sex may be crucial for many types of analyses, 182 such as standard morphometric analyses. Furthermore, say you are allowed to unglue 183 a few select specimens to examine the male genitalia, which includes preparation of 184 the aedeagus. If your analyses could separate the sexes based on dorsal habitus alone, 185 then only male specimens would have to be unglued and the females could be left 186 untouched, instead of randomly unglueing specimens until you find a male. So you will 187 not potentially damage the females during the process of unglueing, which can happen 188 sometimes. 189 This all sounds very interesting. I will start collecting the images. I will go for 190 high-resolution photos using my standard imaging setup. 191 A: [one week later...] Have you assembled the images yet? 192 T: No, I am sorry. I can only do around 50 specimens per day for both dorsal and 193 ventral views, if I focus entirely on this task. Unfortunately, I have other things to do as 194 well, so I will need another week or so. My top pace may not be sustainable in the long 195 term anyway, if I considered a more ambitious project comprising more species. And 196 the cost is substantial. I saw a study of a mass digitization project indicating that an 197 average staff member can generate 40 images per day, resulting in a labor cost of 3.99 198 euros per image [32]. 199 A: With that approach it will take ages and cost a fortune to digitize all the 200 collections across the globe! I need some sample images to start experimenting, so 201 I will just photograph some Oxythyrea specimens in a ”quick and dirty” way with a 202 smartphone and an attachable 2$ zoom lens (Fig. 2B). We can compare the results 203 obtained with these images to the results obtained with your high-resolution images. 204

6/22 abigail albopicta cinctella dulcis funesta gronbechi noemi pantherina subcalva tripolitana

B

both sexes both

smartphone

females

dorsal males

A

females

ventral males

205

Fig. 2. Datasets of ten Oxythyrea species. In the first row we show example images of dataset B collected in a ”quick and 206 dirty” way using a smartphone and a cheap 2$ attachable zoom lens. The remaining four rows are example images from dataset 207 A - standardized taxonomic imaging setting, where the first two rows are dorsal habitus and the last two are ventral habitus. 208 Note that Oxythyrea beetles show sexual dimorphism only on their abdomen. More about specimen acquisition, identification and 209 imaging can be found in Material and Methods. 210

7/22 Baseline 211

Dorsal view Combined - features concatenated 1.0 1.0 abi, 13 0.38 0.15 0 0.08 0.08 0 0.08 0.08 0 0.15 abi, 13 0.46 0.08 0 0.08 0.08 0.08 0.08 0 0.08 0.08

alb, 50 0 0.98 0 0 00.020 0 0 0 alb, 50 0 0.98 0 0 0 00.020 0 0 0.8 0.8 cin, 50 0 0 1 0 0 0 0 0 0 0 cin, 50 0 0 1 0 0 0 0 0 0 0

dul, 50 0 0 0 1 0 0 0 0 0 0 dul, 50 0 0 0 1 0 0 0 0 0 0 0.6 0.6 fun, 50 0.02 0 0 0 0.94 0 0 0.020.02 0 fun, 50 0 0.02 0 0 0.9 0 0 0.040.04 0

gro, 50 0 0 0 0 0.02 0.96 0 0 0.02 0 gro, 50 0 0 0.02 0 0 0.98 0 0 0 0 True, N 0.4 True, N 0.4 noe, 50 0 0 0 0 0 0 0.98 0 0.02 0 noe, 50 0 0 0 0 0.02 0 0.96 0 0.02 0

pan, 50 0 0 0 0 0 0.02 0 0.96 0.02 0 pan, 50 0 0 0 0.02 0 0 0 0.94 0.04 0 0.2 0.2 sub, 50 0 0 0 00.020 0 0 0.98 0 sub, 50 0 0 0 0 0.02 0 0.02 0 0.96 0

tri, 21 0 0 0 0.050.1 0 0 0 0 0.86 tri, 21 0 0 0 0.050.05 0 0 0 0 0.9 0.0 0.0 abi alb cin dul fun gro noe pan sub tri abi alb cin dul fun gro noe pan sub tri Predicted Predicted

Ventral view Sm artphone 1.0 1.0 abi, 13 0.23 0 0.08 0.23 0.08 0.08 0.08 0 0.08 0.15 abi, 15 0.6 0 0 0 0 0.4 0 0 0

alb, 50 0 0.88 0.04 0 0 0.020.06 0 0 0 alb, 50 0 1 0 0 0 0 0 0 0 0.8 0.8 cin, 50 0 0 0.98 0 0 0 0.02 0 0 0 cin, 50 0 0 0.98 0 0 0.02 0 0 0 dul, 50 0 0 0 1 0 0 0 0 0 0 0.6 dul, 50 0 0 0 0.98 0 0 0 0 0.02 0.6 fun, 50 0 0 0 0 0.8 0 0.02 0.14 0.04 0 fun, 50 0 0 0 0 0.92 0 0.08 0 0 gro, 50 0 0.020.02 0 0 0.9 0.06 0 0 0 True, N 0.4 True, N noe, 50 0.02 0 0.06 0 0 0.92 0 0 0 0.4 noe, 50 0 0.02 0.04 0 0.02 0.04 0.82 0.02 0.04 0 pan, 50 0 0 0 0 0.04 0 0.94 0.02 0 pan, 50 0 0 0.02 0 0.06 0 0 0.88 0.02 0.02 0.2 0.2 sub, 50 0 0 0 0 0 0 00.02 0.98 0 sub, 50 0 0 0 0 0 0 0 1 0

tri, 21 0.05 0 0 0 0.14 0 0 0 0 0.81 tri, 40 0 0 0 0.1 0 0 0 0 0.9 0.0 0.0 abi alb cin dul fun gro noe pan sub tri abi alb cin dul fun noe pan sub tri Predicted Predicted [t] 212 Fig. 3. Confusion matrices for different imaging techniques and image combinations. We show results when high- 213 resolution images of dorsal (top left) and ventral habitus (bottom left) are used alone; when features from dorsal and ventral habitus 214 are concatenated (top right) and when features are obtained from smart-phone images of the dorsal habitus (bottom right). Rows 215 show true categories and number of images per species (N) and columns show predicted categories. Colors and numbers in matrix 216 indicate the proportion of predictions normalized so that the sum in each row is equal to 1. (abi = O. abigail, alb = O. albopicta, 217 cin = O. cinctella, dul = O. dulcis, fun = O. funesta, gro = O. gronbechi, noe = O. noemi, pan = O. pantherina, sub = O. subcalva, 218 tri = O. tripolitana) 219 A: [another week later...]. Now I have the results from the baseline study based on 220 your high-resolution images (see Methods for details). I summarize the performance 221 of the AI system in the form of a confusion matrix, which specifies how often different 222 categories are confused by the identification system (Fig. 3). 223 T: Wow, these results already surpass all my expectations! Human experts (there 224 is only a handful of us) would never have been able to match those numbers if the 225 identification had to be based on dorsal habitus only (Fig. 3, top left). Also the 226 identification accuracy based on ventral habitus would be very difficult to beat for 227 human experts. 228 It is interesting that the AI system finds the dorsal view (95%) better than the 229 ventral (88%) for species identification; it is the same for human experts. However, it is 230 surprising that the combination of both views does not boost the performance of the AI 231 system. Actually, the accuracy got slightly worse for some species (e.g. O. pantherina 232 and O. funesta). Could this be some sort of artifact? 233 A: It is true that the accuracy decreases for some species but note that it increases 234 for others. In total, two more images were correctly identified compared to the best of 235 the single-view datasets, that is, a slight improvement. However, this improvement is in 236 the range of ”being lucky”. One possible reason for the lack of a significant boost in 237

8/22 Dorsal & ventral & combined Fig. 4. Effect of sample size and habitus view on identification accuracy. Scatter plot of identification accuracies against sample 80 sizes for all species from dorsal (blue) and ven- tral habitus (green) and when features from dor- 70 sal and ventral habitus are concatenated (red). 60 Each point represents the average across hun- dred random subsamples of the original dataset. 50 Note that accuracies are higher for dorsal habi- 5 10 tus compared to ventral but combining them outperforms both single views.

identification accuracy is that we achieved high accuracy already with dorsal habitus 238 alone ( 95%). For further improvement, we would have needed a bit better features or 239 performance from the other habitus. However, the ventral habitus had 3x higher error 240 rate than the dorsal habitus. Combining images of different views would likely bring 241 more gain when the difference in error rate is not as high as in this case, when you have 242 similarly performing views, or when you have more limited training sets (see Fig. 4). 243

T: I noticed that O. abigail, O. tripolitana and O. funesta seem to be the most 244 challenging for the AI system to identify in both views. The first two are likely challenging 245 because we photographed so few specimens; these are rare species and there are not 246 many specimens available in collections. The third, O. funesta, is indeed hard to separate 247 from O. pantherina and O. subcalva also for human experts. It is interesting that the 248 model also often confuses these three species in both views. Four out of six erroneous 249 predictions based on dorsal habitus mistake one of these species for one of the other two. 250 The challenge of separating these species is even more apparent for the ventral habitus, 251 where there are 15 such cases out of 17 total misidentifications. 252

Lab-based vs phone photos 253

A: Now, let us look at the results obtained with my smartphone images. 254 T: I must say I am surprised. The phone images are blurry and hardly usable for 255 experts but surprisingly they seem to be sufficient enough for a CNN model to match 256 the performance we obtained with high-resolution images (94.6% vs 95.2% identification 257 accuracy for dorsal habitus images). Moreover, the conclusions we reached based on 258 high-resolution images still apply. O. abigail and O. tripolitana are the most challenging 259 to identify, and species in the group consisting of O. funesta, O. pantherina and O. 260 subcalva are more likely to be confused with each other than with the remaining species 261 from the genus (seven misidentified images from these three species are confused with 262 one of the other two species). 263 A: In my opinion, we should focus on the cost in terms of time and equipment needed 264 to digitize the same number of specimens. One hour was enough to photograph all 265 the specimens with my cell phone and a cheap, attachable zoom lens, and imaging the 266 same specimens at high resolution took you a whole two weeks. Just photographing 267 the dorsal view would have taken you one week. And your photographic equipment 268 is a lot more expensive than my setup. Of course, high-resolution images are more 269 useful than blurry smartphone images in many contexts, but if the purpose is to build a 270

9/22 good automated identification system, then the smartphone approach is clearly worth 271 considering, especially since it allows you to assemble much larger training sets. 272

Male vs female - a task unsolvable by humans 273

A: Let us now look at the task of separating males from females. 274 T: This is really interesting! I assure you, it is impossible for experts to identify the 275 sex from dorsal view, even when using geometric morphometrics (Vondr´aˇcek unpublished 276 data using O. dulcis as model organism). There is no evidence in the literature that 277 anyone noted or hypothesized sexual dimorphism in the dorsal habitus of any of the 278 Oxythyrea species. Yet, according to your results (Fig. 5), such differences do exist at 279 least in some species (O. cinctella, O. dulcis and O. tripolitana). An expert can identify 280 the sex from the ventral habitus mainly because males and females have different patterns 281 of white marks in many species (Fig. 2). For the species without colour dimorphism, it 282 is necessary to identify the presence or absence of a very thin groove in the medial part 283 of the abdomen, going from the first segment to the fourth (similar in position to the 284 four white spots in O. funesta; Fig. 2). The groove is clearly seen if you look at the 285 abdomen from the side and in appropriate illumination, but it is hardly visible in the 286 ventral habitus images we collected. This is because we used a strict ventral view, and 287 the lighting we used made the groove difficult to see because of the shiny midpart of the 288 abdomen. Therefore, it makes sense that the model had difficulties in some cases; for 289 instance, the identification of sex from ventral habitus images of O. gronbechi was close 290 to a random guess. 291 Another interesting pattern that I noted given these results is the case of O. albopicta 292 and O. noemi, where the model seems to be able to tell whether a specimen is a male but 293 for female specimens it predicts sex close to the rate of a random guess. This suggests to 294 me some form of sex-limited polymorphism, that is, that females exhibit polymorphism 295 but males do not. 296 It is quite remarkable that in some cases, the machine was able to distinguish both 297 sexes only from the dorsal view, but in some similar cases, it was not able to do so, 298 e.g. O. cinctella x O. gronbechi, which are very closely related species with similar type 299 of sexual dimorphism. Overall. it is still very difficult for the machine to sort these 300 specimens into males and females from only the dorsal view (e.g. O. pantherina, O. 301 subcalva). 302 A: ”Absence of evidence is not evidence of absence”. If all it takes to tell the sex of 303 a specimen is to turn it up-side-down, and the sex dimorphism is not obvious from the 304 dorsal side, then that is what most people will do - turn the specimen around. So maybe 305 it is not that surprising that the sex dimorphism in dorsal habitus has been overlooked 306 in this group. However, your morphometric study indicates that it is actually quite 307 challenging to find what the dimorphic features are. I guess one of the most frustrating 308 aspects of difficult discrimination tasks is that it can take an expert many, many hours 309 to find pertinent differences, and there is no guarantee that they exist. Now we have a 310 tool that could give us some insight as to whether is worth looking for those differences, 311 and from which specimens we should start. We can assume that adding data and/or 312 training a dedicated model would provide more information on whether discriminating 313 features exist and where they can be found. 314

[t] 315

10/22 316

Fig. 5. Confusion matrix for identification of species and their sex. Experiments are based on either dorsal (left) or 317 ventral habitus (right) alone. 318

Beyond off-the-shelf solutions 319

T: What sort of improvements can we expect with systems that go beyond off-the-shelf 320 approaches? 321 A: First of all, it is highly likely that we could improve the performance in term 322 of accuracy. In fact, I ran some experiments to verify that this is the case for the 323 problems we looked at (see Fig. 6, Fig. S1 and Fig. S2). However, due to the improved 324 accuracy, some of the insights we gained from our simpler experiments might be hidden. 325 For instance, if only 1-2 specimens are misidentified, the confusion matrix no longer 326 contains any useful information about which species are more challenging for the system 327 to identify, as the misidentified species could be coincidental. 328 I want to stress that off-the-shelf systems are often sufficient to solve even challenging 329 tasks that appear unsolvable to humans, as we have seen above. Few entomologists will 330 have access to an AI expert; even machine learning labs struggle currently to attract 331 relevant AI expertise. Fortunately, however, it is perfectly feasible for entomologists to 332 gain the necessary expertise to run off-the-shelf systems on their own. Even if your goal 333 is to develop a species identification tool that is as accurate as possible, first exploring 334 an off-the-shelf approach might generate enough insight to help you decide whether the 335 task is feasible and to what extent further investment of time and resources is justifiable. 336 A: Having said this, there are cases when a dedicated CNN is clearly preferable. For 337 instance, with such an approach you are more likely to compute accurate activation maps, 338 that is, maps that pinpoint the image areas that are used by the system to discriminate 339 among categories. 340

T: So this technique can be used to identify which parts of Oxythyrea specimens that 341 are important in separating species? It would be interesting to see whether the machine 342 is using the same or at least similar morphological features to those that taxonomists 343 find useful. 344 A: Yes, we can prepare “heat maps”, which show the areas in the image on which 345 the machine is focusing. I generated such heat maps for the dorsal and ventral views 346 from the high-resolution images, as well as the dorsal views from the cellphone pictures 347 (Fig. 7). 348

11/22 Fig. 6. Off-the-shelf (blue) ap- proach in comparison with dedi- cated CNN (orange). We trained dedicated CNNs for each of our exper- iments. Note that instead of accuracy we used error rate as our metric (the complement of accuracy, that is, 1 - accuracy). The error rate gives a bet- ter appreciation of how much improve- ment training a dedicated CNN brings. The error rate is reduced by 55% to 80% for the experiments we tried.

T: The heat maps for the high-resolution images are fascinating! It seems that the 349 machine is focusing on the same morphological details as expert taxonomists. The 350 most informative part in the dorsal habitus view is the pattern of white marks on the 351 pronotum, and we can see that the machine is focusing there for almost all species. For 352 the easiest species, it is the only part (e.g. O. albopicta, O. cinctella); apparently, this 353 region contains enough information to distinguish them. When it gets more difficult, 354 the machine starts using a wider region, including the distal part of the elytra (e.g. O. 355 dulcis, O. noemi), to separate the species. For the most difficult species (e.g. O. abigail, 356 O. funesta, O. subcalva), the machine looks at the whole elytra region. Interestingly, 357 for O. tripolitana the machine mostly focuses on the part between the pronotum and 358 scutellum, where there is an accumulation of shorter setae in this species. This has not 359 been an important character for taxonomists, at least not until now. For O. gronbechi, 360 the machine was checking not only the pronotum but the pin as well. This appears to 361 be a case where the machine is misled by biases in the training set. All specimens of O. 362 gronbechi were collected a long time age by an entomologist who pinned them from a 363 different angle than is usual. Of course, this means that the model we trained for this 364 task might have problems with specimens of this species that are pinned in the standard 365 way. 366 In the smartphone pictures, the machine is using much wider regions for species 367 discrimination. Although these pictures are enough for the machine to provide accurate 368 identifications, as we have seen, it apparently needs to check wider regions of the images 369 to be sure of the correct identification. 370 For the ventral habitus, it seems that the machine is using the amount of setation, 371 the presence and size of the white spots on the sides of the abdomen and probably also 372 the shape of the mesometaventral processus. The latter is typical for the subfamily 373 Cetoniinae and it is known that the shape can be useful to identify the specimens of 374 some groups in the subfamily to species. However, for Oxythyrea, no differences in the 375 shape of the processus had been noted previously. 376

[t] 377

12/22 378

Fig. 7. Class Activation Maps for ten Oxythyrea species. This technique highlights category-specific regions of images by 379 producing heat maps [40]. The select specimens are exemplary representatives of respective species. 380 Let us now look at the case where the machine is sorting specimens to sex as well as 381 species (Fig. 8). For the dorsal view, the heat maps are very similar to those for the first 382 task, where the machine was separating the specimens only to species, but it seems that 383 sometimes the region is a little bit wider and more complicated. Usually the machine is 384 focusing on several smaller parts in species that are more challenging to identify. Maybe 385 to check the overall shape of the body and some specific structures? 386 If we check the results based on the ventral side, it is getting more interesting. We 387 can see that the machine is merely focusing on the patterns of white marks, much more 388 so than in the simple “species task”. We can see these results for the species with easily 389 visible dimorphism, where it is checking exactly the median regions with the spots that 390 are only present in males (e.g. O. funesta, O. tripolitana). For those species with less 391 distinct dimorphism, it is anyway checking the median region of the abdomen, probably 392 in an attempt to identify the presence or absence of the median groove mentioned above. 393 Although the groove is often difficult to detect to the naked eye in these images, it is 394 quite visible in some cases, for instance in the male specimen of O. cinctella (Fig. 8). 395

[t] 396

13/22 397

Fig. 8. Class Activation Maps for the specimens from the species/sex task (only select species are depicted). 398

AI and the future of insect taxonomy 399

T: Are there no limits to what these AI systems can do? Will taxonomists be jobless 400 in the future? 401 A: AI will certainly replace humans in many tasks and many professions may 402 disappear. At the same time, AI will create a plethora of new opportunities. There 403 is a common belief in the AI community that today, or in the near future, AI can 404 replace humans on tasks that take less than a second to observe, understand and execute. 405 The same applies for all tasks that can be partitioned into several one-second tasks. 406 This has been shown to be true already for many well-defined tasks of this type. An 407 example would be the routine identification of common taxa, a task that taxonomists 408 often find time-consuming and that takes time from more challenging problems, such as 409 circumscribing and describing new taxa, or inferring phylogeny and studying evolution. 410 Today’s AI does not think, it usually cannot explain its decisions and above all it is 411 not creative. These are qualities that every scientist should have and taxonomists are no 412 exception. AI should be considered just a tool, which could help taxonomists to be faster 413 and more efficient in gathering knowledge. It is up to scientists to postulate hypotheses, 414 test them, draw conclusions and share what has been discovered with the rest of society. 415 I think we should turn the question around, and ask instead how taxonomists can 416 benefit from AI. I hope the experiments that we described here show that modern AI 417 tools is not all about developing automated identification systems. These tools can 418 be quite helpful for insect taxonomists in many other ways. As we have seen, they 419 can be used to help identify the features that best distinguish closely related species. 420 They can also be used to find out whether it is possible to distinguish sibling species 421 morphologically. Hopefully, our discussion will inspire many insect systematists to 422 explore the full range of opportunities offered by modern AI tools. 423

Material and Methods 424

Specimen acquisition and identification. For this study we created two datasets. 425 Dataset A consists of 434 adult specimens belonging to 10 species of the genus Oxythyrea. 426 The specimens were obtained from the National Museum of Natural History in Prague, 427 the Natural Museum of Natural History in Paris and the Zoologisches Forschungsmuseum 428

14/22 Alexander Koenig in Bonn. We managed to obtain 50 specimens of all but two species (O. 429 abigail and O. tripolitana), with both sexes equally represented whenever possible. All 430 specimens were examined and identified by DV. The identifications were also verified by 431 sequencing at least one specimen from each of the collecting events (same day, collector, 432 plant etc.) represented by the photographed specimens. 433 Dataset B consists of 455 images of adult Oxythyrea specimens belonging to 9 species 434 (O. gronbechi was not photographed). The specimens are typically about 10–15 mm in 435 size. All specimens were examined and identified by DV and deposited at the National 436 Museum of Natural History in Prague and the Natural Museum of Natural History in 437 Paris. 438

Specimen imaging. Specimens in dataset A were photographed in a highly standard- 439 ized setting from both the dorsal and the ventral view using a high-resolution digital 440 camera and additional illumination. Specimens were removed from the drawers, labels 441 beneath them were removed, the specimens were cleaned from dust and finally placed 442 above the same uniform background so that the variation between images was kept to a 443 minimum. All images have the same aspect ratio, that is, the proportional relationship 444 between width and height are the same for all images in Dataset A. The imaging took 445 approximately two weeks. 446 For dataset B we used a smartphone and a cheap 2$ attachable zoom lens without 447 additional illumination. Each photo was taken by hovering over the specimen, which 448 was not moved from the drawer. It took one person one hour to photograph all 455 449 specimens using this technique. 450

Methods 451

Off-the-shelf system 452

As our off-the-shelf system, we used VGG16 [25] pretrained on ImageNet [21] for fixed 453 feature extraction and a linear SVM [7] for classification. This approach is often 454 better than traditional approaches with hand-crafted features. It is widely used in 455 machine learning for image classification tasks with limited training data [2, 8, 24, 39]. 456 No preprocessing is performed on images. We used images of size 224x224 pixels. 457 Following [33], features from all five convolutional blocks were first pooled using the 458 global average, then concatenated and normalized using the signed square-root method. 459 We used a one-vs-all support vector classifier (SVC), which means that a separate 460 classifier is trained for every category. This classifier approach is robust to over-fitting, 461 high-dimensional feature vectors and small datasets and it works well with imbalanced 462 datasets. We used default settings implemented in the python package scikit-learn [19]. 463 In the Supplementary Information, we give instructions on how insect systematists 464 who are new to AI tools can get this system up and running and apply it to their favorite 465 tasks easily using a few simple steps. 466

More advanced system 467

For the more advanced system, we used the pre-trained SE-ResNext101-32x4 [10], for 468 which the last layer was changed to N (number of categories) and weights were set 469 to random. This architecture is based on residual modules, ResNets [9], improved 470 by adding ”cardinality” [37] and a squeeze and excite block [10]. We used images of 471 size 512x352 pixels. The learning rate (lr) was changed using the one cycle policy, 472 following [26–28], with a maximum lr set to 0.0006. We trained our model for 12 epochs 473 using batches of 12 and back-propagating accumulated gradients on every two iterations. 474 We chose the Adam [11] optimization strategy with a momentum of 0.9. We adopted 475

15/22 label smoothing regularization as our classification loss function [30]. In addition, we 476 inserted a dropout [29] layer (0.5) before the last layer as additional regularisation. 477 During training we used standard augmentation (zooming, flipping, shifting, brightness, 478 lighting and contrast) and mixup [38] for experiments on species and sex. To highlight 479 category-specific regions of images, we used a technique for producing heat maps named 480 Class Activation Maps (CAM) [40]. 481

Cross validation and metrics 482

Cross validation. We used tenfold cross validation. We ensured that the number of 483 examples of each species in each fold was proportional to its representation in the whole 484 dataset. Each fold was tested on a model trained on the remaining nine folds. Finally, we 485 averaged and reported the accuracy across all folds. A pseudo random number generator 486 was used, starting with the same random state (555) in all experiments [5]. 487

Metrics. As a measure of performance in all our experiments, we used accuracy or 488 recall. This measure represents the fraction of images that are correctly identified. 489

Acknowledgements 490

This work was supported by the European Union's Horizon 2020 research and innovation 491 program under the Marie Sklodowska-Curie grant agreement No. 642241 to MV, by the 492 Ministry of Culture of the Czech Republic (DKRVO 2019-2023/5.I.a, National Museum, 493 00023272), Charles University in Prague (SVV 260 434/2019) and by the European 494 Community Research Infrastructure Action under the Synthesys Project [FR-TAF-5869 495 (http://www.synthesys.info/)] to DV. 496

Author contributions statement 497

MV, DV and FR conceived the experiments, M.V. carried out the experiments and 498 analysed the results, DV identified the specimens. DV photographed the specimens for 499 Dataset A and MV for Dataset B. MV wrote the first draft. All authors reviewed the 500 manuscript and contributed to the final version. 501

Additional information 502

Competing financial interests The authors declare that they have no competing 503 interests. 504

Supporting Information 505

In this section we provide complementary figures that show results obtained after training 506 a dedicated CNN, a more advanced system. The details about the training procedure 507 and the techniques we used are provided in Material and Methods in the main section 508 of the paper. However, first we provide a more detailed explanation of how to use an 509 off-the-shelf solution for a task of your own interest. 510

16/22 Using an off-the-shelf system 511

An off-the-shelf AI system is typically provided as a python script or a set of such scripts, 512 which you run on your local computer. Python is a modern programming language, 513 which is popular for many computational tasks. In case you have never used python 514 before, then we recommend that you start by installing https://www.anaconda.com/, 515 which comes with python and contains many convenient tools for running python scripts 516 and programs. To download Anaconda follow the instructions on www.anaconda.com. 517 In short, after you visit the Anaconda download website, it will automatically recognize 518 your local operating system (Windows, MacOS, Linux) and provide you with a few 519 Installation options. In case your computer has a 32bit architecture (which is rare 520 nowadays) you should choose one of the two 32 bit installations and in case you have a 521 computer with a 64bit architecture, all of the options would work for you. Do not get 522 confused by the python version in the installation name as this relates only to whether 523 you want Python 2 or Python 3 to be your default python programming environment. 524 Regardless of which version you choose, you can create and run environments with both 525 python versions. 526 To obtain your own copy of the scripts needed to run the off-the-shelf-system, one can 527 clone it from github repository (available after publication) using git (on how to install git 528 click here) and the following line of code (git clone https://github.com/valanm/off the shelf example529 An alternative would be to simply download all the scripts as a single zip file and then 530 extract (unzip) them to your preferred location. 531 In programming, it is often the case that a fully functional code/software for certain 532 operations that we need in a given task is already written and made available by 533 others, so that we do not need to write one by ourselves from scratch. In the python 534 community, these software are called packages. Our off-the-shelf script makes use of 535 some of these packages; they are called ”dependencies”. To fulfill these dependencies 536 you should open your command line, navigate to your project directory (the directory 537 where you extracted the scripts needed to run the off-the-shelf system — it includes 538 a file named ”environments.yml”) and then copy and run the following line in your 539 command line environment: conda env create –name envname –file=environments.yml, 540 where ”envname” is any name you choose for the environment where you want to run 541 your experiment. 542 Now place your images into a new folder called ”data” inside the folder where you 543 plan to run your experiment. The data folder should be structured into subfolders, each 544 subfolder with the name of the category of the images placed in that folder. The image 545 file names do not matter but the file format does as the script looks for files with the 546 following extensions (and expects them to be formatted accordingly): PNG, JPG, BMP, 547 PPM or TIF. 548 Now you are ready to run your off-the-shelf system on your task. Do this by activating 549 your environment conda activate envname and giving the command python experiment.py. 550 The script will then loop through the data, processes it, shuffle it, train the AI model 551 and store the result (predictions for all data and accuracy/error rate). To exit the 552 environment give the command conda deactivate. 553

Complementary figures 554

Here we show the confusion matrices for the identification of species (Fig. S1) and the 555 identification of species and sex (Fig. S2) when a dedicated CNN is trained for the tasks 556 discussed in the main paper. 557

17/22 Fig. S1. 558

559

Fig. S1. Confusion matrices for the identification of sex from different views and imaging settings when a 560 dedicated CNN is trained. We show results when high-resolution images of dorsal (top left) and ventral habitus (bottom left) 561 are used alone; when dorsal and ventral habitus images are combined (top right); and when features are obtained from smart-phone 562 images of the dorsal habitus (bottom right). Rows show true categories and number of images per species (N) and columns show 563 predicted categories. Colors and numbers in matrix indicate the proportion of predictions normalized so that the sum in each row is 564 equal to 1. (abi = O. abigail, alb = O. albopicta, cin = O. cinctella, dul = O. dulcis, fun = O. funesta, gro = O. gronbechi, noe = 565 O. noemi, pan = O. pantherina, sub = O. subcalva, tri = O. tripolitana) 566

18/22 Fig. S2. 567

568

Fig. S2. Confusion matrices for the identification of species and sex when a dedicated CNN is trained. Experiments 569 are based on either dorsal (left) or ventral habitus (right) images alone. 570

References

1. S. Azadi, J. Feng, S. Jegelka, and T. Darrell. Auxiliary image regularization for deep cnns with noisy labels. arXiv preprint arXiv:1511.07069, 2015. 2. H. Azizpour, A. S. Razavian, J. Sullivan, A. Maki, and S. Carlsson. Factors of transferability for a generic ConvNet representation. IEEE Trans. Pattern Anal. Mach. Intell., 38(9):1790–1802, Sept. 2016.

3. J. Baraud. Col´eopt`eres Scarabaeoidea: faune du Nord de l’Afrique du Maroc au Sina¨ı. Encyclop´edie Entomologique. Editions´ Lechevalier, 1985. 4. J. Baraud. Col´eopt`eres Scarabaeoidea d’Europe, volume 78. Federation Francaise des Societes de Sciences Naturelles & Societe Linneenne de Lyon, Faune de France, 1992.

5. E. Barker, W. Barker, W. Burr, W. Polk, and M. Smid. Recommendation for key management part 1: General (revision 3). NIST Spec. Pub., 800(57):1–147, 2012.

6. A. Bezdˇek. Subfamily Cetoniinae. In: L¨obl, I. & ¨obl, D. (Eds.). Catalogue of Palaearctic Coleoptera. Scarabaeoidea—Scirtoidea—Dasciloidea—Buprestoidea— Byrrhoidea. Revised and Updated Edition., 3:367–412, 2016.

7. C. Cortes and V. Vapnik. Support-vector networks. Mach. Learn., 20(3):273–297, 1 Sept. 1995. 8. J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning, pages 647–655. jmlr.org, 27 Jan. 2014.

19/22 9. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

10. J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

11. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

12. K.-H. Lee, X. He, L. Zhang, and L. Yang. Cleannet: Transfer learning for scalable image classifier training with label noise. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5447–5456, 2018.

13. Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and L.-J. Li. Learning from noisy labels with distillation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1910–1918, 2017.

14. Z. Li and D. Hoiem. Learning without forgetting. IEEE Transactions on Pattern Analysis & Machine Intelligence, (12):2935–2947, 2018.

15. A. C. R. Marques, M. M. Raimundo, E. M. B. Cavalheiro, L. F. Salles, C. Lyra, and F. J. Von Zuben. Ant genera identification using an ensemble of convolutional neural networks. PloS one, 13(1):e0192011, 2018.

16. R. Mikˇsi´c. Monographie der Cetoniinae der Pal¨aarktischen und Orientalischen region Coleoptera, Lamellicornia. Band 3. Systematischer Teil. Cetoniini I Teil., volume 14. Forstinstitut in Sarajevo, 1982.

17. A. Ng. Why AI is the new electricity. Nikkei Asian Review Online, 27 Oct. 2016. 18. G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1944–1952, 2017.

19. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E.´ Duchesnay. Scikit-learn: Machine learning in python. J. Mach. Learn. Res., 12(Oct):2825–2830, 2011.

20. E. Rodner, M. Simon, G. Brehm, S. Pietsch, J. W. W¨agele, and J. Den- zler. Fine-grained recognition datasets for biodiversity analysis. arXiv preprint arXiv:1507.00913, 2015.

21. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis., 115(3):211–252, 1 Dec. 2015.

22. G. Sabatinelli. Le Oxythyrea Muls. del Mediterraneo: studi morfologici sistematici (Coleoptera, Scarabaeoidea). Fragmenta Entomologica, 16:45–60, 1981.

23. G. Sabatinelli. Studi sul genere Oxythyrea Muls: note sulle specie del gruppo cinctella (Schaum)(Scarabaeidae Cetoniimae). Bollettino della Societa Entomolog- ica Italiana, 116:102–104, 1984.

20/22 24. A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. CNN features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 806–813. cv-foundation.org, 2014. 25. K. Simonyan and A. Zisserman. Very deep convolutional networks for Large-Scale image recognition. arXiv:, 4 Sept. 2014. 26. L. N. Smith. No more pesky learning rate guessing games. CoRR, abs/1506.01186, 2015.

27. L. N. Smith. A disciplined approach to neural network hyper-parameters: Part 1–learning rate, batch size, momentum, and weight decay. arXiv preprint arXiv:1803.09820, 2018. 28. L. N. Smith and N. Topin. Super-convergence: Very fast training of residual networks using large learning rates. CoRR, abs/1708.07120, 2017.

29. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.

30. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. 31. D. Tanaka, D. Ikami, T. Yamasaki, and K. Aizawa. Joint optimization framework for learning with noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5552–5560, 2018.

32. R. Tegelberg, J. Haapala, T. Mononen, M. Pajari, and H. Saarenmaa. The development of a digitising service centre for natural history collections. ZooKeys, (209):75, 2012. 33. M. Valan, K. Makonyi, A. Maki, D. Vondr´aˇcek, and F. Ronquist. Automated taxonomic identification of insects with expert-level accuracy using effective feature transfer from convolutional networks. Systematic biology, 2019.

34. A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, and S. Belongie. Learning from noisy large-scale datasets with minimal supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 839–847, 2017.

35. X.-S. Wei, J.-H. Luo, J. Wu, and Z.-H. Zhou. Selective convolutional descrip- tor aggregation for fine-grained image retrieval. IEEE Transactions on Image Processing, 26(6):2868–2881, 2017. 36. T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2691–2699, 2015.

37. S. Xie, R. Girshick, P. Doll´ar, Z. Tu, and K. He. Aggregated residual transforma- tions for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.

38. H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.

21/22 39. L. Zheng, Y. Zhao, S. Wang, J. Wang, and Q. Tian. Good practice in CNN feature transfer. arXiv:, 1 Apr. 2016.

40. B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016.

41. B. Zhuang, L. Liu, Y. Li, C. Shen, and I. Reid. Attend in groups: a weakly- supervised deep learning framework for learning from web data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1878–1887, 2017.

22/22 III

AI-Phy: improving automated image-based identification of biological organisms using phylogenetic information

Miroslav Valana,b,c,1, Johan A. A. Nylanderb,, and Fredrik Ronquistb

a b Savantic AB, Rosenlundsgatan 52, 118 63 Stockholm, Sweden; Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Frescativagen 40, 114 18 c Stockholm, Sweden; Department of Zoology, Stockholm University, Universitetsvägen 10, 114 18 Stockholm, Sweden

This manuscript was compiled on December 8, 2020

1 Computer vision has made dramatic progress in recent years on im-

2 age classification tasks. There are now numerous success stories,

3 where supervised learning of convolutional neural networks using

4 large training sets has resulted in impressive classification perfor-

5 mance. Nevertheless, these systems occasionally make grave er-

6 rors that humans would never make. One reason for this may be

7 that the training algorithms are over-simplified. They typically use

8 binary scores, 1 for the correct target and 0 otherwise. Recently it

9 has been shown that label smoothing, that is, distributing a small

10 portion of the score equally on all “wrong” categories, can improve

11 performance in some cases. Both of these methods assume that

12 image categories are equally distant, but we often have background Fig. 1. Networks optimized toward hard (one-hot) targets or soft labels with uniform

13 knowledge about the similarity relations between them, which can label smoothing (3) assume all categories are equally distant from each other and hence all errors are penalized equally. We propose to smooth the targets using 14 be useful in developing more sophisticated methods. Here, we ex- hierarchical information (taxonomy or phylogeny) so that networks are penalized more 15 plore the utility of phylogenetic information in training and evaluat- for erroneous identifications that are farther away from the correct category. 16 ing CNNs for identification of biological species. Specifically we pro-

17 pose label smoothing based on taxonomic information (taxonomic

18 label smoothing) or distances between species in a reference phy- ages with Lepidoptera and vice versa, which humans would 17 19 logeny (phylogenetic label smoothing). Using two empirical exam- hardly do. 18 20 ples (38,000 images of 83 species of snakes, and 2,600 images of One potential reason for this could be the fact that CNNs 19 21 153 species of butterflies and moths), we show that networks trained are optimized toward targets without incorporating any infor- 20 22 with phylogenetic information perform at least as well on common mation about the similarity relations among categories. The 21 23 performance metrics as standard systems, while making errors that targets are often one-hot encoded, which means that if there 22 24 are more acceptable to humans and less wrong in an objective bio- are N possible unique categories, we would encode target i as 23 25 logical sense. We argue that this is likely to make the systems more a one-dimensional vector of length N (1xN) of 0s, with ele- 24 26 robust and better at placing unseen categories or unusual images ment i flipped to 1. During training, we would then optimize 25 27 correctly. We demonstrate the potential power of the approach by a network to minimize the difference between the predicted 26 28 showing that a phylogenetically informed system is better at identi- DRAFTdistribution and the one-hot target distribution, assumed to 27 29 fying venomous snakes than a system trained using standard meth-

30 ods.

Significance Statement The development of AI systems for image classification has 1 ccurate identification of organisms is important in many so far mainly involved training algorithms and performance cri- 2 Afields, such as ecology, conservation, pest control or in teria that treat imaged object categories as equidistant from 3 the monitoring of invasive species or disease vectors. It is also each other. However, we often know something about the 4 costly as it is tedious, time-consuming and requires human ex- similarity relations between the categories, and one might ex- 5 perts that take a long time to train. Recently, a lot of progress pect that this information could be useful. Here, we show that 6 has been made toward automated identification systems as phylogenetic information describing evolutionary relationships 7 a result of the rise of deep learning and the introduction of among organisms can indeed be used to improve both training 8 convolutional neural networks (CNNs) (1, 2). Given example and evaluation of AI systems for image-based identification of 9 images as training inputs and species names as targets, these biological species. 10 networks can learn on their own what features in images are

11 important in order to minimize the error or loss, measured as MV conceived the idea, design the experiments, obtained image data sets, wrote scripts, performed 12 the difference between predicted and target values—species analyses, interpreted results and wrote the first draft. FR continuously provided feedback and supervision at all stages. JN obtained phylogenetic trees. All authors contributed to the final 13 names in this particular example. However, these networks oc- version of the manuscript. 14 casionally make errors that no human would do. For instance, No competing interests to declare. 15 a network trained to distinguish between three categories, a 1 16 dog, and two Lepidoptera species, may confuse some dog im- E-mail: [email protected]

Manuscript | December 8, 2020 | vol. XXX | no. XX | 1–7 Fig. 2. Sample images of each of the two data sets. For each data set, we show four different image categories. The example images illustrate some of the taxon diversity and variation in image capture settings within each category.

28 be the “true” distribution. Recently, a technique named label distant from each other. We refer to methods that rely directly 78 29 smoothing (LS) (3) was found to improve generalization by on the phylogeny as phylogenetic label smoothing (PLS). In 79 30 preventing the network to become overconfident. LS uses ’soft PLS, we match the categories in the learning task to leaves in 80 31 targets’, which are a weighted average of the ’hard targets’ a reference phylogeny, and define phylogenetically informed 81 32 (the one-hot distributions) and the uniform distribution over distributions for target encoding based either on the anagenetic 82 33 labels. distance (total branch length) or cladogenetic distance (number 83 34 In both one-hot-encoding and LS, the target distributions of branches) between the leaves/categories. We then optimize 84 35 of any pair of categories differ in exactly two positions. In networks toward a weighted average of one-hot encoded and 85 36 other words, the training algorithm assumes that all categories phylogenetically informed target distributions in a manner 86 37 are equally distant from each other. Therefore, the misiden- similar to LS. 87 38 tifications of a Lepidoptera species as another Lepidoptera To explore the effectiveness of TLS and PLS, we experi- 88 39 species or as a dog are considered equally wrong. We know mented with two data sets (Fig. 2). The first data set (D1) 89 40 from real life scenarios that this is incorrect. Some inner rela- is more challenging; it consists of 37,774 outdoor images of 90 41 tionships among categories typically exist, such that not all 83 species of snakes from 43 genera and 10 families. The 91 42 identification errors should be treated equally (see Fig. 1). If second data set (D2) is smaller but easier to learn from; it 92 43 the categories are biological species, then there is a natural contains 2,593 images of preserved Lepidoptera specimens of 93 44 reference system that expresses the relations between them, 153 species belonging to 99 genera and 10 families. Our results 94 45 namely their phylogeny (4). If we could leverage this informa- show that TLS- and PLS-trained systems perform on par with 95 46 tion while training automated identification systems, we might or better than comparable standard systems when their ability 96 47 expect the systems to do better. In particular, we might expect to discriminate species is evaluated using common metrics that 97 48 them to make errors that are more acceptable to humans and do not take the nature of the errors into account. However, 98 49 that are less serious from a biological perspective. The main they perform considerably better than standard systems when 99 50 purpose of this paper is to introduce such phylogenetically errors are evaluated against how serious they are in a phylo- 100 51 informed label smoothing. genetic perspective. We demonstrate the importance of the 101 52 One possible way to include phylogenetic information dur- latter by showing that the phylogenetically informed systems 102 53 ing the training of CNNs would be to incorporate one or more trained on snake images do considerably better than compa- 103 54 higher taxonomic ranks—such as genus, tribe or family—as rable standard systems in predicting if an image portrays a 104 55 additional one-hot encoded targets, replacing the uniform venomous species. We predict that phylogenetically informed 105 56 smoothing distribution in LS. This might be referred to as systems will be superior to standard systems in training effi- 106 57 DRAFTciency, and in classifying images with unexpected features or 107 taxonomic label smoothing (TLS). 58 TLS provides the network with more information but it also of new species that have not been encountered previously. 108 59 introduces the incorrect assumption that all higher taxonomic 60 ranks within the same group are equally distant from each Results 109 61 other. Even the notion that the units at some taxonomic 62 rank, say genus, should be comparable to each other cannot be In our experiments with TLS and PLS, we tried different ratios 110 63 taken for granted. There are simply no universally accepted between hierarchical information and hard targets. Specifi- 111 64 objective criteria, which define what makes a taxonomic group. cally, we set the smoothing value (the proportion of the mix-in 112 65 A well studied group, such as birds, is typically categorized information) to β = {0.025, 0.05, 0.1, 0.2, 0.4}, and compared 113 66 into finer units than a poorly studied group, such as some the performance of the resulting systems with two bench- 114 67 group of wasps or flies. This can be confirmed by looking at marks: a) one-hot encoding and b) LS with β = 0.1 (3). Using 115 68 genetic information: the genetic variation within an insect standard identification performance metrics, one-hot encoding 116 69 genus may be many times as large as that within a bird genus. gave better results than LS for D1 on accuracy and top3, but 117 70 Even within a restricted group of organisms, the criteria used slightly worse on f1 scores (Fig. 3, Table SI1). TLS and PLS 118 71 to recognize a group at a particular taxonomic level may vary with small smoothing values (0.025 ≤ β ≤ 0.1) slightly out- 119 72 across lineages. performed both benchmarks, except for PLS with anagenetic 120 73 The phylogeny itself provides a more powerful source of distances, which performed on par with the benchmarks. 121 74 information about how species are related to each other. It On data set D2, the LS benchmark consistently outper- 122 75 directly captures information about the relationships among formed the hard-target benchmark. Systems trained using 123 76 species, and it does not make false assumptions that groups phylogenetically informed targets (TLS or PLS) consistently 124 77 at the same taxonomic rank level are comparable or equally outperformed systems trained using hard targets. When com- 125

2 | Valan et al. Fig. 3. Impact of using phylogenetically informed target distributions on identification performance, assessed using common evaluation metrics. We used two data sets - D1 - outdoor images of snakes (top) and D2 - images of preserved Lepidoptera specimens (bottom). Phylogenetic information (taxonomic ranks, anagenetic and cladogenetic distances) was introduced using label smoothing based on a weighted average of one-hot target distributions and distributions based on the phylogenetic information. We experimented with different smoothing values, β; specifically we set β = {0.025, 0.05, 0.1, 0.2, 0.4}). The dashed lines indicate the baseline results (mean) obtained when networks are trained without hierarchical information, using either hard (one-hot) targets (blue) or ‘flat’ label smoothing (3) (orange). For every experiment we ran 10 trials with different random initialization and evaluated them using common classification metrics (accuracy, top3, f1 score macro). Exactly the same training parameters were used in all experiments, except for the targets toward which the models were optimized.

126 pared to the second benchmark, LS, the inclusion of hierarchi- commonly used during supervised learning of CNNs fail to 147 127 cal information (TLS or PLS) with intermediateDRAFT smoothing penalize errors depending on how serious they are. Intuitively, 148 128 values (0.05 ≤ β ≤ 0.4) gave similar accuracy and f1 scores, one might expect differential scoring of errors in the target 149 129 while it often gave better top3. distributions to result in faster learning, and in systems that 150 130 When evaluated against more sophisticated performance make fewer errors, particularly serious errors. In many practi- 151 131 metrics, such as the accuracy of identifications at the genus cal situations, a system that makes fewer serious errors might 152 132 or family level, or the accuracy of predicting an important be better than one that is more often exactly right but is at 153 133 biological trait (a snake being venomous), the phylogenetically increased risk of making grave errors. 154 134 informed systems performed better than both benchmarks Our proposed solution is based on a generalization of the 155 135 across a range of different smoothing values (Figure 4). recently introduced LS technique (3). Although we develop 156 136 During these experiments, we also noted that phylogeneti- it here for including hierarchical information about the phy- 157 137 cally informed targets resulted in systems that converged more logenetic relationships among biological organisms, the same 158 138 rapidly, that is, the CNNs learned the classification task more ideas could be used to introduce any distance-based measure 159 139 quickly with TLS and PLS targets than if they were trained of the similarity relations among the categories of interest 160 140 using standard targets. during CNN training. An important note is that our approach 161 differs from LS only in using different targets. Therefore, the 162 141 Discussion computational cost of our approach is the same as for LS or 163 standard one-hot targets, and it can be used in all contexts 164 142 Although CNNs and deep learning have achieved impressive where these standard targets can be used. 165 143 performance on many image classification tasks, these sys- 144 tems still make occasional errors that are unacceptable in Specifically, the results we present here indicate that en- 166 145 many practical applications, errors that no human would do. coding phylogenetic information into the targets toward which 167 146 One potential reason for this is that the target distributions we optimize CNN models can improve the performance of au- 168

Valan et al. Manuscript | December 8, 2020 | vol. XXX | no. XX | 3 Fig. 4. Impact of using phylogenetically informed target distribuions on identification performance, assessed using custom evaluation metrics. The same experiments as in Figure 3 are shown, evaluated using custom metrics that take the nature of the errors into account. The first two metrics simply measure the accuracy of identifications at the genus and family levels. The third metric measures the accuracy of predicting a biological trait, in this case whether an imaged snake is venomous (only applied to D1).

169 tomated systems for the taxonomic identification of biological performance of PLS-trained systems to be better than that of 194 170 organisms. When the identification performance is assessed TLS-trained systems. 195 171 using common metrics that do not differentiate among errors, We found that the optimal smoothing value for TLS and 196 172 the systems trained using phylogenetic information perform on PLS varied across data sets. For D1, smaller ratios (0.025-0.1) 197 173 DRAFT 198 par with or better than systems trained using standard targets yielded better results compared to D2, where larger values 174 (3). When the nature of the errors are taken into account, the were better (0.05-0.4). This difference appears to be related 199 175 phylogenetically informed systems typically perform better, to the the number of categories N in the data set (D1 has 83 200 176 particularly when the label smoothing value is optimized (4). and D2 has 153 categories). We distribute the total smoothing 201

177 We note that the systems trained using taxonomic informa- value β across all categories. Therefore, when there are fewer 202 178 tion (TLS) tended to perform slightly better than the systems categories, the smoothing value for each category will be higher 203 179 trained using phylogenetic distances (PLS) in our experiments, on average, and the effect on training will be larger compared 204 180 despite the expected superiority of the latter. However, the to a data set with more categories. It may well be possible 205 181 custom metrics we used here were biased toward TLS in that to develop an understanding of how to control for this effect, 206 182 they were based on taxonomic information (genus or family). such that it is possible to standardize phylogenetic smoothing 207 183 More sophisticated measures of the seriousness of errors from values across data sets with different number of categories. 208 184 a biological standpoint, for instance based on phylogenetic In our experiments, the largest smoothing value (0.7) re- 209 185 distance, would presumably yield a different outcome. We sulted in degraded performance of both TLS and PLS on all 210 186 also note that the biological trait we examined here—whether standard metrics when compared to the benchmarks or to 211 187 or not a snake is venomous—appears to be strongly deter- smaller smoothing values (results not shown). The reason for 212 188 mined by the family to which a species belongs. This may this is apparently that networks trained under such extreme 213 189 explain why, in our experiments, the systems trained using regimes primarily minimize the phylogenetic error (70% of 214 190 TLS targets, which pay particular attention to the family the target weight), and this partly conflicts with the aim of 215 191 level, are more successful in predicting this trait than systems correctly identifying the species (encoded by the hard target, 216 192 trained using the more sophisticated PLS targets. Across a 30% of the target weight). 217 193 broader spectrum of biological traits, we expect the average Several important topics remain to be explored. Systems 218

4 | Valan et al. 219 trained using phylogenetically informed targets might be ex- and used the remaining subset for training. The split was 277 220 pected to learn classification tasks faster than systems trained stratified, which means that the ratio of images per species is 278 221 using standard targets. We noted a tendency for this to be the preserved in both training and validation set. Specifically, we 279 222 case in our experiments, but this needs to be more thoroughly utilized the StratifiedShuffleSplit function from the scikit-learn 280 223 explored using experiments specifically designed to look at package (6). For D1, which is a larger and more challenging 281 224 training efficiency. Because phylogenetically informed systems data set (outdoor images) we used a 70:30 split for training 282 225 tend to make fewer serious identification mistakes, one might and validation, respectively. For D2 we used a 30:70 split 283 226 expect them to perform better than standard systems on dif- in order to have a larger validation set. The validation set 284 227 ficult tasks such as the classification of images belonging to was kept aside during training and it was used to evaluate 285 228 novel categories that were not included in the training set, performance of the networks after the training had finished. 286 229 images displaying organisms with unusual features like mu- In order to minimize the effect of chance we performed the 287 230 tants or color morphs, or images with distracting or unusual same experiment 5 times using 5 different random seeds and 288 231 background noise. These topics all remain to be explored in report average results. For every random seed a new validation 289 232 future studies. set was created. In addition, we used test-time-augmentation, 290 233 In summary, we propose a more sophisticated training which means that for evaluation we combined predictions from 291 234 regimen for CNNs, which uses differential weighting of errors the original image as well as 8 slightly altered copies of the 292 235 depending on the similarity relations between target categories. image. This technique is often used because it improves scores 293 236 Our approach should result in systems that make errors that on metrics, but our main motivation here was to stabilize the 294 237 are more acceptable to humans and that are less serious in results. 295 238 some objective sense, which can be essential in many contexts. 239 We show how this approach can be used to improve CNNs for Architectures. We utilized a well known and light convolu- 296 240 automated identification of biological species by taking advan- tional neural network (CNN), ResNet 50 (7). We fine-tuned 297 241 tage of similarity relations defined by a reference phylogeny. a publicly available checkpoint pretrained on the publicly 298 242 We hope that these ideas could lead to improvement in the available ImageNet dataset (8). The only modification we 299 243 training of CNNs on image classification tasks in many other introduced compared to the official implementation is that 300 244 domains where it is possible to differentiate errors in a similar we changed the number of outputs of the last layer to the 301 245 way. number of species in our data sets. We used 0.5 dropout (9) 302 and standard augmentations (flip, zoom, rotate, lighting and 303 brightness) for regularization in order to prevent overfitting 304 246 Materials and Methods to the training data. In all experiments we used the optimizer 305 247 Data sets. We used two image data sets for this study. The Adam (10) and the one-cycle training policy (11–13). We 306 248 first data set (D1) consists of 37,774 images from 83 species trained our networks with a batch size of 64, and with 5 and 307 249 of snakes (on average 406 images per species). This data set 15 epochs for D1 and D2, respectively (note that D2 had only 308 250 is a subset (approximately 1/4) of a data set used for the 779 images in the training set). The maximum learning rate 309 251 Snake Identification Challenge hosted on the aicrowd platform was 3e-4, with a warm start for the first 5% of the training 310 252 (link, 3rd round). We set the maximum number of images followed by the learning rate decay with cosine annealing and 311 253 to 500 per species and we reduced the number of species to minimum learning rate 1e-5. 312 254 the 83 species for which we were able to obtain genetic data. 255 The species belong to 44 genera and 10 families. The second Evaluation metrics. We evaluated our experiments using three 313 256 data set (D2) was obtained from GBIF (5) and it contains common evaluation metrics: a) Accuracy - the ratio of correctly 314 257 2,593 images from 153 species of LepidopteraDRAFT (butterflies and predicted observations to total observations; b)Top-N-accuracy 315 258 moths). These species belong to 99 genera and 10 families. or topN - if the correct category is among the N highest prob- 316 259 All images in D2 are contributions from a single institution. abilities, we count the answer as correct (in this study N=3); 317 and c) F1 score macro - a weighted average of recall and pre- 318 260 Image characteristics. The images in D1 are contributed by cision, where recall is the ratio of correctly predicted positive 319 261 citizen scientists and presumably taken with mobile devices. observations to all observations in the actual category, and pre- 320 262 Therefore, the imaging equipment and conditions vary a lot cision is the ratio of correctly predicted positive observations 321 263 (see Figure 2). The snakes are often photographed from a to the total predicted positive observations. The first metric, 322 264 distance, they are often to a great extent hidden behind rocks accuracy, favors the majority category and is well aligned with 323 265 or vegetation or under the water. The images in D2 are the commonly used cross entropy loss function. The F1 score 324 266 from preserved museum specimens. Thus, specimens occupy a macro metric values each category equally and is often used 325 267 large portion of the image area and are oriented in a similar with imbalanced data sets. 326 268 way. Unlike in D1, D2 images have little background noise, The common evaluation metrics assume that all errors 327 269 which allows neural networks to learn classification tasks from are equally bad. In order to understand the nature of the 328 270 a relatively small number of images. Because people were errors better, we created several custom metrics. Specifically, 329 271 photographing objects of interest it is logical to assume they we measured the accuracy at the genus and family levels 330 272 are all centered. Therefore, in our experiments we used a to investigate how often an erroneously predicted species is 331 273 square central crop with maximum size ( min( image_height, misidentified as another species of the same genus or family. 332 274 image_width )). For the snake data set, we also report the accuracy of detecting 333 whether a species is venomous or not. The information on 334 275 Training and validation sets. For each experiment, we ran- whether a species is venomous or not was not provided during 335 276 domly assigned a subset of the data to the validation set, training and is learned as a byproduct. 336

Valan et al. Manuscript | December 8, 2020 | vol. XXX | no. XX | 5 337 Reference trees. The phylogenetically informed target encod- the Marie Sklodowska-Curie grant agreement No. 642241 (to FR 396 338 ing was based on a reference tree. As reference tree for data and MV). 397 339 set D1, we used a phylogeny based on a maximum likelihood 1. J Schmidhuber, Deep learning in neural networks: An overview. Neural networks 61, 85–117 398 (2015). 399 340 analysis of a multi-gene data set for both snakes and lizards 2. Y LeCun, Y Bengio, G Hinton, Deep learning. nature 521, 436–444 (2015). 400 341 (14). Species names not present in D1 were pruned from the 3. C Szegedy, V Vanhoucke, S Ioffe, J Shlens, Z Wojna, Rethinking the inception architecture 401 342 reference tree, while preserving the back-bone structure and for computer vision in The IEEE Conference on Computer Vision and Pattern Recognition 402 (CVPR). (2016). 403 343 keeping the branch lengths. Due to differences in the taxon- 4. W Hennig, Phylogenetic Systematics. (University of Illinois Press), (1966). 404 344 omy used, some species names differed between D1 and the 5. Gbif (Occurrence Download, https://doi.org/10.15468/dl.7eqt9y) (2020) Accessed: November 405 345 reference tree. In those cases, names were adjusted according 2020. 406 6. F Pedregosa, et al., Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825– 407 346 to the current use in The Reptile Database (15). For data set 2830 (2011). 408 347 D2, we used a phylogenetic tree based on a large-scale multi- 7. K He, X Zhang, S Ren, J Sun, Deep residual learning for image recognition in Proceedings 409 of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016). 410 348 omics data set (16). Species names not present in D2 were 8. O Russakovsky, et al., ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 411 349 pruned from the reference tree, while preserving the back-bone 115, 211–252 (2015). 412 350 structure and keeping the branch lengths. Tree manipulations 9. N Srivastava, G Hinton, A Krizhevsky, I Sutskever, R Salakhutdinov, Dropout: a simple way 413 to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014). 414 351 were done using the R-package ape (17). See SI-XXX for the 10. DP Kingma, J Ba, Adam: A method for stochastic optimization. arXiv preprint 415 352 scripts used. arXiv:1412.6980 (2014). 416 11. LN Smith, No more pesky learning rate guessing games. CoRR abs/1506.01186 (2015). 417 353 Target encoding. We used two common target encoding 12. LN Smith, A disciplined approach to neural network hyper-parameters: Part 1–learning rate, 418 batch size, momentum, and weight decay. arXiv, arXiv–1803 (2018). 419 354 schemes as our baselines: One-hot encoding and label smooth- 13. LN Smith, N Topin, Super-Convergence: Very fast training of residual networks using large 420 355 ing. In one-hot encoding (also called ’hard targets’) of target i learning rates. CoRR abs/1708.07120 (2017). 421 356 among N possibly unique categories, we use a one-dimensional 14. RA Pyron, FT Burbrink, JJ Wiens, A phylogeny and revised classification of squamata, includ- 422 ing 4161 species of lizards and snakes. BMC Evol. Biol. 13, 93 (2013). 423 357 N i vector of zeros of length , with element flipped to 1. LS 15. P Uetz, P Freed, J Hošek, Reptile database (http://www.reptile-database.org) (2020) Ac- 424 358 (3) (a variant of ’soft targets’) uses a weighted average of the cessed: March 2020. 425 16. D Chesters, Construction of a Species-Level tree of life for the insects and utility in taxonomic 426 359 hard-target distribution and the uniform probability distribu- profiling (2016). 427 360 tion over labels. LS improves generalization by preventing the 17. E Paradis, K Schliep, ape 5.0: an environment for modern phylogenetics and evolutionary 428 361 network from becoming overconfident (3) but also improves analyses in R (2019). 429 18. R Müller, S Kornblith, GE Hinton, When does label smoothing help? in Advances in Neural 430 362 network calibration (18). Information Processing Systems, eds. H Wallach, et al. (Curran Associates, Inc.), Vol. 32, pp. 431 363 For all other experiments we also used soft targets, but un- 4694–4703 (2019). 432 364 like LS, the one-hot distribution was mixed with a non-uniform . 433 365 distribution over labels representing hierarchical information 366 about the categories. Specifically, we used three different Supplementary Information 434

367 types of hierarchical information: taxonomic rank, anagenetic 435 368 distance, and cladogenetic distance. For taxonomic rank, we 369 counted the number of shared taxonomic levels (genus and 436 370 family) between the correct target and each of the other cate- 371 gories. For anagenetic-distance smoothing, we computed the 372 distance between the correct target and each of the other 373 categories using the sum of the branch lengths separating 374 them on the reference tree. We then subtracted the distances 375 from the maximum value and normalized the resulting val- 376 ues by dividing them with the sum over all categories. The 377 DRAFT cladogenetic-distance smoothing distribution was computed in 378 a similar way, but using the number of edges separating the 379 categories as the distance measure. 380 For all hierarchical smoothing methods, we explored dif- 381 ferent mixture proportions, β. Specifically, we used β = 382 {0.025, 0.05, 0.1, 0.2, 0.4}.

383 Loss function. All target encoding schemes we used ensure 384 that the sum over categories is 1 for all targets. This makes it 385 possible to use a Softmax output with common classification 386 loss functions such as cross-entropy or the Kullback-Leibler 387 (KL) divergence. In our experiments we utilized the KL diver- 388 gence.

389 Data availability. The data sets D1 and 390 D2 are deposited on aicrowd platform 391 (https://www.aicrowd.com/challenges/snake- 392 species-identification-challenge/) and GBIF 393 (https://doi.org/10.15468/dl.7eqt9y), respectively.

394 ACKNOWLEDGMENTS. This work was supported by the Euro- 395 pean Union’s Horizon 2020 research and innovation program under

6 | Valan et al. Table 1. The results from experiments on D1. We report mean averages of 10 runs and differences against the first benchmark (in parenthe- ses).

experiment accuracy top3 f1_macro genus family venom 0_one-hot 61.38 79.84 60.78 72.02 80.55 87.55 1_LS 61.28 (-0.1) 79.81 (-0.03) 60.86 (0.08) 71.97 (-0.05) 80.47 (-0.08) 87.37 (-0.18) 2_TLS_0.025 61.42 (0.04) 79.95 (0.11) 61.09 (0.31) 72.28 (0.26) 80.89 (0.34) 87.64 (0.09) 2_TLS_0.05 61.39 (0.01) 79.88 (0.04) 61.05 (0.27) 72.26 (0.24) 80.85 (0.3) 87.68 (0.13) 2_TLS_0.1 61.25 (-0.13) 79.83 (-0.01) 60.85 (0.07) 72.23 (0.21) 81.01 (0.46) 87.74 (0.19) 2_TLS_0.2 60.68 (-0.7) 79.47 (-0.37) 60.21 (-0.57) 72.01 (-0.01) 81.12 (0.57) 87.77 (0.22) 2_TLS_0.4 58.98 (-2.4) 77.75 (-2.09) 58.33 (-2.45) 71.13 (-0.89) 80.87 (0.32) 87.65 (0.1) 3_PLS_ana_0.025 61.3 (-0.08) 79.81 (-0.03) 60.93 (0.15) 72.06 (0.04) 80.62 (0.07) 87.51 (-0.04) 3_PLS_ana_0.05 61.3 (-0.08) 79.85 (0.01) 60.94 (0.16) 72 (-0.02) 80.59 (0.04) 87.46 (-0.09) 3_PLS_ana_0.1 61.18 (-0.2) 79.89 (0.05) 60.76 (-0.02) 71.87 (-0.15) 80.53 (-0.02) 87.46 (-0.09) 3_PLS_ana_0.2 60.93 (-0.45) 79.63 (-0.21) 60.46 (-0.32) 71.74 (-0.28) 80.53 (-0.02) 87.41 (-0.14) 3_PLS_ana_0.4 59.57 (-1.81) 78.66 (-1.18) 58.84 (-1.94) 70.62 (-1.4) 79.92 (-0.63) 86.96 (-0.59) 3_PLS_clado_0.025 61.37 (-0.01) 79.91 (0.07) 61.02 (0.24) 72.08 (0.06) 80.67 (0.12) 87.57 (0.02) 3_PLS_clado_0.05 61.29 (-0.09) 79.87 (0.03) 60.9 (0.12) 72.04 (0.02) 80.54 (-0.01) 87.51 (-0.04) 3_PLS_clado_0.1 61.19 (-0.19) 79.84 (0.0) 60.76 (-0.02) 71.91 (-0.11) 80.44 (-0.11) 87.45 (-0.1) 3_PLS_clado_0.2 60.85 (-0.53) 79.5 (-0.34) 60.41 (-0.37) 71.48 (-0.54) 80.16 (-0.39) 87.16 (-0.39) 3_PLS_clado_0.4 59.51 (-1.87) 78.71 (-1.13) 58.85 (-1.93) 70.31 (-1.71) 79.46 (-1.09) 86.74 (-0.81)

Table 2. The results from experiments on D2. The same experiments and evaluation metrics as in Table 1 are shown, without biological trait.

experiment accuracy top3 f1_macro genus family 0_one-hot 89.22 97.1 87.62 94.08 99.1 1_LS 89.26 (0.04)DRAFT 97.3 (0.2) 87.62 (0.0) 94.1 (0.02) 99.18 (0.08) 2_TLS_0.025 89.58 (0.36) 97.22 (0.12) 87.96 (0.34) 94.38 (0.3) 99.18 (0.08) 2_TLS_0.05 89.44 (0.22) 97.3 (0.2) 87.92 (0.3) 94.34 (0.26) 99.28 (0.18) 2_TLS_0.1 89.48 (0.26) 97.36 (0.26) 87.8 (0.18) 94.38 (0.3) 99.34 (0.24) 2_TLS_0.2 89.78 (0.56) 97.34 (0.24) 88.1 (0.48) 94.84 (0.76) 99.32 (0.22) 2_TLS_0.4 88.84 (-0.38) 97.38 (0.28) 86.82 (-0.8) 94.82 (0.74) 99.38 (0.28) 3_PLS_ana_0.025 89.36 (0.14) 97.2 (0.1) 87.78 (0.16) 94.18 (0.1) 99.12 (0.02) 3_PLS_ana_0.05 89.48 (0.26) 97.06 (-0.04) 87.96 (0.34) 94.3 (0.22) 99.12 (0.02) 3_PLS_ana_0.1 89.26 (0.04) 97.3 (0.2) 87.64 (0.02) 94.24 (0.16) 99.2 (0.1) 3_PLS_ana_0.2 89.1 (-0.12) 97.14 (0.04) 87.44 (-0.18) 94.16 (0.08) 99.24 (0.14) 3_PLS_ana_0.4 89.34 (0.12) 97.32 (0.22) 87.72 (0.1) 94.22 (0.14) 99.24 (0.14) 3_PLS_clado_0.025 89.5 (0.28) 97.06 (-0.04) 87.96 (0.34) 94.14 (0.06) 99.12 (0.02) 3_PLS_clado_0.05 89.54 (0.32) 97.16 (0.06) 87.9 (0.28) 94.2 (0.12) 99.18 (0.08) 3_PLS_clado_0.1 89.2 (-0.02) 97.24 (0.14) 87.6 (-0.02) 94.22 (0.14) 99.22 (0.12) 3_PLS_clado_0.2 89.5 (0.28) 97.18 (0.08) 87.82 (0.2) 94.24 (0.16) 99.22 (0.12) 3_PLS_clado_0.4 89.16 (-0.06) 97.24 (0.14) 87.48 (-0.14) 94.24 (0.16) 99.26 (0.16)

Valan et al. Manuscript | December 8, 2020 | vol. XXX | no. XX | 7

IV

The Swedish Ladybird Project: Engaging 15,000 school children in improving AI identification of ladybird beetles

1,2,3,*​ 4​ 5​ 2,3 Miroslav Valan​ , Martin Bergman​ , Mattias Forshage​ and Fredrik Ronquist​ 1 Savantic AB, Rosenlundsgatan 52, 118 63 Stockholm, Sweden

2 Department of Bioinformatics and Genetics, Swedish Museum of Natural History, P. O. Box 50007,

SE-104 05 Stockholm, Sweden

3 Department of Zoology, Stockholm University, SE-106 91 Stockholm, Sweden

4 Vetenskap & Allmänhet (Public & Science), P.O. Box 5073, SE-102 42 Stockholm, Sweden

5 Department of Zoology, Swedish Museum of Natural History, Box 50007, SE-104 05 Stockholm,

Sweden

* ​[email protected]

Abstract

Citizen scientists contribute large volumes of critical biodiversity data but the taxonomic coverage is limited and the species determinations are not always reliable. Both problems could be addressed, at least in part, by developing better identification tools. Thanks to the dramatic progress in AI-based computer vision technology, the interest in automated species identification systems has exploded in recent years. However, these techniques often require large training sets of accurately labeled images, which are unavailable for most organism groups. In the Swedish Ladybird Project, we engaged school children in collecting photos of one of these groups, ladybird beetles (Coccinellidae), to improve AI identification tools. We estimate that more than 15,000 school children participated in the project. Children were provided with an app for submitting photos and macro lenses fitting onto mobile device cameras, while their teachers received instructions and educational materials focused on the diversity and biology of ladybird beetles. Over the summer of 2018, participants collected more than 5,000 photos of 30 species of coccinellids. The project added substantially to the openly-licensed ladybird images available previously from the GBIF portal, more than doubling the number of images for four species. Adding the project images to the GBIF data improved AI identification accuracy for all but the most common ladybird species. We conclude that citizen-science projects targeting teachers and school children can be an effective way of improving AI identification systems for biodiversity while providing many positive educational and societal side effects.

1

Introduction

During the last 75 years, and especially in the 21st century, we have witnessed a dramatic increase in the number and scale of citizen science (CS) projects ​(Theobald et al. 2015)​. This is particularly true for biology and environmental science, disciplines that attract the interest of many citizen scientists. For instance, there is a long tradition of amateur naturalists, such as bird watchers, reporting occurrence data for their favorite groups. Since the early 2000’s, such occurrence data have been aggregated online together with data from professional sources by the Global Biodiversity Information

Facility (GBIF). The volume of CS data has grown more rapidly than that of other data sources, and since 2016 they constitute more than half of the GBIF occurrence records

(Waller 2019).

CS projects enable professional scientists to collect data on larger scales, both temporally and spatially, than would otherwise have been possible In some fields, this has been critical in allowing professional scientists to generate data and results fast enough to support timely policy decisions ​(Theobald et al. 2015; Pimm et al. 2014;

Erwin and Johnson 2000)​. Some examples from the biodiversity field include monitoring of biodiversity for conservation purposes ​(McKinley et al. 2017; Ellwood, Crimmins, and

Miller-Rushing 2017)​; recording changes in the distribution of organisms that have direct or indirect health implications on humans, such as disease vectors ​(Palmer et al. 2017;

Collantes et al. 2015, 2016)​, or detecting the presence of invasive species, threatening local biodiversity ​(Crall et al. 2010; Delaney et al. 2008; Mori et al. 2019)​.

2

Despite their value, CS biodiversity data come with some important limitations.

Compared to professional sources, the taxonomic coverage of CS data is quite narrow

(Waller 2019), and the species identifications may not always be reliable. Both of these shortcomings could be addressed at least in part by providing citizen scientists with better identification tools. Thanks to the dramatic progress in AI-based computer vision technology in recent years, the development of automated identification tools for citizen scientists in the biodiversity field has exploded, driven in part by yearly competitions

Pl@ntNet (Joly et organized in collaboration with major CS platforms like iNaturalist and ​ al 2014, ​ )​ Van Horn et al 2018​ . These tools rely on deep learning, which typically requires large training sets of accurately labeled images. CS platforms can often provide such data for popular organism groups, such as birds. However, to broaden the taxonomic scope of CS contributions, it would be necessary to develop automated identification tools also for organism groups that are outside the current CS focus, but that might become popular CS targets once such tools exist.

Here, we describe lessons learned from a CS project focused on collecting images and developing accurate species identification tools for one such group, ladybird beetles in Sweden. The ladybirds or ladybugs (Coccinellidae) are a well-known family of beetles (Coleoptera), with 71 species in Sweden (Wärmling 2017) and some 6,000 species worldwide ​(Slipinski, Leschen, and Lawrence 2011)​. They are common, visually striking, decently sized, and usually slow-moving, so they are easy to observe for amateurs and well-known to the general public, with plenty of folklore connected with them. More than half of the species are readily recognised as members of the family by

3

most people (exceptions include the typically small, dull, black or elongate members of the subfamilies Scymninae and Coccidulinae). Despite these attractive features, ladybird beetles do not belong to the most popular CS targets, and there is only a limited number of ladybird images available through GBIF. Choosing ladybirds as a target group was also timely, since an affordable and accessible field guide to all

Swedish species had just been published (Wärmling 2017).

The Swedish Ladybird Project (SLP2018), which was run in 2018, engaged

15,000 school children and teachers. Equipped with guides, lenses and mobile device cameras, each school class dedicated an average of one to two hours of effort to the project, and some teachers and children continued their contributions afterwards. The project generated over 5,000 photos covering 30 ladybird species. This significantly expanded the previously available GBIF image set, more than doubling the number of images for four species. We found that the SLP2018 data substantially improved the accuracy of automated identification tools except for the most common species. As a byproduct, the project generated an accurate snapshot of the distribution of an invasive species, ​Harmonia axyridis​. We conclude that CS projects targeting teachers and school children can be an effective way of improving AI identification systems, while providing many positive educational and societal side effects.

4

Methods

The SLP2018 project was divided into two distinctive phases: citizen science (phase A) and data analysis (phase B). In phase A, we aimed to collect data (images, geo-coordinates, time and date) on ladybirds using contributions from schoolchildren. In phase B, we used the collected images in developing automated identification tools. We relied on experts to identify the ladybird species in the collected images, and then we developed automated identification tools using state-of-the-art machine learning methods trained with or without the contributed images​.

Phase A - Citizen Science

The CS part of the project was organized in the summer of 2018 as a Swedish

Mass Experiment, an annual CS project coordinated by the Swedish non-profit organization Public & Science (“Vetenskap & Allmänhet”). The Swedish Mass

Experiment is one of the activities at “ForskarFredag”, a Swedish annual science festival, which is a part of the larger european-wide science festival “European

Researchers' Night”. In the project (called “Nyckelpigeförsöket 2018” in Swedish; see more information on the project’s website (Nyckelpigeförsöket 2018)), we asked participants to contribute photographs of ladybirds submitted through a mobile app specifically designed for this project by the company BioNote (Figure 1). The app had a user interface in Swedish and it was available in versions for iOS (iPhone and iPad) and for Android (phones and tablets). It was possible to use it offline and in that case images

5

and metadata would be uploaded to the database when wifi connection to the Internet was restored. Participants had the option to provide a tentative name for the species they photographed.

The timing of the project activities was designed in part based on the life history of ladybird beetles. In Sweden, adult ladybirds leave their overwintering sites in late spring. They stay active searching for food and mate until early summer when eggs are laid. During the summer, eggs hatch into larvae. The latter develop through four distinct stages, each separated from the preceding one by a skin molt, and the larvae finally transform into pupae. Then, in late summer the new generation of adult ladybirds emerge, joining the survivors of the parent generation in the search for food to stack the reserves for overwintering. Thus, late summer is arguably the best period for observing adult ladybird beetles in the field. For these reasons, we scheduled the project kick off in late spring to allow for field work in late summer and early fall.

Recruitment. The project was initiated as a CS project aimed at schoolchildren (ages

6-16). In April 2018, the project was launched and schools were invited to register to participate in the project. A press release was sent out to national and local media, generating media coverage of the project by morning shows on national TV, national and local radio, and national and local newspapers. Encouraged by the interest from the target audience and the general public, we decided early on to expand the project to also include preschool children (aged up to 6) and to allow participation of private individuals.

6

Registration allowed us to coordinate the project, to communicate with participants (teachers and private individuals) and to provide instructions and support.

After completion of the project, we also sent an Evaluation Survey to all registered participants (Supplementary Material, Evaluation Survey). However, registration was not a requirement to participate in the project, and we also made it possible for non-registered users to contribute to the project by sending in images. Thus, the total number of participants in the project can only be approximated.

Experiment kit. Each pre-registered teacher was offered an “experiment kit” (Figure 1), which included a guide for teachers (also available online; see Supplementary Material,

Guide for Teachers), a snap-on macro lense for smartphones and other mobile devices, and an accessible but comprehensive field guide to Swedish ladybird species, published just before the project kickoff (Wärmling 2017). The guide for teachers included information on the general biology of ladybirds, information about biodiversity and AI, but also practical guidelines on how to find ladybirds in the wild.

In total, 700 experiment kits were sent out, with the ambition of providing one kit per 15 schoolchildren on average. The experiment kits were not required for participation but were provided as additional resources for experimentation and learning. For school children, we provided a simplified identification guide to common

Swedish ladybirds in electronic form (Supplementary Material, Guide for Teachers). The material we provided were intended to help minimize the requirements for participation, so that we could also include preschool children aged five or six in the project. The

7

material was designed with help from teachers and expert entomologists, and it was presented to the children in late May and early June prior to school holidays.

Engagement. School teachers were instructed to invest at least 1-2 hours of field work with their students on the project during the period from mid August, when the fall semester starts, to the end of September. Preschools and private individuals were given no specific instructions about the time or amount of effort to spend on the project. Some school teachers initiated the project already in the spring and encouraged students to contribute ladybird images during the summer break.

Costs and other risks for participants. ​Participation in the project was free. All costs were fully covered by project grants (see Acknowledgements), including creation of manuals, purchase of lenses and books, and development of and access to a free project app available for Android and iOS devices (Figure 1). Every pupil in Sweden is provided with access to a tablet by their school, making it possible for all school children to participate regardless of whether they own a private device.

Participants were not placed in any situation associated with increased risk of mental or bodily harm. The field work occurred in close proximity to their schools, such as school yards, city parks or nature reserves.

Legal issues and privacy concerns. ​Alongside the images, we collected metadata: geo-coordinates, date, time, and randomly generated unique identifiers for users and

8

events (one or more images taken in a sequence). We specifically requested that no other information should be collected and deposited with the submitted data. School children were not individually registered to the project, and they could not be identified through the app. To avoid privacy concerns, the data were kept private during the project period, and we ensured that the publicly released data cannot be linked back to the participants.

Phase B - Data Analysis

Species identification​. To enable faster identification, we encouraged participants to provide tentative species names (Swedish or scientific names) for images they submitted. For images that did not include tentative names, we generated suggested identifications using an AI model trained on images from GBIF (GBIF.org).

Next, we organized the images into directories for each tentative species. We also created a separate directory called “other” for the images that were challenging for the AI model to identify, and for images that appeared corrupted, images of children’s faces, non-ladybird images and images of ladybird pupae, larvae and eggs. We restricted our attention to images of adults, as they are more easily identified to species based on images than earlier instars. Finally, all images were examined individually by a competent taxonomist (MF), who verified or corrected the identifications as needed.

We refer to this set in the data analysis sections as the “SLP2018 images”, the

“SLP2018 data”, or simply “SLP2018”.

9

Image sets for evaluating the SLP2018 contribution​. In order to understand the value of the SLP2018 data for training automated identification systems, we compared it to GBIF data. GBIF is the largest collection of open-sourced biodiversity data, including images. However, by mid September 2018 it only included 99 images of ladybird beetles taken in Sweden. Therefore, to increase the size of the GBIF image set, we expanded it to include all GBIF images of Swedish ladybird species, even if the images were taken in other countries. We included only the species previously recorded in

Sweden. For instance, if images A and B from species A’ and B’, respectively, were available from Brazil but only species A’ was recorded from Sweden, then only image A was used. We only used GBIF images tagged as “human observation”. The​ images obtained from GBIF were assumed to be correctly identified. In total, the GBIF data included 23,224​ ladybird images​. In the experiments, we used the GBIF images contributed in 2018 (until mid

September; the “GBIF2018 data”) as the evaluation set. In total, this set included 7,264 images. The remaining GBIF images (those contributed before 2018) were used for training (the “GBIF_training data”). In total, GBIF_training included 15,960 images. The size and taxonomic coverage of all image sets are given in Table 1 (see also

Supplementary Table S1).

Evaluation experiments​. To assess the value of the SLP2018 data, we compared automated identification models trained on SLP2018 data alone with models trained on

10

GBIF_training data alone, or on combined SLP2018 and GBIF_training data. In all cases, we used GBIF2018 for evaluation. We used several evaluation metrics to measure the identification performance of the models, all implemented in the scikit-learn package ​(Pedregosa et al. 2011)​. Specifically we used: ● Accuracy - ratio of correctly predicted observation to the total observations.

Accuracy is the most intuitive metric but it can be misleading when dealing with

imbalanced datasets. For instance, if our model needs to distinguish between

two categories A and B, and we have 100 observations with imbalance 90:10,

then predicting everything as A gives you 90% accuracy. If your aim is to

correctly predict as many observations as possible then accuracy is a good

performance indicator. However, if you are interested in a minority category then

some other metric is usually a better choice.

● F1 score - weighted average of precision and recall, where recall is the ratio of

correctly predicted positive observations to all observations in the actual category

and precision is the ratio of correctly predicted positive observations to the total

predicted positive observations. We used “macro” averaging, which means that

the metric is computed for each category independently and then averaged,

hence it treats every class equally regardless of the number of observations. For

these reasons, the F1 score (with macro averaging) is considered a better metric

than accuracy for imbalanced datasets.

11

In addition to these common metrics, we also used F1 scores on species subsets. The subsets were created based on the number of images per species on

GBIF. Subset A (dominant species) was composed of the two dominant species,

Harmonia axyridis ​and ​Coccinella septempunctata. ​Subset B (common species) was composed of the species ranked 3-10, and subset C (minority species) was composed of the remaining species.

Our three datasets, GBIF_training, GBIF2018 and SLP2018, included images of

46 species (n​ ​=46) but the list of species present in each of them differs. GBIF_training has ​n​=45, GBIF2018 has ​n​=36 and SLP2018 has ​n​=30 (Table 1; Supplementary Table S1). Across all our experiments, we set the number of possible species to 46. This means that the evaluation of a model trained on SLP2018 (n​ ​=30) included some species that the model was not exposed to during training. At the same time, the evaluation set did not necessarily have examples from all the species present in

SLP2018. This is a real-life scenario, and we opted for this more realistic approach instead of simplifying the testing to a toy problem, where only the species present in all three datasets were included. To allow more detailed analysis of the results, we also monitored the individual F1 scores for all species included in the SLP2018 dataset. In addition, we show the performance of the models in detecting the invasive species, ​H. axyridis​. The metric we used here is accuracy.

Automated image-based identification systems​. For automated image-based identification we utilized convolutional neural networks (CNNs). We used initialization

12

from publicly available checkpoints pretrained on the publicly available ImageNet dataset ​(Russakovsky et al. 2015)​. Specifically, we utilized a standard architecture,

ResNet 50 ​(He et al. 2015)​, and standard training procedures. We did not optimize the system for a particular task or for high performance according to a particular evaluation metric. In order to minimize the randomness, for each of the three experiments we report average results from three runs initiated with different random seeds, while keeping everything else unchanged.

Data and code availability​. All the data, code and pipelines needed to reproduce the results are available in the open repository accompanying the paper (URL to be added here). For the GBIF data, we provide DOI links to the downloaded occurrence data sets; the links to the original images are available through those DOI links on GBIF.org.

Results

In total, 372 school and preschool teachers from across Sweden registered to participate in the project. In addition, 104 individual participants registered. An evaluation survey was sent out to all registered participants after the completion of the project but only 24 replies were received (Supplementary Material, Survey Report). Of the respondents, 2 were individual participants, 4 were preschool teachers, and 18 were school teachers. The preschool and school teachers reported that they had supervised from 16 to 67 (average 31.5) school children, who had participated in the project. If the answers are representative, then the registered teachers represent about 12,000

13

preschool and school children. This approximately corresponds to an estimate of the number of participants based on the number of experiment kits sent out (700), and the average number of school children per kit (about 15), suggesting that some 10,500 children had access to an experiment kit. With unregistered participants added to these numbers, we estimate that at least 15,000 school children contributed to the project.

Based on the interactions of participants with project staff, we estimate that the number of individual participants (registered and non-registered) was around 300.

The survey of registered participants indicated that social media were important in recruitment; almost half of the respondents (45%) said that they had heard about the project in social media (Supplementary Material, Evaluation Survey). Many respondents had also learned about the project because they had participated in previous Swedish

Mass Experiments (25%) or because they had visited the website of the annual

Swedish science festival “Forskarfredag” (21%). The most important reasons for participating included the opportunity to contribute to a research project (75% of respondents), the fact that the project involved interesting topics (50% of respondents) and that it fit nicely in the school curriculum (38% of respondents).

The survey respondents indicated that they had been searching for ladybird beetles on 2 to 28 occasions (some did not give exact numbers, see below), and that they usually had spent up to 1 hour (46%) or 1-2 hours (33%) on the task each time.

Some indicated that the children had also been searching for ladybird beetles outside of school hours (Supplementary Material, Evaluation Survey). Some, presumably preschool classes in many cases, had obviously spent a lot of time looking for ladybird

14

beetles (replies included “every day”, “all the time throughout the project period” and

“many times”).

The respondents included teachers of preschool children (ages 5–6) all the way to high-school students (ages 16–19). The most frequent age groups were grade 4

(ages 10–11; 41%), grade 6 (ages 12–13; 23%) and preschool (ages 1–5; 14%). More than half of the respondents (58%) replied that it was rather difficult or very difficult to find ladybird beetles, while 33% reported that it was rather easy. The respondents reported that they found from 0 to more than 100 ladybird beetles in total.

In total, we received 5,119 images of adult ladybird beetles (Figure 2) and 298 other images from at least 639 unique user accounts. An account corresponds to a single registered individual or a registered school or preschool class and not necessarily to a single unique user. In both cases multiple devices could be used. Over half of the accounts contributed 1–3 images, while a few contributed more than 100 images

(Supplementary Table S2).

One third of the off-target images (Supplementary Table S3) portrayed other insects, mostly insects that are similar to ladybird beetles. These included juvenile stinkbugs (Pentatomoidea) such as the red and black ​Zicrona caerulea ​(a round species) and ​Pyrrhocorus apterus (a more elongate species). It is remarkable that other red and black species, such as the even more common ​Lygaeus equestris​, were not among the submitted images. Among non-ladybird beetles, the species ​Endomychus coccineus was common (closely related to ladybirds and rather similar in habitus), but also members of the families Chrysomelidae and Nitidulidae. Of Chrysomelidae, it was

15

mostly more or less red and black species of the genera ​Gonioctena and ​Chrysomela that were mistaken for ladybird beetles, but surprisingly also some of the metallic blue ones. Of Nitidulidae, it was mostly the red and black but quite elongated ​Glischrochilus species but also some round but yellowish species, such as ​Cychramus luteus​. Most of the remaining off-target images portrayed ladybird larvae or pupae, adult ladybird beetles that could not be identified to species, or represented apparent failed attempts to photograph ladybird beetles. Only 5 images were completely off-target.

The first ladybird images were submitted in early June, after some of the teachers had presented the project to their students (Figure 3). This was followed by a first wave of contributions, spanning over the first three weeks of summer (weeks

26-28). After that the summer was calm until weeks 35-38, which brought the biggest peak in contributions. This is when the majority of teachers scheduled time for field work in the project. It also coincides with the predicted peak in the abundance of adult ladybird beetles.

The majority of contributed ladybird photos (68%) were identified as ​Coccinella septempunctata ​(Figure 4; Table 2). The proportion of images belonging to this species was similar over the season; ​C. septempunctata comprised 60–80% of the monthly contributions. It is interesting to note that, in addition to ​C. septempunctata​, only three more species were represented by more than 3% of the total number of images. These species were: ​Psyllobora vigintiduopunctata ​(7.7%), ​Harmonia axyridis (5.8%) and

Adalia bipunctata ​(5.4%) (Figure 4). ​Psyllobora ​vigintiduopunctata was clearly the second most common species in August and September, while the majority of ​A.

16

bipunctata images were collected in June. The majority of contributions were made in morning hours (Figure 5).

We received contributions from all of the 21 counties of Sweden (Figure 6, Table

2). Most of the images were taken in the most populated counties Stockholm, Skåne and Västra Götaland. In terms of per capita contributions, the county of Gotland stands out. It contributed around 7 times as many images per person as the national average.

In Figure 7 and Table 3, we compare the SLP2018 image set with the total GBIF image set (GBIF_training+GBIF2018). Note that the proportion of GBIF images contributed from Sweden is only 0.41% (99 of 23,224 images). All of the species for which we received images in SLP2018 were previously recorded in Sweden and had images on GBIF. However, for only one half (15) of those species, GBIF contained images from Sweden. For four species, the number of images in SLP2018 surpassed the total GBIF numbers: ​Psyllobora vigintiduopunctata (393 vs 354), ​Chilocorus renipustulatus (64 vs 39), ​Myrrha octodecimguttata (52 vs 38) and ​Cynegetis impunctata​ (28 vs 17). In total, the SLP2018 images cover almost half of the Swedish fauna (30 of 71 species: Table 2). The representation is uneven across subfamilies of Coccinellidae:

Chilocorinae (2 of 5, 40%), Coccidulinae (0 of 5, 0%), Coccinellinae (24 of 35, 69%),

Epilachninae (2 of 2, 100%), and Scymninae (2 of 24, 8%). In general, the taxa that were best represented are also the ones most easily recognized as ladybird beetles by amateurs. The taxonomic composition of the SLP2018 image set appears to be biased strongly towards common species, even though some more or less rare species are

17

included, such as ​Calvia decemguttata​, ​Hippodamia variegata and ​Nephus quadrimaculatus​. Species with a northern distribution in Sweden (e.g. ​Anosticta strigata​,

Coccinella trifasciata and ​Adalia conglomerata​) are conspicuously rare or absent in the material. There is also an apparent bias towards species that are common in gardens, meadows and deciduous forests, while some common coniferous-forest and dry-heath species are less commonly photographed, and most of the wetland species (such as

Anisosticta novedecimpunctata and ​Hippodamia tredecimguttata​) are notably rare in the images.

The estimated value of the SLP2018 data for the development of automated species identification depends heavily on the choice of evaluation metric and the associated perspective on system performance (Table 5). The gain in overall accuracy by adding the SLP2018 image set to the GBIF_training set is small. The reason for this is that the accuracy metric favors majority categories (such as ​Coccinella septempunctata ​and ​Harmonia axyridis​), and these were already well represented in the GBIF dataset. Around 80% of the ladybird images on GBIF are from these two species

(Table 4).

In contrast, adding SLP2018 data to the GBIF_training data significantly improved our second metric, the F1 score with macro ​averaging (Table 5). The change is due to improvements in the identification of minority categories (less common species). THis can be clearly seen by comparing F1 scores over the three species subsets A–C: the scores for subset A (Table 5, F1_A), including the two dominant species, remained almost the same after adding the SLP2018 images; the scores for

18

subset B (Table 5, F1_B), representing the species ranked 3-10 in abundance, increased slightly; and the scores for subset C (Table 5, F1_C), including the remaining species (the rare species), increased substantially.

As one might expect, adding SLP2018 data to GBIF data improved the performance of the automated identification systems for the species present in

SLP2018 (Table 5, F1_S+). However, somewhat unexpectedly, there were also gains in the identification performance for species that were absent from SLP2018 (Table 5,

F1_S-). A possible explanation for this is that the addition of more visual information for a species A, which is in the SLP2018 set, may help the model to distinguish it from a related species B, which is not in the set. As a result, the addition of species A images can indirectly improve the identification accuracy for species B. The performance of the models in detecting the only invasive species among Swedish ladybird beetles,

Harmonia axyridis​, was quite good despite the fact that they had not been trained specifically for this task. The accuracy in identifying ​H. axyridis was ~97% when the model was trained on SLP2018+GBIF_training data (Table 5, ​H. axyridis​).

Discussion

Citizen Science

The success of the SLP2018 project appears to have been due to several factors. The survey indicated that the marketing of the project through social media was important, as well as the fact that it was part of the well-established annual Swedish Mass

19

Experiment and the annual Swedish science festival “Forskarfredag”. The topics touched upon by the project (AI, biodiversity, and ladybirds) also appear to have been attractive, as well as the fact that the project fit well in the school curriculum. We also think that the suggested timing of the field part, in the early part of the fall semester, made participation convenient for school teachers.

Because we did not require registration for participation in the project, it is impossible to know how many people contributed. However, as detailed above, a reasonable estimate is that approximately 15,000 school children were involved while the number of individual participants was small in comparison. Based on the instructions we sent out to teachers and on the data from the survey, a conservative estimate is that the children spent at least 1 hour each searching for ladybird beetles. In total, the school children thus contributed at least 15,000 hours (375 weeks or 83 person months) of effort to the project.

The contributions we received came in two waves, the first in the beginning of the summer and the second in the beginning of the fall semester. The first wave was presumably the result of the project marketing campaign. Undoubtedly, some participants were eager to start contributing as soon as they had joined the project.

During the summer we continued to receive contributions at a lower level. Judged from the feedback we received, these contributions came to a large extent from preschools, which contributed images of ladybird beetles throughout the summer. Summer contributions also came from field stations, museums and groups of amateur entomologists who had organized tours and walks into nature to seek and photograph

20

ladybirds. The second and larger wave of contributions occurred shortly after the start of the autumn semester, when children returned to school. This also coincided with the expected peak in the abundance of ladybird beetles, and with the suggested timing of the field part of the experiment in the instructions we gave to teachers. Thus, school children appear to have mainly participated in the early fall, while preschools and individual participants contributed throughout the summer. These differences in timing between the contributions from different groups is an important factor to consider in designing CS projects similar to SLP2018.

Given that we asked for images of only one particular family of insects, ladybirds, and the fact that the chances of finding them depend a lot on weather and habitat, we are quite pleased by the return on that effort. Unfortunately, it seems that a number of school classes did not find a single ladybird beetle, but all groups had the chance to learn about the group and its life history, and the other research topics involved.

Coccinellid biology

Because we had no control of the sampling intensity over the season or over regions, it is difficult to say much about the spatial and temporal variation in absolute coccinellid species abundances. For instance, the fact that the majority of contributions came in the morning hours could be explained by the diurnal activity rhythm of the beetles, but also by the fact that the majority of teachers scheduled their field expeditions in the mornings

(8:00–11:00 AM). The large number of contributions from Gotland appears to be explained partly by the fact that it is a popular summer destination, and partly because

21

of some dedicated teachers and individuals (one third of the Gotland images came from the small village of Fårösund, especially from the area around the local school and kindergarten).

Some systematic sampling biases are obvious. For instance, we received few contributions from northern Sweden, presumably because the population is less dense there. The small number of northern contributions could also, in part, be due to the fact that the ladybird season is shorter there, resulting in a brief time window for participation during the fall semester. Species that thrive close to humans appear to be overrepresented, while species that are more difficult to identify as ladybird beetles, or that occur in habitats that are not as attractive for field excursions (wet habitats, for instance), are underrepresented.

Despite these sampling biases, the spatial and temporal variation in relative species abundances is probably informative. For instance, our data clearly suggest that during the 2020 season, ​Psyllobora ​vigintiduopunctata was most abundant in the fall

(August and September), while ​A.​ ​bipunctata​ peaked in June.

An interesting finding is that ​Harmonia axyridis​, an invasive species, was the third most commonly photographed species. The species was first recorded in Sweden a little over a decade ago (Brown et al. 2008). The destructive effect of the species on native ecosystems is well documented ​(Roy et al. 2012; Brown et al. 2018; Brown and

Roy 2018; Werenkraut, Baudino, and Roy 2020; Roy and Wajnberg 2008)​, and the SLP2018 data thus cause some concern. At the same time, they show the effectiveness of CS projects in monitoring this species.

22

The collected image set

The roughly 5,000 on-target images collected by the SLP2018 project represented a very significant expansion of what was available at the time through GBIF. SLP2018 increased the number of Swedish images of ladybird beetles by a factor of 50, and it represented fully 21% of all GBIF images collected over the last 20 years from all over the globe of species of ladybird beetles occurring in Sweden. In 2018, the SLP2018 data approached the number of images submitted to GBIF from around the world (5,119 versus 7,264). The coverage of the Swedish fauna was also quite impressive, with 42% of the naturally occurring species included (30 of 71).

The proportion of submitted images that did not represent actual ladybirds was remarkably small but nevertheless significant. We think that the educational materials we provided contributed to this high degree of precision in the effort. The fact that ladybird beetles are fairly easy to identify as such and familiar to many people probably also contributed to a large extent. Thus, similar CS projects targeting other organism groups may have to spend more effort on instructional material or training, and may nevertheless experience considerably larger portions of off-target images.

An interesting note is that the off-target images represent a valuable resource for the training of AI identification systems because they show insects or other objects that humans easily mistake for the target organisms, and these images could also be confusing for machines. Thus, it would be quite valuable if a platform like GBIF would

23

provide images that were (initially) mistakenly identified, together with the initial taxonomic determination.

The SLP2018 numbers show that CS projects, in combination with expert validation of the species identifications, have the potential to significantly grow the available image sets for supervised learning of AI identification tools. We see no reason why the SLP2018 approach would not work for other groups of organisms that are neglected by existing CS projects, including many groups of beetles, moths, and other reasonably large arthropods that could be recognized and photographed by citizen scientists.

Value for Automated Identification

Assessing the value of data like the SLP2018 images for training automated identification tools is not straightforward. Ideally, one would first define a specific task, then train the AI models specifically for that task with or without the data, and then apply an appropriate evaluation criterion. The assessed value would undoubtedly vary depending on the task, the model, and the evaluation criterion. Here, we used simple experiments to explore the value of the data in a few common settings. Specifically we looked at the performance in: (a) correctly identifying as many images as possible

(evaluation metric accuracy); (b) correctly identifying rare species (evaluation metric F1 score with macro averaging); and (c) detecting an invasive species (Harmonia​ axyridis​). We did not highly optimize our models for either of these tasks or metrics. Instead, we

24

trained a standard CNN for a moderate number of epochs, using a default loss function

(cross entropy).

Our results suggest that the SLP2018 images do not noticeably improve accuracy (a score influenced mostly by the most common species, ​Coccinella septempunctata ​and Harmonia axyridis​). This is seen both in overall accuracy and in the accuracy of identifying the invasive ​H. axyridis​. However, the SLP2018 images substantially increase the ability to identify rare species. This is clearly shown by the scores of the three species subsets A–C. The positive effect on F1 scores increases substantially when moving from subset A (dominant species), over subset B (common species) to subset C (rare species).

An important note here is that our experimental setup was not ideal for assessing the value of the SLP2018 data, as the evaluation set (GBIF2018) had a different species composition than the SLP2018 data. In fact, the GBIF2018 composition was more similar to that of the GBIF_training data, biasing the results against SLP2018 and in favor of GBIF_training. With an evaluation set more similar to SLP2018, the positive effects would probably have been larger. More importantly, the assessed effects would presumably have reflected the value of the data for the development of identification tools targeting Swedish CS platforms more accurately.

A factor that could affect the results in either direction is that the GBIF images are more heterogeneous than the SLP2018 images. We only selected GBIF images tagged as “human observation”, which should refer to field photographs. However, while browsing through the GBIF images we spotted several examples of pinned specimens

25

from natural history collections. Because the color of pinned specimens may fade with aging, and other differences between museum specimens and specimens photographed alive in the field, this could have affected the performance of the models trained only on GBIF data; the effect could have been either positive or negative. It could also have affected the performance of the SLP2018-trained models, as they were evaluated against these GBIF image sets. In the latter case, the effect is more likely to have been negative.

As for identifying the ladybird species, it was quite obvious that polymorphic species and rare species provided particular difficulties (as could be expected), but there was also a seemingly random element in some species that had a perhaps unexpectedly high or unexpectedly low accuracy rate. Of course, very common species are easier to learn, but in the case of ​Coccinella septempunctata​, it is often difficult to distinguish from ​C. magnifica and on suboptimal pictures also from ​C. quinquepunctata​.

In fact, in a large number of the submitted pictures the identifications as ​C. septempunctata​ were not strictly possible to verify.

Conclusion

Automated identification tools based on AI clearly have the potential to significantly expand the taxonomic coverage and improve the reliability of species identifications in

CS platforms for biodiversity observations. However, most of the existing AI projects rely on image sets and associated annotations that have accumulated for popular organism groups like birds and flowering plants. Our project shows the potential of

26

targeted CS efforts in collecting the required image sets for providing accurate identification tools for organism groups that have so far not received the attention from citizen scientists they deserve.

Author contributions

M.V., M.B., and F.R. conceived the idea; M.V., M.B., and F.R. designed the study; M.B led citizen science and MV assisted; M.F. identified specimens from images and interpreted results; M.V. compiled and filtered data set from GBIF, wrote the scripts, performed analyses, interpreted results, and prepared the first draft. All authors contributed to the final version of the manuscript.

Acknowledgements

The authors thank Johannes Bergsten and Katarina Deneberg from the Swedish

Museum of Natural History for their contribution to Teachers night, and Alexandre

Antonelli for his valuable comments during the project. We also thank numerous other staff members from Savantic AB, the Swedish Museum of Natural History, and the

Swedish non-profit Science & Public for their support and valuable discussions before, during and after the project. Lastly, and most importantly, we thank all 15 000 schoolchildren, their teachers and other participants who devoted their time to this project.

Funding

27

This work was supported by the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement No. 642241 (to

FR and MV), and by Formas grant agreement No. 2017-12267 to MB and MV. The project was also supported by Savantic AB, the Swedish Museum of Natural History, and the Swedish non-profit Science & Public.

28

Tables:

Table 1. Image sets used for training and evaluation​. Size and species coverage of the image sets used in the current study for training and evaluation of automated identification systems. For a complete list of the included species and the number of images for each species, see Supplementary Table S1.

SLP2018 GBIF_training GBIF2018 GBIF (total)

Total images 5119 15960 7264 23224

Number of taxa 30 45 36 46

29

Table 2. Species breakdown of the SLP2018 dataset​. The number of images (percent of total in parenthesis) for the 30 species of ladybird beetles that were photographed at least once in the SLP2018 project.

Species Number of images

Coccinella septempunctata 3484 (68.06)

Psyllobora vigintiduopunctata 393 (7.68)

Harmonia axyridis 297 (5.08)

Adalia bipunctata 278 (5.43)

Propylea quatuordecimpunctata 108 (2.11)

Chilocorus renipustulatus 64 (1.25)

Calvia quatuordecimguttata 63 (1.23)

Myrrha octodecimguttata 52 (1.02)

Coccinella undecimpunctata 50 (0.98)

Coccinella quinquepunctata 48 (0.94)

Halyzia sedecimguttata 40 (0.78)

Tytthaspis sedecimpunctata 34 (0.66)

Anatis ocellata 33 (0.64)

Cynegetis impunctata 28 (0.55)

Adalia decempunctata 28 (0.55)

Coccinula quatuordecimpustulata 24 (0.47)

Exochomus quadripustulatus 22 (0.43)

Hippodamia variegata 16 (0.31)

Myzia oblongoguttata 14 (0.27)

Hippodamia septemmaculata 14 (0.27)

Subcoccinella vigintiquatuorpunctata 8 (0.16)

Calvia decemguttata 6 (0.12)

Hippodamia tredecimpunctata 4 (0.08)

30

Nephus quadrimaculatus 3 (0.06)

Ceratomegilla notata 2 (0.04)

Scymnus suturalis 2 (0.04)

Sospita vigintiguttata 1 (0.02)

Adalia conglomerata 1 (0.02)

Coccinella hieroglyphica 1 (0.02)

Anisosticta novemdecimpunctata 1 (0.02)

31

Table 3. The image contributions received from each Swedish county​. We show the number of contributed images, the contributions per capita (images per 1 million inhabitants, and the normalized contribution per capita (ratio to national average). Only images with geographical coordinates are included (82% of SLP2018 images).

County Number of images Images per capita Normalized per capita

Värmland 14 49 0.12

Södermanland 15 51 0.13

Norrbotten 17 67 0.17

Kronoberg 14 70 0.17

Västerbotten 37 137 0.33

Kalmar 34 139 0.34

Halland 47 144 0.35

Jönköping 90 251 0.61

Gävleborg 80 280 0.68

Jämtland 39 300 0.73

Västra Götaland 563 332 0.81

Uppsala 123 333 0.81

Örebro 101 337 0.82

Västmanland 112 413 1

Blekinge 75 470 1.14

Västernorrland 118 479 1.16

Dalarna 157 548 1.33

Östergötland 252 550 1.33

Stockholm 1288 558 1.35

Skåne 840 624 1.51

Gotland 168 2867 6.94

32

Table 4. Number of images per species in SLP2018 and GBIF image sets​. For GBIF, we show the number of occurrences with images from the World and from Sweden collected until mid September 2018. ​We limit the comparison to images from species recorded in Sweden. The list is sorted in descending order based on the number of GBIF images. The order of SLP2018 is shown in parentheses. The table includes a few GBIF images with download problems, which were not included in the GBIF2018 or GBIF_training sets (see Table 1).

Species GBIF-World GBIF-Sweden SLP2018

Harmonia axyridis 11112 14 297 (3)

Coccinella septempunctata 7982 25 3484 (1)

Adalia bipunctata 731 6 278 (4)

Propylea quatuordecimpunctata 710 7 108 (5)

Hippodamia variegata 501 8 16 (19)

Psyllobora vigintiduopunctata 354 16 393 (2)

Cryptolaemus montrouzieri 283

Coccinella undecimpunctata 283 6 50 (9)

Coccinella trifasciata 235

Halyzia sedecimguttata 223 1 40 (11)

Calvia quatuordecimguttata 199 3 63 (7)

Adalia decempunctata 135 1 28 (14)

Exochomus quadripustulatus 121 22 (17)

Myzia oblongoguttata 105 1 14 (20)

Coccinella quinquepunctata 94 3 48 (10)

Coccinula quatuordecimpustulata 93 24 (18)

Chilocorus bipustulatus 92

Anatis ocellata 80 1 33 (13

Aphidecta obliterata 67

Hippodamia tredecimpunctata 56 4 (24)

33

Coccinella hieroglyphica 56 2 1

Subcoccinella vigintiquatuorpunctata 46 8 (22)

Tytthaspis sedecimpunctata 44 34 (12)

Oenopia conglobata 42

Chilocorus renipustulatus 39 64 (6)

Myrrha octodecimguttata 38 52 (8)

Anisosticta novemdecimpunctata 37 1 (29)

Ceratomegilla notata 32 2 (26)

Calvia decemguttata 31 5 6 (23)

Harmonia quadripunctata 30

Coccinella magnifica 22

Vibidia duodecimguttata 21

Cynegetis impunctata 17 28 (15)

Scymnus suturalis 12 2 (27)

Parexochomus nigromaculatus 12

Hippodamia septemmaculata 12 14 (21)

Coccidula rufa 11

Rhyzobius chrysomeloides 8

Nephus quadrimaculatus 7 3 (25)

Sospita vigintiguttata 5 1 (30)

Platynaspis luteorubra 5

Scymnus haemorrhoidalis 3

Scymnus nigrinus 3

Adalia conglomerata 1 1 (28)

Coccidula scutellata 1

Scymnus ferrugatus 1

Total images 23992 99 5119

34

Comparison to total 82.14 0.34 17.52

Comparison to GBIF 100 0.41 21.34

35

Table 5. Analysis of contributions from SLP2018 images when training a model for ladybird identification. For each of the 3 datasets - SLP2018, GBIF_training and

GBIF_training+SLP2018 - we report averages of three runs initialized with different random seeds. We utilized ResNet50 with input size of 512 pixels trained for 10 epochs. The imbalance is not addressed (for details, see Methods). The abbreviated column names are: Acc - accuracy;

F1 - F1 score; F1_A - subset of ​C. septempunctata and ​H. axyridis; ​F1_B - common ladybirds (3-10); F1_C - minority species; F1_S+ - species present in SLP2018; F1_S- - species absent in

SLP2018.

Dataset Acc F1 F1_A F1_B F1_C F1_S+ F1_S- H. axyridis

SLP2018 71.5 16.9 79.3 40.2 5.7 24.7 0.0 81.6

GBIF_training 93.1 47.6 96.1 80.3 35.1 55.7 30.4 96.3

GBIF_training 93.6 51.6 96.4 82.4 39.4 59.8 33.7 96.8 +SLP2018

36

Figures:

Figure 1. ​Experiment kit. First 700 pre-registered school classes were provided with an experimental kit, which included a book with current knowledge on Swedish ladybirds

and a smartphone lens. All participants had online access to a teacher’s guide, a

simplified identification guide for common Swedish ladybirds and an app (Android and

iOS for tablets and smartphones) for submitting contributions. (Photo courtesy ​Lotta

Tomasson​)

37

Figure 2.​ Example images.

38

Figure 3​. Image contributions over time. In the figure a) we show the number of images contributed during the project. The first pictures were submitted in early June, followed by the first peek during the first three weeks of summer when the project was broadly marketed. Another peak coincided with the start of school when school children devoted two hours of organized field work. In the figure b) we show the proportion of monthly contributions of each ladybird species.

39

Figure 4. ​Species composition of the contributed images over time (a) and in total (b).

Most of the contributed images (68%) belonged to ​Coccinella septempunctata​, followed by ​Psyllobora vigintiduopunctata ​(7.7%), ​Harmonia axyridis (5.8%)​ and ​Adalia bipunctata (5.4%). These 4 species together comprise ~87% of submitted images.

40

Figure 5. Number of images contributed per hour. The majority of contributions came during morning hours when teachers scheduled field expeditions.

41

Figure 6. Geographical distribution of the contributed images. Most of the images came from the most populated counties Stockholm, Skåne and Västra Götaland (a). In (b) we show locations where images are taken (each dot represents a single image) and in (c) we show the normalized contribution per capita for each of the 21 counties (darker is more)

42

References:

Brown, PMJ, Adriaens, T, Bathon, H, Cuppen, J, Goldarazena, A, Hägg, T, Kenis, M, et al. 2008. “​Harmonia axyridis​ in Europe: Spread and Distribution of a Non-Native Coccinellid.” ​From Biological Control to Invasion: The Ladybird Harmonia Axyridis as a Model Species. https://doi.org/10.1007/978-1-4020-6939-0_2​. Brown, PMJ and Roy, HE. 2018. “Native Ladybird Decline Caused by the Invasive Harlequin Ladybird ​Harmonia axyridis​ : Evidence from a Long-Term Field Study.” ​Insect Conservation and Diversity​. https://doi.org/10.1111/icad.12266​ ​. Brown, PMJ, Roy, DB, Harrower, C, Dean, HJ, Rorke, SL and Roy, HE. 2018. “Spread of a Model Invasive Alien Species, the Harlequin Ladybird ​Harmonia axyridis​ in Britain and Ireland.” Scientific Data​. https://doi.org/10.1038/sdata.2018.239​ ​. Collantes, F, Delacour, S, Alarcón-Elbal, PM, Ruiz-Arrondo, I, Delgado, JA, Torrell-Sorio, A, Bengoa, M, et al. 2015. “Review of Ten-Years Presence of ​Aedes albopictus ​in Spain 2004–2014: Known Distribution and Public Health Concerns.” ​Parasites & Vectors​. https://doi.org/​10.1186/s13071-015-1262-y​. Collantes, F, Delacour, S, Delgado, JA, Bengoa, M, Torrell-Sorio, A, Guinea, H, Ruiz, S, Lucientes, J and Alert, M. 2016. “Updating the Known Distribution of ​Aedes albopictus​ (Skuse, 1894) in Spain 2015.” ​Acta Tropica​ 164 (December): 64–68. Crall, AW, Newman, GJ, Jarnevich, CS, Stohlgren, TJ, Waller, DM and Graham, J. 2010. “Improving and Integrating Data on Invasive Species Collected by Citizen Scientists.” ​Biological Invasions​. https://doi.org/10.1007/s10530-010-9740-9​ ​. Delaney, DG, Sperling, CD, Adams, CS and Leung, B. 2008. “Marine Invasive Species: Validation of Citizen Science and Implications for National Monitoring Networks.” ​Biological Invasions​. https://doi.org/10.1007/s10530-007-9114-0​ ​. Ellwood, ER, Crimmins, TM and Miller-Rushing, AJ. 2017. “Citizen Science and Conservation: Recommendations for a Rapidly Moving Field.” ​Biological Conservation​. https://doi.org/​10.1016/j.biocon.2016.10.014​. Erwin, TL and Johnson, PJ. 2000. “Naming Species, a New Paradigm for Crisis Management in Taxonomy: Rapid Journal Validation of Scientific Names Enhanced with More Complete Descriptions on the Internet.” ​The Coleopterists Bulletin​. https://doi.org/​10.1649/0010-065x(2000)054[0269:nsanpf]2.0.co;2​. GBIF.org (24th September 2018) GBIF Occurrence Download https://doi.org/10.15468/dl.e78x7y

43

G​room, Q, Weatherdon, L and Geijzendorffer, IR. 2017. “Is Citizen Science an Open Science in the Case of Biodiversity Observations?” ​Journal of Applied Ecology​. https://doi.org/​10.1111/1365-2664.12767​. Guide for teachers. Avalable at https://forskarfredag.se/filer/ff2018-nyckelpigeforsoket-lararhandledning.pdf.​ (Last accessed 1 December 2020).

Hampton, SE, Strasser, CA, Tewksbury, JJ, Gram, WK, Budden, AE, Batcheller, AL, Duke, CS and Porter, JH. 2013. “Big Data and the Future of Ecology.” ​Frontiers in Ecology and the Environment​ 11 (3): 156–62. He, K, Zhang, X, Ren, S and Sun, J. 2015. “Deep Residual Learning for Image Recognition.” arXiv Preprint arXiv:1512.03385​. iNaturalist. Available at https://www.inaturalist.org. (Last accessed 1 December 2020).

Joly, A, Goëau, H, Glotin, H, Spampinato, C, Bonnet, P, Vellinga, WP, Planque, R, Rauber, A, Fisher, R and Müller, H. 2014. “LifeCLEF 2014: Multimedia Life Species Identification Challenges.” Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-319-11382-1_20.

McKinley, DC, Miller-Rushing, AJ, Ballard, HL, Bonney, R, Brown, H, Cook-Patton, SC, Evans, DM, et al. 2017. “Citizen Science Can Improve Conservation Science, Natural Resource Management, and Environmental Protection.” ​Biological Conservation​. https://doi.org/10.1016/j.biocon.2016.05.015​ ​. Mori, E, Menchetti, M, Camporesi, A, Cavigioli, L, Tabarelli de Fatis, K and Girardello, M. 2019. “License to Kill? Domestic Cats Affect a Wide Range of Native Fauna in a Highly Biodiverse Mediterranean Country.” ​Frontiers in Ecology and Evolution.​ https://doi.org/​10.3389/fevo.2019.00477​. Nyckelpigeförsöket 2018. Available at https://forskarfredag.se/forskarfredags-massexperiment/nyckelpigeforsoket/ (Last accessed 1 December 2020)

Palmer, JRB, Oltra, A, Collantes, F, Delgado, JA, Lucientes, J, Delacour, S, Bengoa, M, Eritja, R and Bartumeus, F. 2017. “Citizen Science Provides a Reliable and Scalable Tool to Track Disease-Carrying Mosquitoes.” ​Nature Communications​ 8 (1): 916. Pedregosa, F, Varoquaux, G, Gramfort, A, Michel, V, Thirion, B, Grisel, O, Blondel, M et al. 2011. “Scikit-Learn: Machine Learning in Python.” ​Journal of Machine Learning Research: JMLR ​ 12 (Oct): 2825–30. Pimm, SL, Jenkins, CN, Abell, R, Brooks, TM, Gittleman, JL, Joppa, LN, Raven, PH, Roberts, CM and Sexton, JO. 2014. “The Biodiversity of Species and Their Rates of Extinction, Distribution, and Protection.” ​Science​. https://doi.org/​10.1126/science.1246752​.

44

Pl@ntNet. Available at ​https://identify.plantnet.org​. (Last accessed 1 December 2020). Practical guide. Available at https://forskarfredag.se/filer/ff2018-nyckelpigeforsoket-praktiskhandledning.pdf​. (Last accessed 1 December 2020).

Roy, HE, Adriaens, T, Isaac, NJB, Kenis, M, Onkelinx, T, San Martin, G, Brown PMJ et al. 2012. “Invasive Alien Predator Causes Rapid Declines of Native European Ladybirds.” ​Diversity and Distributions​. https://doi.org/10.1111/j.1472-4642.2012.00883.x​

Roy, H and Wajnberg, E. 2008. “From Biological Control to Invasion: The Ladybird ​Harmonia axyridis​ as a Model Species.” ​From Biological Control to Invasion: The Ladybird Harmonia axyridis as a Model Species​. https://doi.org/​10.1007/978-1-4020-6939-0_1​. Russakovsky, O, Deng, J, Su, H, Krause, J, Satheesh, S, Ma, S, Huang, Z, Karpathy, A, Khosla, A, Bernstein, M et al. 2015. “ImageNet Large Scale Visual Recognition Challenge.” ​International Journal of Computer Vision​ 115 (3): 211–52. Slipinski, SA, Leschen, RAB and Lawrence, JF. 2011. “Order Coleoptera Linnaeus, 1758. In: Zhang, Z.-Q. (Ed.) Animal Biodiversity: An Outline of Higher-Level Classification and Survey of Taxonomic Richness.” ​Zootaxa​. https://doi.org/​10.11646/zootaxa.3148.1.39​. Theobald, EJ, Ettinger, AK, Burgess, HK, DeBey, LB, Schmidt, NR, Froehlich, HE, Wagner, C, Lambers, HR, Tewksbury, J, Harsch, MA and Parrish, JK​. 2015. “Global Change and Local Solutions: Tapping the Unrealized Potential of Citizen Science for Biodiversity Research.” Biological Conservation​. https://doi.org/10.1016/j.biocon.2014.10.021​ ​. Van Horn, G, Mac Aodha, O, Song, Y, Cui, Y, Sun, C, Shepard, A, Adam, H, Perona, P and Belongie, S. 2018. “The iNaturalist Species Classification and Detection Dataset.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition​. https://doi.org/10.1109/cvpr.2018.00914.

Waller, J. 2019. Will citizen science take over? GBIF data blog post, 2019-01-21. Available at https://data-blog.gbif.org/post/gbif-citizen-science-data/​ ​(Last accessed 1 December 2020). Werenkraut, V, Baudino, F and Roy, HE. 2020. “Citizen Science Reveals the Distribution of the Invasive Harlequin Ladybird (​Harmonia axyridis ​ Pallas) in Argentina.” ​Biological Invasions​. https://doi.org/​10.1007/s10530-020-02312-7​.

45