Artificial Neural Networks for Fly Identification

Biologia, Bratislava, 62/4: 462—469, 2007 Section Zoology DOI: 10.2478/s11756-007-0089-1 Artificial neural networks for fly identification: A case study from the genera Tachina and Ectophasia (Diptera, Tachinidae) Jaromír Vaňhara1*, Natália Muráriková1, Igor Malenovský2 &JosefHavel3* 1Masaryk University, Faculty of Science, Institute of Botany and Zoology, Kotlářská 2,CZ-61137 Brno, Czech Republic; e-mail: [email protected] 2Moravian Museum, Department of Entomology, Hviezdoslavova 29a, CZ-62700 Brno, Czech Republic 3Masaryk University, Faculty of Science, Department of Analytical Chemistry, Kotlářská 2,CZ-61137 Brno, Czech Republic; e-mail: [email protected] Abstract: The classification methodology based on morphometric data and supervised artificial neural networks (ANN) was tested on five fly species of the parasitoid genera Tachina and Ectophasia (Diptera, Tachinidae). Objects were initially photographed, then digitalized; consequently the picture was scaled and measured by means of an image analyser. The 16 variables used for classification included length of different wing veins or their parts and width of antennal segments. The sex was found to have some influence on the data and was included in the study as another input variable. Better and reliable classification was obtained when data from both the right and left wings were entered, the data from one wing were however found to be sufficient. The prediction success (correct identification of unknown test samples) varied from 88 to 100% throughout the study depending especially on the number of specimens in the training set. Classification of the studied Diptera species using ANN is possible assuming a sufficiently high number (tens) of specimens of each species is available for the ANN training. The methodology proposed is quite general and can be applied for all biological objects where it is possible to define adequate diagnostic characters and create the appropriate database. Key words: artificial neural networks; species identification; Diptera; Tachinidae; Tachina; Ectophasia; parasitoids Introduction the identification process, especially if coupled with image analysis (Weeks & Gaston 1997; Gaston & O’Neill In the last century, artificial neural network (ANN) 2004). This was also documented in the case study by computation was developed having been inspired by Do et al. (1999) who successfully applied ANN to iden- neurobiology and the way the human brain works. tify six spider species from the family Lycosidae us- Nowadays, numerous and wide applications are found ing transformed digital images of female genitalia. The in all branches of science. There are extensive appli- ANN have also performed the treatment of bioacoustic cations of ANN in chemistry, biochemistry as well as data, e.g., Chesmore (2004) used them for automated in biology. ANN is used, e.g., in microbiology, ecology, identification of four species of British Orthoptera from hydrobiology, entomology, and even in forensic investi- sound recordings. Hernández-Borges et al. (2004) iden- gations. tified three species of limpets (Mollusca, Patellogas- The use of ANN in taxonomy is rather rare, even tropoda, Patella) in the coastal areas of the Canary though in this discipline they clearly provide a powerful Islands using ANN and chemotaxonomic data (the con- pattern recognition and data analysis tool as pointed tent of different aliphatic hydrocarbons). out already by Weeks & Gaston (1997). Their ability to Quite often, in some taxonomic groups the charac- learn patterns in multivariate data makes them ideally ters used for the discrimination of species are not un- suited to identification problems. ambiguously distinct. They may overlap, be variable Identification of species has always been an object or even missing and the species might then be diffi- of general interest and often an important task in sys- cult to separate. Studying the flowering plant genus tematic biological disciplines, such as botany, zoology, Lithops (Aizoaceae), Clark (2003) suggested that, in as well as entomology. Up to now, the traditional meth- such cases, ANN can be used as advisory tools for tax- ods for the identification of species are mostly based onomists. This is particularly useful since experts in dif- on great encyclopaedic knowledge, memory, and many ferent groups often work alone and an unbiased second long-years experience of an expert in the field. ANN, opinion on the identification of a “difficult” specimen however, have a great potential to partly automate is valuable. Frequently, the difficulties in the identifi- * Corresponding authors: J. Vaňhara (Diptera), J. Havel (ANN) c 2007 Institute of Zoology, Slovak Academy of Sciences ANN tachinid identification 463 cation pertain to some developmental stage or sex. For INPUT w example, the Neotropical Phlebotomine sand fly species ij of the Lutzomyia intermedia (Lutz & Neiva, 1912) com- OUTPUT plex (Diptera, Psychodidae) can be easily distinguished by females but their males are very similar. Marcondes & Borges (2000) succeeded in reliably distinguishing the males using ANN and a set of morphometric characters and thus brought the first application of ANN in input i out j Diptera. One of the most taxonomically problematic families of Diptera are the Tachinidae (McAlpine 1989), w also being one of the largest fly families, with about ij 8,200 species worldwide. Among the species 1,550 oc- cur in the Palaearctic Region and nearly 880 in Europe (Tschorsnig & Richter 1998; Tschorsnig et al. 2005). Ta- HIDDEN LAYERS chinids are parasitoids of insects and some other arthro- Fig. 1. Construction of ANN from four layers, i.e., input, output pods, and may be regarded as economically beneficial and two hidden layers. as they also prey on many agricultural and forest pests. The Central European fauna was treated in the mono- graph by Tschorsnig & Herting (1994). Some persistent tionally constructed with three or more layers, i.e., in- doubts and taxonomic problems concern, e.g., the gen- put, output and hidden layers (Fig. 1). era Tachina and Ectophasia, due to a great morpholog- Each layer has a different number of nodes. The ical variability of individual species in Central Europe input layer receives the information about the system (Vaňhara et al. 2004). Often, this causes an uncertain (the nodes of this layer are simple distributive nodes and ambiguous identification of some specimens, which which do not alter the input value at all). The hidden can be otherwise solved out only by much experience layer processes the information initiated at the input, and a good reference collection. while the output layer is the observable response or be- The aim of our study is to select an alternative haviour. The inputs, inputi, multiplied by connection set of diagnostic characters and evaluate the possibility weights wij, are first summed and then passed through of supervised ANN to identify individual specimens in a transfer function to produce the output, outi. The de- these two model genera of Tachinidae. termination of the appropriate number of hidden layers and number of hidden nodes in each layer is one of the Theory of ANN most critical tasks in ANN design. Unlike the input and output layers, one starts with no prior knowledge of the ANN represent sophisticated computational modelling number and size of hidden layers. tools which can be used to solve a wide variety of The use of ANN consists of two steps: “Training” complex problems. The attractiveness of ANN in biol- and “Prediction”. The “Training” consists first of defin- ogy comes from their capability to learn and/or model ing input and output data to the network. It is usually very complex systems which allows for the possibility necessary to scale the data or normalize it to the net- of them being used as a tool for classification. There- works paradigm. This data is referred to as the training fore, their actual potential in this branch of science is set. In this training phase, where actual data must be high. A big contrast between conventional computer used, the optimum structure, weight coefficients and bi- programs and ANN is reflected in the fact that the for- ases of the network are searched for. The training is con- mer can only accomplish those tasks for which they sidered complete when the neural networks achieved the were specifically designed, while ANN is a kind of gen- desired statistical accuracy as it produces the required eralised learning machine which can, in principle, learn outputs for a given sequence of inputs. A good criterion almost anything. ANN’s theory has been widely dis- to find the correct network structure and therefore to cussedinpreviousliterature. Good overview of ANN stop the learning process is to minimize the root mean principles can be found in the monographs by Car- square error (RMS) ling (1992), Fausett (1994), Bishop (1995), Patterson (1996); the theory of different networks has been re- N M 2 i=1 j=1(yij − outij ) viewed by Zupan & Gasteiger (1991). Here we will only RMS = briefly describe the basic idea of this methodology. N × M ANN is a computational model formed from a cer- tain number of single units, artificial neurones or nodes, where yij is the element of the matrix (N × M)forthe connected with coefficients (weights), wij, which con- training set or test set, and outij is the element of the stitute the neural structure. Many different neural net- output matrix (N ×M) of the neural network, where N work architectures can be used. One of the most com- is the number of variables in the matrix and M is the mon is the feed forward supervised neural network of number of samples. RMS gives a single number which multi-layer perceptrons (MLP). The MLP is conven- summarizes the overall error. 464 J. Vaňhara et al. A B Fig. 2. Measured wing (A 1–14) and antennal (B 15–16) characters in Tachinidae. 1(cs1 part) – part of costal section cs1;2(cs2) – length of costal section cs2;3(cs3) – costal section cs3;4(cs4)–cs4;5(cs5)–cs5;6(R1) –lengthofR1;7(R2+3)–lengthofR2+3;8(R4+5)–R4+5; 9(M part1) – medial vein M between bm-cu and its bend; 10(M part2) – length of post angular vein (the rest of M); 11(CuA1 part) – CuA1 between bm-cu and dm-cu; 12(dm-cu) – length of dm-cu; 13(r-m) – length of r-m; 14(bm-cu) – length of bm-cu.

Artificial Neural Networks for Fly Identification

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support