Machine Vision and Applications manuscript No. (will be inserted by the editor)

Natalia Larios · Hongli Deng · Wei Zhang · Matt Sarpola · Jenny Yuen · Robert Paasch · Andrew Moldenke · David A. Lytle · Salvador Ruiz Correa · Eric N. Mortensen · Linda G. Shapiro · Thomas G. Dietterich Automated Identification through Concatenated Histograms of Local Appearance Features Feature Vector Generation and Region Detection for Deformable Objects

Received: date / Accepted: date

Abstract This paper describes a computer vision ap- aged via an apparatus that manipulates the specimens proach to automated rapid-throughput taxonomic iden- into the field of view of a microscope so that images are tification of stonefly larvae. The long-term goal of this obtained under highly repeatable conditions. The images research is to develop a cost-effective method for environ- are then classified through a process that involves (a) mental monitoring based on automated identification of identification of regions of interest, (b) representation of indicator species. Recognition of stonefly larvae is chal- those regions as SIFT vectors [25], (c) classification of the lenging because they are highly articulated, they exhibit SIFT vectors into learned “features” to form a histogram a high degree of intraspecies variation in size and color, of detected features, and (d) classification of the feature and some species are difficult to distinguish visually, de- histogram via state-of-the-art ensemble classification al- spite prominent dorsal patterning. The stoneflies are im- gorithms. The steps (a) to (c) compose the concatenated feature histogram (CFH) method. We apply three region detectors for part (a) above, including a newly developed Natalia Larios principal curvature-based region (PCBR) detector. This University of Washington, Department of Electrical Engi- neering detector finds stable regions of high curvature via a wa- E-mail: [email protected] tershed segmentation algorithm. We compute a separate dictionary of learned features for each region detector, H. Deng · W. Zhang · E. Mortensen · T. G. Dietterich Oregon State University, School of Electrical Engineering and and then concatenate the histograms prior to the final Computer Science classification step. E-mail: {deng, zhangwe, enm, tgd}@eecs.oregonstate.edu We evaluate this classification methodology on a task Matt Sarpola · Robert Paasch of discriminating among four stonefly taxa, two of which, Oregon State University, Department of Mechanical Engi- Calineuria and Doroneuria, are difficult even for experts neering to discriminate. The results show that the combination E-mail: [email protected] of all three detectors gives four-class accuracy of 82% Andrew Moldenke and three-class accuracy (pooling Calineuria and Doro- Oregon State University, Department of Botany and Plant neuria) of 95%. Each region detector makes a valuable Pathology E-mail: [email protected] contribution. In particular, our new PCBR detector is able to discriminate Calineuria and Doroneuria much David A. Lytle Oregon State University, Department of Zoology better than the other detectors. E-mail: [email protected] Keywords classification · object recognition · interest Jenny Yuen operators · region detectors · SIFT descriptor Massachusetts Institute of Technology, Computer Science and AI Laboratory E-mail: [email protected] 1 Introduction Salvador Ruiz Correa Children’s National Medical Center, Department of Diagnos- tic Imaging and Radiology There are many environmental science applications that E-mail: [email protected] could benefit from inexpensive computer vision meth- Linda G. Shapiro ods for automated population counting of and University of Washington, Department of Computer Science other small . At present, only a handful of and Engineering projects can justify the expense of having expert ento- E-mail: [email protected] mologists manually classify field-collected specimens to 2 Natalia Larios et al. obtain measurements of populations. The goal in a considerable loss of information and, potentially, in of our research is to develop general-purpose computer the failure to detect changes in water quality. vision methods, and associated mechanical hardware, for Besides its practical importance, the automated iden- rapid-throughput image capture, classification, and sort- tification of stoneflies raises many fundamental computer ing of small arthropod specimens. If such methods can vision challenges. Stonefly larvae are highly-articulated be made sufficiently accurate and inexpensive, they could objects with many sub-parts (legs, antennae, tails, wing have a positive impact on environmental monitoring and pads, etc.) and many degrees of freedom. Some taxa ex- ecological science [14,18,8]. hibit interesting patterns on their dorsal sides, but others are not patterned. Some taxa are distinctive, others are The focus of our initial effort is the automated recog- very difficult to identify. Finally, as the larvae repeat- nition of stonefly () larvae for the biomoni- edly molt, their size and color change. Immediately after toring of freshwater stream health. Stream quality mea- molting, they are light colored, and then they gradually surement could be significantly advanced if an economi- darken. This variation in size, color, and pose means that cally practical method were available for monitoring in- simple computer vision methods that rely on placing all sect populations in stream substrates. Population counts objects in a standard pose cannot be applied here. In- of stonefly larvae and other aquatic insects inhabiting stead, we need methods that can handle significant vari- stream substrates are known to be a sensitive and robust ation in pose, size, and coloration. indicator of stream health and water quality [17]. Be- To address these challenges, we have adopted the bag- cause these live in the stream, they integrate wa- of-features approach [15,9,32]. This approach extracts a ter quality over time. Hence, they provide a more reliable bag of region-based “features” from the image without measure of stream health than single-time-point chem- regard to their relative spatial arrangement. These fea- ical measurements. Aquatic insects are especially use- tures are then summarized as a feature vector and classi- ful as biomonitors because (a) they are found in nearly fied via state-of-the-art machine learning methods. The all running-water habitats, (b) their large species diver- primary advantage of this approach is that it is invariant sity offers a wide range of responses to water quality to changes in pose and scale as long as the features can change, (c) the of most groups is well known be reliably detected. Furthermore, with an appropriate and identification keys are available, (d) responses of choice of classifier, not all features need to be detected many species to different types of pollution have been in order to achieve high classification accuracy. Hence, established, and (e) data analysis methods for aquatic even if some features are occluded or fail to be detected, insect communities are available [6]. Because of these ad- the method can still succeed. An additional advantage is vantages, biomonitoring using aquatic insects has been that only weak supervision (at the level of entire images) employed by federal, state, local, tribal, and private re- is necessary during training. source managers to track changes in river and stream health and to establish baseline criteria for water qual- A potential drawback of this approach is that it ig- ity standards. Collection of aquatic insect samples for nores some parts of the image, and hence loses some biomonitoring is inexpensive and requires relatively little potentially useful information. In addition, it does not technical training. However, the sorting and identifica- capture the spatial relationships among the detected re- tion of insect specimens can be extremely time consum- gions. We believe that this loss of spatial information ing and requires substantial technical expertise. Thus, is unimportant in this application, because all stoneflies aquatic insect identification is a major technical bottle- share the same body plan and, hence, the spatial layout neck for large-scale implementation of biomonitoring. of the detected features provides very little discrimina- tive information. Larval stoneflies are especially important for biomon- The bag-of-features approach involves five phases: (a) itoring because they are sensitive to reductions in wa- region detection, (b) region description, (c) region clas- ter quality caused by thermal pollution, eutrophication, sification into features, (d) combination of detected fea- sedimentation, and chemical pollution. On a scale of or- tures into a feature vector, and (e) final classification ganic pollution tolerance from 0 to 10, with 10 being of the feature vector. For region detection, we employ the most tolerant, most stonefly taxa have a value of 0, three different interest operators: (a) the Hessian-affine 1, or 2 [17]. Because of their low tolerance to pollution, detector [29], (b) the Kadir entropy detector [21], and change in stonefly abundance or taxonomic composition (c) a new detector that we have developed called the is often the first indication of water quality degradation. principal curvature-based region detector (PCBR). The Most biomonitoring programs identify stoneflies to the combination of these three detectors gives better perfor- taxonomic resolution of Family, although when expertise mance than any single detector or pair of detectors. The is available Genus-level (and occasionally Species-level) combination was critical to achieving good classification identification is possible. Unfortunately, because of con- rates. straints on time, budgets, and availability of expertise, All detected regions are described using Lowe’s SIFT some biomonitoring programs fail to resolve stoneflies (as representation [25]. At training time, a Gaussian mix- well as other taxa) below the level of Order. This results ture model (GMM) is fit to the set of SIFT vectors, and Automated Insect Identification through Concatenated Histograms of Local Appearance Features 3 each mixture component is taken to define a feature. The computed from small image patches that have been nor- GMM can be interpreted as a classifier that, given a new malized and smoothed. Classification is performed using SIFT vector, can compute the mixture component most Support Vector Machines and Kernel Discriminant Anal- likely to have generated that vector. Hence, at classifi- ysis. cation time, each SIFT vector is assigned to the most This project has obtained very good results, even likely feature (i.e., mixture component). A histogram when discriminating between bee species that are known consisting of the number of SIFT vectors assigned to to be hard to classify. It has also overcome its initial re- each feature is formed. A separate GMM, set of features, quirement of expert interaction with the image for fea- and feature vector is created for each of the three re- ture extraction; although it still has the restriction of gion detectors and each of the stonefly taxa. These fea- complex user interaction to manipulate the specimen for ture vectors are then concatenated prior to classification. the capture of the wing image. The ABIS feature ex- The steps mentioned above form the concatenated fea- traction algorithm incorporates prior expert knowledge ture histogram (CFH) method, which allows the use of about wing venation. This facilitates the bee classifica- general classifiers from the machine learning literature. tion task; but makes it very specialized. This special- The final labeling of the specimens is performed by an ization precludes a straightforward application to other ensemble of logistic model trees [23], where each tree has insect identification tasks. one vote. The rest of the paper is organized as follows. Sec- 2.1.2 Digital Automated Identification SYstem tion 2 discusses existing systems for insect recognition (DAISY). as well as relevant work in generic object recognition in computer vision. Section 3 introduces our PCBR de- DAISY [31] is a general-purpose identification system tector and its underlying algorithms. In Section 4, we that has been applied to several arthropod identification describe our insect recognition system including the ap- tasks including mosquitoes (Culex p. molestus vs. Culex paratus for manipulating and photographing the speci- p. pipiens), palaeartic ceratopogonid biting midges, oph- mens and the algorithms for feature extraction, learning, ionines (parasites of lepidoptera), parasitic wasps in the and classification. Section 5 presents a series of experi- genus Enicospilus, and hawk-moths (Sphingidae) of the ments to evaluate the effectiveness of our classification genus Xylophanes. Unlike our system, DAISY requires system and discusses the results of those experiments. Fi- user interaction for image capture and segmentation, be- nally, Section 6 draws some conclusions about the overall cause specimens must be aligned in the images. This performance of our system and the prospects for rapid- might hamper DAISY’s throughput and make its ap- throughput insect population counting. plication infeasible in some monitoring tasks where the identification of large samples is required. In its first version, DAISY built on the progress made 2 Related Work in human face detection and recognition via eigen-images [41]. Identification proceeded by determining how well We divide our discussion of related work into two parts. a specimen correlated with an optimal linear combina- First, we review related work in insect identification sys- tion of the principal components of each class. This ap- tems. Then we discuss work in generic object recognition. proach was shown to be too computationally expensive and error-prone. In its second version, the core classification engine is based on a random n-tuple classifier (NNC) [26] and plas- 2.1 Automated Insect Identification Systems tic self organizing maps (PSOM). It employs a pattern to pattern correlation algorithm called the normalized A few other research groups have developed systems that vector difference (NVD) algorithm. DAISY is capable of apply computer vision methods to discriminate among a handling hundreds of taxa and delivering the identifica- defined set of insect species. tions in seconds. It also makes possible the addition of new species with only a small computational cost. On 2.1.1 Automated Bee Identification System (ABIS). the other hand, the use of NNC imposes the require- ment of adding enough instances of each species. Species The ABIS system [1] performs identification of bees from with high intra-class variability require many training forewings features. Each bee is manually positioned and instances to cover their whole appearance range. a photograph of its forewing is obtained in a standard pose. From this image, the wing venation is identified, 2.1.3 SPecies IDentification, Automated and web and a set of key wing cells (areas between veins) are accessible (SPIDA-web). determined. These are used to align and scale the im- ages. Then geometric features (lengths, angles, and ar- SPIDA-web [8] is an automated species identification eas) are computed. In addition, appearance features are system that applies neural networks to wavelet encoded 4 Natalia Larios et al. images. The SPIDA-web prototype has been tested on regions must be informative—that is, they must capture the spider family Trochanteriidae (119 species in 15 gen- properties that allow objects in different object classes era) using images of the external genitalia. to discriminate from one another. Special effort has been SPIDA-web’s feature vector is built from a subset put into the development of affine-invariant region detec- of the components of the wavelet transform using the tors to achieve robustness to moderate changes in view- Daubechines 4 function. The spider specimen has to be ing angle. Current affine-invariant region detectors can manipulated by hand, and the image capture, prepro- be divided into two categories: intensity-based detectors cessing and region selection also require direct user inter- and structure-based detectors. The intensity-based re- action. The images are oriented, normalized, and scaled gion detectors include the Harris-corner detector [16], into a 128 × 128 square prior to analysis. The specimens the Hessian-affine detector [28,29], the maximally sta- are classified in a hierarchical manner, first to genus and ble extremal region detector (MSER) [27], the inten- then to species. The classification engine is composed of sity extrema-based region detector (IBR) [42], and the a trained neural network for each species in the group. entropy-based region detector [21]. Structure-based de- Preliminary results for females indicate that SPIDA is tectors include the edge-based region detector (EBR) able to classify images to genus level with 95 − 100% [43] and the scale-invariant shape feature (SISF) detector accuracy. The results of species-level classification still [19]. have room for improvement; most likely due to the lack Upon detection, each region must then be charac- of enough training samples. terized as a vector of features. Several methods have been employed for this purpose, but by far the most 2.1.4 Summary of previous insect identification work. widely-used region representation is David Lowe’s 128- dimensional SIFT descriptor [25], which is based on his- This brief review shows that existing approaches rely on tograms of local intensity gradients. Other region de- manual manipulation and image capture of each spec- scriptors can be computed including image patches (pos- imen. Some systems also require the user to manually sibly after smoothing and down-sampling), photometric identify key image features. To our knowledge, no system invariants, and various intensity statistics (mean, vari- exists that identifies insects in a completely automated ance, skewness, kurtosis). way, from the manipulation of the specimens to the final Once the image has been converted into a collection labeling. The goal of our research is to achieve full rapid- of vectors—where each vector is associated with a partic- throughput automation, which we believe is essential to ular region in the image—two general classes of methods supporting routine bio-monitoring activities. One key to have been developed for predicting the object class from doing this is to exploit recent developments in generic this information. The first approach is known as the “bag object recognition, which we now discuss. of features” approach, because it disregards the spatial relationships among the SIFT vectors and treats them as an un-ordered bag of feature vectors. The second ap- 2.2 Generic Object Recognition proach is known as the “constellation method”, because it attempts to capture and exploit the spatial relation- The past decade has seen the emergence of new ap- ships among the detected regions. (Strictly speaking, the proaches to object-class recognition based on region de- term constellation model refers to the series of models tectors, local features, and machine learning. Current developed by Burl, Weber and Perona [5].) methods are able to perform object recognition tasks in images taken in non-controlled environments with vari- In the bag-of-features approach, the standard method ability in the position and orientation of the objects, with is to take all of the SIFT vectors from the training data cluttered backgrounds, and with some degree of occlu- and cluster them (possibly preceded by a dimensionality- sion. Furthermore, these methods only require supervi- reduction step such as PCA). Each resulting cluster is sion at the level of whole images—the position and ori- taken to define a “keyword”, and these keywords are entation of the object in each training image does not collected into a codebook or dictionary [7,10,20]. The need to be specified. These approaches compare favor- dictionary can then be applied to map each SIFT vector ably with previous global-feature approaches, for exam- into a keyword, and therefore, to map the bag of SIFT ple [34,40]. features into a bag of keywords. The local feature approaches begin by applying an The final step of our approach is to train a classifier interest operator to identify “interesting regions”. These to assign the correct class label to the bag of keywords. regions must be reliably detected in the sense that the The most direct way to do this is to convert the bag into same region can be found in images taken under differ- a feature vector and apply standard machine learning ent lighting conditions, viewing angles, and object poses. methods such as AdaBoost [12]. One simple method is to Further, for generic object recognition, these detected compute a histogram where the i-th element corresponds regions must be robust to variation from one object to to the number of occurrences in the image of the i-th another within the same generic class. Additionally, the keyword. Automated Insect Identification through Concatenated Histograms of Local Appearance Features 5

Another classification strategy is to employ distance- interest “points”, our method applies a watershed seg- based learning algorithms such as the nearest-neighbor mentation to the principal curvature image to find “re- method. This involves defining a distance measure be- gions” that are robust to various image transformations. tween two bags of keywords such as the minimum dis- As such, our PCBR detector combines differential ge- tance between all keywords from one bag and all key- ometry -as used by the Harris- and Hessian-affine inter- words from the other bag. est point detectors with concepts found in region-based Given a new image to classify, the process of finding structure detectors such as EBR [43] or SISF [19]. interesting regions, representing them as SIFT vectors, mapping those to keywords, and classifying the resulting bags of keywords is repeated. 3.1 A Curvature-Based Region Detector In the constellation method, several techniques have been applied for exploiting the spatial layout of the de- Given an input image (Figure 1a), our PCBR region de- tected regions. The star-shaped model [24,37] is a com- tector can be summarized as follows: mon choice, because it is easy to train and evaluate. Fer- gus et al. [11] employ a generative model of the (x, y) 1. Compute the Hessian matrix image describing each distribution of the regions as a 2-dimensional Gaussian pixel’s local image curvature. distribution. More complex methods apply discrimina- tive graphical models to capture the relations between 2. Form the principal curvature image by extracting the the detected regions [2,22,35]. largest positive eigenvalue from each pixel’s Hessian matrix (Figure 1b).

3. Apply a gray scale morphological closing on the prin- 3 Principal Curvature-Based Region Detector cipal curvature image to remove noise and threshold the resulting image to obtain a “clean” binary prin- Before describing our stonefly recognition system, we cipal curvature image (Figure 1c). first introduce our new Principal Curvature-Based Re- gion (PCBR) detector. This detector is of independent 4. Segment the clean image into regions using the wa- interest and we have demonstrated elsewhere that it can tershed transform (Figures 1d and 1e). be applied to a wide range of object recognition problems [45]. 5. Fit an ellipse to each watershed regions to produce The PCBR detector grew out of earlier experiments the detected interest regions (Figure 1f). that apply Steger’s “curvilinear” detector [39] to the stonefly images. The curvilinear detector finds line struc- tures (either curved or straight) such as roads in aerial or Each of these steps is detailed in the following para- satellite images or blood vessels in medical scans. When graphs. applied to stonefly images, the detector provides a kind of sketch of the characteristic patterning that appears on the insects’ dorsal side. Further, these curvilinear struc- 3.2 Principal Curvature Image tures can be detected over a range of viewpoints, scales, and illumination changes. There are two types of structures that have high curva- However, in order to produce features that readily ture in one direction: edges and curvilinear structures. map to image regions, which can then be used to build Viewing an image as an intensity surface, the curvilin- a descriptor (such as SIFT), our PCBR detector ulti- ear structure detector looks for ridges and valleys of this mately uses only the first steps of Steger’s curvilinear surface. These correspond to white lines on black back- detector process—that of computing the principal eigen- grounds or black lines on white backgrounds. The width value of the Hessian matrix at each pixel. We note that of the detected line is determined by the Gaussian scale since both the Hessian matrix and the related second used to smooth the image (see Eq. 1 below). Ridges and moment matrix quantify a pixel’s local image geome- valleys have large curvature in one direction, edges have try, they have also been applied in several other inter- high curvature in one direction and low curvature in est operators such as the Harris [16], Harris-affine [30], the orthogonal direction, and corners (or highly curved and Hessian-affine [29] detectors to find image positions ridges and valleys) have high curvature in two directions. where the local image geometry is changing in more than The shape characteristics of the surface can be described one direction. Likewise, Lowe’s maximal difference-of- by the Hessian matrix, which is given by   Gaussian (DoG) detector [25] also uses components of Ixx(x,σD) Ixy(x,σD) the Hessian matrix (or at least approximates the sum of H(x,σD)= (1) Ixy(x,σD) Iyy(x,σD) the diagonal elements) to find points of interest. How- ever, we also note that our PCBR detector is quite dif- where Ixx, Ixy and Iyy are the second-order partial deriv- ferent from these other methods. Rather than finding atives of the image and σD is the Gaussian scale at which 6 Natalia Larios et al.

(a)(b)(c)

(d)(e)(f)

Fig. 1 Regions defined by principal curvature. (a) The original, (b) principal curvature, and (c) cleaned binary images. The resulting (d) boundaries and (e) regions that result by applying the watershed transform to (c). (f) The final detected regions created by fitting an ellipse to each region. the second partial derivatives of the image are computed. largest absolute eigenvalue since that would produce two The interest point detectors mentioned previously [16, responses for each edge. For our stonefly project, we have 30,29] apply the Harris measure (or a similar metric [25]) found that the patterning on the stonefly dorsal side is to determine a point’s saliency. The Harris measure is better characterized by the dark lines and as such we given by apply Eq. 4. Figure 1(b) shows an eigenvalue image that 2 results from applying Eq. 4 to the grayscale image de- det(A) − k · tr (A) > threshold (2) rived from Fig. 1(a). We utilize the principle curvature where det is the determinant, tr is the trace, and the ma- image to find the stable regions via watershed segmen- trix A is either the Hessian matrix, H, (for the Hessian- tation [44]. affine detector) or the second moment matrix,   2 Ix IxIy M = 2 , (3) IxIy Iy for the Harris or Harris-affine detectors. The constant k is typically between 0.03 and 0.06 with 0.04 being very common. The Harris measure penalizes (i.e., pro- (a)(b) duces low values for) “long” structures for which the first or second derivative in one particular orientation Fig. 2 (a) Watershed segmentation of original eigenvalue im- is very small. One advantage of the Harris metric is that age (Fig. 1b). (b) Detection results using the “clean” principal it does not require explicit computation of the eigenvalue curvature image (Fig. 1c). or eigenvectors. However, computing the eigenvalues and eigenvectors for a 2 × 2 matrix requires only a single Ja- cobi rotation to eliminate the off-diagonal term, Ixy,as noted by Steger [39]. 3.3 Watershed Segmentation Our PCBR detector complements the previous inter- est point detectors. We abandon the Harris measure and Our detector depends on a robust watershed segmen- exploit those very long structures as detection cues. The tation. A main problem with segmentation via the wa- principal curvature image is given by either tershed transform is its sensitivity to noise and image P (x) = max(λ1(x), 0) (4) variations. Figure 2(a) shows the result of applying the or watershed algorithm directly to the eigenvalue image (shown in Fig. 1(b)). Many of the small regions are due P (x) = min(λ2(x), 0) (5) to noise or other small, unstable image variations. To achieve a more stable watershed segmentation, we first where λ1(x)andλ2(x) are the maximum and minimum eigenvalues, respectively, of H at x. Eq. 4 provides a high apply a grayscale morphological closing followed by hys- response only for dark lines on a light background (or on teresis thresholding. The grayscale morphological closing the dark side of edges) while Eq. 5 is used to detect light operation is defined as lines against a darker background. We do not take the f • b =(f ⊕ b)  b (6) Automated Insect Identification through Concatenated Histograms of Local Appearance Features 7 where f is the image (P from Eq. 4 for our applica- vectors. To improve visibility, we draw them at every 4 tion), b is a disk-shaped structuring element, and ⊕ and pixels. At the point indicated by the large white arrow,  are the grayscale dilation and erosion, respectively. we see that the eigenvalue magnitudes are small and the The closing operation removes the small “potholes” in ridge there is almost invisible. Nonetheless, the direction the principal curvature terrain, thus eliminating many of the eigenvectors are quite uniform. This eigenvector- local minima that result from noise and would otherwise based active thresholding process yields better perfor- produce watershed catchment basins. mance in building continuous ridges and in filling in scale However, beyond the small (in terms of area of in- gaps between ridges, which results in more stable regions fluence) local minima, there are other minima that have (Fig. 2(b)). larger zones of influence and are not reclaimed by the The final step is to perform the watershed transform morphological closing. Some of these minima should in- on the clean binary image. Since the image is binary, deed be minima since they have a very low principal all black (or 0-valued) pixels become catchment basins curvature response. However, other minima have a high and the midline of the thresholded white ridge pixels po- response but are surrounded by even higher peaks in the tentially become watershed lines if it separates two dis- principle curvature terrain. A primary cause for these tinct catchment basins. After performing the watershed high “dips” between ridges is that the Gaussian scale transform, the resulting segmented regions are fit with used to compute the Hessian matrix is not large enough ellipses, via PCA, that have the same second-moment as to match the thickness of the line structure; hence the these watershed regions. These ellipses then define the second derivative operator produces principal curvature final interest regions of the PCBR detector (Fig. 1(f)). responses that tend toward the center of the thick line but don’t quite meet up. One solution to this problem is to use a multiscale approach and try to estimate the best scale to apply at each pixel. Unfortunately, this would require that the Hessian be applied at many scales to find the single characteristic scale for each pixel. Instead, we choose to compute the Hessian at just a few scales (σD =1, 2, 4) and then use eigenvector-flow hysteresis thresholding to fill in the gaps between scales. For eigenvalue-flow hysteresis thresholding we have a high and a low threshold—just as in traditional hys- teresis thresholding. For this application, we have set Fig. 3 Illustration of how the eigenvector flow is used to the high threshold at 0.04 to indicate strong principal support weak principal curvature response. curvature response. Pixels with a strong response act as seeds that expand out to include connected pixels that are above the low threshold. Unlike traditional hysteresis thresholding, our low threshold is a function of the sup- port each pixel’s major eigenvector receives from neigh- 4 Stonefly Identification System boring pixels. Of course, we want the low pixel to be high enough to avoid over-segmentation and low enough The goal of our work is to provide a rapid-throughput to prevent ridge lines from fracturing. As such, we choose system for classifying stonefly larvae to the species level. our low threshold on a per-pixel basis by comparing the To achieve this, we have developed a system that com- direction of the major (or minor) eigenvector to the di- bines a mechanical apparatus for manipulating and pho- rection of the adjacent pixels’ major (or minor) eigen- tographing the specimens with a software system for pro- vectors. This can be done by simply taking the absolute cessing and classifying the resulting images. We now de- value of the inner (or dot) product of a pixel’s normal- scribe each of these components in turn. ized eigenvector with that of each neighbor. The inner product is 1 for vectors pointing in the same direction and 0 for orthogonal vectors. If the average dot prod- 4.1 Semi-Automated Mechanical Manipulation and uct over all neighbors is high enough, we set the low to Imaging of Stonefly Larvae high threshold ratio to 0.2 (giving an absolute threshold of 0.04 · 0.2=0.008); otherwise the low to high ratio is The purpose of the hardware system is to speed up the 0.7 (for an absolute low threshold of 0.028). These ra- image capture process in order to make bio-monitoring tios were chosen based on experiments with hundreds of viable and to reduce variability during image capture. stonefly images. To achieve consistent, repeatable image capture, we have Figure 3 illustrates how the eigenvector flow supports designed and constructed a software-controlled mechani- an otherwise weak region. The red arrows are the major cal stonefly larval transport and imaging apparatus that eigenvectors and the yellow arrows are the minor eigen- positions specimens under a microscope, rotates them 8 Natalia Larios et al.

(a)(b)

Fig. 4 (a) Prototype mirror and transportation apparatus. (b) Entire stonefly transportation and imaging setup (with microscope and attached digital camera, light boxes, and computer controlled pumps for transporting and rotating the specimen. to obtain a dorsal view, and photographs them with a reconstruction of the specimens, but for the work de- high-resolution digital camera. scribed in this paper, it doubles the probability of ob- taining a good dorsal view in each shot. Figure 4 shows the mechanical apparatus. The stone- All images are captured using a QImaging MicroPub- flies are kept in alcohol (70% ethanol) at all times, and lisher 5.0 RTV 5 megapixel color digital camera. The dig- therefore, the apparatus consists of two alcohol reservoirs ital camera is attached to a Leica MZ9.5 high-performan- connected by an alcohol-filled tube (having a diamond ce stereo microscope at 0.63x magnification. We use a cross-section). To photograph a specimen, it is manually 0.32 objective on the microscope to increase the field of inserted into the arcylic well shown at the right edge of view, depth of field, and working distance. Illumination the figure and then pumped through the tube. Infrared is provided by gooseneck light guides powered by Volpi detectors positioned part way along the tube detect the V-Lux 1000 cold light sources. Diffusers installed on the passage of the specimen and cut off the pumps. Then a guides reduce glare, specular reflections, and hard shad- side fluid jet “captures” the specimen in the field of view ows. Care was taken in the design of the apparatus to of the microscope. When power to this jet is cut off, minimize the creation of bubbles in the alcohol, as these the specimen settles to the bottom of the tube where it could confuse the recognizer. can be photographed. The side jet can be activated re- With this apparatus, we can image a few tens of spec- peatedly to spin the specimen to obtain different views. imens per hour. Figure 6 shows some example images Once a suitable image has been obtained (a decision cur- obtained using this stonefly imaging assembly. rently made by the human operator), the specimen is then pumped out of the tube and into the plexiglass well at the left edge of the figure. For this project, a “suitable 4.2 Training and Classification image” is one that gives a good back (dorsal side) view of the specimen. In future work, we plan to construct a Our approach to classification of stonefly larvae follows “dorsal view detector” to automatically determine when closely the “bag of features” approach but with several a good dorsal image has been obtained. In addition, fu- modifications and extensions. Figure 5 gives an overall ture versions of the apparatus will physically sort each picture of the data flow during training and classifica- specimen into an appropriate bin based on the output of tion, and Tables 1, 2, and 3 provide pseudo-code for our the recognizer. method. We now provide a detailed description. Figure 4(b) shows the apparatus in place under the The training process requires two sets of images, one microscope. Each photograph taken by the camera cap- for defining the dictionaries and one for training the clas- tures two images at a 90 degree separation via a set of sifier. In addition, to assess the accuracy of the learned mirrors. The original purpose of this was to support 3D classifier, we need a holdout test data set. Therefore, we Automated Insect Identification through Concatenated Histograms of Local Appearance Features 9

Table 1 Dictionary Construction. D is the number of region detectors (3 in our case), and K is the number of stonefly taxa to be recognized (4 in our case)

Dictionary Construction For each detector d =1,...,D For each class k =1,...,K Let Sd,k be the set of SIFT vectors that results from applying detector d to all cluster images from class k. Fit a Gaussian mixture model to Sd,k to obtain a set of mixture components {Cd,k,},=1,...,L. The GMM estimates the probability of each SIFT vector s ∈ Sd,k as XL P (s)= Cd,k,(s | μd,k,, Σd,k,)P (). =1 where Cd,k, is a multi-variate Gaussian distribution with mean μd,k, and diagonal covariance matrix Σd,k,. Define the keyword mapping function keyd,k(s) = argmax Cd,k,(s | μd,k,, Σd,k,)

Table 2 Feature Vector Construction Feature Vector Construction To construct a feature vector for an image: For each detector d =1,...,D For each class k =1,...,K Let Hd,k be the keyword histogram for detector d and class k Initialize Hd,k[]=0for =1,...,L For each SIFT vector s detected by detector d increment Hd,k[keyd,k(s)] Let H be the concatenation of the Hd,k histograms for all d and k.

Table 3 Training and Classification. B isthenumberof Fig. 5 Object recognition system overview: Feature genera- bootstrap iterations (i.e., the size of the classifier ensemble). tion and classification components

Training Let T = {(Hi,yi)},i=1,...,N be the set of N training begin by partitioning the data at random into three sub- H examples where i is the concatenated histogram for sets: clustering, training, and testing. training image i and yi is the corresponding class label (i.e., stonefly species). As mentioned previously, we apply three region de- For bootstrap replicate b =1,...,B tectors to each image: (a) the Hessian-affine detector Construct training set Tb by sampling N training [29], (b) the Kadir Entropy detector [21], and (c) our T examples randomly with replacement from PCBR detector. We use the Hessian-affine detector im- Let LMTb be the logistic model tree fitted to Tb plementation available from Mikolajczyk1 with a detec- Classification tion threshold of 1000. For the Kadir entrophy detector, Given a test image, let H be the concatenated histogram we use the binary code made available by the author2 resulting from feature vector construction. and set the scale search range between 25 − 45 pixels votes k k Let [ ] = 0 be the number of votes for class . with the saliency threshold at 58. All the parameters for For b =1,...,B Lety ˆb be the class predicted by LMTb applied to H. the two detectors mentioned above are obtained empiri- Increment votes[ˆyb]. cally by modifying the default values in order to obtain Lety ˆ =argmaxk votes[k] be the class with the most votes. reasonable regions. For the PCBR detector, we detect y Predict ˆ. in three scales with σD =1, 2, 4. The higher value in hysteresis thresholding is 0.04. The two ratios applied to get the lower thresholds are 0.2 and 0.7—producing

1 www.robots.ox.ac.uk/˜vgg/research/affine/ 2 www.robots.ox.ac.uk/˜timork/salscale.html 10 Natalia Larios et al. low thresholds of 0.008 and 0.028, respectively. Each de- the overall classification. The ensemble of LMTs classi- tected region is represented by a SIFT vector using Miko- fier only interacts with the feature vectors generated by lajczyk’s modification to the binary code distributed by the CFH method. David Lowe [25]. Our chosen learning algorithm is the logistic model We then construct a separate dictionary for each re- tree (LMT) method of Landwehr, Hall, and Frank [23]. gion detector d and each class k.LetSd,k be the SIFT de- An LMT has the structure of a decision tree where each scriptors for the regions found by detector d in all cluster- leaf node contains a logistic regression classifier. Each set images from class k. We fit a Gaussian mixture model internal node tests the value of one chosen feature from (GMM) to Sd,k via the Expectation-Maximi-zation (EM) the feature vector against a threshold and branches to algorithm. A GMM with L components has the form the left child if the value is less than the threshold and L to the right child if the value is greater than or equal to p(s)= Cd,k,(s | μd,k,, Σd,k,)P () (7) the threshold. LMTs are fit by the standard top-down =1 divide-and-conquer method employed by CART [4] and C4.5 [36]. At each node in the decision tree, the algo- where s denotes a SIFT vector and the component prob- rithm must decide whether to introduce a split at that ability distribution Cd,k, is a multivariate Gaussian den- μ point or make the node into a leaf (and fit a logistic re- sity function with mean d,k, and covariance matrix gression model). This choice is made by a one-step looka- Σd,k, (constrained to be diagonal). Each fitted com- L head search in which all possible features and thresholds ponent of the GMM defines one of keywords. Given are evaluated to see which one will result in the best a new SIFT vector s, we compute the corresponding  key s  improvement in the fit to the training data. In standard keyword = d,k( ) by finding the that maximizes decision trees, efficient purity measures such as the GINI p(s | μd,k,, Σd,k,). Note that we disregard the mixture P  index or the information gain can be employed to predict probabilities ( ). This is equivalent to mapping s to the the quality of the split. In LMTs, it is instead necessary nearest cluster center μ under the Mahalobis distance Σ to fit a logistic regression model to the training examples defined by . that belong to each branch of the proposed split. This We initialize EM by fitting each GMM component is computationally expensive, although the expense is to each cluster obtained by the k-means algorithm. The substantially reduced via a clever incremental algorithm k-means algorithm is initialized by picking random el- based on logit-boost [13]. Thorough benchmarking ex- ements. The EM algorithm iterates until the change in periments show that LMTs give robust state-of-the-art the fitted GMM error from the previous iteration is less performance [23]. than 0.05% or until a defined number of iterations is reached. In practice, learning of the mixture almost al- ways reaches the first stopping criterion (the change in error is less that 0.05%). 5 Experiments and Results After building the keyword dictionaries, we next con- struct a set of training examples by applying the three We now describe the series of experiments carried out to region detectors to each training image. We characterize evaluate our system. We first discuss the data set and each region found by detector d with a SIFT descriptor and show some example images to demonstrate the dif- and then map the SIFT vector to the nearest keyword ficulty of the task. Then we present the series of experi- (as describe above) for each class k using keyd,s.Weac- ments and discuss the results. cumulate the keywords to form a histogram Hd,k and concatenate these histograms to produce the final fea- ture vector. With D detectors, K classes, and L mixture 5.1 Stonefly Dataset components, the number of attributes A in the final fea- ture vector (i.e., the concatenated histogram) is D·K ·L. We collected 263 specimens of four stonefly taxa from Upon constructing the set of training examples, we freshwater streams in the mid-Willamette Valley and next learn the classifier. We employ a state-of-the-art Cascade Range of Oregon: the species Calineuria califor- ensemble classification method: bagged logistic model nica (Banks), the species Doroneuria baumanni Stark & trees. Bagging [3] is a general method for constructing an Baumann, the species pacifica (Banks), and ensemble of classifiers. Given a set T of labeled training the genus Yoraperla. Each specimen was independently examples and a desired ensemble size B, it constructs classified by two experts, and only specimens that were B bootstrap replicate training sets Tb, b =1,...,B. classified identically by both experts were considered in Each bootstrap replicate is a training set of size |T | con- the study. Each specimen was placed in its own vial with structed by sampling uniformly with replacement from an assigned control number and then photographed using T . The learning algorithm is then applied to each of these the apparatus described in Section 4. Approximately ten replicate training sets Tb to produce a classifier LMTb. photos were obtained of each specimen, which yields 20 To predict the class of a new image, each LMTb is applied individual images. These were then manually examined, to the new image and the predictions vote to determine and all images that gave a dorsal view within 30 degrees Automated Insect Identification through Concatenated Histograms of Local Appearance Features 11

Table 4 Specimens and images employed in the study Table 5 Partitions for 3-fold cross-validation. Taxon Specimens Images Partition #Specimens # Images Calineuria 85 400 1 87 413 Doroneuria 91 463 2 97 411 Hesperoperla 58 253 3 79 416 Yoraperla 29 124

toring, since Calineuria and Doroneuria have identical of vertical were selected for analysis. Table 4 summarizes pollution tolerance levels. Hence, discriminating between the number of specimens and dorsal images obtained. them is not critical for our application. Finally, the CD A potential flaw in our procedure is that the speci- task presents a very challenging objective recognition men vials tended to be grouped together by taxon (i.e., problem, so it is interesting to see how well our method several Calineurias together, then several Doroneurias, can do when it focuses only on this two-class problem. etc.), so that in any given photo session, most of the Performance on all three tasks is evaluated via three- specimens being photographed belong to a single taxon. fold cross-validation. The images are randomly parti- This could introduce some implicit cues (e.g., lighting, tioned into three equal-sized sets under the constraint bubbles, scratches) that might permit the learning algo- that all images of any given specimen were required to be rithm to “cheat”. The apparatus constrains the lighting placed in the same partition. In addition, to the extent so that it is very consistent in all sessions. We did detect possible, the partitions are stratified so that the class some bubbles in the images. In cases where the region frequencies are the same across the three partitions. Ta- detectors found those bubbles, we manually remove the ble 5 gives the number of specimens and images in each detections to ensure that they are not influencing the partition. results. In each “fold” of the cross-validation, one partition Figure 6 shows some of the images collected for the serves as the clustering data set for defining the dictio- study. Note the variety of colors, sizes, and poses. Note naries, a second partition serves as the training data set, also that Yoraperla, which is in the family Peltoperli- and the third partition serves as the test set. dae, is quite distinctive in color and shape. The other Our approach requires specification of the following three taxa, which are all in the family , are quite parameters: similar to each other, and the first two (Calineuria and Doroneuria) are exceedingly difficult to distinguish. This – the number L of mixture components in the Gaus- is emphasized in Figure 7, which shows closeup dorsal sian mixture model for each dictionary, views. To verify the difficulty of discriminating these two taxa, we conducted an informal study that tested – the number B of bootstrap replicates for bagging, the ability of humans to identify between them. A to- tal of 26 students and faculty from Oregon State Uni- – the minimum number M of training examples in the versity were allowed to train on 50 randomly-selected leaves of the logistic model trees, and images of Calineuria and Doroneuria, and were subse- quently tested with another 50 images. Most of the sub- – the number I of iterations of logit boost employed for jects (21) had some prior entomological experience. The training the logistic model trees. mean score was 78.6% correctly identified (std. dev. = 8.4). There was no statistical difference between the per- L formance of entomologists and non-entomologists (Wil- These parameters are set as follows. is determined for coxon two-sample test [38], W =57.5, p ≤ 0.5365). each species through a series of EM fitting procedures. Given the characteristics of the taxa, we defined three We increment the number of mixture components until discrimination tasks, which we term CDHY, JtHY, and the GMM is capable of modeling the data distribution— CD as follows: when the GMM achieves a relative fitting error below 5% in less than 100 EM iterations. The resulting values CDHY: Discriminate among all four taxa. of L are 90, 90, 85 and 65 for Calineuria, Doroneuria, Hesperoperla,andYoraperla, respectively. Likewise, B is JtHY: Merge Calineuria and Doroneuria to define a sin- determined by evaluating a series of bagging ensembles gle class, and then discriminate among the resulting with different numbers of classifiers on the same training three classes. set. The number of classifiers in each ensemble is incre- mented by two until the training error starts to increase, CD: Focus on discriminating only between Calineuria at which point B is simply assigned to be five less than and Doroneuria. that number. The reason we assign B to be 5 less than the number that causes the training error to increase— The CDHY task assesses the overall performance of the rather than simply assign it to the largest number that system. The JtHY task is most relevant to biomoni- produces the lowest error—is that the smaller number 12 Natalia Larios et al.

(a)(b)(c)(d) Fig. 6 Example images of different stonefly larvae species. (a) Calineuria,(b)Doroneuria,(c)Hesperoperla and (d) Yoraperla.

(a)(b)(c)(d) Fig. 7 Images visually comparing Calineuria and Doroneuria.(a)Calineuria,(b)Doroneuria,(c)Calineuria detail and (d) Doroneuria detail. of boostrap replicates helps to avoid overfitting. Table 6 5.2 Results shows the value of B for each of the three tasks. The minimum number M of instances that each leaf in the We designed our experiments to achieve two goals. First, LMT requires to avoid pruning is set to 15, which is the we wanted to see how well the CFH method (with three default value for the LMT implementation recommended region detectors) coupled with an ensemble of LMTs per- by the authors. The number of logit boost iterations I forms on the three recognition tasks. To establish a ba- is set by internal cross-validation within the training set sis for evaluation, we also apply the method of Opelt, et while the LMT is being induced. al. [33], which is currently one of the best object recogni- tion systems. Second, we wanted to evaluate how each of the three region detectors affects the performance of the system. To achieve this second goal, we train our system Table 6 Number of bagging iterations for each experiments. using 7 different configurations corresponding to train- ing with all three detectors, all pairs of detectors, and all B Experiment Bagging Iterations individual detectors. 4-species: CDHY 20 3-species: JtHY 20 2-species: CD 18 5.2.1 Overall Results

Table 7 shows the classification rates achieved by the CFH method on the three discrimination tasks. Tables Automated Insect Identification through Concatenated Histograms of Local Appearance Features 13

Table 7 Percentage of images correctly classified by our sys- Table 10 CD confusion matrix of the combined Kadir, tem with all three region detectors along using a 95% confi- Hessian-affine and PCBR detectors dence interval. predicted as ⇒ Calineuria Doroneuria Task Accuracy [%] Calineuria 304 96 CDHY 82.42 ± 2.12 Doroneuria 82 381 JtHY 95.40 ± 1.16 CD 79.37 ± 2.70

Table 8 CDHY confusion matrix of the combined Kadir, Table 11 Comparison of CD classification rates using Hessian-affine and PCBR detectors Opelt’s method√ and our system with different combinations predicted as ⇒ Cal. Dor. Hes. Yor. of detectors. A indicates the detector(s) used. Calineuria 315 79 6 0 Hessian Kadir Accuracy[%] Doroneuria 80 381 2 0 Affine Entropy PCBR Opelt [33] Ours Hesperoperla 24 22 203 4 √ 60.59 70.10 Yoraperla 1 0 0 123 √ √ 62.63 70.34 √ √ √ 67.86 79.03 Table 9 JtHY confusion matrix of the combined Kadir, 70.10 79.37 Hessian-affine and PCBR detectors predicted as ⇒ Joint CD Hes. Yor. Joint CD 857 5 1 Hesperoperla 46 203 4 Yoraperla 0 1 123 thresholded by the difference between the two highest confidence measures. In this series, we vary the thresh- 8, 9, and 10 show the confusion matrices for the three old from 0 to 1. If the difference is higher than the defined tasks. On the CDHY task, our system achieves 82% cor- threshold, the label of the highest is assigned otherwise rect classifications. The confusion matrix shows that it the specimen is declared as “uncertain”. Figure 8 shows achieves near perfect recognition of Yoraperla. It also the plotting of the accuracy against the rejection rate. recognizes Hesperoperla very well with only a few im- The curves show us that if we reject around 30% of the ages misclassified as Calineuria or Doroneuria.Asex- specimens, all the tasks will reach an accuracy higher pected, the main difficulty is to discriminate Calineuria than 90%, even the CD task. and Doroneuria. When these two classes are pooled in the JtHY task, performance reaches 95% correct, which We also evaluate the performance of our classification is excellent. It is interesting to note that if we had applied methodology relative to a competing method [33] on the the four-way classifier and then pooled the predictions of most difficult CD task using the same image features. the classifiers, the 3-class performance would have been Opelt’s method is similar to our method in that it is slightly better (95.48% versus 95.08%). The difference is also based on ensemble learning principles (AdaBoost), that in the JtHY task, we learn a combined dictionary and it is also capable of combining multiple feature types for the merged Calineuria and Doroneuria (CD) class, for classification. We adapted Opelt’s Matlab implemen- whereas in the 4-class task, each taxon has its own dic- tation to our features and used the default parameter tionaries. settings given in the paper. The Euclidean distance met- A similar phenomenon occurs in the 2-class CD task. ric was used for the SIFT features and number of itera- Our method attains 79% of correct classification rate tions I was set to 100. Table 11 summarizes the classifica- when trained on only these two tasks. If instead, we ap- tion rates. Our system provides 8 to 12% better accuracy plied the CDHY classifiers and treated predictions for than Opelt’s method for all four combinations of detec- Hesperoperla and Yoraperla as errors, the performance tors. In addition, training Opelt’s classifier is more com- would be slightly better (79.61% versus 79.37%). These putationally expensive than is training our system. In differences are not statistically significant, but they do particular, the complexity of computing Opelt’s feature- suggest that in future work it might be useful to build to-image distance matrix is O(T 2R2D), where T is the separate dictionaries and classifiers for groups within number of training images, R is the maximum number of each taxon (e.g., first cluster by size and color) and then detected image regions in a single image, and D = 128 is map the resulting predictions back to the 4-class task. the SIFT vector dimension. The total number of detected On this binary classification task, our method attains training regions, T · R, is easily greater than 20, 000) in 79% correct classification, which is approximately equal this application. On the other hand, training our sys- to the mean for human subjects with some prior experi- tem is much faster. The complexity of building the LMT ence. ensemble classifier (which dominates the training com- Our system is capable of giving a confidence mea- putation) is O(T · A · I), where A is the number of his- sure to each of the existing categories. We performed togram attributes and I is the number of LMT induction a series of experiments where the species assignment is iterations (typically in the hundreds). 14 Natalia Larios et al.

Accuracy/Rejection Curve Accuracy/Rejection Curve Accuracy/Rejection Curve 100 100 99.5

98 99

95 96 98.5

94 98 90 92 97.5

90 97 85 88 96.5 Correctly Classified [%] Correctly Classified [%] Correctly Classified [%]

86 96 80

84 95.5

75 82 95 0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 0 5 10 15 20 25 30 35 40 45 Rejected [%] Rejected [%] Rejected [%] CD CDHY JtHY

Fig. 8 Accuracy/Rejection curves for the three experiments with all the detectors combined while changing the confidence- difference threshold

Table 12 Classification rates√ using our system with different imens and the background. In addition to concentrating combinations of detectors. A indicates the detector(s) used. on the interior, the regions found by the PCBR detector Hessian Kadir Accuracy[%] are more “meaningful” in that they correspond better to Affine√ Entropy PCBR CDHY JtHY CD body parts. This may explain why the PCBR detector √ 73.14 90.32 70.10 did a better job on the CD task. √ 70.64 90.56 70.34 √ √ 71.69 86.21 79.03 √ √ 78.14 94.19 74.16 √ √ 80.48 93.79 78.68 6 Conclusions and Future Work √ √ √ 78.31 92.09 68.83 82.42 95.40 79.37 This paper has presented a combined hardware-software system for rapid-throughput classification of stonefly lar- vae. The goal of the system is to perform cost-effective bio-monitoring of freshwater streams. To this end, the 5.2.2 Results for Multiple Region Detectors mechanical apparatus is capable of nearly unassisted ma- nipulation and imaging of stonefly specimens while also Table 12 summarizes the results of applying all combi- obtaining consistently high quality images. The generic nations of one, two, and three detectors to the CDHY, object recognition algorithms attain classification accu- JtHY, and CD tasks. The first three lines show that each racy that is sufficiently good (82% for 4-classes; 95% for detector has unique strengths when applied alone. The 3-classes) to support the application. By rejecting for Hessian-affine detector works best on the 4-class CDHY manual classification the specimens in which the confi- task; the Kadir detector is best on the 3-class JtHY task, dence level is not high enough; only a reasonable 30% and the PCBR detector gives the best 2-class CD results. of the samples would require further processing while On the pairwise experiments it appears that the Hessian- the remaining identified specimens can reach an accu- affine and PCBR complement each other well. The best racy above 90% on all the defined tasks. pairwise results for the JtHY task is obtained by the We compared our CFH method to Opelt’s related Kadir-Hessian pair; which appears to be better for tasks state-of-art method on the most difficult task, discrimat- that require an overall assessment of shape. Finally, the ing Calineuria from Doroneuria. The CFH method al- combination of all three detectors gives the best results ways achieved better performance. It is also worth notic- on each task. ing that human subjects with some prior experience and To understand the region detector results, it is helpful using the same images reached an accuracy equal to our to look at their behaviors. Figures 9 and 10 show the re- method. Finally, we described a new region detector, gions found by each detector on selected Calineuria and the principal curvature-based region (PCBR) detector. Doroneuria specimens. The detectors behave in quite dif- Our experiments demonstrated that PCBR is particu- ferent ways. The PCBR detector is very stable, although larly useful for discriminating between the two visually it does not always identify all of the relevant regions. similar species, and, as such provides an important con- The Kadir detector is also stable, but it finds a very tribution in attaining greater accuracy. large number of regions, most of which are not relevant. There are a few details that must be addressed before The Hessian-affine detector finds very good small-scale the system is ready for field testing. First, the mechan- regions, but its larger-scale detections are not useful for ical apparatus must be modified to include mechanisms classification. The PCBR detector focuses on the interior for sorting the specimens into bins after they have been of the specimens, whereas the other detectors (especially photographed and classified. Second, we need to develop Kadir) tend to find points on the edges between the spec- an algorithm for determining whether a good dorsal im- Automated Insect Identification through Concatenated Histograms of Local Appearance Features 15

(a)(b)(c) Fig. 9 Visual Comparison of the regions output by the three detectors on three Calineuria specimens. (a) Hessian-affine, (b) Kadir Entropy, (c) PCBR age of the specimen has been obtained. We are currently References exploring several methods for this including training the classifier described in this paper for the task. Third, we 1. Arbuckle, T., Schroder, S., Steinhage, V., Wittmann, need to evaluate the performance of the system on a D.: Biodiversity informatics in action: identification and broader range of taxa. A practical bio-monitoring system monitoring of bee species using ABIS. In: Proc. 15th Int. Symp. Informatics for Environmental Protection, vol. 1, for the Willamette Valley will need to be able to recog- pp. 425–430. Zurich (2001) nize around 8 stonefly taxa. Finally, we need to develop 2. Bouchard, G., Triggs, B.: Hierarchical part-based visual methods for dealing with specimens that are not stone- object categorization. In: IEEE Conference on Computer flies or that do not belong to any of the taxa that the Vision & Pattern Recognition, pp. I 710–715 (2005). URL system is trained to recognize. We are studying SIFT- http://lear.inrialpes.fr/pubs/2005/BT05 3. Breiman, L.: Bagging predictors. Machine Learning based density estimation techniques for this purpose. 24(2), 123–140 (1996). URL citeseer.ist.psu.edu/ breiman96bagging.html Beyond freshwater stream bio-monitoring, there are 4. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Clas- many other potential applications for rapid-throughput sification and Regression Trees. Chapman & Hall, New arthropod recognition systems. One area that we are York, NY, USA (1984) studying involves automated population counts of soil 5. Burl, M., M.Weber, Perona, P.: A probabilistic approach mesofauna for soil biodiversity studies. Soil mesofauna to object recognition using local photometry and global geometry. In: Proc. ECCV, pp. 628–641 (1998) are small arthropods (mites, spiders, pseudo-scorpions, 6. Carter, J., Resh, V., Hannaford, M., Myers, M.: Macroin- etc.) that live in soils. There are upwards of 2000 species, vertebrates as biotic indicators of env. qual. In: F. Hauer and the study of their interactions and population dy- and G. Lamberti, editors. Methods in Stream Ecology. namics is critical for understanding soil ecology and soil Academic Press, San Diego (2006) responses to different land uses. In our future work, we 7. Csurka, G., Dance, C., Fan, L., Williamowski, J., Bray, C.: Visual categorization with bags of keypoints. will test the hypothesis that the methods described in ECCV’04 workshop on Statistical Learning in Computer this paper, when combined with additional techniques Vision pp. 59–74 (2004) for shape analysis and classification, will be sufficient to 8. Do, M., Harp, J., Norris, K.: A test of a pattern recog- build a useful system for classifying soil mesofauna. nition system for identification of spiders. Bulletin of Entomological Research 89(3), 217–224 (1999) 9. Dorko, G., Schmid, C.: Object class recognition using discriminative local features. PAMI submitted (2004) Acknowledgements We wish to thank Andreas Opelt for 10. Dork´o, G., Schmid, C.: Object class recognition using providing the Matlab code of his PAMI’06 method for the discriminative local features (2005). URL http://lear. comparison experiment. We also wish to thank Asako Yama- inrialpes.fr/pubs/2005/DS05. Accepted under major muro and Justin Miles for their assistance with the dataset revisions to IEEE Transactions on Pattern Analysis and stonefly identification. Machine Intelligence, updated 13 September 16 Natalia Larios et al.

(a)(b)(c) Fig. 10 Visual Comparison of the regions output by the three detectors on four Doroneuria specimens. (a) Hessian-affine, (b) Kadir Entropy, (c) PCBR

11. Fergus, R., Perona, P., Zisserman, A.: Object class recog- Workshop, EMMCVPR 2005, pp. 153 – 168. Springer- nition by unsupervised scale-invariant learning. In: Pro- Verlag, St. Augustine, Florida (2005) ceedings of the IEEE Conference on Computer Vision 23. Landwehr, N., Hall, M., Frank, E.: Logistic model trees. and Pattern Recognition, vol. 2, pp. 264–271. Madison, Mach. Learn. 59(1-2), 161–205 (2005). DOI http://dx. Wisconsin (2003) doi.org/10.1007/s10994-005-0466-3 12. Freund, Y., Schapire, R.E.: Experiments with a new 24. Leibe, B., Seemann, E., Schiele, B.: Pedestrian detection boosting algorithm. In: International Conference on Ma- in crowded scenes. In: CVPR ’05: Proceedings of the 2005 chine Learning, pp. 148–156 (1996). URL citeseer.ist. IEEE Computer Society Conference on Computer Vi- psu.edu/freund96experiments.html sion and Pattern Recognition (CVPR’05) - Volume 1, pp. 13. Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic 878–885. IEEE Computer Society, Washington, DC, USA regression: a statistical view of boosting (1998). URL (2005). DOI http://dx.doi.org/10.1109/CVPR.2005.272 citeseer.ist.psu.edu/friedman98additive.html 25. Lowe, D.G.: Distinctive image features from scale- 14. Gaston, K.J., O’Neill, M.A.: Automated species identifi- invariant keypoints. Int. J. Comput. Vision 60(2), cation: why not? Philosophical Transactions of the Royal 91–110 (2004). DOI http://dx.doi.org/10.1023/B:VISI. Society B: Biological Sciences 359(1444), 655–667 (2004) 0000029664.99615.94 15. G.Csurka, Bray, C., Fan, C.L.: Visual categorization with 26. Lucas, S.: Face recognition with continuous n-tuple clas- bags of keypoints. ECCV workshop (2004) sifier. In: Proc. British Machine Vision Conference, pp. 16. Harris, C., Stephens, M.: A combined corner and edge 222–231. Essex (1997) detector. Alvey Vision Conference pp. 147–151 (1988) 27. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust 17. Hilsenhoff, W.L.: Rapid field assessment of organic pol- wide-baseline stereo from maximally stable extremal re- lution with a family level biotic index. Journal of the gions. Image and Vision Computing 22(10), 761–767 North American Benthological Society 7, 65–68 (1988) (2004) 18. Hopkins, G.W., Freckleton, R.P.: Declines in the numbers 28. Mikolajczyk, K., Schmid, C.: An affine invariant interest of amateur and professional taxonomists: implications for point detector. ECCV 1(1), 128–142 (2002) conservation. Conservation 5(3), 245–249 (2002) 29. Mikolajczyk, K., Schmid, C.: Scale and affine invariant 19. Jurie, F., Schmid, C.: Scale-invariant shape features for interest point detectors. IJCV 60(1), 63–86 (2004) recognition of object categories. CVPR 2, 90–96 (2004) 30. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, 20. Jurie, F., Triggs, B.: Creating efficient codebooks for A., Matas, J., Schaffalitzky, F., Kadir, T., Gool, L.V.: A visual recognition. In: ICCV ’05: Proceedings of the comparison of affine region detectors. IJCV (2005) Tenth IEEE International Conference on Computer Vi- 31. O’Neill, M.A., Gauld, I.D., Gaston, K.J., Weeks, P.: sion (ICCV’05) Volume 1, pp. 604–610. IEEE Com- Daisy: an automated invertebrate identification sys- puter Society, Washington, DC, USA (2005). DOI tem using holistic vision techniques. In: Proc. In- http://dx.doi.org/10.1109/ICCV.2005.66 augural Meeting BioNET-INTERNATIONAL Group 21. Kadir, T., Zisserman, A., Brady, M.: An affine invari- for Computer-Aided Taxonomy (BIGCAT), pp. 13–22. ant salient region detector. In: European Conference on Egham (2000) Computer Vision (ECCV04), pp. 228–241 (2004) 32. Opelt, A., Fussenegger, M., Pinz, A., Auer, P.: Weak hy- 22. Kumar, S., August, J., Hebert, M.: Exploiting infer- potheses and boosting for generic object detection and ence for approximate parameter learning in discrimina- recognition. In: 8th European Conference on Computer tive fields: An empirical study. In: 5th International Vision, vol. 2, pp. 71–84. Prague, Czech Republic (2004) Automated Insect Identification through Concatenated Histograms of Local Appearance Features 17

33. Opelt, A., Pinz, A., Fussenegger, M., Auer, P.: Generic Hongli Deng received a BE object recognition with boosting. IEEE Trans. Pattern degree in 1992 from Wuhan Anal. Mach. Intell. 28(3), 416–431 (2006) University of technology and 34. Papageorgiou, C., Poggio, T.: A trainable system for ME degree in 1999 from object detection. Int. J. Comput. Vision 38(1), 15–33 Sichuan University in China. (2000) He is a PhD candidate in Elec- 35. Quattoni, A., Collins, M., Darrell, T.: Conditional ran- trical Engineering and Com- dom fields for object recognition. In: Proc. NIPS 2004. puter Science department of MIT Press, Cambridge, MA (2005) Oregon State University. His 36. Quinlan, J.R.: C4.5: programs for machine learning. Mor- research interests are local im- gan Kaufmann Publishers Inc., San Francisco, CA, USA age feature detection, descrip- (1993) tion and registration. 37. Shotton, J., Blake, A., Cipolla, R.: Contour-based learn- ing for object detection. In: ICCV ’05: Proceedings of the Tenth IEEE International Conference on Com- puter Vision (ICCV’05) Volume 1, pp. 503–510. IEEE Computer Society, Washington, DC, USA (2005). DOI http://dx.doi.org/10.1109/ICCV.2005.63 38. Sokal, R.R., Rohlf, F.J.: Biometry, 3rd edn. W. H. Free- Wei Zhang received the BS man & Co. (1995) and MS degrees from Xi’an 39. Steger, C.: An unbiased detector of curvilinear structures. Jiaotong University, China, in PAMI 20(2), 113–125 (1998) 2001 and 2004. Currently, he is 40. Sung, K.K., Poggio, T.: Example-based learning for view- a PhD candidate and research based human face detection. IEEE Transactions on assistant at Oregon State Uni- Pattern Analysis and Machine Intelligence 20(1), 39–51 versity. His research interests (1998) include machine learning, com- puter vision and object recog- 41. Turk, M.A., Pentland, A.P.: Face recognition using eigen- nition. faces. In: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pp. 586–591 (1991) 42. Tuytelaars, T., Gool, L.V.: Wide baseline stereo match- ing based on local, affinely invariant regions. BMVC pp. 412–425 (2000) 43. Tuytelaars, T., Gool, L.V.: Matching widely separated views based on affine invariant regions. IJCV 59(1), 61– 85 (2004) 44. Vincent, L., Soille, P.: Watersheds in digital spaces: An efficient algorithm based on immersion simulations. Matt Sarpola graduated PAMI 13(6), 583–598 (1991) from Oregon State University 45. Zhang, W., Deng, H., Dietterich, T.G., Mortensen, E.N.: with his MS in mechanical en- A hierarchical object recognition system based on multi- gineeringin2005andcurrently scale principal curvature regions. International Confer- does new product research and ence of Pattern Recognition pp. 1475–1490 (2006) development for Videx in Corvallis, OR.

Natalia Larios received the Jenny Yuen is currently a BS in Cumpoter Engineering graduate student at the Mas- from the Universidad Nacional sachusetts Institute of Technol- Autonoma de Mexico in 2003. ogy Computer Science and Ar- She is currently a graduate tificial Intelligence Laboratory, student and research assistant where she is working on com- in the Electrical Engineering puter vision for event recogni- Department at the University tion and detection. She has a of Washington. Her research BS in Computer Science from interests include compiter vi- the University of Washington. son, object recognition and im- age retrieval empoying ma- chine learning and probabilistic modeling. 18 Natalia Larios et al.

Robert Paasch is an Asso- Salvador Ruiz Correa was ciate Professor in the Depart- born in Mexico City, Mexico, ment of Mechanical Engineer- 1966. He received the B.S. M.S. ing at Oregon State University, degrees in Electrical Engineer- and currently holds the Boe- ing from the Universidad Na- ing Professorship in Mechan- cional Autonoma de Mexico, ical Design. He received his in 1990 and 1994, respectively, Ph.D. in Mechanical Engineer- andthePh.D.degreeinElec- ing from the University of Cal- trical Engineering fron the Uni- ifornia at Berkeley in 1990. His versity of Washington, Seat- research interests include au- tle, in 2004. Dr. Ruiz Correa tomated monitoring and diag- hasworkedasapost-doctoral nosis, probabilistic and robust fellow in imaging informatics design, and design theory and at Children’s Hospital and Re- methodology. gional Medical Center in Seat- tle WA and Children’s Na- tional Medical Center in Wash- ington, D.C from 2004 to present. Currently, he is the director of the Pediatric Imaging Research Laboratory at CNMC. His research interests include computer vision, biomedical image processing, pattern recognition, robotics and imaging infor- matics. Dr Ruiz is associate member of the IEEE and reviewer for IEEE PAMI, the International Journal of Computer Vi- sion and Pattern Recognition and IEEE Transactions on Mul- timedia.

Andrew R. Moldenke is an ecologist in the Department Eric N. Mortensen is an As- of Botany and Plant Pathol- sistant Professor in the School ogy at Oregon State Univer- of Electrical Engineering and sity. His research interests cen- Computer Science at Oregon ter on the interactions between State University. He recieved arthropods and plants, in par- his Ph.D. in Computer Science ticular pollination and nutri- from Brigham Young Univer- ent recycling. He is especially sity in 2000. His research inter- concerned about the inability ests include interactive vision, of scientists to develop appro- image and video segmenta- priate protocols for measuring tion and editing, feature detec- critical changes in biodiversity tion/description/matching, ob- as the climate changes due to ject recognition, and image- the sheer numbers of arthropod based modeling. He is a mem- species and the limited number ber of the Institute of Electri- of trained taxonomists. cal and Electronics Engineers

(IEEE) and the Association for Computing Machinery (ACM).

Linda G. Shapiro is Profes- sor of Computer Science and Engineering and of Electrical Engineering at the University of Washington. Her research Dave A. Lytle is currently interests include computer vi- an assistant Professor of Zool- sion, image database systems, ogy at Oregon State Univer- artificial intelligence, pattern sity. He received his PhD in recognition, and robotics. She Ecology and Evolutionary Bi- is a Fellow of the IEEE, a ology from Cornell University Fellow of the IAPR, and a in 2000. He has been a D.H. past Chair of the IEEE Com- Smith Postdoctoral Fellow and puter Society Technical Com- a Postdoctoral Associate in the mittee on Pattern Analysis and University of Chicago. His re- Machine Intelligence. She has search interests include the use served as Editor-in-Chief of of evolutionary ecology to un- CVGIP: Image Understanding, derstand how aquatic organ- Associate Editor of IEEE Transactions on Pattern Analysis isms are shaped. and Machine Intelligence, and Chair or Program Chair of numerous computer vision conferences. She has co-authored Automated Insect Identification through Concatenated Histograms of Local Appearance Features 19 three textbooks, one on data structures and two on computer vision.

Thomas G. Dietterich is Professor and Director of In- telligent Systems in the School of Electrical Engineering and Computer Science at Oregon State University, where he joined the faculty in 1985. In 1987, he was named a Presi- dential Young Investigator for the NSF. In 1990, he pub- lished, with Dr. Jude Shavlik, thebookentitledReadings in Machine Learning, and he also served as the Technical Pro- gram Co-Chair of the National Conference on Artificial Intel- ligence (AAAI-90). From 1992- 1998 he held the position of Executive Editor of the journal Machine Learning. He is a Fellow of the American Associa- tion for Artificial Intelligence (1994) and of the Association for Computing Machinery (2003). In 2000, he co-founded a new, free electronic journal: The Journal of Machine Learn- ing Research. He served as Technical Program Chair of the Neural Information Processing Systems (NIPS) conference in 2000 and General Chair in 2001. He is currently President of the International Machine Learning Society, and he also serves on the Board of Trustees of the NIPS Foundation.