Identification of species images based on mito- chondrial DNA

Lior Wolf & Yoni Donner

School of Computer Science, Tel-Aviv University, Israel

The appearance of an animal species is a complex phenotype partially encoded in its genome.

Previous work on linking genotype to visually-identifiable phenotypes has focused on univari- ate or low-dimensional traits such as eye color1, principal variations in skeletal structure2 and height3, as well as on the discovery of specific genes that contribute to these traits. Here, we go beyond single traits to the full genotype-phenotype analysis of photographs and illustrations of animal species. We address the problems of (1) identification and (2) synthesis of images of previously unseen using genetic data. We demonstrate that both these problems are feasible: in a multiple choice test, our algorithm identifies with high accuracy the correct im- age of previously unseen fish, birds and , based only on a short gene sequence; additionally, using the same sequence we are able to approximate images of unseen fish contours. Our predic- tions are based on correlative phenotype-genotype links rather than on specific gene targeting, and they employ the cytochrome c oxidase I mitochondrial gene which is assumed to have little causal influence on appearance. Such correlative links enable the use of high-dimensional phe- notypes in genetic research, and applications may range from forensics to personalized medical treatment.

There is little doubt that appearance is to a large extent influenced by heritage, yet the actual mech- anisms that determine our looks may be difficult to unravel since they are likely to involve complex pathways consisting of a multitude of genes. This, however, does not eliminate the possibility of pre- dicting appearance according to genetic data. In order to predict it is sufficient to discover statistical

1 Common ancestral population

Population A Population B

Figure 1: At first glance, it may seem improbable that mitochondrial genes can be used to identify images, since they do not affect appearance directly. These genes may, however, be correlated with appearance due to the common evolutionary lineage shared by all genes. The figure above demon- strates correlations that are created by an event of population split. The two colors correspond to genes of different functions. After the split, new variants of genes created in one population do not migrate to the other and appear concurrently with new variants of other genes, thus yielding correla- tions between functionally independent genes. correlations between genetic markers and images, without having to describe the exact mechanism at work.

Moreover, correlations between genotype and phenotype can be identified even in the absence of direct causal relationship. The genes directly responsible for appearance are inherited together with other genes that are involved in completely different functions, thus generating many intricate inter- dependencies that can be exploited statistically. This distinction between function and correlation, as illustrated in Figure 1, is demonstrated acutely in our work, since we employ a mitochondrial gene, which is unlikely to directly determine visual appearance.

Mitochondrial DNA (mtDNA) is normally inherited unchanged from the mother, with very limited recombination compared to nuclear DNA, yet its mutation rate is higher than that of nuclear DNA4, resulting in low variance within species and high variance between different species, thus making it a promising candidate for species identification5. In addition, mtDNA contains very few introns and is

2 easy to collect, which further increases its suitability for this purpose. In theory, many mitochondrial loci can be used; in practice, the mitochondrial gene cytochrome c oxidase I (COI) has been repeatedly employed for DNA barcoding and its discriminative effectiveness has been demonstrated in several species6, 7. While the ability of COI to serve as a universal barcode has been doubted8, 9, in this work we use COI sequence data for all experiments, mainly due to the high availability of COI sequences for many species of the same , publicly available from the Barcode of Life Database (BOLD)10.

Three data sets are employed: of Australia11, Diversity in Northern Madagascar12, and Birds of North America - Phase II project13. By locating matching images we constructed three data sets containing multiple genotype-phenotype pairs (M,P ), where M stands for the COI gene sequence and P is a single image of an animal of that species. No fish or ant species is associ- ated with more than one image, while the bird data set contains several genetic markers and im- ages for each species. The images were extracted from several sources. A total of 157 fish species illustrations with varying style and quality were retrieved from FishBase (http://filaman. ifm-geomar.de/home.htm). Images of 26 relevant ant species were available from AntWeb

(http://www.antweb.org/), with each ant photographed from a profile view, a head view and a dorsal view. The fish and ant images were matched to the BOLD record by the species name. The

BOLD repository itself contains 125 wing images of sequenced birds covering 25 different species.

Links to the images used in our experiments can be found in the accompanying Supplementary Infor- mation; some examples are shown in Figure 2.

The ultimate challenge for learning the relations between genotype and phenotype is the success-

N ful prediction of appearance based on genetic markers, i.e., given a training set {Mi,Pi}i=1 of N ˆ matching markers and images, and a marker of a new organism Mnew, to generate an image Pnew

ˆ such that Pnew is a good approximation of the actual appearance of the organism. While we have partially successful results predicting the appearance of fish (see below), this task is difficult due to

3 the high dimensionality of the target space, the relatively small size of the training data set and the low relevance of the genetic markers used, which provide only partial correlative information.

A more accessible task is to identify the image of an animal of the same species out of k candidate images P1, ..., Pk given its genotype M, where the genetic marker and the species in the candidate images are all previously unseen. Note that the ability to predict the image Pˆ that corresponds to the genetic marker M implies the ability to identify without difficulty the correct image as the one closest visually to Pˆ. The converse is not true: the correlative nature of the learned genotype-phenotype relations can be used to solve the selection task without providing the ability to synthesize an image.

Indeed, our experiments demonstrate that the COI gene by itself provides sufficient information to allow, in the case of two alternatives (k = 2), correct identification at considerably higher success rates than the 50% chance performance.

Throughout this work, linear models are used for prediction. First, both the sequences of the COI gene (M) and the images (P ) are represented as vectors m and p respectively, of lengths nm and np. To describe the genomic data this way, we represent the n nucleotides of each sequence (n is constant per dataset, and its value varies between 576 for ants and 669 for birds) as a vector of dimension nm =

4n of real numbers between 0 and 1. Each element marks the appearance of a specific nucleotide

(A,G,C or T) at a specific location in the COI sequences of the species. In cases where there are multiple measurement in the BOLD database for a particular species, the elements mark the relative frequency of a nucleic acid of a certain type. In this representation, the dot product of two gene vectors is the expected number of agreements between the sequences. This representation has been demonstrated to be effective for predicting univariate phenotypes14. An alternative representation based on the Kimura two-parameter model (K2P)15 was also tested, with nearly identical results

(see Supplementary Information). For the representation of images as vectors we use a bag-of-SIFT- features representation (see box 1), where keypoints are sampled uniformly across all image locations.

4 For all data sets np = 11, 111.

For the identification task, we first subtract the mean value from all markers and images. Then we learn from the training set a pair of linear transformations Tm, Tp that transform markers and images

min(nm,np) to a common vector space R , so that the transformed vectors Tm(mi) and Tp(pi), i = 1..N of matching training markers and images are similar in the sense of small Euclidean distance. Put differently, we seek linear transformations Tm, Tp that minimize

N 1 X kT (m ) − T (p )k2. (1) N m i p i i=1

Additional constraints need to be provided to avoid trivial or singular solutions, namely that the m components of the resulting vectors be pairwise uncorrelated and of unit variance:

N N X T X T xixi = yiyi = I, (2) i=1 i=1

N N 1 P 1 P where xi = Tm(mi) − N Tm(mi) and yi = Tp(pi) − N Tp(pi). This problem can be solved j=1 j=1 through Canonical Correlation Analysis16 (CCA). Please refer to the Supplementary Information for details.

Since the feature vectors for both genes and images are of dimensions significantly higher than the number of training samples, statistical regularization must be used to avoid overfitting. We use the regularized version of CCA suggested by17. Generally, two regularization parameters need to be determined: ηm and ηp. We use a single regularization parameter instead, η as follows. Let

X = [x1 x2 . . . xN ] and Y = [y1 y2 . . . yN ] (x and y are defined above below Eq. 2), and denote by

> > λm, λp the largest eigenvalues of XX and YY . We set ηm = ηλm and similarly for ηp. This way of choosing the regularization parameters is invariant to scale and can be used uniformly across all data sets.

5 After learning the transformations Tm, Tp from the training set, the identification task consists of choosing one image from a set [p1, ..., pk] of previously unseen images that is most likely to be the matching image to the previously unseen genetic marker mnew. We select arg min D(Tm(mnew),Tp(pt)), t where D(u, v) is the Mahanalobis distance function (see Methods).

Experiments are performed using multiple random training and test splits, and in each trial 90% of the examples are used for training (see Methods). In the bird-wing data-set there are several samples for each species, and to avoid trivial identification where we match an image of a previously seen species, no species appears in both the training and test sets, and the choice is always between members of two different previously unseen species. Figure 3(top) shows the identification results for the fish, bird-wing and ant data sets for a regularization parameter of η = 0.05. The mean identification accuracy is 90.5% for fish, 72% for bird wings and up to 63.8% for the profile ant views.

The varying performance level can be partly accounted for by the training set sizes (when using as many training images in the fish experiments as in the ant experiments, performance drops to 77%), as well as by the increased variance due to smaller test set size in the ant experiments. Figure 3(bottom) shows results for varying values of η. As can be seen, the performance level is stable across a large range of the regularization parameter.

Next, we validate the need to learn directly the connections between the genotype and the pheno- type, instead of the alternative of analyzing each separately and using the analysis for identification.

First, the performance of a simple Nearest Neighbor method is evaluated, where given a new genetic

N marker mnew, chooses out of the existing markers the closest one arg min kmi −mnewk and selects the i=1 image most similar to the corresponding image pi: arg min kpk − pik. As can be seen in Figure 3 this k method performs, on average, about 20% worse than the above CCA method for fish (p < 10−7). Sec- ond, we construct phylogenetic trees for the genotype and the phenotype using the UPGMA method18.

As expected, due to the multitude of parameters that affect images, the image-based phylogenetic tree

6 differs significantly compared to the gene-based one (see Supplementary Information). Trying to em- ploy these trees for identification (in two different ways, see Methods), produces significantly worse performance on fish and birds than our CCA-based method. The ant data set is too small to judge significance. This suggests that learning genotype and phenotype together is more effective than analyzing each separately.

The above experiments were designed to evaluate inter-class identification. For the bird data- set, several genetic markers and images from members of the same species are available, which enables additional intra-species identification experiments. For such experiments the relevance of same-species-examples is much larger than that of examples from other classes: the non-causality assumption implies that the intra-class phenotype-genotype correlations differ from those of of inter- class correlation. Inter-species correlation may provide valuable information regarding the correla- tions that exist between elements in the genotype or phenotype vector, but probably cannot predict the relevancy of an intra-species SNP to some visual trait. Since each bird species has only 2-7 sam- ples, our identification experiments do not perform better than chance. We therefore designed a task easier than multiple-choice identification. At each test we provide a pair of genetic markers (from the same species), and a pair of images. Matching is done for both pairs simultaneously providing additional evidence (see Methods for more details). In this second experiment 63% correct matching was achieved (p < 0.00003).

For the task of fish image synthesis we focus on predicting the outline (contour) of the fish. We

first identified a subset of 93 fish images that have the same number of fins, and therefore share similar topologies. 35 control points corresponding to visually identifiable locations in the fish images were then marked manually on each image (see Supplementary Information). These 35 2D control points constitute the shape model, which is represented as a 70-dimensional vector si, i = 1..93. In addition, for each of the 93 fish, a simple ad-hoc automatic segmentation algorithm was run to identify the set

7 of points defining its contour.

In each image synthesis experiment, a random partition of 82 training and 11 testing images is created. Linear ridge regression19 is used to learn, from the training set, the mapping between the genetic markers mi and the shape model si. The regularization value of the ridge regression is set, as above, to some parameter (η = 0.05) times the largest eigenvalue of the covariance matrix of mi. To synthesize the predicted contour for an unseen genotype mnew, we predict its shape snew according to the learned regression formula, and we also find the training example mnearest which is closest genetically to mnew. We then warp the contour of this nearest species to create the predicted contour.

This warping is performed by applying a thin-plate-spline warp20 which is computed in accordance with the matching points snearest and snew. Results for the testing fish of one run are shown in Figure 4.

To evaluate the quality of the contour reconstruction results, we have applied a simple distance measure between the original and predicted contour. For each point on the original contours we measure the distance to the closest point in the predicted one. We average this score along all contour points and all 11 testing contours. The mean error obtained for the contour prediction is 3.4% of the image width (standard deviation of 0.6%, measured across 100 repetitions), compared to 6.3% (SD

5%) obtained by randomly picking one of the training contours as the prediction.

The experiments presented in our work show that images can be identified and even predicted to some extent by examining small sequences of an animal’s genome. The use of the COI gene makes our results more definite by eliminating the possibility of causal genotype-phenotype relations that have been the focus of previous work. Several applications could be made possible by further development of such capabilities, such as predicting the appearance of extinct animals for which tissues are available and forensic applications including the identification of suspects by matching

DNA evidence to surveillance video or even the synthesis of their portraits from such evidence. The methods employed here can be transferred to non-vision related tasks as well. For example, given

8 a data-set of successfully recovered patients, where for each patient genetic markers and records of treatment are available, models suggesting the suitability of treatment for new patients may be constructed without identifying the underlying mechanisms or focusing on relevant genes.

9 Box 1

Modern image representations.

During the last few years, considerable progress has been made in the development of efficient visual representations serving as the core of several accurate computer vision systems. For example, real-time systems now exist that can detect specific classes of objects, such as people and cars, within complex images.

Often, an image is represented by a set of high-dimensional feature vectors, which describe its appearance at various locations. Among the most popular is the SIFT descriptor21, which captures the distribution of local image-edge orientations and is designed to be invariant to scaling and rigid image-transforms and is almost unaffected by local appearance perturbations.

One paradigm that has been proven effective despite its simplicity is the “bag-of- features” paradigm22, 23. In this paradigm, image descriptors from each training image are extracted at visually distinct or, alternatively, at random image locations. Next, the vector space of image descriptors is partitioned by employing a clustering algorithm of all descriptors arising from the training images. Each image (training or new) is then represented by the number of descriptors extracted from it which belong to each partition of the vector space.

Other paradigms for visual recognition include the hierarchical neuroscience- motivated systems24, 25. These systems are designed to have a layered structure, in which invariance and selectivity are increased toward the top of the hierarchy. In the

Supplementary Information, we describe experiments with the system of C1 descrip- tors26 that show a performance level similar to that of the bag-of-features paradigm.

10 Methods Summary

Data sets The Barcode of Life Database contains 192 unique species in the Fishes of Australia project. images were retrieved from FishBase (http://filaman.ifm-geomar.de/home. htm). We found reasonable illustrations for 157 out of the 192 species. Note that the illustrations vary in style and quality.

Ant images were retrieved from AntWeb (http://www.antweb.org/). Out of 86 ant species sequenced in the Ant Diversity in Northern Madagascar Barcode of Life project, we matched 26 entries with images in the AntWeb site. For each ant there are three photographs: head, profile and dorsal.

Bird sequences and images were taken from the Birds of North America - Phase II project at

Barcode of Life Database13. We used images of the birds’ wings, taken from a lateral view. Out of

657 unique species with sequences, only 25 have such wing images. Several sequences and images are available for each species, spanning a total of 125 sequences and matching images. We flipped those images such that all bird wings would point left, and cropped them such that all the paper signs which were visible in the original images were removed.

Experimental procedure The experimental results presented in this work are evaluated using holdout- type cross validation: in each of the trials of an experiment, 90% of the data were chosen randomly to be used as training set, and the remaining 10% were used as test set. When comparing the influence of various parameters on the results of the same data set, care was taken to use the same divisions to training and test data in each experiment. When using the bird wings data set, no single species appears in both the training and test sets. In the identification experiments, for each genetic marker in the test set, an experiment is performed for each combination of the true image and another test image, and the results are averaged over all pairs. The number of trials in an experiment varies based

11 on the size of the test set, since a smaller test set implies that more trials are needed to reach the same level of statistical significance. 50 trials were used for fish, and 100 trials for birds and ants due to the smaller sizes of the corresponding data sets.

References

[1] Sturm, R. A. & Frudakis, T. N. Eye colour: portals into pigmentation genes and ancestry. Trends in Genetics 20, 327–332 (2004).

[2] Chase, K. et al. Genetic basis for systems of skeletal quantitative traits: Principal component analysis of the canid skeleton. PNAS 99, 9930 – 9935 (2002).

[3] Sutterm, N. B. et al. A Single IGF1 Allele Is a Major Determinant of Small Size in Dogs. Science 316, 112–115 (2007).

[4] Brown, W. M., George, M. & Wilson, A. C. Rapid Evolution of Animal Mitochondrial DNA. PNAS 76, 1967–1971 (1979).

[5] Avise, J. C. et al. Intraspecific Phylogeography: The Mitochondrial DNA Bridge Between Population Genetics and Systematics. Annual Review of Ecology and Systematics 18, 489–522 (1987).

[6] Hebert, P. D. N., Cywinska, A., Ball, S. L. & deWaard, J. R. Biological identifications through DNA barcodes. Proceedings of The Royal Society Biological Sciences 270, 313–321 (2003).

[7] Hebert, P. D. N., Ratnasingham, S. & deWaard, J. R. Barcoding animal life: cytochrome c oxidase subunit 1 divergences among closely related species. Proceedings of the Royal Society B: Biological Sciences 270, S96–S99 (2003).

[8] Will, K. W. & Rubinoff, D. Myth of the molecule: DNA barcodes for species cannot replace morphology for identification and classification. Cladistics 20, 47–55 (2004).

[9] Gregory, T. R. DNA barcoding does not compete with . Nature 434, 1067 (2005).

[10] Ratnasingham, S. & Hebert, P. D. N. bold: The Barcode of Life Data System ( http://www.barcodinglife.org). Molecular Ecology Notes 7, 355–364(10) (May 2007).

[11] Ward, R. D., Zemlak, T. S., Innes, B. H., Last, P. R. & Hebert, P. D. N. DNA barcoding Australia’s fish species. Philosophical Transactions of The Royal Society Biological Sciences 360, 1847–1857 (2005).

[12] Smith, M. A., Fisher, B. L. & Hebert, P. D. N. DNA barcoding for effective assess- ment of a hyperdiverse group: the ants of Madagascar. Philosophical Transactions of The Royal Society Biological Sciences 360, 1825–1834 (2005).

[13] Kerr, K. C. et al. Comprehensive DNA barcode coverage of North American birds. Molecular Ecology Notes 7, 535–543 (2007).

12 [14] Yosef, N. et al. Prediction of phenotype information from genotype data. Submitted (by unre- lated authors), PLoS Computational Biology (2007).

[15] Kimura, M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 16, 111–120 (1980).

[16] Hotelling, H. Relations Between Two Sets of Variates. Biometrika 28, 321–377 (1936).

[17] Vinod, H. D. Canonical ridge and econometrics of joint production. Journal of Econometrics 4, 147–166 (1976).

[18] Sokal, R. R. & Michner, C. D. A statistical method for evaluating systematic relationships. Univ. Kans. Sci. Bull 38, 1409–1438 (1958).

[19] Hoerl, A. E. & Kennard, R. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55–67 (1970).

[20] Bookstein, F. Principal warps: thin-plate splines and the decomposition of deformations. IEEE Transactions on Pattern Analysis and Machine Intelligence 11, 567–585 (1989).

[21] Lowe, D. G. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60, 91–110 (2004).

[22] Leung, T. & Malik, J. Representing and Recognizing the Visual Appearance of Materials using Three-dimensional Textons. International Journal of Computer Vision 43, 29–44 (2001).

[23] Fergus, R., Perona, P. & Zisserman, A. Object class recognition by unsupervised scale-invariant learning. In Proceedings. 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, II–264–II–271 (2003).

[24] LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE 86, 2278–2324 (1998).

[25] Riesenhuber, M. & Poggio, T. Hierarchical models of object recognition in cortex. Nature Neuroscience 2, 1019–1025 (1999).

[26] Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M. & Poggio, T. Robust object recognition with cortex-like mechanisms. IEEE Trans Pattern Anal Mach Intell 29, 411–426 (2007).

[27] Nister, D. & Stewenius, H. Scalable Recognition with a Vocabulary Tree. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2161– 2168 (2006).

[28] Nowak, E., Jurie, F. & Triggs, B. Sampling strategies for bag-of-features image classification. In European Conference on Computer Vision. Springer (2006).

[29] Beveridge, J. R., Bolme, D., Draper, B. & Teixeira, M. The CSU Face Identification Evaluation System. Machine Vision and Applications 16, 128–138 (2004).

[30] Goh, C., Bogan, A. A., Joachimiak, M., Walther, D. & Cohen, F. E. Co-evolution of proteins with their interaction partners. Journal of Molecular Biology 299, 283–293 (2000).

13 Methods

Mapping images to vectors The visual descriptors of the images are computed by the bag-of- sift implementation of Andrea Vendaldi available at http://vision.ucla.edu/˜vedaldi/ code/bag/bag.html. This implementation uses hierarchical K-means27 for partitioning the de- scriptor space. Keypoints are selected at random locations28.

Matching pairs of genes and images The bird intra-species identification task presents a challenge more difficult than the inter-species identification tests. Assuming that all instances of a species descend from a small population of homogeneous ancestors, correlations between genotypes and phenotypes learned from inter-species data do not allow intra-species identification. In the bird data- set, each species is represented by a handful of examples. To make recognition easier, we design the following cross-test.

The task consists of deciding the correct matching between two sequences {m1, m2} and two images {p1, p2}. The transformations Tm and Tp are learned from the training set as before, and both pairs of test markers and images are transformed. Then distances for both matchings are computed by adding the corresponding sequence-image distances:

dist({(m1, p1), (m2, p2)}) = D(Tm(m1),Tp(p1)) + D(Tm(m2),Tp(p2))

dist({(m1, p2), (m2, p1)}) = D(Tm(m1),Tp(p2)) + D(Tm(m2),Tp(p1))

The matching for which the above distance is smaller is chosen.

Distance metrics As distance metric for the transformed vectors, a metric similar to the commonly- used Mahalanobis distance29 is employed, in which each element of the vectors is multiplied by the corresponding correlation coefficient, thus giving more weight to better-correlated features. Let u

14 and v be the transformed vectors, we define uˆ and vˆ by uˆi = λiui and vˆi = λivi, where λi is the i’th singular value of the matrix M (see Supplementary Information), and then use the standard cosine

uˆ·vˆ distance metric between uˆ and vˆ, thus D(u, v) = 1 − kuˆkkvˆk . Other distance metrics were also considered and an experimental comparison is included in the Supplementary Information.

Phylogenetic-tree based predictions Two forms of phylogenetic-tree based identification were used for comparison. In the first method, a phylogenetic tree is constructed using the UPGMA method for the pairs of genetic markers and images (mi, pi) in the training set, where the distances are de- termined by Euclidean distance between the vector representations of the gene sequences (there is no need to use alignment distance since the sequences in BOLD are aligned for each data set). Then, given the new marker mnew, the node v = (mi, pi) in the tree that minimizes kmi − mnewk is found,

as well as the nodes uk = (mik , pik ) that minimize kpik − pnewk k for each of the new images {pnewk }.

The chosen image pk is that for which the tree distance between v and uk is minimal.

The second method is based on a similarity measure between phylogenetic trees, where the sim- ilarity between two trees is defined as the linear correlation between the two corresponding distance matrices30, thus the trees need not be constructed in order to compute their similarity. The chosen image using this method maximizes the tree similarity between the trees corresponding to genetic

N N markers {mi}i=1 ∪ {mnew} and images {pi}i=1 ∪ {pnewk }.

15 Acanthopagrus butcheri Argyrops spinifer Carcharhinus dussumieri Epinephelus rivulatus

Anochetus mad01 Cataulacus mad02 Terataner mad01 Tetraponera mad02

Anas americana Mergus serrator Aythya marila Mergus merganser Figure 2: Example of images used in our experiments. (top) illustrations of fish from FishBase; (middle) Ant images from head, dorsal and profile views, retrieved from AntWeb; (bottom) Wing images available at the BOLD database.

16 100

90

80

70

60 ca t io n f i

t i CCA, eta=0.1 n 50 NN

rr ec t id e 40 o Phylogenetic tree c

distance % Phylogenetic tree 30 similarity

20

10

0 Wing Fish Ants parallel Ants dorsal Ants head Data set

100 100 100 90 90 90

80 80 80 n n n

o 70 70 70 i ca t i

f 60 60

60 i t n 50 e 50 50 i d

40 40 40 rr ec t o co rr ec t i den tifi ca ti o c co rr ec t i den tifi ca ti o

30 30 %

% 30 % 20 20 20 10 10 10 0 0 0 1E-07 0.000001 0.00001 0.0001 0.001 0.01 0.1 1 10 1E-07 0.000001 0.00001 0.0001 0.001 0.01 0.1 1 10 1E-07 0.000001 0.00001 0.0001 0.001 0.01 0.1 1 10 eta eta eta fish data set bird wings data set ants data set

Figure 3: (top) Comparison of several identification methods, for fish (n = 50), bird wings (n = 100) and ants (n = 100). (bottom) The effect of changing the regularization parameter on the performance of the CCA method. The error bars depict SD.

17 (a) (b) (c) Figure 4: Prediction of fish contours based on their COI gene. (a) the original unseen fish image. (b) extracted contour. (c) synthesized contour. The fish species shown are (top to bottom) Thunnus ori- entalis, Otolithes ruber, Epinephelus multinotatus, Epinephelus malabaricus, Caprodon longimanus, Variola louti, Pentaceros decacanthus, Auxis thazard, Epinephelus ergastularius, Platax batavianus, Epinephelus ongus. The rest of the 82 fish were used for training.

18