Identification of Animal Species Images Based on Mito- Chondrial

Identification of animal species images based on mitochondrial DNA Lior Wolf & Yoni Donner School of Computer Science, Tel-Aviv University, Israel The appearance of an animal species is a complex phenotype partially encoded in its genome. Previous work on linking genotype to visually-identifiable phenotypes has focused on univariate or low-dimensional traits such as eye color1, principal variations in skeletal structure2 and height3, as well as on the discovery of specific genes that contribute to these traits. Here, we go beyond single traits to the full genotype-phenotype analysis of photographs and illustrations of animal species. We address the problems of (1) identification and (2) synthesis of images of previously unseen animals using genetic data. We demonstrate that both these problems are feasible: in a multiple choice test, our algorithm identifies with high accuracy the correct image of previously unseen fish, birds and ants, based only on a short gene sequence; additionally, using the same sequence we are able to approximate images of unseen fish contours. Our predic- tions are based on correlative phenotype-genotype links rather than on specific gene targeting, and they employ the cytochrome c oxidase I mitochondrial gene which is assumed to have little causal influence on appearance. Such correlative links enable the use of high-dimensional phenotypes in genetic research, and applications may range from forensics to personalized medical treatment. There is little doubt that appearance is to a large extent influenced by heritage, yet the actual mech- anisms that determine our looks may be difficult to unravel since they are likely to involve complex pathways consisting of a multitude of genes. This, however, does not eliminate the possibility of predicting appearance according to genetic data. In order to predict it is sufficient to discover statistical 1 Common ancestral population Population A Population B Figure 1: At first glance, it may seem improbable that mitochondrial genes can be used to identify images, since they do not affect appearance directly. These genes may, however, be correlated with appearance due to the common evolutionary lineage shared by all genes. The figure above demon- strates correlations that are created by an event of population split. The two colors correspond to genes of different functions. After the split, new variants of genes created in one population do not migrate to the other and appear concurrently with new variants of other genes, thus yielding correlations between functionally independent genes. correlations between genetic markers and images, without having to describe the exact mechanism at work. Moreover, correlations between genotype and phenotype can be identified even in the absence of direct causal relationship. The genes directly responsible for appearance are inherited together with other genes that are involved in completely different functions, thus generating many intricate inter- dependencies that can be exploited statistically. This distinction between function and correlation, as illustrated in Figure 1, is demonstrated acutely in our work, since we employ a mitochondrial gene, which is unlikely to directly determine visual appearance. Mitochondrial DNA (mtDNA) is normally inherited unchanged from the mother, with very limited recombination compared to nuclear DNA, yet its mutation rate is higher than that of nuclear DNA4, resulting in low variance within species and high variance between different species, thus making it a promising candidate for species identification5. In addition, mtDNA contains very few introns and is 2 easy to collect, which further increases its suitability for this purpose. In theory, many mitochondrial loci can be used; in practice, the mitochondrial gene cytochrome c oxidase I (COI) has been repeatedly employed for DNA barcoding and its discriminative effectiveness has been demonstrated in several species6, 7. While the ability of COI to serve as a universal barcode has been doubted8, 9, in this work we use COI sequence data for all experiments, mainly due to the high availability of COI sequences for many species of the same genus, publicly available from the Barcode of Life Database (BOLD)10. Three data sets are employed: Fishes of Australia11, Ant Diversity in Northern Madagascar12, and Birds of North America - Phase II project13. By locating matching images we constructed three data sets containing multiple genotype-phenotype pairs (M, P ), where M stands for the COI gene sequence and P is a single image of an animal of that species. No fish or ant species is associ- ated with more than one image, while the bird data set contains several genetic markers and images for each species. The images were extracted from several sources. A total of 157 fish species illustrations with varying style and quality were retrieved from FishBase (http://filaman. ifm-geomar.de/home.htm). Images of 26 relevant ant species were available from AntWeb (http://www.antweb.org/), with each ant photographed from a profile view, a head view and a dorsal view. The fish and ant images were matched to the BOLD record by the species name. The BOLD repository itself contains 125 wing images of sequenced birds covering 25 different species. Links to the images used in our experiments can be found in the accompanying Supplementary Infor- mation; some examples are shown in Figure 2. The ultimate challenge for learning the relations between genotype and phenotype is the success- N ful prediction of appearance based on genetic markers, i.e., given a training set {Mi,Pi}i=1 of N ˆ matching markers and images, and a marker of a new organism Mnew, to generate an image Pnew ˆ such that Pnew is a good approximation of the actual appearance of the organism. While we have partially successful results predicting the appearance of fish (see below), this task is difficult due to 3 the high dimensionality of the target space, the relatively small size of the training data set and the low relevance of the genetic markers used, which provide only partial correlative information. A more accessible task is to identify the image of an animal of the same species out of k candidate images P1, ..., Pk given its genotype M, where the genetic marker and the species in the candidate images are all previously unseen. Note that the ability to predict the image Pˆ that corresponds to the genetic marker M implies the ability to identify without difficulty the correct image as the one closest visually to Pˆ. The converse is not true: the correlative nature of the learned genotype-phenotype relations can be used to solve the selection task without providing the ability to synthesize an image. Indeed, our experiments demonstrate that the COI gene by itself provides sufficient information to allow, in the case of two alternatives (k = 2), correct identification at considerably higher success rates than the 50% chance performance. Throughout this work, linear models are used for prediction. First, both the sequences of the COI gene (M) and the images (P ) are represented as vectors m and p respectively, of lengths nm and np. To describe the genomic data this way, we represent the n nucleotides of each sequence (n is constant per dataset, and its value varies between 576 for ants and 669 for birds) as a vector of dimension nm = 4n of real numbers between 0 and 1. Each element marks the appearance of a specific nucleotide (A,G,C or T) at a specific location in the COI sequences of the species. In cases where there are multiple measurement in the BOLD database for a particular species, the elements mark the relative frequency of a nucleic acid of a certain type. In this representation, the dot product of two gene vectors is the expected number of agreements between the sequences. This representation has been demonstrated to be effective for predicting univariate phenotypes14. An alternative representation based on the Kimura two-parameter model (K2P)15 was also tested, with nearly identical results (see Supplementary Information). For the representation of images as vectors we use a bag-of-SIFT- features representation (see box 1), where keypoints are sampled uniformly across all image locations. 4 For all data sets np = 11, 111. For the identification task, we first subtract the mean value from all markers and images. Then we learn from the training set a pair of linear transformations Tm, Tp that transform markers and images min(nm,np) to a common vector space R , so that the transformed vectors Tm(mi) and Tp(pi), i = 1..N of matching training markers and images are similar in the sense of small Euclidean distance. Put differently, we seek linear transformations Tm, Tp that minimize N 1 X kT (m ) − T (p )k2. (1) N m i p i i=1 Additional constraints need to be provided to avoid trivial or singular solutions, namely that the m components of the resulting vectors be pairwise uncorrelated and of unit variance: N N X T X T xixi = yiyi = I, (2) i=1 i=1 N N 1 P 1 P where xi = Tm(mi) − N Tm(mi) and yi = Tp(pi) − N Tp(pi). This problem can be solved j=1 j=1 through Canonical Correlation Analysis16 (CCA). Please refer to the Supplementary Information for details. Since the feature vectors for both genes and images are of dimensions significantly higher than the number of training samples, statistical regularization must be used to avoid overfitting. We use the regularized version of CCA suggested by17. Generally, two regularization parameters need to be determined: ηm and ηp. We use a single regularization parameter instead, η as follows. Let X = [x1 x2 . xN ] and Y = [y1 y2 . yN ] (x and y are defined above below Eq. 2), and denote by > > λm, λp the largest eigenvalues of XX and YY .

Load more