bioRxiv preprint doi: https://doi.org/10.1101/338442; this version posted June 4, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. GENETICS | INVESTIGATION
Dimension Reduction and Visualization for Single-copy Alignments via Generalized PCA.
AB Rohrlach∗,†, Nigel Bean∗,†, Gary Glonek∗, Barbara Holland‡, Ray Tobler§, Jonathan Tuke∗,† and Alan Cooper§ ∗School of Mathematical Sciences, University of Adelaide, Adelaide, South Australia, 5005, Australia., †ARC Centre for Excellence for Mathematical and Statistical Frontiers, University of Adelaide, Adelaide, South Australia, 5005, Australia., ‡School of Natural Sciences (Mathematics), University of Tasmania, Hobart, Tasmania 7001, Australia., §Australian Centre for Ancient DNA, School of Biological Sciences, University of Adelaide, Adelaide, South Australia, 5005, Australia.
ABSTRACT Principal components analysis (PCA) has been one of the most widely used exploration tools in genomic data analysis since its introduction in 1978 (Menozzi et al. 1978). PCA allows similarities between individuals to be efficiently calculated and visualized, optimally in two dimensions. While PCA is well suited to analyses concerned with autosomal DNA, no analogue for PCA exists for the analysis and visualization of non-autosomal DNA. In this paper we introduce a statistically valid method for the analysis of single-copy sequence data. We then show that tests for relationships between genetic information and qualitative and quantitative characteristics can be implemented in a rigorous statistical framework. We motivate the use of our method with examples from empirical data.
KEYWORDS Dimension Reduction, mtDNA, Population Genetics
berghe et al. 2011). A popular form of unsupervized data ex- Introduction ploration is principal components analysis An important feature of any genetic analysis can (PCA) (Pearson 1901). PCA is a dimension re- be detecting whether a sample comes from a duction technique that takes n p-dimensional structured population. Demographic structure vectors and, using linear combinations of can take many forms. For example, samples the original vectors, finds min(n − 1, p) p- may be taken from geographically isolated sub- dimensional basis vectors. The new vectors are populations (Tobler et al. 2017), from subpopu- ordered by the amount of variability explained lations along a migration route (Novembre and by each ‘principal dimension’. Often the first Stephens 2008) or from temporally separated few dimensions are used to visualise points in population replacement events (Posth et al. 2016). the new transformed space. In some cases it can be of interest to discover that PCA is a non-parametric, hypothesis-free ex- no geographic structure exists at all, leading to ploratory technique, making it a particularly at- the exploration of social structure (Van Grem- tractive analytical tool. However, PCA does re- quire that the vectors of information are quan- Copyright © 2018 by the Genetics Society of America doi: 10.1534/genetics.XXX.XXXXXX titative variables. Clearly sequence characters Manuscript compiled: Friday 1st June, 2018 are not quantitative random variables, and so a 1Corresponding author: School of Mathematical Sciences, University of Adelaide, SA, 5005. E-mail: [email protected] transformation must be applied to raw sequence
Genetics, Vol. XXX, XXXX–XXXX June 2018 1 bioRxiv preprint doi: https://doi.org/10.1101/338442; this version posted June 4, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. data before PCA can be directly applied (Patter- African mtDNA from haplogroups L0, L1, L2, son et al. 2006). However, we are aware of no L4 and L5 to show that our method produces such suitable transformation for DNA sequences valid and easily interpretable results by repro- that are non-biallelic, and in particular, haploid ducing mtDNA macro-haplogroups via cluster- DNA such as mitochondrial DNA (mtDNA) or Y ing. We also explore an alignment of modern chromosome sequences. Instead, we suggest the and ancient thylacine mtDNA from a mainland application of multiple correspondence analysis and island population. We show that thylacine (MCA) directly to the sequence characters. genetic signals are highly correlated with longi- MCA is an adaptation of PCA where cate- tude, and identify a possible ancestral migration gorical variables (in this case Single Nucleotide route. Finally we explore an alignment of West- Polymorphisms: SNPs) are converted into bi- ern Australian Ghost Bat mtDNA to show that nary variables denoting the presence or absence genetic diversity can be almost completely ex- of each level of the variables (in this case alle- plained by discrete cave locations. les) (Jolliffe 2002). Unlike PCA, MCA can be applied to any number of alleles. Our method Materials and Methods makes the assumption that SNP inheritance is random, i.e. that the underlying phylogenetic The transformation of genomic data to contin- tree is a star tree. One could test whether alle- uous coordinates les appear to occur independently by investigat- Consider an n × p alignment A of mtDNA, ing a contingency table of pairwise allele counts where Aij ∈ {A, C, G, T}, filtered to remove for an alignment, and then apply a chi-squared homozygous sites. The n rows represent se- test. Since one would almost always overwhelm- quenced individuals, denoted {a1, ··· , an} and ingly reject the null hypothesis, the result of a the p columns represent single nucleotide poly- chi-squared test would be of no interest. How- morphisms (SNPs), denoted s1, ··· , sp . Note ever, the matrix of signed residuals under this that each of the SNPs can take between two to assumption form the basis of the transformation four forms, and we say that sj has |sj| levels. from sequence data to continuous data. p Consider each of the Q = ∑ |sj| different In this paper we aim to show that MCA is a j=1 statistically powerful method for the analysis of allelic forms of the p SNPs, ordered (without non-autosomal DNA. We show that MCA has loss of generality) numerically by position, then many properties that are analogous to PCA, and within SNPs, lexicographically by nucleotide. is hence immediately intuitive to researchers We can define an n × Q indicator matrix X, such with experience using PCA. We demonstrate that Xik equals one if individual ai has the al- that PCA only quantifies the relationships be- lele at the position indicated by the kth column tween rows (individuals), while MCA quantifies name, for k = 1, ··· , Q (see Figure 1). Note that the relationships between the rows (individu- for a SNP with |sj| levels, there are only |sj| − 1 als) and also the columns (SNPs) simultaneously. linearly independent columns of information in For this reason we can quantify and visualise re- the X matrix (since if an individual does not lationships between individuals as in PCA, and have any of the first |sj| − 1 forms of the allele, also quantify and visualise the relationship be- they must have the remaining allele). Hence, in tween SNPs, and between SNPs and individuals total there are only Q − p linearly independent simultaneously in the same dimensions. columns. Finally, we can also calculate a con- We show that results obtained from MCA cor- tingency table of pairwise marker combinations T respond to the results obtained from mtDNA B = X X (see Figure 1), this matrix is discussed phylogenetic trees in a meaningful way, and later in the process. n Q that demographic structure can be detected us- 1 Let N = ∑ ∑ xij, r = N X1Q and c = ing these results. We explore an alignment of i=1 j=1
2 Rohrlach et al. bioRxiv preprint doi: https://doi.org/10.1101/338442; this version posted June 4, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
SNP1 SNP2 SNP3 is to post multiply each dimension by the associ- a1 A G C
A = a2 A T C ated singular value. Hence the relative spread of a3 C T G points in each dimension is proportional to the a4 C T G amount of inertia captured by each dimension. y From the standard row scores we obtain the
SNP1A SNP1C SNP2G SNP2T SNP3C SNP3G transformed coordinates, also called the ‘row a1 1 0 1 0 1 0 factor scores’, and denoted F, of the individuals X = a2 1 0 0 1 1 0 in the alignment A in ‘genetic space’ via a3 0 1 0 1 0 1 a 0 1 0 1 0 1 4 F = F∗ Σ. (2) y
SNP1A SNP1C SNP2G SNP2T SNP3C SNP3G The distances between individuals calculated SNP1A 2 0 1 1 2 0 form these coordinates will respect three proper- SNP1C 0 2 0 2 0 2 ties: B = SNP2G 1 0 1 0 1 0 SNP2T 1 2 0 3 1 2 1. If two individuals have the same DNA SNP3C 2 0 1 1 2 0
SNP3G 0 2 0 2 0 2 sequence, they will have identical coordi- nates. Figure 1 A transformation from raw sequence alignment A, to an indicator matrix X, and a 2. If two individuals share many alleles, they Burt table B = XTX. will be closer than two individuals that do not.
1 T N X 1n, where 1k is a k × 1 vector of ones, and 3. Individuals that share rare alleles will be define Dr = diag(r) and Dc = diag(c). We can closer still. define a new n × Q matrix, as a function of X, It is important to note that the pairwise dis- tances between individuals calculated from the −1/2 1 T −1/2 f (X) = D X − rc D . (1) matrix F differ from classical pairwise genetic r N c differences in two important ways. First, one need not assume a model of sequence evolution On f (X) we perform a compact singular value to find the matrix F. Second, classical pairwise decomposition (SVD) so that f (X) = UΣVT. genetic distances are calculated on only two se- Due to the above number of linearly indepen- quences at a time, and so do not take into ac- dent columns, the diagonal matrix of singular count the rarity of alleles. Our method uses the values, Σ, will only have J = Q − p non-zero complete alignment to calculate the matrix F, entries, and we need only consider these dimen- and gives greater weight to rarer alleles. sions. Following this reasoning, U and V are The choice of rescaling for the standard col- truncated to be matrices of dimensions n × J and umn coordinates, with respect to the standard Q × J, respectively. From the diagonal matrix row coordinates, depends on the desired prop- Σ we may also obtain the percentage of inertia erties of the resulting column factor scores. We (analogous to variability in PCA) explained by propose rescaling the standard column coordi- each of the first J principal dimensions, which nates by the squares of the singular values, such are proportional to the singular values. that the column factors scores are The standard row and column coordinates, ∗ −1/2 ∗ −1/2 defined as F = Dr U and G = Dc V G = G∗Σ2. respectively, are the unscaled row and factor scores that do not account for the proportion This rescaling of the standard column coordi- of inertia in principal dimensions. A natural nates yields a desirable property for comparing choice for scaling the standard row coordinates the coordinates of individuals and alleles. The
Dimension Reduction and Visualization for Single-copy Alignments via Generalized PCA. 3 bioRxiv preprint doi: https://doi.org/10.1101/338442; this version posted June 4, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. coordinates for any allele can be found at the centroid of the coordinates of the individuals that carry that allele (proof given in Appendix A). A special case of this property is that if an individual uniquely carries an allele, then that allele shares exactly the same coordinates as the individual (proof given in Appendix A). There is a second, equivalent way to consider the method we have proposed. It is known that B = XTX the row and column factor scores from Figure 2 A biplot of the first two principal di- will be the same as the factor scores obtained mensions for the sequences as shown in Figure X from (Greenacre 2007). The transformation 1. Individuals are in black, SNPs are in red f (B) (found in a similar same way as in Equa- and projected coordinates for supplementary r c tion 1, but with appropriate dimensions for , variable Size (Big and Small) and Location (Lo- and a recalculated normalising constant) would cation 1 and Location 2) are shown in purple. Q × Q R = ρ R yield a matrix, of the form ij . is The new sequence ‘CTC’ is projected onto the a matrix of the correlations for detecting linkage dimensions and given in green. Euclidean dis- disequilibrium with multiple alleles, where if tances between individuals are given in blue. ρij 6= 0, then the loci associated with alleles i and j are in linkage disequilibrium (Zaykin et al.
2008). While mtDNA does not undergo recom- same can be said of a1 and ‘SNP2_G’). ‘SNP1_A’ bination, our method also attempts to identify is shared by a1 and a2, and so falls exactly at the groups of alleles that occur together more than mid-point of the two points. However, ‘SNP2_T’ expected just by random chance, and hence the is shared by a2 and by both a3 and a4, and so lies individuals that carry these alleles. only one-third the way along the line connecting As with PCA, we can use the principal coor- a3 and a4 to a2. dinates to visualize the relationships between Note that in Figure 1, if an individual has an individuals. However, our method also allows ‘A’ at the first site, then they always have a ‘C ’ us to visualize the relationships between SNPs, in the third position. Similarly, if an individual and between individuals and SNPs. We can also has a ‘C’ at the first site, then they always have a look at the pairwise distances between individu- ‘G ’ in the third position. Hence the SNPs at the als, and SNPs, in genetic space. first and third sites provide no new information Figure 2 shows the relationship between the about the nature of the relationships between sequences as shown in Figure 1. Since a3 and individuals since one can infer the third SNP, a4 have identical sequences, they have the same given the nature of the first SNP. For this reason, coordinates in gene space. As a1 shares no sim- the first two principal dimensions capture 100% ilarity with a3 or a4, they are the furthest apart. of the inertia, and reducing the dimensionality However, a2 shares one SNP with a3 and a4, and of the transformed genetic space results in no two SNPs with a single individual a1, and hence loss of information about the structure of the is more closely ‘attracted’ to a1. Due to this ‘at- relationships between individuals. traction’ to individuals with similar SNP pro- It is possible to project new sequences onto files, the term ‘inertia’ is used in the place of the genetic space defined by an MCA. The new ‘variance’. sequence must have one of the allelic forms for Note the relationship between individual co- every SNP from the original alignment. For ex- ordinates, and SNP coordinates. Since a3 and ample, consider an alignment of new sequences a4 are the only individuals with ‘SNP1_C’ and of dimension m × p denoted H, with correspond- ‘SNP3_G’, they share the same coordinates (the ing m × Q indicator matrix iH (see Figure3).
4 Rohrlach et al. bioRxiv preprint doi: https://doi.org/10.1101/338442; this version posted June 4, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
SNP1 SNP2 SNP3 Size Location H = C T C a1 Big 1
W = a2 Big 2 y a3 Small 2
a4 Small 2 SNP1A SNP1C SNP2G SNP2T SNP3C SNP3G i = H 0 1 0 1 1 0 y
Figure 3 A transformation from a new raw se- Size_Big Size_Small Location_1 Location_2 a1 1 0 1 0 quence alignment H, to an indicator matrix iH jW = a2 1 0 0 1 to be projected onto existing MCA dimensions. a3 0 1 0 1 a4 0 1 0 1