bioRxiv preprint doi: https://doi.org/10.1101/338442; this version posted June 4, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. GENETICS | INVESTIGATION

Dimension Reduction and Visualization for Single-copy Alignments via Generalized PCA.

AB Rohrlach∗,†, Nigel Bean∗,†, Gary Glonek∗, Barbara Holland‡, Ray Tobler§, Jonathan Tuke∗,† and Alan Cooper§ ∗School of Mathematical Sciences, University of Adelaide, Adelaide, South Australia, 5005, Australia., †ARC Centre for Excellence for Mathematical and Statistical Frontiers, University of Adelaide, Adelaide, South Australia, 5005, Australia., ‡School of Natural Sciences (Mathematics), University of Tasmania, Hobart, Tasmania 7001, Australia., §Australian Centre for Ancient DNA, School of Biological Sciences, University of Adelaide, Adelaide, South Australia, 5005, Australia.

ABSTRACT Principal components analysis (PCA) has been one of the most widely used exploration tools in genomic data analysis since its introduction in 1978 (Menozzi et al. 1978). PCA allows similarities between individuals to be efficiently calculated and visualized, optimally in two dimensions. While PCA is well suited to analyses concerned with autosomal DNA, no analogue for PCA exists for the analysis and visualization of non-autosomal DNA. In this paper we introduce a statistically valid method for the analysis of single-copy sequence data. We then show that tests for relationships between genetic information and qualitative and quantitative characteristics can be implemented in a rigorous statistical framework. We motivate the use of our method with examples from empirical data.

KEYWORDS Dimension Reduction, mtDNA, Population Genetics

berghe et al. 2011). A popular form of unsupervized data ex- Introduction ploration is principal components analysis An important feature of any genetic analysis can (PCA) (Pearson 1901). PCA is a dimension re- be detecting whether a sample comes from a duction technique that takes n p-dimensional structured population. Demographic structure vectors and, using linear combinations of can take many forms. For example, samples the original vectors, finds min(n − 1, p) p- may be taken from geographically isolated sub- dimensional basis vectors. The new vectors are populations (Tobler et al. 2017), from subpopu- ordered by the amount of variability explained lations along a migration route (Novembre and by each ‘principal dimension’. Often the first Stephens 2008) or from temporally separated few dimensions are used to visualise points in population replacement events (Posth et al. 2016). the new transformed space. In some cases it can be of interest to discover that PCA is a non-parametric, hypothesis-free ex- no geographic structure exists at all, leading to ploratory technique, making it a particularly at- the exploration of social structure (Van Grem- tractive analytical tool. However, PCA does re- quire that the vectors of information are quan- Copyright © 2018 by the Genetics Society of America doi: 10.1534/genetics.XXX.XXXXXX titative variables. Clearly sequence characters Manuscript compiled: Friday 1st June, 2018 are not quantitative random variables, and so a 1Corresponding author: School of Mathematical Sciences, University of Adelaide, SA, 5005. E-mail: [email protected] transformation must be applied to raw sequence

Genetics, Vol. XXX, XXXX–XXXX June 2018 1 bioRxiv preprint doi: https://doi.org/10.1101/338442; this version posted June 4, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. data before PCA can be directly applied (Patter- African mtDNA from haplogroups L0, L1, L2, son et al. 2006). However, we are aware of no L4 and L5 to show that our method produces such suitable transformation for DNA sequences valid and easily interpretable results by repro- that are non-biallelic, and in particular, haploid ducing mtDNA macro-haplogroups via cluster- DNA such as mitochondrial DNA (mtDNA) or Y ing. We also explore an alignment of modern chromosome sequences. Instead, we suggest the and ancient mtDNA from a mainland application of multiple correspondence analysis and island population. We show that thylacine (MCA) directly to the sequence characters. genetic signals are highly correlated with longi- MCA is an adaptation of PCA where cate- tude, and identify a possible ancestral migration gorical variables (in this case Single Nucleotide route. Finally we explore an alignment of West- Polymorphisms: SNPs) are converted into bi- ern Australian Ghost mtDNA to show that nary variables denoting the presence or absence genetic diversity can be almost completely ex- of each level of the variables (in this case alle- plained by discrete cave locations. les) (Jolliffe 2002). Unlike PCA, MCA can be applied to any number of alleles. Our method Materials and Methods makes the assumption that SNP inheritance is random, i.e. that the underlying phylogenetic The transformation of genomic data to contin- tree is a star tree. One could test whether alle- uous coordinates les appear to occur independently by investigat- Consider an n × p alignment A of mtDNA, ing a contingency table of pairwise allele counts where Aij ∈ {A, C, G, T}, filtered to remove for an alignment, and then apply a chi-squared homozygous sites. The n rows represent se- test. Since one would almost always overwhelm- quenced individuals, denoted {a1, ··· , an} and ingly reject the null hypothesis, the result of a the p columns represent single nucleotide poly-  chi-squared test would be of no interest. How- morphisms (SNPs), denoted s1, ··· , sp . Note ever, the matrix of signed residuals under this that each of the SNPs can take between two to assumption form the basis of the transformation four forms, and we say that sj has |sj| levels. from sequence data to continuous data. p Consider each of the Q = ∑ |sj| different In this paper we aim to show that MCA is a j=1 statistically powerful method for the analysis of allelic forms of the p SNPs, ordered (without non-autosomal DNA. We show that MCA has loss of generality) numerically by position, then many properties that are analogous to PCA, and within SNPs, lexicographically by nucleotide. is hence immediately intuitive to researchers We can define an n × Q indicator matrix X, such with experience using PCA. We demonstrate that Xik equals one if individual ai has the al- that PCA only quantifies the relationships be- lele at the position indicated by the kth column tween rows (individuals), while MCA quantifies name, for k = 1, ··· , Q (see Figure 1). Note that the relationships between the rows (individu- for a SNP with |sj| levels, there are only |sj| − 1 als) and also the columns (SNPs) simultaneously. linearly independent columns of information in For this reason we can quantify and visualise re- the X matrix (since if an individual does not lationships between individuals as in PCA, and have any of the first |sj| − 1 forms of the allele, also quantify and visualise the relationship be- they must have the remaining allele). Hence, in tween SNPs, and between SNPs and individuals total there are only Q − p linearly independent simultaneously in the same dimensions. columns. Finally, we can also calculate a con- We show that results obtained from MCA cor- tingency table of pairwise marker combinations T respond to the results obtained from mtDNA B = X X (see Figure 1), this matrix is discussed phylogenetic trees in a meaningful way, and later in the process. n Q that demographic structure can be detected us- 1 Let N = ∑ ∑ xij, r = N X1Q and c = ing these results. We explore an alignment of i=1 j=1

2 Rohrlach et al. bioRxiv preprint doi: https://doi.org/10.1101/338442; this version posted June 4, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

SNP1 SNP2 SNP3 is to post multiply each dimension by the associ- a1 A G C

A = a2 A T C ated singular value. Hence the relative spread of a3 C T G points in each dimension is proportional to the a4 C T G amount of inertia captured by each dimension.   y From the standard row scores we obtain the

SNP1A SNP1C SNP2G SNP2T SNP3C SNP3G transformed coordinates, also called the ‘row a1 1 0 1 0 1 0 factor scores’, and denoted F, of the individuals X = a2 1 0 0 1 1 0 in the alignment A in ‘genetic space’ via a3 0 1 0 1 0 1 a 0 1 0 1 0 1 4 F = F∗  Σ. (2)  y

SNP1A SNP1C SNP2G SNP2T SNP3C SNP3G The distances between individuals calculated SNP1A 2 0 1 1 2 0 form these coordinates will respect three proper- SNP1C 0 2 0 2 0 2 ties: B = SNP2G 1 0 1 0 1 0 SNP2T 1 2 0 3 1 2 1. If two individuals have the same DNA SNP3C 2 0 1 1 2 0

SNP3G 0 2 0 2 0 2 sequence, they will have identical coordi- nates. Figure 1 A transformation from raw sequence alignment A, to an indicator matrix X, and a 2. If two individuals share many alleles, they Burt table B = XTX. will be closer than two individuals that do not.

1 T N X 1n, where 1k is a k × 1 vector of ones, and 3. Individuals that share rare alleles will be define Dr = diag(r) and Dc = diag(c). We can closer still. define a new n × Q matrix, as a function of X, It is important to note that the pairwise dis-   tances between individuals calculated from the −1/2 1 T −1/2 f (X) = D X − rc D . (1) matrix F differ from classical pairwise genetic r N c differences in two important ways. First, one need not assume a model of sequence evolution On f (X) we perform a compact singular value to find the matrix F. Second, classical pairwise decomposition (SVD) so that f (X) = UΣVT. genetic distances are calculated on only two se- Due to the above number of linearly indepen- quences at a time, and so do not take into ac- dent columns, the diagonal matrix of singular count the rarity of alleles. Our method uses the values, Σ, will only have J = Q − p non-zero complete alignment to calculate the matrix F, entries, and we need only consider these dimen- and gives greater weight to rarer alleles. sions. Following this reasoning, U and V are The choice of rescaling for the standard col- truncated to be matrices of dimensions n × J and umn coordinates, with respect to the standard Q × J, respectively. From the diagonal matrix row coordinates, depends on the desired prop- Σ we may also obtain the percentage of inertia erties of the resulting column factor scores. We (analogous to variability in PCA) explained by propose rescaling the standard column coordi- each of the first J principal dimensions, which nates by the squares of the singular values, such are proportional to the singular values. that the column factors scores are The standard row and column coordinates, ∗ −1/2 ∗ −1/2 defined as F = Dr U and G = Dc V G = G∗Σ2. respectively, are the unscaled row and factor scores that do not account for the proportion This rescaling of the standard column coordi- of inertia in principal dimensions. A natural nates yields a desirable property for comparing choice for scaling the standard row coordinates the coordinates of individuals and alleles. The

Dimension Reduction and Visualization for Single-copy Alignments via Generalized PCA. 3 bioRxiv preprint doi: https://doi.org/10.1101/338442; this version posted June 4, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. coordinates for any allele can be found at the centroid of the coordinates of the individuals that carry that allele (proof given in Appendix A). A special case of this property is that if an individual uniquely carries an allele, then that allele shares exactly the same coordinates as the individual (proof given in Appendix A). There is a second, equivalent way to consider the method we have proposed. It is known that B = XTX the row and column factor scores from Figure 2 A biplot of the first two principal di- will be the same as the factor scores obtained mensions for the sequences as shown in Figure X from (Greenacre 2007). The transformation 1. Individuals are in black, SNPs are in red f (B) (found in a similar same way as in Equa- and projected coordinates for supplementary r c tion 1, but with appropriate dimensions for , variable Size (Big and Small) and Location (Lo- and a recalculated normalising constant) would cation 1 and Location 2) are shown in purple. Q × Q R = ρ  R yield a matrix, of the form ij . is The new sequence ‘CTC’ is projected onto the a matrix of the correlations for detecting linkage dimensions and given in green. Euclidean dis- disequilibrium with multiple alleles, where if tances between individuals are given in blue. ρij 6= 0, then the loci associated with alleles i and j are in linkage disequilibrium (Zaykin et al.

2008). While mtDNA does not undergo recom- same can be said of a1 and ‘SNP2_G’). ‘SNP1_A’ bination, our method also attempts to identify is shared by a1 and a2, and so falls exactly at the groups of alleles that occur together more than mid-point of the two points. However, ‘SNP2_T’ expected just by random chance, and hence the is shared by a2 and by both a3 and a4, and so lies individuals that carry these alleles. only one-third the way along the line connecting As with PCA, we can use the principal coor- a3 and a4 to a2. dinates to visualize the relationships between Note that in Figure 1, if an individual has an individuals. However, our method also allows ‘A’ at the first site, then they always have a ‘C ’ us to visualize the relationships between SNPs, in the third position. Similarly, if an individual and between individuals and SNPs. We can also has a ‘C’ at the first site, then they always have a look at the pairwise distances between individu- ‘G ’ in the third position. Hence the SNPs at the als, and SNPs, in genetic space. first and third sites provide no new information Figure 2 shows the relationship between the about the nature of the relationships between sequences as shown in Figure 1. Since a3 and individuals since one can infer the third SNP, a4 have identical sequences, they have the same given the nature of the first SNP. For this reason, coordinates in gene space. As a1 shares no sim- the first two principal dimensions capture 100% ilarity with a3 or a4, they are the furthest apart. of the inertia, and reducing the dimensionality However, a2 shares one SNP with a3 and a4, and of the transformed genetic space results in no two SNPs with a single individual a1, and hence loss of information about the structure of the is more closely ‘attracted’ to a1. Due to this ‘at- relationships between individuals. traction’ to individuals with similar SNP pro- It is possible to project new sequences onto files, the term ‘inertia’ is used in the place of the genetic space defined by an MCA. The new ‘variance’. sequence must have one of the allelic forms for Note the relationship between individual co- every SNP from the original alignment. For ex- ordinates, and SNP coordinates. Since a3 and ample, consider an alignment of new sequences a4 are the only individuals with ‘SNP1_C’ and of dimension m × p denoted H, with correspond- ‘SNP3_G’, they share the same coordinates (the ing m × Q indicator matrix iH (see Figure3).

4 Rohrlach et al. bioRxiv preprint doi: https://doi.org/10.1101/338442; this version posted June 4, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

SNP1 SNP2 SNP3 Size Location H = C T C a1 Big 1

 W = a2 Big 2  y a3 Small 2

a4 Small 2 SNP1A SNP1C SNP2G SNP2T SNP3C SNP3G i = H  0 1 0 1 1 0  y

Figure 3 A transformation from a new raw se- Size_Big Size_Small Location_1 Location_2 a1 1 0 1 0 quence alignment H, to an indicator matrix iH jW = a2 1 0 0 1 to be projected onto existing MCA dimensions. a3 0 1 0 1 a4 0 1 0 1

 Figure 4 A transformation from a matrix of For the matrix sH = diag iH1Q , coordinates for the new sequences can be found (Abdi and supplementary qualitative information W, to Valentin 2007) via an indicator matrix jW to be projected onto existing MCA dimensions. T h −1 i −1/2 GH = sH iH Dc V. ordinates for the average qualitative supplemen- In Figure 2 we project the new sequence given tary variables can be found via in Figure 3 onto the first two principal dimen- T h −1 i −1/2 sions as given by the analysis of the alignment FW = sW jW Dr UΣ. A from Figure 1. Note that this sequence is an equal ‘mix’ of the sequences a2 and a3, and so The projected coordinates for the supplemen- falls halfway along the line connecting the two tary variables are an estimate of the average sequences. While this makes mathematical and coordinates for individuals with the given lev- intuitive sense, this mixing of two sequences els of the qualitative supplementary variables. makes no sense for non-recombining DNA. For example, in Figure 4, if you were to imag- While this method could be used for pseudo- ine that we could sample all individuals from haploid DNA where this type of interpretation the true population with level ‘Big’ for variable does make sense, there is value in projecting ‘Size’, then we believe that the centre of the new sequences onto the principal dimensions cluster of points would fall at approximately for non-recombining DNA. For example, when (0.875, −0.1083). Note that the calculated coor- data contains many SNPs, principal dimensions dinates are based on inertia, and called ‘barycen- may represent haplogroups with a collection of tres’ rather than centroids. diagnostic SNPs. Projecting ancient samples, for In Figure 2 we see that only sequences a3 and example, would include individuals ancestral a4 have Size ‘Small’, and since they share coor- to individuals from the alignment that have not dinates, the barycentre for ‘Small’ also shares acquired more recent SNPs. These projected this coordinate. Sequences a1 and a2 both have points might fall along the line connecting the Size ‘Big’, and so the barycentre for ‘Big’ falls origin to the group, with individuals that carry halfway along the line connecting their coordi- fewer diagnostic SNPs closer to the origin. nates. Similarly, a1 is the only individual found Finally, we may project the ‘average’ coordi- at ‘Location 1’, and so the barycentre for ‘Loca- nates of some qualitative supplementary vari- tion 1’ shares the coordinates of a1. However, a2, able. Imagine we have r such variables, with a a3 and a4 were all found at ‘Location 2’, and so total of R levels. Let W be the n × r matrix of the barycentre for ‘Location 2’ can be found two supplementary information, with correspond- thirds of the way along the line connecting a2 to ing n × R indicator matrix jW (see Figure4). a3 and a4. Notice also that ‘Location 2’ is found Following a similar method for projecting new exclusively when there is a T at SNP2, and so sequences, for the matrix sW = diag (1R jW ), co- they also share coordinates.

Dimension Reduction and Visualization for Single-copy Alignments via Generalized PCA. 5 bioRxiv preprint doi: https://doi.org/10.1101/338442; this version posted June 4, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Once a coordinate representation of the re- be thought of as the proportion of variability lationship between individuals has been con- explained by a qualitative variable. For a one- structed, we can examine relationships between dimensional response variable Y, η2 is equiva- individuals in this ‘genetic space’, and compare lent to R2, the coefficient of determination for a them to characteristics that have been recorded linear model with the qualitative variable as the for individuals. These ‘supplementary variables’ sole predictor variable. In the case of multiple are anything recorded about sampled individu- dimensional response variables, an analogous als that were not used in the alignment table (i.e. η2 may be calculated (Breiman and Friedman any non-SNP data). Of particular interest are de- 1997). mographic variables, such as country of origin, A permutation test can be used to find if η2 spatial coordinates on a landscape or morpho- is significantly greater than for a random re- logical characters, for example. labelling of the population. For each of the T Here we give three examples that illustrate permutations of the group labellings, we calcu- 2 the ability of the method to produce biologi- late ηt , the correlation ratio calculated for the cally meaningful and intuitive results in a rig- tth permutation. An empirical p-value of the orous statistical framework, using previously form (r + 1)/(T + 1) is calculated, where r is the published empirical data sets. total number of permuted samples yielding a greater correlation ratio than the observed sam- Correlation tests for continuous supplemen- ple (Davison and Hinkley 1997). tary variables Identifying relationships between coordinates Data Availability in genetic space and continuous supplementary Sequence alignments in fasta format are avail- variables is intuitively simple. One could sim- able upon request. Files S1, S2 and S3 con- ply calculate the Pearson correlation coefficient tains all of the available supplementary infor- for each continuous supplementary variable, fol- mation available for the Human mitochondrial lowed by an exact test for a significantly non- sequence alignment, Thylacine sequence align- zero coefficient. ment and Ghost Bat sequence alignment, respec- It should be noted for a principal compo- tively nents analysis of spatially structured sequence data that has undergone recombination, that Results the top two principal components are expected to be highly correlated with perpendicular geo- Coordinates in Gene Space and Dissimilarity graphic axes (Novembre and Stephens 2008). In Matrices for Haplotype Identification the case of mtDNA, or any other recombination- The L-haplogroups represent the earliest evolu- free sequence data, this assumption cannot be tion in modern human history, with the most made. More extreme axis values can be inter- recent common ancestor (MRCA) of the L- preted as the accumulation of more and more haplogroups being the MRCA of all humans. of a unique set of SNPs that characterize some Hence, our method should be able to recover partition of the most-related tips of a tree. structure in the form of clusters of the major haplogroups L0, L1, L2, L4 and L5. To test this, Correlation tests for categorical supplemen- we analysed a custom alignment from several tary variables published studies involving African sampled Supplementary categorical variables which ex- mtDNA (Torroni et al. 2006; Behar et al. 2008; plain significant proportions of the structure of Costa et al. 2009; Batini et al. 2011; Barbieri et al. individuals in gene space can be identified, and 2013). We randomly chose sequences from these their effect quantified, using the correlation ra- studies from sub-haplogroups L0d, L0k, L1c, tio η2 (Brown 2008). The correlation ratio η2 can L2a, L4 and L5a. We aimed to include 20 sam-

6 Rohrlach et al. bioRxiv preprint doi: https://doi.org/10.1101/338442; this version posted June 4, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. ples per haplogroup, although we included only 9 from L5a, as this was all that was available at the time of writing, and 10 from L4 to delib- erately introduce further sampling asymmetry, resulting in an alignment of 79 individuals (see Table S1 for the file list of Genbank accession numbers and haplotype assignments). We aligned our sequences to the revised Cam- bridge Reference Sequence (Andrews et al. 1999) using MAFFT v7.310 (Katoh and Standley 2013). Haplogroups were determined using Haplogrep v2.1.0 (Weissensteiner et al. 2016). Aligned se- quences were filtered to remove any homoge- neous sites. MCA was performed on the remain- ing 281 SNPs. The first two principal dimen- sions captured 50.93% of the total inertia. That is, 50.93% of the variability in the 16,569 dimen- Figure 5 Scatter plots of the first six principal sional space (the number of base pairs in the dimensions and phylogenetic reconstruction sequences) can be observed in the first two prin- for the L-haplogroup alignment. cipal dimensions. We reconstructed a phylogenetic tree to com- pare the topology with our results. A Tamura- L0k sub-haplogroup. Nei model, with invariant sites and a gamma L1 then separates (top left quadrant) from L2, distribution with five classes was selected as L4 and L5 (bottom left quadrant), which is the the best model of sequence evolution using next major split in the human mtDNA tree. L5 ModelGenerator v0.85 (Keane et al. 2004). We is also separated from L2 and L4, and this is the used Beast v1.8.3 (Bouckaert et al. 2014) to con- next major split. Finally, although it is not as struct the phylogenetic tree using an MCMC pronounced as the previous separations, L2 and chain of length 5 × 109, logging parameters L4 separate, and this is the final major split. every 10,000 states. The first 5 × 108 states The third dimension (Figure 5, panel B) shows were discarded as burn-in, and the remaining a clear distinction between L5 (positive coordi- trees were used to find a consensus tree using nates) and L2 and L4 (negative coordinates). The treeannotator v1.8.4 (Bouckaert et al. 2014). fourth dimension (Figure 5, panel B) separates Convergence was assessed through trace plots L0d (positive) coordinates from L0k (negative of posterior distributions. The branches of the coordinates). Dimension 5 (Figure 5, panel C) consensus tree are in evolutionary time (relative finds a separation between L2 (negative coor- mutation rate µ = 1) as we are only interested dinates) and L4 (positive coordinates). Finally in the topology of the tree as a means of com- dimension 6 (Figure 5, panel C) separates L1c1 parison with the results of the MCA (see Figure from the remaining L1c individuals. The remain- 5). ing dimensions further identify splits in the tree, In the first two principal dimensions (Figure though this is not included in Figure 5. For this 5, panel A), L0 (bottom right quadrant) is visibly reason, when performing clustering we include separated from the remaining haplogroups, and all principal dimensions. this makes sense, with L0 being the most diver- We performed hierarchical agglomerative gent human mtDNA haplogroup. The deep split clustering on the coordinates from the MCA us- within L0, between L0d and L0k can be observed ing the R-package cluster v2.0.6 (Kaufman here, with the 10 furthest points representing the and Rousseeuw 2009). The choice of termination

Dimension Reduction and Visualization for Single-copy Alignments via Generalized PCA. 7 bioRxiv preprint doi: https://doi.org/10.1101/338442; this version posted June 4, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. point for identifying clusters is arbitrary, and in decline, was extinct on the mainland and was our case we cease identifying clusters when a only found in Tasmania. cluster of size one is suggested. From museum samples we use sequence data The clustering algorithm respected the config- from three samples from south-west Western uration of the points in the first two principal di- Australia, three samples from the Nullarbor mensions. The first cluster identified was L0, fol- Plain in Western Australia, six samples from Tas- lowed by L1 and L5. L0d and L0k are separated mania and one sample from New South Wales into two clusters, followed by the split between (see Figure 6)(White et al. 2017). Samples were L2 and L4, which reflects the greater divergence removed if the longitude or latitude were un- time for the respective haplogroups (Gonder known, or if the sampling age was unknown. To et al. 2007). The fifth cluster separation of L2 avoid artificial inflation of signal from geograph- and L4 represents the final major haplogroup ical coordinates, for sequences found in the same according to the current nomenclature. location, a single representative was randomly The remaining clusters all respect the sub- selected. In total, 13 individuals were analysed haplogroup structure of the mtDNA tree, iden- (see Table S2 for supplementary variables and tifying sub-haplogroups for each of L0, L1, L2, Genbank accession numbers). L4 and L5. The clusters identified here are spe- Sequences were aligned using MAFFT cific to this dataset, i.e. it may be the case that v7.310 (Katoh and Standley 2013). The align- if more than one sequence from the haplogroup ment was filtered to remove homogeneous and L1c2b2 were included, then we may have identi- missing sites, and a total of 113 SNPs were in- fied L1c2b2 as a cluster. cluded in the MCA. It is worth noting that our clustering sug- From the MCA row factor scores the first prin- gests that the current nomenclature for human cipal dimension, which captured 62.62% of the mtDNA may not reflect statistically significant total inertia, correlates strongly with longitude groups, but rather just the sequence of histor- (r = 0.9467235, p = 9.517 × 10−7), suggesting a ically discovered diagnostic SNPs in densely possible migration gradient (Menozzi et al. 1978). sampled haplogroups. For example, the split Gradients are not expected to be strictly linear between L0d and L0k appears more significant for principal component maps, and the same than the split between L2 and L4 in both the can be assumed for MCA maps (Novembre and MCA and the phylogenetic tree. However, the Stephens 2008). sample sizes here are not large enough to re- To investigate the relationship between ge- fute the nomenclature, although the method pro- ography and the MCA coordinates, a multi- vides a clear way forward to revise this. response linear model was used. Multi-response Overall, the method has clearly shown that linear models are similar to standard linear mod- we can identify a tree like structure in the data, els, but allow for more than one response vari- and that just the first two principal dimensions able to be collectively modelled by the same set were able to visualise the haplotype structure in of explanatory variables (Berridge and Crouch- the data. ley 2011). A multi-response model was fitted to the data to predict latitude and longitude us- Application of method for continuous supple- ing principal dimension 1 (PD1). Polynomial mentary variables models of varying degrees were fit and the best The thylacine (Thylacinus cynocephalus) is an Aus- model was quadratic (using AIC), with R2 val- tralian marsupial carnivore most famous for its ues of 0.9334 and 0.9075 between longitude and recent extinction due to human hunting (Har- latitude respectively. ris 1808; White et al. 2017). By the time of the A ‘predicted’ migration route can be projected arrival of Europeans to Australia, the thylacine onto the geographical map suggesting a coastal had already undergone a significant population route was taken along the south of Australia (see

8 Rohrlach et al. bioRxiv preprint doi: https://doi.org/10.1101/338442; this version posted June 4, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

within cave-based colonies. Blast mining dis- rupts, and in some cases destroys, cave com- plexes, displacing resident bat populations. Con- servationists wish to understand the phylogeo- graphic distribution of ghost to understand if the destruction of a single cave colony sig- nificantly reduces genetic diversity. Gene flow between colonies would indicate a reduced im- pact on ghost bat diversity, whereas a highly structured population would indicate a need to Figure 6 Sample location and sample IDs for reduce the effects of blast mining and colony thylacine mtDNA. The black line is the pre- disruption. dicted geographic locations for given the observed range of principal dimen- We focus on samples collected in the Northern sion 1 coordinates. found in four colonies on the northern side of the Hamersley Range (Bamboo, Callawa, Lalla Rookh, Klondyke), and two on the south- Figure 6). However, the extremely small sample ern side (Rhodes, Silvergrass), and one colony size means the results are limited as the MRCA in the Kimberley (Tunnel Creek) (see Figure 7). of the sample is not necessarily closely related to The Hamersley Range contains the twenty high- the MRCA of the population, and there is little est peaks in Western Australia, and so forms a reason to believe that ancestral thylacine pop- significant geographical boundary for bats to ulations remained in the areas they originally cross. All colonies are represented by one sam- inhabited. pled cave, with the Rhodes colony being the exception with four closely sampled caves (see Application of method for categorical supple- Table S3 for supplementary variables and ID mentary variables numbers). The ghost bat ( gigas) is a native Aus- We filtered an alignment of 257bp of the tralian bat endemic to the Northern Pilbara and mtDNA HVR region, from 137 individuals, to Kimberley in Western Australia, and in some remove homogeneous sites, and MCA was per- regions of the Northern Territory and Queens- formed on the remaining 25 SNPs. For each land (Armstrong and Anstee 2000). The ghost individual, the colony and population (North, bat conservation status is currently listed as vul- South, Kimberley) were recorded and treated nerable by the International Union for Conser- as categorical supplementary variables. Longi- vation of Nature. tude and latitude were also recorded and kept as quantitative supplementary variables. In Figure 8 we present the squared-correlation plot for all supplementary variables and SNPs. The x and y coordinates of points in this plot give squared correlation values for the first two principal dimensions and each of the supple- mentary variables and SNPs. The further to the right of the plot a variable or SNP name is, the more highly correlated it is with Dimension 1. Figure 7 Sample location and sizes for ghost Similarly, the further to the top of the plot a bat mtDNA. variable or SNP is, the more highly correlated it is with Dimension 2. Immediately we see that Ghost bats are found in discrete populations latitude and longitude are not as strongly corre-

Dimension Reduction and Visualization for Single-copy Alignments via Generalized PCA. 9 bioRxiv preprint doi: https://doi.org/10.1101/338442; this version posted June 4, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Figure 9 The scatter plot of the first two princi- pal dimensions for the ghost bat alignment.

better explains the structure of the genetic co- Figure 8 The correlation plot for all variables ordinates. This can be seen in Figure 9 where and SNPs for the Ghost Bat alignment. we observe that five of the seven colonies form distinct clusters, with the exception of: a single lated with the two first principal dimensions as Klondyke individual in the Tunnel Creek colony, population and colony. a Callawa individual in the Rhodes colony, and Calculating the η2 values for population struc- a Silvergrass individual in the Callawa colony. ture and colony structure yields η2 = 0.8332 The remaining two colonies, Klondyke and Bam- pop boo, cluster together in the top left of the plot. A (p < 9.999 × 10−6) and η2 = 0.8888 (p < col further four individuals from Klondyke form a −6 9.999 × 10 ) respectively. separate cluster, genetically nearer the southern Clearly then, population explains a large pro- colonies. This may represent a recent migration portion of the variability of the points in genetic from the southern colonies, or potentially mem- space, however colony explains a greater propor- bers of the founding population for the southern tion of the total variance, and thus had a larger colonies. 2 η value. Clearly colony explains a significant Without further information we cannot ex- proportion of the structure of the individuals plain why these two colonies cluster together. in genetic space since the first two principal di- It is worth noting that the Bamboo Creek site mensions explained a total of 85.03% of the total is a recently abandoned complex of mines (dis- inertia. mantled in 1962), whereas mining operations in The first principal dimension visualises the Klondyke mine area have increased drasti- the split between the Kimberley and Pilbara cally since 1955. It is possible that there has been colonies (except for one Pilbara individual a recent blending of the two colonies as bats within the Kimberley samples), and the second have been displaced from places like Klondyke, principal dimension visualises the divide be- and found refuge in caves left from abandoned tween the North and South colonies within the mining efforts, like Bamboo Creek. However, Pilbara region. In fact, if one places a boundary our results still indicate a significant colony- representing the Hamersley Range, on the y-axis based structure outside of these two colonies. at -0.35 (the dashed line in Figure 9), only one This colony-based structure further strengthens individual from the Northern Pilbara sample the argument for more protection for roosting lies below the boundary, and only one Southern colonies from blast mining. Pilbara sample lies above the boundary. Finally we give an example of identifying po- 2 2 Since ηcol > ηpop, this suggests that colony tential diagnostic SNPs using our method. From

10 Rohrlach et al. bioRxiv preprint doi: https://doi.org/10.1101/338442; this version posted June 4, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Figure 8 we also see that we can identify poten- SNPs that are correlated with supplementary tially diagnostic SNPs. Given that Dimension variables of interest such as habitat or pheno- 1 is explained by the geographical separation typic traits. of the Kimberley and Pilbara populations, as Our method was able to detect known haplo- all but one of the individuals with a positive type structure, showing that the results of MCA first principal dimension coordinate are from are biologically meaningful. Similarly, the fact the Kimberley, SNPs that are highly correlated that our method can be reformulated as a PCA with this dimension may be diagnostic for one of the linkage disequilibrium table for multiple of the regions. For example, SNP18 can be found loci indicates that applications to recombining in only Kimberley ghost bats (with the exception DNA may also be useful for detecting popula- of the one Klondyke individual). If we had de- tion structure. cided to use clustering to identify haplogroups, Using techniques from classical statistics, our then SNP18 could be considered a diagnostic method was also able to efficiently visualise the SNP, from this limited sample. strength of the relationships between supple- mentary information and empirical sequence Discussion data. Finally, using standard polynomial regres- MCA provides a powerful method for unsu- sion techniques, our method was able to identify pervized exploration of single-copy DNA. Our a possible migration route for geographically method is analogous to a PCA analysis of distributed sequence data. classical allele correlation values for detecting linkage disequilibrium. We have shown that p−dimensional single-copy DNA can be trans- Literature Cited formed into coordinates in genetic space, analo- Abdi, H. and D. Valentin, 2007 Multiple Cor- gous to the way in which diploid DNA is trans- respondence Analysis. Encyclopedia of Mea- formed via PCA in many genetic studies. One surement and Statistics. pp. 651–657. of the attractive features of our approach is the Andrews, R. M., I. Kubacka, P. F. Chinnery, R. N. parallel with PCA, making the interpretation of Lightowlers, D. M. Turnbull, et al., 1999 Re- results natural for researchers experienced with analysis and revision of the Cambridge refer- PCA. ence sequence for human mitochondrial DNA. Our method allows for the coordinates of sup- Nature Genetics 23: 147–147. plementary variables to be calculated and visu- Armstrong, K. N. and S. D. Anstee, 2000 The alized in the same coordinate space as for in- ghost bat in the Pilbara: 100 years on. Aus- dividuals and SNPs, and for the relationships tralian Mammalogy 22: 93–101. between the supplementary variables and prin- Barbieri, C., M. Vicente, J. Rocha, S. W. Mpoloka, cipal dimensions to be quantified. Like PCA, M. Stoneking, et al., 2013 Ancient substructure additional sequences can be projected onto the in early mtDNA lineages of southern Africa. coordinate space that has been calculated from The American Journal of Human Genetics 92: an alignment of interest. 285–292. Dimension reduction can be performed, re- Batini, C., J. Lopes, D. M. Behar, F. Calafell, ducing potentially massive numbers of SNPs L. B. Jorde, et al., 2011 Insights into the de- into far fewer dimensions with potentially little mographic history of African Pygmies from reduction in information, leading to informative complete mitochondrial genomes. Molecular visualization of high-dimensional data. Unlike Biology and Evolution 28: 1099–1110. PCA, our method is able to simultaneously in- Behar, D. M., E. Metspalu, T. Kivisild, S. Rosset, vestigate the relationships between individuals S. Tzur, et al., 2008 Counting the founders: and SNPs. This extra information can lead to the the matrilineal genetic ancestry of the Jewish detection of diagnostic SNPs, and potentially Diaspora. PloS One 3: e2062.

Dimension Reduction and Visualization for Single-copy Alignments via Generalized PCA. 11 bioRxiv preprint doi: https://doi.org/10.1101/338442; this version posted June 4, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Berridge, D. M. and R. Crouchley, 2011 Multi- Menozzi, P., A. Piazza, and L. Cavalli-Sforza, variate generalized linear mixed models using R.. 1978 Synthetic Maps of Human Gene Frequen- CRC Press. cies in Europeans. Science 201: 786–792. Bouckaert, R., J. Heled, D. Kühnert, T. Vaughan, Novembre, J. and M. Stephens, 2008 Interpret- C.-H. Wu, et al., 2014 BEAST 2: a software ing principal component analyses of spatial platform for Bayesian evolutionary analysis. population genetic variation. Nature Genetics PLoS Computational Biology 10: e1003537. 40: 646–649. Breiman, L. and J. H. Friedman, 1997 Predict- Patterson, N., A. Price, and D. Reich, 2006 Popu- ing multivariate responses in multiple linear lation Structure and Eigenanalysis. PLoS Ge- regression. Journal of the Royal Statistical So- netics 2: e190. ciety: Series B (Statistical Methodology) 59: Pearson, K., 1901 LIII. On lines and planes of 3–54. closest fit to systems of points in space. The Brown, J. D., 2008 Effect Size and Eta Squared. London, Edinburgh, and Dublin Philosoph- JALT Testing & Evaluation SIG News . ical Magazine and Journal of Science 2: 559– Costa, M. D., L. Cherni, V. Fernandes, F. Fre- 572. itas, A. B. A. el Gaaied, et al., 2009 Data from Posth, C., G. Renaud, A. Mittnik, D. G. Drucker, complete mtDNA sequencing of Tunisian cen- H. Rougier, et al., 2016 Pleistocene mitochon- tenarians: testing haplogroup association and drial genomes suggest a single major dispersal the “golden mean” to longevity. Mechanisms of non-Africans and a Late Glacial population of Ageing and Development 130: 222–226. turnover in Europe. Current Biology 26: 827– Davison, A. C. and D. V. Hinkley, 1997 Bootstrap 833. Methods and their Application., volume 1. Cam- Tobler, R., A. Rohrlach, J. Soubrier, P. Bover, bridge University Press. B. Llamas, et al., 2017 Aboriginal mitogenomes Gonder, M. K., H. M. Mortensen, F. A. Reed, reveal 50,000 years of regionalism in Australia. A. de Sousa, and S. A. Tishkoff, 2007 Whole- Nature 544: 180. mtDNA genome sequence analysis of ancient Torroni, A., A. Achilli, V. Macaulay, M. Richards, African lineages. Molecular Biology and Evo- and H.-J. Bandelt, 2006 Harvesting the fruit of lution 24: 757–768. the human mtDNA tree. TRENDS in Genetics Greenacre, M., 2007 Correspondence Analysis in 22: 339–345. Practice.. CRC Press. Van Gremberghe, I., F. Leliaert, J. Mergeay, Harris, G. P., 1808 XI. Description of two new P. Vanormelingen, K. Van der Gucht, et al., Species of Didelphis from Van Diemen’s Land. 2011 Lack of phylogeographic structure in the Transactions of the Linnean Society of London freshwater cyanobacterium Microcystis aerug- 9: 174–178. inosa suggests global dispersal. PloS One 6: Jolliffe, I., 2002 Principal Component Analysis.. Wi- e19561. ley Online Library. Weissensteiner, H., D. Pacher, A. Kloss- Katoh, K. and D. M. Standley, 2013 MAFFT mul- Brandstätter, L. Forer, G. Specht, et al., 2016 tiple sequence alignment software version 7: HaploGrep 2: mitochondrial haplogroup clas- improvements in performance and usability. sification in the era of high-throughput se- Molecular Biology and Evolution 30: 772–780. quencing. Nucleic Acids Research pp. W58– Kaufman, L. and P. J. Rousseeuw, 2009 Finding W63. Groups in Data: an Introduction to Cluster Anal- White, L. C., K. J. Mitchell, and J. J. Austin, 2017 ysis., volume 344. John Wiley & Sons. Ancient Mitochondrial Genomes Reveal the Keane, T., T. Naughton, and J. McInerney, 2004 Demographic History and Phylogeography of Modelgenerator: amino acid and nucleotide the Extinct, Enigmatic Thylacine (Thylacinus substitution model selection. National Univer- Cynocephalus). Journal of Biogeography . sity of Ireland, Maynooth, Ireland 34. Zaykin, D. V., A. Pudovkin, and B. S. Weir, 2008

12 Rohrlach et al. bioRxiv preprint doi: https://doi.org/10.1101/338442; this version posted June 4, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Correlation-based inference for linkage dise- has a particular structure. We then exploit this quilibrium with multiple alleles. Genetics 180: to show the required result for the matrix 533–545. −1/2  T −1/2 C = Dr X − rc Dc , (5) Appendix A namely that for our choice of row and column We aim to show that our choice of scaling for the standard coordinate scaling, the identifying al- row and column factor scores yields the prop- lele and the identified individual share the same erty that if an individual uniquely carries an scaled factor scores. allele, then the individual and the allele share Let A have a compact singular value decom- the same coordinates. To do this we investigate position (SVD) of the form properties of the indicator matrix X, where the T columns have been permuted to make the first A = UAΣAVA . (6) column the identifying allele and to make the first row the identified individual (without loss Consider the matrix product of generality). To avoid carrying constants, we −1/2 T assume that X has already been normalized to M = Dc A UA. (7) have grand sum one. T Result 1: If there is an allele that uniquely Since UAΣAVA , is a SVD, then UA is a unitary T identifies an individual, then the individual and matrix, and so UAUA = In (where Ik is the k × k the allele have the same coordinates if the stan- identity matrix), and substituting Equation (6) dard row factor scores are scaled by the singular into Equation (7) gives, values, and the standard column factor scores −1/2 T are scaled by the squared singular values. M = Dc A UA T Proof: Let X be an n × Q matrix such that −1/2  T = Dc UAΣAVA UA   −1/2 T T x11 x12 ... x1Q = Dc VAΣAUAUA     = D−1/2V  0 x22 ... x2Q c AΣA. X =   , (3)  . . .. .   . . . .  M c−1/2v   Hence, the first row of is 1 1ΣA, where 0 xn2 ... xnQ v1 is the first row of VA. However, instead substituting Equation (4) n Q into Equation (7) yields such that ∑ ∑j=1 xij = 1 and xij ≥ 0 ∀i, j. i= 1 −1/2 T Let 1k be a k × 1 vector of ones, and let M = Dc A UA  T T = −1/2 −1/2 −1/2 r = X1Q = (r1, ··· , rn) Dc Dr XDc UA −1 T −1/2 and = Dc X Dr UA. T T c = X 1n = c1, ··· , cQ Note that since x21 = ··· = xn1 = 0, then c1 = −1 T −1/2 be the strictly positive row and column sums of x11 and hence the first row of A = Dc X Dr X respectively, and let will equal

Dr = diag(r) and Dc = diag(c). −1 −1/2  −1/2  c1 r1 (x11, 0, ··· , 0) = r1 , 0, ··· , 0 . We begin by showing that the SVD of the ma- −1/2 trix So the first row of M is also r1 u1, where u1 is −1/2 −1/2 A = Dr XDc , (4) the first row of UA.

Dimension Reduction and Visualization for Single-copy Alignments via Generalized PCA. 13 bioRxiv preprint doi: https://doi.org/10.1101/338442; this version posted June 4, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

This shows that negative real numbers on the diagonal.

T −1/2 T −1/2 −1/2 r Dr UCΣCVC r1 u1 = c1 v1ΣA, (8) T −1/2 −1/2  T −1/2 =r Dr Dr X − rc Dc   which are the first rows of the row and column = T −1 − T −1/2 −1/2 −1/2 r Dr X rc Dc scores of the SVD of Dr XDc , where X is  T −1/2 of the form given in Equation (3). =1n X − rc Dc To extend this result to the SVD of the matrix   = − T −1/2 C in Equation5, first note that 1nX 1nrc Dc  T T −1/2 = c − c Dc   AT D−1/2r = D−1/2XT D−1r −1/2 r c r =0QDc −1/2 T =0 . = Dc X 1n Q −1/2 = Dc c Since, ΣC is a diagonal matrix with positive di- −1 agonal entries, we know that ΣC exists. Further, and −1 T VC = VC , so this implies that

  T −1/2 T −1/2 −1/2 −1 r D UCΣCV = 0Q A Dc c = Dr XDc c r C T −1/2 −1/2 =⇒ r Dr UC = 0Q. = Dr X1n −1/2 = Dr r. Similarly, it can be shown that

cT D−1/2V = 0 . Now this shows that the SVD of A has a singu- c C n lar value of 1, with left and right singular vectors −1/2 −1/2 It follows then that Dr r and Dc c, respectively.   We now show that a SVD of C, denoted C = UTU UT D−1/2r T T C C C r UCΣCV can be augmented by these singular U∗ U∗ =   C T −1/2 T −1 vectors and the singular value 1 to construct a r Dr UC r Dr r SVD for A.   In 0Q Consider new matrices =   T 0Q 1 h i h i −1/2 −1/2 = I , U∗ = UC Dr r , V∗ = VC Dc c , n+1

and that and     T T −1/2 T VC VC VC Dc c ΣC 0n V∗ V∗ =   = T −1/2 T −1 Σ∗   , c Dc VC c D c 0T 1 c Q   IQ 0n =   where 0k is a k × 1 vector of zeros. Next we T 0n 1 show that U∗, V∗ are unitary matrices and Σ∗ is a rectangular matrix diagonal matrix with non- = IQ+1.

14 Rohrlach et al. bioRxiv preprint doi: https://doi.org/10.1101/338442; this version posted June 4, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Note that where x11 = ··· = xm1 > 0, and x(m+1)1 = n T ··· = ( ) = Q = ≥ U∗Σ∗V∗ x n1 0, and ∑ ∑j=1 xij 1 and xij i=1 T −1/2 T −1/2 =UCΣCVC + Dr rc Dc 0 ∀i, j. As previously, the first column of M =   −1/2 T −1/2 −1/2 T −1/2 −1/2 T −1/2 Dc A UA is c1 vΣA. However, the sum of =Dr X − rc Dc + Dr rc Dc the first column of X is mx11, and hence the first −1/2 −1/2 −1 T −1/2 =Dr XDc row of Dc X Dr is now = A. 1 (x11, ··· , xm1, 0, ··· , 0) Therefore, since Σ∗ is a rectangular diagonal mx11 matrix, with positive diagonal entries, and U∗ = (1/m, ··· , 1/m, 0, ··· , 0) , and V∗ are unitary matrices, it must be that T and it follows that the first column of M is also U∗Σ∗V∗ is a SVD of A. Thus we have two representations for the com- m 1 −1/2 pact SVD of A. Hence they are equivalent. How- r vi, m ∑ i ever, from Equation 8, we know that the first i=1 row and column factor scores for the SVD of A yielding that are equal, and are given by 1 m −1/2 ∗ −1/2 ∗ c−1/2v = r−1/2v r1 u1 = c1 v1Σ∗. 1 ΣA ∑ i i. m i=1 Hence any sub vectors that are constructed from removing corresponding elements from the vec- Following the same argument as before, it is −1/2 ∗ −1/2 ∗ also true that this is the case for the SVD of tors r u and c v Σ∗ will also be equal, 1 1 1 1 C = D−1/2 X − rcT D−1/2. Hence the iden- specifically r c tifying allele has column factor score equal to −1/2 C −1/2 C the centroid of the row factor scores for the iden- r u = c v ΣC, 1 1 1 1 tified individuals. C C where u1 is the first column of UC, and v1 is the  first column of VC. . Result 2: If a single allele identifies a group of m individuals, the the column factor score for the allele is the centroid of the row factor scores of the identified individuals, if the standard row factor and the standard column factor scores are scaled by the squared singular values. Proof: If   x11 x12 ... x1Q    . . .   ......      xm1 xm2 ... xmQ  X =   ,    0 x(m+1)2 ... x(m+1)Q    . . .. .   . . . .    0 xn2 ... xnQ

Dimension Reduction and Visualization for Single-copy Alignments via Generalized PCA. 15