Dimension Reduction and Visualization for Single-Copy Alignments Via Generalized PCA

bioRxiv preprint doi: https://doi.org/10.1101/338442; this version posted June 4, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. GENETICS | INVESTIGATION Dimension Reduction and Visualization for Single-copy Alignments via Generalized PCA. AB Rohrlach∗,†, Nigel Bean∗,†, Gary Glonek∗, Barbara Holland‡, Ray Tobler§, Jonathan Tuke∗,† and Alan Cooper§ ∗School of Mathematical Sciences, University of Adelaide, Adelaide, South Australia, 5005, Australia., †ARC Centre for Excellence for Mathematical and Statistical Frontiers, University of Adelaide, Adelaide, South Australia, 5005, Australia., ‡School of Natural Sciences (Mathematics), University of Tasmania, Hobart, Tasmania 7001, Australia., §Australian Centre for Ancient DNA, School of Biological Sciences, University of Adelaide, Adelaide, South Australia, 5005, Australia. ABSTRACT Principal components analysis (PCA) has been one of the most widely used exploration tools in genomic data analysis since its introduction in 1978 (Menozzi et al. 1978). PCA allows similarities between individuals to be efficiently calculated and visualized, optimally in two dimensions. While PCA is well suited to analyses concerned with autosomal DNA, no analogue for PCA exists for the analysis and visualization of non-autosomal DNA. In this paper we introduce a statistically valid method for the analysis of single-copy sequence data. We then show that tests for relationships between genetic information and qualitative and quantitative characteristics can be implemented in a rigorous statistical framework. We motivate the use of our method with examples from empirical data. KEYWORDS Dimension Reduction, mtDNA, Population Genetics berghe et al. 2011). A popular form of unsupervized data ex- Introduction ploration is principal components analysis An important feature of any genetic analysis can (PCA) (Pearson 1901). PCA is a dimension re- be detecting whether a sample comes from a duction technique that takes n p-dimensional structured population. Demographic structure vectors and, using linear combinations of can take many forms. For example, samples the original vectors, finds min(n − 1, p) p- may be taken from geographically isolated sub- dimensional basis vectors. The new vectors are populations (Tobler et al. 2017), from subpopu- ordered by the amount of variability explained lations along a migration route (Novembre and by each ‘principal dimension’. Often the first Stephens 2008) or from temporally separated few dimensions are used to visualise points in population replacement events (Posth et al. 2016). the new transformed space. In some cases it can be of interest to discover that PCA is a non-parametric, hypothesis-free ex- no geographic structure exists at all, leading to ploratory technique, making it a particularly at- the exploration of social structure (Van Grem- tractive analytical tool. However, PCA does re- quire that the vectors of information are quan- Copyright © 2018 by the Genetics Society of America doi: 10.1534/genetics.XXX.XXXXXX titative variables. Clearly sequence characters Manuscript compiled: Friday 1st June, 2018 are not quantitative random variables, and so a 1Corresponding author: School of Mathematical Sciences, University of Adelaide, SA, 5005. E-mail: [email protected] transformation must be applied to raw sequence Genetics, Vol. XXX, XXXX–XXXX June 2018 1 bioRxiv preprint doi: https://doi.org/10.1101/338442; this version posted June 4, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. data before PCA can be directly applied (Patter- African mtDNA from haplogroups L0, L1, L2, son et al. 2006). However, we are aware of no L4 and L5 to show that our method produces such suitable transformation for DNA sequences valid and easily interpretable results by repro- that are non-biallelic, and in particular, haploid ducing mtDNA macro-haplogroups via cluster- DNA such as mitochondrial DNA (mtDNA) or Y ing. We also explore an alignment of modern chromosome sequences. Instead, we suggest the and ancient thylacine mtDNA from a mainland application of multiple correspondence analysis and island population. We show that thylacine (MCA) directly to the sequence characters. genetic signals are highly correlated with longi- MCA is an adaptation of PCA where cate- tude, and identify a possible ancestral migration gorical variables (in this case Single Nucleotide route. Finally we explore an alignment of West- Polymorphisms: SNPs) are converted into bi- ern Australian Ghost Bat mtDNA to show that nary variables denoting the presence or absence genetic diversity can be almost completely ex- of each level of the variables (in this case alle- plained by discrete cave locations. les) (Jolliffe 2002). Unlike PCA, MCA can be applied to any number of alleles. Our method Materials and Methods makes the assumption that SNP inheritance is random, i.e. that the underlying phylogenetic The transformation of genomic data to contin- tree is a star tree. One could test whether alle- uous coordinates les appear to occur independently by investigat- Consider an n × p alignment A of mtDNA, ing a contingency table of pairwise allele counts where Aij 2 fA, C, G, Tg, filtered to remove for an alignment, and then apply a chi-squared homozygous sites. The n rows represent se- test. Since one would almost always overwhelm- quenced individuals, denoted fa1, ··· , ang and ingly reject the null hypothesis, the result of a the p columns represent single nucleotide poly- chi-squared test would be of no interest. How- morphisms (SNPs), denoted s1, ··· , sp . Note ever, the matrix of signed residuals under this that each of the SNPs can take between two to assumption form the basis of the transformation four forms, and we say that sj has jsjj levels. from sequence data to continuous data. p Consider each of the Q = ∑ jsjj different In this paper we aim to show that MCA is a j=1 statistically powerful method for the analysis of allelic forms of the p SNPs, ordered (without non-autosomal DNA. We show that MCA has loss of generality) numerically by position, then many properties that are analogous to PCA, and within SNPs, lexicographically by nucleotide. is hence immediately intuitive to researchers We can define an n × Q indicator matrix X, such with experience using PCA. We demonstrate that Xik equals one if individual ai has the al- that PCA only quantifies the relationships be- lele at the position indicated by the kth column tween rows (individuals), while MCA quantifies name, for k = 1, ··· , Q (see Figure 1). Note that the relationships between the rows (individu- for a SNP with jsjj levels, there are only jsjj − 1 als) and also the columns (SNPs) simultaneously. linearly independent columns of information in For this reason we can quantify and visualise re- the X matrix (since if an individual does not lationships between individuals as in PCA, and have any of the first jsjj − 1 forms of the allele, also quantify and visualise the relationship be- they must have the remaining allele). Hence, in tween SNPs, and between SNPs and individuals total there are only Q − p linearly independent simultaneously in the same dimensions. columns. Finally, we can also calculate a con- We show that results obtained from MCA cor- tingency table of pairwise marker combinations T respond to the results obtained from mtDNA B = X X (see Figure 1), this matrix is discussed phylogenetic trees in a meaningful way, and later in the process. n Q that demographic structure can be detected us- 1 Let N = ∑ ∑ xij, r = N X1Q and c = ing these results. We explore an alignment of i=1 j=1 2 Rohrlach et al. bioRxiv preprint doi: https://doi.org/10.1101/338442; this version posted June 4, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. SNP1 SNP2 SNP3 is to post multiply each dimension by the associ- a1 A G C A = a2 A T C ated singular value. Hence the relative spread of a3 C T G points in each dimension is proportional to the a4 C T G amount of inertia captured by each dimension. ? ? y From the standard row scores we obtain the SNP1A SNP1C SNP2G SNP2T SNP3C SNP3G transformed coordinates, also called the ‘row a1 1 0 1 0 1 0 factor scores’, and denoted F, of the individuals X = a2 1 0 0 1 1 0 in the alignment A in ‘genetic space’ via a3 0 1 0 1 0 1 a 0 1 0 1 0 1 4 F = F∗ ? S. (2) ? y SNP1A SNP1C SNP2G SNP2T SNP3C SNP3G The distances between individuals calculated SNP1A 2 0 1 1 2 0 form these coordinates will respect three proper- SNP1C 0 2 0 2 0 2 ties: B = SNP2G 1 0 1 0 1 0 SNP2T 1 2 0 3 1 2 1. If two individuals have the same DNA SNP3C 2 0 1 1 2 0 SNP3G 0 2 0 2 0 2 sequence, they will have identical coordinates. Figure 1 A transformation from raw sequence alignment A, to an indicator matrix X, and a 2. If two individuals share many alleles, they Burt table B = XTX. will be closer than two individuals that do not. 1 T N X 1n, where 1k is a k × 1 vector of ones, and 3. Individuals that share rare alleles will be define Dr = diag(r) and Dc = diag(c).

Dimension Reduction and Visualization for Single-Copy Alignments Via Generalized PCA

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support