Simultaneous Dimension Reduction and Adjustment for Confounding Variation

Simultaneous dimension reduction and adjustment for confounding variation

Zhixiang Lina, Can Yangb, Ying Zhuc,d, John Duchia,e, Yao Fuf, Yong Wangg, Bai Jianga, Mahdi Zamanighomia, Xuming Xud, Mingfeng Lid, Nenad Sestand,h,i, Hongyu Zhaoc,1, and Wing Hung Wonga,j,1

aDepartment of Statistics, Stanford University, Stanford, CA 94305; bDepartment of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong; cDepartment of Biostatistics, Yale School of Public Health, New Haven, CT 06520; dDepartment of Neuroscience, Kavli Institute for Neuroscience, Yale School of Medicine, New Haven, CT 06510; eDepartment of Electrical Engineering, Stanford University, Stanford, CA 94305; fProgram of Computational Biology & Bioinformatics, Yale University, New Haven, CT 06511; gAcademy of Mathematics & Systems Science, Chinese Academy of Sciences, Beijing 100080, China; hDepartment of Genetics, Yale School of Medicine, New Haven, CT 06510; iDepartment of Psychiatry, Section of Comparative Medicine, Program in Cellular Neuroscience, Neurodegeneration and Repair, Yale School of Medicine, New Haven, CT 06510; and jDepartment of Health Research & Policy, Stanford University, Stanford, CA 94305

Contributed by Wing Hung Wong, October 21, 2016 (sent for review April 19, 2016; reviewed by Rafael Irizzary and Fengzhu Sun) Dimension reduction methods are commonly applied to high- implemented AC-PCA with sparsity constraints to enable vari- throughput biological datasets. However, the results can be hin- able/gene selection and better interpretation of the PCs. dered by confounding factors, either biological or technical in origin. In this study, we extend principal component analysis (PCA) Results to propose AC-PCA for simultaneous dimension reduction and AC-PCA in a General Form. Let X denote the N × p data matrix, adjustment for confounding (AC) variation. We show that AC- where N is the number of observations and p is the number of PCA can adjust for (i) variations across individual donors present variables/genes. X is centered by column. Let x(i) denote the ith in a human brain exon array dataset and (ii) variations of dif- observation. Let v denote a p-dimensional vector and ti = x(i) · v ferent species in a model organism ENCODE RNA sequencing PN 2 PN 2 denote the projection induced by v. i=1 ti = i=1(x(i) ·v) = dataset. Our approach is able to recover the anatomical struc- v T X T Xv is proportional to the total variation after the projec- ture of neocortical regions and to capture the shared variation tion and classical PCA seeks v that maximizes it. The dimension among species during embryonic development. For gene selection extracted this way can be misleading if there is confounding vari- purposes, we extend AC-PCA with sparsity constraints and pro- ation (Results, A Motivating Example). Let Y denote a N ×l con- pose and implement an efficient algorithm. The methods devel- founder matrix, representing l confounders, and let K = YY T . oped in this paper can also be applied to more general set- T T tings. The R package and MATLAB source code are available at Y is centered by column. We choose Y so that v X KXv rep- https://github.com/linzx06/AC-PCA. resent the confounding variation in t. Because we are not interested in the subspace exhibiting the confounding variation, this suggests the following modification of PCA: dimension reduction | confounding variation | transcriptome maximize v T X T Xv − λv T X T KXv p v∈R [1] 2 subject to ||v||2 ≤ 1, imension reduction methods, such as multidimensional scal- Ding (MDS) and principal component analysis (PCA), are where the tuning parameter λ ≥ 0 controls the strength of regu- commonly applied in high-throughput biological datasets to visu- larization. If λ = 0, this is classical PCA; when λ is large enough, alize data in a low-dimensional space, identify dominant pat- we are restricting the subspace to be orthogonal to the columns terns, and extract relevant features (1–6). MDS aims to place each sample in a lower-dimensional space such that the between- Significance sample distances are preserved as much as possible (7). PCA seeks the linear combinations of the original variables such that the derived variables capture maximal variance (8). One advan- With the advancement in high-throughput technologies, ana- tage of PCA is that the principal components (PCs) are more lyzing high-dimensional data has become a common task. interpretable by checking the loadings of the variables. Dimension reduction methods have been applied to visual- Confounding factors, either biological or technical in origin, ize and identify dominant patterns in high-dimensional data. are commonly observed in high-throughput biological experi- Confounding factors, commonly observed in high-throughput ments. Various methods have been proposed to estimate the biological experiments, can affect the performance of these confounding variation, for example, regression models on known methods, and other downstream analysis. Here, we develop confounding factors (9) and factor models and surrogate vector a method by coupling dimension reduction with the adjust- analysis for unobserved confounding factors (10–15). However, ment for confounder effects. Our method is able to capture limited work has been done in the context of dimension reduc- the underlying patterns, as demonstrated by a human brain tion. Confounding variation can affect PC-based visualization of exon array dataset, a model organism ENCODE RNA sequenc- the data points because it may obscure the desired biological ing dataset, and simulations. variation, and it can also affect the loading of the variables in Author contributions: Z.L., N.S., H.Z., and W.H.W. designed research; Z.L. performed the PCs. research; C.Y., J.D., Y.F., Y.W., B.J., and M.Z. contributed new reagents/analytic tools; Z.L., Here we extend PCA to propose AC-PCA for simultaneous Y.Z., X.X., and M.L. analyzed data; and Z.L., C.Y., Y.Z., Y.F., Y.W., N.S., H.Z., and W.H.W. dimension reduction and adjustment for confounding (AC) vari- wrote the paper. ation. We introduce a class of penalty functions in PCA, which Reviewers: R.I., Harvard University; and F.S., University of Southern California. encourages the PCs to be invariant to the confounding variation. The authors declare no conflict of interest. We demonstrate the performance of AC-PCA through its appli- 1To whom correspondence may be addressed. Email: [email protected] or cation to a human brain development exon array dataset (4), [email protected]. a model organism ENCODE (modENCODE) RNA sequenc- This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. ing (RNA-Seq) dataset (16, 17), and simulated data. We also 1073/pnas.1617317113/-/DCSupplemental.

14662–14667 | PNAS | December 20, 2016 | vol. 113 | no. 51 www.pnas.org/cgi/doi/10.1073/pnas.1617317113 Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 oigojciefnto oajs o h aito fdonors: of variation the for adjust to function objective lowing for solution sparse adding a by PCs, achieved the be of interpretation better For in presented are dimensions extracted choosing i tal. et Lin italic. hemisphere and right bold the in the in labeled from Samples are samples available. donors, are some hemispheres In right donor. and/or cor- a left temporal represents color inferior Each and (I)]. (T)], [A1C [ITC [STC cortex tex [S1C cortex auditory temporal primary cortex superior (P)], somatosensory posterior [IPC (A)], primary cortex parietal (M)], inferior [M1C posterior (S)], cortex (O)], motor [OFC [DFC primary cortex cortex prefrontal (D)], prefrontal dorsolateral (V)], orbital [VFC regions (F)], cortex brain prefrontal [MFC the ventrolateral cortex for prefrontal Abbreviations medial regions. brain are the represent labels The 1. Fig. the X later. This detail regions. more brain in across discussed be variation will the dataset cluster. extract a to PCs brain 20 challenging form ﬁrst 10 it the to S1 in from tends regions Fig. the Appendix, donor come among (SI patterns each samples clear and no The are donors There exon 1). six brain in (Fig. human the regions (4) from samples dataset of subset array a on PCA ducted Example. Motivating A parameter. where choose to how on pro- below We examples variation. confounding several the vide on assumptions the and ture on matrix eigendecomposition founder implementing by directly in e fgns ysakn h osof rows the stacking By genes. of ber donor ( nPA h odnsfrtevralsaetpclynonzero. typically are variables the for loadings the PCA, In o ipiiy easm htteeaen isn aa Let data. missing no are there that assume we simplicity, For PC 2 Y i ) N Denote . ersn the represent ×

−30 −20 −10 0 10 20 i ujc to subject maximize c where , neapefo h ua ri xnarydt,wno 5. window data, array exon brain human the from example An 1 p v λ ∈R sacntn eedn on depending constant a is aamatrix data P n o sesn h ttsia infiac fthe of significance statistical the assessing for and p M V F F 4 2 040 20 0 −20 −40 S Z V O T I b I A O Y M = D v v stenme fbanrgosand regions brain of number the is T T T A X sue-end eedn ntedt struc- data the on depending user-defined, is S b X X P T × D .Tevraino niiuldnr makes donors individual of variation The ). T T X X ` p omtvt h rbe,w rtcon- first we problem, the motivate To KXv Xv 1 −λX where , P arxfrtegn xrsinlvl of levels expression gene the for matrix constraint: S M I V F ≤ A T T KX c O N 1 PC 1 PCA D , = ||v n problem and F D T n X || aeil n Methods. and Materials I λ × 1 (1) S ≤ P I b and S A , epooetefol- the propose We . T M M · · · c V V 2 A , c , T 2 ||v Y X P 1 O P stesparsity the is F ( ehd for Methods . || a esolved be can n p Z I 2 2 ) V eobtain we , F stenum- the is h con- The . ≤ S O D O M D T V A 1, I I S A M F S v P P D V can T [2] rttoPswt C in the PC1 S4 of with correlations PCs more the calculated component two be we first shared fail, to may the tend methods other in AC-PCA the in that the loading”) with and (“PC (“PC”) correlated data genes projected of implemented the both loading was SVA, Compared PCA and Appendix ). ComBat where (SI with effects 14), confounder the (10, shown removing are SVA after runs and 100 (9) for ComBat and and 2 run Fig. representative in one for and in donor) results structure each latent in regions the random (ii) (three affected subset a is only regions (i) of Methods): and (Materials settings two considered whereas in ent and component, donors, among shared component rank ssmlri l ein ihnadnr hnuigdnrlabels donor using when donor, a effect within donor’s regions the all that in assumes similar ComBat is because expected as is in donor a within regions all in same that assumed ther by capture affected can be When PCs donors. several multiple first from the samples pooled the on component shared PCA the capture to is goal that assumed We set lations. we and data, brain the Simulations. undesirable. squares, is of variation between-groups sum the between-groups when the represents term data. penalty projected the the the in general, variation In donor-to-donor the penalizes it formula in function because regions, the that notation have the we ping so groups, the region represent n a labels for donor the samples the by Consider represented is squares squares of of sum total The correspondingly. squares mean, grand and mean group group into divided in be vation can samples the t that assume tions, Interpretation. Variance of Analysis The detect to how on guidelines in provide overcorrection We clusters. of appearance discussed are details in implementation More Appendix). in (SI samples formula values missing and are data There brain similar. the be to donors across region formula in term penalty jk k Formula .Cma dut for adjusts ComBat ). h eat a rical nuethe induce artificially can penalty The Methods. and Materials 1 = eoe h rjcin(nue by (induced projection the denotes ujc to subject maximize nSS , Λ v ∈R ∀ 2 ( SS Λ X i B ( k ) p r 2 ( . eas hr r orpiae o oo.Drop- donor. a for replicates no are there because , SS T ) i Λ 3 ) scnee.Let centered. is = eeautdA-C nsmltosmimicking simulations in AC-PCA evaluated We SS = losfrmr opiae oo’ fetadwe and effect donor’s complicated more for allows 1 ( ecmae CPAwith AC-PCA compared We . S2 Fig. Appendix, SI B saseilcs fformula of case special a is i v ||v PNAS ) P Y T = Γ Γ sw aese ntebandt.W fur- We data. brain the in seen have we as Γ, T P asssmlsfo h aednrt cluster, to donor same the from samples causes || X k n IAppendix. SI k = arxi formula in matrix P ( k K 2 2 =1 j −1 i , Γ =1 T 3 , ) ≤ | P ( k K j Xv a erwitnas rewritten be can SS i sGusa noise. Gaussian is P =1 ) P 1 = eebr2,2016 20, December 1. r b Λ = B =1 ( 3 l n − n j n =k r X Λ =1 k ) , norgstepoeto ftesame the of projection the encourages b k Λ Λ 1 P · · · = ( ( +1 n 1 (i λ 10, = 3 SS i 1 ¯ (t 2 ( t eladcno dutfor adjust cannot and well ) ) i ·k SS k n a emdfidt ademissing handle to modified be can and n ) X P jk i (t Ω+Γ = , =1 R Λ + =1 − −1 screae ihta in that with correlated is n − rk B ehave We . ∗ k n ¯ t k (t =1 ¯ ·· t −t p j When Ω. Let . ≡ Λ ·· =i X rk ) 2 (i 400 = × ) 2 2 n (t rl ) 1 P 2 +1 − n h eann u of sum remaining the and , h oo’ feti the is effect donor’s The : is S3 Figs. Appendix, (SI ( h ewe-russum between-groups the , rk ) (X r i a edsge uhthat such designed be can o ubro observa- of number a For ¯ t Λ 2 ) | r b ·· v Γ − + o ape nalof all in samples For . SS =1 osdrperforming Consider Ω. ¯ 1 ( t ntebandt.Let data. brain the in ) T Γ Γ Γ ( o.113 vol. ( i ·k ¯ t 2 v j i ) and r (X ) ) 1 hra ti differ- is it whereas , ( = (λ) SS o the for ) = · sukonadour and unknown is i Γ Γ Γ ) − stedonor-specific the is SS ) and .The Appendix). (SI 2 where , ( B P ( tcnb shown be can It . n 6 h C can PCs the 0, = j X r T ) ) 5 = r b osehow see To Ω. h objective The . − ( | = SS ¯ t =1 i ·· ) o 51 no. )v X SS T ntesimu- the in P eoethe denote Ω K ( −λSS j K i B ) hobser- th k n stelow the is ) = Λ =1 + T groups. The Ω. | 2 n Γ Γ 0, = Γ This . (t SS B 14663 ∗ rk and and so , [3] R ) 2 .

STATISTICS SYSTEMS BIOLOGY A

Fig. 2. Comparison of AC-PCA with PCA, ComBat (9), and SVA (10, 14) on simulated data. (A and C) Settings 1 and 2, one representative run. Each color represents a donor. (B) Setting 1, correlation with PCA on the shared component Ω. The correlations of the ﬁrst two PCs in Ω with the matched PCs in the three methods were calculated, and the distribution for 100 runs is shown. We calculated the Pearson’s correlation for the projected data (“PC”) and the Spearman’s rank correlation for the loading of genes (“PC loading”). The dot in the violin plot indicates the median of the distribution. (D) Setting 1, distribution of the distance between the left-out sample and the retained samples, the ﬁrst two PCs.

for the adjustment. Compared with ComBat, SVA adjusts for Λ2 Table S1). Samples within a time window are relatively homoge- better but not as well for Λ1. AC-PCA adjusts for both Λ1 and Λ2 neous in time, except for window 4, in which the donor’s effect (i) (i) (i) is likely driven by age (SI Appendix, Fig. S7). Samples in window well. Simulations for Γ = Λ1 or Λ2 alone and other settings are provided in SI Appendix. 5 were used for demonstration in Fig. 1. In window 5, when we In addition to identifying dominant patterns in the data, applied formula 3 to adjust for the confounding effects from indi- PCA has been used to detect abnormal samples, potentially vidual donors, samples from the same neocortical region tended caused by mislabeling. To see whether AC-PCA can detect a to cluster together, and we were able to recover the anatomical mislabeled sample, we performed leave-one-out cross-validation structure of neocortex (Fig. 3 A and B). ComBat and SVA were (CV): (i) We performed AC-PCA on the retained samples and able to remove some donors’ effect, because samples no longer used the eigenvectors to calculate the projection for the left-out cluster by donors after the adjustment. However, no clear inter- sample, (ii) we calculated the distance of the left-out sample with regional patterns were identified by the two methods. One lim- each of the retained samples, and (iii) we iterated i and ii through itation of our method is that the age effect is not distinguished all of the samples. In the first two PCs, the left-out sample tends from the donor’s effect. It is challenging to distinguish between to be closer to the retained samples with the same region label the two effects because the donor labels are highly confounded (Fig. 2D and SI Appendix, Figs. S5 and S6). The two distributions, with age (SI Appendix, Fig. S7). same region label vs. different region labels, are well separated We compared the eigenvalues in the brain data versus per- in the first two PCs, especially for PC1. For a left-out sample, mutation by shuffling the region labels in each donor (Fig. 3C). by comparing its distances to samples that have the same region The first three PCs are likely to be significant. When we shuffled label vs. different region labels in the retained set, we are likely all samples across donors, the trend was similar (SI Appendix, to identify whether it is mislabeled. Fig. S8). We also compared the variance explained by the PCs in the brain data versus permutation, and the trend was similar Application to the Human Brain Exon Array Data. The human brain (SI Appendix, Fig. S9). A parallel evidence for the significance exon array dataset (4) includes the transcriptomes of 16 brain of the PCs is achieved through CV. In addition to the leave- regions comprising 11 areas of the neocortex and 5 other regions. one-out CV presented in the simulation section, we considered In the analysis, we used samples from 10 regions in the neocor- leave-one-donor-out CV, where all samples within a donor are tex. Primary visual cortex (V1C) was excluded from the analysis left out. We iterated through all donors and calculated the dis- because the distinct nature of this area relative to other neocorti- tance between all pairs of samples: one from the left-out donor cal regions tended to compress the other 10 regions into a single and the other one from the retained samples. The CV result cluster. We sorted the donors by age and defined nine time win- is consistent with the eigenvalue and variance results, because dows by grouping samples from every six donors (SI Appendix, the left-out samples tend to be closer to the retained samples

14664 | www.pnas.org/cgi/doi/10.1073/pnas.1617317113 Lin et al. Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 oprdwt htfrec Cidvdal Fg 3D (Fig. individually PC each for separated, more that be region with to same tended the compared labels with region Appendix, samples different of (SI versus pair label PCs the three for first distance Euclidean the in label S10 Figs. region same the with “W.” are as that abbreviated genes is 337 “Window” of (18). list a 5.0 represents version “Essential” (DEG) the analysis. represents Genes “All” the Essential PC2. in of and used Database PC1 genes the in 5,000 from genes the the obtained represents for essential, “All-5000” scores and (dN/dS) experiment. conserved Conservation array functionally (G) exon individuals. the the in over (F genes median individuals. 17,568 the total the represents retained point among the each regions from and one across shown other are variance the genes and the donor over left-out the sum from the one be samples: of pairs all cortex between distance (E auditory Euclidean samples. primary the hemisphere. (P)], of of surface Distribution [IPC lateral (D) brain, cortex permutations. human (C 100 parietal fetal surface. for prefrontal Representative inferior lateral (B) dorsolateral the posterior (I)]. (V)], on [ITC (S)], cortex visible [VFC temporal [S1C not cortex inferior is cortex and MFC prefrontal (T)], somatosensory ventrolateral [STC primary cortex (O)], temporal (M)], [OFC superior posterior [M1C cortex (A)], cortex prefrontal [A1C motor orbital primary (F)], [MFC (D)], cortex [DFC prefrontal cortex medial are regions brain the 3. Fig. i tal. et Lin in factors equals latent which of PCs, consis- and significant number give of the CV) number and to the 2, for variance, and estimates 1 (eigenvalue, tent settings criteria penalty simulation three the significantly In the that with influence. dimension shows confounder of result learning reduced this the enables regions, bias AC-PCA any of in induce clustering cannot the donor random left-out in the for of accuracies effect confounding expected the are whereas guess out, prediction sample the PCs, one three and first out the donor on based are set accuracies training the the to of in it label assigning region by the sample predict testing we a if fact, In S10–S12). Figs. Appendix, F CDEG AB S6 ). oprsn fA-C ihCma 9 n V 1,1) idw5 brvain for Abbreviations 5. window 14), (10, SVA and (9) ComBat with AC-PCA of Comparisons (A) data. array exon brain human the to AC-PCA of Application ubro ee eetdi h prePsadteitreinlvrainepandb h eua C.Itreinlvraini acltdto calculated is variation Interregional PCs. regular the by explained variation interregional the and PCs sparse the in selected genes of Number ) 10%(k and S11 32%(k 43%(k 1 = .We ecmie h rttrePs the PCs, three first the combined we When ). ), 1 = 1 = 20%(k ), ), 65%(k 67%(k 2 = ,and ), 2 = 2 = h ievle ntebandt rdcos s emttddt bxlt.Tebxlt ersn h distributions the represent boxplots The (boxplot). data permutated vs. cross) (red data brain the in eigenvalues The ) k ), Ω and ) coetcutr fregions of clusters -closest 30%(k 80%(k is S5 Figs. Appendix , (SI 85%(k 3 = 3 = .Bcuethe Because ). o ev one leave for ) 3 = o leave for ) and xrsinlvl o h ee ihtplaig nwnos1ad3 h o three top The 3. and 1 windows in loadings top with genes the for levels Expression ) SI in ae ncosseiscmaio n ifrnilexpression differential and 20). comparison reported (19, previously cross-species pattern on “hourglass” the based to (win- similar birth S17), to Fig. close decreases PCs varia- two dow first interregional the The repre- by order. explained PC2 tion reversed window and components from (5), two Starting these gradient. cortex dorsal-to-ventral developing the of senting contour which the gradient, frontal-to-temporal follows windows the data from representing similar brain PC1 is the with pattern 5, in S15 The PCs ). S16 Fig. the Fig. is Appendix, of Appendix, pattern (SI (SI dynamics the noise temporal the but of explored level window, we high time structure a a gross by within performed the distorted consistent is hemisphere, be and PCA to donor tends When each locations. for agree physical to separately tend regions’ to AC-PCA the 4 by and identified with windows ComBat patterns in in the identified 3; whereas were SVA, to patterns interregional 1 clear windows no 9, in well reasonably performed h iulzto eut o h te idw r shown are windows other the for results visualization The is S13 Figs. Appendix, SI IAppendix, ( SI windows time later in increases then and 4) PNAS | eebr2,2016 20, December and nsmay l methods all summary, In S14. | o.113 vol. | o 51 no. .Next, ). | 14665 1 to 6,

STATISTICS SYSTEMS BIOLOGY We then implemented AC-PCA with sparsity constraints to A fly, PCA worm, PCA 1 select genes associated with the PCs. The number of genes with 6 /2 10−12 nonzero loadings are shown in Fig. 3E, along with the interre- 12−14 5 1 6 751/2 gional variation explained in the regular PCs. Interestingly, the 7 2 8−10 14−16 trends tend to be consistent: When the regular PC explains more 6−8 8 16−18 variation, more genes are selected in the corresponding sparse 1 2 2 /2 1 89/2 18−20 1 1 PC 2 PC 2 / /21 1 9 2 PC. To produce more stringent and comparable gene lists, we 1 /2 4−6 1 3 /2 0 10 1 chose the sparsity parameter such that 200 genes are selected in 10 /2 22−24 11 3 12 each window. The overlap of gene lists across windows is moder- 1 20−22 11 /2 ate (SI Appendix, Fig. S18) and, as expected, the overlap with the 0−2 4

−40 −202−4 0 20 40 ﬁrst window decreases over time. The overlap between adjacent −60 −40 −20 0 20 40 60 windows tends to be larger in later time windows, indicating that −60 −40 −20 0 20 40 60 −60 −40 −20 0 20 40 60 PC 1 PC 1 interregional differences become stable. Genes with the largest loadings demonstrate interesting spatial patterns (Fig. 3F and B fly, PCA, PC1 fly, PCA, PC2 SI Appendix, Fig. S19). In windows 1 and 3, the top genes in PC1 CycC CG9548 janA betaTub60D dys NtR fly fly follow the frontal-to-temporal gradient, whereas in PC2 they worm worm tend to follow the dorsal-to-ventral gradient. A brief overview of the functions of these genes is shown in SI Appendix, Table S2. We compared the gene level results between SVA and AC-PCA (SI Appendix, Fig. S19). In the later time windows, especially windows 5, 6, and 7, AC-PCA tends to select genes with larger inter- 46810

regional variation. + 1) level (expression + 1) level (expression Finally, we demonstrate the functional conservation of the 2 2 log log 02468

200 genes selected in PC1 and PC2. These genes tend to have 2 low dN/dS scores for human versus macaque comparison, even early embryo → late embryo early embryo → late embryo lower than the complete list of all essential genes (Fig. 3G). PCA AC-PCA

C 1 In the human versus mouse comparison, we observed a similar 6 / fly 2 fly worm worm SI Appendix 61/ 5 trend ( , Fig. S20). Parallel to the cross-species con- 2 1 1 71 /2 5 1/2 7 5 /2 5 7 /2 servation, we also observed that these genes tend to have low 7 4−6 6 6 6−8 10−12 8 heterozygosity scores, a measure of functional conservation in 0−2 8−10 8 12−14 2−4 10−12 12−14 9 1 human (SI Appendix, Fig. S21). 1 9 /2 8−10 8 /2 1 1 2 /2 8 /2 1 6−8 14−16 1010 /2 PC 2 11 PC 2 4−631/ 9 1 14−16 2 16−189 /2 2 1 12 Application to the modENCODE RNA-Seq Data. 1 2 /2 1 The modENCODE /21 11/ 11 /2 0−2 2 16−18 2−4 18−20 31/ 1 project generates the transcriptional landscapes for model 2 22−24 22−24 10 /2 0 2 10 18−20 3 11 20−22 11/ 20−22 organisms during development (16, 17). In the analysis, we used 1 2 4 12 /21 3 1 11 /2 the time-course RNA-Seq data for ﬂy and worm embryonic 4

−60 −40 −20 0 20 40 60 0 −40 −20 0 20 40 development. For the fly data, samples were taken in 12 time −60 −40 −20 0 20 40 60 −60 −40 −20 0 20 40 60 windows during embryonic development: 0 to 2, 2 to 4, ··· , 22 to PC 1 PC 1 24 h; for the worm data, samples were taken every 30 min dur- AC-PCA, PC1 AC-PCA, PC2 ing embryonic development: 0, 0.5, ··· , 4, 5, ··· , 12 h, where the D Shaw CG1732 Mec2 CG7997 scf CG31915 sample from 4.5 h is missing, resulting in 24 samples. fly fly We first conducted PCA on fly and worm separately, as shown worm worm in Fig. 4A. Although the temporal patterns share some similarity, the projections for fly and worm are different. The genes with top loadings in fly have different temporal dynamics in worm, especially for PC2 (Fig. 4B). We also conducted PCA on fly and worm jointly, which reveals the variations of different species (Fig. 4C). (expression level + 1) level (expression + 1) level (expression Let X (f ) and X (w) represent the data matrices for fly and 2 2 log log

(f ) (w) 02468 worm, correspondingly. Let Xt and Xt denote the data for

(w) 024681012 time window t and time point t, correspondingly. X10 is miss- early embryo → late embryo early embryo → late embryo ing. Let X represent the data matrix for both species, by stacking the rows in X (f ) and X (w). We propose the following objective Fig. 4. AC-PCA captures shared variation in fly and worm modENCODE RNA-Seq data, embryonic stage. (A) PCA on fly and worm separately. Each function to adjust for the variation of species: data point represents a time window (fly) or a time point (worm), in the unit 12 of hours. (B) Expression levels of the top three genes in fly PCA. (C) PCA and T T X T (f ) (w) T AC-PCA on fly and worm jointly. ( ) Expression levels of the top three genes maximize v X Xv − λ v (X − f (X , t)) D p t v∈R in AC-PCA. t=1 [4] × (X (f ) − f (X (w), t))v t (f ) 2 stages, we shrink the projection of Xt toward the mean of subject to ||v||2 ≤ 1, (w) (w) (w) X2t−1, X2t , and X2t+1 after the projection. Li et al. (21) aligned where fly and worm development based on stage-associated genes using ( the same dataset. We did not implement their alignment results, 1 (X (w) + X (w) ), if t = 5 f (X (w), t) = 2 2t−1 2t+1 to keep the analysis unsupervised. 1 (w) (w) (w) Formula 4 is able to capture the shared variation among fly (X2t−1 + X2t + X2t+1), otherwise 3 and worm (Fig. 4C). The selected genes tend to have consis- Formula 4 is a special case of the general form (SI Appendix). tent and smooth temporal patterns in both species (Fig. 4D). To incorporate the difference in the length of the embryonic PCA on fly and worm jointly cannot capture the direction of

14666 | www.pnas.org/cgi/doi/10.1073/pnas.1617317113 Lin et al. Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 0 ekJ,Soe D(07 atrn eeoeet ngn xrsinsuisb sur- by studies expression gene in heterogeneity Capturing (2007) JD Storey JT, Leek 10. 3 agC agL hn ,Za 21)Acutn o o-eei atr by factors non-genetic for Accounting (2013) H Zhao S, for Zhang package L, SVA Wang The C, (2012) Yang JD Storey 13. AE, Jaffe HS, unwanted Parker for WE, correct Johnson to JT, genes Leek control Using 12. (2012) TP Speed JA, Gagnon-Bartsch 11. i tal. et Lin CPAwt preLoading. Sparse with AC-PCA in shown region is formula brain Methods. of same analysis connection the The correlation in Methods). needed, canonical samples and are Materials of interest) SI pair Appendix, of every (SI variables penalizing primary are the brain we for the because Only labels required. (i.e., necessarily labels not penalty are region labels the donor formulate the To However, tion. window. formula time each in for term separately when SVA) individuals and different Bat, two as donor formula same implementing the from hemispheres Donors. right Individual of Variations for Adjusting AC-PCA Methods and with Materials combined be and models. imple- regression be tool can in features exploratory mented extracted the an example, For as methods. AC- adjust other features. serve variation, relevant can desired the select the PCA and capture confounders, to appli- potential is datasets for AC-PCA these others. to and immune (24), cable classifying methylome DNA (23), using footprinting cells fea- yeast metabolic as classifying using such (22), prediction purposes, mutants methylation to various for for applied extraction data ture been genomics of have types methods other reduction Dimension datasets. kernels effects confounders’ of of types property simultaneously. multiple for dif- additive adjust of to The combination kernels, the ferent well. through generalization, as further a used enables on be kernel can linear general kernels dimension implemented a formula simultaneous In we proposed variation. for 1, have confounding PCA for we adjustment in study, and functions reduction this penalty In of results. interpreta- class and the visualization of the tion hence dimension and of methods, performance reduction the affect can variation Confounding Discussion AC-PCA. of in vided Applications. Other direction S23). that Fig. capture Appendix, cannot shown still (SI PCA are in adjustment, PCA the peak for in labels levels PCs expression other The gene in stage. the embryonic which middle in the AC-PCA, in PC2 .JhsnW,L ,RbnvcA(07 dutn ac fet nmcora expression microarray in effects batch Adjusting (2007) A Rabinovic C, Li WE, Johnson 9. (2002) I Jolliffe 8. (1978) M Wish JB, Kruskal 7. .Drai ,e l 21)Asre fhmnbantasrpoedvriya h single the at diversity transcriptome brain human of survey A (2015) al. et S, brain. Darmanis human prenatal 6. the of brain. landscape Transcriptional human (2014) the al. et of JA, Miller transcriptome Spatio-temporal 5. (2011) al. et HJ, adrenocor- of Kang prognostication and 4. classification Molecular (2009) al. et TJ, Giordano 3. embryos. Ringn early and cells 2. stem mouse of analysis Transcriptome (2003) al. et AA, Sharov 1. h plcto fA-C sntlmtdt transcriptome to limited not is AC-PCA of application The oaevral analysis. variable rogate methods. Bayes empirical using data 11. Vol CA), Oaks, Thousand (SAGE, Sciences Social o-akrpeetto n prergeso o QLmapping. eQTL for regression sparse 29(8):1026–1034. and representation low-rank experi- high-throughput in variation unwanted other ments. and effects batch removing data. microarray in variation level. cell 508(7495):199–206. 478(7370):483–489. profiling. transcriptome by tumors tical Biol PLoS fe sn oBtwt h species the with ComBat using After S22. Fig. Appendix, SI rM(08 hti rnia opnn analysis? component principal is What (2008) M er ´ Bioinformatics hr eipeetdohrformulations other implemented we where Appendix, SI 1(3):E74. rcNt cdSiUSA Sci Acad Natl Proc rnia opnn Analysis Component Principal ene oko h oreo ofudn varia- confounding of source the know to need we 3, 28(6):882–883. w diinlsmlto xmlsaepro- are examples simulation additional Two eipeetdalmtos(CPA C,Com- PCA, (AC-PCA, methods all implemented We 3. LSGenet PLoS uniaieApiain nthe in Applications Quantitative Scaling. Multidimensional Biostatistics See 112(23):7285–7290. IApni,S aeil n Methods. and Materials SI Appendix, SI 3(9):1724–1735. Biostatistics lnCne Res Cancer Clin 13(3):539–552. Wly e York). New (Wiley, IApni,S aeil and Materials SI Appendix, SI Y 8(1):118–127. (i.e., a Biotechnol Nat 15(2):668–676. etetdtelf and left the treated We YY T n other and ) 26(3):303–304. Bioinformatics 3 Nature Nature with 4 ui ,e l 21)Eieoi nlssdtcswdsra gene-body widespread detects analysis for Epigenomic mutants (2012) yeast al. of et classification M, High-throughput Kulis (2003) 24. al. et J, Allen genomic human in 23. status methylation of prediction Computational (2006) al. et R, Das 22. characterizing to of Comparison approach (2014) SE Brenner field-based PJ, Bickel H, random Huang JJ, Li markov 21. A (2015) al. et Z, neocortical human Lin of bilaterality 20. and specification Temporal and (2014) prokaryotes al. both et in M, genes essential Pletikos of database 19. a 5.0, distant DEG across (2009) Y transcriptome Lin the R, Zhang of analysis 18. Comparative genome. (2014) the al. of et MB, secrets Gerstein the Unlocking 17. (2009) al. et factor using SE, data RNA-seq Celniker of Normalization 16. (2014) S Dudoit TP, problems Speed J, prediction Ngai for D, effects Risso batch Removing 15. (2014) JT Leek HC, Bravo HS, Parker 14. ..;adNtoa ntttso elhGat 5 H094adU01 and Center. performed Computing MH106934 were Performance computations High P50 Biomedical All (to Grants University N.S.). Yale and the FRG2/15-16/011 Health M.L., at of X.X., and Y.Z., (to Institutes FRG2/14-15/069 MH103339 Hong National Grants 12316116; and and University C.Y.); 22302815 Grants Baptist 61501389; Council Grant Kong Grant China Y.W., of Research Z.L., Funding Kong (to Science National GM109836 Hong National H.Z.); R01 W.H.W.); and and and HG007834 M.Z., Z.L. R01 B.J., (to Grants CA154295 Health P01 of and Institutes of GM59507 by Institutes supported R01 National partially Grants and was State Health DMS-1106738 work W. Grant This Matthew Foundation Z.L. Science and of National discussions, support financial useful partial for the Arthur for Joey and Rubio Angel ACKNOWLEDGMENTS. Preprocessing. Data in shown are formula results implemented simulation We More entries. three shuffle and term the is 1 set we from generated and are variation, 0. confounding to set of are strength entries other the and s 2], Uniform[0, from generated are a represents pcfi component specific 2 a has and smooth 1. is variance it that Let assumed 2. we rank purpose, visualization For variation. (w i Simulations. Tuning in orig- rows the the in permutate values by we the same. explained where compare the permutation, variance we vs. this, the data achieve and To inal significant. eigenvalues are the PCs whether the checking by PCs PCs. the fixed of we Significance dataset, brain human the In of S25). range Fig. λ wide Appendix, a (SI to analysis robust data are AC-PCA by tured In data. projected the λ in variation of total designs the other with compared “small” is ation If S24). Fig. Appendix, (SI decrease to tend Tuning PCs. Multiple ftebtengop u fsurs ecos h smallest the choose we squares, R(λ) of sum between-groups the of hindividual, th i o etrcmaio vrtetm windows. time the over comparison better a for 5 = uhthat such r 1 are 1 N yoehlto ncrnclmhctcleukemia. lymphocytic chronic in 1236–1242. hypomethylation footprinting. DNA metabolic using 692–696. genomics functional sequences. Res gans data. 9(1):429–451. transcriptome spatial–temporal using development brain human expression. gene topographic eukaryotes. species. 927–930. samples. or genes control of analysis analysis. variable surrogate frozen with × w ≤ 2 24(7):1086–1101. p λ. c sa is ) eeomna tgs ise,adclsb oECD N-e data. RNA-seq modENCODE by cells and tissues, stages, developmental 1 5i h C htw r neetdi,s h ofudn vari- confounding the so in, interested are we that PCs the in 0.05 × arxadterw in rows the and matrix and Nature When p w µ R(λ) arxwt l 1s. all with matrix b×1 rcNt cdSiUSA Sci Acad Natl Proc arcs eeae from generated matrices, b nstig1 econsidered we 1, setting In uli cd Res Acids Nucleic 2 c See B = X 2 × i . efis set first We . N ∼ (i 512(7515):445–448. ) λ K ≤ (1, PNAS See arx ersnigteltn tutr fteshared the of structure latent the representing matrix, 2 IApni,S aeil n Methods. and Materials SI Appendix, SI + Ω = , nrae rm0 h ratio the 0, from increases a egetrta n ecos h smallest the choose we and 1 than greater be can R(λ) Γ N 0.05R(λ See · · · (0, (i IApni,S aeil n Methods. and Materials SI Appendix, SI (0, ) , Λ = etakYxa i o the for Qiu Yixuan thank We | 0.25 o fixed a For IApni,S aeil n Methods. and Materials SI Appendix, SI b) 5.I etn ,teol ifrnefo setting from difference only the 2, setting In 0.25). αΓ eebr2,2016 20, December 0 1 (i and 7Spl1):D455–D458. 37(Suppl Neuron (i = ) · B ) Λ + Σ i + ) entc htteoealpten cap- patterns overall the that notice We 0). ob qa to equal be to h ,where ), 103(28):10713–10716. w r eeae from generated are (i 2 (i a Biotechnol Nat 1 ) B ) 81(2):321–332. h hrdcomponent shared The . where , eeaut h infiac fthe of significance the evaluate we λ, stenormalized the is i PeerJ IAppendix SI sa is N n arx nwihtreentries three which in matrix, b×1 (0, Σ = 2:e561. v ij Λ 5, T I p = X R(λ) | 1 (i ). b ) T w 32(9):896–902. α o.113 vol. = α KXv exp(−(w = nsmltosadreal and simulations in λs 1 . = = sasaa niaigthe indicating scalar a is 1r n hnrnol pick randomly then and 0 and 10, .melanogaster D. i 3 a h interpretation the has v .Teetisin entries The 2.5. and a Biotechnol Nat T ihma and 0 mean with µ, o etns1ad2. and 1 settings for N X RSpectra T | i (0, 1 a Genet Nat KXv Λ p − Nature o 51 no. 2 (i I = Ω p = ) X .Tedonor- The ). w /v = n plStat Appl Ann n keep and 0.Frthe For 400. λ j 1 T B ) Wh. X uhthat such 2 package, and 459(7249): i s /4). | T i Genome Xv The . r 44(11): 14667 .ele- C. i W 21(6): and will h = (i Y is 1 )

STATISTICS SYSTEMS BIOLOGY