Alignment and Integration of Spatial Transcriptomics Data
Total Page:16
File Type:pdf, Size:1020Kb
Alignment and Integration of Spatial Transcriptomics Data Ron Zeira1, Max Land1, and Benjamin J. Raphael1 1Department of Computer Science, Princeton University, Princeton NJ 08544, USA S1 Supplementary methods S1.1 Proof of Theorem 1 ¯ P (q) (q)T 1 Theorem 1. Let X = q λqX Π diag( g ) and X = WH. We have, X S(W; H) = gic(x·i; x¯·i) + τ i 2 P vl where c(u; v) = ku − vk or c(u; v) = KL(vjju) = vl log − vl + ul, and τ is a constant l ul that does not depend on W; H Proof. We first prove the theorem for the Euclidean distance c(u; v) = ku − vk2. We write the P (q) P objective function explicitly and simplify it using j Πij = gi and q λq = 1. 2 X X X (q) (q) S(W; H) = λq x·i − x·j πij q i j X X X T (q) T (q) (q) = λq x·i x·iπij − 2x·i x·j πij + β q i j X T X (q) X X X T X (q) (q) = x·i x·i πij λq − 2 x·i λqx·j πij + β i j q i j q X T T X (q) (q)T = gix·i x·i − 2 Tr(X λqX Π ) + β i q X T 1 = Tr(XT X diag(g)) − 2 Tr(XT λ X(q)Π(q) diag( ) diag(g)) + β q g q = Tr(XT X − 2XT X¯ diag(g)) + β X 2 0 = gi kx·i − x¯·ik + β i 1 where β and β0 are constants that do not depend on W; H. P vl Next, we prove the theorem for the KL divergence c(u; v) = KL(vjju) = vl log −vl+ul. l ul Again, we write the objective function explicitly and simplify it: X X Xh X (q) (q) (q) (q) i (q) S(W; H) = λq xli − xlj − xlj log(xli) + xlj log(xlj ) πij q i j l X X Xh X (q) i (q) = λq xli − xlj log(xli) πij + γ q i j l X Xh X (q) X X X (q) (q)i = xli πij λq − log(xli) λqxlj πij + γ i l j q j q X Xh i = gi xli − x¯li log(xli) + γ i l X 0 = giKL(¯x·ijjx·i) + γ i where γ and γ0 are constants that do not depend on W; H. S1.2 Finding optimal rotation for spatial coordiantes In this section, we seek to find a rotation and translation that of the spatial coordinates of one layer that minimizes the distances to the spatial coordinates of the other layer given a mapping. The prob- lem of finding rotation and translation that minimizes the distances between matched set of points is a well know problem in several research fields [6, 3]. In 2d the problem is often called called Procrustes analysis , a more general linear algebra problem is called the Orthogonal Procrustes problem , and the vector weighted version is called Wahba’s problem [6]. In chemistry/biology the solution to the 3d problem is called the Kabsch algorithm [3]. The 2d solution is based on finding the rotation angle while the general case (which also works in 2d) looks for a rotation matrix, thus it also supports reflection. Our problem is a variation of this problem since we have a probabilistic alignment between the spots given by the mapping Π. 2×n 2×n0 Problem S1. Given ST layers with spatial coordinates Z 2 R and W 2 R and a mapping 0 2 2×2 Π 2 Γ(g; g ), find a vector t 2 R and a rotation matrix R 2 R : X 2 Q(t; R) = πij kz·i − Rw·j − tk : (S1) i;j We first show that we can assume that no translation is needed (t = 0) by centering the spatial coordinates Z and W . Assuming R is fixed, we can find the optimal translation by taking the 2 derivative of Q w.r.t. t and comparing to zero: @Q X = −2 π (z − Rw − t) @t ij ·i ·j i;j X X X X X = −2 z·i πij + 2 w·j πij + 2t πij i j j i i;j X X 0 = −2 z·igi + 2 w·jgj + 2t = 0 i j 0 We have t^ = Zg − W g . By replacing the spatial coordinates z·i with z·i − Zg and the spatial 0 P 2 coordinates w·j with w·j − W g we get Q = i;j πij kz·i − Rw·jk . Therefore, centering both spatial coordinates removes the need to find a translation and we are only left with finding the optimal rotation. We rewrite the objective Q in matrix notation: X Q = πij(z·i − Rw·j) i;j X T T T T T T = πij(z·i z·i − w·j R Rw·j − z·i Rw·j − w·j R z·i) i;j X T = −2 πij(z·i − Rw·j) + α i;j = −2 Tr(ZT RW ΠT ) + α = −2 Tr(RW ΠT ZT ) + α where α us a constant independent of R. We find the optimal rotation R that minimizes Q using SVD similar to the solution to Wahbs’s problem [4]. Let UΣV T be the SVD decomposition of W ΠT ZT . We have Q = −2 Tr(RUΣV T ) + α = −2 Tr(ΣV T RU) + α Notice that Σ is a positive diagonal matrix and V T RU is an orthonormal matrix. Therefore, the objective Q is minimized when the trace of V T RU is maximal which is attained when V T RU = I. We have R = VU T . We note that R may also do reflection in addition to rotation. An alternative derivation for the 2d case is done similar to Procrustes analysis. We write the rotation matrix as a function of the rotation angle θ: cos(θ) − sin(θ) R(θ) = sin(θ) cos(θ) Taking the derivative of Q with respect to θ and comparing to zero gives: 3 @Q @R(θ) = −2 Tr W ΠT ZT @θ @θ − sin(θ) − cos(θ) = −2 Tr W ΠT ZT = 0 cos(θ) − sin(θ) Dividing by cos(θ) and extracting θ we have: 0 −1 Tr W ΠT ZT 1 0 θ^ = arctan Tr(W ΠT ZT ) S2 Supplementary results S2.1 Comparison of PASTE to Scanorama on ST alignment simulation We compared our results to a SC-RNAseq integration method Scanorama [1]. Scanorama integrates gene expression information by resolving noise and batch effects between two or more datasets. Scanorama is not designed to align cells from RNAseq, though it does relies on inferring near- est neighbors between cells in the given data sets. To directly compare Scanorama with PASTE, we calculated an alignment between spots of the different layers by finding a mapping that mini- mizes the Wasserstein optimal transport distance, where the transportation cost between the spots is taken as the Euclidean distance between the spots in the integrated gene expression datasets from Scanorama. We see that PASTE outperforms alignment based gene expression corrected by Scanorama (Figure S3). In fact, Scanorama performs slightly worse than our pairwise alignment on the original gene expression data alone. S2.2 Spatial entropy definition The spatial entropy is computed as follows. Given a graph with vertex labels (e.g. cluster labels), the spatial entropy is the Shannon entropy of the distribution of the unordered pairs of cluster labels on the edges of the graph. Specifically, let G = (V; E) be graph where V is the set of spots and where there is an edge (i; k) 2 E between every pair (i; k) of spots adjacent on the array. Let K = f1; 2; : : : ; kg be a set of k cluster labels and let ` : V K be the spot cluster assignment. We define a categorical variable C = ffa; bg;(a; b) 2 N × Ng which describes every distinct unordered pair of cluster labels. The spatial entropy is calculated as H(G) = H(CjE) = P c − c2C P(cjE) log(P(cjE)), where P(cjE) = jEj . A low value of spatial entropy indicates that many adjacent spots have the same cluster label, while a large spatial entropy indicates that many adjacent spots have different cluster labels. 4 S2.3 Supplementary plots Layer 1 Layer 2 Layer 3 Layer 4 Figure S1: Spatial organization of breast cancer ST layers from [5]. 5 0.7 0.6 0.5 0.4 0.3 0.2 Mixed Gene Exp Only % of Spots Correctly Aligned 0.1 Spatial Only 0.0 0.2 0.4 0.6 0.8 1.0 level Figure S2: PASTE results for pairwise alignment of a simulated ST layer with layer 1 of breast cancer dataset [5] with varying levels of α. Coverage variability factor for the simulated ST layer was set at η = 10. 6 Figure S3: PASTE results on pairwise alignment of simulated ST layers based on four layers of breast cancer dataset [5]. Each value is an averaged over 10 simulations. 7 a Original b Rotation π/6 c Rotation π/3 d Rotation 2π/3 Figure S4: Spatial organization of spots used in center layer alignment simulation of layer 2 from the breast cancer dataset [5]. (a) Original spatial organization of spots in layer 2 of breast cancer π π 2π dataset. (b) - (d) Simulated spatial structures obtained by rotating (a) by 6 ; 3 ; 3 respectively. 8 Figure S5: PASTE results on center layer integration of simulated ST layers based on four layers of breast cancer dataset [5].