Multi-Omics Integration and Regulatory Inference for Unpaired Single-Cell

bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 Multi-omics integration and regulatory inference for unpaired single-

2 cell data with a graph-linked unified embedding framework

3 Zhi-Jie Cao1, Ge Gao1,* 4 1 Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics 5 (ICG), Center for Bioinformatics (CBI), and State Key Laboratory of Protein and Plant Gene Research at 6 School of Life Sciences, Peking University, Beijing, 100871, China

7 * To whom correspondence should be addressed. Tel: +86-010-62755206; Email: [email protected]

1 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 Abstract

2 With the ever-increasing amount of single-cell multi-omics data accumulated during the past years, 3 effective and efficient computational integration is becoming a serious challenge. One major obstacle 4 of unpaired multi-omics integration is the feature discrepancies among omics layers. Here, we 5 propose a computational framework called GLUE (graph-linked unified embedding), which utilizes 6 accessible prior knowledge about regulatory interactions to bridge the gaps between feature spaces. 7 Systematic benchmarks demonstrated that GLUE is accurate, robust and scalable. We further 8 employed GLUE for various challenging tasks, including triple-omics integration, model-based 9 regulatory inference and multi-omics human cell atlas construction (over millions of cells) and found 10 that GLUE achieved superior performance for each task. As a generalizable framework, GLUE 11 features a modular design that can be flexibly extended and enhanced for new analysis tasks. The full 12 package is available online at https://github.com/gao-lab/GLUE for the community.

2 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 Introduction

2 Recent technological advances in single-cell sequencing have enabled the probing of regulatory 3 maps through multiple omics layers, such as chromatin accessibility (scATAC-seq1, 2), DNA 4 methylation (snmC-seq3, sci-MET4) and the transcriptome (scRNA-seq5, 6), offering a unique 5 opportunity to unveil the underlying regulatory bases for the functionalities of diverse cell types7. 6 While simultaneous assays are emerging recently8-11, different omics are usually measured 7 independently and produce unpaired data, which calls for effective and efficient in silico multi-omics 8 integration12, 13. 9 10 Computationally, one major obstacle faced when integrating unpaired multi-omics data is the distinct 11 feature spaces of different modalities (e.g., accessible chromatin regions in scATAC-seq vs. genes in 12 scRNA-seq)14. A quick fix is to convert multimodality data into one common feature space based on 13 prior information and apply single-omics data integration methods15, 16. Such explicit “feature 14 conversion” is straightforward, but has been reported to result in significant information loss17. 15 Algorithms based on coupled matrix factorization circumvent explicit conversion but hardly handle 16 more than two omics layers18, 19. An alternative option is to match cells from different omics layers 17 via nonlinear manifold alignment, which removes the requirement of prior knowledge completely 18 and could reduce inter-modality information loss in theory20, 21; however, this technique has mostly 19 been applied to continuous, trajectory-like manifolds rather than atlases. 20 21 The ever-increasing volume of data is another serious challenge22. Recently developed technologies 22 can routinely generate datasets at the scale of millions of cells23-25, whereas current integration 23 methods have only been applied to datasets with much smaller volumes15, 16, 18-21. To catch up with 24 the growth in data throughput, computational integration methods should be designed with 25 scalability in mind. 26 27 Hereby, we introduce GLUE (graph-linked unified embedding), a modular framework for integrating 28 unpaired single-cell multi-omics data and inferring regulatory interactions simultaneously. By 29 modeling the regulatory interactions across omics layers explicitly, GLUE bridges the gaps between 30 various omics-specific feature spaces in a biologically intuitive manner. Systematic benchmarks and 31 case studies demonstrate that GLUE is accurate, robust and scalable for heterogeneous single-cell 32 multi-omics data. Furthermore, GLUE is designed as a generalizable framework that allows for easy

3 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 extension and quick adoption to particular scenarios in a modular manner. GLUE is publicly 2 accessible at https://github.com/gao-lab/GLUE. 3

4 Results

5 Integrating unpaired single-cell multi-omics data via graph-guided embeddings

6 Inspired by previous works, we model cell states as low-dimensional cell embeddings learned 7 through variational autoencoders26, 27. Given their intrinsic differences in biological nature and assay 8 technology, each omics layer is equipped with a separate autoencoder that uses a probabilistic 9 generative model tailored to the layer-specific feature space (Fig. 1, Methods). 10

Encoders m Decoders

|V1|

Knowledge-based |V2| guidance graph V2 ^ G = (V, E) G ~VVT |V3|

V3 |V1| |V1|

N1 N1

m ^ T X1 X ~U1V1 N1 1 scRNA-seq U1

|V2| |V2|

N2 N2 N2 X U ^ ~U V T 2 2 X2 2 2 scATAC-seq N3 U3 |V3| |V3| N N 3 X 3 ^ ~U V T 3 X3 3 3 snmC-seq Discriminator 11 12 Fig. 1 Architecture of the GLUE framework. 13 GLUE employs omics-specific variational autoencoders to learn low-dimensional cell embeddings from each omics 14 layer. The data dimensionality and generative distribution can differ across omics layers, but the cell embedding 15 dimensions are shared. A graph variational autoencoder is used to learn feature embeddings from the prior 16 knowledge-based guidance graph; these embeddings are then used as data decoder parameters. The feature 17 embeddings effectively link the omics-specific autoencoders to ensure a consistent embedding orientation. Last, an 18 omics discriminator is employed to align the cell embeddings of different omics layers via adversarial learning.

4 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 Taking advantage of prior biological knowledge, we propose the use of a knowledge-based graph 2 (“guidance graph”) that explicitly models cross-layer regulatory interactions for linking layer- 3 specific feature spaces; the vertices in the graph correspond to the features of different omics layers, 4 and edges represent signed regulatory interactions. For example, when integrating scRNA-seq and 5 scATAC-seq data, the vertices are genes and accessible chromatin regions (i.e., ATAC peaks), and a 6 positive edge can be connected between an accessible region and its putative downstream gene. 7 Then, adversarial multimodal alignment is performed as an iterative optimization procedure, guided 8 by feature embeddings encoded from the graph28 (Fig. 1, Methods). Notably, when the iterative 9 process converges, the graph can be refined with inputs from the alignment procedure and used for 10 data-oriented regulatory inference (see below for more details). 11

12 Systematic benchmarks demonstrate superior alignment accuracy and robustness over existing 13 methods

14 We first benchmarked GLUE against multiple popular unpaired multi-omics integration methods15, 15 16, 20, 29 using gold-standard datasets generated by recent simultaneous scRNA-seq and scATAC-seq 16 technologies8, 9, 30. 17 18 At the cell type level, an integration method should match the corresponding cell types from 19 different omics layers, producing cell embeddings where the cell types are clearly distinguishable 20 and the omics layers are well mixed. Compared to other methods, GLUE achieved overall the 21 highest cell type resolution (as quantified by mean average precision) and layer mixing (as quantified 22 by the Seurat alignment score31) simultaneously (Fig. 2a); these results were also validated by 23 UMAP visualization of the aligned cell embeddings (Supplementary Fig. 1-3). 24 25 An optimal integration method should produce accurate alignments not only at the cell type level but 26 also at finer scales. Exploiting the ground truth cell-to-cell correspondence between scRNA-seq and 27 scATAC-seq, we further quantified single-cell level alignment error via the FOSCTTM (fraction of 28 samples closer than the true match) metric32. On all three datasets, GLUE achieved the lowest 29 FOSCTTM, decreasing the alignment error by large margins compared to the second-best method on 30 each dataset (Fig. 2b, the decreases were 3.6-fold for SNARE-seq, 1.74-fold for SHARE-seq, and 31 1.4-fold for 10x Multiome). 32

5 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

a b 0.6 SNARE−seq SHARE−seq 10x Multiome 1.00 Method

UnionCom 0.75 0.4 LIGER

0.50 bindSC 0.2 FOSCTTM Seurat v3 0.25 GLUE Seurat alignment score alignment Seurat 0.0 N.A. 0.00 0.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7 0.8 0.9 SNARE−seq SHARE−seq 10x Multiome Mean average precision Dataset

c d SNARE−seq SHARE−seq 10x Multiome SNARE−seq SHARE−seq 10x Multiome 0.3 0.6

0.2 0.4

0.1 0.2 FOSCTTM

0.0 Increase in FOSCTTM in Increase

0 0 0 0.1 0.2 0.4 0.6 0.8 0.9 0.1 0.2 0.4 0.6 0.8 0.9 0.1 0.2 0.4 0.6 0.8 0.9 250 500 1000 2000 4000 8000 250 500 1000 2000 4000 8000 250 500 1000 2000 4000 8000 1 Corruption rate Subsample size 2 Fig. 2 Systematic benchmarks on gold-standard datasets. 3 a, Cell type resolution (quantified by mean average precision) vs. omics layer mixing (quantified by Seurat alignment 4 score) for different integration methods. b, Single-cell level alignment error (quantified by FOSCTTM) of different 5 integration methods. UnionCom failed to run on the SHARE-seq dataset due to memory overflow. c, Increases in 6 FOSCTTM at different prior knowledge corruption rates for integration methods that rely on prior feature interactions. 7 d, FOSCTTM values of different integration methods on subsampled datasets. The error bars indicate mean ± s.d.

8 During the evaluation described above, we adopted a standard schema (ATAC peaks were linked to 9 RNA genes if they overlapped in the gene body or proximal promoter regions) to construct the 10 guidance graph for GLUE and to perform feature conversion for other conversion-based methods. 11 Given that our current knowledge about the regulatory interactions is still far from prefect, a useful 12 integration method must be robust to such inaccuracies. Thus, we further assessed the methods’ 13 robustness to corruption of regulatory interactions by randomly replacing varying fractions of 14 existing interactions with nonexistent ones. For all three datasets, GLUE exhibited the smallest 15 performance changes even at high corruption rates (Fig. 2c), suggesting its superior robustness. 16 17 Given its neural network-based nature, GLUE may suffer from undertraining when working with 18 small datasets. Thus, we repeated the evaluations using subsampled datasets of various sizes. GLUE 19 remained the top-ranking method with as few as 2,000 cells, but the alignment error increased more 20 steeply when the data volume decreased to less than 1,000 cells (Fig. 2d). Additionally, we also 21 noted that the performance of GLUE was robust for a wide range of hyperparameter settings 22 (Supplementary Fig. 4). 23

6 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 GLUE enables effective triple-omics integration

2 Benefitting from a modular design and scalable adversarial alignment, GLUE readily extends to 3 more than two omics layers. As a case study, we used GLUE to integrate three distinct omics layers 4 of neuronal cells in the adult mouse cortex, including gene expression33, chromatin accessibility34, 5 and DNA methylation3. 6 7 Unlike chromatin accessibility, gene body DNA methylation generally shows a negative correlation 8 with gene expression in neuronal cells35. GLUE natively supports the mixture of regulatory effects 9 by modeling edge signs in the guidance graph. Such a strategy avoids data inversion, which is 10 required by previous methods16 and can break data sparsity and the underlying distribution. For the 11 triple-omics guidance graph, we linked gene body mCH and mCG levels to genes via negative edges, 12 while the positive edges between accessible regions and genes remained the same. 13 14 The GLUE alignment successfully revealed a shared manifold of cell states across the three omics 15 layers (Fig. 3a-d). We observed highly significant marker overlap (Fig. 3e, three-way Fisher’s exact 16 test36, FDR < 2×10-15) for 12 out of the 14 mapped cell types (Supplementary Fig. 5, 6, Methods), 17 indicating reliable alignment. Interestingly, we found that GLUE alignment helped improve the 18 effects of cell typing in all omics layers, including the further partitioning of the scRNA-seq “MGE” 19 cluster into Pvalb+ (“mPv”) and Sst+ (“mSst”) subtypes (highlighted with green circles/flows in Fig. 20 3, Supplementary Fig. 5), the partitioning of the scRNA-seq “CGE” cluster and scATAC-seq “Vip” 21 cluster into Vip+ (“mVip”) and Ndnf+ (“mNdnf”) subtypes (highlighted with dark blue circles/flows 22 in Fig. 3, Supplementary Fig. 5), and the identification of snmC-seq “mDL-3” cells and a subset of 23 scATAC-seq “L6 IT” cells as claustrum cells (highlighted with light blue circles/flows in Fig. 3, 24 Supplementary Fig. 5). 25

7 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

a scRNA-seq cell type b snmC-seq cell type c scATAC-seq cell type

L2/3 IT Layer2/3 mL2/3 mDL-3 L4 Layer5a mL4 mIn-1 L5 IT Layer5 mL5-1 mVip L6 IT Layer5b mDL-1 mNdnf-1 L5 PT Layer6 mDL-2 mNdnf-2 NP UMAP2 Claustrum UMAP2 mL5-2 mPv UMAP2 L6 CT CGE mL6-1 mSst-1 Vip MGE mL6-2 mSst-2 Pvalb Sst

UMAP1 UMAP1 UMAP1

d e f Omics layer mPv mL2/3 mL5-2 0.8 mDL-3 2 R

mL6-2 n o mL4 i 0.6 s s

mL6-1 e

scRNA-seq r

mDL-2 p scATAC-seq x 0.4 e Cell type

mNdnf e

UMAP2 snmC-seq

mL5-1 n mSst e 0.2 mVip G mDL-1 mIn-1 0.0 0 50 100 150 200 Combined mCH mCG ATAC 1 UMAP1 -log10 FDR Omics layer 2 Fig. 3 Triple-omics integration of the mouse cortex. 3 a-c, UMAP visualizations of the integrated cell embeddings for a, scRNA-seq, b, snmC-seq, and c, scATAC-seq, 4 colored by the original cell types. Cells aligning with “mPv” and “mSst” are highlighted with green circles. Cells 5 aligning with “mNdnf” and “mVip” are highlighted with dark blue circles. Cells aligning with “mDL-3” are 6 highlighted with light blue circles. d, UMAP visualizations of the integrated cell embeddings for all cells, colored by 7 omics layers. e, Significance of marker gene overlap for each cell type across all three omics layers (three-way 8 Fisher’s exact test36). The dashed vertical line indicates that FDR = 0.01. We observed highly significant marker 9 overlap (FDR < 2×10-15) for 12 out of the 14 cell types, indicating reliable alignment. For the remaining 2 cell types, 10 “mDL-1” had marginally significant marker overlap with FDR = 0.001, while the “mIn-1” cells in snmC-seq did not 11 properly align with the scRNA-seq or scATAC-seq cells. f, Coefficient of determination (R2) for predicting gene 12 expression based on each epigenetic layer as well as the combination of all layers. The box plots indicate the medians 13 (centerlines), means (triangles), 1st and 3rd quartiles (hinges), and minima and maxima (whiskers).

14 Such triple-omics integration also sheds light on the quantitative contributions of different epigenetic 15 regulation mechanisms (Methods). Among mCH, mCG and chromatin accessibility, we found that 16 the mCH level had the highest predictive power for gene expression in cortical neurons (average R2 17 = 0.188). When all epigenetic layers were considered, the expression predictability increased further 18 (average R2 = 0.238), suggesting the presence of nonredundant contributions (Fig. 3f). Among the 19 neurons of different layers, DNA methylation (especially mCH) exhibited slightly higher 20 predictability for gene expression in deeper layers than in superficial layers, whereas the reverse 21 situation held for chromatin accessibility (Supplementary Fig. 7a). Across all genes, the 22 predictability of gene expression was generally correlated among the different epigenetic layers 23 (Supplementary Fig. 7b). We also observed varying associations with gene characteristics. For 24 example, mCH had higher expression predictability for longer genes, which was consistent with 25 previous studies16, 37, while chromatin accessibility contributed more to genes with higher expression 26 variability (Supplementary Fig. 7c). 27

8 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 Model-based regulatory inference with GLUE

2 The incorporation of a graph explicitly modeling regulatory interactions in GLUE further enables a 3 Bayesian-like approach that combines prior knowledge and observed data for posterior regulatory 4 inference. Specifically, since the feature embeddings are designed to reconstruct the knowledge- 5 based guidance graph and single-cell multi-omics data simultaneously (Fig. 1), their cosine 6 similarities should reflect information from both aspects, which we adopt as “regulatory scores”. 7 8 As a demonstration, we employed the official PBMC (peripheral blood mononuclear cell) Multiome 9 dataset from 10x30 and fed it to GLUE as unpaired scRNA-seq and scATAC-seq data (Methods). To 10 capture remote cis-regulatory interactions, we employed a long-range guidance graph connecting 11 ATAC peaks and RNA genes within 150 kb windows weighted by a power-law function that models 12 chromatin contact probability38, 39. Visualization of cell embeddings confirmed that the GLUE 13 alignment was correct and accurate (Supplementary Fig. 8a, b). As expected, we found that the 14 regulatory score was negatively correlated with genomic distance (Fig. 4a) and positively correlated 15 with the empirical peak-gene correlation (computed with paired cells, Fig. 4b), with robustness 16 across different random seeds (Supplementary Fig. 8c). 17 18 To further assess whether the score reflected actual cis-regulatory interactions, we compared it with 19 external evidence, including pcHi-C40 and eQTL41. The GLUE regulatory score was higher for pcHi- 20 C-supported peak-gene pairs in all distance ranges (Fig. 4a) and was a better predictor of pcHi-C 21 interactions than empirical peak-gene correlations (Fig. 4b, c), as well as Cicero39, the 22 coaccessibility-based regulatory prediction method (Fig. 4c). The same held for eQTL 23 (Supplementary Fig. 8d-f). 24 25 The GLUE framework also allows additional regulatory evidence, such as pcHi-C, to be 26 incorporated intuitively via the guidance graph. Thus, we further trained new models with a 27 composite guidance graph containing distance-weighted interactions as well as pcHi-C- and eQTL- 28 supported interactions (Supplementary Fig. 9). While the multi-omics alignment was insensitive to 29 these changes, the GLUE-derived TF-target gene network (Methods) showed more significant 30 agreement with manually curated connections in the TRRUST v2 database42 than individual 31 evidence-based networks (Supplementary Fig. 9e, Supplementary Fig. 10, Supplementary Table 3). 32

9 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

a b c pcHi-C prediction 1.0 1.0 1.0 0.8

0.5 0.5 0.6 Cicero (AUC = 0.548) pcHi-C False Spearman (AUC = 0.555) 0.0 0.0 TPR True 0.4 GLUE (AUC = 0.636)

−0.5 −0.5

GLUE regulatory score pcHi-C GLUE regulatory score 0.2 False True −1.0 −1.0 0.0 −0.5 0.0 0.5 1.0 0.00 0.25 0.50 0.75 1.00 Spearman correlation FPR

0-25 kb 25-50 kb 50-75 kb 75-100 kb 100-125 kb 125-150 kb Genomic distance e Target gene: CD83

d Target gene: NCF2 0.5 GLUE

0.0 0.5 GLUE pcHi-C 0.0

pcHi-C eQTL

0.5 ATAC eQTL 0.0 -0.5 BCL11A ChIP 0.5 ATAC 0.0 PAX5 ChIP -0.5 SPI1 ChIP RELB ChIP NMNAT2 SMG7 APOBEC4 Genes RNF182 CD83 AL353152.2 Genes

SMG7-AS1 NCF2 AL157899.1 MRPL35P1 AL133259.1 AL137800.1

ARPC5 AL022396.1 RNU7-133P AL353152.1 RGL1

AL590422.1 LINC01108 183,400 183,450 183,500 183,550 183,600 183,650 183,700 183,750 Kb 13,950 14,000 14,050 14,100 14,150 14,200 14,250 14,300 Kb 1 chr1 chr6 2 Fig. 4 Model-based regulatory inference in PBMC. 3 a, GLUE regulatory scores for peak-gene pairs across different genomic ranges, grouped by whether they had pcHi- 4 C support. The box plots indicate the medians (centerlines), means (triangles), 1st and 3rd quartiles (hinges), and 5 minima and maxima (whiskers). b, Comparison between the GLUE regulatory scores and the empirical peak-gene 6 correlations computed on paired cells. Peak-gene pairs are colored by whether they had pcHi-C support. c, ROC 7 (receiver operating characteristic) curves for predicting pcHi-C interactions based on different peak-gene association 8 scores. d, e, GLUE-identified cis-regulatory interactions for d, NCF2, and e, CD83, along with individual regulatory 9 evidence. SPI1 (highlighted with a green box) is a known regulator of NCF2.

10 Interestingly, we noticed that the GLUE-inferred cis-regulatory interactions could provide new hints 11 about the regulatory mechanisms of known TF-target pairs. For example, SPI1 is a known regulator 12 of the NCF2 gene, and both are highly expressed in monocytes (Supplementary Fig. 11a, b). GLUE 13 identified three remote regulatory peaks for NCF2 with various pieces of evidence, i.e., ~120 kb 14 downstream, ~25 kb downstream, and ~20 kb upstream from the TSS (transcription start site) (Fig. 15 4d), all of which were bound by SPI1. Meanwhile, most putative regulatory interactions were 16 previously unknown. For example, CD83 was linked with two regulatory peaks (~25 kb upstream 17 from the TSS), which were enriched for the binding of three TFs (BCL11A, PAX5, and RELB; Fig. 18 4e). While CD83 was highly expressed in both monocytes and B cells, the inferred TFs showed more 19 constrained expression patterns (Supplementary Fig. 11c-f), suggesting that its active regulators 20 might differ per cell type. Supplementary Fig. 12 shows more examples of GLUE-inferred regulatory 21 interactions. 22

10 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 Atlas-scale integration over millions of cells with GLUE

2 As technologies continue to evolve, the throughput of single-cell experiments is constantly 3 increasing. Recent studies have generated human cell atlases for gene expression24 and chromatin 4 accessibility25 containing millions of cells. The integration of these atlases poses a significant 5 challenge to computational methods due to the sheer volume of data, extensive heterogeneity, low 6 coverage per cell, and unbalanced cell type compositions, and has yet to be accomplished at the 7 single-cell level. 8 9 Implemented as parametric neural networks with minibatch optimization, GLUE delivers superior 10 scalability with a sublinear time cost, promising its applicability at the atlas scale, which is barely 11 possible for other methods (Supplementary Fig. 13a). Using an efficient multistage training strategy 12 for GLUE (Methods), we successfully integrated the gene expression and chromatin accessibility 13 data into a unified multi-omics human cell atlas (Fig. 5). 14

a Omics layer b Cell type UMAP2 UMAP2

UMAP1 UMAP1

Omics layer Corneal and conjunctival epithelial cells Horizontal cells/Amacrine cells? Mesangial cells? STC2_TLX1 positive cells scATAC-seq Ductal cells IGFBP1_DKK1 positive cells Mesothelial cells Satellite cells scRNA-seq ELF3_AGBL2 positive cells Inhibitory interneurons Metanephric cells Schwann cells ELF3_AGBL2 positive cells? Inhibitory interneurons? Microglia Skeletal muscle cells Cell type ENS glia Inhibitory neurons Muscle_Unknown.7 Skeletal muscle cells? AFP_ALB positive cells ENS neurons Intestinal epithelial cells Myeloid cells Smooth muscle cells Acinar cells ENS neurons? Intestine_Unknown.4 Neuroendocrine cells Squamous epithelial cells Adrenocortical cells Endocardial cells Intestine_Unknown.8 Oligodendrocytes Stellate cells Amacrine cells Epicardial fat cells Islet endocrine cells PAEP_MECOM positive cells Stromal cells Antigen presenting cells Epithelial cells Kidney_Unknown.7 PDE1C_ACSM3 positive cells Stromal cells? Astrocytes Erythroblasts Kidney_Unknown.14 PDE11A_FAM19A2 positive cells Sympathoblasts Astrocytes/Oligodendrocytes Excitatory neurons Lens fibre cells Pancreas_Unknown.1 Syncytiotrophoblast and villous cytotrophoblasts? Bipolar cells Extravillous trophoblasts Limbic system neurons Parietal and chief cells Syncytiotrophoblasts and villous cytotrophoblasts Bronchiolar and alveolar epithelial cells Eye_Unknown.6 Lymphatic endothelial cells Photoreceptor cells Thymic epithelial cells CCL19_CCL21 positive cells Ganglion cells Lymphoid and Myeloid cells Purkinje neurons Thymocytes CLC_IL5RA positive cells Goblet cells Lymphoid cells Retinal pigment cells Trophoblast giant cells CSH1_CSH2 positive cells Granule neurons Lymphoid/Myeloid cells Retinal progenitors and Muller glia Unipolar brush cells Cardiomyocytes Heart_Unknown.10 MUC13_DMBT1 positive cells SATB2_LRRC7 positive cells Ureteric bud cells Cardiomyocytes/Vascular endothelial cells Hematopoietic stem cells Megakaryocytes SKOR2_NPSR1 positive cells Vascular endothelial cells Cerebrum_Unknown.3 Hepatoblasts Megakaryocytes? SLC24A4_PEX5L positive cells Vascular endothelial cells? Chromaffin cells Horizontal cells Mesangial cells SLC26A4_PAEP positive cells Visceral neurons 15 Ciliated epithelial cells

11 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 Fig. 5 Integration of a multi-omics human cell atlas. 2 UMAP visualizations of the integrated cell embeddings, colored by a, omics layers, and b, cell types. The pink circles 3 highlight cells labeled as “Excitatory neurons” in scRNA-seq but “Astrocytes” in scATAC-seq. The blue circles 4 highlight cells labeled as “Astrocytes” in scRNA-seq but “Astrocytes/Oligodendrocytes” in scATAC-seq. The brown 5 circles highlight cells labeled as “Oligodendrocytes” in scRNA-seq but “Astrocytes/Oligodendrocytes” in scATAC- 6 seq.

7 While the aligned atlas was largely consistent with the original annotations25 (Supplementary Fig. 8 13c-e), we also noticed several discrepancies. For example, cells originally annotated as 9 “Astrocytes” in scATAC-seq were aligned to an “Excitatory neurons” cluster in scRNA-seq 10 (highlighted with pink circles/flows in Supplementary Fig. 13). Further inspection revealed that 11 canonical radial glial (RG) markers such as PAX6, HES1, and HOPX43, 44 were actively transcribed 12 in this cluster, both in the RNA and ATAC domain (Supplementary Fig. 14), with chromatin 13 priming9 also detected at both neuronal and glial markers (Supplementary Fig. 15-17), suggesting 14 that the cluster consists of multipotent neural progenitors (likely RGs) rather than excitatory neurons 15 or astrocytes as originally annotated. GLUE-based integration also resolved several scATAC-seq 16 clusters that were ambiguously annotated. For example, the “Astrocytes/Oligodendrocytes” cluster 17 was split into two halves and aligned to the “Astrocytes” and “Oligodendrocytes” clusters of scRNA- 18 seq (highlighted with blue and brown circles/flows in Supplementary Fig. 13, respectively), which 19 was also supported by marker expression and accessibility (Supplementary Fig. 16, 17). These 20 results demonstrate the unique value of atlas-scale multi-omics integration. 21

22 Discussion

23 Combining omics-specific autoencoders with graph-based coupling and adversarial alignment, we 24 designed and implemented the GLUE framework for unpaired single-cell multi-omics data 25 integration with superior accuracy and robustness. By modeling regulatory interactions across omics 26 layers explicitly, GLUE uniquely supports model-based regulatory inference for unpaired multi- 27 omics datasets, exhibiting even higher reliability than regular correlation analysis on paired datasets 28 (notably, in a Bayesian interpretation, the GLUE regulatory inference can be seen as a posterior 29 estimate, which can be continuously refined upon the arrival of new data). Furthermore, benefitting 30 from a neural network-based design, GLUE enables notable scalability for whole-atlas alignment 31 over millions of unpaired cells, which remains a serious challenge for in silico integration. In fact, 32 we also attempted to perform integration using the popular Seurat v3 method (Methods), but the 33 result was far from optimal (Supplementary Fig. 18). 34

12 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 Unpaired multi-omics integration, also referred to as diagonal integration14, shares some conceptual 2 similarities with batch effect correction45, as both call for the alignment of unpaired cells in certain 3 data representations. Nonetheless, the former is significantly more challenging because of the 4 distinct, omics-specific feature spaces. While completely unsupervised multi-omics integration has 5 been proposed20, 21, such an approach is exceedingly difficult and has largely been limited to aligning 6 continuous trajectories. For general-case multi-omics integration, additional prior knowledge is 7 necessary. At the omics feature level, presumed feature interactions have been used via feature 8 conversion15, 16 or coupled matrix factorization18, 19. At the cell level, known cell types have also 9 been used via (semi)supervised learning46, 47, but this approach incurs substantial limitations in terms 10 of applicability since such supervision is typically unavailable and in many cases serves as the 11 purpose of multi-omics integration per se25. Notably, one of these methods was proposed with a 12 similar autoencoder architecture and adversarial alignment47, but it relied on matched cell types or 13 clusters to orient the alignment. In fact, GLUE shares more conceptual similarity with the coupled 14 matrix factorization methods, but with superior accuracy, robustness and scalability, which mostly 15 benefits from its deep generative model-based design. 16 17 As a generalizable framework, GLUE features a modular design, where the data and graph 18 autoencoders are independently configurable. 19 l The data autoencoders in GLUE are customizable with appropriate generative models that 20 conform to omics-specific data distributions. In the current work, we used the negative binomial 21 distribution for scRNA-seq and scATAC-seq, and the zero-inflated log-normal distribution for 22 snmC-seq (Methods). Nevertheless, generative distributions can be easily reconfigured to 23 accommodate other omics layers, such as protein abundance48 and histone modification49, and to 24 adopt new advances in data modeling techniques50. 25 l The guidance graphs used in GLUE have currently been limited to multipartite graphs, 26 containing only edges between the features of different layers. Nonetheless, graphs, as intuitive 27 and flexible representations of regulatory knowledge, can embody more complex regulatory 28 patterns, including within-modality interactions, non-feature vertices, and multi-relations. 29 Beyond canonical graph convolution, more advanced graph neural network architectures51-53 30 may also be adopted to extract richer information from the regulatory graph. 31 32 Recent advances in experimental multi-omics technologies have increased the availability of paired 33 data8-11, 30. While most of the current simultaneous multi-omics protocols still suffer from lower data 34 quality or throughput than that of single-omics methods54, paired cells can be highly informative in

13 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 anchoring different omics layers and should be utilized in conjunction with unpaired cells whenever 2 available. It is straightforward to extend the GLUE framework to incorporate such pairing 3 information, e.g., by adding another loss term that penalizes the embedding distances between paired 4 cells55. Such an extension may ultimately lead to a solution for the general case of mosaic 5 integration14. 6 7 Apart from multi-omics integration, we also note that the GLUE framework could be suitable for 8 cross-species integration, especially when distal species are concerned and one-to-one orthologs are 9 limited. Specifically, we may compile all orthologs into a GLUE guidance graph and perform 10 integration without explicit ortholog conversion. Under that setting, the GLUE approach could also 11 be conceptually connected to a recent work called SAMap56. 12 13 We believe that GLUE, as a modular and generalizable framework, creates an unprecedented 14 opportunity towards effectively delineating gene regulatory maps via large-scale multi-omics 15 integration at single-cell resolution. The whole package of GLUE, along with tutorials and demo 16 cases, is available online at https://github.com/gao-lab/GLUE for the community.

14 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 Methods

2 The GLUE framework

3 We assume that there are � different omics layers to be integrated, each with a distinct feature set

4 �, � = 1,2, … , �. For example, in scRNA-seq, � is the set of genes, while in scATAC-seq, � is

|�| 5 the set of chromatin regions. The data spaces of different omics layers are denoted as � ⊆ ℝ () th 6 with varying dimensionalities. We use � ∈ �, � = 1,2, … , � to denote cells from the � omics () th th 7 layer and � , � ∈ � to denote the observed value of feature � of the � layer in the � cell. � is 8 the sample size of the �th layer. Notably, the cells from different omics layers are unpaired and can 9 have different sample sizes. To avoid cluttering, we drop the superscript (�) when referring to an 10 arbitrary cell. 11 12 We model the observed data from different omics layers as generated by a low-dimensional latent 13 variable (i.e., cell embedding) � ∈ ℝ:

14 �(�; �) = �(�|�; �)�(�)�� Eq. 1

15 where �(�) is the prior distribution of the latent variable, �(�|�; �) are learnable generative

16 distributions (i.e., data decoders), and � denotes learnable parameters in the decoders. The cell 17 latent variable � is shared across different omics layers. In other words, � represents the common 18 cell states underlying all omics observations, while the observed data from each layer are generated 19 by a specific type of measurement of the underlying cell states. 20

21 With the introduction of variational posteriors �(�|�; �) (i.e., data encoders, where � are 22 learnable parameters in the encoders), model fitting can be efficiently performed by maximizing the 23 following evidence lower bounds:

ℒ� (�, �) = 24 Eq. 2 ( ) ( ) ( ) ��∼(�)��∼(�|�;) log � �|�; � − KL� �|�; � ∥ � � 25 26 Since different autoencoders are independently parameterized and trained on separate data, the cell 27 embeddings learned for different omics layers could have inconsistent semantic meanings unless 28 they are linked properly. 29

15 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 To link the autoencoders, we propose a guidance graph � = (�, ℰ), which incorporates prior 2 knowledge about the regulatory interactions among features at distinct omics layers, where � = 3 ⋃ � is the universal feature set and ℰ = {(�, �)|�, � ∈ �} is the set of edges. Each edge is also

4 associated with signs and weights, which are denoted as � and �, respectively. We require that

5 � ∈ (0,1], which can be interpreted as interaction credibility, and that � ∈ {−1,1}, which 6 specifies the sign of the regulatory interaction. For example, an ATAC peak located near the 7 promoter of a gene is usually assumed to positively regulate its expression, so they can be connected

8 with a positive edge (� = 1). Meanwhile, DNA methylation in the gene promoter is usually

9 assumed to suppress expression, so they can be connected with a negative edge (� = −1). In 10 addition to the connections between features, self-loops are also added for numerical stability, with

11 � = 1, � = 1, ∀� ∈ �. 12 13 We treat the guidance graph as observed variable and model it as generated by low-dimensional 14 feature latent variables (i.e., feature embeddings) � ∈ ℝ , � ∈ �. Furthermore, differing from the

15 previous model, we now model � as generated by the combination of feature latent variables � ∈ 16 ℝ , � ∈ � and the cell latent variable � ∈ ℝ . For convenience, we introduce the notation � ∈ 17 ℝ×|�|, which combines all feature embeddings into a single matrix. The model likelihood can thus 18 be written as:

19 ��, �; �, �� = �(�|�, �; �)��|�; ��(�)�(�)�� Eq. 3

20 where �(�|�, �; �) and ��|�; �� are learnable generative distributions for the omics data (i.e.,

21 data decoders) and knowledge graph (i.e., graph decoder), respectively. � and �� are learnable 22 parameters in the decoders. �(�) and �(�) are the prior distributions of the cell latent variable and 23 feature latent variables, respectively, which are fixed as standard normal distributions for simplicity:

24 �(�) = �(�; �, �) Eq. 4

25 �(�) = �(�; �, �), �(�) = �(�) Eq. 5 ∈� 57 26 though alternatives may also be used . For convenience, we also introduce the notation � ∈

×|�| th 27 ℝ , which contains only feature embeddings in the � omics layer, and �, which emphasizes 28 that the cell embedding is from a cell in the �th omics layer. 29

30 The graph likelihood ��|�; �� (i.e., graph decoder) is defined as:

16 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 ��|�; � = � �� ⋅ � � ⋅ � 1 − �� ⋅ � � Eq. 6 � ,∼,; ∼ |

58 2 where � is the sigmoid function and � is a negative sampling distribution . In other words, we 3 first sample the edges (�, �) with probabilities proportional to the edge weights and then sample

4 vertices �’ that are not connected to � and treat them as if � = �. When maximizing the graph 5 likelihood, the inner products between features are maximized or minimized (per edge sign) based on 6 the Bernoulli distribution. For example, ATAC peaks located near the promoter of a gene would be 7 encouraged to have similar embeddings to that of the gene, while DNA methylation in the gene 8 promoter would be encouraged to have a dissimilar embedding to that of the gene. 9

10 The data likelihoods �(�|�, �; �) (i.e., data decoders) in Eq. 3 are built upon the inner product

11 between the cell embedding � and feature embeddings �. Thus, analogous to the loading matrix in

12 principal component analysis (PCA), the feature embeddings � confer semantic meanings for the

13 cell embedding space. As � are modulated by interactions among omics features in the guidance 14 graph, the semantic meanings become linked. The exact formulation of data likelihood depends on 15 the omics data distribution. For example, for count-based scRNA-seq and scATAC-seq data, we 16 used the negative binomial (NB) distribution:

( | ) 17 � � �, �; � = NB�; �, � Eq. 7 ∈�

� � Γ� + � � � 18 NB� ; � , � = Eq. 8 ( ) � + � � + � Γ � Γ� + 1

( ) 19 � = Softmax �⨀� � + � ⋅ � Eq. 9 ∈�

|�| 20 where �, � ∈ ℝ are the mean and dispersion of the NB distribution, respectively, and � ∈ | | � |�| th 21 ℝ , � ∈ ℝ are scaling and bias factors. ⨀ is the Hadamard product. Softmax represents the �

22 dimension of the softmax output. The set of learnable parameters is � = {�, �, �}. Analogously, 23 many other distributions can also be supported, as long as we can parameterize the means of the 24 distributions by feature-cell inner products. 25 26 For efficient inference and optimization, we introduce the following factorized variational posterior:

27 ��, �|�, �; �, �� = �(�|�; �) ⋅ ��|�; �� Eq. 10

28 The graph variational posterior ��|�; �� (i.e., graph encoder) is modeled as diagonal-covariance 29 normal distributions parameterized by a graph convolutional network59:

17 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 ��|�; �� = ��|�; �� Eq. 11 ∈�

2 ��|�; �� = � �; GCN� �; ��, GCN ��; �� Eq. 12 �

3 where �� represents the learnable parameters in the GCN encoder. 4

5 The variational data posteriors �(�|�; �) (i.e., data encoders) are modeled as diagonal-covariance 6 normal distributions parameterized by multilayer perceptron (MLP) neural networks:

7 �(�|�, �; �) = � �; MLP,�(�; �), MLP,� (�; �) Eq. 13

th 8 where � is the set of learnable parameters in the MLP encoder of the � omics layer. 9 10 Model fitting can then be performed by maximizing the following evidence lower bound: � log �(� |�, �; � )��|�; � �∼(�|�;),�∼�|�;� � 11 �� ∼ (� ) Eq. 14 −KL �(�|� ; � )��|�; � ∥ �(�)�(�) � 12 which can be further rearranged into the following form: 13 � ⋅ ℒ��, �� + ℒ� �, �, �� Eq. 15 14 where we have

ℒ��, �, �� = 15 Eq. 16 � � log �(� |�, �; � ) − KL�(�|� ; � ) ∥ �(�) �∼(�) �∼(�|�;),�∼�|�;�

16 ℒ � , � = � log ��|�; � − KL ��|�; � ∥ �(�) Eq. 17 � � � �∼�|�;� � � 17 Below, for convenience, we denote the union of all encoder parameters as � = (⋃ �) ∪ �� and 18 the union of all decoder parameters as � = (⋃ �) ∪ ��. 19 20 To ensure the proper alignment of different omics layers, we use the adversarial alignment strategy27, 21 60. A discriminator D with a �-dimensional softmax output is introduced, which predicts the omics 22 layers of cells based on their embeddings �. The discriminator D is trained by minimizing the 23 multiclass classification cross entropy: 1 24 ℒ (�, �) = − � � log D (�; �) Eq. 18 � �∼(�) �∼(�|�;) th 25 where D represents the � dimension of the discriminator output and � is the set of learnable 26 parameters in the discriminator. The data encoders can then be trained in the opposite direction to

18 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 fool the discriminator, ultimately leading to the alignment of cell embeddings from different omics 2 layers61. 3 4 The overall training objective of GLUE thus consists of:

min � ⋅ ℒ(�, �) Eq. 19 5

6 max � ⋅ ℒ(�, �) + �� ⋅ ℒ��, �� + ℒ� �, �, �� Eq. 20 ,

7 The two hyperparameters � and �� control the contributions of adversarial alignment and graph- 8 based feature embedding, respectively. We use stochastic gradient descent (SGD) to train the GLUE 9 model. Each SGD iteration is divided into two steps. In the first step, the discriminator is updated 10 according to objective Eq. 19. In the second step, the data and graph autoencoders are updated 11 according to Eq. 20. The RMSprop optimizer with no momentum term is employed to ensure the 12 stability of adversarial training. 13

14 Implementation details

15 We applied linear dimensionality reduction using canonical methods such as PCA (for scRNA-seq) 16 or LSI (latent semantic indexing, for scATAC-seq) as the first transformation layers of the data 17 encoders (note that the decoders were still fitted in the original feature spaces). This effectively 18 reduced model size and enabled a modular input, so advanced dimensionality reduction or batch 19 effect correction methods can also be used instead as preprocessing steps for GLUE integration. 20 21 To ensure stable alignment, we used batch normalization in the data encoder layers and employed

22 additive noise annealing. Specifically, noise � ∼ �(�; �, � ⋅ �) was added to the cell embeddings � 23 before passing to the discriminator. The parameter � controls the noise level, which starts at � = 1 24 and decreases linearly per epoch until reaching 0 (i.e., noise annealing). The number of annealing 25 epochs was set automatically based on the data size and learning rate to match a learning progress 26 equivalent to 4,000 iterations at a learning rate of 0.002. 27 28 During model training, 10% of the cells were used as the validation set. In the final stage of training, 29 the learning rate would be reduced by factors of 10 if the validation loss did not improve for 30 consecutive epochs. Training would be terminated if the validation loss still did not improve for 31 consecutive epochs. The patience for learning rate reduction, training termination, and the maximal

19 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 number of training epochs were automatically set based on the data size and learning rate to match a 2 learning progress equivalent to 1,000, 2,000, and 16,000 iterations at a learning rate of 0.002, 3 respectively. 4 5 For all benchmarks and case studies with GLUE, we used the default hyperparameters unless 6 explicitly stated. The set of default hyperparameters is presented in Supplementary Fig. 4. 7

8 Systematic benchmarks

9 UnionCom20 and GLUE were executed using the Python packages “unioncom” (v0.3.0) and “scglue” 10 (v0.1.0), respectively. LIGER16, bindSC29, and Seurat v315 were executed using the R packages 11 “rliger” (v0.5.0), “bindSC” (v1.0.0), and “Seurat” (v4.0.2), respectively. For each method, we used 12 the default hyperparameter settings and data preprocessing steps as recommended. For the scRNA- 13 seq data, 2,000 highly variable genes were selected using the Seurat “vst” method. To construct the 14 guidance graph, we connected ATAC peaks with RNA genes via positive edges if they overlapped in 15 either the gene body or proximal promoter regions (defined as 2 kb upstream from the TSS). For the 16 methods that require feature conversion (LIGER, bindSC, and Seurat v3), we converted the 17 scATAC-seq data to gene-level activity scores by summing up counts in the ATAC peaks connected 18 to specific genes in the guidance graph. 19 20 MAP (mean average precision) was used to evaluate the cell type resolution. Supposing that the cell 21 type of the ith cell is �() and that the cell types of its � ordered nearest neighbors are () () () 22 � , � , … � , the MAP is then defined as follows: 1 23 MAP = AP() Eq. 21 � ∑ � () () ⎧ ⎪∑ � () ⋅ () � , if � () > 0 24 AP = () Eq. 22 ∑ � () ⎨ () ⎪ ⎩ 0, otherwise

() () 25 where � () () is an indicator function that equals 1 if � = � and 0 otherwise. For each cell, AP 26 (average precision) computes the average cell type precision up to each cell type-matched neighbor, 27 and MAP is the average AP across all cells. We set � to 1% of the total number of cells in each 28 dataset. MAP has a range of 0 − 1, and higher values indicate better cell type resolution.

20 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 2 SAS (Seurat alignment score) was used to evaluate the extent of mixing among distinct omics layers 3 and was computed as described in the original paper31: � �̅ − 4 SAS = 1 − � Eq. 23 � � − � 5 where �̅ is the average number of cells from the same omics layer among the � nearest neighbors 6 (different layers were first subsampled to the same size), and � is the number of omics layers. SAS 7 has a range of 0 − 1, and higher values indicate better mixing. 8 9 FOSCTTM (fraction of samples closer than the true match)32 was used to evaluate the single-cell 10 level alignment accuracy. It was computed on two datasets with known cell-to-cell pairings. Suppose 11 that each dataset contains � cells, and that the cells are sorted in the same order, i.e., the �th cell in 12 the first dataset is paired with the �th cell in the second dataset. Denote � and � as the cell 13 embeddings of the first and second dataset, respectively. The FOSCTTM is then defined as: 1 �() �() 14 FOSCTTM = + Eq. 24 2� � � () 15 � = �|��, � < �(�, �) Eq. 25 () 16 � = �|��, � < �(�, �) Eq. 26 () () 17 where � and � are the number of cells in the first and second dataset, respectively, that are closer 18 to the �th cell than their true matches in the opposite dataset. � is the Euclidean distance. 19 20 For the baseline benchmark, each method was run 8 times with different random seeds, except for 21 bindSC, which has a deterministic implementation and was run only once. For the guidance 22 corruption benchmark, we removed the specified proportions of existing peak-gene interactions and 23 added equal numbers of nonexistent interactions, so the total number of interactions remained 24 unchanged. Of note, feature conversion was also repeated using the corrupted guidance graphs. The 25 corruption procedure was repeated 8 times with different random seeds. For the subsampling 26 benchmark, the scRNA-seq and scATAC-seq cells were subsampled in pairs (so FOSCTTM could 27 still be computed). The subsampling process was also repeated 8 times with different random seeds. 28

21 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 For the systematic scalability test (Supplementary Fig. 13a), all methods were run on a Linux 2 workstation with 40 CPU cores (two Intel Xeon Silver 4210 chips), 250 GB of RAM, and NVIDIA 3 GeForce RTX 2080 Ti GPUs. Only a single GPU card was used when training GLUE. 4

5 Triple-omics integration

6 The scRNA-seq and scATAC-seq data were handled as previously described (see the section 7 “Systematic benchmarks”). Due to low coverage per single-C site, the snmC-seq data were converted 8 to average methylation levels in gene bodies. The mCH and mCG levels were quantified separately, 9 resulting in 2 features per gene. The gene methylation levels were normalized by the global 10 methylation level per cell. An initial dimensionality reduction was performed using PCA (see the 11 section “Implementation details”). For the triple-omics guidance graph, the mCH and mCG levels 12 were connected to the corresponding genes with negative edges. 13 14 The normalized methylation levels were positive, with dropouts corresponding to the genes that were 15 not covered in single cells. As such, we used the zero-inflated log-normal (ZILN) distribution for the 16 data decoder:

( | ) 17 � � �, �; � = ZILN�; �, �, � Eq. 27 ∈� 1 − � log � − � exp − , � > 0 18 ZILN� ; �, �, � = 2� Eq. 28 ��√2� �, � = 0 19 � = �⨀� � + � Eq. 29 | | |�| � |�| 20 where � ∈ ℝ , � ∈ ℝ , � ∈ (0,1) are the log-scale mean, log-scale standard deviation and | | � |�| 21 zero-inflation parameters of the ZILN distribution, respectively, and � ∈ ℝ , � ∈ ℝ are scaling 22 and bias factors. 23 24 To unify the cell type labels, we performed a nearest neighbor-based label transfer with the snmC- 25 seq dataset as a reference. The 5 nearest neighbors in snmC-seq were identified for each scRNA-seq 26 and scATAC-seq cell in the aligned embedding space, and majority voting was used to determine the 27 transferred label. To verify whether the alignment was correct, we tested for significant overlap in 28 cell type marker genes. The features of all omics layers were first converted to genes. Then, for each 29 omics layer, the cell type markers were identified using the one-vs.-rest Wilcoxon rank-sum test with

22 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 the following criteria: FDR < 0.05 and log-fold change > 0 for scRNA-seq/scATAC-seq; FDR < 2 0.05 and log-fold change < 0 for snmC-seq. The significance of marker overlap was determined by 3 the three-way Fisher’s exact test36. 4 5 To perform correlation and regression analysis after the integration, we clustered all cells from the 6 three omics layers using fine-scale k-means (k = 200). Then, for each omics layer, the cells in each 7 cluster were aggregated into a metacell by summing their expression/accessibility counts or 8 averaging their DNA methylation levels. Therefore, the metacells were established as paired 9 samples, based on which feature correlation and regression analyses could be conducted. 10

11 Model-based cis-regulatory inference

12 To ensure consistency of cell types, we first selected the overlapping cell types between the 10x 13 Multiome and pcHi-C data. The remaining cell types included T cells, B cells and monocytes. The 14 eQTL data were used as is, because they were not cell type-specific. For scRNA-seq, we selected 15 6,000 highly variable genes. For the initial regulatory inference, the guidance graph was constructed 16 by connecting RNA genes with ATAC peaks within 150 kb of the gene promoters (defined as 2 kb 17 upstream from the TSS); the graph was weighted by a power-law function � = (� + 1). (� is 18 the genomic distance in kb), which has been proposed to model the probability of chromatin 19 contact38, 39. 20 21 To incorporate the regulatory evidence of pcHi-C and eQTL, we anchored all evidence to that 22 between the ATAC peaks and RNA genes. A peak-gene pair was considered supported by pcHi-C if 23 (1) the gene promoter was within 1 kb of a bait fragment, (2) the peak was within 1 kb of an other- 24 end fragment, and (3) significant contact was identified between the bait and the other-end fragment 25 in pcHi-C. The pcHi-C-supported peak-gene interactions were weighted by multiplying the 26 promoter-to-bait and the peak-to-other-end power-law weights (see above). If a peak-gene pair was 27 supported by multiple pcHi-C contacts, the weights were summed and clipped at a maximum of 1. A 28 peak-gene pair was considered supported by eQTL if (1) the peak overlapped an eQTL locus and (2) 29 the locus was associated with the expression of the gene. The eQTL-supported peak-gene 30 interactions were assigned weights of 1. The composite guidance graph was constructed by adding 31 the pcHi-C- and eQTL-supported interactions to the previous distance-based interactions, allowing 32 for multi-edges. 33

23 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 For regulatory inference, only peak-gene pairs within 150 kb in distance were considered. The 2 GLUE training process was repeated 4 times with different random seeds. For each repeat, the peak- 3 gene regulatory score was computed as the cosine similarity between the feature embeddings. The 4 final regulatory inference was obtained by averaging the regulatory scores across the 4 repeats. 5

6 TF-target gene regulatory inference

7 We employed the SCENIC workflow62 to construct a TF-gene regulatory network from the inferred 8 peak-gene regulatory interactions. Briefly, the SCENIC workflow first constructs a gene 9 coexpression network based on the scRNA-seq data, and then uses external cis-regulatory evidence 10 to filter out false positives. SCENIC accepts cis-regulatory evidence in the form of gene rankings per 11 TF, i.e., genes with higher TF enrichment levels in their regulatory regions are ranked higher. To 12 construct the rankings based on our inferred peak-gene interactions, we first overlapped the 13 ENCODE TF ChIP peaks63 with the ATAC peaks and counted the number of ChIP peaks for each 14 TF in each ATAC peak. Since different genes can have different numbers of connected ATAC 15 peaks, and the ATAC peaks vary in length (longer peaks can contain more ChIP peaks by chance), 16 we devised a sampling-based approach to evaluate TF enrichment. Specifically, for each gene, we 17 randomly sampled 1,000 sets of ATAC peaks that matched the connected ATAC peaks in both 18 number and length distribution. We counted the numbers of TF ChIP peaks in these random ATAC 19 peaks as null distributions. For each TF in each gene, an empirical P value could then be computed 20 by comparing the observed number of ChIP peaks to the null distribution. Finally, we ranked the 21 genes by the empirical P values for each TF, producing the cis-regulatory rankings used by SCENIC. 22 Since peak-gene-based inference is mainly focused on remote regulatory regions, proximal 23 promoters could be missed. As such, we provided SCENIC with both the above peak-based and 24 proximal promoter-based cis-regulatory rankings. 25

26 Integration for the human multi-omics atlas

27 The scRNA-seq and scATAC-seq atlases have highly unbalanced cell type compositions, which is 28 primarily caused by differences in organ sampling sizes (Supplementary Fig. 13b). Although cell 29 types are unknown during real-world analyses, organ sources are typically available and can be 30 utilized to help balance the integration process. To perform organ-balanced data preprocessing, we 31 first subsampled each omics layer to match the organ compositions. For the scRNA-seq data, 4,000 32 highly variable genes were selected using the organ-balanced subsample. Then, for the initial

24 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 dimensionality reduction, we fitted PCA (scRNA-seq) and LSI (scATAC-seq) on the organ-balanced 2 subsample and applied the projection to the full data. The PCA/LSI coordinates were used as the first 3 transformation layer in the GLUE data encoders (see the section “Implementation details”), as well 4 as for metacell aggregation (see below). The guidance graph was constructed as described previously 5 (see the section “Systematic benchmarks”). 6 7 The two atlases consist of large numbers of cells but with low coverage per cell. To alleviate dropout 8 and increase the training speed simultaneously, we designed a multistage training strategy, where the 9 GLUE model was pretrained on aggregated metacells and then fine-tuned on the original single cells. 10 Specifically, in the first stage, we clustered the cells in each omics layer using fine-scaled k-means (k 11 = 100,000 for scRNA-seq and k = 40,000 for scATAC-seq). To balance the organ compositions at 12 the same time, k-means centroids were fitted on the previous organ-balanced subsample and then 13 applied to the full data. The cells in each k-means cluster were aggregated into a metacell by 14 summing their expression/accessibility counts and averaging their PCA/LSI coordinates. GLUE was 15 then pretrained on the aggregated metacells without adversarial alignment, which roughly oriented 16 the cell embeddings but did not actually align them. To better utilize the large data size, the hidden 17 layer dimensionality was doubled to 512 from the default 256. 18 19 In the second stage, GLUE was fine-tuned on the full single-cell data with weighted adversarial 20 alignment (see below). As shown in previous work27, pure adversarial alignment amounts to 21 minimizing a generalized form of Jensen–Shannon divergence among the cell embedding 22 distributions of different omics layers: 1 1 23 KL � (�)‖ � (�) Eq. 30 � � ( ) ( ) 24 where � � = ��∼(�)� �|�; � represents the marginal cell embedding distribution of the th 25 k layer. Without other loss terms, Eq. 30 converges at perfect alignment, i.e., when �(�) =

26 �(�), ∀� ≠ �. This can be problematic when cell type compositions differ dramatically across 27 different layers, as was the case here. To address this issue, we added cell-specific weights �() to 28 the discriminator loss in Eq. 18:

1 1 () 29 ℒ (�, �) = − � ⋅ � () log D (�; �) Eq. 31 �∼�|� ; � �

25 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

() 1 where the normalizer � = ∑ � . The adversarial alignment still amounts to minimizing

2 Eq. 30 but with weighted marginal cell embedding distributions �(�) =

() () 3 ∑ � ��|� ; �. By assigning appropriate weights to balance the cell distributions across

4 different layers, the optimum of �(�) = �(�), ∀� ≠ � could be much closer to the desired 5 alignment. For example, a viable choice for the cell weights is organ-balancing weights. Suppose

6 that the organ proportions in scRNA-seq and scATAC-seq are �, �, … , � and �, �, … , � (� is 7 the number of organs, ∑ � = 1, ∑ � = 1), respectively. We can weight the RNA cells in the

th th th 8 i organ by �/� and weight the ATAC cells in the i organ by �/�, so that the i organ has a

9 balanced accumulative contribution of ��, regardless of omics layers. However, in case there are 10 major differences among the cell type compositions within the same organ, organ-level balancing 11 can be insufficient. As such, we designed a method to compute cell type-level balancing weights in 12 an unsupervised manner. For each omics layer, we clustered the cell embeddings using Louvain 13 clustering and matched the clusters in different layers via cosine similarity. The population size of 14 each cluster was distributed to its matched counterparts by cosine similarity. Subsequently, for each

15 omics layer, we could obtain the proportion of each cluster � and its effective proportion � in the 16 opposite layer (by normalizing the effective population size received from the opposite layer). 17 Balancing weights could then be computed as above. The weight-balanced alignment proved 18 effective in aligning the highly skewed data (Fig. 5). 19 20 For a comparison with other integration methods, we also tried Seurat v3, which was the second-best 21 method in our previous benchmark in terms of accuracy. Since Seurat v3 could not scale to the full 22 dataset (Supplementary Fig. 13a), we only managed to apply it to the aggregated data, as used in the 23 first stage of GLUE training, but the result was suboptimal (Supplementary Fig. 18).

26 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 Data availability

2 All datasets used in this study are already published and were obtained from public data repositories. 3 See Supplementary Table 1 for detailed information, including access codes and URLs. All 4 benchmark data are available in Supplementary Table 2. 5

6 Code availability

7 The GLUE framework was implemented in the “scglue” Python package, which is available at 8 https://github.com/gao-lab/GLUE. For reproducibility, the scripts for all benchmarks and case 9 studies were assembled using Snakemake, which is also available in the above repository. 10

11 Acknowledgments

12 The authors would like to thank Drs. Zemin Zhang, Xiaoliang Sunney Xie, Letian Tao, Cheng Li, 13 Jian Lu (at Peking University) and Yang Ding (at the Beijing Institute of Radiation Medicine) for 14 their helpful discussions and comments during the study, as well as authors of the datasets used in 15 this work for their kindly help. 16 This work was supported by funds from the National Key Research and Development Program 17 (2016YFC0901603), the China 863 Program (2015AA020108), and the State Key Laboratory of 18 Protein and Plant Gene Research and the Beijing Advanced Innovation Center for Genomics (ICG) 19 at Peking University. The research of G.G. was supported in part by the National Program for 20 Support of Top-notch Young Professionals. 21 Part of the analysis was performed on the Computing Platform of the Center for Life Sciences of 22 Peking University and supported by the High-performance Computing Platform of Peking 23 University. 24

25 Author contributions

26 G.G. conceived the study and supervised the research; Z.J.C. designed and implemented the 27 computational framework and conducted benchmarks and case studies, with guidance from G.G.; 28 Z.J.C. and G.G. wrote the manuscript.

27 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

2 Competing interests

3 The authors declare no competing interests.

28 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 References

2 1. Cusanovich, D.A. et al. Multiplex single cell profiling of chromatin accessibility by 3 combinatorial cellular indexing. Science 348, 910–914 (2015). 4 2. Chen, X., Miragaia, R.J., Natarajan, K.N. & Teichmann, S.A. A rapid and robust method for 5 single cell chromatin accessibility profiling. Nat. Commun. 9, 5345 (2018). 6 3. Luo, C. et al. Single-cell methylomes identify neuronal subtypes and regulatory elements in 7 mammalian cortex. Science 357, 600–604 (2017). 8 4. Mulqueen, R.M. et al. Highly scalable generation of DNA methylation profiles in single 9 cells. Nat. Biotechnol. 36, 428–431 (2018). 10 5. Picelli, S. et al. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. 11 Nat. Methods 10, 1096–1098 (2013). 12 6. Zheng, G.X. et al. Massively parallel digital transcriptional profiling of single cells. Nat. 13 Commun. 8, 14049 (2017). 14 7. Packer, J. & Trapnell, C. Single-cell multi-omics: An engine for new quantitative models of 15 gene regulation. Trends Genet. 34, 653–665 (2018). 16 8. Chen, S., Lake, B.B. & Zhang, K. High-throughput sequencing of the transcriptome and 17 chromatin accessibility in the same cell. Nat. Biotechnol. 37, 1452–1457 (2019). 18 9. Ma, S. et al. Chromatin potential identified by shared single-cell profiling of RNA and 19 chromatin. Cell 183, 1103–1116 (2020). 20 10. Clark, S.J. et al. scNMT-seq enables joint profiling of chromatin accessibility DNA 21 methylation and transcription in single cells. Nat. Commun. 9, 781 (2018). 22 11. Wang, Y. et al. Single-cell multiomics sequencing reveals the functional regulatory landscape 23 of early embryos. Nat. Commun. 12, 1247 (2021). 24 12. Lake, B.B. et al. Integrative single-cell analysis of transcriptional and epigenetic states in the 25 human adult brain. Nat. Biotechnol. 36, 70–80 (2018). 26 13. Bravo Gonzalez-Blas, C. et al. Identification of genomic enhancers through spatial 27 integration of single-cell transcriptomics and epigenomics. Mol. Syst. Biol. 16, e9438 (2020). 28 14. Argelaguet, R., Cuomo, A.S.E., Stegle, O. & Marioni, J.C. Computational principles and 29 challenges in single-cell data integration. Nat. Biotechnol. (2021). 30 15. Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019). 31 16. Welch, J.D. et al. Single-cell multi-omic integration compares and contrasts features of brain 32 cell identity. Cell 177, 1873–1887 (2019). 33 17. Chen, H. et al. Assessment of computational methods for the analysis of single-cell ATAC- 34 seq data. Genome Biol. 20, 241 (2019). 35 18. Duren, Z. et al. Integrative analysis of single-cell genomics data by coupled nonnegative 36 matrix factorizations. Proc. Natl. Acad. Sci. U. S. A. 115, 7723–7728 (2018). 37 19. Zeng, W. et al. DC3 is a method for deconvolution and coupled clustering from bulk and 38 single-cell genomics data. Nat. Commun. 10, 4613 (2019). 39 20. Cao, K., Bai, X., Hong, Y. & Wan, L. Unsupervised topological alignment for single-cell 40 multi-omics integration. Bioinformatics 36, i48–i56 (2020). 41 21. Demetci, P., Santorella, R., Sandstede, B., Noble, W.S. & Singh, R. Gromov-Wasserstein 42 optimal transport to align single-cell multi-omics data. Preprint at 43 https://www.biorxiv.org/content/10.1101/2020.04.28.066787 (2020). 44 22. Svensson, V., Vento-Tormo, R. & Teichmann, S.A. Exponential scaling of single-cell RNA- 45 seq in the past decade. Nat. Protoc. 13, 599–604 (2018).

29 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 23. Kozareva, V. et al. A transcriptomic atlas of the mouse cerebellum reveals regional 2 specializations and novel cell types. Preprint at 3 http://biorxiv.org/content/early/2020/03/05/2020.03.04.976407.abstract (2020). 4 24. Cao, J. et al. A human cell atlas of fetal gene expression. Science 370, eaba7721 (2020). 5 25. Domcke, S. et al. A human cell atlas of fetal chromatin accessibility. Science 370, eaba7612 6 (2020). 7 26. Lopez, R., Regier, J., Cole, M.B., Jordan, M.I. & Yosef, N. Deep generative modeling for 8 single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018). 9 27. Cao, Z.J., Wei, L., Lu, S., Yang, D.C. & Gao, G. Searching large-scale scRNA-seq databases 10 via unbiased cell embedding with Cell BLAST. Nat. Commun. 11, 3458 (2020). 11 28. Kipf, T.N. & Welling, M. Variational graph auto-encoders. Preprint at 12 https://arxiv.org/abs/1611.07308 (2016). 13 29. Dou, J. et al. Unbiased integration of single cell multi-omics data. Preprint at 14 https://www.biorxiv.org/content/10.1101/2020.12.11.422014 (2020). 15 30. 10x Genomics. PBMC from a healthy donor, single cell multiome ATAC gene expression 16 demonstration data by Cell Ranger ARC 1.0.0. https://support.10xgenomics.com/single-cell- 17 multiome-atac-gex/datasets/1.0.0/pbmc_granulocyte_sorted_10k (2020). 18 31. Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell 19 transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 20 36, 411–420 (2018). 21 32. Singh, R. et al. Unsupervised manifold alignment for single-cell multi-omics data. In 22 Proceedings of the 11th ACM International Conference on Bioinformatics, Computational 23 Biology and Health Informatics. (ACM, Virtual Event, USA, 2020). 24 33. Saunders, A. et al. Molecular diversity and specializations among the cells of the adult mouse 25 brain. Cell 174, 1015–1030 (2018). 26 34. 10x Genomics. Fresh cortex from adult mouse brain (v1), single cell ATAC demonstration 27 data by Cell Ranger 1.1.0. https://support.10xgenomics.com/single-cell- 28 atac/datasets/1.1.0/atac_v1_adult_brain_fresh_5k (2019). 29 35. Mo, A. et al. Epigenomic signatures of neuronal diversity in the mammalian brain. Neuron 30 86, 1369–1384 (2015). 31 36. Wang, M., Zhao, Y. & Zhang, B. Efficient test and visualization of multi-set intersections. 32 Sci. Rep. 5, 16923 (2015). 33 37. Gabel, H.W. et al. Disruption of DNA-methylation-dependent long gene repression in Rett 34 syndrome. Nature 522, 89–93 (2015). 35 38. Dekker, J., Marti-Renom, M.A. & Mirny, L.A. Exploring the three-dimensional organization 36 of genomes: Interpreting chromatin interaction data. Nat. Rev. Genet. 14, 390–403 (2013). 37 39. Pliner, H.A. et al. Cicero predicts cis-regulatory DNA interactions from single-cell chromatin 38 accessibility data. Mol. Cell 71, 858–871 (2018). 39 40. Javierre, B.M. et al. Lineage-specific genome architecture links enhancers and non-coding 40 disease variants to target gene promoters. Cell 167, 1369–1384 (2016). 41 41. Aguet, F. et al. Genetic effects on gene expression across human tissues. Nature 550, 204– 42 213 (2017). 43 42. Han, H. et al. TRRUST v2: An expanded reference database of human and mouse 44 transcriptional regulatory interactions. Nucleic Acids Res. 46, D380–D386 (2018). 45 43. Thomsen, E.R. et al. Fixed single-cell transcriptomic characterization of human radial glial 46 diversity. Nat. Methods 13, 87-93 (2016). 47 44. Pollen, Alex A. et al. Molecular identity of human outer radial glia during cortical 48 development. Cell 163, 55-67 (2015). 49 45. Tran, H.T.N. et al. A benchmark of batch-effect correction methods for single-cell RNA 50 sequencing data. Genome Biol. 21, 12 (2020).

30 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 46. Stark, S.G. et al. SCIM: Universal single-cell matching with unpaired feature sets. 2 Bioinformatics 36, i919–i927 (2020). 3 47. Yang, K.D. et al. Multi-domain translation between single-cell imaging and sequencing data 4 using autoencoders. Nat. Commun. 12, 31 (2021). 5 48. Bandura, D.R. et al. Mass cytometry: Technique for real time single cell multitarget 6 immunoassay based on inductively coupled plasma time-of-flight mass spectrometry. Anal. 7 Chem. 81, 6813–6822 (2009). 8 49. Bartosovic, M., Kabbe, M. & Castelo-Branco, G. Single-cell CUT&Tag profiles histone 9 modifications and transcription factors in complex tissues. Nat. Biotechnol. 39, 825–835 10 (2021). 11 50. Ashuach, T., Reidenbach, D.A., Gayoso, A. & Yosef, N. PeakVI: A deep generative model 12 for single cell chromatin accessibility analysis. Preprint at 13 https://www.biorxiv.org/content/10.1101/2021.04.29.442020 (2021). 14 51. Hamilton, W., Ying, Z. & Leskovec, J. Inductive representation learning on large graphs. In 15 Advances in Neural Information Processing Systems. (eds. I. Guyon et al.) 1024–1034 16 (Curran Associates, Inc., Long Beach, CA, USA, 2017). 17 52. Veličković, P. et al. Graph attention networks. Preprint at https://arxiv.org/abs/1710.10903 18 (2017). 19 53. Vashishth, S., Sanyal, S., Nitin, V. & Talukdar, P. Composition-based multi-relational graph 20 convolutional networks. In Proceedings of the 8th International Conference on Learning 21 Representations. (Addis Ababa, Ethiopia, 2020). 22 54. Stuart, T. & Satija, R. Integrative single-cell analysis. Nat. Rev. Genet. 20, 257–272 (2019). 23 55. Amodio, M. & Krishnaswamy, S. MAGAN: Aligning biological manifolds. In Proceedings 24 of the 35th International Conference on Machine Learning. (eds. J.G. Dy & A. Krause) 215– 25 223 (PMLR, Stockholm, Sweden, 2018). 26 56. Tarashansky, A.J. et al. Mapping single-cell atlases throughout metazoa unravels cell type 27 evolution. eLife 10, e66747 (2021). 28 57. Ding, J. & Regev, A. Deep generative model embedding of single-cell RNA-seq profiles on 29 hyperspheres and hyperbolic spaces. Nat. Commun. 12, 2554 (2021). 30 58. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. Distributed representations of 31 words and phrases and their compositionality. In Advances in Neural Information Processing 32 Systems. (eds. C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani & K.Q. Weinberger) 33 3111–3119 (Curran Associates, Inc., Lake Tahoe, NV, USA, 2013). 34 59. Kipf, T.N. & Welling, M. Semi-supervised classification with graph convolutional networks. 35 In Proceedings of the 5th International Conference on Learning Representations. (Toulon, 36 France, 2017). 37 60. Dincer, A.B., Janizek, J.D. & Lee, S.-I. Adversarial deconfounding autoencoder for learning 38 robust gene expression embeddings. Bioinformatics 36, i573–i582 (2020). 39 61. Goodfellow, I. et al. Generative adversarial nets. In Advances in Neural Information 40 Processing Systems. (eds. Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence & K.Q. 41 Weinberger) 2672–2680 (Curran Associates, Inc., Montreal, Quebec, Canada, 2014). 42 62. Aibar, S. et al. SCENIC: Single-cell regulatory network inference and clustering. Nat. 43 Methods 14, 1083–1086 (2017). 44 63. Davis, C.A. et al. The encyclopedia of DNA elements (ENCODE): Data portal update. 45 Nucleic Acids Res. 46, D794–D801 (2018). 46

31 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 Supplementary Figures

Cell type Omics layer GLUE

Cell type Ast Seurat v3 Clau E2Rasgrf2 E3Rmst E3Rorb E4Il1rapl2 E4Thsd7a E5Galnt14 E5Parm1 E5Sulf1 E5Tshz2

bindSC E6Tle4 Endo InN InP InS InV Mic OPC OliI OliM Peri LIGER Omics layer scATAC-seq scRNA-seq UnionCom

2 3 Supplementary Fig. 1 UMAP visualizations of the cell embeddings in the SNARE-seq dataset 4 aligned with different integration methods. 5

32 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

Cell type Omics layer GLUE

Cell type Basal Seurat v3 Dermal Fibroblast Dermal Papilla Dermal Sheath Endothelial Granular Hair Shaft-Cuticle/Cortex IRS Infundibulum Isthmus K6+ Bulge/Companion Layer

bindSC Macrophage DC Medulla Melanocyte ORS Schwann Cell Sebaceous Gland Spinous TAC-1 TAC-2 ahigh CD34+ bulge alow CD34+ bulge LIGER

Omics layer scATAC-seq scRNA-seq

N.A. N.A. UnionCom

1 2 Supplementary Fig. 2 UMAP visualizations of the cell embeddings in the SHARE-seq dataset 3 aligned with different integration methods. 4 UnionCom failed to run because of memory overflow.

33 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

Cell type Omics layer GLUE

Seurat v3 Cell type CD4 Naive CD4 TCM CD4 TEM CD8 Naive CD8 TEM_1 CD8 TEM_2 CD14 Mono CD16 Mono HSPC Intermediate B bindSC MAIT Memory B NK Naive B Plasma Treg cDC gdT pDC

Omics layer LIGER scATAC-seq scRNA-seq UnionCom

1 2 Supplementary Fig. 3 UMAP visualizations of the cell embeddings in the 10x Multiome dataset 3 aligned with different integration methods. 4

34 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

Dimensionality Preprocessing dimensionality Hidden layer depth

0.10

0.05

10 30 50 100 200 30 50 100 200 400 0 1 2 3 4

Hidden layer dimensionality Dropout Lambda graph

Dataset 0.10 SNARE−seq SHARE−seq

FOSCTTM 10x Multiome 0.05

64 128 256 512 1024 0 0.1 0.2 0.3 0.5 0.001 0.005 0.02 0.1 0.5

Lambda align Negative sampling rate

0.10

0.05

0.001 0.005 0.02 0.1 0.5 2 5 10 20 50 1 Hyperparameter value 2 Supplementary Fig. 4 FOSCTTM of GLUE under different hyperparameter settings. 3 “Dimensionality” denotes the cell embedding dimensionality. “Preprocessing dimensionality” is the reduced 4 dimensionality used for the first transformation layers of the data encoders (see Methods). “Hidden layer depth” is 5 the number of hidden layers in the data encoders and modality discriminator. “Hidden layer dimensionality” is the 6 dimensionality of hidden layers in the data encoders and modality discriminator. “Dropout” is the dropout rate of 7 hidden layers in data encoders and modality discriminator. “Lambda graph” is the weight of the graph loss (��). 8 “Lambda align” is the weight of the adversarial alignment (�). “Negative sampling rate” is the number of empirical 9 samples used in negative edge sampling (samples from �). For each hyperparameter, the center value is the default. 10 To control computational cost, one hyperparameter was varied at a time, with all others set to their default values. 11 The performance of GLUE was robust across a wide range of hyperparameter settings, except for failed alignments 12 in which the adversarial alignment weight was too low or no hidden layers were used in the neural networks 13 (equivalently a linear model with insufficient capacity).

35 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

a scRNA-seq cell type b snmC-seq cell type c scATAC-seq cell type

UMAP1 UMAP1 UMAP1

d scRNA-seq common cell type mL2/3 e snmC-seq common cell type mL2/3 f scATAC-seq common cell type mL2/3 mL4 mL4 mL4 mL5-1 mL5-1 mL5-1 mDL-1 mDL-1 mDL-1 mDL-2 mDL-2 mDL-2 mL5-2 mL5-2 mL5-2 mL6-1 mL6-1 mL6-1 mL6-2 mL6-2 mL6-2 UMAP2 mDL-3 UMAP2 mDL-3 UMAP2 mDL-3 mIn-1 mIn-1 mIn-1 mVip mVip mVip mNdnf mNdnf mNdnf mPv mPv mPv UMAP1 mSst UMAP1 mSst UMAP1 mSst

g h i

1 2 Supplementary Fig. 5 Triple-omics alignment and label transfer. 3 a-c, UMAP visualizations of the cell embeddings before integration for a, scRNA-seq, b, snmC-seq, and c, scATAC- 4 seq, colored by the original cell types. d-f, UMAP visualizations of the integrated cell embeddings for d, scRNA- 5 seq, e, snmC-seq, and f, scATAC-seq, colored by the unified cell types (labels transferred from snmC-seq). g-i, 6 Alluvial diagrams comparing the original cell types and unified cell types for g, scRNA-seq, h, snmC-seq, and i, 7 scATAC-seq. The original cell types are to the left, and the unified cell types are to the right. Cells relabeled to “mPv” 8 and “mSst” are highlighted with green flows. Cells relabeled to “mNdnf” and “mVip” are highlighted with dark blue 9 flows. Cells relabeled to “mDL-3” are highlighted with light blue flows.

36 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

a mL2/3 mL4 mL5-1 mDL-1 mDL-2 mL5-2 mL6-1 mL6-2 mDL-3 mVip mNdnf mPv mSst

mL2/3 mL4 mL5-1 mDL-1 mDL-2 mL5-2 mL6-1 mL6-2 mDL-3 mVip mNdnf mPv Mean expression in group mSst 0.0 0.5 1.0 Kit Hgf Lrg1 Syt6 Syt2 Myl4 Ly6d Rorb Vav2 Sall1 Pnoc Acan Elfn1 Vipr1 Gnb4 Gfra1 Il17ra Pvalb Cplx3 Bmp3 Chst8 Prox1 Calb1 Tshz2 Lypd6 Oprk1 Nr4a2 Lima1 Crhbp Foxo6 Foxp2 Lrrc38 Kcnh5 Susd5 Kcng1 Kank4 Ccbe1 Hspb3 Rspo2 Prdm1 Stard8 Ptger3 Myo3b Moxd1 Hs3st4 Cnksr3 Pou3f2 Rhbdl3 B3gat2 Scube1 Slc24a3 Slc26a4 Slc44a5 Rasl10a Aldh1a3 Trp53i11 Adamts2 S100a10 Gm15270 Gm14204 Adamts15 9930014A18Rik

b mL2/3 mL4 mL5-1 mDL-1 mDL-2 mL5-2 mL6-1 mL6-2 mDL-3 mVip mNdnf mPv mSst

mL2/3 mL4 mL5-1 mDL-1 mDL-2 mL5-2 mL6-1 mL6-2 mDL-3 mVip mNdnf mPv Mean methylation in group mSst 0.0 0.5 1.0 Kit Hgf Lrg1 Syt6 Syt2 Myl4 Ly6d Rorb Vav2 Sall1 Pnoc Acan Elfn1 Vipr1 Gnb4 Gfra1 Il17ra Pvalb Cplx3 Bmp3 Chst8 Prox1 Calb1 Tshz2 Lypd6 Oprk1 Nr4a2 Lima1 Crhbp Foxo6 Foxp2 Lrrc38 Kcnh5 Susd5 Kcng1 Kank4 Ccbe1 Hspb3 Rspo2 Prdm1 Stard8 Ptger3 Myo3b Moxd1 Hs3st4 Cnksr3 Pou3f2 Rhbdl3 B3gat2 Scube1 Slc24a3 Slc26a4 Slc44a5 Rasl10a Aldh1a3 Trp53i11 Adamts2 S100a10 Gm15270 Gm14204 Adamts15 9930014A18Rik

c mL2/3 mL4 mL5-1 mDL-1 mDL-2 mL5-2 mL6-1 mL6-2 mDL-3 mVip mNdnf mPv mSst

mL2/3 mL4 mL5-1 mDL-1 mDL-2 mL5-2 mL6-1 mL6-2 mDL-3 mVip mNdnf mPv Mean accessibility in group mSst 0.0 0.5 1.0 Kit Hgf Lrg1 Syt6 Syt2 Myl4 Ly6d Rorb Vav2 Sall1 Pnoc Acan Elfn1 Vipr1 Gnb4 Gfra1 Il17ra Pvalb Cplx3 Bmp3 Chst8 Prox1 Calb1 Tshz2 Lypd6 Oprk1 Nr4a2 Lima1 Crhbp Foxo6 Foxp2 Lrrc38 Kcnh5 Susd5 Kcng1 Kank4 Ccbe1 Hspb3 Rspo2 Prdm1 Stard8 Ptger3 Myo3b Moxd1 Hs3st4 Cnksr3 Pou3f2 Rhbdl3 B3gat2 Scube1 Slc24a3 Slc26a4 Slc44a5 Rasl10a Aldh1a3 Trp53i11 Adamts2 S100a10 Gm15270 Gm14204 Adamts15

1 9930014A18Rik 2 Supplementary Fig. 6 Consensus cell type markers across three omics layers. 3 a, Expression in scRNA-seq. b, Gene body DNA methylation in snmC-seq. c, Chromatin accessibility in scATAC- 4 seq. Note that an inverted color bar is used for DNA methylation.

37 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

a b

1.0 β = 7.72e-03 β = 2.37e-03 β = -2.61e-03 P = 5.92e-22 P = 7.08e-05 P = 8.52e-07 0.0

0.8 mCH −0.5 2 R

Cell type n o i 0.6 mL2/3 s

s mL4 0.5 e r

p mL5-1 x e 0.4 mDL-2 0.0 e

n mL6-2 e mCG G −0.5 0.2

0.0 mCH mCG ATAC 0.5 Omics layer ATAC 0.0

c −1 0 −1 0 0 1 0.7 mCH mCG ATAC

0.6 Gene stat 0.5 Length Expr mean Expr variability Association 0.4

0.3

mCH mCG ATAC 1 Omics layer 2 Supplementary Fig. 7 Epigenetic contributions to gene expression. 3 a, Coefficient of determination (R2) for predicting gene expression based on each epigenetic layer in different cell 4 types. The box plots indicate the medians (centerlines), means (triangles), 1st and 3rd quartiles (hinges), and minima 5 and maxima (whiskers). Above each epigenetic layer, the linear regression slope (�) and its P value are displayed, 6 with the cell type as the regressor. b, Pearson’s correlations between different epigenetic states and gene expression. 7 c, Associations (defined as the absolute values of Pearson’s correlations) between the correlations in b and different 8 gene characteristics.

38 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

a c Cell type 1 r = 0.963 r = 0.960 r = 0.964

CD4 Naive CD4 TCM 0 CD4 TEM Seed = 0 CD8 Naive CD8 TEM_1 CD8 TEM_2 1 r = 0.963 r = 0.966 r = 0.960 UMAP2 CD14 Mono CD16 Mono Memory B 0 Naive B Seed = 1

UMAP1 1 r = 0.960 r = 0.966 r = 0.964

b Omics layer 0 Seed = 2

1 r = 0.964 r = 0.960 r = 0.964 scATAC-seq scRNA-seq UMAP2 0 Seed = 3

−1 0 1 −1 0 1 −1 0 1 −1 0 1 UMAP1 Seed = 0 Seed = 1 Seed = 2 Seed = 3

d e f eQTL prediction 1.0 1.0 1.0 0.8

0.5 0.5 0.6 Cicero (AUC = 0.534) eQTL Spearman (AUC = 0.569) False 0.0 0.0 TPR True 0.4 GLUE (AUC = 0.619)

−0.5 −0.5 GLUE regulatory score

GLUE regulatory score eQTL 0.2 False −1.0 −1.0 True 0.0 −0.5 0.0 0.5 1.0 0.00 0.25 0.50 0.75 1.00 Spearman correlation FPR

0-25 kb 25-50 kb 50-75 kb 75-100 kb 100-125 kb 125-150 kb 1 Genomic distance 2 Supplementary Fig. 8 Model-based regulatory inference using distance-based power-law 3 interactions as guidance. 4 a, b, UMAP visualizations of the integrated cell embeddings colored by a, cell types, and b, omics layers. c, Pearson’s 5 correlation coefficients for the GLUE regulatory scores across different random seeds. d, GLUE regulatory scores 6 for peak-gene pairs across different genomic ranges, grouped by whether they had eQTL support. The box plots 7 indicate the medians (centerlines), means (triangles), 1st and 3rd quartiles (hinges), and minima and maxima 8 (whiskers). e, Comparison between the GLUE regulatory scores and the empirical peak-gene correlations computed 9 on paired cells. Peak-gene pairs are colored by whether they had eQTL support. f, ROC (receiver operating 10 characteristic) curves for predicting eQTL interactions based on different peak-gene association scores.

39 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

a Cell type b Omics layer

CD4 Naive CD4 TCM CD4 TEM CD8 Naive CD8 TEM_1 scATAC-seq CD8 TEM_2 scRNA-seq UMAP2 CD14 Mono UMAP2 CD16 Mono Memory B Naive B

UMAP1 UMAP1

c d

1.00 1.00

0.75 0.75

0.50 0.50

0.25 0.25

0.00 0.00

−0.25 −0.25 GLUE regulatory score pcHi-C GLUE regulatory score eQTL −0.50 False −0.50 False True True −0.75 −0.75 −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 Spearman correlation Spearman correlation

e 12.5

10.0 e u l a v - 7.5 P

0 1 g

o 5.0 l -

2.5

0.0

eQTL GLUE pcHi-C

Distance

Correlation 1 Cis-regulatory region 2 Supplementary Fig. 9 Model-based regulatory inference using a combination of distance- 3 based, eQTL and pcHi-C interactions as guidance. 4 a, b, UMAP visualizations of the integrated cell embeddings colored by a, cell types, and b, omics layers. c, d, 5 Comparison between the GLUE regulatory scores and the empirical peak-gene correlations computed on paired cells. 6 The peak-gene pairs are colored by whether they had c, pcHi-C support or d, eQTL support. Red dashed lines indicate 7 the 75th percentile of the GLUE regulatory scores, which was used as a cutoff. The GLUE-identified interactions 8 mostly exhibited positive empirical correlations and covered the majority of pcHi-C and eQTL support, while the 9 selection of peak-gene pairs based solely on empirical correlation would lead to much lower external support. e, 10 Consistency between the TF-target gene networks constructed with different peak-gene association methods and the 11 manually curated connections in the TRRUST v2 database (Fisher’s exact test).

40 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

IATPR AC036108.3 AC079949.1 MACROD2 NR5A2 SLC38A11 AC134682.1 PTGES ZNF768 CEBPA BAIAP3 AC019197.1 ZBTB5 FAM216A PZP ZNF433 HIST1H2AC ZFP3 AC093249.2 PRMT5 KIF13B AUNIP AC007192.2 LIN9 AP3B2 AC010746.2

ZNF263 ISLR2 SREBF1 ZNF266 AFG1L LEKR1 AP001005.3 AGAP3 LGR4 SNX7 DCAF10 GPR153 AC008937.1 MSH3 RNF19B PRR5 ATP1B1 ZNF341-AS1 EPHX4 E2F4 AKAP8 NVL PDZD3 TLCD2 EOMES PRTG ARID4A DYNLRB2 B9D1 HK2 SLC19A2 MMP11 AC023886.1 HIBADH COG7 TENT5B MATK MEIG1 LRRC8B PADI4 SGSM1 NFIL3 DOCK6 MRC2 LEF1-AS1 LINC02356 KIAA0586 EP400 RPS3A ZNF341 B3GAT1 AL360181.1 COLEC11 POU3F1 AC016394.2 TMEM246 GLIS1 HS2ST1 ID1 LOXL4 AC023024.1 ZNF680 TET3 SLC25A43 AL161729.1 PIK3R4 DDIT3 SPATA5 RMDN2 PGD ATF1 DLGAP4-AS1 CR1L ART3 AC013472.2 CTDSPL PER3 CASS4 ADNP JPH4 NT5DC1 AC114763.1 TASP1 ZNF10 AL137856.1 ANKS1A AC133552.5 ZDHHC2 PYGL CPQ USP46 TBC1D16 AC011498.1 LINC02298 A2M-AS1 SLC9A5 TESC PARD6G-AS1 TSPAN15 RHOB FARP2 EGR1 MAK TENT5C AMDHD1 ADORA2B EGLN3 LINC00870 B3GNT3 DERA ARNTL2 SLC44A1 ARHGAP32 MORN4 ZBTB46 ECM1 DDX58 SECTM1 WASHC2C TEX14 ARRDC3-AS1 PYHIN1 RAB40C ASAP2 LPCAT3 AC019129.2 AL355075.4 JADE3 KIF21A AC106791.1 PCNA AGO2 RGL1 SLC12A2 PCCA-DT AC011511.2 HSPBAP1 ATXN1 HES4 TPX2 TRNP1 CCDC141 BCAS3 LIMA1 UBE2E1-AS1 NBEAL1 GYPE BATF USP32 LONRF3 OLFML1 CTNNAL1 SIL1 ITGA1 CRY1 KHDRBS2 CMSS1 IL2RA MYNN C1orf21 CCL4 MSC-AS1 IFT74 AC023202.1 RASGRP3 ARHGAP24 AL078459.1 EXO1 FBXO28 BX649632.1 HIST1H2AH IFFO2 PEBP4 PKIG ZHX2 AC113615.2 CD80 ADAM19 AC245060.5 GALNT10 PRSS23 KIAA0825 IGLV10-54 OR2T10 LTB KLRG1 CCL5 KLRD1 MBNL3 EFHC1 BRIP1 PAWR PIK3C2B C12orf75 TUBB4A AC243829.1 HTT-AS SLC2A3 BCAR3 IGHG2 GRAMD1C SKA2 NECTIN2 SYAP1 GEM SAMD12 TMEM132E GZMK CXCR3 LRRC7 FTL PBX3 ABHD12 DISP1 DIPK1A MAP3K9 NEDD4L KLF12 CCNA2 GTDC1 PITPNA BHLHE40 AC006230.1 ARHGAP10 HLCS A2M AL138720.1 AL450992.3 TFCP2 TNFRSF10B PHLDA2 SCN4A AC004837.2 RGS7 GFI1B LINC00299 ERN1 NFATC2 ITGB5-AS1 GABARAPL1 KYNU KLF10 LINC02084 PPME1 CTSL AL034351.4 CELSR1 SLC9C1 KLRK1 AC243829.2 IL2RB CXCR6 ATP13A3 UBAP1 CHKA STAT4 GATA1 AC239799.2 TUBA1A LUCAT1 PTPN12 LINC02100 ZNF827 PLCXD2 RGS9 SLC38A6 FOSL1 ARL13B PPARG NPHP1 ARSK CLIP1 DUSP5 PRAG1 COCH ZC3H12D NCALD SLAMF7 ST8SIA6 LINC01962 USP40 TNFAIP2 FAM228B AL117381.1 RFX3-AS1 CD160 AC022075.1 MICAL2 AC022217.3 ELOVL6 AP002852.1 AL034417.3 AC245014.3 ATF3 ST6GALNAC2 CORO2B LINC02615 RBKS AZIN1-AS1 SGPP2 SLAMF1 CORT AC012085.2 VSTM1 BMPR2 AC007952.4 TBK1 AC005833.2 LINC01871 GNG2 ETS2 FGD6 ADAMTSL4-AS1 RAB10 FUT8 PDE4D AC079822.1 GPR68 XCL2 SLX4IP APAF1 ZFAND5 AC087500.1 TGFB2 SGPP1 IFNLR1 AC015849.1 ADGRG1 BHLHE40-AS1 L3MBTL4 EBF1 TAF4B PHLDB2 DTHD1 PID1 EFCAB2 AL139807.1 PLD4 TBX21 ENPP5 AC093010.2 RGS2 LINC00910 ZMIZ1 ADD3-AS1 IFNG SLC17A5 B4GALT5 STYK1 MAN1A1 ARSD BTD NBEA MIR3681HG PIK3R3 OSCAR IER3 SPECC1 GSN TRAV1-2 FCRL6 ARL5B MED12L CST7 IL12RB2 PRR5L VNN2 AC007881.2 IGHA1 XCL1 SIDT1 NUGGC KATNAL1 EMC1-AS1 JUN GZMA CD226 KLHL8 FOSB ETF1 NEAT1 SKAP2 NPC1 P2RY14 LINC02576 YES1 MCTP2 OSBPL3 SBF2 SSH1 RELB AC068338.3 PARM1 IQCH-AS1 RASSF6 LINC02384 LINC01934 HBA2 KIAA1671 AC104370.1 OPA3 CTSH PRKCB AC110760.1 LINC02732 RRP12 PICALM CAPN14 VAV2 IFNG-AS1 ID2 MT2A LINC01160 MBOAT7 STK38L PHACTR4 PLPP5 RRAS2 SLCO5A1 C16orf74 AMN1 SYNE2 ATP6V1A SH2B3 KLHL5 MEF2A TP63 AP1S3 ZDHHC21 YPEL1 TIGIT KAT2B FHAD1 AC006369.1 ZSWIM6 RBM47 ZNF106 MYO1E AC079336.4 AK8 SETBP1 CTSC AP001528.1 IL32 FBXO34 FOS IFNGR2 TBL1X AC021678.2 IGHV3-49 CNR2 C11orf80 SGCD CEP78 CD2 AL158071.1 PMAIP1 HERPUD1 WARS2-AS1 CD19 TM4SF19 CYB5R4 AC020916.1 DHX57 SH2B2 FBXO10 GLRA3 AC128687.3 SP140 SAMD3 NMUR1 RIC1 HAVCR2 TIMP2 PLEKHO1 LYST AC079089.1 SLC35F2 LLGL2 PRF1 SGK1 AC012368.1 AC079921.1 RALGPS2 LY86-AS1 TNFRSF13B MICAL3 IGHV3-23 MCOLN2 LINC00944 CBLB TSPAN1 HACD1 HIF1A-AS3 YARS TANC2 SLC38A2 HIF1A LPAR1 IRF8 SH3RF2 PLEK LONRF1 AP3S2 BHLHE41 MEF2C-AS2 AL392172.2 IGLC5 MGAT5 TTLL1 LINC01504 ZFP36 USP15 FAM49A AC136424.2 ADK SKAP1 DOCK5 CORO1C JMJD1C ILDR1 TMOD2 LINC02245 IL4R ATP10D AC022182.1 LIME1 TUBA1C PPM1H PAK1 IGHG1 SLC9A7 CD79B IGLC3 GLCCI1 NCR3 CD244 PLD1 SERPINE1 AC018682.1 FOXN2 SSBP2CD79A BIRC3 AL161781.2 AL137186.1 SLC9B2 LGALS1 ZNF710 MCPH1 TAPT1-AS1 DENND5B SPIB FLNB MGAT3 BCAS4 WNT16 FCRL3 BABAM2 ELF2 ITGAV UBE2E1 AL691403.1 USP6NL AL445430.2 EIF2AK3 P2RX5 TNS3 ATP2B1 AKIRIN2 PIP5K1BAL079338.1 AC107884.1 MYO1D FANCD2 FCRL2 LIPA STON2 LPGAT1 TCL1A IL7 AC119396.1 FCHSD2 C12orf42 MIR155HG SUPT3H PACSIN2 XYLT1 CD200 LHFPL2 CIITA ADGRG5 MNAT1 SLC25A13 ZNF41 POU2F2 LINC02785 C5 AC007384.1 TLR10AC060234.3 PMEPA1 PPP1R16B AL133467.2 RPS12 TATDN3 CD83 LAMC1 AC008014.1 IKZF3 AL354740.1 NMNAT1 TMEM252-DT CCDC50 AC006504.5 CD72 CAMK1D ENTPD1 PPP3R1 INSR TMED8 AL355076.2 RPS27 PXK GPM6A LINC00494 AL009177.1 NFKBIA CSNK1G1 GAS7 WDFY2 NEK6 LAT2 CCR6 LINC01754 CNR1 AC007381.1 AL121935.2 RPL13A AKNA ZNF77 AC008462.1 CLEC7A PDE8A TNFRSF13C DAPP1LINC01374 MCPH1-AS1 KCNQ5 MBD2 CLEC17A AL121888.1ZNF821 DNMBP-AS1 AL139021.1 LINC02798 GBE1 ICAM4 LRSAM1 COL19A1 VPREB3 IGHE CHL1 TCL6 ITLN1 ANXA1 CUEDC1 CDKN1A MEF2C ALPK1 FCRLA AL592429.2 IGHA2 WDFY4 BX284613.2 GALNT14 AC109446.3 CCDC138 FRAS1 MEF2B ADGRB3 TBC1D12 KLF6 WARS2 MARK1 AL139020.1 CLVS1 CERNA2 TMEM156 TPT1 TRIM25 ACSL4 METTL25 PTK2B ENAH IRAK2 FMO5 ZNF425 GNA12 TCP10L2 ZNF764 RLF WSB1 TMEM154 THRB NID1 MS4A1 ITPR1 ZNF608 TMEM117 PTPRG CC2D2B AC022364.1 SLC7A7 FAM177B LINC01184 IGHV5-51 TPST1 AC080132.1 CDC42BPA NR4A2 FGD4 KIAA0930 ANKRD33B RHEX TRAK1 MOB3B SAT1 C20orf194 VMP1 EAF2 PLEKHG1 IGHG4 LY86 GRK5 DUSP1 AL158801.2 KSR1 BCL11A SMIM14 TCL1B AC011290.1 SPATC1 DZIP3 HBA1 AC100803.2 CWC27 DENND3 PELI1 ARAP2 DCLK2 EPB41L4A SLC2A5 HSH2D AC005037.1 CCDC122 GRIK4 FAM81A CXCL10 PLEKHG2 ZFP36L1 SORT1 AC009522.1 CLECL1 LINC02202 ITGAL IGKV4-1 AL590068.4 LINC01588 ZNF367 PTPRJ GSAP UVRAG BTLA STAP1 LINC01182 FAM30A BTNL9 PEAK1 EPS8 HMGA2 AC020907.3 ST8SIA4 TET2 VAV3-AS1 LINC00886 CLNK AC083837.1 SIGLEC6 AL132780.2 HSPA1B IL15RA NKAPL COLCA2 GNB5 GAB1 AL359317.1 P4HA1 ADAP2 NCOA4 CHD9 PLEKHA8 HMGA1P4 TESK2 SNAP25MACC1 UGT8 MEF2C-AS1 RHBDD1 AC104781.2 AL354696.2 AC092171.4 POC1B-AS1 CTSZ PTK2 SMAD3 AFF3 LINC02550 EPB41L2 HVCN1 USP28 KLHL7 AC073140.1 GCH1 CLIP2SH3RF1 DOP1B BLNKCD22 BMP2K SERAC1 SMCO4 MTPN CD74 AUTS2 LINC01237 PCDH9 HLA-DQB1 HIST1H1D GBP5 DHRS7B IGHM SLC15A2 BLK MARCH3 ZNF229 LINC01480 DDX19B ZNF800 CEBPB LYN HLA-DPB1 USTBEND4 LINC00926 AC099063.1 EPSTI1 XRCC4 BCL2A1 CD86 HLA-DMA EFHD2 SWAP70 LINC01857 ANKAR SCN3A SPRY1 PSMD5 GBP1 IFIT3 PTPN6 CD93 RASSF4 SCIMP SIPA1L3 DHTKD1 GBP2 SIRPD SH3BGRL NAGK KMO STRBP LINC02397 AC111006.1 EFCAB11 IGFBP6 IRF1 IGSF6 NEDD9 NIBAN3 TRIO RAB30 CCSER1 STK33 C15orf65 AL360178.1 ZNF786 MB21D2 AC069243.1 ATP1B3 SLC28A3 CD300C FGR PIK3AP1 STX7 EML6 KLHL14 IGHG3 FRG1-DT AC242426.2 FAM110A LMO2 GRN CPB2-AS1 TLR7 ADAM28 IGHDPOU2AF1 LINC01685 KCNG1 XKR6 MCM8 LMTK2 PECAM1 SCARB2 HLA-DMB GNG7 PAX5 IFIH1 SOD2 CASP1 TYROBP CFD FTH1 SYT17 GEN1 RAPGEF5 KCNIP2-AS1 GPR160 GPR137B IFNGR1 BTK FCER2 SREBF2 AC002480.2 BARD1 SOAT1 SLC16A3 CX3CR1 LINC02185 AC011476.3 ATL1 PRKN AC018695.2 USP25 TUT7 LINC02085 LIMD1 AP000787.1 CLIC4DTX1 HRK TIMP3 GSTCD NABP1 WARS MPEG1 AC092546.1 RAB3D RHBDF2 TPD52 AC025287.4 SLC30A4 AHNAK GSTP1 AP002892.1 FCRL5 MIPEP SNX25 HLA-DRA GBP4 FTCDNL1 SEZ6L HLA-DOB ACTB SNX8 MRVI1-AS1 AL590101.1 TMX4 AC024619.3 CD68 CHIT1 LYPLAL1-DT ARID5B AP001636.3 NTNG1 EMILIN2 FILIP1L ZNF860 FAM111B HDX AIF1 GOLIM4 BANK1 IRF4 AC103831.1 ANGPT2 AC009313.1 ETV4 LCP1 TLR4 PDE6D CHPT1 IFIT1 CAMK2D ZEB1 MPP6 IGLC1 TNFSF10 RNF135 S100A11 MAP3K8 CD24 ZNF423 CD300LF CD33 B4GALT6 ITGAX TSHZ2 AC097662.1 AC133961.1 LARGE1 AC025164.1 TMEM51 MYOF ABI3 LIX1 GPR183 PEG10 SDK1 ZHX3 GALNT1 PPCDC CYBB AC132219.1 CD69 CPSF3 CDC42EP4 LILRA6 GINS1 PSCA AC091978.1 LINC02777 SPI1 SCPEP1 PPM1L GLYATL1 AK9 SIMC1 AC025048.4 HLA-DRB1 TPM3 POLE2 TNFAIP8L3 CEP112 OSBPL10 ZFP62 SLC4A8 AL022318.5 CBFA2T3 LGALSL-DT KIF20B AC008083.3 LILRB2 CSF1R VASH2 IGHGP SCN2A CPNE5 AL050309.1 H2AFZ FIG4 PRAM1 FCGR2A PRR11 DTWD1 ANKRD55 DNMBP HECTD2 NR2E3 PTPN18 INPP5B TNFSF13 CARMIL1 IGLC2 AC008883.2 BIRC5 SIGLEC10 CLIC2 DOK2 HOTAIRM1 PCBD2 CBLN3 MYBL2 DLG2 BLM AC008083.2 LACTB C1orf162 WDR11-AS1 GMEB1 GDPD1 AC011611.4 TTYH3 CFP PPM1F CCDC191 ZNF569 TRAF3 OAS1 MS4A7 CAAP1 ATP6V1B2 ITPK1 ORAI2 AC004540.1 ARHGAP19 LCTL CHEK1 LINC01762 DOK3 NLRC4 SLC18A1 TBC1D9 CMC1 SRC AC064805.1 DMXL2 RRM1 OSBPL11 C3AR1 CMKLR1 ZRANB3 NAP1L1 MCM4 CD300E FGL2 CTSS HLA-DRB5 MIR3142HG RNF122 TOE1 GPRC5D-AS1 CTNND1 KDM1B FCN1 SFMBT2 CXCL16 ESR2 HIST1H3C ARID3A CENPM SLC16A1 WDR76 EFCAB6 SLC1A4 ABHD15-AS1 TBC1D8 BID ALOX5 IFITM3 DIAPH3 CDC42EP3 RNF144B ST7-AS2 VRK1 DSCC1 HIST1H3G FEN1 ZBTB12 AC117834.1 DUSP6 SAMD9L ILKIFITM2 SLC39A5 CDCA7 SLC44A5 LILRA2 FCER1G LPCAT2 CCDC6 RTTN CPLANE1 CENPP GINS2 MCM3 ABHD18 PLPPR4 NCF2 TRIP4 AC011257.1 RPL10 ZNF385A MIS18BP1 HIST1H1B MMS22L CDT1 PAX6 ADGRE2 GRK3 AL355881.1 LILRB1 MBOAT1 BMP4 CDC7 TYMSOS HIST1H4H EPM2A ANXA4 FANCL PSEN1 TLR8 TBC1D32 SMC4 SMC2 CDC6 THEMIS2 POLR2G MCM6 HIST1H2BB CEP152 CSTA ARAP1 IFI30 CALML4 SIPA1L2 CERS4 GAPDH ODC1 GMNN PRIM2 AP1S2 ITGA4 SH2D3C PRKCE PPP1R9A BUB1B LMNB1 KDM2B GABARAP LILRA1 FANCI TUBB MCM10 HLA-DPA1 LRRC25 LST1 GP6 POLQ ORC1 SH3BGRL3 KCTD12 HBQ1 TK1 FANCM E4F1 VASP LILRA5 CCL4L2 CCDC170 HIST1H3F ASF1B CD300A ALDH3B1 SLC10A7 WWOX NPDC1 HJURP PTGIR TMSB10 TUBA1B CNBD2 CCND2 MCM2 CPPED1 IKBIP E2F8 AIG1 ATAD5 XYLB POLA1 FIRRE SLC16A6 KNTC1 TICAM2 SPN ZCWPW1 CLSPN UBE2S NDC80 AL031602.2 MNT E2F7 CPED1 SPC24 CCDC167 HIST1H4C SUFU SLC20A1 HMGB2 NCAPG2 ZNF143 AC006001.2 ARHGEF11 ZNF701 LARP1B SAMD13 PKMYT1 MCM7 CDCA5 QSER1 LINC01376 KIF14 DHX32 AGPAT4 ARL15 CEP290 RBL1 TYMS RUSC2 SLC27A6 AHCYL2 IQCH CENPU STIL WDHD1 PLK4 TIPIN TTC21A ALMS1 CAMKMT RMI1 SSX2IP HIST1H3B TEX2 H2AFX DUT VKORC1L1 DTL C22orf34 SRSF3 STMN1 FBXO5 KHSRP CPEB2 ABL1 SLC35D1 BRCA2 FBXO36 E2F2 MND1 XRCC2 MAML2 SHCBP1 NDC1 CHD2 CTPS2 FLNA RFC3 ATP8B4 UBTD2 MAP3K13 KIFC1 STX18 RAD18 UBE2O AC137630.1 ASPM PATL2 ISG15 NFKB2 PLAGL1 ACTA2 FAM222B CCDC34 RAD51 RILPL2 NFXL1 SEPTIN11 AC040169.1 SWT1 SOGA3 AC019077.1 TMSB4X NIBAN1 PIGX GPHN SLC35G1 ARHGEF12 CCNB2 TMTC3 AP002495.1 UBXN10-AS1 EZH2 TMEM38B WDSUB1 GPATCH2 GALM DIXDC1 EEF1A1 SHLD2 RB1 EML5 RBM17 RBM15-AS1 RSU1 NDFIP2 TRIM71 KDM4A-AS1 TRIM62 AL670729.3 ZNRF2 REEP3 LINC01301 POLR3B RPS18 APLF PTPN13 HLF ZNF7 TRAF3IP2-AS1 CRTC1 NUDCD1 RARG GLRA1 USP31 PERP ACACA AC007216.4

LINC02846 BICD1 ZNF829 PIWIL2 ZNF585B NEO1 DEPDC4 CADM1 ACER2 FBXO32 ZNF792 GABPB1 LINC01715 FAM78B PDGFD CFAP299 1 2 Supplementary Fig. 10 Inferred TF-target gene regulatory network in PBMC. 3

41 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

a b 2.0 2.0

1.5 1.5

1.0 1.0 SPI1 NCF2

0.5 0.5

0.0 0.0

Naive B Naive B

CD4 TEM CD4 TEM CD4 TCM Memory B CD4 TCM Memory B CD4 Naive CD8 Naive CD4 Naive CD8 Naive CD8 TEM_1 CD8 TEM_2 CD14 Mono CD16 Mono CD8 TEM_1 CD8 TEM_2 CD14 Mono CD16 Mono c d

3.0 2.5 2.5 2.0 2.0 1.5 1.5 CD83

BCL11A 1.0 1.0

0.5 0.5

0.0 0.0

Naive B Naive B

CD4 TEM CD4 TEM CD4 TCM Memory B CD4 TCM Memory B CD4 Naive CD8 Naive CD4 Naive CD8 Naive CD8 TEM_1 CD8 TEM_2 CD14 Mono CD16 Mono CD8 TEM_1 CD8 TEM_2 CD14 Mono CD16 Mono e f 3.0 2.0

2.5 1.5 2.0

1.5 1.0 PAX5 RELB 1.0 0.5 0.5

0.0 0.0

Naive B Naive B

CD4 TEM CD4 TEM CD4 TCM Memory B CD4 TCM Memory B CD4 Naive CD8 Naive CD4 Naive CD8 Naive 1 CD8 TEM_1 CD8 TEM_2 CD14 Mono CD16 Mono CD8 TEM_1 CD8 TEM_2 CD14 Mono CD16 Mono 2 Supplementary Fig. 11 TF-target gene expression levels in different cell types. 3 a, b, Expression levels of a, NCF2, and its regulator b, SPI1 in different cell types. c-f, Expression levels of c, CD83, 4 and its inferred regulators d, BCL11A, e, PAX5, and f, RELB in different cell types.

42 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

a Target gene: FCER2 b Target gene: IFIT3 0.5 GLUE

0.0 0.5 GLUE

pcHi-C 0.0

pcHi-C eQTL

0.5 ATAC eQTL 0.0 -0.5 PAX5 ChIP 0.5 ATAC 0.0 IRF4 ChIP -0.5 SPI1 ChIP EBF1 ChIP HMGA2 ChIP BCL11A ChIP AL513533.1 IFIT2 IFIT6P Genes MEF2C ChIP CH25H AL353751.1 IFIT1B TEX45 CAMSAP3 MCEMP1 CD209 EXOSC3P2 Genes AC008878.1 MIR6792 AC008763.3 CLEC4M PRR36 LIPA

ZNF358 XAB2 RETN CLEC4G CLEC4GP1 IFIT3 IFIT1 AC008878.4 PET100 FCER2 RPL21P129 5S_rRNA IFIT5 MCOLN1 AC008763.2 EVI5L AC008878.2 PCP2 TRAPPC5 AC010336.3 SLC16A12 PNPLA6 AC008763.1 LYPLA2P2 SLC16A12-AS1 STXBP2 LRRC8E 89,150 89,200 89,250 89,300 89,350 89,400 89,450 89,500 Kb RPS27AP19 AC010336.1 chr10 7,550 7,600 7,650 7,700 7,750 7,800 7,850 7,900 Kb chr19

c Target gene: ITGAX d Target gene: KLRD1

0.5 GLUE 0.5 GLUE

0.0 0.0

pcHi-C pcHi-C

eQTL eQTL

0.5 ATAC 0.25 ATAC 0.0 0.0 -0.5 -0.25 SPI1 ChIP BATF ChIP IRF4 ChIP TBX21 ChIP AC009088.4 ITGAM ITGAD TGFB1I1 Genes AC024224.2 HNRNPABP1 GABARAPL1 LINC02617 Genes NDUFA3P6 ITGAX AC026471.5 FUS AC093520.2 AC026471.4 CLEC9A OLR1 GABARAPL1-AS1 KLRK1 AC009088.2 AC093520.1 SLC5A2 AC009088.1 COX6A2 RUSF1 AC024224.1 TMEM52B LINC02598 KLRC3 PYCARD ZNF843 VN1R65P PYCARD-AS1 AC026471.1 CLEC1A KLRD1 TRIM72 ARMC5 PYDC1 AC026471.6 RN7SKP161 AC022075.1 AC026471.3 AC026471.2 CLEC7A KLRC4-KLRK1 AHSP LINC02190 KLRC4 AC106730.1 VN1R64P AC068775.1 31,200 31,250 31,300 31,350 31,400 31,450 31,500 31,550 Kb 10,050 10,100 10,150 10,200 10,250 10,300 10,350 10,400 Kb chr16 chr12

e Target gene: IL2RB f Target gene: GBP2

0.5 GLUE 0.5 GLUE

0.0 0.0

pcHi-C pcHi-C

eQTL eQTL

0.5 ATAC 0.5 ATAC 0.0 0.0 -0.5 TBX21 ChIP -0.5 IRF1 ChIP TEX33 RN7SKP214 IL2RB RAC2 ELFN2 Genes KYAT3 GBP1 PTGES3P1 GBP4 GBP5 Genes

TST TMPRSS6 C1QTNF6 CYTH4 RBMXL1 GBP2 AC099063.4

Z73420.1 AL022314.1 Z82188.2 Z94160.1 GBP3 AC104459.1 AC099063.1

MPST SSTR3 ELFN2 Y_RNA GBP7 BX119321.1

Z73420.2 Z82188.1 Z94160.2 AC099063.3

KCTD17 AC099063.2 37,000 37,050 37,100 37,150 37,200 37,250 37,300 37,350 Kb 89,000 89,050 89,100 89,150 89,200 89,250 89,300 89,350 Kb 1 chr22 chr1 2 Supplementary Fig. 12 Regulatory inference and evidence per gene. 3 GLUE-identified cis-regulatory interactions for a, FCER2, b, IFIT3, c, ITGAX, d, KLRD1, e, IL2RB, and f, GBP2, 4 along with individual pieces of regulatory evidence. For each gene, the ChIP-seq tracks correspond to inferred TF 5 regulators. Known regulators are highlighted with green boxes.

43 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

a b 30000 0.4 Method 10000 0.3 LIGER (β = 1.218) bindSC (β = 1.279) 3000 0.2

Time (s) Time Seurat v3 (β = 1.717) scATAC-seq 1000 GLUE (β = 0.671) 0.1

300 0.0 3e+04 1e+05 3e+05 0.0 0.1 0.2 0.3 0.4 Number of cells scRNA-seq

e c scRNA-seq cell type UMAP2

UMAP1

d scATAC-seq cell type UMAP2

UMAP1 1 2 (Legend on next page)

44 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 Supplementary Fig. 13 Scalability benchmarking and integration of a multi-omics human cell 2 atlas. 3 a, Time costs of different methods on subsampled atlases of varying sizes. The time costs and cell numbers are both 4 plotted in log-scale, so the slope � represents the exponent in original scale, i.e., � = 2 represents quadratic 5 scalability, � = 1 represents linear scalability, and � < 1 represents sublinear scalability. b, Organ compositions in 6 scRNA-seq and scATAC-seq. c, UMAP visualizations of the integrated cell embeddings showing only the scRNA- 7 seq cells, colored by the original cell types. d, UMAP visualizations of the integrated cell embeddings showing only 8 the scATAC-seq cells, colored by the original cell types. e, Alluvial diagram comparing the NNLS (non-negative 9 least squares)-based and GLUE-based cell type annotations. The original NNLS-based annotations are to the left, 10 and the GLUE-based annotations are to the right. Cells originally labeled as “Astrocytes” but mapped to “Excitatory 11 neurons” are highlighted with pink flows. Cells originally labeled as “Astrocytes/Oligodendrocytes” but mapped to 12 “Astrocytes” are highlighted with blue flows. Cells originally labeled as “Astrocytes/Oligodendrocytes” but mapped 13 to “Oligodendrocytes” are highlighted with brown flows.

45 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

PAX6 SOX2 HES1 VIM

1.00 1.5 1.0 1.0 0.75 1.0 0.50

UMAP2 0.5 UMAP2 UMAP2 UMAP2 0.5 0.5 0.25

0.0 0.0 0.00 0.0 UMAP1 UMAP1 UMAP1 UMAP1 HOPX LIFR PDGFD GLI3 1.5 2.0 scRNA-seq 0.8 1.5 1.5 0.6 1.0 1.0 1.0 0.4 UMAP2 UMAP2 0.5 UMAP2 UMAP2 0.5 0.2 0.5

0.0 0.0 0.0 0.0 UMAP1 UMAP1 UMAP1 UMAP1

PAX6 SOX2 HES1 VIM 0.10 0.10 2.0 2.0 0.05 1.5 0.05 1.5 0.00 1.0 0.00 1.0 UMAP2 UMAP2 UMAP2 UMAP2

0.5 −0.05 0.5 −0.05

0.0 −0.10 0.0 −0.10 UMAP1 UMAP1 UMAP1 UMAP1 HOPX LIFR PDGFD GLI3

scATAC-seq 2.0 2.0 1.5 2 1.5 1.5 1.0 1.0 1.0 UMAP2 UMAP2 UMAP2 UMAP2 1 0.5 0.5 0.5

0.0 0.0 0.0 0 1 UMAP1 UMAP1 UMAP1 UMAP1 2 Supplementary Fig. 14 Gene expression and chromatin accessibility patterns of neural 3 progenitor markers in cerebrum cells. 4 Putative neural progenitors are highlighted with pink arrows, astrocytes are highlighted with blue arrows, and 5 oligodendrocytes are highlighted with brown arrows. SOX2 and VIM were not detected in scATAC-seq, due to 6 limitations in ATAC gene-level score calculation (no peaks overlap gene body and 2 kb upstream from TSS).

46 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

EOMES SOX5 RUNX1T1 EPHA3

1.0 2 2 2

0.5 UMAP2 UMAP2 1 UMAP2 1 UMAP2 1

0.0 0 0 0 UMAP1 UMAP1 UMAP1 UMAP1 FAM19A1 LMO3 SATB2 UNC5D scRNA-seq 2 1.5 2 2 1.0 1 UMAP2 UMAP2 UMAP2 1 UMAP2 1 0.5

0 0.0 0 0 UMAP1 UMAP1 UMAP1 UMAP1

EOMES SOX5 RUNX1T1 EPHA3 0.10

2 2.0 0.05 2 1.5 0.00 1 1.0 UMAP2 UMAP2 UMAP2 UMAP2 1 −0.05 0.5

−0.10 0 0 0.0 UMAP1 UMAP1 UMAP1 UMAP1 FAM19A1 LMO3 SATB2 UNC5D

scATAC-seq 2.0 2.0 2 2 1.5 1.5

1.0 1.0 1 UMAP2 UMAP2 UMAP2 UMAP2 1 0.5 0.5

0.0 0.0 0 0 1 UMAP1 UMAP1 UMAP1 UMAP1 2 Supplementary Fig. 15 Gene expression and chromatin accessibility patterns of excitatory 3 neuron markers in cerebrum cells. 4 Putative neural progenitors are highlighted with pink arrows, astrocytes are highlighted with blue arrows, and 5 oligodendrocytes are highlighted with brown arrows. EOMES was not detected in scATAC-seq, due to limitations in 6 ATAC gene-level score calculation (no peaks overlap gene body and 2 kb upstream from TSS).

47 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

SLC1A3 AQP4 PAX3 GFAP

1.00 1.00 1.5 1.0 0.75 0.75 1.0 0.50 0.50

UMAP2 UMAP2 0.5 UMAP2 UMAP2 0.5 0.25 0.25

0.0 0.0 0.00 0.00 UMAP1 UMAP1 UMAP1 UMAP1 RFX4 PTN MMD2 ATP1A2 scRNA-seq 1.0 2 1.0 1.0

UMAP2 0.5 UMAP2 UMAP2 UMAP2 0.5 0.5

0.0 0 0.0 0.0 UMAP1 UMAP1 UMAP1 UMAP1

SLC1A3 AQP4 PAX3 GFAP

2.0 2.0 1.5 1.5 1.5 1.5 1.0 1.0 1.0 1.0 UMAP2 UMAP2 UMAP2 UMAP2 0.5 0.5 0.5 0.5

0.0 0.0 0.0 0.0 UMAP1 UMAP1 UMAP1 UMAP1

AC-seq RFX4 PTN MMD2 ATP1A2 AT

sc 2.0 2.0 1.5 1.5 1.5 1.5 1.0 1.0 1.0 1.0 UMAP2 UMAP2 UMAP2 UMAP2 0.5 0.5 0.5 0.5

0.0 0.0 0.0 0.0 1 UMAP1 UMAP1 UMAP1 UMAP1 2 Supplementary Fig. 16 Gene expression and chromatin accessibility patterns of astrocyte 3 markers in cerebrum cells. 4 Putative neural progenitors are highlighted with pink arrows, astrocytes are highlighted with blue arrows, and 5 oligodendrocytes are highlighted with brown arrows.

48 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

SOX10 LHFPL3 PCDH15 SCN1A

2.0 2.0 1.0 0.4 1.5 1.5

1.0 1.0 UMAP2 0.2 UMAP2 UMAP2 UMAP2 0.5 0.5 0.5

0.0 0.0 0.0 0.0 UMAP1 UMAP1 UMAP1 UMAP1 ZNF365 PDGFRA BRINP3 PDE4B 2.0 2.0 scRNA-seq 1.00 1.0 1.5 1.5 0.75 1.0 1.0 0.50 UMAP2 UMAP2 0.5 UMAP2 UMAP2 0.25 0.5 0.5

0.00 0.0 0.0 0.0 UMAP1 UMAP1 UMAP1 UMAP1

SOX10 LHFPL3 PCDH15 SCN1A 1.5 2.0 2.0 2 1.5 1.0 1.5 1.0 1.0 UMAP2 UMAP2 UMAP2 UMAP2 1 0.5 0.5 0.5

0.0 0.0 0 0.0 UMAP1 UMAP1 UMAP1 UMAP1

AC-seq ZNF365 PDGFRA BRINP3 PDE4B

AT 2.0

sc 2.0 2.0 2 1.5 1.5 1.5

1.0 1.0 1.0 UMAP2 UMAP2 UMAP2 UMAP2 1 0.5 0.5 0.5

0.0 0.0 0.0 0 1 UMAP1 UMAP1 UMAP1 UMAP1 2 Supplementary Fig. 17 Gene expression and chromatin accessibility patterns of 3 oligodendrocyte markers in cerebrum cells. 4 Putative neural progenitors are highlighted with pink arrows, astrocytes are highlighted with blue arrows, and 5 oligodendrocytes are highlighted with brown arrows.

49 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

a Omics layer b Cell type UMAP2 UMAP2

UMAP1 UMAP1

Omics layer Ciliated epithelial cells Horizontal cells Mesangial cells STC2_TLX1 positive cells scATAC-seq Corneal and conjunctival epithelial cells Horizontal cells/Amacrine cells? Mesangial cells? Satellite cells scRNA-seq Ductal cells IGFBP1_DKK1 positive cells Mesothelial cells Schwann cells ELF3_AGBL2 positive cells Inhibitory interneurons Metanephric cells Skeletal muscle cells Cell type ELF3_AGBL2 positive cells? Inhibitory interneurons? Microglia Skeletal muscle cells? AFP_ALB positive cells ENS glia Inhibitory neurons Muscle_Unknown.7 Smooth muscle cells Acinar cells ENS neurons Intestinal epithelial cells Myeloid cells Squamous epithelial cells Adrenocortical cells ENS neurons? Intestine_Unknown.4 Neuroendocrine cells Stellate cells Amacrine cells Endocardial cells Intestine_Unknown.8 Oligodendrocytes Stromal cells Antigen presenting cells Epicardial fat cells Islet endocrine cells PAEP_MECOM positive cells Stromal cells? Astrocytes Epithelial cells Kidney_Unknown.7 PDE1C_ACSM3 positive cells Sympathoblasts Astrocytes/Oligodendrocytes Erythroblasts Kidney_Unknown.14 PDE11A_FAM19A2 positive cells Syncytiotrophoblast and villous cytotrophoblasts? Bipolar cells Excitatory neurons Lens fibre cells Pancreas_Unknown.1 Syncytiotrophoblasts and villous cytotrophoblasts Bronchiolar and alveolar epithelial cells Extravillous trophoblasts Limbic system neurons Parietal and chief cells Thymic epithelial cells CCL19_CCL21 positive cells Eye_Unknown.6 Lymphatic endothelial cells Photoreceptor cells Thymocytes CLC_IL5RA positive cells Ganglion cells Lymphoid and Myeloid cells Purkinje neurons Trophoblast giant cells CSH1_CSH2 positive cells Goblet cells Lymphoid cells Retinal pigment cells Unipolar brush cells Cardiomyocytes Granule neurons Lymphoid/Myeloid cells Retinal progenitors and Muller glia Ureteric bud cells Cardiomyocytes/Vascular endothelial cells Heart_Unknown.10 MUC13_DMBT1 positive cells SATB2_LRRC7 positive cells Vascular endothelial cells Cerebrum_Unknown.3 Hematopoietic stem cells Megakaryocytes SKOR2_NPSR1 positive cells Vascular endothelial cells? 1 Chromaffin cells Hepatoblasts Megakaryocytes? SLC24A4_PEX5L positive cells Visceral neurons 2 Supplementary Fig. 18 UMAP visualization of the Seurat v3 integration of the aggregated 3 multi-omics human cell atlas. 4 Coembeddings of the aggregated metacells colored by a, omics layers, and b, cell types.

50 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.22.457275; this version posted August 24, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 Supplementary Table Legends

2 Supplementary Table 1 Public datasets used in the study. 3

4 Supplementary Table 2 Detailed benchmark data. 5

6 Supplementary Table 3 Regulatory interactions in the GLUE-derived TF-target gene network. 7