Dissecting heterogeneous cell populations across drug and disease conditions with PopAlign

Sisi Chena,b,1 , Paul Rivauda,b , Jong H. Parka,b, Tiffany Tsoua,b , Emeric Charlesc, John R. Haliburtond, Flavia Pichiorrie, and Matt Thomsona,b,1

aDivision of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125; bBeckman Center for Single-Cell Profiling and Engineering, California Institute of Technology, Pasadena, CA 91125; cDepartment of Molecular and Cell Biology, University of California Berkeley, Berkeley, CA 94720; dAugmenta Bioworks Inc., Menlo Park, CA 94025; and eDepartment of Hematologic Malignancies Translational Science, City of Hope, Monrovia, CA 91016

Edited by Jonathan S. Weissman, University of California, San Francisco, CA, and approved September 25, 2020 (received for review March 31, 2020)

Single-cell measurement techniques can now probe expres- The key conceptual advance underlying PopAlign is repre- sion in heterogeneous cell populations from the human body sentational: we model the distribution of gene-expression states across a range of environmental and physiological conditions. within a heterogeneous cell population using a probabilistic However, new mathematical and computational methods are mixture model that we infer from single-cell data. PopAlign required to represent and analyze gene-expression changes that identifies and represents subpopulations of cells as indepen- occur in complex mixtures of single cells as they respond to sig- dent Gaussian densities within a reduced gene-expression nals, drugs, or disease states. Here, we introduce a mathematical space identified by orthogonal nonnegative matrix factorization modeling platform, PopAlign, that automatically identifies sub- (oNMF). PopAlign, then, makes quantitative statistical align- populations of cells within a heterogeneous mixture and tracks ments between subpopulations across samples, and thus enables gene-expression and cell-abundance changes across subpopula- targeted and quantitative comparisons in gene-expression state tions by constructing and comparing probabilistic models. Prob- and cellular abundance. Probabilistic modeling is enabled by a abilistic models provide a low-error, compressed representation low-dimensional representation of cell state in terms of a set of of single-cell data that enables efficient large-scale computa- gene-expression features learned from data (6–8). tions. We apply PopAlign to analyze the impact of 40 different Critically, PopAlign fulfills a fundamental need for compara- immunomodulatory compounds on a heterogeneous population tive analysis methods that can scale to hundreds of experimental of donor-derived human immune cells as well as patient-specific samples. Fundamentally, PopAlign runtime scales linearly with disease signatures in multiple myeloma. PopAlign scales to com- the number of samples because computations are performed on parisons involving tens to hundreds of samples, enabling large- probabilistic models rather than on raw single-cell data. Prob- scale studies of natural and engineered cell populations as they abilistic models provide a reduced representation of single-cell respond to drugs, signals, or physiological change. data, reducing the memory footprint of a typical 10,000-cell experimental sample by 50- to 100-fold. Further, downstream computations, including population alignment, are performed on single-cell genomics | probabilistic models | single cell mRNA-seq

Introduction Significance All physiological processes in the body are driven by hetero- geneous populations of single cells (1–3). Single-cell measure- Many physiological processes are driven by changes across ment technologies can now profile gene expression in thousands heterogeneous populations of cells. However, we currently of cells from heterogeneous cell populations across different lack a conceptual framework for comparing single-cell tran- tissues, physiological conditions, and disease states. However, scriptional data collected from populations as they respond converting single-cell data into models that provide a population- to perturbations. Here, we develop a framework, called level understanding of processes like an immune response PopAlign, that models a cell population as a probability dis- to infection or cancer progression remains a fundamental tribution in gene-expression space. Individual subpopulations challenge. are represented as local densities that are statistically aligned Many single-cell analysis tools have been designed to provide across samples. The models present an integrated represen- in-depth characterization of cell states and developmental trajec- tation that allows comparisons at multiple scales, from gene- tories within a single-cell population. However, once clusters or expression programs to cell state to population structure. The trajectories are identified, existing methods do not provide a for- models are memory-efficient and can be scaled across hun- mal way to compare different populations or samples with each dreds of samples, which we demonstrate using public data other. In this paper, we introduce a computational framework, and data from human immune cells responding to drugs and PopAlign, that was designed to provide an integrated represen- disease. tation of the cell population within a sample, so that samples can be compared at different scales of representation, ranging from Author contributions: S.C., P.R., F.P., and M.T. designed research; S.C., P.R., J.H.P., T.T., E.C., J.R.H., and M.T. performed research; S.C., P.R., and M.T. contributed new gene-expression programs, to cell states, to the structure of the reagents/analytic tools; S.C., P.R., and M.T. analyzed data; S.C. and M.T. wrote the paper; entire cell population. PopAlign identifies, aligns, and tracks sub- and F.P. provided clinical guidance and context for data interpretation.y populations of single cells within a heterogeneous cell population Competing interest statement: S.C., M.T., and P.R. have filed a US and Patent Cooperation profiled by single-cell RNA sequencing (scRNA-seq) (2, 4, 5). Treaty patent for the PopAlign computational framework.y Mathematically, PopAlign constructs a low-dimensional proba- This article is a PNAS Direct Submission.y bilistic model of each cell population across a series of samples. Published under the PNAS license.y PopAlign 1) automatically identifies and models subpopulations 1 To whom correspondence may be addressed. Email: [email protected] or of cells; 2) aligns cellular subpopulations across experimental [email protected] conditions (signaling, disease); and 3) quantifies changes in cell This article contains supporting information online at https://www.pnas.org/lookup/suppl/ abundance and gene expression for all aligned subpopulations of doi:10.1073/pnas.2005990117/-/DCSupplemental.y cells. First published October 30, 2020.

28784–28794 | PNAS | November 17, 2020 | vol. 117 | no. 46 www.pnas.org/cgi/doi/10.1073/pnas.2005990117 Downloaded by guest on September 27, 2021 Downloaded by guest on September 27, 2021 olwn lgmn,teprmtr faindsbouainpisaecmae oietf upplto-pcfi hfsin shifts subpopulation-specific identify to al. et compared Chen are pairs subpopulation aligned of parameters the alignment, abundance cellular Following densities (D) Gaussian divergence. local Jeffreys of of mixture set (µ a a state as as expression states vector gene-expression gene-expression of each distribution representing an of by as collection “Reference” cell data a a each from are input data each the gene-expression which single-cell of sample, input “Test” Users dimensionality one (A) samples. least experimental at single-cell and many sample across expression, gene and abundance cell-state 1. Fig. forward, Moving samples. patient MM of from signatures progression treatment-specific disease and used level we general Finally, population extract mixture. to a the within PopAlign at types cell hits specific biggest for also the and and discover cells to immune PopAlign human used immunomod- primary from to 40 states applied of compounds screen cell ulatory experimental an meaningful performed We biologically data. identify representational high and have accuracy models diverse probabilistic dis- a The human states. across and ease states experiments, drug-perturbation cell tissues, track of and show range identify of We can comparison [MM]). PopAlign myeloma a includ- (multiple that and cells, disease drugs to blood immunomodulatory patients healthy peripheral of as screen human (14) a Muris) on ing (Tabula experiments survey as mouse-tissue well a from datasets analyzing 12 for metrics statistical changes. whole-population quantitative and representation subpopulation and formalized a cell-state provides it of because samples num- small of analyzing bers for capabilities However, unique populations offers conditions. also cell experimental PopAlign different of of hundreds interrogation sam- across new convenient using allow by generated specifically technologies Multiplexing data (11–13). or technologies multiplexing surveys 10) ple for (9, generated tissues those diseased as of such samples, many with datasets memory. in datasets storing single-cell requires raw and many between compute-intensive of computations is pairwise which on cells, neigh- individual rely stochastic (tSNE) t-distributed embedding or data bor (Louvain) on single-cell clustering based from methods by (clusters) contrast, either features By geometric magnitude. of computa- of extraction of order number an the by reducing tions often themselves, models the ABCD easse h cuayadgnrlt fPplg using PopAlign of generality and accuracy the assessed We large-scale analyzing to well-suited particularly is PopAlign umr fPplg rmwr.Pplg rvdsasaal ehdfrdcntutn uniaiecagsi ouainsrcue including structure, population in changes quantitative deconstructing for method scalable a provides PopAlign framework. PopAlign of Summary dmninlvco fcoefficients of vector m-dimensional i ,adpplto ped(Σ spread population and ), ∆ hfsi engn-xrsinstate gene-expression mean in shifts , i .(C ). Each ) c o ahsml,Pplg siae o-iesoa rbblsi oe htrpeet the represents that model probabilistic low-dimensional a estimates PopAlign sample, each For (B) . N i test n hfsi upplto shape subpopulation in shifts and ∆µ, nteTs ouaini lge oteclosest the to aligned is population Test the in dmninlgn-xrsinvectors gene-expression n-dimensional .Tehigh-dimensional (n The 1B). gene-expression (Fig. dis- cells of gene-expression of the set nature each of for model tribution probabilistic a constructed computations Appendix, downstream (SI for log-transformed then and capture single-cell in scaled species and mRNA each of g abundance the quantifies e.g., vectors, (g gene-expression of set a 1 disease (Fig. mRNA-seq and cell drugs, signals, (D to respond they popula- as heterogeneous conditions. cells gene-expression in of analyze changes to tions applied population-structure parameter be and and can alignment, probabilistic PopAlign steps: model analysis. aligned three construction, has between model method expression mixture The shifts gene 1). quantifies (Fig. and 2) populations abundance then reference and cell-state (a population), cells in test single aligns a of and and populations population identifies paired across 1) states that cell (PopAlign) Proba- framework putational with Models. Populations Mixture Cell bilistic Heterogeneous Represents PopAlign Results tissue hetero- human on primary experi- from perturbations large-scale extracted samples. genetic of populations and cell analysis drugs geneous the of for screens stage mental the sets PopAlign N i and 1 ocmaeterfrneadts-elppltos efirst we populations, test-cell and reference the compare To population reference a cells, of populations two consider We ref ihprmtr noigsbouainaudne( abundance subpopulation encoding parameters with , g n etpplto (D population test a and ) 2 , k . . . stenme fpoldsnl el.W normalized We cells. single profiled of number the is , m g g n eeepeso etrs(m features gene-expression aaNormalization). Data ) oacutfrtcnclvraiiyi transcript in variability technical for account to PNAS san is ∆Σ. | n oebr1,2020 17, November .Poln fec ouaingenerates population each of Profiling A). dmninlgn-xrsinvco that vector gene-expression -dimensional edvlpamteaia n com- and mathematical a develop We hw ssnl os oAinrdcsthe reduces PopAlign dots. single as shown g, N i ref test nteRfrnesml yminimizing by sample Reference the in ∼ ,ta r rfie ihsingle- with profiled are that ), pc ae the makes space 000) 20, = D | test 10 o.117 vol. = − 0,tu representing thus 20), {g i } | i k =1 o 46 no. i ,ma gene- mean ), where , | 28785 g =

BIOPHYSICS AND COMPUTATIONAL BIOLOGY inference and interpretation of probabilistic models challenging. sity in gene-feature space, and we compare cell populations Therefore, we represented each cell, not as a vector of , but by asking how the density of cells shifts across experimental as a vector of gene-expression programs or gene-expression fea- conditions. tures that were extracted from the data, so that each single cell was represented as a vector c = (c1, c2 ··· cm ) of m feature coef- Statistical Alignment of Cellular Subpopulations between Samples. ficients, ci , which weighted the magnitude of gene-expression To compare the test and reference models, we “aligned” each test programs in a given cell (SI Appendix, Extraction of Gene Feature mixture component in the test population model, φi (c) ∈ test ref Vectors Using Matrix Factorization). {φi (c)}, to a mixture component, {φi (c)}, in the reference We extracted these gene features using a particular matrix- population model (Fig. 1C). Alignment was performed by find- factorization method called orthogonal nonnegative matrix fac- ing the “closest” reference mixture component in gene-feature torization (oNMF) (15) that produces a useful set of features space (SI Appendix, Alignment of Models). Mathematically, to because all vectors are positive and composed of largely nonover- define closeness, we used the Jeffreys divergence, a statistical lapping genes (SI Appendix, Figs. S1B, S2, and S3). This allowed metric of similarity on probability distributions. We chose the us to naturally think of a cell’s transcriptional state as a linear Jeffreys divergence over other metrics because it is symmetric sum of different positive gene-expression programs (16). Other while also having a convenient parametric form (SI Appendix, methods like principal components analysis (PCA) can be more Model Interpretation through Parameter Analysis). test test ref difficult to interpret because they produce vectors which contain Specifically, for each φi ∈ {φi (c)}, we find a φj ∈ contributions from overlapping sets of genes. If two features, f1 ref {φj (c)} , the closest mixture component in the reference set: and f2, contain similar sets of genes, then representations which use these features can hide underlying cell-state similarity. oNMF test ref forces shared genes to be separated into their own program, so, in arg min DJD (φi (c) k φj (c)), [2] φref(c)∈{φref(c)} the example above, shared genes become a third program f3, and j j the two cell types can be represented as f1 + f3 and f2 + f3. ref Choosing the number of features to use involves balancing a where the minimization is performed over each {φi (c)} in the tradeoff between accuracy and dimensionality. To provide a prin- set of reference mixture components, and DJD is the Jeffreys cipled way to choose the number of features, we constructed a divergence (19). Intuitively, for each test mixture component, we loss function f (m) that places a penalty on high values of m (SI find the reference mixture component φj that is closest in terms Appendix, Figs. S4 and S5 and Extraction of Gene Feature Vectors) of position and shape in feature space. For each alignment, we and uses the function to identify a local minimum. The PopAlign can calculate an explicit P value from an empirical null distribu- package allows users to modulate the exact choice of m by tun- tion P(DJD ) that estimates the probability of observing a given ing the loss function, thus constructing a coarse- or fine-grained value of DJD in an empirical set of all subpopulation pairs within representation that is appropriate for the exact use case. a single-cell tissue database (SI Appendix, Scoring Alignments). Following dimensionality reduction, for a given cell popula- Alignments identify subpopulations with maximal transcrip- tion, we think of cell states as being sampled from an underlying tional-state similarity across samples. Since transcriptional-state joint probability distribution over this feature space, P(c), that similarity can arise even in the absence of a direct lineage or iden- specifies the probability of observing a specific combination of tity relationship, alignments do not guarantee cell-type identity, gene-expression features/programs, c, in the cell population. We but, rather, highlight predicted relationships that should be inter- estimated a probabilistic model, P test(c) and P ref(c), for the ref- preted in the context of prior knowledge and can be explored erence and test-cell populations that intrinsically factored each through further downstream analysis. population into a set of distinct subpopulations, each repre- The directionality of alignment can impact the results and sented by a Gaussian probability density (density depicted as highlight different classes of phenomena. For instance, suppose individual “clouds” in Fig. 1B): a cell state in the reference population splits into two progeny branches in a test sample. If we align to the reference, both l branches in the test sample will align to the same reference cell test X test P (c) = wi φi (c) [1] state. However, reversing the directionality will give only one i=1 alignment—the original cell state will align only with its closest test progeny. Both procedures are useful because they address dif- where φi (c) = N (c; µi , Σi ), ferent questions; the first allows us to identify all cell states that where N (c; µi , Σi ) are multivariate normal distributions with align to a particular cell state, and the second allows us to iden- weight wi , centroids µi , and covariance matrices Σi . The dis- tify only the most similar cell state. For this reason, we offer the test tributions φi (c)= N (c; µi , Σi ), mixture components, represent option to align in both directions, from test samples to reference, individual subpopulations of cells; l is the number of Gaus- and the reverse. sian densities in the model. We estimated the parameters of Additionally, we can perform alignments two-way, which is the mixture model ({µi , Σi , wi }) from single-cell data using the useful for identifying these branching events or even missing expectation-maximization algorithm (17, 18) with an additional populations. For instance, if a cell state is present in the test step to merge redundant mixture components to compensate for sample, but not in the reference sample, the alignment results fitting instabilities (SI Appendix, Merging of Redundant Mixture will change depending on the directionality. An alignment will Components). be found going one way, but not the other way. In this case, we The parameters associated with each Gaussian density, run the alignment procedure in both directions and only retain (µi , Σi , wi ), have a natural correspondence to the biological alignments in which the aligned pair are each other’s best match. structure and semantics of a cellular subpopulation. The rela- Alignments which are only found one way are flagged to indi- tive abundance of each subpopulation corresponds to the weight cate a missing population or branching event. Thus, two-way wi ∈ [0, 1]; the average cell gene-expression state of each sub- alignments are useful for providing a stringent assessment of population corresponds to the m-dimensional Gaussian cen- similar cell states across samples and also for flagging poten- troid vector µi , and the shape or spread of the subpopulation tially interesting orphaned populations and branching events. is captured by the covariance matrix Σi . Intuitively, the local These settings offer a suite of different approaches that have Gaussian densities provide a natural “language” for compar- their uses in different contexts (SI Appendix, Directionality of isons between samples. Each Gaussian is a region of high den- Alignment).

28786 | www.pnas.org/cgi/doi/10.1073/pnas.2005990117 Chen et al. Downloaded by guest on September 27, 2021 Downloaded by guest on September 27, 2021 itiuino umdL rosacross errors L1 summed of al. Distribution et (D) Chen 0.75. of bandwidth a with 0.188 points is data error all Mean using projections. (iii by (C random built right (i 500 data. is the data for estimate experimental on the projections kernel-density displayed the 2D The of and projection. (KDE) in computed 2D estimate each structures is kernel-density of KDE geometric axis a and qualitative between model error the the quantifying between replicate error for data plots model-generated projection (1D) the one-dimensional projection, in each shown projections In feature-space component. 3D and 2D same (A, projection (3D) three-dimensional an into projected 2. Fig. aligned each gene-expression for sub- pair, Mathematically, identified state, mixture 1D). the (Fig. gene-expression mod- cells across of test in abundances populations cellular and shifts and reference track covariance, the to between els parameters differences quantitative mixture analyzed in we Parameters. alignment, Model mixture Mixture lowing through Shifts Cell-State Tracking riig(i.2.Pplg oesgnrt ytei data synthetic generate models (model- PopAlign model from synthetic 2). out the held the (Fig. data comparing of experimental training raw by Accuracy to assessed S9). data the generated) Fig. be of Appendix, can decomposition (SI models interpretable states cell and underlying accurate an produce collected data with samples single-cell tissue (14, contains different Muris) 12 study (Tabula from Muris study Tabula public The recent a 21). from range tissues wide mouse a across of models probabilistic aligned and constructed Tis- Mouse Disparate across sues. States Cell environmental Aligns and Identifies or PopAlign perturbations pairs drug population. mixture cell of underlying all the impact for on changes the parameters in assess shifts to these calculated We and the (20); Analysis) using ( variance matrices including gene-expression component, in metric, mixture Forstner each changes of and shape rotations the in shifts tifies where {µ AC c c 10 2 o l ise nlzd h rbblsi itr models mixture probabilistic the analyzed, tissues all For j test marrow Bone Experimental single-celldata c , ots h cuayadgnrlt fPplg,w first we PopAlign, of generality and accuracy the test To 1 Σ ∆µ xeietldt o ,6 oemro cells marrow bone 3,267 for data Experimental (A) accuracy. quantitative and qualitative high with data experimental represent models PopAlign j test i IAppendix, SI , w esrssit nma eeexpression; gene mean in shifts measures j test (φ m ecalculate: we }, i test D = C ∆w 0dmninloM etr pc.Te2 lt hwsnl el rjce ln NFfauepis( pairs feature oNMF along projected cells single show plots 2D The space. feature oNMF 10-dimensional , c naindmxuecmoetcovariance component mixture aligned on , φ ∆ ∆ ∆ 5 j ref i Σ w µ ) unie hfsi elsaeabundance. cell-state in shifts quantifies oe nepeaintruhParameter through Interpretation Model i i i = c = = ihparameters with 1 |w D kµ C i ref i ref (Σ Inset − − i ref w µ , j Σ oe-eeae aafrthe for data Model-generated (B) plots. 3D and 2D between axis shared a denotes axis blue The shown. is ) test ∼ j test j test 0 000 40, c |, k 2 ), 2 , {µ c B (B, A. 9 i ref el total. cells , Model-generated data c Σ 1 Inset i ref ∆ , ± w faGusa mixture Gaussian a of (µ) centroid the denotes circle maroon each projection, 3D the In ) Σ 0.066. i ref i } quan- Fol- and [3] [5] [4] c 5 and ieisgtit h neligcmoiino ise shedding tissue, a of composition underlying the into insight specifi- 3H give (Fig. found macrophages as programs limb-muscle Sequestration, in gene-expression cally Ion tissue-specific Metal as pro- and common well ER-Trafficking (w reveal as we muscle such macrophages, limb as grams (w such the gland types, mammary cell in shared the abundant in rare highly but are relative cells muscle limb in lial the differences in rare scale but (w gland, tissue prevalent highly are revealed cells T also abundance. and S4), including Fig. tissues, two (P iden- (P the we between cells 3E), types (Fig. B muscle cell limb “common” to tified gland mammary composition. tissue aligning of By comparisons high-level enables PopAlign Muris Tabula by provided S10). all type Fig. single-cell Appendix, across (SI a Broadly, for classified muscle). S9), nents Fig. limb Appendix, endothelial (SI gland) (in models cells, tissue mammary macrophages stem (in and mesenchymal cells cells, cells, T muscle skeletal and and macrophages, 3 cells, (Fig. known Tab- including luminal extracts the types, PopAlign by cell tissues, supplied resident example labels tissue In by project. defined Muris as ula “type” single-cell a of components, mixture PopAlign indi- by biologically represented a subpopulations vidual into cellular populations of cell set interpretable the decompose models ture , Appendix (SI subsampling 50% at observed S8 Fig. error the two-dimensional of to or equal error tSNE quantitative 3 in Error and with data 2 varia- projections tissue (Figs. statistical feature the and in (2D) structures found geometric tions the replicates that c 1 hog lgmn fmdlcmoet costissues, across components model of alignment Through mix- the representation, accurate an providing to addition In .03,admcohgs(P macrophages and 0.0013), = vi .Temria Ddsrbtos(iv, distributions 1D marginal The ). .Teqatttv ro cos2 rjcin sroughly is projections 2D across error quantitative The ). ). φ i h itr opnns(i.3 (Fig. components mixture the (c), .06,Tcls(P cells T 0.0006), = PNAS c 2 cell μ A and | c oebr1,2020 17, November 9 B -20 D )i)iii) ii) i) v )vi) v) iv) 0 P(x) and 1 10 -10 0.00 0.05 % of 2D projections |P(KDE)-P(model)| P(model) P(KDE) 0 1 2 6 7 8 3 4 5 1 10 -10 0 ...... 0.6 0.5 0.4 0.3 0.2 0.1 0.0 2 020 020020 and IAppendix, SI {φ w IAppendix , (SI 0.0076) 0.004, = 0.00 0.01 0.02 i and v, iv (c)} .0) nohla cells endothelial 0.001), = 0 = 0 = otemdl(ii model the to ) 70% -20 | 0 .Between 3G). (Fig. .06) ;endothe- 3G); (Fig. .05) 0.00 0.05 omnycnancells contain commonly vi c o.117 vol. i ftemxuecompo- mixture the of 0 = r umdaogthe along summed are ) , .Pplg a,thus, can, PopAlign ). 1 10 -10 0 C c j ,adasnl selected single a and ), and .3 nlsso Model of Analysis 0.01 0.02 ntemammary the in C adm2 and 2D Random ) | aa cells, basal D) -20 o 46 no. and 0 and 0.00 0.05 0 v .The D). 0 = .TeL1 The ). | ∼ 0.0025 0.0050 28787 .32), 18% x

BIOPHYSICS AND COMPUTATIONAL BIOLOGY ABC

D

EF G

H

Fig. 3. Probabilistic models identify, align, and dissect cellular subpopulations across disparate tissues. (A and B) For two tissues, mammary gland (A) and limb muscle (B), experimental single-cell data (black) are plotted together with PopAlign model-generated data (teal) using a 2D tSNE transformation. For each tissue, mixture model centroids (µ) are indicated by a numbered black dot. (C and D) Heatmaps for mammary gland (C) and limb muscle (D) showing the percentage of cells associated with each mixture component that have a specific cell annotation label supplied by Tabula Muris. Columns (but not rows) sum to one. (E) Null distribution of Jeffreys divergence (JD) using all possible pairs of mixture components within each model. Threshold for P < 0.05 is indicated by a vertical line. (F) Alignments between mixture component centroids (µ) from the reference population (mammary gland) and the test population (limb muscle) are shown as connecting lines. All m-dimensional µ vectors are transformed by using PCA and plotted by using the first three principal components. The width of each line is inversely proportional to the P value associated with the alignment (see key). (G) We ranked aligned subpopulations in terms of maximum ∆ and show the top two pairs (T cells and endothelial cells) that are highlighted in F with a gray dotted line. T cells are highly abundant in mammary gland, while endothelial cells are highly abundant in limb muscle. (H) Comparing subpopulation centroids (µ) for macrophages in terms of annotated oNMF features. Macrophages in mammary gland and limb muscle share common features (Left), but two features (Right) are expressed at higher levels specifically in limb-muscle macrophages. Coefficient values for each feature have been normalized by the maximum value across the column. Corresponding alignments are highlighted in F with a red ellipse.

light onto principles of tissue organization with respect to tissue iments, including comparisons of disparate tissues that do not function. contain overlapping populations. PopAlign achieves generality because it aligns subpopulations by performing a local com- PopAlign Can Perform Global Comparisons of Cell State across Tens putation for each test subpopulation (i.e., the minimization of to Hundreds of Samples. We tested the ability of PopAlign to com- Jeffreys divergence relative to reference subpopulations), that pare large numbers of samples, using synthetic collections of can be accepted or rejected using a hypothesis test. Other samples bootstrapped from Tabula Muris data survey. We found methods for comparing samples across experiments essentially that PopAlign runtime scales linearly with sample number and perform batch correction to align multiple datasets, before pool- can analyze 100 samples in approximately 100 min on a typical ing data and jointly identifying clusters (22, 23). For example, workstation with eight cores and 64 GB RAM (Fig. 4A). By first canonical correlation analysis (CCA) finds a global linear trans- building models, PopAlign front-loads the computation to pro- formation of the data that minimizes transcriptional differences duce a low-error (Fig. 2) representation of the data that achieves between samples. Not only are many batch-correction meth- a 50- to 100-fold reduction in the memory footprint. Memory ods computationally expensive (Fig. 4A), they can also require efficiency speeds up downstream tasks, such as the calculation of overlapping subpopulations for interpretation, thus limiting the pairwise divergences between subpopulations (Fig. 4B) necessary generality of the approach. for aligning them across samples. In the 12-sample Tabula Muris comparison, PopAlign uncov- Applying PopAlign to compare all 12 tissues of Tabula Muris ered meaningful signatures of cell distributions and gene- shows that the method is general across many types of exper- expression patterns that reflect and expand upon known biology.

28788 | www.pnas.org/cgi/doi/10.1073/pnas.2005990117 Chen et al. Downloaded by guest on September 27, 2021 Downloaded by guest on September 27, 2021 hne al. et Chen popu- macrophage in phagocytic) 4C). highly (Fig. (e.g., are activation gland macrophages gene-program mammary of lung the patterns distinct in found abundant also The very We are destroyed. observation cells and the T as engulfed such that results, be surprising highlights must also vas- that analysis macrophages bacteria highly and and in muscle), debris prevalent limb 4F most and (Fig. are (kidney tissues 4E) mature cularized to (Fig. respectively], known cells (25), are spleen endothelial they and (24) where thymus 4D) organs [the (Fig. cells developmentally in B and abundant 4C) most (Fig. cells are T that found we example, For arrows). and font red with (highlighted pathways immune key with (E cells endothelial (D), 4. Fig. uppltoscutrda arpae,dslyn iseadcl-yelbl xrce rmTbl ui noain.(C ( annotations. similarity abundances Muris very pairwise Tabula corresponding on a and from applied gland), extracted of when (mammary labels Heatmap even model cell-type where (B) clusters, reference scales and exp(−JD), RAM). cell-type-specific a CCA tissue as to cogent GB while displaying aligned defined identify 64 macrophages, samples, is can cores, as of PopAlign metric (eight clustered that similarity number subpopulations workstations demonstrates The the typical tissues types. with on 12 tissue linearly all performed disparate scales from were subpopulations PopAlign tests to between (red). Benchmarking applied metric when method Muris. error Tabula alignment out-of-memory survey CCA-based an tissue encounters vs. and (blue) time PopAlign polynomial with for samples of number F E D C AB Δ Δ Δ Δ oAincnpromgoa oprsn fcl ttsars oest udeso xeietlsmls ( samples. experimental of hundreds to dozens across states cell of comparisons global perform can PopAlign r ihypeaeti h ug hc accumulates which lung, the in prevalent highly are ) ,admcohgs(F macrophages and ), o arpae cosaltsusso aito nfaue associated features in variation show tissues all across macrophages for (µ) state gene-expression Mean (G) ). G JD steJfry iegnebtentosubpopulations. two between divergence Jeffreys the is ltosfo h ua oya hyrsodt environmental to respond they as body pop- human cell the heterogeneous from study to ulations is PopAlign of application key Drugs. A of Impacts Cell-Type-Specific and Universal Identifies PopAlign not populations is cell and overlapping samples for gene samples. of between requirements and numbers large abundance by to in constrained scales shifts that meaningful expression frame- extracting computational for results efficient an These work is (26). PopAlign macrophages that reports among demonstrate previous diversity with functional consistent 4G), of (Fig. tissues across lations > ape.Smlsaebosrpe rmal1 ape ftemouse- the of samples 12 all from bootstrapped are Samples samples. 8 r lte o eetdsbouain lsie sTcls(C cells T as classified subpopulations selected for plotted are ) PNAS | oebr1,2020 17, November opttoa utm versus runtime Computational A) | –F o.117 vol. oesfraltsusare tissues all for Models ) B, | Inset o 46 no. highlights ,Bcells B ), | 28789

BIOPHYSICS AND COMPUTATIONAL BIOLOGY change, drug treatments, and disease. The human immune sys- T cells by up-regulating genes that reduce oxidative stress could tem is an important application domain for PopAlign as an potentially be useful in modulating T cell behavior for other extremely heterogeneous physiological system that is central for diseases. disease and cell-engineering applications (2, 27–30). Being able PopAlign allows us to rapidly identify cell-type-specific effects to analyze the effects of different drugs on complex immune-cell of drugs. Identification of the most impactful drugs would be populations, and understand how they affect cell function, is fun- difficult using common visualization approaches like tSNE (SI damentally important to our ability to design drug therapies for Appendix, Fig. S12) or uniform manifold approximation and disease treatment. Thus, we performed an analysis of commer- projection (UMAP) (37), which show qualitative changes (see cially available immunological compounds on human immune highlighted conditions), but are not quantifiable due to the cells and used PopAlign to discover how these compounds alter nonlinear embedding. Here, using a small screen of 40 drugs specific cellular subtypes. from an immunomodulatory compound library, we were able We performed our screen using 40 drugs (Fig. 5A) from a com- to use PopAlign to discover universal and cell-type-specific mercially available compound library (Selleck Chem) on periph- mechanisms of drugs, including the observation that glucocor- eral blood mononuclear cells (PBMCs) from a healthy 22-y-old ticoids broadly down-regulate motility genes and dexrazoxane male donor. PBMCs normally contain a mixture of different specifically impacts T cells by up-regulating prosurvival genes. immune cell types, but our model revealed that blood samples Understanding the cell-type-specific impacts of drugs, which from this particular donor were dominated by monocytes (18%) have so far been obscured, will be integral for designing precision and T cells (82%) (Fig. 5B). therapeutics that have targeted effects within a heterogeneous We first identified hits at a high level by ranking drugs based on tissue. how similar the drug-exposed populations are to the unperturbed control populations (six independent replicates). Statistically, we PopAlign Finds General and Treatment-Specific Signatures of MM. could define “hits” as drugs which have a negative log-likelihood Given the success of the PopAlign framework in extracting cell- ratio metric (SI Appendix, Ranking Populations) that lies below type-specific responses in the immune drug-response data, we the control range (gray box). Within this group, high-ranking applied the method to study underlying changes in cell state due drugs include a group of glucocorticoids (compounds labeled to a disease process. As a model system, we applied PopAlign to in orange, Fig. 5B), as well as mTOR inhibitors (pink) and compare human PBMC samples from healthy donors to patients alprostadil (a prostaglandin) (purple). being treated for MM. MM is an incurable malignancy of blood Many immune-regulating drugs are known to be broadly sup- plasma cells in the bone marrow. Both the disease and associated pressive or activating, but their cell-type-specific effects are not treatments result in broad disruptions in cell function across the very well understood. By quantifying and ranking these shifts immune system (38–41), further contributing to disease progres- across specific cell types, we found that 26 drugs exerted signif- sion and treatment relapse. In MM patients, immune cells with icant gene-expression shifts (∆ ) on monocytes (Fig 5C), while disrupted phenotypes can be detected in the peripheral blood 14 drugs exerted significant effects on T cells (false discovery rate (40, 42, 43). An ability to monitor disease progression and treat- [FDR]-corrected P values < 0.05 ) (Fig 5D). Of these drugs, eight ment in the peripheral blood could therefore provide a powerful drugs (highlighted in color) impacted both cell types (Fig. 5 C new strategy for making clinical decisions. and D). Most drugs either did not affect abundances (∆w ≈ 0) or We obtained samples of frozen PBMCs from two healthy increased monocyte abundance up to 5% (SI Appendix, Fig. S11 and four MM patients undergoing various stages of treatment A and B). (SI Appendix, Table S1). We profiled > 5, 000 cells from each The ability to find the transcriptional impacts of genes that patient and constructed and aligned probabilistic models to one are universal across cell types can reveal important insights reference healthy population (Fig. 6A). into a drug’s fundamental mechanisms. In our screen, we dis- PopAlign identified several common global signatures in the covered that, although drug-responsive genes were mostly cell- MM samples at the level of cell-type abundance and gene expres- type-specific (Fig 5E), for some drugs, up to 15% of impacted sion. Across all samples, we found previously known signatures genes were shared between cell types (SI Appendix, Supplemen- of MM, including a deficiency in B cells (40, 44, 45), an expan- tary File 1, which supplies differentially expressed genes for sion of monocyte/myeloid derived cells (42), and, critically, new all drugs/cell types). For example, budesonide up-regulated 11 impairments in T cell functions. genes and down-regulated 14 genes in both T cells and mono- Plotting ∆w across all patients, we found high-level changes cytes (Fig 5F). The overlapping down-regulated genes included in subpopulation abundances, which are known to be prog- many genes associated with -based motility—such as actin nostic of disease progression (43). We found that all MM genes (beta-actin [ACTB] and ACTG1), an antiadhesion pep- patients experienced a contraction in B cell numbers (Fig. 6B), tide (CD52), a myosin-interacting (CD74) (31), and an and two out of four saw a dramatic expansion (∆w  0.2) of actin-sequestering protein (TSMB10) (32). This result is consis- monocytes (Fig. 6C). Changes in T cell levels, however, can tent with earlier observations that glucocorticoids impede T cell be highly variable. Most patients saw a reduction in effector polarization and motility (33) and monocyte migratory behavior T cells (Fig. 6D) and no change in resting T cells (Fig. 6E). (34) and suggests that broad leukocyte motility deficits may be However, outlier patient MM4 had a large increase in effec- partly responsible for the general immunosuppressive effects of tor T cells (Fig. 6D; ∆w = 0.2) and a complete elimination glucocorticoids. of resting T cells (Fig. 6E; ∆w = 0.2). For this patient, who Our analyses also allowed us to discover a highly T cell-specific was receiving a thalidomide-derived drug therapy, these devia- drug, dexrazoxane, which exerted the largest changes on T cell tions are consistent with thalidomide’s known stimulatory effects state (mean ∆µ = 2.64, P = 2.54e-5; Fig. 5G), but no changes in on T cells (46). monocytes (mean ∆µ = 0.29, P = 1; Fig 5H). Dexrazoxane did In patients with apparently normal abundances (i.e., ∆w are not generate any differentially expressed genes in monocytes (Fig small), uncovering subpopulation-specific changes in transcrip- 5E). We found that in T cells, dexrazoxane up-regulated many tion can point to specific modes of immune dysfunction. We cell-survival genes, including antioxidant enzymes (GPX4 and used PopAlign to find that monocyte subpopulations in patients PRDX1) and CORO1A, which is essential for T cell survival (35) acquire immunosuppressive phenotypes, evidenced by upregu- (Fig 5I). Dexrazoxane is normally used as a chemoprotectant lated expression of CD11b and CD33. Both genes are specific agent to reduce toxic side effects of chemotherapy on cardiac markers of myeloid-derived suppressor cells (47), which are tissue (36). Our finding that dexrazoxane specifically impacts negative regulators of immune function associated with cancer.

28790 | www.pnas.org/cgi/doi/10.1073/pnas.2005990117 Chen et al. Downloaded by guest on September 27, 2021 Downloaded by guest on September 27, 2021 ihrsett hi lge upplto nCR1 ahsalbakdtrpeet eaaebosrpe oe ul rmarnol hsnsub- chosen randomly a from built model bootstrapped mean separate the a indicates represents dot dot large black The small data. Each same CTRL1. the in of subpopulation (80%) aligned sample their to respect with for (∆µ) shifts gene-expression cell T (G) top. the at shown are monocytes and cells T (H both controls. impact (F to that drugs. relative genes 95% top-ranking dexrazoxane to expressed The response Differentially box: in drug. Gray perturbed dual-impact indicated. are a that are genes drug overlapping each and against monocyte-specific, replicates control six the FDR-corrected (LLR). metric line, ratio Dashed log-likelihood indicated. are the drug using each against population ( control weights Abundance to components. similarity principal two first the onto jected 5. Fig. M crdhgl o ohmeoddrvdsprso cell al. et suppressor Chen myeloid-derived both for patient highly except patients scored all that saw MM3 we values CD33, gene-expression and CD11b mean both for monocyte-specific the plotting By cells. T dexrazoxane-exposed in PRDX1 and CORO1A, GPX4, of up-regulation CD EF A # genes ∆μ 100 Total #genesperturbed Monocytes 0 CTRL1 edrn fGusa itr oe o TL pro- CTRL1 for model mixture Gaussian of Rendering (A) drugs. immunomodulatory of impacts cell-type-specific and universal identifies PopAlign

-1 -6 -14 PC2 weighted log(probability) Monocytes Overlap Monocyte T cells -12 w=0.82 95% CI pval=0.05 -10 PC1 -8 T cells w=0.18 -4 16% 51% 33% I ( controls. to relative dexrazoxane for (∆µ) shifts gene-expression Monocyte ) B

Drugs LLR P = from CTRL Most different 95% CI 5 (C 0.05. el Mono T cells Dual-impact drug:Budesonide

L1-distance and CXCR4, CSF1,IL8 CLEC12A, FABP5, IL1RN,CD9 CD163, VSIG4,S100A9,MMP19 NKG7, IL32,CD37,CCL5 IL7R, BTG1,FTL,CXCR4, TSC22D3, TXNIP,ZFAS1, NACA ACTB, ACTG1, CD52, CD74, TMSB10 CD74, CD52, ACTB, ACTG1, eeepeso hfs( shifts Gene-expression D) ∆µ rgrnigbsdo population-level on based ranking Drug (B) circle. the of size the by represented are ) n sclrdby colored is and Io h oto en ahdline: Dashed mean. control the of CI in yial aeapo rgoi,udrcrn h edto need the patients. underscoring in prognosis, populations MDSC poor monitor a have typically tions 6F (Fig. markers (MDSC) p-val>0.05 T cells o rgepsdmnct (C monocyte drug-exposed for ∆µ) −log( P ausuigone-sample using values P PNAS ∆μ au) FDR-corrected value). e-eeL-itnemtisaesonfrbudesonide, for shown are metrics L1-distance Per-gene ) GH |

CTRL2 T cells

CTRL6 mTOR inhibitor prostaglandin glucocorticoid oebr1,2020 17, November CTRL3 T cellspecificdrug:Dexrazoxane CTRL1 CTRL4 CTRL5 .Ptet ihhg DCpopula- MDSC high with Patients ). Dexrazoxane

∆μ eeepeso itiuin showing distributions Gene-expression ) P = P 5 (E 0.05. t ts ftesxcnrlreplicates control six the of -test ausuigone-sample using values CTRL2 control calcineurin inhibitor other Dexrazoxane Mono

subpopulations (D) cell T and ) CTRL5

| CTRL4 CTRL6 o.117 vol. ubro cell-specific, T of Number ) CTRL3 CTRL1 Most similar log(g+1) I to CTRL | CORO1A PRDX1 GPX4 o 46 no. Dexrazoxane CTRL2 t | etof test 28791

BIOPHYSICS AND COMPUTATIONAL BIOLOGY A Healthy1 Healthy2 MM1 MM2 MM3 MM4 (ref) (test) (test) (test) (test) (test)

B cells 4 2 6 1 2 4 1 2 4 5 3 3 5 Monocytes 5 1 2 2

MHC Class II 3 5 1 4 2 1 4 Naive T 3 3 1 5 3 6 7 6 Cytotoxic RAGE 7 Effector T 5 4 lymphocyte receptor 8 killing pathway Monocytes BCB cells Monocytes D Effector T EFNaive T 0.12 MM2 0 0.3 0.2 0.5 0.1 MM4

0.2 MM1 0.08 -0.05 0 0 Δ Δ Δ Δ log(g+1) MM3 0.1

mean(CD11b) 0.06 Healthy1 -0.2 -0.5 Healthy2 -0.1 0 0.04 0.02 0.04 0.06 0.08 0.1 MM1MM2MM3MM4 MM1MM2MM3MM4 MM1MM2MM3MM4 MM1MM2MM3MM4 mean(CD33) Healthy1Healthy2 Healthy1Healthy2 Healthy1Healthy2 Healthy1Healthy2 log(g +1) Naive T G 0.05 H Healthy1-3 ACTB PFN1 0.04 Healthy2-7 IJ5 3 0.03 MM1-3 MM2-3 4 0.02 MM3-4 2 Naive T |Δμ| (a.u.) 0.01 MM3-6 3 MM4 No Mixture Component 0.15 0 None 2 log (g +1)

0.1 log(g +1) 1

MM4 Healthy1-4 a.u. MM1-3MM2-3MM3-4MM3-6 1 Healthy2-8 0.05 Healthy2-7Effector T MM1-5 0.05 MM2-5 0 0 0 MM3-7 0.04

Effector T MM4-2 MM1-5MM2-5MM3-7MM4-2MM4-6 MM1-5MM2-5MM3-7MM4-2MM4-6 ealthy1-4ealthy2-8 0.03 MM4-6 H H healthy1-4healthy2-8 0.02 Effector T

|Δμ| (a.u.) 0.01 Ribosome Ribosome RibosomeRibosome Ribosome 0 MHCMHC Class Class II I AHSP Pathway Leukocyte Motility MM1-5MM2-5MM3-7MM4-2MM4-6 Translation Initiation Translation Initiiation B Cell Survival / CD40LRage Receptor Binding Healthy2-8 Iron Uptake and TransportProtein Localization to ER Cytotoxic Lymphocyte Killing

Fig. 6. Discovering signatures of disease and treatment in PBMCs from MM patients. (A) Experimental single-cell mRNA-seq data from two healthy donors and four MM patients (MM1 to 4-) are projected into 16-dimensional gene-feature space. The 3D plots show single cells in a subset of three gene features that highlight separation between different immune-cell types. Mixture model centroids (µ) are indicated by a numbered black dot. (B–E) Subpopulations in test samples are aligned to the reference (healthy1), and changes in abundance (∆ ) are plotted for B cells (B), monocytes (C), effector T cells (D), and naive T cells (E). (F) Mean gene-expression levels for two markers of MDSCs, CD33 and CD11b, are plotted for all monocyte subpopulations. Error bars denote CI of the mean. (G) |∆µ| for naive T cell and effector T cell populations relative to healthy1. A.u., arbitrary units. (H) Heatmap of mixture component µ vectors in terms of feature coefficients ci for aligned naive and effector T cells across samples. MM subpopulations exhibit reduced expression of two features (red font): leukocyte motility and cytotoxic lymphocyte killing. (I) Distribution of ACTB expression for all effector T cell subpopulations across samples. Violin shows distribution, and mean is denoted by white circle. (J) Distribution of PFN1 expression for all effector T cell subpopulations across samples. For single gene plots—F, I, and J—units are in terms of normalized and log-transformed gene expression (log(gˆ + 1)).

Importantly, we also found that naive and effector T cells in the context of disease progression or drug response can pro- across all MM patients had transcriptional defects in pathways vide insight into treatment efficacy and can form the basis of essential for T cell function. By plotting ∆µ, we show that a personalized medicine approach. Our framework provides a both populations of T cells experience large mean transcrip- highly scalable way of extracting, aligning, and comparing these tional shifts, compared to T cells from our second healthy donor, disease signatures, across many patients at one time. healthy2 (Fig. 6G). By examining the µ0s in terms of gene- expression features (Fig. 6H), we found that in MM, T cells Discussion reduce their expression of two key features—leukocyte motil- In this paper, we introduce PopAlign, a computational and math- ity and cytotoxic lymphocyte killing. Surprisingly, the impact on ematical framework for tracking changes in gene-expression motility is apparent even on the expression of ACTB (Fig. 6I), state and cell abundance in heterogeneous cell populations a core subunit of the actin , which was the top hit in across experimental conditions. The central advance in the the leukocyte motility feature. We find similar declines in the dis- method is a probabilistic modeling framework that represents tribution of Perforin 1 (PFN1), a pore-forming cytolytic protein a cell population as a mixture of Gaussian probability densi- that was found as a top hit in the cytotoxic lymphocyte program ties within a low-dimensional space of gene-expression features. (Fig. 6J). Models are aligned and compared across experimental samples, Our analysis establishes that we can extract consistent and also and by analyzing shifts in model parameters, we can pinpoint patient-specific transcriptional signatures of human disease and gene-expression and cell-abundance changes in individual cell treatment response from PBMCs. Interpreting these signatures populations.

28792 | www.pnas.org/cgi/doi/10.1073/pnas.2005990117 Chen et al. Downloaded by guest on September 27, 2021 Downloaded by guest on September 27, 2021 4 .R Quake R. S. 14. Kang M. H. single- 13. multiplexed Highly Pachter, L. Thomson, M. Chen, S. Park, Hwee J. Gehring, J. 12. McGinnis S. C. 11. Galen van P. 10. hne al. et Chen n fdu obntost ra ua ies sa essentially an as disease phenomena. human treat population-level target- to single-cell combinations to drug of lead ing could and insights microenvironment Such cellular on the types. treatments as immune-cell well drug as of states, impact cell the diseased revealing and tissue within states diseased transcriptional a inter- of treatment spectrum guiding result the for exposing This PopAlign by ventions of samples. use patient potential T three a in to in points defect motility potential and a dis- activation exposing of cell patients, MM signatures in treatment cell-type-specific ease identified disease single-cell PopAlign deconstructing for states. and identify- targets for drug/signal PopAlign of ing immune power human potential the the highlight from we datasets system, dis- to human PopAlign of applying treatment By and ease. analysis single-cell for workbench a cell-type-specific human and to resting universal us on drugs. both allowed of screens identify mechanisms scalability and drug This cells complex large-scale shot.” immune a from “one on data in even analyze responses, cells, drug of to learn us mixture scalably allows representation and subpopu- probabilistic quickly to for Our used metrics alignment. be statistical lation can quantitative which and calculate interpreted in explicitly be distribution object can mathematical parameters probability discrete whose a a define conceptu- as we space, By population gene-expression samples. single-cell statistical across a well-defined comparisons alizing making lack for approaches metrics such cells. of Fundamentally, subpopulations extract to analysis cluster-based heuristic cell of range tissue. wide tissue, a the within a impact types disease single- treatments how within drug understanding a for like types building and interventions disease for cell human of essential of picture be cell variety will inter- wide models to population-level due a arise between human neurodegeneration Since actions and populations. cancer designed cell like complex explicitly diseases within is changes track PopAlign to methods. analytical single-cell .H Mathys H. 9. Recovering analysis?: component principal “Robust Wright, J. Ma, Y. in Li, X. Gaussians” Candes, E. of 8. mixtures “Learning Dasgupta, gene S. in 7. dimensionality Low Thomson, M. El-Samad, H. Bhatnagar, R. Heimberg, G. 6. Macosko Z. E. 5. Zheng Y. ecosystem. X. cancer G. new A 4. Horning, J. S. tran- 3. Single-cell Teichmann, A. S. Regev, A. Rozenblatt-Rosen, cancer: O. for Stubbington, J. therapy M. Ecological Axelrod, 2. E. D. Axelrod, R. McGregor, N. Pienta, J. K. 1. nteftr,w oeta oAincnb sda atof part a as used be can PopAlign that hope we future, the In on rely methods analysis single-cell existing Mathematically, existing over advance conceptual a constitutes PopAlign rmidvda iecetsaTbl ui.boxvhttps://doi.org/10.1101/237446 bioRxiv Muris. Tabula 2018). March a (29 creates mice individual from variation. genetic elRAsqb N lgncetd agn fclua . cellular of tagging (2020). 35–38 oligonucleotide 38, DNA by RNA-seq (8 cell https://doi.org/10.1101/387241 bioRxiv indices. 2018). lipid-tagged August using sequencing immunity. and progression (2019). 332–337 (SAM) in Workshop Processing errors” Signal sparse from matrices low-rank Science Computer of from Foundations programs transcriptional of extraction accurate sequencing. the shallow enables data expression droplets. nanoliter using cells Commun. Nat. disease. and health in system (2017). immune the explore to scriptomics novel for opportunities new suggests treatments. paradigm cancer ecosystem an using tumors Defining igecl rncitmcaayi fAzemrsdisease. Alzheimer’s of analysis transcriptomic Single-cell al., et igecl rncitmccaatrzto f2 rasadtissues and organs 20 of characterization transcriptomic Single-cell al., et utpee rpe igecl N-euniguignatural using RNA-sequencing single-cell droplet Multiplexed al., et igecl N-e eel M irrhe eeatt disease to relevant hierarchies AML reveals RNA-seq Single-cell al., et ihyprle eoewd xrsinpoln findividual of profiling expression genome-wide parallel Highly al., et asvl aalldgtltasrpinlpoln fsnl cells. single of profiling transcriptional digital parallel Massively al., et 44 (2017). 14049 8, UT-e:Saal apemlilxn o igecl RNA single-cell for multiplexing sample Scalable MULTI-seq: al., et a.Biotechnol. Nat. rnl Oncol. Transl. elSyst. Cell Cell Cell 3–5 (2016). 239–250 2, 2518.2 (2019). 1265–1281.e24 176, 5–6 (2008). 158–164 1, IE,Psaaa,N,19) p 634–644. pp. 1999), NJ, Piscataway, (IEEE, 2211 (2015). 1202–1214 161, 99 (2018). 89–94 36, IE,Psaaa,N,21) p 201–204. pp. 2010), NJ, Piscataway, (IEEE, Science 00IE esrAryadMultichannel and Array Sensor IEEE 2010 13(2017). 1103 355, 0hAna ypsu on Symposium Annual 40th Science a.Biotechnol. Nat. Nature 58–63 358, 570, 7 .Mrh,C Weaver, C. Murphy, K. 27. Gautier L. E. 26. B-lymphocytes. splenic thymus. of the functions in Immunological selection Brown, single-cell A. cell T in 25. of effects mechanisms The Batch “Single- Takayanagi, H. Marioni, Takaba, C. H. from J. 24. Morgan, D. Data M. Lun, T. Pisco, A. Haghverdi, L. A. 23. Batson, transcrip- single-cell Integrating Satija, R. Papalexi, J. E. Smibert, P. Hoffman, P. Butler, A. Webber, 22. J. Botvinnik, O. 21. F W. 20. Jeffreys, H. 19. data incomplete MacKay, from C. likelihood J. Maximum D. Rubin, 18. B. D. Laird, M. N. factorization. matrix Dempster, P. non-negative A. by objects 17. of parts the Learning Seung, H. Lee, D. for 16. t-factorizations matrix nonnegative “Orthogonal Park. H. Wei, P. Tao, L. Ding, C. 15. 8 .A i,C .Jn,Tepicpe fegneigimn el otetcancer. treat to cells immune engineering of principles The June, H. C. Lim, A. W. 28. hr n a uc onainadteHrtg eia eerhInstitute. Insti- Research Medical David Beckman Heritage the the and the and Foundation labora- by at Curci supported Kay was performed M.T. McGinnis and M.T. Center. Shurl was Engineering figure the and Chris work Profiling for Single-Cell This of Strazhnik discussions; tute illustrations. Inna-Marie members and and and guidance; editing and feedback experimental for Khazaei, helpful Patterson Tami for Gehring, tory Jase Pool, ACKNOWLEDGMENTS. GitHub, at found be can com/thomsonlab/popalign. 3, Python in software implemented package, The (48). (https://doi.org/10.6084/m9.figshare.11837097) Figshare and #17-0727), from Availability. (Reference Data exempt approval is required. and work not review was this consent IRB Institu- informed that for Technology determined requirement has of the (IRB) Institute board California Review The tional Hemacare. from mercially 3 single-cell Approval. using Study suspension lanes single-cell Genomics oligos 10X a lipid-modified two into Multiseq on using reagents. dissociated running by were before at multiplexed and (11) cells used TrypLE exposure, were using 40 by of Drugs to h SelleckChem. exposed small- 24 by and inflammation-related plate sold and 96-well 1 library immunology- a compound the of CO molecule well from in each selected RPMI-1640 into drugs in seeded rested were and cells thawed 37 Multiseq. at were incubator Using Hemacare from RNA-Seq sourced Single-Cell 3 single-cell Multiplexed sample, using each lane For Genomics 10X RPMI-1640. a in into cells/mL reagents. loaded were 1e6 cells Memorial to 17,400 Park resuspended Roswell were warm 37 Cells at in (RPMI-1640) thawed Medium were Institute PBMCs Samples. cryopreserved Donor Healthy ples and MM for RNA-Seq Single-Cell parameter through interpretation in model provided is and analysis, models, of model of alignment analysis number, error, feature of selection normalization, data including Framework. Mathematical Methods ocnrtosi PI14 ls1%ftlbvn eu.After serum. bovine fetal 10% plus RPMI-1640 in concentrations µM Nature (2006). 126–135 pp. Mining, Data and Discovery Knowledge in clustering” htudri h dniyaddvriyo os isemacrophages. tissue mouse (2012). of 1118–1128 diversity 13, and identity the underlie that (1992). 395–417 neighbors. Immunol. nearest mutual matching by corrected Biotechnol. Dataset. are data Figshare. RNA-sequencing (v2).” emulsion species. (2018). and 411–420 technologies, conditions, different microfluidic across data tomic from 2019. May data 1 Eds. Accessed . Schwarze, https://doi.org/10.6084/m9.figshare.5968960.v3 RNA-seq S. V. Krumm, cell W. F. Grafarend, W. 299–309. E. pp. 2003), Millennium, Germany, Berlin, 3rd (Springer, the of lenge 1998). 2003). UK, Cambridge, Press, University algorithm. EM the via 2–4 (2017). 724–740 168, 2016). rte,B onn Amti o oainemtie”in matrices” covariance for metric “A Moonen, B. orstner, ¨ 8–9 (1999). 788–791 401, 85P1 (2017). P805–P816 38, 2–2 (2018). 421–427 36, eeepeso rfie n rncitoa euaoypathways regulatory transcriptional and profiles Gene-expression al., et h hoyo Probability of Theory The rceig fte1t C IKDItrainlCneec on Conference International SIGKDD ACM 12th the of Proceedings ◦ nomto hoy neec n erigAlgorithms Learning and Inference Theory, Information PNAS l tde eepromdo BC bandcom- obtained PBMCs on performed were studies All o 6hbfr rgepsr.Atrrsig 200,000 resting, After exposure. drug before h 16 for C igecl eeepeso aahv endpstdin deposited been have data gene-expression Single-cell –8(1977). 1–38 B, Soc. Stat. Roy. J. IAppendix SI aea’ Immunobiology Janeway’s | etakJsi os rcCo,Allan-Hermann Chow, Eric Bois, Justin thank We oebr1,2020 17, November icsino h ahmtclframework, mathematical the of Discussion . ◦ n eltda 300 at pelleted and C Ofr nvriyPes xod UK, Oxford, Press, University (Oxford | CCPes oaRtn L d 9, ed. FL, Raton, Boca Press, (CRC o.117 vol. rorsre PBMCs Cryopreserved l ua elsam- cell human All rt e.Immunol. Rev. Crit. | eds—h Chal- Geodesy—The a.Biotechnol. Nat. o 46 no. × https://github. a.Immunol. Nat. g o min. 2 for (Cambridge | Trends 28793 0 0 Nat. Cell 36, 11, v3 v2 2

BIOPHYSICS AND COMPUTATIONAL BIOLOGY 29. K. Newton, V. M. Dixit, Signaling in innate immunity and inflammation. Cold Spring 40. L. M. Pilarski, E. Joy Andrews, M. J. Mant, B. A. Ruether, Humoral immune deficiency Harb. Perspect. Biol. 4, a006049 (2012). in multiple myeloma patients due to compromised B-cell function. J. Clin. Immunol. 30. A. C. Villani, S. Sarkizova, N. Hacohen, Systems immunology: Learning the rules of the 6, 491–501 (1986). immune system. Annu. Rev. Immunol. 36, 813–842 (2018). 41. M. Bolzoni et al., IL21R expressing CD14+CD16+ monocytes expand in multiple 31. G. Faure-Andre´ et al., Regulation of dendritic cell migration by CD74, the MHC class myeloma patients leading to increased osteoclasts. Haematologica 102, 773–784 II-associated invariant chain. Science 322, 1705–1710 (2008). (2017). 32. X. Zhang et al., Thymosin beta 10 is a key regulator of tumorigenesis and 42. C. Botta, A. Gulla,` P. Correale, P. Tagliaferri, P. Tassone, Myeloid-derived suppressor metastasis and a novel serum marker in breast cancer. Breast Cancer Res. 19, 15 cells in multiple myeloma: Pre-clinical research and translational opportunities. Front. (2017). Oncol. 4, 348 (2014). 33. N. Muller,¨ H. J. Fischer, D. Tischner, J. van den Brandt, H. M. Reichardt, Glucocorticoids 43. T. Dosani et al., Significance of the absolute lymphocyte/monocyte ratio as a prognos- induce effector T cell depolarization via ERM proteins, thereby impeding migration tic immune biomarker in newly diagnosed multiple myeloma. Blood Canc. J. 7, e579 and APC conjugation. J. Immunol. 190, 4360–4370 (2013). (2017). 34. J. Ehrchen et al., Glucocorticoids induce differentiation of a specifically activated, 44. A. C. Rawstron et al., B-lymphocyte suppression in multiple myeloma is a reversible anti-inflammatory subtype of human monocytes. Blood 109, 1265–1274 (2007). phenomenon specific to normal B-cell progenitors and plasma cell precursors. Br. J. 35. P. Mueller et al., Regulation of T cell survival through coronin-1-mediated gener- Haematol. 100, 176–183 (1998). ation of inositol-1,4,5-trisphosphate and calcium mobilization after T cell receptor 45. R. J. Pessoa-Magalhaes et al., Analysis of the immune system of multiple myeloma triggering. Nat. Immunol. 9, 424–431 (2008). patients achieving long-term disease control, by multidimensional flow cytometry. 36. S. E. Lipshultz et al., The effect of dexrazoxane on myocardial injury in doxorubicin- Haematologica 98, 79–86 (2013). treated children with acute lymphoblastic leukemia. N. Engl. J. Med. 351, 145–153 46. M. Winqvist et al., In vivo effects of lenalidomide on T cell proliferation and immune (2004). checkpoint molecules in patients with advanced stage CLL: Results from a phase II 37. L. McInnes, J. Healy, UMAP: Uniform manifold approximation and projection for study. Blood 126, 4164 (2015). dimension reduction. arXiv:1802.03426 (9 February 2018). 47. V. Bronte et al., Recommendations for myeloid-derived suppressor cell nomenclature 38. S. K. Kumar et al., Multiple myeloma. Nat. Rev. Dis. Primers 3, 17046 (2017). and characterization standards. Nat. Commun. 7, 12150 (2016). 39. E. Malek et al., Myeloid-derived suppressor cells: The green light for myeloma 48. S. Chen, PopAlign Data. Figshare. Dataset. https://doi.org/10.6084/m9.figshare. immune escape. Blood Rev. 30, 341–348 (2016). 11837097.v3. Deposited 9 November 2020.

28794 | www.pnas.org/cgi/doi/10.1073/pnas.2005990117 Chen et al. Downloaded by guest on September 27, 2021