Statistical analysis for scRNAseq data

Cathy Maugis-Rabusseau

[email protected]

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 1 / 52 Plan

1 Introduction

2 Feature selection / extraction

3 Dimension reduction

4 Single cell clustering

5 Pseudotime analysis

6 Differential analysis

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 2 / 52 scRNA-seq data

n cells, G genes: n ≤ G or n ≈ G =⇒ high dimensionality

Measures: xij = expression of the gene j for the cell i ∈ N

Technical and biological noise

High variability

Zero-inflated data =⇒ "sparsity" (≥ 80% of zeros per raw, dropouts)

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 3 / 52 Biological questions

Are there distinct subpopulations of cells?

For each cell type, what are the marker genes?

How visualize the cells?

Are there continuums of differentiation / activation cell states?

...

Rostom et al, FEBS 2017 C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 4 / 52 Statistical analysis

Clustering of cells

Variable (gene) selection in learning or differential analysis (hypothesis testing)

Reduction dimension

Network inference

...

Rostom et al, FEBS 2017 C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 5 / 52 Some bio-info-stat. pipelines/workflows

[Juliá et al., 2015] Sincell (Bioconductor/R package)

https://bioconductor.org/packages/release/bioc/html/sincell.html

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 6 / 52 Some bio-info-stat. pipelines/workflows

[Juliá et al., 2015] Sincell (Bioconductor/R package)

[Poirion et al., 2016]

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 6 / 52 Some bio-info-stat. pipelines/workflows

[Juliá et al., 2015] Sincell (Bioconductor/R package)

[Poirion et al., 2016]

[Wolf et al., 2018] SCANPY

https://github.com/theislab/Scanpy

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 6 / 52 Some bio-info-stat. pipelines/workflows

[Juliá et al., 2015] Sincell (Bioconductor/R package)

[Poirion et al., 2016]

[Wolf et al., 2018] SCANPY

[Guo et al., 2015] SINCERA:A Single-Cell RNA-Seq Analytic Pipeline

https://github.com/xu-lab/SINCERA

https://research.cchmc.org/pbge/sincera.html

Fig 1. Schematic Workflow. The analytic pipeline consists of three main components: pre-processing, cell type identification, and cell type specific gene signature and driving force identification. doi:10.1371/journal.pcbi.1004575.g001

addressing rare cell types that are not forming clusters with other cells by adjusting ds y 1 i ð Þ¼ in a separate study. The cell specificity filter is defined by a cell specificity index s, which is modified from the C.Maugis-Rabusseau (IMT/INSA) ti Statistical analysis for scRNAseq data 6 / 52 calculation of tissue specificity index in [42]. Es min Es x s ij À ð i Þ ij ¼ max Es min Es ð i ÞÀ ð i Þ Ns 1 1 x s ð Þ j 1 ij ts ¼ ð À Þ i ¼ X qs 1 À s fi s ti denotes the cell speci city of gene i in sample s, Eij is the expression of gene i in cell j in s,

PLOS Computational Biology | DOI:10.1371/journal.pcbi.1004575 November 24, 2015 4 / 28 Some bio-info-stat. pipelines/workflows

[Juliá et al., 2015] Sincell (Bioconductor/R package)

[Poirion et al., 2016]

[Wolf et al., 2018] SCANPY

[Guo et al., 2015] SINCERA:

[Lun et al., 2016] Workflow Package : simpleSingleCell

https://bioconductor.org/packages/release/workflows/html/simpleSingleCell.html

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 6 / 52 Some bio-info-stat. pipelines/workflows

[Juliá et al., 2015] Sincell (Bioconductor/R package)

[Poirion et al., 2016]

[Wolf et al., 2018] SCANPY

[Guo et al., 2015] SINCERA:

[Lun et al., 2016] Workflow Package : simpleSingleCell

[Satija et al., 2015] SEURAT:

https://satijalab.org/seurat/ ...

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 6 / 52 Plan

1 Introduction

2 Feature selection / extraction

3 Dimension reduction

4 Single cell clustering

5 Pseudotime analysis

6 Differential analysis

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 7 / 52 Feature (gene) extraction

Simple filtering criteria : see e.g [Lun et al., 2016],[Soneson and Robinson, 2018] Filtering of lowly expressed genes: genes expressed in <τ % of cells genes with a mean average of expression <τ

Dropout-based feature selection M3Drop, [Andrews and Hemberg, 2018]

Based on the Michaelis-Menten function

S Pdropout = 1 − KM + S where S = mean expression Pdropout = dropout rate

MLE to obtain the global KM across all genes

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 8 / 52 Highly Variable Genes (HVG)

[Brennecke et al., 2013] Fits a quadratic model (gamma generalized linear model) to the relationship between mean expression and the coefficient of variation squared (CV2) χ2 test is used to find genes signif. above the curve Implemented in M3Drop package

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 9 / 52 Highly Variable Genes (HVG)

[Brennecke et al., 2013] [Kim et al., 2015]

Uses spike-ins to estimate parameters related to technical variance and estimates gene-specific biological variability by substracting the estimated technical variance from the total variance.

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 9 / 52 Highly Variable Genes (HVG)

[Brennecke et al., 2013] [Kim et al., 2015] [Vallejos et al., 2015]

BASiCS = Bayesian Analysis of Single-Cell Sequencing Data Models spike-ins and endogenous genes simultaneously as two Poisson-Gamma hierarchical models

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 9 / 52 Highly correlated genes

Gene-gene correlation:

Calculate the gene-gene correlation matrix ρ = (ρij )i,j=1,...,G

Evaluate the correlation magnitude for each gene : ρ˜i = max|ρij | j

Take the top few thousand genes having the highest correlation magnitude

PCA loadings: Select the genes with high PCA loadings

...

Non adapted for batch effects

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 10 / 52 Plan

1 Introduction

2 Feature selection / extraction

3 Dimension reduction

4 Single cell clustering

5 Pseudotime analysis

6 Differential analysis

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 11 / 52 Objectives

Minimize curse of dimensionality

Allow visualization

Reduce computational time

....

But attention to the interpretations after!

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 12 / 52 Principal component analysis (PCA)

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 13 / 52 Principal component analysis (PCA)

Diagonalization of the covariance (or correlation) matrix

Linear transformations: meta-variables = linear combinations of the genes

Capture the dimensions with higher variance

Fast deterministic procedure

Sparse-PCA : PCA + gene selection

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 13 / 52 Extensions of PCA for scRNAseq data

[Pierson and Yau, 2015] : ZIFA (Zero Inflated Factor Analysis) Deals with the large number of zero-values in scRNASeq data Relationship between the dropout rate p0 and the mean level of non-zero expression (log read count) µ:

2 p0 = exp(−λµ )

ZIFA adopts a latent variable model and uses an EM algorithm for the parameter estimation Python software : https://github.com/epierson9/ZIFA

[Risso et al., 2017] : ZINB-WaVE = Zero-Inflated Negative Binomial Model for RNA-Seq Data a method similar to PCA based on a zero- inflated negative binomial model instead of a Gaussian model https://bioconductor.org/packages/release/bioc/html/zinbwave.html

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 14 / 52 Extensions of PCA

[Lin et al., 2017] CIDR (https://github.com/VCCRI/CIDR)

1 Preliminary, log(xij + 1)

2 Identification of dropout candidates. (CIDR finds a sample-dependent threshold that separates the zero peak from the rest of the expression distribution for each cell)

3 Estimation of the relationship between dropout rate and levels (non-linear least-squares regression to fit a decreasing logistic function to the data)

4 Calculation of dissimilarity between the imputed gene expression profiles for every pairs of single cells

5 PCoA using the CIDR dissimilarity matrix

6 Clustering (CAH) using the first few principal coordinates

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 15 / 52 Example of t-SNE plot

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 16 / 52 t-SNE

Reduce a dataset to 2 dimensions Non-linear dimension reduction technique Want to preserve the neighborhood "Don’t interpret distances in t-SNE plots"

https://constantamateur.github.io/2018-01-02-tSNE/

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 17 / 52 t-SNE

Reduce a dataset to 2 dimensions Non-linear dimension reduction technique Want to preserve the neighborhood "Don’t interpret distances in t-SNE plots"

G INPUT : X = (x1,..., xn) with xi ∈ R (High dimensional data) 2 OUTPUT: Y = (y1,..., yn) with yi ∈ R ( Low dimensional data)

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 17 / 52 Stochastic Neighbor Embedding (SNE) Converting the high-dimensional Euclidian distances into conditional probabilities (=similarities) Similarity of points in high-dimension

exp(−kx − x k2/2σ2) p = i j i Gaussian distrib. j|i P 2 2 k6=i exp(−kxi − xk k /2σi ) Similarity of points in low dimension exp(−ky − y k2) = i j qj|i P 2 Gaussian distrib. k6=i exp(−kyi − yk k ) Cost function: Kullback-Leibler divergence   X X X pj|i C(Y) = KL(P |Q ) = p ln i i j|i q i i j j|i Minimize the cost function using gradient descent

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 18 / 52 Symmetric SNE

Pairwise similarities: exp(−kx − x k2/2σ2) = i j pij P 2 2 Gaussian distrib. k6=l exp(−kxk − xl k /2σ ) exp(−ky − y k2) = i j qij P 2 Gaussian distrib. k6=l exp(−kyk − yl k ) In practice, p | + p | p = i j j i ij 2n H(P ) + perplexity (effective nb of neighbors) Perp(Pi ) = 2 i (link to σi , H(Pi ) = Shannon entropy) Minimize the cost function   X X pij C(Y) = pij ln qij i j

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 19 / 52 t-SNE

Use Student (heavy tail) distribution than Gaussian in low-dimensional space:

(1 + ky − y k2)−1 = i j qij P 2 −1 k6=l (1 + kyk − yl k )

Cost function   X X pij C(Y) = pij ln qij i j

Large pij modeled by small qij : large penalty

Small pij modeled by large qij : small penalty t-SNE: mainly preserves local similarity structure of the data

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 20 / 52 t-SNE

In practice, t-SNE is used on the first components of PCA to reduce the time calculation t-SNE is a stochastic algorithm t-SNE is implemented in various programming languages (see https://lvdmaaten.github.io/tsne/)

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 21 / 52 Other methods

ICA = Independent Component Analysis Computational method for separating a multivariate signal into additive subcomponents. This is done by assuming that the subcomponents are non-Gaussian signals and that they are statistically independent from each other.

Diffusion maps Computes a family of embeddings of a data set into Euclidean space whose coordinates can be computed from the eigenvectors and eigenvalues of a diffusion operator on the data. ...

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 22 / 52 Plan

1 Introduction

2 Feature selection / extraction

3 Dimension reduction

4 Single cell clustering

5 Pseudotime analysis

6 Differential analysis

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 23 / 52 What is clustering?

Goal : organizing cells into groups whose members are similar in some way

Fundamental question : What is meant by "similar cells"?

The number of clusters is unknown

Typical clustering methods : Hierarchical clustering (CAH), Kmeans clustering, Graph-based clustering Model-based clustering ...

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 24 / 52 Hierarchical clustering

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 25 / 52 Hierarchical clustering

Choose a (dis)-similarity between data points and a linkage criterion (similarity between clusters)

Agglomerative : starts with all data points as individual clusters and joins the most similar ones in a bottom-up approach

Divisive : starts with all data points in one large cluster and splits it into two at each step (a top-down approach)

A dendrogram representing the decisions at each merge/division of clusters

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 25 / 52 Hierarchical clustering

(dis)-similarity between data points : Euclidian distance 1-correlation, 1-correlation2, .... Dissimilarity between the imputed gene expression profiles (CIDR), ....

Linkage criteria : Ward, complete linkage, average linkage, ...

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 26 / 52 K-means

1 Starts with random selection of cluster centers 2 Assigns each data points to the nearest cluster 3 Calculates the new cluster centers 4 Repeats steps 2-3 until no more changes occur

Require a distance between data points Fuzzy Kmeans, ...

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 27 / 52 Graph-based clustering

Objects (cells) are represented as nodes

Assign a weight to each branch between two nodes x and x˜

w(x, x˜) = distance(x, x˜)

Clustering

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 28 / 52 Graph-based clustering

Minimal Spanning (MST) = a connected sub-graph with minimal weight that contains all nodes and has no cycle (ex: Prim’s algo, Kruskal’s algo)

Different strategies to delete some branchs

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 29 / 52 Model-based clustering

Data are distributed from an unknown distribution s

s is estimated by a finite mixture:

Data are organized into K subpopulations

Each subpopulation is distributed from fk (.|αk ) ⇒ Thus the population is distributed from a mixture of these subdistributions

K K X K X f (.|θK ) = πk fk (.|αk ) with (π1, . . . , πK ) ∈]0, 1[ , πk = 1 k=1 k=1

Parameter vector: θK = (π1, . . . , πK , α1, . . . , αK )

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 30 / 52 Model-based clustering

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 31 / 52 Model-based clustering

Model collection:

( K ) ? p X ∀K ∈ N , SK = x ∈ R 7→ f (x|θK ) = πk fk (.|αk ) k=1 ⇒ Choice of the model collection

For each model SK , the mixture is determined which best fits the ˆ data: f (.|θK )

ˆ ⇒ Need a parameter estimation algorithm (θK )

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 32 / 52 Model-based clustering

ˆ ˆ ˆ Choose the "best" mixture among f (.|θ2), f (.|θ3),..., f (.|θKmax )

ˆ ˆ ⇒ Need a model selection criterion to determine K (and f (.|θKˆ )). Clustering : Maximum A Posteriori (MAP) rule A cell is assigned in the cluster for which it has the highest probability of belonging.

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 33 / 52 Some clustering softwares

Software Description Ref Kmeans on different dimensions of PCA (on log-counts) + CAH on consensus SC3 matrix [Kiselev et al., 2017] https://bioconductor.org/packages/release/bioc/html/SC3.html Combines PCA on log-counts, k-means and "iterative" hierarchical clustering pcaReduce https://github.com/JustinaZ/pcaReduce [Žurauskiene˙ and Yau, 2016] CAH using Pearson corr. and average linkage on z-score (prelim. data transf.) SINCERA [Guo et al., 2015] https://research.cchmc.org/pbge/sincera.html Shared Nearest Neighbours graph + clique method SNN-Cliq [Xu and Su, 2015] http://bioinfo.uncc.edu/SNNCliq/ PCA + Nearest Neighbor graph clustering Seurat [Butler et al., 2018] https://github.com/satijalab/seurat PCA (diss. CIDR on counts, dropouts imputation) + CAH CIDR [Lin et al., 2017] https://github.com/VCCRI/CIDR Ensemble clustering using SC3, CIDR, Seurat et t-SNE Kmeans SAFE [Yang et al., 2017] https://github.com/yycunc/SAFEclustering Dropout imputation and clustering. mixtures of multivariate log-normal and BISCUIT Bayesian inference [Prabhakaran et al., 2016] https://github.com/sandhya212/BISCUIT_SingleCell_IMM_ICML_2016 Kernel-based similarity learning (S) + t-SNE on S + Kmeans SIMLR [Wang et al., 2017] https://bioconductor.org/packages/release/bioc/html/SIMLR.html Kmedoids clustering on similarity matrix of Pearson’s correlation coeff. RaceID [Grün et al., 2015] https://cran.r-project.org/web/packages/RaceID/ Biclustering based on sorting points into neighborhoods (SPIN) backSpin [Zeisel et al., 2015] ... https://github.com/linnarsson-lab/BackSPIN

Reviews : [Duò et al., 2018], [Freytag et al., 2018]

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 34 / 52 bioRxivSC3 preprint first posted online Jan. 13, 2016; doi: http://dx.doi.org/10.1101/036558. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.

[Kiselev et al., 2017]

https://bioconductor.org/packages/release/bioc/html/SC3.html

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 35 / 52

Figure 1. The SC3 framework for consensus clustering. (a) Overview of clustering with SC3 ​ ​ framework (see Methods). A total of 6D clusterings are obtained, where D is the total number of ​ ​ dimensions d1, …, dD considered. These clusterings are then combined through a consensus ​ ​ ​ ​ ​ step to increase accuracy and robustness. Here, the consensus step is exemplified using the Treutlein data: the binary matrices (Methods) corresponding to each clustering are averaged, and the resulting matrix is segmented using hierarchical clustering up to the k­th hierarchical ​ level (k = 5 in this example). (b) Published datasets used to set SC3 parameters. N is the ​ ​ ​ ​ number of cells in a dataset; k is the number of clusters originally identified by the authors 9,14,17,18,20–23,63–66 ​ ;​ Units: RPKM is Reads Per Kilobase of transcript per Million mapped reads, RPM is Reads Per Million mapped reads, FPKM is Fragments Per Kilobase of transcript per Million mapped reads, TPM is Transcripts Per Million mapped reads. (c) Testing the distances, ​ ​ nonlinear transformations and d range. Median of ARI over 100 realizations of the SC3 ​ clustering for six gold standard datasets (Biase, Yan, Goolam, Kolodziejczyk, Deng and Pollen,

ŽurauskieneandYau˙ BMC Bioinformatics (2016) 17:140 Page 3 of 11

we could recapitulate such results fully automatically in merge operations are used, the whole process can be an unsupervised fashion without prior knowledge of cell repeated to obtain a number of alternative clusterings. type markers that was used in these studies. This ques- This will be useful for assessing the stability of the cluster- tion is important in study conditions where there maybe ing results. Algorithm 1 gives a pseudo-code description. little or unreliable prior knowledge of cell types. We will test our methods against existing approaches but note that many existing approaches typically utilise different Algorithm 1: The pcaReduce algorithm. Here combinations of standard and Xn×d is a gene expression matrix with n cells clustering algorithms. Therefore, in our investigations, (given in rows) and d genes (in columns); q is the instead of using these packages directly, we will explore number of dimensions – effectively this refers to the utility of the constituents components (which might the number of levels in the hierarchy; Y is a score ŽurauskieneandYau˙ BMC Bioinformatics (2016) 17:140 Page 3 of 11 µ be shared between approaches). matrix, which is the output of PCA algorithm; ij and !ij definition are given in Eq. (1); (i) and (ii) MethodspcaReduce denote two different merging settings: (i) merg- ( ) Let Xn d denote a gene expression matrix, where n ing is based on largest probability P i, j value; (ii) × merging is based on sampling according to P(i, j) we could recapitulate such results fully automatically in ismerge the number operations of cells are used, measured the whole across processd number can of be ... distribution. an unsupervised fashion without prior knowledge of cell genes;repeated i.e. each to obtain cell xi a={ numberxi1, of, x alternativeid} is a d-dimensional clusterings. Input: Xn d and q ; type markers that was used in these studies. This ques- object.This will Further be useful assume for assessing that Yn×q thedenotes stability a scoreof the matrix, cluster- × tion is important in study conditions where there maybe obtaineding results. after Algorithm projecting 1 gives data intoa pseudo-code first q principle description. direc- Output: a collection of q clusterings; 1 Y PCA(X ,dim=q); little or unreliable prior knowledge of cell types. We will tions, and Yi denotes a subset of cells, Yi Y. ←− n×d ⊂ µ ! kmeans test our methods against existing approaches but note Our clustering algorithm begins by performing a 2 ( , ) ←− (Y, K = q + 1); K-meansAlgorithm clustering 1: The operationpcaReduce onalgorithm. the projection Here of the 3 Q q 1; that many existing approaches typically utilise different ←− −... combinations of standard dimensionality reduction and originalXn d is gene a gene expression expression matrix, matrixXn d,tothetop with n cellsK 1 4 for r 1, , Q do × × − = ( ) clustering algorithms. Therefore, in our investigations, principal(given in directions. rows) and Thed genes number (in columns); of initialq clustersis the K is 5 for all possible pairs i, j in ...... instead of using these packages directly, we will explore setnumber to a sufficiently of dimensions large – value, effectively say 30, this to refers ensure to most 1, , K 1, , K do { ( )} × { } µ ! the utility of the constituents components (which might cellthe types number will of be levels captured. in the Once hierarchy; the initialY is a clusters score are 6 P i, j ←− p Yi ∪ Yj| ij, ij ; ( µ be shared between approaches). determined,matrix, which we is take the twooutput subsets of PCAY algorithm;i, Yj)thatoriginateij 7 end for ! " ( ) fromand a! pairij definition of clusters are(i, givenj) respectively, in Eq. (1); and (i) and calculate (ii) the 8 (i) either choose pair i, j with largest Methods probabilitydenote two for different those observations merging settings: to be merged (i) merg- together, probability P(i, j) and merge clusters i and j; ( µ ! ) ( ) 9 (ii) or sample a pair (i, j) with probability Let Xn d denote a gene expression matrix, where n p ingYi, isYj basedij, onij . largest We assume probability thatP thei, j probabilityvalue; (ii) den- × { }| ( ) ( ) is the number of cells measured across d number of sitymerging function is based is a multivariate on sampling Gaussian according with to P meani, j and P i, j and merge clusters i and j; ... covariancedistribution. matrix given by: 10 q q 1; genes; i.e. each cell xi ={xi1, , xid} is a d-dimensional ←− − 11 Y Yn q (i.e. remove last dimension); object. Further assume that Yn q denotes a score matrix, Input: Xn×d and q ; ← × × n nj µ ! obtained after projecting data into first q principle direc- µOutput:i a collectionµ of qµclusterings; 12 update ( , ); ij = i + j, 1 Y ni PCAnj (X n,dim=i njq); 13 K K 1; tions, and Yi denotes a subset of cells, Yi ⊂ Y. ←− + n×d + (1) ← − µ ! n kmeansnj 14 Our clustering algorithm begins by performing a 2!( , ) ←−i ! (Y, K! = q + 1); end for [Žurauskienij e˙ and Yau,i 2016] j, K-means clustering operation on the projection of the 3 Q= ni q nj1; + ni nj ←− +−... + original gene expression matrix, Xn d,tothetopK 1 https://github.com/JustinaZ/pcaReduce4 for r 1, , Q do × − ( =) (µ µ ) (! (!)) principal directions. The number of initial clusters K is where5 ni,fornj ,all possiblei, j and pairsi,i, jj indenote the sizes, cen- Figure 2 gives an illustration of our method using an ...... set to a sufficiently large value, say 30, to ensure most troids and{1, covariances, K} × {1, of the, K} clustersdo i and j respec- autoencoder representation. Autoencoders are feedfor- ( ) µ ! ( ) cell types will be captured. Once the initial clusters are tively.6 We repeatP i, j this←− forp allYi possible∪ Yj| ij, pairsij ; i, j .Wethen ward neural networks that accept d-dimensional input determined, we take two subsets (Yi, Yj)thatoriginate choose7 toend merge for two clusters by either (i) picking the pair data and report d-dimensional output data. The d nodes C.Maugis-Rabusseau (IMT/INSA)! "Statistical analysis for scRNAseq data 36 / 52 from a pair of clusters (i, j) respectively, and calculate the that8 has the(i) either highest choose probability pair (i, j or) with (ii) largest sampling a pair in the input and output network layers are connected probability for those observations to be merged together, of clustersprobability to mergeP in(i, proportionj) and merge to clusters their (normalised)i and j; via one or more hidden layers. Data transformations are ( ) p( Yi, Yj µij, !ij). We assume that the probability den- merged9 probabilities.(ii) or sample The a pair numberi, j with of probability clusters will now applied between each layer of the network. If a hid- { }| ( ) sity function is a multivariate Gaussian with mean and decrease toP iK, j −and1. merge We then clusters projecti and thej; data matrix on den layer has fewer nodes than its predecessor then covariance matrix given by: to the10 firstq ←−K −q2− principal1; directions, i.e. removing the the information from the previous layer in the autoen- ( 11 ) Y Y (i.e. remove last dimension); K − 1 -st principal← n×q component that explains the lowest coder network is forced into a lower dimensional form ni nj degree12 ofupdate variance (µ, ! in); the data, removing this dimen- hence performing dimensionality reduction. Each hidden µij µi µj, = ni nj + ni nj sion13 fromK the existingK 1; cluster centroids and covariance layer encodes a reduced dimensional representation of + + (1) ← − ni nj matrices.14 end for the input data. In an autoencoder, the parameters gov- ! ! ! , ij = i + j The above clustering operation is then repeated so that erning the data transformations between the layers are ni + nj ni + nj after every merge operation we remove a principal direc- fitted to minimise the mean-squared error between the where (ni, nj), (µi, µj) and (!i, !j) denote the sizes, cen- tionFigure until only2 gives a single an illustration cluster remains. of our If method sampling-based using an original input data and the output representation. It can troids and covariances of the clusters i and j respec- autoencoder representation. Autoencoders are feedfor- tively. We repeat this for all possible pairs (i, j).Wethen ward neural networks that accept d-dimensional input choose to merge two clusters by either (i) picking the pair data and report d-dimensional output data. The d nodes that has the highest probability or (ii) sampling a pair in the input and output network layers are connected of clusters to merge in proportion to their (normalised) via one or more hidden layers. Data transformations are merged probabilities. The number of clusters will now applied between each layer of the network. If a hid- decrease to K − 1. We then project the data matrix on den layer has fewer nodes than its predecessor then to the first K 2 principal directions, i.e. removing the the information from the previous layer in the autoen- ( ) − K − 1 -st principal component that explains the lowest coder network is forced into a lower dimensional form degree of variance in the data, removing this dimen- hence performing dimensionality reduction. Each hidden sion from the existing cluster centroids and covariance layer encodes a reduced dimensional representation of matrices. the input data. In an autoencoder, the parameters gov- The above clustering operation is then repeated so that erning the data transformations between the layers are after every merge operation we remove a principal direc- fitted to minimise the mean-squared error between the tion until only a single cluster remains. If sampling-based original input data and the output representation. It can BISCUIT Generative Model with Technical Variation

Latent counts which we want to recover Variation Without Technical

Observation Variation With Technical

26

[Prabhakaran et al., 2016]

https://github.com/sandhya212/BISCUIT_SingleCell_IMM_ICML_2016

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 37 / 52 BISCUIT BISCUIT (Bayesian Inference for Single-cell ClUstering and ImpuTing)

Data-driven priors to adapt to different datasets Global Distributions of Observed Data

Hyper-priors

Hyper-parameters

Priors set based on observed lib size Cluster-specific distribution parameters Cell-specific parameters

Observed gene 27 Prabhakaran*, Azizi* ICML 2016 expression per cell j

[Prabhakaran et al., 2016]

https://github.com/sandhya212/BISCUIT_SingleCell_IMM_ICML_2016

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 37 / 52 bioRxiv preprint first posted online Nov. 7, 2017; doi: http://dx.doi.org/10.1101/215723. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. SAFE

Individual clusterings SAFE-clustering

HGPA HGPA clustering k-way partitioning using shmetis Expression matrix of SC3 scRNA-seq data Hypergraph n single cells n single cells H1 H2 H3 H4 MCLA 0. 00 0 0. 00 0 0. 00 0 0 0. 00 0 0 0. 00 0 0. 00 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0. 00 0 0. 00 0 0. 00 0 0 0. 00 0 0 0. 00 0 0. 00 0 CIDR 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0. 00 0 0. 00 0 0. 00 0 0 0. 00 0 0 0. 00 0 0. 00 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0. 00 0 9. 12 4 0. 00 0 0 9. 46 4 0 0. 00 0 0. 00 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0. 00 0 0. 00 0 0. 00 0 0 0. 00 0 0 0. 00 0 0. 00 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0. 00 0 0. 00 0 0. 00 0 0 0. 00 0 0 0. 00 0 0. 00 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 k-way partitioning ANMI 0. 00 0 0. 00 0 0. 00 0 0 0. 00 0 0 0. 00 0 0. 00 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 Jaccard similarity Final cluster ensemble 0. 00 0 0. 00 0 0. 00 0 0 0. 00 0 0 0. 00 0 0. 00 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 MCLA clustering 0. 00 0 0. 00 0 0. 00 0 0 0. 00 0 0 0. 00 0 0. 00 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 using gpmetis calculation Normalization 0. 00 0 0. 00 0 0. 00 0 0 0. 00 0 0 0. 00 0 0. 00 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 matrix calculation 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 9. 13 6 0. 00 0 0. 00 0 0 1 0. 4 63 0 0 .0 00 9 .6 43 Seurat 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Association index Genes Genes 0. 00 0 0. 00 0 0. 00 0 0 0. 00 0 0 0. 00 0 0. 00 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 9. 13 6 0. 00 0 0. 00 0 0 0. 00 0 0 9. 31 5 0. 00 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0. 00 0 0. 00 0 0. 00 0 0 0. 00 0 0 0. 00 0 0. 00 0 cells single 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 calculation 0. 00 0 0. 00 0 0. 00 0 0 0. 00 0 0 0. 00 0 0. 00 0 n 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0. 00 0 0. 00 0 0. 00 0 0 0. 00 0 0 0. 00 0 0. 00 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9. 13 6 0. 00 0 9. 62 6 0 9. 46 4 0 0. 00 0 0. 00 0 Similarity matrix 0. 00 0 0. 00 0 0. 00 0 0 0. 00 0 0 0. 00 0 0. 00 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0. 00 0 0. 00 0 0. 00 0 0 0. 00 0 0 0. 00 0 0. 00 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 calculation t-SNE + k-means h hyperedges CSPA

k-way partitioning using gpmetis CSPA clustering

Figure 1. Overview of SAFE-clustering. Log-transformed expression matrix of scRNA-seq data are first clustered using four state-of-the-art methods, SC3, CIDR, Seurat and t-SNE + k-means; and then individual solutions are combined using one of the three hypergraph-based partitioning algorithms: hypergraph partitioning algorithm (HGPA), meta-cluster algorithm (MCLA) and cluster-based similarity partitioning algorithm (CSPA) to produce consensus clustering.

[Yang et al., 2017]

https://github.com/yycunc/SAFEclustering

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 38 / 52

Remarks

What clustering method to choose?

What is meant by "similar cells"?

How to choose the number of clusters?

Consistency between several methods, consensus of clusterings

Is it important to cluster all the cells? Fuzzy clustering versions

Ideas: biclustering, clustering + gene selection, ...

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 39 / 52 Plan

1 Introduction

2 Feature selection / extraction

3 Dimension reduction

4 Single cell clustering

5 Pseudotime analysis

6 Differential analysis

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 40 / 52 Pseudotime analysis

Trajectory inference aims to reconstruct a cellular dynamic process

2 main steps: Dimensionality reduction step: PCA, t-SNE, .... or graph-based techniques .... or clustering methods

Trajectory modelling step

Reviews: [Cannoodt et al., 2016]: comparison of 10 methods

[Saelens et al., 2018]: comparison of 29 (among 59 existing) methods https://github.com/dynverse/dynverse

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 41 / 52 Pseudotime analysis

Eur. J. Immunol. 2016. 46: 2496–2506 HIGHLIGHTS 2499

Figure 2. TI methods use several common building blocks and can be organized in a unifying modu- lar framework. Overall, TI consists of two steps. In the first step dimen- sionality reduction techniques such as manifold learning, clustering, or graph-based methods are used to convert the dataset to a more simpli- fied representation. This representa- tion of the data then allows the tra- jectory itself to be more easily mod- elled in a second step. In this step, the trajectory itself is found within the data using both graph-based and curve-based approaches, after which the cells themselves can be ordered using a variety of methods.

sets of genes or cells are grouped together and subsequently sim- the k nearest neighbors (KNN) for each cell (Fig. 2). For cytom- plified to one representative point per group. These representative etry data the Euclidean distance is used, whereas for transcrip- point sets provide a good approximation of the different cells or tomics data the cosine distance is used, as it is scale independent.

C.Maugis-Rabusseaugenes in the dataset, (IMT/INSA) speed upStatistical the downstream analysis for analysis scRNAseq and data can Many subgraphs are bootstrapped,42 / 52 by sampling L out of K edges even increase the interpretability of the results by reducing bio- per node, after which the shortest path distance between each logical and technical noise. cell and a starting cell is calculated. Waypoints are used to align the orderings of the cells in the different bootstraps, and a con- sensus distance value is calculated for each of the cells. Because Trajectory modeling step the shortest path algorithm is sensitive to noise in the expres- sion data, Wanderlust uses an ensemble of trajectories to average A trajectory can be represented using several formalisms. Many out the noise and thus increase its robustness (Fig. 3). As addi- TI methods use graph-based techniques, where a simplified graph tional input, Wanderlust requires specifying a starting cell, which representation is used as input to find a path through a series is used as a prior to further improve the robustness of the method, of nodes in the graph. These nodes can correspond to individual although this introduces a bias. cells or groups of cells, and different path-finding algorithms are Wishbone [40] is based on Wanderlust, but is able to detect a used by different algorithms. A number of methods rely on the bifurcation point in the trajectory. It also requires the end user to prior definition of a “starting cell” by the user [27, 29, 39, 40, 44]. specify a starting cell from which the algorithm will infer the tra- This cell is supposed to be representative for cells at the start of jectory. Wishbone uses a combination of PCA and diffusion maps the underlying dynamic process (e.g., the most immature cell in to reduce the dimensionality of the dataset to a few components the case of a cell developmental process), and is then used as a (Fig. 2). The distance between each cell is used to construct a KNN reference cell to compare all other cells against. Other methods graph, which is repeatedly sampled by selecting L out of K edges. [30, 42] look for the longest connected path in a sparsified graph, The shortest path distance from each cell to the starting cell is and then project all cells onto that path to order cells according to calculated, and waypoints are used to align the orderings of the the underlying dynamic process. cells in the different bootstraps. The disagreement between pairs Asecondwaytorepresentatrajectoryisbyconstructinga of trajectories is investigated to identify a bifurcation event, and curve [31, 41], or a series of curves, such that the whole trajectory each cell is mapped to the bifurcating trajectory (Fig. 3). Wish- optimally explains the variation between the individual cells. An bone uses several strategies to improve its robustness: (i) use of example of such an algorithm is principal curves, which extends an ensemble of inferred trajectories, (ii) use of a starting cell, and PCA by looking for nonlinear principal directions in the data such (iii) use of the nonlinear dimensionality reduction technique dif- that the curve is the average of all points that are projected onto it. fusion maps [45]. In theory, the detection of bifurcations could be applied iteratively in order to produce a multiple branching tra- Characterization of TI methods jectory instead of a bifurcating one, although this would increase the complexity of the problem at hand. TI methods make use of a wide variety of components SLICER [44] is a KNN-based TI method that is able to infer (Fig. 2), each with its own advantages and shortcomings branched trajectories while requiring little input from the end (Fig. 3), an overview of which is given in the next sections. user. It uses an alpha-hull to estimate the k parameter for locally Wanderlust [39] is one of the pioneering TI methods. While linear embedding dimensionality reduction. Locally linear embed- originally developed for cytometry data, it is also adopted for ding is used to compute the KNN graph, and to find the most single-cell transcriptomics usage. Wanderlust first creates a graph “extreme” cells (Fig. 2). The user then has to specify which of consisting of edges between each cell and its neighbors by finding these extreme cells is a starting cell, and from that cell the shortest

C ⃝ 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.eji-journal.eu Pseudotime analysis

2500 Robrecht Cannoodt et al. Eur. J. Immunol. 2016. 46: 2496–2506

Figure 3. Each TI method has a unique set of strengths and weaknesses. The level of bias of each method indicates the amount of external information each method requires: , none; , some; -, alot.Thescalabilitywasdeterminedbyexecutingeachmethodonaninsilicodataset + ± containing 100 cells (resp. genes) and an increasing number of genes (resp. cells) until the execution time exceeded 10 s: , >10 000; , <10 000; + ± -, <1000. Code and documentation denotes the quality of the code and documentation thereof provided. The parameter ease-of-use denotes how much is expected of the end user to tune the parameters. C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 43 / 52 path to all other cells is determined. The geodesic entropy is used cluster centers are used to reduce the effect of noise while cal- to find branches; high values thereof indicate the existence of a culating the MST, although slight changes in the center locations branching point. SLICER helps the end user by automatically esti- might have a large impact on the MST and therefore the inferred mating its parameters and by determining good candidate starting trajectory. cells. Allowing the end user to select a starting cell through visual Another MST-based method is TSCAN [42], which is concep- inspection introduces a minimal bias effect in the results (Fig. 3). tually very similar to Monocle and Waterfall. TSCAN also uses As for now, it is the only technique to offer fully branched TI with PCA to reduce the dimensionality, but uses Mclust, a clustering minimal bias in the outputted results, which can also scale well to algorithm that models the datasets as a mixture of normal dis- thousands of genes and cells. tributions, to cluster the cells while automatically determining Monocle [30] is an alternative method to detect branched the number of clusters using the Bayesian information criterion trajectories. Monocle uses ICA as its initial dimensionality reduc- (Fig. 2). It calculates an MST through the cluster centers from tion step (Fig. 2), but first performs a differential expression test which a trajectory is inferred by determining the longest connected between the different cell populations, as ICA does not scale well path through the tree. Finally, all the cells are projected to the with an increasing number of genes. After the dimensionality has nearest point on the trajectory, creating the final ordering. TSCAN been reduced, a minimal spanning tree (MST) is calculated, after is also one of the few completely unsupervised methods, as it auto- which the longest connected path(s) is determined within the matically determines both start and end cell clusters like Monocle graph. Each cell is then assigned to the nearest point in the inferred does, while avoiding any prior filtering in the gene dimension. It trajectory. Calculating an MST between individual cells is typi- has a high ease of use, as all of its parameters are automatically cally very sensitive to noise (Fig. 3). By reducing the number of determined (Fig. 3). Like Monocle, it also finds the longest path in genes using a differential expression test, the execution time of an MST, but does so more reliably since the cells were first grouped ICA is decreased and potentially improves the robustness of the into clusters. However, similar to Waterfall, slight changes in MST. However, this introduces a heavy bias toward existing pop- center locations might have a large impact on the trajectory ulation groupings and away from possible heterogeneities within found. subpopulations. In a recent paper, Monocle was used to success- SCUBA [41] infers branching trajectories from time-series fully recover the continuous development from endothelial cells experiments. For whole-transcriptome expression data, it first to hematopoietic stem cells from scRNA-seq data [46]. applies PCA. The cells in the initial cell state are clustered using Waterfall [27] is also an MST-based method, but calculates an K-means, with the number of clusters automatically determined MST between clusters of cells instead of individual cells. Waterfall using the gap statistic. For each consecutive time point, the cells uses PCA to reduce the dimensionality, and clusters the cells using are mapped to one of the clusters in the earlier time points. Each K-means. An MST is calculated between the cluster centers, and group of cells is then clustered into two clusters with K-means, and the cluster with the lowest value in the first component is selected the gap statistic is used to determine whether a bifurcation occurs as the starting node. The longest path from the starting node is at this stage or not (Fig. 2). Finally, a penalized likelihood func- used as a trajectory, and each cell is perpendicularly projected to tion is used to optimize the trajectory by reassigning cells to more the path, creating the final ordering of the cells (Fig. 2). Waterfall fitting clusters. The model generated by SCUBA is unique in that it is one of the few completely unsupervised methods (Fig. 3). The does not directly estimate a pseudotime for every cell but instead

C ⃝ 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.eji-journal.eu bioRxiv preprint first posted online Mar. 5, 2018; doi: http://dx.doi.org/10.1101/276907. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.

Evaluation of trajectory inference methods

PseudotimeAn overview of the main analysis results from this study is shown in Figure 2. This includes an overview of the results obtained from the method characterisation (Figure 2a), the benchmarking evaluation (Figure 2b), and the quality control evaluation (Figure 2c).

a Method characterisation b Benchmark c Quality control

Priors Per metric Per source Per trajectory type Execution Practicality Categories

Type Topo. Startconstr.End StatesGenes Score CorrelationEdge flipRF MSEReal SyntheticLinearBifurcationConvergenceMultifurcationRootedDAG tree Cycle Graph Average% ErroredtimeError reason Score DeveloperUser friendly friendlyFuture-proofAvailabilityBehaviourCode assuranceCode qualityDocumentationPaper

Slingshot free id 2s 2%

TSCAN free 14s 0%

Monocle DDRTree free # 49s 5%

SCORPIUS fixed 29s 0%

cellTree Maptpx free id id 17m 0%

cellTree Gibbs free id id 2h 5%

Embeddr fixed 32s 0%

SLICE free id id id 2m 6%

reCAT fixed 38m 9%

Wishbone param # id 45s 19%

MFA param # 30m 0%

Waterfall fixed 9s 0%

cellTree VEM free id id 2m 1%

Monocle ICA param # 1m 13%

SCUBA free id id 9m 1%

Topslam fixed 10m 8%

Phenopath fixed 12m 0%

SCOUP param id # id 1h 25%

DPT fixed id 51s 0%

Ouijaflow fixed id 3m 6%

Wanderlust fixed 44s 17%

GPfates param # 12m 25%

StemID free 39m 17%

Sincell free 15s 6%

Mpath free id id 3s 30%

SLICER free id id 7m 0%

SCIMITAR fixed insufficient data 2h 70%

Ouija fixed id insufficient data 5h 88%

Pseudogp fixed insufficient data 5h 90%

Topology constraints Prior Benchmark score Error reason QC score free inferred by algorithm id optional cell/gene ids Memory limit exceeded parameter determined by parameter # optional amount Time limit exceeded low high low high fixed fixed by algorithm id required cell/gene ids Dataset-specific error Stochastic error Generated on 2018-03-05 # required amount

Figure 2 Overview of the results on the evaluation of 29 TI methods. a) The methods were characterised according to the C.Maugis-Rabusseaumost complex (IMT/INSA)trajectory type they canStatistical infer, whether analysis or not the for inferred scRNAseq topology datais constrained by the algorithm or a parameter, 44 / 52 and which prior information a method requires or can optionally use. b) The methods are ordered according to the overall score achieved in the benchmark evaluation. Also shown are the aggregated scores per metric, source and trajectory type, as well as the average execution time across all datasets and the percentage of executions in which no output was produced. c) Overall performance in the quality control evaluation is highly variable, even amongst the highest ranked methods according to the benchmark evaluation. Also listed are the quality control scores aggregated according to practicality and the different categories.

Having ordered all methods by their overall benchmarking score, we found that Slingshot predicted the most accurate trajec- tories, followed by TSCAN and Monocle DDRTree. When we looked at the benchmark scores per trajectory type, Slingshot was the only method that performed well across most trajectory types. However, we found that several methods were specialised in predicting specific trajectory types; for example SCORPIUS for linear trajectories, reCAT for cycles, and Monocle DDRTree for trees.

We observed a high correlation (0.7-0.9) between results originating from real datasets versus those originating from synthetic datasets (Supplementary Figure 6). This confirms both the relevance of the synthetic data and the accuracy of the gold stan-

5 Plan

1 Introduction

2 Feature selection / extraction

3 Dimension reduction

4 Single cell clustering

5 Pseudotime analysis

6 Differential analysis

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 45 / 52 Methods for differential analysis

Review: [Soneson and Robinson, 2018]

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 46 / 52 Methods for differential analysis

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 47 / 52 Some other interesting resources

List of software packages for single-cell data analysis, including RNA-seq: https://github.com/seandavi/awesome-single-cell

Course of Hemberg’s lab:

https://hemberg-lab.github.io/scRNA.seq.course/index.html

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 48 / 52 ReferencesI Andrews, T. S. and Hemberg, M. (2018). Identifying cell populations with scrnaseq. Molecular aspects of medicine, 59:114–122.

Brennecke, P., Anders, S., Kim, J. K., Kołodziejczyk, A. A., Zhang, X., Proserpio, V., Baying, B., Benes, V., Teichmann, S. A., Marioni, J. C., et al. (2013). Accounting for technical noise in single-cell rna-seq experiments. Nature methods, 10(11):1093.

Butler, A., Hoffman, P., Smibert, P., Papalexi, E., and Satija, R. (2018). Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature biotechnology, 36(5):411.

Cannoodt, R., Saelens, W., and Saeys, Y. (2016). Computational methods for trajectory inference from single-cell transcriptomics. European journal of immunology, 46(11):2496–2506.

Duò, A., Robinson, M. D., and Soneson, C. (2018). A systematic performance evaluation of clustering methods for single-cell rna-seq data. F1000Research, 7.

Freytag, S., Tian, L., Lönnstedt, I., Ng, M., and Bahlo, M. (2018). Comparison of clustering tools in r for medium-sized 10x genomics single-cell rna-sequencing data. F1000Research, 7.

Grün, D., Lyubimova, A., Kester, L., Wiebrands, K., Basak, O., Sasaki, N., Clevers, H., and van Oudenaarden, A. (2015). Single-cell messenger rna sequencing reveals rare intestinal cell types. Nature, 525(7568):251.

Guo, M., Wang, H., Potter, S. S., Whitsett, J. A., and Xu, Y. (2015). Sincera: a pipeline for single-cell rna-seq profiling analysis. PLoS computational biology, 11(11):e1004575.

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 49 / 52 ReferencesII

Juliá, M., Telenti, A., and Rausell, A. (2015). Sincell: an r/bioconductor package for statistical assessment of cell-state hierarchies from single-cell rna-seq. Bioinformatics, 31(20):3380–3382.

Kim, J. K., Kolodziejczyk, A. A., Ilicic, T., Teichmann, S. A., and Marioni, J. C. (2015). Characterizing noise structure in single-cell rna-seq distinguishes genuine from technical stochastic allelic expression. Nature communications, 6:8687.

Kiselev, V. Y., Kirschner, K., Schaub, M. T., Andrews, T., Yiu, A., Chandra, T., Natarajan, K. N., Reik, W., Barahona, M., Green, A. R., and Hemberg, M. (2017). Sc3: consensus clustering of single-cell rna-seq data. Nature Methods, 14:483 EP –.

Lin, P., Troup, M., and Ho, J. W. (2017). Cidr: Ultrafast and accurate clustering through imputation for single-cell rna-seq data. Genome biology, 18(1):59.

Lun, A. T., McCarthy, D. J., and Marioni, J. C. (2016). A step-by-step workflow for low-level analysis of single-cell rna-seq data with bioconductor. F1000Research, 5.

Pierson, E. and Yau, C. (2015). Zifa: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome biology, 16(1):241.

Poirion, O. B., Zhu, X., Ching, T., and Garmire, L. (2016). Single-cell transcriptomics bioinformatics and computational challenges. Frontiers in genetics, 7:163.

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 50 / 52 References III

Prabhakaran, S., Azizi, E., Carr, A., and Pe’er, D. (2016). Dirichlet process mixture model for correcting technical variation in single-cell gene expression data. In International Conference on Machine Learning, pages 1070–1079.

Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S., and Vert, J.-P. (2017). Zinb-wave: A general and flexible method for signal extraction from single-cell rna-seq data. bioRxiv. Saelens, W., Cannoodt, R., Todorov, H., and Saeys, Y. (2018). A comparison of single-cell trajectory inference methods: towards more accurate and robust tools. bioRxiv, page 276907.

Satija, R., Farrell, J. A., Gennert, D., Schier, A. F., and Regev, A. (2015). Spatial reconstruction of single-cell gene expression data. Nature biotechnology, 33(5):495.

Soneson, C. and Robinson, M. (2018). Bias, robustness and scability in single-cell differential expression analysis. Nature Methods, 15:255–261.

Vallejos, C., Marioni, J., and Richardson, S. (2015). Basics: Bayesian analysis of single-cell sequencing data. PLOS COMPUTATIONAL BIOLOGY, 11.

Wang, B., Zhu, J., Pierson, E., Ramazzotti, D., and Batzoglou, S. (2017). Visualization and analysis of single-cell rna-seq data by kernel-based similarity learning. Nature methods, 14(4):414.

Wolf, F. A., Angerer, P., and Theis, F. J. (2018). Scanpy: large-scale single-cell gene expression data analysis. Genome biology, 19(1):15.

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 51 / 52 ReferencesIV

Xu, C. and Su, Z. (2015). Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics, 31(12):1974–1980.

Yang, Y., Huh, R., Culpepper, H. W., Lin, Y., Love, M. I., and Li, Y. (2017). SAFE-clustering: Single-cell aggregated (from ensemble) clustering for single-cell RNA-seq data. bioRxiv. Zeisel, A., Muñoz-Manchado, A. B., Codeluppi, S., Lönnerberg, P., La Manno, G., Juréus, A., Marques, S., Munguba, H., He, L., Betsholtz, C., et al. (2015). Cell types in the mouse cortex and hippocampus revealed by single-cell rna-seq. Science, 347(6226):1138–1142.

Žurauskiene,˙ J. and Yau, C. (2016). pcareduce: hierarchical clustering of single cell transcriptional profiles. BMC Bioinformatics, 17.

C.Maugis-Rabusseau (IMT/INSA) Statistical analysis for scRNAseq data 52 / 52