Detecting Sources of Transcriptional Heterogeneity in Large-Scale RNA-Seq Data Sets Brian C
Total Page:16
File Type:pdf, Size:1020Kb
Genetics: Early Online, published on October 11, 2016 as 10.1534/genetics.116.193714 Detecting sources of transcriptional heterogeneity in large-scale RNA-Seq data sets Brian C. Searle, Rachel M. Gittelman, Ohad Manor, and Joshua M. Akey Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA 1 Copyright 2016. RUNNING TITLE Detecting sources of heterogeneity KEY WORDS GTEx Consortium; gene expression normalization; random forest classification; transcriptional heterogeneity CORRESPONDING AUTHOR Joshua M. Akey, PhD [email protected] Department of Genome Sciences University of Washington School of Medicine Box 355065 1705 NE Pacific Street Seattle, WA 98195 1 ABSTRACT 2 Gene expression levels are dynamic molecular phenotypes that respond to biological, 3 environmental, and technical perturbations. Here we use a novel replicate classifier approach for 4 discovering transcriptional signatures and apply it to the Genotype-Tissue Expression (GTEx) 5 data set. We identified many factors contributing to expression heterogeneity, such as collection 6 center and ischemia time and our approach of scoring replicate classifiers allows us to 7 statistically stratify these factors by effect strength. Strikingly, from transcriptional expression in 8 blood alone we detect markers that help predict heart disease and stroke in some patients. Our 9 results illustrate the challenges and opportunities of interpreting patterns of transcriptional 10 variation in large-scale data sets. 2 1 INTRODUCTION 2 Unlike previous large-scale tissue (FANTOM et al. 2015) or cell-type (ENCODE et al. 3 2012) specific expression data sets, the GTEx Project (GTEx et al. 2015) is unique in the breadth 4 of tissue types sampled from the same individuals. The GTEx Consortium has previously 5 demonstrated that tissue-specific gene expression signatures are preserved in postmortem 6 samples using hierarchical clustering (Melé et al. 2015), which groups samples by gene 7 expression using a data-driven approach to identify hidden structure in the data. While 8 hierarchical clustering is effective at identifying the greatest global source of variation, it does 9 not capture more subtle sources of variation. For example, in the context of the GTEx Project, 10 hierarchical clustering largely captures gene expression variation due to tissue type, but less 11 effectively captures the influence of confounding factors like age or sex. 12 Using the GTEx pilot data freeze version 4, we attempted to recapitulate the results of 13 hierarchical clustering using supervised Random Forest (RF) classification (Breiman 2001). 14 Unlike hierarchical clustering, RF uses sample type annotations in a training data set to create 15 decision trees where the nodes correspond to genes whose expression levels distinguish between 16 tissue types. Although RF classification typically considers a single classifier per classification 17 task, we randomly generated replicate classifiers to statistically assess how well two groups can 18 be distinguished. This approach is markedly distinct from hierarchical clustering or PCA and 19 enables statistical uncertainty to be rigorously quantified. These analyses reveal strong 20 transcriptional signatures that contribute to patterns of expression heterogeneity in the GTEx data. 21 More broadly, our results highlight that a deeper understanding of the determinants of 22 transcriptional variation enable insights into the biological factors that govern variation in gene 23 expression among tissues and individuals. 3 1 MATERIALS AND METHODS 2 Normalization and Data Curating: 3 We first removed samples of non-European descent and summed counts for technical 4 replicates. We normalized expression profiles to the upper quartile (Bullard et al. 2010) and 5 removed globally weak responding genes (13.7%) where no tissue type had more than 10 6 samples with at least 5 counts. We also removed D-statistic outlier samples ('t Hoen et al. 2013) 7 (approximately 3%) if they correlated poorly across all genes to over half the samples within the 8 tissue type (Pearson’s correlation scores <0.5). This resulted in a data set containing 17,702 9 genes and 1,821 total samples from 165 donors across the 23 tissue types we considered. The 10 genes, samples, and donors are enumerated in Table S3. We used normalized gene expression 11 results downloaded from the GTEx website to interrogate the effect of DESeq/PEER 12 normalization on the strength of cofactors. 13 14 Random Forest classifiers for predicting tissue type: 15 Random Forest is an ensemble method for classifying groups using a collection of 16 decision trees. In this work we chose to use RF because unlike most machine learning 17 approaches, RF classification is robust in the face of large numbers of features and high feature 18 redundancy, making it ideal for classifying gene expression data sets. Additionally, the decision 19 trees inside a RF are generally easy to interpret, which we exploit to identify split-point gene 20 signatures. Realistically, considering only the issue of tissue classification, the approach we 21 present here would still likely work well with a different classification method as its foundation. 22 Each tree in the forest is trained using a bootstrapped selection of “in-bag” samples (a 23 random subset with replacement). Typically decision trees are trained at each node to determine 4 1 the feature (of N total features) that best splits the samples between two classes using the entire 2 feature set. However, in RF typically each split is chosen from a subset of the square root of N 3 randomly selected features. These two levels of randomness help buffer from over-fitting in the 4 presence of a high number of features. RF prediction for a sample is essentially a voting system 5 where the prediction is the majority vote classification across all of the trees. 6 In this study we used entropy as a measure of information gain when selecting decision 7 points. Decision split-points that were already biased towards a 90% or greater decision were 8 eliminated to improve generalization. For each forest approximately 37% of the samples are not 9 selected (i.e. “out-of-bag”) in each bootstrapped sample group. While RF tree pruning is 10 uncommon for efficiency concerns, we use these out-of-bag samples to prune leaves that lower 11 classification accuracy in “unseen” training set samples to help improve generalization. 12 Unlike SVM or logistic regression methods that produce unique, global solutions, RF 13 classifiers are affected by random starting conditions. Each time we trained a RF classifier we 14 selected a different random starting point and a different subset of training data, which 15 consequently produced slightly different performances. We took advantage of this by generating 16 100 “one-vs-rest” (Bishop 2006) binary RF classifiers for a given tissue type where each 17 classifier operated as a “technical” replicate. 18 Each RF was aggregated across 100 weak predictor decision trees. This number of weak 19 predictors was the point at which we found the ROC-AUC was guaranteed to have converged. 20 We used all of the M query samples (specific to the classified tissue type) in our training/testing 21 sample pool and M/2 non-target samples from each of the background tissue types to maintain an 22 even distribution for classifier comparison. Each forest was generated using 80% of the samples 23 randomly selected from the sample pool for each tissue type. The corresponding 20% of the 5 1 samples were reserved exclusively to evaluate the classification accuracy of our classifier. We 2 limited the number of non-target samples in the testing set to be no more than 90% of the total 3 testing pool. This percentage is relatively high to ensure sufficient background tissue diversity. 4 In an effort to speed up the process of classification, we trimmed each feature list before 5 classification to the top 1% of genes that separated the 80% randomly selected training samples 6 using a Mann-Whitney U test. Finally, we performed ROC-AUC integration calculations using 7 the trapezoidal rule and calculated confidence intervals around median ROC-AUC values using 8 medians of 100 bootstrapped sample sets. 9 Since we generated 100 replicate RF classifiers per tissue, we can determine critical 10 decision genes by counting the number of times each gene is used as a decision split-point. Due 11 to the decision tree splitting procedure the actual number of times a gene can be used for splitting 12 scales with the number of samples. However, tissue-specific genes are used repeatedly over the 13 100 replicate classifiers, and the relative number of repeats can indicate key tissue-specific 14 decision split-points. For each tissue type, we ran Gene Ontology (GO) enrichment analysis 15 using the online PANTHER Overrepresentation Test (Mi et al. 2013) (release 20150430) using 16 the Homo sapiens GO Ontology database (Released 2015-06-06) on the top 100 decision split- 17 point genes. We required a stringent <0.05 Bonferroni corrected p-value for Biological Process 18 GO enrichment. The number of independent tests is calculated as the number of ontology classes 19 with at least two genes in the reference list. 20 21 Random Forest classifier for predicting blood-specific signatures: 22 Blood-specific markers for identifying sex, collection center, and ischemia time, were 23 identified using binary Random Forests, while classification of donor death was performed with 6 1 one-vs-rest Random Forests using a similar system to that for predicting tissue type. We broke 2 each factor down into the lowest number of possible classes specifically to limit signal dilution 3 across the 165 donor blood samples. A randomly guessing classifier should produce a ROC- 4 AUC score of 0.5. To verify this, for each classifier we randomly permuted the sample labels and 5 calculated a background ROC-AUC.