Genetics: Early Online, published on October 11, 2016 as 10.1534/genetics.116.193714
Detecting sources of transcriptional heterogeneity in large-scale RNA-Seq data sets Brian C. Searle, Rachel M. Gittelman, Ohad Manor, and Joshua M. Akey Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
1
Copyright 2016. RUNNING TITLE
Detecting sources of heterogeneity
KEY WORDS
GTEx Consortium; gene expression normalization; random forest classification; transcriptional
heterogeneity
CORRESPONDING AUTHOR
Joshua M. Akey, PhD [email protected] Department of Genome Sciences University of Washington School of Medicine Box 355065 1705 NE Pacific Street Seattle, WA 98195
1 ABSTRACT
2 Gene expression levels are dynamic molecular phenotypes that respond to biological,
3 environmental, and technical perturbations. Here we use a novel replicate classifier approach for
4 discovering transcriptional signatures and apply it to the Genotype-Tissue Expression (GTEx)
5 data set. We identified many factors contributing to expression heterogeneity, such as collection
6 center and ischemia time and our approach of scoring replicate classifiers allows us to
7 statistically stratify these factors by effect strength. Strikingly, from transcriptional expression in
8 blood alone we detect markers that help predict heart disease and stroke in some patients. Our
9 results illustrate the challenges and opportunities of interpreting patterns of transcriptional
10 variation in large-scale data sets.
2 1 INTRODUCTION
2 Unlike previous large-scale tissue (FANTOM et al. 2015) or cell-type (ENCODE et al.
3 2012) specific expression data sets, the GTEx Project (GTEx et al. 2015) is unique in the breadth
4 of tissue types sampled from the same individuals. The GTEx Consortium has previously
5 demonstrated that tissue-specific gene expression signatures are preserved in postmortem
6 samples using hierarchical clustering (Melé et al. 2015), which groups samples by gene
7 expression using a data-driven approach to identify hidden structure in the data. While
8 hierarchical clustering is effective at identifying the greatest global source of variation, it does
9 not capture more subtle sources of variation. For example, in the context of the GTEx Project,
10 hierarchical clustering largely captures gene expression variation due to tissue type, but less
11 effectively captures the influence of confounding factors like age or sex.
12 Using the GTEx pilot data freeze version 4, we attempted to recapitulate the results of
13 hierarchical clustering using supervised Random Forest (RF) classification (Breiman 2001).
14 Unlike hierarchical clustering, RF uses sample type annotations in a training data set to create
15 decision trees where the nodes correspond to genes whose expression levels distinguish between
16 tissue types. Although RF classification typically considers a single classifier per classification
17 task, we randomly generated replicate classifiers to statistically assess how well two groups can
18 be distinguished. This approach is markedly distinct from hierarchical clustering or PCA and
19 enables statistical uncertainty to be rigorously quantified. These analyses reveal strong
20 transcriptional signatures that contribute to patterns of expression heterogeneity in the GTEx data.
21 More broadly, our results highlight that a deeper understanding of the determinants of
22 transcriptional variation enable insights into the biological factors that govern variation in gene
23 expression among tissues and individuals.
3 1 MATERIALS AND METHODS
2 Normalization and Data Curating:
3 We first removed samples of non-European descent and summed counts for technical
4 replicates. We normalized expression profiles to the upper quartile (Bullard et al. 2010) and
5 removed globally weak responding genes (13.7%) where no tissue type had more than 10
6 samples with at least 5 counts. We also removed D-statistic outlier samples ('t Hoen et al. 2013)
7 (approximately 3%) if they correlated poorly across all genes to over half the samples within the
8 tissue type (Pearson’s correlation scores <0.5). This resulted in a data set containing 17,702
9 genes and 1,821 total samples from 165 donors across the 23 tissue types we considered. The
10 genes, samples, and donors are enumerated in Table S3. We used normalized gene expression
11 results downloaded from the GTEx website to interrogate the effect of DESeq/PEER
12 normalization on the strength of cofactors.
13
14 Random Forest classifiers for predicting tissue type:
15 Random Forest is an ensemble method for classifying groups using a collection of
16 decision trees. In this work we chose to use RF because unlike most machine learning
17 approaches, RF classification is robust in the face of large numbers of features and high feature
18 redundancy, making it ideal for classifying gene expression data sets. Additionally, the decision
19 trees inside a RF are generally easy to interpret, which we exploit to identify split-point gene
20 signatures. Realistically, considering only the issue of tissue classification, the approach we
21 present here would still likely work well with a different classification method as its foundation.
22 Each tree in the forest is trained using a bootstrapped selection of “in-bag” samples (a
23 random subset with replacement). Typically decision trees are trained at each node to determine
4 1 the feature (of N total features) that best splits the samples between two classes using the entire
2 feature set. However, in RF typically each split is chosen from a subset of the square root of N
3 randomly selected features. These two levels of randomness help buffer from over-fitting in the
4 presence of a high number of features. RF prediction for a sample is essentially a voting system
5 where the prediction is the majority vote classification across all of the trees.
6 In this study we used entropy as a measure of information gain when selecting decision
7 points. Decision split-points that were already biased towards a 90% or greater decision were
8 eliminated to improve generalization. For each forest approximately 37% of the samples are not
9 selected (i.e. “out-of-bag”) in each bootstrapped sample group. While RF tree pruning is
10 uncommon for efficiency concerns, we use these out-of-bag samples to prune leaves that lower
11 classification accuracy in “unseen” training set samples to help improve generalization.
12 Unlike SVM or logistic regression methods that produce unique, global solutions, RF
13 classifiers are affected by random starting conditions. Each time we trained a RF classifier we
14 selected a different random starting point and a different subset of training data, which
15 consequently produced slightly different performances. We took advantage of this by generating
16 100 “one-vs-rest” (Bishop 2006) binary RF classifiers for a given tissue type where each
17 classifier operated as a “technical” replicate.
18 Each RF was aggregated across 100 weak predictor decision trees. This number of weak
19 predictors was the point at which we found the ROC-AUC was guaranteed to have converged.
20 We used all of the M query samples (specific to the classified tissue type) in our training/testing
21 sample pool and M/2 non-target samples from each of the background tissue types to maintain an
22 even distribution for classifier comparison. Each forest was generated using 80% of the samples
23 randomly selected from the sample pool for each tissue type. The corresponding 20% of the
5 1 samples were reserved exclusively to evaluate the classification accuracy of our classifier. We
2 limited the number of non-target samples in the testing set to be no more than 90% of the total
3 testing pool. This percentage is relatively high to ensure sufficient background tissue diversity.
4 In an effort to speed up the process of classification, we trimmed each feature list before
5 classification to the top 1% of genes that separated the 80% randomly selected training samples
6 using a Mann-Whitney U test. Finally, we performed ROC-AUC integration calculations using
7 the trapezoidal rule and calculated confidence intervals around median ROC-AUC values using
8 medians of 100 bootstrapped sample sets.
9 Since we generated 100 replicate RF classifiers per tissue, we can determine critical
10 decision genes by counting the number of times each gene is used as a decision split-point. Due
11 to the decision tree splitting procedure the actual number of times a gene can be used for splitting
12 scales with the number of samples. However, tissue-specific genes are used repeatedly over the
13 100 replicate classifiers, and the relative number of repeats can indicate key tissue-specific
14 decision split-points. For each tissue type, we ran Gene Ontology (GO) enrichment analysis
15 using the online PANTHER Overrepresentation Test (Mi et al. 2013) (release 20150430) using
16 the Homo sapiens GO Ontology database (Released 2015-06-06) on the top 100 decision split-
17 point genes. We required a stringent <0.05 Bonferroni corrected p-value for Biological Process
18 GO enrichment. The number of independent tests is calculated as the number of ontology classes
19 with at least two genes in the reference list.
20
21 Random Forest classifier for predicting blood-specific signatures:
22 Blood-specific markers for identifying sex, collection center, and ischemia time, were
23 identified using binary Random Forests, while classification of donor death was performed with
6 1 one-vs-rest Random Forests using a similar system to that for predicting tissue type. We broke
2 each factor down into the lowest number of possible classes specifically to limit signal dilution
3 across the 165 donor blood samples. A randomly guessing classifier should produce a ROC-
4 AUC score of 0.5. To verify this, for each classifier we randomly permuted the sample labels and
5 calculated a background ROC-AUC. From these two pools of ROC-AUC scores, we calculated a
6 two-tailed t-test with samples of unequal variance to estimate a p-value indicating how
7 significant the classification success is for each factor.
8 The names of the three collection centers are hidden using codes (“B1,” “C1,” and “D1”).
9 We dropped the D1 center from this analysis because that center only collected samples from
10 nine of the donors used our work. The remaining two centers B1 (83 donors) and C1 (73 donors)
11 were carried forward for RF analysis.
12 The GTEx coding of “cause of death” annotations can be ambiguous. Before any analysis
13 took place, an independent researcher re-coded “cause of death” for us based on three annotation
14 variables: “deathCategory”, “deathCause”, and “deathClass” (Table S3). In particular, care was
15 exercised to differentiate the most common annotation “Cerebrovascular diseases”, which mixes
16 stroke (generally annotated as CVA or “cerebral vascular accident” in deathCause) with brain
17 trauma (generally annotated as ICH or “intracranial hemorrhage”).
18 Causes of death are heavily conflated with donor age. We attempted to regress out the
19 effect of age using a linear model (Figure S6) but found this produced unexpected results,
20 leading us to consider non-parametric normalization. To account for the effect of age we grouped
21 donors by age into ten-year blocks. For each gene, we removed the median expression for the
22 gene/age block and multiplied back the median of medians to keep the normalized expression
23 counts near to their original levels. We used the same cofactor normalization approach to
7 1 disentangle collection center from ischemia time. Cofactor normalization was not applied to the
2 DESeq/PEER normalized results.
3 Data Availability:
4 The GTEx pilot data (version 4, dbGaP Accession: phs000424.v4.p1) we analyzed can be
5 downloaded from NCBI at https://www.ncbi.nlm.nih.gov/projects/gap/cgi-
6 bin/study.cgi?study_id=phs000424.v1.p1. The GTEx sample annotation tables are available
7 through the GTEx Portal, http://gtexportal.org/home/datasets. Code implementing the random
8 forest data analysis methods presented here is available at https://bitbucket.org/searleb/gtex-
9 heterogeneity-classifier.
10
11 RESULTS AND DISCUSSION
12 Previous analyses of the GTEx data showed that tissue type could be accurately predicted
13 from gene expression data for many, but not all cell types (Melé et al. 2015). We first attempted
14 to demonstrate the strength of tissue type gene expression signatures in the GTEx data using a
15 novel classification algorithm. Briefly, for each tissue type we trained 100 RF classifiers to
16 separate that tissue from the others, and calculated the median Receiver Operator Characteristic
17 (ROC) area under the curve (AUC) score as an estimate of prediction accuracy. We further
18 examined the RFs to quantify the frequency that a given gene is used as a decision split-point
19 across all of the replicates. Genes that are frequently used as tissue-determining decision points
20 are likely to be important signatures for tissue type, even if their specific differential expression
21 levels are not significant. This approach differs from DESeq (Anders and Huber 2010) and other
22 differential expression tools based on sequence count data for individual genes in that split-point
23 analysis can identify transcription signatures that act in concert to distinguish tissue expression.
8 1 To our knowledge, this work represents the first use of replicate RFs to in silico enrich for
2 significant gene expression signatures.
3 Gene expression signatures determined by RF classification can essentially perfectly
4 differentiate blood, muscle, lung, skin, heart, adrenal gland, liver, blood vessel, and nerve tissues
5 with lower quartile ROC-AUCs>0.99 (Figure 1a). While Melé et al., 2015) were able to
6 reasonably differentiate the first eight tissue types using hierarchical clustering, nerve tissue was
7 not easily distinguishable from other tissue types. To investigate this disparity, we performed GO
8 enrichment of the top 100 RF decision split genes for nerve tissue (Table S1) and identified 31
9 genes involved in nervous system development at a Bonferroni corrected (BC) p-value=9.2x10-5.
10 These genes include 11 involved in axon ensheathment/myelination (BC p-value=4.7x10-9) and
11 10 involved in glial cell differentiation (BC p-value=1.1x10-5) that confidently differentiate nerve
12 tissue from other tissues in the GTEx data (Table S2).
13 Some tissues defy precise classification, as indicated by lower median ROC-AUCs
14 (Figure 1a). For example, although GO enrichment of the top 100 split genes for stomach tissues
15 indicates 17 signature genes involved in digestion (BC p-value=3.5x10-16), stomach sample
16 classification produced median ROC-AUC scores of only 0.82. When we clustered samples
17 using Euclidean distance, these genes show two distinct groups of expression within stomach
18 tissue (Figure 1b) where several tissue samples show little or no expression in any of the 17
19 digestion-related genes. As this heterogeneity does not appear correlated with donor
20 characteristics or sample processing factors, we speculate that it may be related to whether or not
21 the stomach was empty at the time of donor death.
22 Having demonstrated our ability to uncover expression subtypes within stomach tissue,
23 we hypothesized that other sources of gene expression heterogeneity could be found in the GTEx
9 1 data. The GTEx project samples are richly annotated with covariate data, which we used to
2 identify additional factors that contribute to gene expression variability (Figure 1c). Here, we
3 focused on expression data from blood, which had the largest sample size among all available
4 tissues and generated RF classifiers for a number of potential factors that contribute to
5 expression heterogeneity among individuals. For example, sex-specific genes are clearly present
6 in blood transcript levels (median ROC-AUC>0.99). As one would expect, all of the top 20 RF
7 decision split genes are either found on the X or Y chromosome (e.g. XIST).
8 The GTEx Consortium noted that gene expression is affected by sample preparation
9 conditions (GTEx et al. 2015) and we therefore evaluated influence of sample collection center
10 and total ischemia time (defined as the time after donor death but before the first tissue sample
11 was extracted). Although sample collection center and ischemia time are correlated (Figure S1),
12 their effects are separable. Specifically, when analyzing center-related signatures, we removed
13 ischemia time biases using median-subtraction, and vice versa when analyzing ischemia-related
14 signatures. We removed one of the three centers from this analysis because of small sample size.
15 From the remaining two centers, we were able to determine with a high degree of accuracy
16 which hospital collected the sample using only blood gene expression data (median ROC-
17 AUC=0.97). We permuted the annotations for each RF, recalculated the distribution of ROC-
18 AUCs, and computed a two-tailed t-test with unequal sample variance, which showed that the
19 prediction accuracy was highly significant (p-value=7.6x10-69).
20 Ischemic time can also be determined from blood samples using a similar approach. In
21 the GTEx data, ischemia time was highly variable (Figure S2). The median ischemia time was
22 4.8 hours, with quartiles ranging from 2.9 to 11.4 hours. Although only 54% of the donor bodies
23 were annotated for refrigeration status prior to sample extraction, of those that were, in general
10 1 samples processed in less than 6 hours were not refrigerated, while those processed longer than 6
2 hours were. Indeed, replicate RF classifiers easily distinguished blood transcription patterns
3 between quickly and slowly processed donors (median ROC-AUC=0.95, t-test p-value=1.1x10-
4 81).
5 We next examined our ability to discern the cause of death from blood-derived gene
6 expression levels. We grouped donors into four categories based on the reported cause of death:
7 “brain anoxia” (primarily strokes, but also asphyxiation), “heart disease” (primarily myocardial
8 infarctions), “brain trauma” (primarily gunshot and transportation accidents), and “other” (see
9 Table S3). The manner of donor death is heavily conflated with donor age (Figure S3). To
10 account for this, we normalized for age-related gene expression using median subtraction. While
11 the cause of death transcriptional signature is weaker than that of sex, center, or ischemia time,
12 RF classification can significantly differentiate donors that died of “brain anoxia” in blood
13 samples (median ROC-AUC=0.67; t-test p-value=1.0x10-15; Figure S4a). Heart disease patients
14 are more difficult to classify (Figure S4b), but a modestly significant expression signature is
15 present (median ROC-AUC=0.57; t-test p-value<1.9x10-2). Our approach does not differentiate
16 “brain trauma” donors from the other groups (Figure S4c), most likely because often the time
17 between onset and death is too quick for injury specific transcripts to build in the circulatory
18 system.
19 Although our RF approach highlights the impact of confounding variables on expression
20 heterogeneity, the GTEx Consortium used well-established methods for mitigating these effects
21 in their analyses. This included explicitly modeling several known confounders as well as
22 learning additional hidden covariates using PEER (Stegle et al. 2010). PEER was informed
23 about sex as a known cofactor, and as expected it is extremely effective at removing sex-related
11 1 expression signals, eliminating our ability to predict sex (Figure 1d). PEER is also effective at
2 attenuating unknown covariates if their influence is weak. While brain anoxia donors could still
3 be differentiated in PEER corrected transcriptional data, its effect size was reduced (median
4 ROC-AUC=0.56, t-test p-value=2.41x10-3) and signatures for heart disease were eliminated.
5 Since within this study gender did not bias the likelihood of brain anoxia or heart disease (Figure
6 S5), this difference may indicate PEERs ability to suppress some unknown covariates. That said,
7 strong unknown covariates, such as center (median ROC-AUC>0.99) and ischemic time (median
8 ROC-AUC=0.97), remained. These results highlight the importance of carefully collecting
9 covariate data when interpreting gene expression variation.
10 Finally, we investigated significantly over-represented RF decision genes to provide
11 potential insights into the underlying biology driving cause of death signatures. We identified
12 three significant RF decision split-points attributed to brain anoxia: CCR8 (e-value=1.7x10-7),
13 PRC1 (e-value=4.4x10-3), and FGB (e-value=1.5x10-2), shown in Figure 2a and 2c. The
14 chemokine CCR8 (Offner et al. 2006) and the cytokinesis regulator PRC1 (Liu et al. 2014) are
15 up regulated in vivo in the brains of stroke-injured rodents, while FGB is associated with stroke
16 in a recent cross-study meta-analysis (Zhang and Luo 2015). The recapitulation of these markers
17 in the GTEx data underlines the power of our RF approach and the fact that we can identify them
18 from blood samples (and not just brain tissue) is remarkable. Conversely, commonly used FDR-
19 corrected Mann-Whitney U tests only identify CCR8 as weakly significant (BC p-value=4.4x10-
20 2), which suggests that RF split gene enrichment using FDR-corrected expected-value estimation
21 may be more efficient at identifying certain gene signatures. Similarly, the top split-point gene
22 associated with heart disease is PCDHA10 (e-value=9.1x10-3, Figure 2b and 2d), which encodes
23 a protocadherin cell-cell adhesion protein. Circulating cell adhesion markers have previously
12 1 been implicated as a predictive measure for patients who died from coronary artery disease
2 (Blankenberg et al. 2001). Conversely, no individual gene changed significantly between these
3 groups when using FDR-corrected Mann-Whitney U tests.
4 Large-scale, multi-site projects are inherently complex and confounding factors are
5 unavoidable in studies of natural populations because not every factor that could potentially
6 influence gene expression can be controlled for. In particular, most human studies necessarily
7 use tissue samples from deceased individuals, which can affect gene expression inferences. The
8 use of replicate classifiers and RF split point analysis will enable more rigorous inferences into
9 the causes and consequences of transcriptional variation.
10
11 ACKNOWLEDGEMENTS
12 We thank L. Pino and the Akey laboratory for helpful feedback related to this work, and
13 M. Foster for help coding GTEx annotations. This work was supported by NIH/NHGRI grant
14 5U01HG007591 to JMA. BCS was supported by NIH/NHGRI training grant T32HG00035. We
15 thank the donors and donor families for making the GTEx Project possible and the GTEx
16 Consortium for sharing its data.
13 a 1.00 c 1.00 d 1.00 0.95 0.95 0.95 0.90 0.90 0.90 0.85 0.85 0.85 0.80 0.80 0.80 0.75 0.75 0.75 0.70 0.70 0.70 0.65 0.65 0.65 0.60 0.60 0.60 0.55 0.55 0.55 Average ROC-AUC Accuracy Accuracy ROC-AUC Average
Average ROC-AUC Accuracy Accuracy ROC-AUC Average 0.50 0.50 Accuracy ROC-AUC Average 0.50 0.45 0.45 0.45 Sex Sex Sex Sex Skin Liver Liver Lung Brain Brain Heart Colon Testis Testis Ovary Nerve Blood Breast Breast Uterus Vagina Vagina Spleen Muscle Thyroid Prostate Stomach Stomach Pancreas Pancreas Esophagus Brain Anoxia Anoxia Brain Blood Vessel Vessel Blood Brain Trauma Trauma Brain Heart Disease Brain Anoxia Anoxia Brain Ischemia Time Ischemia Adrenal Gland Adrenal relative Trauma Brain Heart Disease Adipose Tissue Tissue Adipose Ischemia Time Ischemia Collection Center Collection
row min row max Center Collection
b relative row min row max GTEX-PX3G GTEX-Q734 GTEX-OXRK GTEX-RM2N GTEX-W5WG GTEX-Q2AI GTEX-PWCY GTEX-SUCS GTEX-U3ZN GTEX-XXEK GTEX-XUJ4 GTEX-WFG7 GTEX-XBED GTEX-QMRM GTEX-SNMC GTEX-S3XE GTEX-S4Q7 GTEX-PWOO GTEX-T6MO GTEX-QXCU GTEX-RWS6 GTEX-V955 GTEX-XUZC GTEX-PW2O GTEX-T5JW GTEX-WEY5 GTEX-PWN1 GTEX-QV31 GTEX-P78B GTEX-P4PP GTEX-QDVN GTEX-S4UY GTEX-TMMY GTEX-U4B1 GTEX-V1D1 GTEX-QDVJ GTEX-XQ8I GTEX-S95S GTEX-WFON GTEX-XV7Q GTEX-T8EM GTEX-RTLS GTEX-WFG8 GTEX-QCQG GTEX-T5JC GTEX-SIU7 GTEX-U8T8 GTEX-S7SF GTEX-WH7G GTEX-XPVG GTEX-TKQ2 GTEX-XAJ8 GTEX-P4PQ GTEX-UPIC GTEX-WFJO GTEX-OIZH GTEX-S4P3 GTEX-R55G GTEX-R55C GTEX-PLZ6 GTEX-UPK5 GTEX-P4QT GTEX-S4Z8 Enriched Digestion Genes GKN1 PGA5 TFF2 AKR1B10 CTSE VSIG1 CAPN8 CAPN9 SSTR1 CCKAR
GTEX-PX3G GTEX-Q734 GTEX-OXRK GTEX-RM2N GTEX-W5WG GTEX-Q2AI GTEX-PWCY GTEX-SUCS GTEX-U3ZN GTEX-XXEK GTEX-XUJ4 GTEX-WFG7 GTEX-XBED GTEX-QMRM GTEX-SNMC GTEX-S3XE GTEX-S4Q7 GTEX-PWOO GTEX-T6MO GTEX-QXCU GTEX-RWS6 GTEX-V955 GTEX-XUZC GTEX-PW2O GTEX-T5JW GTEX-WEY5 GTEX-PWN1 GTEX-QV31 GTEX-P78B GTEX-P4PP GTEX-QDVN GTEX-S4UY GTEX-TMMY GTEX-U4B1 PRSS3GTEX-V1D1 GTEX-QDVJ GTEX-XQ8I GTEX-S95S GTEX-WFON GTEX-XV7Q GTEX-T8EM GTEX-RTLS GTEX-WFG8 GTEX-QCQG GTEX-T5JC GTEX-SIU7 GTEX-U8T8 GTEX-S7SF GTEX-WH7G GTEX-XPVG GTEX-TKQ2 GTEX-XAJ8 GTEX-P4PQ GTEX-UPIC GTEX-WFJO GTEX-OIZH GTEX-S4P3 GTEX-R55G GTEX-R55C GTEX-PLZ6 GTEX-UPK5 GTEX-P4QT GTEX-S4Z8 CHRM3 Enriched Digestion Genes CCKBR CHIA GKN1 PGA4 PGA5 MUC6 TFF2 PGC AKR1B10 1 CTSE VSIG1 CAPN8 CAPN9 SSTR1 CCKAR PRSS3 2 Figure 1: Classification accuracy for tissue types and confounding factors in the GTEx CHRM3 CCKBR CHIA PGA4 3 project. a) We generated 100 “technical replicate” Random Forest classifiers for each tissue type MUC6 PGC
4 and calculated median receiver operator characteristic (ROC) area under the curve (AUC) scores.
5 Scores are between 1.0 (perfect classification) and 0.5 (random guessing). Median AUC scores
6 are shown as blue dots with bootstrapped median 95% confidence intervals. For each classifier
7 we also permuted the labels and recalculated the ROC-AUCs (red dots) to provide an unbiased
8 null score for each tissue type. Relatively low ROC-AUCs exposed digestion related gene
9 expression heterogeneity in stomach samples (b). RF classifier accuracy also indicates
10 confounding factors in blood samples, including cause of death when (c) normalizing only for
11 run bias or (d) after DESeq/PEER normalization with sex as a cofactor.
14 a 250 b 200
200 160
150 120
100 80
50 40 # of# score each at Genes of# score each at Genes 0 0 10 20 30 40 50 60 70 80 90 100 110 120 10 15 20 25 30 35 40 45 50 55 60 Split-Point Enrichment Score Split-Point Enrichment Score
c d 2 y = -0.08651x + 3.181 R² = 0.9736 2 1 1.5 0 10 20 30 40 50 60 70 80 90 100 110 120 1 -1 y = -0.08569x + 3.014 R² = 0.9456 0.5 -2 FGB e=1.5x10 -2 0 -3 PRC1 e=4.4x10 10 15 20 25 30 35 40 45 50 55 60 -3 -0.5
-4 -1 (# of(# Score) Each at Genes of(# Score) Each at Genes 10 10 -5 -1.5 Log Log
-6 -2 -7 -3 CCR8 e=1.7x10 PCDHA10 e=9.1x10 -7 -2.5 Split-Point Enrichment Score Split-Point Enrichment Score 1 2 Figure 2: FDR corrected expected values (e-value) for signature genes. Brain anoxia (a) and
3 heart disease (b) split-point score distributions where each bar represents the cumulative number
4 of genes that achieved that score. These distributions behave similarly to the Gumbel extreme-
5 value distribution explored in the sequence homology literature for BLAST (Karlin and Altschul
6 1993) e-value scoring. E-values for brain anoxia (c) and heart disease (d) samples are calculated
7 by fitting a log-linear slope to the tail of the distribution of genes that achieve split-point
8 enrichment scores between 10 and 25. The shaded blue area indicates e-value<0.05 and
9 significant genes in that region are shown in red.
15 REFERENCES
Anders S. and Huber W., 2010 Differential expression analysis for sequence count data. Genome Biol. 11(10): R106.
Bishop, C. M., 2006 Pattern Recognition and Machine Learning. Springer. pp. 182-183. Blankenberg, S., Rupprecht H. J., Bickel C., Peetz D., Hafner G., et al., 2001 Circulating Cell Adhesion Molecules and Death in Patients With Coronary Artery Disease. Circulation 104: 1336-1342. Breiman, L., 2001 Random Forests. Machine Learning 45(1): 5–32. Bullard, J. H., Purdom E., Hansen, K. D., and Dudoit S., 2010 Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11: 94. ENCODE Project Consortium, 2012 An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414): 57-74. FANTOM Consortium and the RIKEN PMI and CLST (DGT), Forrest A.R., Kawaji H., Rehli M., Baillie J.K., et al., 2015 A promoter-level mammalian expression atlas. Nature 507: 462–470. GTEx Consortium, 2015 The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348(6235): 648-660. Karlin, S. and Altschul, S. F., 1993 Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci. 90(12): 5873-5877. Liu, K., Ding L., Li Y., Yang H., Zhao C., et al., 2014 Neuronal necrosis is regulated by a conserved chromatin-modifying cascade. Proc. Natl. Acad. Sci. 111(38): 13960–13965. Melé M., Ferreira P. G., Reverter F., DeLuca D. S., Monlong J., et al., 2015 The human transcriptome across tissues and individuals. Science 348(6235): 660-665. Mi, H., Muruganujan A., Casagrande J. T., and Thomas P. D., 2013 Large-scale gene function analysis with the PANTHER classification system. Nat. Protoc. 8(8): 1551-1566. Offner, H., Subramanian S., Parker S. M., Afentoulis M. E., Vandenbark A. A., et al., 2006 Experimental stroke induces massive, rapid activation of the peripheral immune system. J. Cereb. Blood. Flow. Metab. 26(5): 654-665. Stegle, O., Parts L., Durbin R., and Winn J., 2010 A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies. PLoS Comput. Biol. 6(5): e1000770. 't Hoen P. A. Friedländer M. R., Almlöf J., Sammeth M., Pulyakhina I., et al., 2013 Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nat. Biotechnol. 31(11): 1015-1022. Zhang, X.F. and Luo, T.Y., 2015 Association between the FGB gene polymorphism and ischemic stroke: a meta-analysis. Genet. Mol. Res. 14(1): 1741-1747.
16