Genetics: Early Online, published on October 11, 2016 as 10.1534/genetics.116.193714

Detecting sources of transcriptional heterogeneity in large-scale RNA-Seq data sets Brian C. Searle, Rachel M. Gittelman, Ohad Manor, and Joshua M. Akey Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA

1

Copyright 2016. RUNNING TITLE

Detecting sources of heterogeneity

KEY WORDS

GTEx Consortium; expression normalization; random forest classification; transcriptional

heterogeneity

CORRESPONDING AUTHOR

Joshua M. Akey, PhD [email protected] Department of Genome Sciences University of Washington School of Medicine Box 355065 1705 NE Pacific Street Seattle, WA 98195

1 ABSTRACT

2 Gene expression levels are dynamic molecular phenotypes that respond to biological,

3 environmental, and technical perturbations. Here we use a novel replicate classifier approach for

4 discovering transcriptional signatures and apply it to the Genotype-Tissue Expression (GTEx)

5 data set. We identified many factors contributing to expression heterogeneity, such as collection

6 center and ischemia time and our approach of scoring replicate classifiers allows us to

7 statistically stratify these factors by effect strength. Strikingly, from transcriptional expression in

8 blood alone we detect markers that help predict heart disease and stroke in some patients. Our

9 results illustrate the challenges and opportunities of interpreting patterns of transcriptional

10 variation in large-scale data sets.

2 1 INTRODUCTION

2 Unlike previous large-scale tissue (FANTOM et al. 2015) or cell-type (ENCODE et al.

3 2012) specific expression data sets, the GTEx Project (GTEx et al. 2015) is unique in the breadth

4 of tissue types sampled from the same individuals. The GTEx Consortium has previously

5 demonstrated that tissue-specific gene expression signatures are preserved in postmortem

6 samples using hierarchical clustering (Melé et al. 2015), which groups samples by gene

7 expression using a data-driven approach to identify hidden structure in the data. While

8 hierarchical clustering is effective at identifying the greatest global source of variation, it does

9 not capture more subtle sources of variation. For example, in the context of the GTEx Project,

10 hierarchical clustering largely captures gene expression variation due to tissue type, but less

11 effectively captures the influence of confounding factors like age or sex.

12 Using the GTEx pilot data freeze version 4, we attempted to recapitulate the results of

13 hierarchical clustering using supervised Random Forest (RF) classification (Breiman 2001).

14 Unlike hierarchical clustering, RF uses sample type annotations in a training data set to create

15 decision trees where the nodes correspond to whose expression levels distinguish between

16 tissue types. Although RF classification typically considers a single classifier per classification

17 task, we randomly generated replicate classifiers to statistically assess how well two groups can

18 be distinguished. This approach is markedly distinct from hierarchical clustering or PCA and

19 enables statistical uncertainty to be rigorously quantified. These analyses reveal strong

20 transcriptional signatures that contribute to patterns of expression heterogeneity in the GTEx data.

21 More broadly, our results highlight that a deeper understanding of the determinants of

22 transcriptional variation enable insights into the biological factors that govern variation in gene

23 expression among tissues and individuals.

3 1 MATERIALS AND METHODS

2 Normalization and Data Curating:

3 We first removed samples of non-European descent and summed counts for technical

4 replicates. We normalized expression profiles to the upper quartile (Bullard et al. 2010) and

5 removed globally weak responding genes (13.7%) where no tissue type had more than 10

6 samples with at least 5 counts. We also removed D-statistic outlier samples ('t Hoen et al. 2013)

7 (approximately 3%) if they correlated poorly across all genes to over half the samples within the

8 tissue type (Pearson’s correlation scores <0.5). This resulted in a data set containing 17,702

9 genes and 1,821 total samples from 165 donors across the 23 tissue types we considered. The

10 genes, samples, and donors are enumerated in Table S3. We used normalized gene expression

11 results downloaded from the GTEx website to interrogate the effect of DESeq/PEER

12 normalization on the strength of cofactors.

13

14 Random Forest classifiers for predicting tissue type:

15 Random Forest is an ensemble method for classifying groups using a collection of

16 decision trees. In this work we chose to use RF because unlike most machine learning

17 approaches, RF classification is robust in the face of large numbers of features and high feature

18 redundancy, making it ideal for classifying gene expression data sets. Additionally, the decision

19 trees inside a RF are generally easy to interpret, which we exploit to identify split-point gene

20 signatures. Realistically, considering only the issue of tissue classification, the approach we

21 present here would still likely work well with a different classification method as its foundation.

22 Each tree in the forest is trained using a bootstrapped selection of “in-bag” samples (a

23 random subset with replacement). Typically decision trees are trained at each node to determine

4 1 the feature (of N total features) that best splits the samples between two classes using the entire

2 feature set. However, in RF typically each split is chosen from a subset of the square root of N

3 randomly selected features. These two levels of randomness help buffer from over-fitting in the

4 presence of a high number of features. RF prediction for a sample is essentially a voting system

5 where the prediction is the majority vote classification across all of the trees.

6 In this study we used entropy as a measure of information gain when selecting decision

7 points. Decision split-points that were already biased towards a 90% or greater decision were

8 eliminated to improve generalization. For each forest approximately 37% of the samples are not

9 selected (i.e. “out-of-bag”) in each bootstrapped sample group. While RF tree pruning is

10 uncommon for efficiency concerns, we use these out-of-bag samples to prune leaves that lower

11 classification accuracy in “unseen” training set samples to help improve generalization.

12 Unlike SVM or logistic regression methods that produce unique, global solutions, RF

13 classifiers are affected by random starting conditions. Each time we trained a RF classifier we

14 selected a different random starting point and a different subset of training data, which

15 consequently produced slightly different performances. We took advantage of this by generating

16 100 “one-vs-rest” (Bishop 2006) binary RF classifiers for a given tissue type where each

17 classifier operated as a “technical” replicate.

18 Each RF was aggregated across 100 weak predictor decision trees. This number of weak

19 predictors was the point at which we found the ROC-AUC was guaranteed to have converged.

20 We used all of the M query samples (specific to the classified tissue type) in our training/testing

21 sample pool and M/2 non-target samples from each of the background tissue types to maintain an

22 even distribution for classifier comparison. Each forest was generated using 80% of the samples

23 randomly selected from the sample pool for each tissue type. The corresponding 20% of the

5 1 samples were reserved exclusively to evaluate the classification accuracy of our classifier. We

2 limited the number of non-target samples in the testing set to be no more than 90% of the total

3 testing pool. This percentage is relatively high to ensure sufficient background tissue diversity.

4 In an effort to speed up the process of classification, we trimmed each feature list before

5 classification to the top 1% of genes that separated the 80% randomly selected training samples

6 using a Mann-Whitney U test. Finally, we performed ROC-AUC integration calculations using

7 the trapezoidal rule and calculated confidence intervals around median ROC-AUC values using

8 medians of 100 bootstrapped sample sets.

9 Since we generated 100 replicate RF classifiers per tissue, we can determine critical

10 decision genes by counting the number of times each gene is used as a decision split-point. Due

11 to the decision tree splitting procedure the actual number of times a gene can be used for splitting

12 scales with the number of samples. However, tissue-specific genes are used repeatedly over the

13 100 replicate classifiers, and the relative number of repeats can indicate key tissue-specific

14 decision split-points. For each tissue type, we ran (GO) enrichment analysis

15 using the online PANTHER Overrepresentation Test (Mi et al. 2013) (release 20150430) using

16 the Homo sapiens GO Ontology database (Released 2015-06-06) on the top 100 decision split-

17 point genes. We required a stringent <0.05 Bonferroni corrected p-value for Biological Process

18 GO enrichment. The number of independent tests is calculated as the number of ontology classes

19 with at least two genes in the reference list.

20

21 Random Forest classifier for predicting blood-specific signatures:

22 Blood-specific markers for identifying sex, collection center, and ischemia time, were

23 identified using binary Random Forests, while classification of donor death was performed with

6 1 one-vs-rest Random Forests using a similar system to that for predicting tissue type. We broke

2 each factor down into the lowest number of possible classes specifically to limit signal dilution

3 across the 165 donor blood samples. A randomly guessing classifier should produce a ROC-

4 AUC score of 0.5. To verify this, for each classifier we randomly permuted the sample labels and

5 calculated a background ROC-AUC. From these two pools of ROC-AUC scores, we calculated a

6 two-tailed t-test with samples of unequal variance to estimate a p-value indicating how

7 significant the classification success is for each factor.

8 The names of the three collection centers are hidden using codes (“B1,” “C1,” and “D1”).

9 We dropped the D1 center from this analysis because that center only collected samples from

10 nine of the donors used our work. The remaining two centers B1 (83 donors) and C1 (73 donors)

11 were carried forward for RF analysis.

12 The GTEx coding of “cause of death” annotations can be ambiguous. Before any analysis

13 took place, an independent researcher re-coded “cause of death” for us based on three annotation

14 variables: “deathCategory”, “deathCause”, and “deathClass” (Table S3). In particular, care was

15 exercised to differentiate the most common annotation “Cerebrovascular diseases”, which mixes

16 stroke (generally annotated as CVA or “cerebral vascular accident” in deathCause) with brain

17 trauma (generally annotated as ICH or “intracranial hemorrhage”).

18 Causes of death are heavily conflated with donor age. We attempted to regress out the

19 effect of age using a linear model (Figure S6) but found this produced unexpected results,

20 leading us to consider non-parametric normalization. To account for the effect of age we grouped

21 donors by age into ten-year blocks. For each gene, we removed the median expression for the

22 gene/age block and multiplied back the median of medians to keep the normalized expression

23 counts near to their original levels. We used the same cofactor normalization approach to

7 1 disentangle collection center from ischemia time. Cofactor normalization was not applied to the

2 DESeq/PEER normalized results.

3 Data Availability:

4 The GTEx pilot data (version 4, dbGaP Accession: phs000424.v4.p1) we analyzed can be

5 downloaded from NCBI at https://www.ncbi.nlm.nih.gov/projects/gap/cgi-

6 bin/study.cgi?study_id=phs000424.v1.p1. The GTEx sample annotation tables are available

7 through the GTEx Portal, http://gtexportal.org/home/datasets. Code implementing the random

8 forest data analysis methods presented here is available at https://bitbucket.org/searleb/gtex-

9 heterogeneity-classifier.

10

11 RESULTS AND DISCUSSION

12 Previous analyses of the GTEx data showed that tissue type could be accurately predicted

13 from gene expression data for many, but not all cell types (Melé et al. 2015). We first attempted

14 to demonstrate the strength of tissue type gene expression signatures in the GTEx data using a

15 novel classification algorithm. Briefly, for each tissue type we trained 100 RF classifiers to

16 separate that tissue from the others, and calculated the median Receiver Operator Characteristic

17 (ROC) area under the curve (AUC) score as an estimate of prediction accuracy. We further

18 examined the RFs to quantify the frequency that a given gene is used as a decision split-point

19 across all of the replicates. Genes that are frequently used as tissue-determining decision points

20 are likely to be important signatures for tissue type, even if their specific differential expression

21 levels are not significant. This approach differs from DESeq (Anders and Huber 2010) and other

22 differential expression tools based on sequence count data for individual genes in that split-point

23 analysis can identify transcription signatures that act in concert to distinguish tissue expression.

8 1 To our knowledge, this work represents the first use of replicate RFs to in silico enrich for

2 significant gene expression signatures.

3 Gene expression signatures determined by RF classification can essentially perfectly

4 differentiate blood, muscle, lung, skin, heart, adrenal gland, liver, blood vessel, and nerve tissues

5 with lower quartile ROC-AUCs>0.99 (Figure 1a). While Melé et al., 2015) were able to

6 reasonably differentiate the first eight tissue types using hierarchical clustering, nerve tissue was

7 not easily distinguishable from other tissue types. To investigate this disparity, we performed GO

8 enrichment of the top 100 RF decision split genes for nerve tissue (Table S1) and identified 31

9 genes involved in nervous system development at a Bonferroni corrected (BC) p-value=9.2x10-5.

10 These genes include 11 involved in axon ensheathment/myelination (BC p-value=4.7x10-9) and

11 10 involved in glial cell differentiation (BC p-value=1.1x10-5) that confidently differentiate nerve

12 tissue from other tissues in the GTEx data (Table S2).

13 Some tissues defy precise classification, as indicated by lower median ROC-AUCs

14 (Figure 1a). For example, although GO enrichment of the top 100 split genes for stomach tissues

15 indicates 17 signature genes involved in digestion (BC p-value=3.5x10-16), stomach sample

16 classification produced median ROC-AUC scores of only 0.82. When we clustered samples

17 using Euclidean distance, these genes show two distinct groups of expression within stomach

18 tissue (Figure 1b) where several tissue samples show little or no expression in any of the 17

19 digestion-related genes. As this heterogeneity does not appear correlated with donor

20 characteristics or sample processing factors, we speculate that it may be related to whether or not

21 the stomach was empty at the time of donor death.

22 Having demonstrated our ability to uncover expression subtypes within stomach tissue,

23 we hypothesized that other sources of gene expression heterogeneity could be found in the GTEx

9 1 data. The GTEx project samples are richly annotated with covariate data, which we used to

2 identify additional factors that contribute to gene expression variability (Figure 1c). Here, we

3 focused on expression data from blood, which had the largest sample size among all available

4 tissues and generated RF classifiers for a number of potential factors that contribute to

5 expression heterogeneity among individuals. For example, sex-specific genes are clearly present

6 in blood transcript levels (median ROC-AUC>0.99). As one would expect, all of the top 20 RF

7 decision split genes are either found on the X or Y (e.g. XIST).

8 The GTEx Consortium noted that gene expression is affected by sample preparation

9 conditions (GTEx et al. 2015) and we therefore evaluated influence of sample collection center

10 and total ischemia time (defined as the time after donor death but before the first tissue sample

11 was extracted). Although sample collection center and ischemia time are correlated (Figure S1),

12 their effects are separable. Specifically, when analyzing center-related signatures, we removed

13 ischemia time biases using median-subtraction, and vice versa when analyzing ischemia-related

14 signatures. We removed one of the three centers from this analysis because of small sample size.

15 From the remaining two centers, we were able to determine with a high degree of accuracy

16 which hospital collected the sample using only blood gene expression data (median ROC-

17 AUC=0.97). We permuted the annotations for each RF, recalculated the distribution of ROC-

18 AUCs, and computed a two-tailed t-test with unequal sample variance, which showed that the

19 prediction accuracy was highly significant (p-value=7.6x10-69).

20 Ischemic time can also be determined from blood samples using a similar approach. In

21 the GTEx data, ischemia time was highly variable (Figure S2). The median ischemia time was

22 4.8 hours, with quartiles ranging from 2.9 to 11.4 hours. Although only 54% of the donor bodies

23 were annotated for refrigeration status prior to sample extraction, of those that were, in general

10 1 samples processed in less than 6 hours were not refrigerated, while those processed longer than 6

2 hours were. Indeed, replicate RF classifiers easily distinguished blood transcription patterns

3 between quickly and slowly processed donors (median ROC-AUC=0.95, t-test p-value=1.1x10-

4 81).

5 We next examined our ability to discern the cause of death from blood-derived gene

6 expression levels. We grouped donors into four categories based on the reported cause of death:

7 “brain anoxia” (primarily strokes, but also asphyxiation), “heart disease” (primarily myocardial

8 infarctions), “brain trauma” (primarily gunshot and transportation accidents), and “other” (see

9 Table S3). The manner of donor death is heavily conflated with donor age (Figure S3). To

10 account for this, we normalized for age-related gene expression using median subtraction. While

11 the cause of death transcriptional signature is weaker than that of sex, center, or ischemia time,

12 RF classification can significantly differentiate donors that died of “brain anoxia” in blood

13 samples (median ROC-AUC=0.67; t-test p-value=1.0x10-15; Figure S4a). Heart disease patients

14 are more difficult to classify (Figure S4b), but a modestly significant expression signature is

15 present (median ROC-AUC=0.57; t-test p-value<1.9x10-2). Our approach does not differentiate

16 “brain trauma” donors from the other groups (Figure S4c), most likely because often the time

17 between onset and death is too quick for injury specific transcripts to build in the circulatory

18 system.

19 Although our RF approach highlights the impact of confounding variables on expression

20 heterogeneity, the GTEx Consortium used well-established methods for mitigating these effects

21 in their analyses. This included explicitly modeling several known confounders as well as

22 learning additional hidden covariates using PEER (Stegle et al. 2010). PEER was informed

23 about sex as a known cofactor, and as expected it is extremely effective at removing sex-related

11 1 expression signals, eliminating our ability to predict sex (Figure 1d). PEER is also effective at

2 attenuating unknown covariates if their influence is weak. While brain anoxia donors could still

3 be differentiated in PEER corrected transcriptional data, its effect size was reduced (median

4 ROC-AUC=0.56, t-test p-value=2.41x10-3) and signatures for heart disease were eliminated.

5 Since within this study gender did not bias the likelihood of brain anoxia or heart disease (Figure

6 S5), this difference may indicate PEERs ability to suppress some unknown covariates. That said,

7 strong unknown covariates, such as center (median ROC-AUC>0.99) and ischemic time (median

8 ROC-AUC=0.97), remained. These results highlight the importance of carefully collecting

9 covariate data when interpreting gene expression variation.

10 Finally, we investigated significantly over-represented RF decision genes to provide

11 potential insights into the underlying biology driving cause of death signatures. We identified

12 three significant RF decision split-points attributed to brain anoxia: CCR8 (e-value=1.7x10-7),

13 PRC1 (e-value=4.4x10-3), and FGB (e-value=1.5x10-2), shown in Figure 2a and 2c. The

14 chemokine CCR8 (Offner et al. 2006) and the cytokinesis regulator PRC1 (Liu et al. 2014) are

15 up regulated in vivo in the brains of stroke-injured rodents, while FGB is associated with stroke

16 in a recent cross-study meta-analysis (Zhang and Luo 2015). The recapitulation of these markers

17 in the GTEx data underlines the power of our RF approach and the fact that we can identify them

18 from blood samples (and not just brain tissue) is remarkable. Conversely, commonly used FDR-

19 corrected Mann-Whitney U tests only identify CCR8 as weakly significant (BC p-value=4.4x10-

20 2), which suggests that RF split gene enrichment using FDR-corrected expected-value estimation

21 may be more efficient at identifying certain gene signatures. Similarly, the top split-point gene

22 associated with heart disease is PCDHA10 (e-value=9.1x10-3, Figure 2b and 2d), which encodes

23 a protocadherin cell-cell adhesion . Circulating cell adhesion markers have previously

12 1 been implicated as a predictive measure for patients who died from coronary artery disease

2 (Blankenberg et al. 2001). Conversely, no individual gene changed significantly between these

3 groups when using FDR-corrected Mann-Whitney U tests.

4 Large-scale, multi-site projects are inherently complex and confounding factors are

5 unavoidable in studies of natural populations because not every factor that could potentially

6 influence gene expression can be controlled for. In particular, most human studies necessarily

7 use tissue samples from deceased individuals, which can affect gene expression inferences. The

8 use of replicate classifiers and RF split point analysis will enable more rigorous inferences into

9 the causes and consequences of transcriptional variation.

10

11 ACKNOWLEDGEMENTS

12 We thank L. Pino and the Akey laboratory for helpful feedback related to this work, and

13 M. Foster for help coding GTEx annotations. This work was supported by NIH/NHGRI grant

14 5U01HG007591 to JMA. BCS was supported by NIH/NHGRI training grant T32HG00035. We

15 thank the donors and donor families for making the GTEx Project possible and the GTEx

16 Consortium for sharing its data.

13 a 1.00 c 1.00 d 1.00 0.95 0.95 0.95 0.90 0.90 0.90 0.85 0.85 0.85 0.80 0.80 0.80 0.75 0.75 0.75 0.70 0.70 0.70 0.65 0.65 0.65 0.60 0.60 0.60 0.55 0.55 0.55 Average ROC-AUC Accuracy Accuracy ROC-AUC Average

Average ROC-AUC Accuracy Accuracy ROC-AUC Average 0.50 0.50 Accuracy ROC-AUC Average 0.50 0.45 0.45 0.45 Sex Sex Sex Sex Skin Liver Liver Lung Brain Brain Heart Colon Testis Testis Ovary Nerve Blood Breast Breast Uterus Vagina Vagina Spleen Muscle Thyroid Prostate Stomach Stomach Pancreas Pancreas Esophagus Brain Anoxia Anoxia Brain Blood Vessel Vessel Blood Brain Trauma Trauma Brain Heart Disease Brain Anoxia Anoxia Brain Ischemia Time Ischemia Adrenal Gland Adrenal relative Trauma Brain Heart Disease Adipose Tissue Tissue Adipose Ischemia Time Ischemia Collection Center Collection

row min row max Center Collection

b relative row min row max GTEX-PX3G GTEX-Q734 GTEX-OXRK GTEX-RM2N GTEX-W5WG GTEX-Q2AI GTEX-PWCY GTEX-SUCS GTEX-U3ZN GTEX-XXEK GTEX-XUJ4 GTEX-WFG7 GTEX-XBED GTEX-QMRM GTEX-SNMC GTEX-S3XE GTEX-S4Q7 GTEX-PWOO GTEX-T6MO GTEX-QXCU GTEX-RWS6 GTEX-V955 GTEX-XUZC GTEX-PW2O GTEX-T5JW GTEX-WEY5 GTEX-PWN1 GTEX-QV31 GTEX-P78B GTEX-P4PP GTEX-QDVN GTEX-S4UY GTEX-TMMY GTEX-U4B1 GTEX-V1D1 GTEX-QDVJ GTEX-XQ8I GTEX-S95S GTEX-WFON GTEX-XV7Q GTEX-T8EM GTEX-RTLS GTEX-WFG8 GTEX-QCQG GTEX-T5JC GTEX-SIU7 GTEX-U8T8 GTEX-S7SF GTEX-WH7G GTEX-XPVG GTEX-TKQ2 GTEX-XAJ8 GTEX-P4PQ GTEX-UPIC GTEX-WFJO GTEX-OIZH GTEX-S4P3 GTEX-R55G GTEX-R55C GTEX-PLZ6 GTEX-UPK5 GTEX-P4QT GTEX-S4Z8 Enriched Digestion Genes GKN1 PGA5 TFF2 AKR1B10 CTSE VSIG1 CAPN8 CAPN9 SSTR1 CCKAR

GTEX-PX3G GTEX-Q734 GTEX-OXRK GTEX-RM2N GTEX-W5WG GTEX-Q2AI GTEX-PWCY GTEX-SUCS GTEX-U3ZN GTEX-XXEK GTEX-XUJ4 GTEX-WFG7 GTEX-XBED GTEX-QMRM GTEX-SNMC GTEX-S3XE GTEX-S4Q7 GTEX-PWOO GTEX-T6MO GTEX-QXCU GTEX-RWS6 GTEX-V955 GTEX-XUZC GTEX-PW2O GTEX-T5JW GTEX-WEY5 GTEX-PWN1 GTEX-QV31 GTEX-P78B GTEX-P4PP GTEX-QDVN GTEX-S4UY GTEX-TMMY GTEX-U4B1 PRSS3GTEX-V1D1 GTEX-QDVJ GTEX-XQ8I GTEX-S95S GTEX-WFON GTEX-XV7Q GTEX-T8EM GTEX-RTLS GTEX-WFG8 GTEX-QCQG GTEX-T5JC GTEX-SIU7 GTEX-U8T8 GTEX-S7SF GTEX-WH7G GTEX-XPVG GTEX-TKQ2 GTEX-XAJ8 GTEX-P4PQ GTEX-UPIC GTEX-WFJO GTEX-OIZH GTEX-S4P3 GTEX-R55G GTEX-R55C GTEX-PLZ6 GTEX-UPK5 GTEX-P4QT GTEX-S4Z8 CHRM3 Enriched Digestion Genes CCKBR CHIA GKN1 PGA4 PGA5 MUC6 TFF2 PGC AKR1B10 1 CTSE VSIG1 CAPN8 CAPN9 SSTR1 CCKAR PRSS3 2 Figure 1: Classification accuracy for tissue types and confounding factors in the GTEx CHRM3 CCKBR CHIA PGA4 3 project. a) We generated 100 “technical replicate” Random Forest classifiers for each tissue type MUC6 PGC

4 and calculated median receiver operator characteristic (ROC) area under the curve (AUC) scores.

5 Scores are between 1.0 (perfect classification) and 0.5 (random guessing). Median AUC scores

6 are shown as blue dots with bootstrapped median 95% confidence intervals. For each classifier

7 we also permuted the labels and recalculated the ROC-AUCs (red dots) to provide an unbiased

8 null score for each tissue type. Relatively low ROC-AUCs exposed digestion related gene

9 expression heterogeneity in stomach samples (b). RF classifier accuracy also indicates

10 confounding factors in blood samples, including cause of death when (c) normalizing only for

11 run bias or (d) after DESeq/PEER normalization with sex as a cofactor.

14 a 250 b 200

200 160

150 120

100 80

50 40 # of# score each at Genes of# score each at Genes 0 0 10 20 30 40 50 60 70 80 90 100 110 120 10 15 20 25 30 35 40 45 50 55 60 Split-Point Enrichment Score Split-Point Enrichment Score

c d 2 y = -0.08651x + 3.181 R² = 0.9736 2 1 1.5 0 10 20 30 40 50 60 70 80 90 100 110 120 1 -1 y = -0.08569x + 3.014 R² = 0.9456 0.5 -2 FGB e=1.5x10 -2 0 -3 PRC1 e=4.4x10 10 15 20 25 30 35 40 45 50 55 60 -3 -0.5

-4 -1 (# of(# Score) Each at Genes of(# Score) Each at Genes 10 10 -5 -1.5 Log Log

-6 -2 -7 -3 CCR8 e=1.7x10 PCDHA10 e=9.1x10 -7 -2.5 Split-Point Enrichment Score Split-Point Enrichment Score 1 2 Figure 2: FDR corrected expected values (e-value) for signature genes. Brain anoxia (a) and

3 heart disease (b) split-point score distributions where each bar represents the cumulative number

4 of genes that achieved that score. These distributions behave similarly to the Gumbel extreme-

5 value distribution explored in the literature for BLAST (Karlin and Altschul

6 1993) e-value scoring. E-values for brain anoxia (c) and heart disease (d) samples are calculated

7 by fitting a log-linear slope to the tail of the distribution of genes that achieve split-point

8 enrichment scores between 10 and 25. The shaded blue area indicates e-value<0.05 and

9 significant genes in that region are shown in red.

15 REFERENCES

Anders S. and Huber W., 2010 Differential expression analysis for sequence count data. Genome Biol. 11(10): R106.

Bishop, C. M., 2006 Pattern Recognition and Machine Learning. Springer. pp. 182-183. Blankenberg, S., Rupprecht H. J., Bickel C., Peetz D., Hafner G., et al., 2001 Circulating Cell Adhesion Molecules and Death in Patients With Coronary Artery Disease. Circulation 104: 1336-1342. Breiman, L., 2001 Random Forests. Machine Learning 45(1): 5–32. Bullard, J. H., Purdom E., Hansen, K. D., and Dudoit S., 2010 Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11: 94. ENCODE Project Consortium, 2012 An integrated encyclopedia of DNA elements in the . Nature 489(7414): 57-74. FANTOM Consortium and the RIKEN PMI and CLST (DGT), Forrest A.R., Kawaji H., Rehli M., Baillie J.K., et al., 2015 A promoter-level mammalian expression atlas. Nature 507: 462–470. GTEx Consortium, 2015 The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348(6235): 648-660. Karlin, S. and Altschul, S. F., 1993 Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci. 90(12): 5873-5877. Liu, K., Ding L., Li Y., Yang H., Zhao C., et al., 2014 Neuronal necrosis is regulated by a conserved chromatin-modifying cascade. Proc. Natl. Acad. Sci. 111(38): 13960–13965. Melé M., Ferreira P. G., Reverter F., DeLuca D. S., Monlong J., et al., 2015 The human transcriptome across tissues and individuals. Science 348(6235): 660-665. Mi, H., Muruganujan A., Casagrande J. T., and Thomas P. D., 2013 Large-scale gene function analysis with the PANTHER classification system. Nat. Protoc. 8(8): 1551-1566. Offner, H., Subramanian S., Parker S. M., Afentoulis M. E., Vandenbark A. A., et al., 2006 Experimental stroke induces massive, rapid activation of the peripheral immune system. J. Cereb. Blood. Flow. Metab. 26(5): 654-665. Stegle, O., Parts L., Durbin R., and Winn J., 2010 A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies. PLoS Comput. Biol. 6(5): e1000770. 't Hoen P. A. Friedländer M. R., Almlöf J., Sammeth M., Pulyakhina I., et al., 2013 Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nat. Biotechnol. 31(11): 1015-1022. Zhang, X.F. and Luo, T.Y., 2015 Association between the FGB gene polymorphism and ischemic stroke: a meta-analysis. Genet. Mol. Res. 14(1): 1741-1747.

16