Running Head: PREDICTING TYPE BALANCE

1012 7. Supporting Information

1013

1014

1015 Supplementary Material for:

1016

1017 INFERENCE OF CELL TYPE COMPOSITION FROM BRAIN TRANSCRIPTOMIC

1018 DATASETS ILLUMINATES THE EFFECTS OF AGE, MANNER OF DEATH, DISSECTION,

1019 AND PSYCHIATRIC DIAGNOSIS

1020 *Megan Hastings Hagenauer, Ph.D.1, Anton Schulmann, M.D.2, Jun Z. Li, Ph.D.3, Marquis P. Vawter, 1021 Ph.D.4, David M. Walsh, Psy.D.4, Robert C. Thompson, Ph.D.1, Cortney A. Turner, Ph.D.1, William E. 1022 Bunney, M.D.4, Richard M. Myers, Ph.D.5, Jack D. Barchas, M.D.6, Alan F. Schatzberg, M.D.7, Stanley J. 1023 Watson, M.D., Ph.D.1, Huda Akil, Ph.D.1 1024

42 Running Head: PREDICTING CELL TYPE BALANCE

1025 7.1 Additional Methods: Detailed Preprocessing Methods for the Transcriptomic Datasets

1026

1027 7.1.1 RNA-Seq data from Purified Cell Types (GSE52564 and GSE67835)

1028 As initial validation, we used our method to predict sample cell type identity in two RNA-Seq

1029 datasets derived from purified cell types: one derived from from purified cortical cell types in mice

1030 (n=17: two samples per cell type and 3 whole brain samples: GSE52564) (18), and one derived from 466

1031 single-cells dissociated from freshly-resected human cortex (GSE67835) (2). The RNA-Seq data that we

1032 downloaded from GEO was already in the format of FPKM values (Fragments Per Kilobase of exon

1033 model per million mapped fragments) (18) or counts per (2). To stabilize the variance in the data, we

1034 used a log(2) transformation (after adding 1), and then filtered out the data for any that completely

1035 lacked variation across samples (sd=0; final gene count: GSE52564: 17148, GSE67835: 21627). Then we

1036 applied the methods now found in the BrainInABlender package, and examined the correlation between

1037 each of the cell type indices and sample cell type identity (excluding the fetal, whole brain, and “hybrid”

1038 cells). The code for these analyses can be found at

1039 https://github.com/hagenaue/CellTypeAnalyses_Darmanis and

1040 https://github.com/hagenaue/CellTypeAnalyses_Zhang.

1041

1042 7.1.2 Microarray Data from Artificial Cell Mixtures (GSE19380)

1043 As further validation, we determined whether relative cell type balance could be accurately

1044 deciphered from microarray data for samples containing artificially-generated mixtures of cultured cells

1045 from the cerebral cortices of rat P1 pups (GSE19380, (12)). The microarray profiling was performed

1046 using a Affymetrix Rat Genome 230 2.0 Array. According to the methods on GEO, this data had already

1047 undergone probeset summarization and normalization using robust multi-array averaging (RMA, affy

1048 package (87)), including background subtraction, summarization by median polish, log(2) transformation,

1049 and quantile normalization. We then predicted the cell content of each sample from the microarray data

43 Running Head: PREDICTING CELL TYPE BALANCE

1050 using BrainInABlender, and correlated these predictions with the actual percent of each cell type found in

1051 the mixtures (Figure 2 in main text). The code for these analyses can be found at:

1052 https://github.com/hagenaue/CellTypeAnalyses_KuhnMixtures/tree/master.

1053 1054 7.1.3 Pritzker Dorsolateral Prefrontal Cortex Microarray Dataset (GSE92538)

1055 The original dataset included tissue from 172 high-quality human post-mortem brains donated to

1056 the Brain Donor Program at the University of California, Irvine with the consent of the next of kin.

1057 Frozen coronal slabs were macro-dissected to obtain dorsolateral prefrontal cortex samples. Clinical

1058 information was obtained from medical examiners, coroners’ medical records, and a family member.

1059 Patients were diagnosed with either Major Depressive Disorder, Bipolar Disorder, or Schizophrenia by

1060 consensus based on criteria from the Diagnostic and Statistical Manual of Mental Disorders (88). Due to

1061 the extended nature of this study, this sample collection occurred in waves (“cohorts”) over a period of

1062 many years. This research was overseen and approved by the University of Michigan Institutional Review

1063 Board (IRB # HUM00043530, Pritzker Neuropsychiatric Disorders Research Consortium (2001-0826))

1064 and the University of California Irvine (UCI) Institutional Review Board (IRB# 1997-74).

1065 As described previously (28,62), total RNA from these samples was then distributed to

1066 laboratories at three different institutions (University of Michigan (UM), University of California-Davis

1067 (UCD), University of California-Irvine (UCI)) for hybridization to either Affymetrix HT-U133A or HT-

1068 U133Plus-v2 chips (1-5 replicates per sample, n=367). Before conducting the current analysis, the subset

1069 of probes found on both the Affymetrix HT-U133A and HT-U133Plus-v2 chips was extracted,

1070 reannotated for probe-to-transcript correspondance (89), summarized using RMA (87), log(2)-

1071 transformed, quantile normalized, and gender-checked. Then, 15 batches of highly-correlated samples

1072 were identified that were defined by a combination of cohort, chip, and laboratory (Suppl. Figure 1).

44 Running Head: PREDICTING CELL TYPE BALANCE

Batch# Site Chip Cohort Control BP MDD SCHIZ 1 UCD U133A Dep Cohort 1 & 2 20 9 11 0 2 UCD U133A Dep Cohort 3 11 6 5 0 3 UCD U133A Dep Cohort 4 16 4 7 0 4 UCD U133Plus2 Dep Cohort 5 13 5 10 0 5 UCD U133A Schiz Cohort 1 9 0 0 9 6 UCD U133Plus2 Schiz Cohort 1 8 0 0 8 7 UCD U133Plus2 Schiz Cohort 2 8 0 0 10 8 UCI U133A Schiz Cohort 1 9 0 0 9 9 UM U133A Dep Cohort 1 16 10 9 0 10 UM U133A Dep Cohort 2 3 2 5 0 11 UM U133A Dep Cohort 3 & 4 27 11 11 0 12 UM U133Plus2 Dep Cohort 5 13 5 10 0 13 UM U133Plus2 Dep Cohort 6 7 2 9 3 14 UM U133A Schiz Cohort 1 9 0 0 9 1073 15 UM U133Plus2 Schiz Cohort 2 9 0 0 10

1074 Suppl. Figure 1. The number of microarray chips run in each batch, defined by processing site, 1075 Affymetrix chip type, and sample collection cohort. Samples from the four diagnostic categories 1076 (Control, Bipolar Disorder, Major Depressive Disorder, Schizophrenia) were unevenly distributed across 1077 batches.

1078 Samples that exhibited markedly low average sample-sample correlation coefficients (<0.85:

1079 outliers) were removed from the dataset, including data from one batch that exhibited overall low sample-

1080 sample correlation coefficients with other batches and duplicate microarrays. The batch effects were then

1081 subtracted out using median-centering (detailed procedure: (62)) and the replicate samples were averaged

1082 for each subject. Our current analyses began with this sample-level summary data

1083 (GSE92538). We further removed data from any subjects lacking information regarding critical pre- or

1084 post-mortem variables necessary for our analysis (final sample size: n=157). The code for these analyses

1085 can be found at https://github.com/hagenaue/CellTypeAnalyses_PritzkerAffyDLPFC.

1086

1087 7.1.4 Allen Brain Atlas Cross-Regional Microarray Dataset

1088 The Allen Brain Atlas microarray data was downloaded from http://human.brain-

1089 map.org/microarray/search in December 2015. This microarray survey was performed in brain-specific

1090 batches, with multiple batches per subject. To remove technical variation across batches, a variety of

1091 normalization procedures had been performed by the original authors both within and across batches

1092 using internal controls, as well as across subjects (90). The dataset available for download had already

45 Running Head: PREDICTING CELL TYPE BALANCE

1093 been log(2)-transformed and converted to z-scores using the average and standard deviation for each

1094 probe. These normalization procedures were designed to remove technical artifacts while best preserving

1095 cross-regional effects in the data, but the full information about relative levels of expression within an

1096 individual sample were unavailable and the effects of subject-level variables (such as age and pH) were

1097 likely to be de-emphasized due to the inability to fully separate out subject and batch during the

1098 normalization process. The 30,000 probes mapped onto 18,787 unique genes (as determined by gene

1099 symbol). The code for these analyses can be found at

1100 https://github.com/hagenaue/CellTypeAnalyses_AllenBrainAtlas.

1101

1102 7.1.5 Human Cortical Microarray Dataset GSE53987 (described in Lanz et al. (30))

1103 The full publicly-available dataset GSE53987 (30) contained Affymetrix U133Plus2 microarray

1104 data from 205 post-mortem human brain samples from three brain regions: the DLPFC (Brodmann Area

1105 46, focusing on gray matter only (Lanz T.A., personal communication)), the hippocampus, and the

1106 striatum. These samples were collected by the University of Pittsburgh brain bank. For the purposes of

1107 our current analysis, we only downloaded the microarray .CEL files for the dorsolateral prefrontal cortex

1108 samples. We summarized these data with RMA (87) using a custom up-to-date chip definition file (.cdf)

1109 to define probe-to-transcript correspondence (“hgu133plus2hsentrezgcdf_19.0.0.tar.gz” from http://nmg-

1110 r.bioinformatics.nl/NuGO_R.html (89)). This process included background subtraction, log(2)-

1111 transformation, and quantile normalization. Gene Symbol annotation for probeset gene ids were

1112 provided by the R package org.Hs.eg.db. To control for technical variation, the sample processing batches

1113 were estimated using the microarray chip scan dates extracted from the .CEL files (using the function

1114 protocolData in the GEOquery package (91)), but all chips for the DLPFC were scanned on the same

1115 date. RNA degradation was estimated using the R package AffyRNADegradation (33). During quality

1116 control, two samples were removed - GSM1304979 had a range of sample-sample correlations that was

1117 unusually low compared (median=0.978) compared to the range for the dataset as a whole (median:

46 Running Head: PREDICTING CELL TYPE BALANCE

1118 0.993) and GSM1304953 appeared to be falsely identified as female (signal for XIST<7). The code for

1119 these analyses can be found at:

1120 https://github.com/hagenaue/CellTypeAnalyses_LanzHumanDLPFC/tree/master

1121

1122 7.1.6 Human Cortical Microarray Dataset GSE21138 (described in Narayan et al. (32))

1123 The publicly-available dataset GSE21138 (32)) contained Affymetrix U133Plus2 microarray data

1124 from 59 post-mortem human brain samples from the DLPFC (Brodmann Area 46, gray matter only

1125 (Thomas E.A., personal communication)) collected by the Mental Health Research Institute in Victoria,

1126 Australia. The procedures for data download and pre-processing were identical to those used above for

1127 GSE53987 with a few minor exceptions. In particular, there were six separate scan dates associated with

1128 the microarray .CEL files, but one of these scan dates was not included as a co-variate in our analyses

1129 because it had an n=1 (“06/14/06”). During quality control, the data for two subjects were removed

1130 because they appeared to be falsely-identified as male (XIST>7, GSM528839 & GSM528840) or falsely-

1131 identified as female (XIST<7, GSM528880). Data for two more subjects were removed as outliers due to

1132 having an unsually low range of sample-sample correlations (GSM528866, GSM528873) as compared to

1133 the dataset as a whole. The code for these analyses can be found at:

1134 https://github.com/hagenaue/CellTypeAnalyses_NarayanHumanDLPFC.

1135

1136 7.1.7 Human Cortical Microarray Dataset GSE21935 (described in Barnes et al. (31))

1137 The publicly-available dataset GSE21935 (31) contained Affymetrix U133Plus2 microarray data

1138 from 42 post-mortem human brain samples from the temporal cortex (Brodmann Area 22) collected at the

1139 Charing Cross campus of the Imperial College of London. The procedures for data download and pre-

1140 processing were identical to those used above for GSE53987 with a few minor exceptions. In particular,

1141 there were two separate scan dates associated with the microarray .CEL files, but they were closely

1142 spaced (6/25/04 vs. 6/29/04) and we did not find any strong association between scan date and the top

47 Running Head: PREDICTING CELL TYPE BALANCE

1143 principal components of variation in the data, so we opted to not include scan date as a co-variate in our

1144 statistical models. Quality control did not identify any problematic samples. The code for these analyses

1145 can be found at:

1146 https://github.com/hagenaue/CellTypeAnalyses_BarnesHumanCortex/tree/master.

1147

1148 7.1.8 CommonMind Consortium Human Cortical RNA-Seq Dataset (described in Fromer et al.

1149 (34))

1150 The CommonMind Consortium (CMC) RNA-seq dataset profiled prefrontal cortex samples from 603

1151 individuals (34) collected at three brain banks: Mount Sinai School of Medicine, University of Pittsburgh,

1152 and University of Pennsylvania. This dataset was downloaded as GRCh37-aligned bam files from the

1153 CommonMind Consortium Knowledge Portal (https://www.synapse.org/CMC). Tophat-aligned bam files

1154 were converted back to fastq format and mapped to GRCh38 using HISAT2 (92) with default settings.

1155 Reads mapping uniquely to exons were then counted using subread featureCounts with ensembl transcript

1156 models. Cell type indices were calculated using logCPM values, and analysis of differential gene

1157 expression was performed using using the limma/voom method (93) with observed precision weights in a

1158 weighted least squares linear regression. Prior to upload, poor quality samples from the original dataset

1159 (34) had already been removed (<50 million reads, RIN<5.5) and replaced with higher quality samples.

1160 We further excluded data from 10 replicates and 89 individuals with incomplete demographic data

1161 (missing pH; final sample size: n=514). The dataset was further filtered using an expression threshold

1162 (CPM>1 in at least 50 individuals) which reduced the dataset to data from around 17,000 genes. The code

1163 for these analyses can be found at https://github.com/aschulmann/CMC_celltype_index.

1164

1165 7.2 Additional Validation for the BrainInABlender Method

1166

48 Running Head: PREDICTING CELL TYPE BALANCE

1167 7.2.1 BrainInABlender Can Predict Relative Cell Content in Datasets from Purified Cells and In

1168 Silico Mixtures

1169 As initial validation, we used the BrainInABlender method to predict the relative balance of cell

1170 types in samples with known cell content (purified cells and artificial cell mixtures). To do this analysis,

1171 we used two RNA-Seq datasets: one derived from from purified cortical cell types in mice (GSE52564)

1172 (18), and one derived from human single-cell RNA-Seq (GSE67835) (2). In general, we found that the

1173 statistical cell type indices easily predicted the cell type identities of the original samples (Suppl. Figure

1174 2). The correlation between each of the cell type indices and their respective cell type was very strong

1175 within the mouse purified, pooled cell type dataset (R2 between 0.41-0.90) and moderate in the noisier

1176 human single-cell dataset (R2 between 0.14-0.67), but typically still much higher than the correlation with

1177 other cell types. This was true regardless of the publication from which the cell type specific genes were

1178 derived: cell type specific gene lists derived from publications using different species (human vs. mouse),

1179 platforms (microarray vs. RNA-Seq), methodologies (florescent cell sorting vs. suspension), or statistical

1180 stringency all performed fairly equivalently, with some minor exception. Notably, the cell type indices

1181 derived from the cell type specific gene lists in Doyle et al. ((15), originally identified using TRAP

1182 methodology) tended to perform poorly. In both validation datasets, the cell type index

1183 Oligodendrocyte_All_Doyle_Cell_2008 did not properly predict the cell identity of the samples, and

1184 Neuron_Neuron_CCK_Doyle_Cell_2008 and Neuron_Interneuron_CORT_Doyle_Cell_2008 were

1185 elevated in non-neuronal cell types. Later, we found that other cell type specific gene lists from the Doyle

1186 et al. study (15) included a high percentage of genes that appeared non-specific to their respective cell

1187 type (Suppl. Figure 3; Neuron_CorticoSpinal_Doyle_Cell_2008,

1188 Neuron_CorticoStriatal_Doyle_Cell_2008), leading us to remove all cell type specific gene lists derived

1189 from this publication from later versions of BrainInABlender, although we found that it did not

1190 dramatically alter our results. In general, the cell type indices associated with immature oligodendrocytes

1191 were also somewhat inconsistent. For example, neither of the immature oligodendrocyte cell type indices

1192 derived from gene lists in Zhang et al. (18) could predict OPC sample cell identity in the human single

49 Running Head: PREDICTING CELL TYPE BALANCE

1193 cell dataset (2) (R2<0.02), perhaps due to differences in developmental stage and culture conditions. As

1194 would be expected, the cell type indices derived from cell type specific genes identified by the same

1195 publication that produced the test datasets (2,18) were (by definition) superb predictors of the sample cell

1196 identity in their own dataset, and were thus excluded from later validation analyses. Zhang_Astrocyte Zhang_Endothelial Zhang_Microglia Zhang_Neuron Zhang_Oligodendrocyte Zhang_NFO Zhang_OPC Darmanis_Astrocyte Darmanis_Endothelial Darmanis_Microglia Darmanis_Neuron Darmanis_Oligodendrocyte Darmanis_OPC Astrocyte_All_Cahoy_JNeuro_2008 0.84 0.09 0.04 0.00 0.05 0.09 0.03 0.67 0.01 0.04 0.08 0.08 0.01 Astrocyte_All_Darmanis_PNAS_2015 0.67 0.12 0.04 0.02 0.12 0.10 0.07 0.87 0.02 0.02 0.17 0.05 0.02 Astrocyte_All_Doyle_Cell_2008 0.77 0.10 0.02 0.01 0.10 0.07 0.07 0.48 0.04 0.01 0.09 0.02 0.01 Astrocyte_All_Zeisel_Science_2015 0.74 0.09 0.09 0.01 0.12 0.01 0.09 0.25 0.04 0.11 0.00 0.04 0.02 Astrocyte_All_Zhang_JNeuro_2014 0.90 0.08 0.06 0.00 0.05 0.07 0.00 0.28 0.01 0.05 0.00 0.10 0.02 Endothelial_All_Daneman_PLOS_2010 0.02 0.99 0.02 0.02 0.05 0.04 0.02 0.01 0.27 0.01 0.00 0.02 0.01 Endothelial_All_Darmanis_PNAS_2015 0.01 0.90 0.00 0.09 0.08 0.09 0.00 0.01 0.84 0.00 0.05 0.01 0.00 Endothelial_All_Zeisel_Science_2015 0.00 0.90 0.02 0.01 0.19 0.04 0.00 0.00 0.14 0.03 0.01 0.02 0.03 Endothelial_All_Zhang_JNeuro_2014 0.02 0.99 0.03 0.02 0.05 0.04 0.02 0.03 0.25 0.02 0.00 0.00 0.01 Microglia_All_Darmanis_PNAS_2015 0.08 0.03 0.88 0.05 0.02 0.06 0.02 0.02 0.01 0.70 0.06 0.01 0.04 Microglia_All_Zeisel_Science_2015 0.03 0.01 0.87 0.05 0.13 0.05 0.02 0.02 0.00 0.30 0.00 0.04 0.01 Microglia_All_Zhang_JNeuro_2014 0.05 0.02 0.95 0.05 0.04 0.05 0.00 0.01 0.00 0.48 0.04 0.02 0.01 Neuron_All_Cahoy_JNeuro_2008 0.06 0.06 0.05 0.87 0.03 0.05 0.03 0.06 0.07 0.09 0.64 0.10 0.04 Neuron_All_Darmanis_PNAS_2015 0.03 0.14 0.06 0.74 0.03 0.04 0.10 0.18 0.05 0.08 0.79 0.08 0.03 Neuron_All_Zhang_JNeuro_2014 0.02 0.04 0.03 0.99 0.03 0.04 0.01 0.08 0.03 0.06 0.37 0.02 0.04 Neuron_CorticoSpinal_Doyle_Cell_2008 0.01 0.01 0.23 0.61 0.20 0.01 0.07 0.02 0.01 0.06 0.17 0.03 0.01 Neuron_CorticoStriatal_Doyle_Cell_2008 0.00 0.01 0.11 0.40 0.14 0.29 0.00 0.01 0.00 0.00 0.03 0.01 0.00 Neuron_CorticoThalamic_Doyle_Cell_2008 0.01 0.12 0.02 0.62 0.25 0.11 0.03 0.03 0.01 0.00 0.20 0.10 0.00 Neuron_Glutamate_Sugino_NatNeuro_2006 0.06 0.03 0.07 0.38 0.07 0.03 0.26 0.04 0.01 0.07 0.25 0.02 0.02 Neuron_Pyramidal_Cortical_Zeisel_Science_2015 0.02 0.06 0.04 0.70 0.14 0.03 0.12 0.04 0.04 0.06 0.39 0.06 0.03 Neuron_GABA_Sugino_NatNeuro_2006 0.01 0.01 0.27 0.49 0.23 0.01 0.13 0.04 0.05 0.09 0.40 0.04 0.02 Neuron_Interneuron_CORT_Doyle_Cell_2008 0.02 0.01 0.12 0.49 0.40 0.01 0.09 0.02 0.02 0.05 0.17 0.01 0.02 Neuron_Interneuron_Zeisel_Science_2015 0.00 0.10 0.11 0.68 0.15 0.00 0.11 0.06 0.06 0.09 0.49 0.04 0.04 Neuron_Neuron_CCK_Doyle_Cell_2008 0.02 0.54 0.05 0.00 0.12 0.19 0.11 0.04 0.01 0.01 0.09 0.00 0.01 Neuron_Neuron_PNOC_Doyle_Cell_2008 0.05 0.03 0.01 0.97 0.04 0.07 0.00 0.07 0.04 0.05 0.41 0.03 0.05 Oligodendrocyte_All_Cahoy_JNeuro_2008 0.17 0.08 0.03 0.12 0.41 0.36 0.00 0.04 0.00 0.01 0.11 0.60 0.00 Oligodendrocyte_All_Doyle_Cell_2008 0.02 0.00 0.22 0.09 0.15 0.05 0.61 0.00 0.01 0.05 0.06 0.01 0.01 Oligodendrocyte_All_Zeisel_Science_2015 0.05 0.07 0.18 0.10 0.41 0.34 0.00 0.01 0.03 0.10 0.00 0.21 0.03 Oligodendrocyte_Mature_Darmanis_PNAS_2015 0.02 0.17 0.09 0.12 0.48 0.26 0.00 0.04 0.00 0.00 0.13 0.77 0.00 Oligodendrocyte_Mature_Doyle_Cell_2008 0.13 0.05 0.16 0.07 0.52 0.23 0.00 0.04 0.02 0.02 0.05 0.29 0.00 Oligodendrocyte_Myelinating_Zhang_JNeuro_2014 0.13 0.04 0.06 0.06 0.67 0.17 0.03 0.02 0.00 0.01 0.07 0.48 0.01 Oligodendrocyte_Newly-Formed_Zhang_JNeuro_2014 0.05 0.08 0.28 0.00 0.02 0.70 0.02 0.05 0.02 0.08 0.26 0.01 0.02 Oligodendrocyte_Progenitor Cell_Darmanis_PNAS_2015 0.01 0.10 0.05 0.01 0.22 0.14 0.63 0.03 0.00 0.00 0.03 0.01 0.68 1197 Oligodendrocyte_Progenitor Cell_Zhang_JNeuro_2014 0.02 0.12 0.15 0.02 0.14 0.08 0.64 0.05 0.02 0.08 0.21 0.04 0.01

1198 Suppl. Figure 2. Cell type indices successfully predict sample cell type in purified cell type RNA-Seq 1199 data. Cell type indices derived from cell type specific transcripts originating from publications using 1200 different species, methodologies, and platforms could successfully predict the sample cell types within two 1201 RNA-Seq datasets (mouse purified cells (Zhang et al. (18)) and human single cells (Darmanis et al. (2)). 1202 Depicted in the table are the R2 values indicating how much of the variance within any particular cell 1203 type index (row) is explained by a particular sample cell type (column, NFO: “newly-formed

50 Running Head: PREDICTING CELL TYPE BALANCE

1204 oligodendrocyte”). The cell type indices are named after their origin (primary cell type, subtype, and 1205 publication), and the primary cell type category is further identified by color (lavender: astrocytes, 1206 orange: endothelial, green: microglia, yellow: mural, purple: neuron_all, blue: neuron_projection, red: 1207 neuron_interneuron, pink: oligodendrocyte, white: oligodendrocyte progenitor cell (OPC)). 1208 1209 For further analyses, individual cell type indices were averaged within each of the primary cell

1210 type categories to produce ten consolidated primary cell-type indices for each sample. To perform this

1211 consolidation, we also removed any transcripts that were identified as “cell type specific” to multiple

1212 primary cell type categories (Suppl. Figure 3).

1213 A.

# Genes % Reference Publications: Cell Type Specific Gene Lists Genes (Specific) Specific Astrocyte_All_Cahoy_JNeuro_2008 73 64 88% Astrocyte_All_Darmanis_PNAS_2015 21 20 95% Astrocyte_All_Doyle_Cell_2008 25 25 100% Astrocyte_All_Zeisel_Science_2015 240 216 90% Astrocyte_All_Zhang_JNeuro_2014 40 32 80% Endothelial_All_Daneman_PLOS_2010 49 43 88% Endothelial_All_Darmanis_PNAS_2015 21 19 90% Endothelial_All_Zeisel_Science_2015 353 319 90% Endothelial_All_Zhang_JNeuro_2014 40 30 75% Microglia_All_Darmanis_PNAS_2015 21 19 90% Microglia_All_Zeisel_Science_2015 436 396 91% Microglia_All_Zhang_JNeuro_2014 40 37 93% Mural_All_Zeisel_Science_2015 155 146 94% Mural_Pericyte_Zhang_JNeuro_2014 40 28 70% Mural_Vascular_Daneman_PLOS_2010 50 33 66% Neuron_All_Cahoy_JNeuro_2008 80 57 71% Neuron_All_Darmanis_PNAS_2015 21 16 76% Neuron_All_Zhang_JNeuro_2014 40 27 68% Neuron_CorticoSpinal_Doyle_Cell_2008 25 16 64% Neuron_CorticoStriatal_Doyle_Cell_2008 25 9 36% Neuron_CorticoThalamic_Doyle_Cell_2008 25 15 60% Neuron_Glutamate_Sugino_NatNeuro_2006 67 59 88% Neuron_Pyramidal_Cortical_Zeisel_Science_2015 294 258 88% Neuron_GABA_Sugino_NatNeuro_2006 32 28 88% Neuron_Interneuron_CORT_Doyle_Cell_2008 25 20 80% Neuron_Interneuron_Zeisel_Science_2015 365 328 90% Neuron_Neuron_CCK_Doyle_Cell_2008 25 18 72% Neuron_Neuron_PNOC_Doyle_Cell_2008 24 17 71% Oligodendrocyte_All_Cahoy_JNeuro_2008 50 48 96% Oligodendrocyte_All_Doyle_Cell_2008 25 20 80% Oligodendrocyte_All_Zeisel_Science_2015 453 421 93% Oligodendrocyte_Mature_Darmanis_PNAS_2015 21 20 95% Oligodendrocyte_Mature_Doyle_Cell_2008 25 19 76% Oligodendrocyte_Myelinating_Zhang_JNeuro_2014 40 40 100% Oligodendrocyte_Newly-Formed_Zhang_JNeuro_2014 39 25 64% Oligodendrocyte_Progenitor Cell_Darmanis_PNAS_2015 21 12 57% Oligodendrocyte_Progenitor Cell_Zhang_JNeuro_2014 40 26 65% 1214 RBC_All_GeneCardSearch_Hemoglobin_ErythrocyteSpecific 17 10 59%

51 Running Head: PREDICTING CELL TYPE BALANCE

1215 B. Astrocyte_All_Cahoy_JNeuro_2008 Astrocyte_All_Darmanis_PNAS_2015 Astrocyte_All_Doyle_Cell_2008 Astrocyte_All_Zeisel_Science_2015 Astrocyte_All_Zhang_JNeuro_2014 Endothelial_All_Daneman_PLOS_2010 Endothelial_All_Darmanis_PNAS_2015 Endothelial_All_Zeisel_Science_2015 Endothelial_All_Zhang_JNeuro_2014 Microglia_All_Darmanis_PNAS_2015 Microglia_All_Zeisel_Science_2015 Microglia_All_Zhang_JNeuro_2014 Mural_All_Zeisel_Science_2015 Mural_Pericyte_Zhang_JNeuro_2014 Mural_Vascular_Daneman_PLOS_2010 Neuron_All_Cahoy_JNeuro_2008 Neuron_All_Darmanis_PNAS_2015 Neuron_All_Zhang_JNeuro_2014 Neuron_CorticoSpinal_Doyle_Cell_2008 Neuron_CorticoStriatal_Doyle_Cell_2008 Neuron_CorticoThalamic_Doyle_Cell_2008 Neuron_Glutamate_Sugino_NatNeuro_2006 Neuron_Pyramidal_Cortical_Zeisel_Science_2015 Neuron_GABA_Sugino_NatNeuro_2006 Neuron_Interneuron_CORT_Doyle_Cell_2008 Neuron_Interneuron_Zeisel_Science_2015 Neuron_Neuron_CCK_Doyle_Cell_2008 Neuron_Neuron_PNOC_Doyle_Cell_2008 Oligodendrocyte_All_Cahoy_JNeuro_2008 Oligodendrocyte_All_Doyle_Cell_2008 Oligodendrocyte_All_Zeisel_Science_2015 Oligodendrocyte_Mature_Darmanis_PNAS_2015 Oligodendrocyte_Mature_Doyle_Cell_2008 Oligodendrocyte_Myelinating_Zhang_JNeuro_2014 Oligodendrocyte_Newly-Formed_Zhang_JNeuro_2014 Oligodendrocyte_Progenitor Cell_Darmanis_PNAS_2015 Oligodendrocyte_Progenitor Cell_Zhang_JNeuro_2014 RBC_All_GeneCardSearch_Hemoglobin_ErythrocyteSpecific Astrocyte_All_Cahoy_JNeuro_2008 100% 10% 8% 21% 23% 0% 0% 3% 0% 0% 3% 1% 0% 1% 4% 0% 0% 0% 0% 0% 0% 0% 3% 0% 0% 3% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1% 0% 0% Astrocyte_All_Darmanis_PNAS_2015 33% 100% 0% 14% 5% 0% 0% 0% 0% 0% 0% 0% 0% 5% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Astrocyte_All_Doyle_Cell_2008 24% 0% 100% 12% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Astrocyte_All_Zeisel_Science_2015 6% 1% 1% 100% 3% 0% 0% 1% 0% 0% 3% 0% 0% 2% 0% 0% 0% 0% 0% 0% 0% 0% 1% 0% 0% 1% 0% 0% 0% 0% 3% 0% 1% 0% 0% 0% 0% 0% Astrocyte_All_Zhang_JNeuro_2014 43% 3% 0% 18% 100% 0% 0% 0% 0% 0% 0% 0% 0% 3% 10% 0% 0% 0% 0% 0% 0% 0% 5% 0% 0% 3% 0% 0% 0% 0% 0% 0% 0% 0% 0% 3% 0% 0% Endothelial_All_Daneman_PLOS_2010 0% 0% 0% 0% 0% 100% 2% 59% 10% 0% 2% 0% 2% 0% 0% 0% 0% 0% 6% 2% 0% 0% 0% 0% 2% 0% 2% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Endothelial_All_Darmanis_PNAS_2015 0% 0% 0% 0% 0% 5% 100% 5% 5% 0% 0% 0% 5% 0% 5% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Endothelial_All_Zeisel_Science_2015 1% 0% 0% 1% 0% 8% 0% 100% 8% 0% 2% 0% 0% 0% 1% 0% 0% 0% 3% 0% 0% 1% 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1% 0% Endothelial_All_Zhang_JNeuro_2014 0% 0% 0% 0% 0% 13% 3% 70% 100% 0% 0% 3% 0% 0% 0% 3% 0% 0% 18% 0% 0% 0% 0% 0% 3% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Microglia_All_Darmanis_PNAS_2015 0% 0% 0% 0% 0% 0% 0% 0% 0% 100% 62% 19% 0% 5% 0% 0% 0% 0% 0% 5% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Microglia_All_Zeisel_Science_2015 1% 0% 0% 2% 0% 0% 0% 2% 0% 3% 100% 7% 0% 1% 1% 0% 0% 0% 0% 3% 0% 0% 1% 0% 0% 0% 0% 0% 0% 0% 2% 0% 0% 0% 0% 0% 0% 0% Microglia_All_Zhang_JNeuro_2014 3% 0% 0% 0% 0% 0% 0% 3% 3% 10% 68% 100% 0% 3% 0% 0% 0% 0% 0% 3% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Mural_All_Zeisel_Science_2015 0% 0% 0% 1% 0% 1% 1% 1% 0% 0% 1% 0% 100% 3% 2% 0% 0% 0% 0% 1% 0% 0% 0% 0% 1% 1% 0% 1% 0% 0% 1% 0% 1% 0% 0% 1% 1% 0% Mural_Pericyte_Zhang_JNeuro_2014 3% 3% 0% 15% 3% 0% 0% 3% 0% 3% 3% 3% 10% 100% 10% 0% 0% 0% 0% 3% 0% 0% 10% 0% 0% 0% 0% 0% 0% 0% 3% 0% 0% 0% 0% 0% 13% 0% Mural_Vascular_Daneman_PLOS_2010 6% 0% 0% 2% 8% 0% 2% 2% 0% 0% 6% 0% 6% 8% 100% 2% 0% 0% 0% 4% 0% 0% 6% 0% 0% 0% 0% 0% 0% 2% 2% 0% 2% 0% 0% 2% 2% 0% Neuron_All_Cahoy_JNeuro_2008 0% 0% 0% 0% 0% 0% 0% 1% 1% 0% 1% 0% 0% 0% 1% 100% 13% 9% 1% 0% 3% 3% 9% 0% 0% 6% 1% 5% 0% 0% 3% 0% 0% 0% 1% 0% 0% 0% Neuron_All_Darmanis_PNAS_2015 0% 0% 0% 5% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 43% 100% 10% 0% 0% 0% 0% 0% 0% 0% 19% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Neuron_All_Zhang_JNeuro_2014 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 18% 5% 100% 0% 0% 0% 0% 8% 3% 3% 25% 0% 5% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Neuron_CorticoSpinal_Doyle_Cell_2008 0% 0% 0% 0% 0% 12% 0% 28% 20% 0% 0% 0% 0% 0% 0% 8% 0% 0% 100% 0% 4% 0% 16% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Neuron_CorticoStriatal_Doyle_Cell_2008 0% 0% 0% 0% 0% 4% 0% 4% 0% 4% 52% 4% 4% 8% 12% 0% 0% 0% 0% 100% 0% 4% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Neuron_CorticoThalamic_Doyle_Cell_2008 0% 0% 0% 0% 0% 0% 0% 8% 0% 0% 0% 0% 0% 0% 0% 16% 0% 0% 4% 0% 100% 0% 16% 0% 0% 0% 20% 0% 0% 0% 4% 0% 0% 0% 0% 4% 0% 0% Neuron_Glutamate_Sugino_NatNeuro_2006 0% 0% 0% 0% 0% 0% 0% 3% 0% 0% 1% 0% 0% 0% 0% 3% 0% 0% 0% 1% 0% 100% 4% 0% 0% 0% 1% 0% 0% 0% 3% 0% 0% 0% 3% 0% 0% 0% Neuron_Pyramidal_Cortical_Zeisel_Science_2015 1% 0% 0% 1% 1% 0% 0% 1% 0% 0% 1% 0% 0% 0% 1% 2% 0% 1% 1% 0% 1% 1% 100% 0% 0% 2% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1% 1% 1% Neuron_GABA_Sugino_NatNeuro_2006 0% 0% 0% 3% 0% 0% 0% 3% 0% 0% 0% 0% 0% 0% 0% 0% 0% 3% 0% 0% 0% 0% 3% 100% 0% 38% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Neuron_Interneuron_CORT_Doyle_Cell_2008 0% 0% 0% 0% 0% 4% 0% 0% 4% 0% 0% 0% 4% 0% 0% 0% 0% 4% 0% 0% 0% 0% 4% 0% 100% 12% 12% 0% 0% 0% 4% 0% 0% 0% 0% 0% 0% 0% Neuron_Interneuron_Zeisel_Science_2015 1% 0% 0% 1% 0% 0% 0% 0% 0% 0% 1% 0% 0% 0% 0% 1% 1% 3% 0% 0% 0% 0% 2% 4% 1% 100% 0% 3% 0% 0% 1% 0% 0% 0% 0% 0% 1% 0% Neuron_Neuron_CCK_Doyle_Cell_2008 0% 0% 0% 0% 0% 4% 0% 4% 0% 0% 4% 0% 0% 0% 0% 4% 0% 0% 0% 0% 8% 4% 0% 0% 12% 0% 100% 0% 0% 0% 4% 0% 0% 0% 0% 0% 0% 0% Neuron_Neuron_PNOC_Doyle_Cell_2008 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 4% 0% 0% 17% 0% 8% 0% 0% 0% 0% 4% 0% 0% 46% 0% 100% 0% 0% 0% 0% 0% 0% 0% 4% 0% 0% Oligodendrocyte_All_Cahoy_JNeuro_2008 0% 0% 0% 2% 0% 0% 0% 0% 0% 0% 2% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 100% 4% 48% 12% 6% 20% 2% 0% 0% 0% Oligodendrocyte_All_Doyle_Cell_2008 0% 0% 0% 4% 0% 0% 0% 0% 0% 0% 4% 0% 0% 0% 4% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 8% 100% 4% 0% 0% 0% 4% 0% 8% 0% Oligodendrocyte_All_Zeisel_Science_2015 0% 0% 0% 2% 0% 0% 0% 0% 0% 0% 2% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 5% 0% 100% 2% 2% 4% 2% 0% 0% 0% Oligodendrocyte_Mature_Darmanis_PNAS_2015 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 5% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 29% 0% 48% 100% 5% 24% 0% 0% 0% 0% Oligodendrocyte_Mature_Doyle_Cell_2008 0% 0% 0% 8% 0% 0% 0% 4% 0% 0% 8% 0% 4% 0% 4% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 12% 0% 40% 4% 100% 8% 0% 0% 4% 0% Oligodendrocyte_Myelinating_Zhang_JNeuro_2014 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 25% 0% 48% 13% 5% 100% 0% 0% 0% 0% Oligodendrocyte_Newly-Formed_Zhang_JNeuro_2014 0% 0% 0% 3% 0% 0% 0% 3% 0% 0% 3% 0% 0% 0% 0% 3% 0% 0% 0% 0% 0% 5% 3% 0% 0% 0% 0% 0% 3% 3% 23% 0% 0% 0% 100% 3% 0% 0% Oligodendrocyte_Progenitor Cell_Darmanis_PNAS_2015 5% 0% 0% 5% 5% 0% 0% 5% 0% 0% 0% 0% 10% 0% 5% 0% 0% 0% 0% 0% 5% 0% 10% 0% 0% 5% 0% 5% 0% 0% 5% 0% 0% 0% 5% 100% 19% 0% Oligodendrocyte_Progenitor Cell_Zhang_JNeuro_2014 0% 0% 0% 3% 0% 0% 0% 5% 0% 0% 0% 0% 3% 5% 3% 0% 0% 0% 0% 0% 0% 0% 5% 0% 0% 13% 0% 0% 0% 5% 0% 0% 3% 0% 0% 10% 100% 0% 1216 RBC_All_GeneCardSearch_Hemoglobin_ErythrocyteSpecific 0% 0% 0% 6% 0% 0% 0% 18% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 12% 0% 0% 6% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 100%

1217 Suppl. Figure 3. Identifying non-specific “cell-type specific genes”. Within BrainInABlender, the data 1218 from genes that were identifed as being specific to more than one category of cell type (e.g., a gene 1219 previously-identified as being “specifically expressed” in both microglia and endothelial cells) were 1220 removed before averaging the individual cell type indices within each of the ten primary categories to 1221 create the consolidated cell-type indices used throughout our paper. In both figure panels, the primary 1222 categories of cell type are color-coded similar to Suppl. Figure 1. A) The number of genes identified as 1223 cell type specific in each publication in our database vs. the percentage that were actually found to be 1224 truly specific to that cell type (i.e., not identified as "specific" to another category of cell type in a 1225 different publication). B) The percentage overlap between each list of cell type specific genes (row) and 1226 all other lists of cell type specific genes (columns). The matrix is color-coded with a gradient from blue 1227 (indicating 0% overlap) to red (indicating 100% overlap). The denominator in the percentage overlap 1228 equation was the cell type category specified by the row. Note that many of the genes identified as “non- 1229 specific” are shared by similar cell type categories (e.g., Oligodendrocytes vs. 1230 Oligodendrocyte_Immature), but two gene lists derived from Doyle et al.(15) appeared to include non- 1231 specific genes from highly divergent categories: Neuron_CorticoSpinal_Doyle_Cell_2008 strongly 1232 overlaps with genes identified as specific to endothelial cells, Neuron_CorticoStriatal_Doyle_Cell_2008 1233 strongly overlaps with genes identified as specific to microglia. In later versions of BrainInABlender, we 1234 have removed gene lists derived from Doyle et al.(15) for this reason.

1235

1236 Next, to estimate the limitations and noise inherant in our technique, we also constructed in silico

1237 mixtures of 100 cells with known percentages of each cell type by randomly sampling from each purified

52 Running Head: PREDICTING CELL TYPE BALANCE

1238 cell dataset (with replacement). We found that the consolidated cell type indices strongly predicted the

1239 percentage of their respective cell type included in our artificial mixtures in a linear manner across a

1240 range of values likely to encompass the true proportion of these cells in our cortical samples. The amount

1241 of noise present in these predictions varied by data type, with the predictions generated from single-cell

1242 data having substantially more noise than those generated from pooled, purified cells, but most of the data

1243 (+/- 1 stdev) still fell within +/- 5% of the prediction (Suppl. Figure 4). Therefore, we concluded that cell

1244 type indices are a relatively easy manner to predict relative cell type balance across samples. 0.25 0.25 0.10 0.20 0.20 0.05 0.15 0.15 0.00 Neuron Index Astrocyte Index 0.10 Oligodendrocyte Index 0.05 0.10 − 0.05 0.10 − 0.05 15 20 25 30 35 40 25 30 35 40 45 50 15 20 25 30 35 40

Mixture (% Astrocyte) Mixture (% Neuron) Mixture (% Oligodendrocyte) 0.35 0.30 0.15 0.25 0.10 0.20 Microglia Index 0.15 Endothelial Index 0.05 0.10 0.00 0.05

5 10 15 20 25 30 0 5 10 15 20 25

Mixture (% Microglia) Mixture (% Endothelial) 1245

1246 Suppl. Figure 4. Cell type indices successfully predict the percentage of cells of a particular type in 1247 artificial mixtures of 100 cells created using single-cell RNA-Seq data. Depicted are the cell type 1248 indices (y-axis) calculated for mixed cell samples generated in silico using random sampling (with 1249 replacement) from a human single cell RNA-Seq dataset (2). Each sample contains 100 cells total, with a 1250 designated percentage of the cell type of interest (x-axis), with the percentages designed to roughly span 1251 what might be found in cortical tissue samples. The black best fit line is accompanied by the standard 1252 error of the regression (gray), and the green and red lines are visual guides to help illustrate a 5%

53 Running Head: PREDICTING CELL TYPE BALANCE

1253 increase in the cell type of interest. The results from a similar analysis using the smaller mouse purified, 1254 pooled cell type RNA-Seq dataset (18) showed the same trends but with half as much variability.

1255

1256 7.2.2 Comparison of Our Method vs. PSEA: Predicting Cell Identity in a Single-Cell RNA-Seq

1257 Dataset

1258 Although we generated our method independently to address microarray analysis questions that arose

1259 within the Pritzker Neuropsychiatric Consortium, we later discovered that it was quite similar to the

1260 technique of population-specific expression analysis (PSEA, (12)) with several notable differences.

1261 Similar to our method, PSEA is a carefully-validated analysis method which aims to estimate cell type-

1262 differentiated disease effects from microarray data derived from brain tissue of heterogeneous

1263 composition and approaches this problem by including the averaged, normalized expression of cell type

1264 specific markers within a larger linear model that is used to estimate differential expression in microarray

1265 data (10–12). Analyses using PSEA similarly indicated that individual variability in neuronal, astrocytic,

1266 oligodendrocytic, and microglial cell content was sufficient to account for substantial variability in the

1267 vast majority of probesets in microarray data from human brain samples, even within non-diseased

1268 samples (12). The differences between our techniques are mostly due to the recent growth of the literature

1269 documenting cell type specific expression in brain cell types. PSEA uses a very small set of markers (4-7)

1270 to represent each cell type, and screens these markers for tight co-expression within the dataset of interest,

1271 since co-expression networks have been previously demonstrated to often represent cell type signatures in

1272 the data (94). This is essential for the analysis of microarray data for brain regions that have not been well

1273 characterized for cell type specific expression (e.g., the substantia nigra), but risks the possibility of

1274 closely tracking variability in a particular cell function instead of cell content (as described in our results

1275 related to aging). Our analysis focused on the well-studied cortex, thus enabling us to utilize hundreds of

1276 cell type specific markers derived from a variety of experimental techniques.

1277 Our manner of normalizing data also differs: PSEA normalizes the expression values for each gene by

1278 dividing by the average expression of that gene across samples, whereas we use z-score normalization,

54 Running Head: PREDICTING CELL TYPE BALANCE

1279 both at the level of the individual transcript and later at the level of the gene level summary data. Due to

1280 the dependence of PSEA on ratios, genes that have average expression values that are close to zero can

1281 end up with normalized values that are extremely high for a handful of samples. For microarray data, this

1282 form of normalization should function well because log2 expression values rarely drop below 5.

1283 However, within RNA-Seq, counts of zero are common and therefore we suspected that the ratio-form of

1284 normalization used by PSEA might not function optimally for this data type.

1285 Therefore, we decided to run a head-to-head comparison of our method and PSEA using the human

1286 single-cell RNA-Seq dataset (2). To make the comparison as interpretable as possible, we used the same

1287 list of cell type specific genes for both methods (the genes in our database used to construct

1288 BrainInABlender’s consolidated cell type indices). In order to avoid circular reasoning, we also excluded

1289 any cell type specific genes that had originally been identified by the publication currently used as the test

1290 dataset (2). Then we used the marker() function from the PSEA package to calculate the “Reference

1291 Signal” for the most common primary categories of cell types (astrocytes, endothelial cells, microglia,

1292 mature oligodendrocytes, and neurons). For our method, we used a procedure similar to BrainInABlender.

1293 We applied a z-score transformation to the data for each gene (mean=0, sd=1), and then either averaged

1294 by the primary cell type category (to conduct an analysis paralleling PSEA), or averaged the data from the

1295 cell type specific genes identified by each publication, followed by averaging by primary cell type

1296 category (to create consolidated cell type indices like BrainInABlender).

1297 To compare the efficacy of each method, we ran a linear model to determine the percentage of

1298 variation in the population “reference signal” (PSEA) or “cell type index” (our method) accounted for by

1299 the cell type identities assigned to each cell in the original publication (2). We found that both the

1300 population reference signals (PSEA) and cell type indices (our method) for each cell were strongly related

1301 to their previously-assigned cell type identity, but in general the relationship was stronger when using our

1302 method: on average, 34% of the variation in the PSEA reference signal for each cell type was accounted

1303 for by cell identity, whereas an average of either 45% or 49% of the variation in our cell type indices was

1304 accounted for by cell identity using either the simplified or consolidated versions of our method,

55 Running Head: PREDICTING CELL TYPE BALANCE

1305 respectively (Suppl. Figure 5 A). An illustration of this improvement can be found in Suppl. Figure 5 B:

1306 note the presence of extreme outliers in the PSEA population reference signal. We conclude that the

1307 simple use of a different normalization method is sufficient to make our method more effective at

1308 predicting cell type balance in some datasets. We also find that averaging the predictions drawn from the

1309 cell type specific genes identified by multiple publications into a consolidated index produces some The$method$for$deriving$a$statistical$cell$type$signal$determines$the$strength$of$the$relationship$with$cell$identity 1310 additional improvement.The$percentage$of$the$variation$in$a$statistically3derived$cell$type$signal$accounted$for$by$cell$identity (Darmanis)Data)Set) A. Method$of$deriving$a$statistical$cell$type$signal: Our$Cell$Type$ PSEA$(mean$ Our$Cell$Type$ Indices:$After$$ Signal$from$cell$type$ signal$ratio$ Indices$(z3score$ first$averaging$ specific$genes$for: average) average) by$Publication Astrocytes 34% 52% 57% Oligodendrocytes 38% 45% 50% Microglia 36% 42% 51% Endothelial 30% 28% 33% Neurons 33% 57% 53%

● B. 10 The$same$cell$type$specific$genes$were$used$in$all$methods 8

6 ● ●

4 ● PSEA ●

2 ● ● ● ● 0 Neuron_All Reference Signal Neuron_All Reference

astrocytes endothelial microglia neurons oligodendrocytes OPC 1.0

0.5 ● ● ● ● ●

● 0.0 Neuron_All Index Our Cell Type Index 0.5 − ● ● ● astrocytes endothelial microglia neurons● oligodendrocytes OPC ● 1.0

● 0.5 ● ● ● ● ● ● 0.0 Neuron_All Index Our Cell Type Index:

● 0.5 First Averaging By Publication −

astrocytes endothelial microglia neurons oligodendrocytes OPC

Sample Cell Type 1311

56 Running Head: PREDICTING CELL TYPE BALANCE

1312 Suppl. Figure 5. The method for deriving predicted relative cell content determines the strength of the 1313 relationship with sample cell type. Depicted is a comparison of the efficacy of three different manners of 1314 predicting the relative cell content of samples in a human single-cell RNA-seq dataset (2): 1) the 1315 “population reference signal” generated by PSEA, 2) a simplified version of our method that is meant to 1316 be relatively analogous to PSEA (a simple average of the z-score-transformed data for all genes specific 1317 to a particular cell type in our database), 3) the version of our method used in this manuscript, which 1318 consolidates the predictions derived from the cell type specific genes identified in different publications. 1319 A. For each of these methods of predicting relative cell content (columns) the table provides the 1320 percentage of variation (R2) that is accounted for by the original cell type identities of the samples 1321 provided by the publication (2) for predictions for each of the major cell types (rows). Overall, there is a 1322 strong relationship between the predictions generated by all methods and sample cell type identity, but 1323 the method used in this manuscript produces predictions that best fit sample cell type. B. As an example, 1324 boxplots illustrate the distribution of each of the predictions for neuronal content across samples 1325 identified as different cell types in the original publication((2), x-axis). Note the presence of several 1326 extreme outliers (red arrows) in the predictions produced by PSEA– a similar pattern was seen for all 1327 other cell types.

1328

1329 Using similar methodology, we also calculated the population “reference signal” with PSEA for

1330 microarray data from artificially-created mixtures of cultured cells (GSE19380). The results strongly

1331 tracked the actual cell content of the mixed samples (Suppl. Figure 6) in a manner that was not strikingly

1332 better or worse than the predictions made using BrainInABlender (Figure 2). This again drives home the

1333 fact that the ratio-based normalization methods used in PSEA are particularly incompatible with low

1334 count data in RNA-Seq – results derived from microarray data are fine.

57 Running Head: PREDICTING CELL TYPE BALANCE

Neuron Astrocyte R−Squared: 0.88 R−Squared: 0.79 ● ●

1.3 ● ●

● 1.15 ● ●

1.2 ● ● ● 1.10 ●

1.1 ● ● 1.05 ● 1.0 ● ● ● ● ● ● 1.00

0.9 ● ● ●

● ● 0.8 ● ● 0.95 ● ● ● ● ● Cell Type Reference Signal Reference Cell Type Signal Reference Cell Type 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Actual % Neurons Actual % Astrocytes

Oligodendrocyte Microglia R−Squared: 0.68 (Mature), 0.72 (Immature) R−Squared: 0.62 ● ●

● 1.3 ● 1.15

● ●

● 1.2 ● ● ● ●

● 1.1 ●

1.05 ● ● ● ● ● ● ● ● ● ●

1.0 ● ● ● ● ● ● ● ● ● ●

● 0.9 ● ●

0.95 ● ● ● ● ● ● ● 0.8 Cell Type Reference Signal Reference Cell Type Signal Reference Cell Type 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1335 Actual % Oligodendrocytes Actual % Microglia

1336 Suppl. Figure 6. Relative cell content predictions made using PSEA and our cell type specific gene 1337 lists. Using a microarray dataset derived from samples that contained artificially-generated mixtures of 1338 cultured cells (GSE19380; (12)), we found that the relative cell content predictions (“cell type reference 1339 signal”) produced by PSEA closely reflected actual known content, similar to the predictions made by 1340 BrainInABlender (Figure 2).

1341

1342 7.2.3 Does the Reference Dataset Matter? There is a Strong Convergence of Cell Content

1343 Predictions Derived from Cell Type Specific Transcripts Identified by Different

1344 Publications

1345 Similar to what we observed during our validation analyses using data from purified cell types,

1346 we found that the predicted cell content for the post-mortem human cortical samples (“cell type indices”)

1347 was similar regardless of the methodology used to generate the cell type specific gene lists used in the

1348 predictions. Within all five of the human cortical transcriptomic datasets, there was a strong positive

1349 correlation between cell type indices representing the same cell type, even when the predictions were

1350 derived using cell type specific gene lists from different species, cell type purification strategies, and

58 Running Head: PREDICTING CELL TYPE BALANCE

1351 platforms. Clustering within broad cell type categories was clear using visual inspection of the correlation

1352 matrices (Suppl. Figure 7), hierarchical clustering, or consensus clustering (ConsensusClusterPlus: (95))

1353 and persisted even after removing data from genes identified as cell type specific in multiple publications

1354 (e.g., gene expression identified as astrocyte-expression in both Cahoy_Astrocyte and Zhang_Astrocyte;

1355 Suppl. Figure 8). In some datasets, the cell type indices for support cell subcategories were nicely

1356 clustered and in others they were difficult to fully differentiate (Suppl. Figure 7). Clustering was not able

1357 to reliably discern neuronal subcategories (interneurons, projection neurons) in any dataset. Similar to our

1358 previous validation analyses, oligodendrocyte progenitor cell indices derived from different publications

1359 did not strongly correlate with each other, perhaps due to heterogeneity in the progenitor cell types

1360 sampled by the original publications.

A.

D.

Pia/&/ Neurovasculature Neurons

Astrocytes

Mature/Oligodendrocytes

59 Running Head: PREDICTING CELL TYPE BALANCE

B. Astrocyte_All_Cahoy_JNeuro_2008 Astrocyte_All_Darmanis_PNAS_2015 Astrocyte_All_Doyle_Cell_2008 Astrocyte_All_Zeisel_Science_2015 Astrocyte_All_Zhang_JNeuro_2014 Microglia_All_Darmanis_PNAS_2015 Microglia_All_Zeisel_Science_2015 Microglia_All_Zhang_JNeuro_2014 Endothelial_All_Daneman_PLOS_2010 Endothelial_All_Darmanis_PNAS_2015 Endothelial_All_Zeisel_Science_2015 Endothelial_All_Zhang_JNeuro_2014 Mural_All_Zeisel_Science_2015 Mural_Pericyte_Zhang_JNeuro_2014 Mural_Vascular_Daneman_PLOS_2010 Neuron_All_Cahoy_JNeuro_2008 Neuron_All_Darmanis_PNAS_2015 Neuron_All_Zhang_JNeuro_2014 Neuron_CorticoSpinal_Doyle_Cell_2008 Neuron_CorticoStriatal_Doyle_Cell_2008 Neuron_CorticoThalamic_Doyle_Cell_2008 Neuron_Glutamate_Sugino_NatNeuro_2006 Neuron_Pyramidal_Cortical_Zeisel_Science_2015 Neuron_GABA_Sugino_NatNeuro_2006 Neuron_Interneuron_CORT_Doyle_Cell_2008 Neuron_Interneuron_Zeisel_Science_2015 Neuron_Neuron_CCK_Doyle_Cell_2008 Neuron_Neuron_PNOC_Doyle_Cell_2008 Oligodendrocyte_All_Cahoy_JNeuro_2008 Oligodendrocyte_All_Doyle_Cell_2008 Oligodendrocyte_All_Zeisel_Science_2015 Oligodendrocyte_Mature_Darmanis_PNAS_2015 Oligodendrocyte_Mature_Doyle_Cell_2008 Oligodendrocyte_Myelinating_Zhang_JNeuro_2014 Oligodendrocyte_Newly-Formed_Zhang_JNeuro_2014 Oligodendrocyte_Progenitor Cell_Darmanis_PNAS_2015 Oligodendrocyte_Progenitor Cell_Zhang_JNeuro_2014 RBC_All_GeneCardSearch_Hemoglobin_ErythrocyteSpecific Astrocyte_All_Cahoy_JNeuro_2008 1.0 0.9 1.0 0.9 0.8 -0.1 0.2 0.2 0.0 0.2 0.4 0.1 0.3 0.2 0.1 -0.6 -0.8 -0.1 -0.1 -0.1 0.1 -0.6 -0.6 -0.5 -0.2 -0.5 -0.5 -0.3 0.0 -0.2 0.2 -0.1 0.1 0.2 -0.5 0.5 -0.2 -0.1 Astrocyte_All_Darmanis_PNAS_2015 Astrocytes0.9 1.0 0.9 0.9 0.7 -0.1 0.2 0.1 0.0 0.2 0.4 0.1 0.2 0.1 0.1 -0.7 -0.8 -0.2 -0.1 -0.2 0.0 -0.6 -0.7 -0.5 -0.2 -0.6 -0.6 -0.4 0.1 -0.2 0.3 0.0 0.1 0.2 -0.5 0.4 -0.3 -0.2 Astrocyte_All_Doyle_Cell_2008 1.0 0.9 1.0 0.9 0.7 -0.1 0.2 0.2 0.0 0.2 0.4 0.1 0.3 0.1 0.1 -0.6 -0.8 -0.1 -0.1 -0.1 0.1 -0.6 -0.6 -0.6 -0.3 -0.5 -0.5 -0.3 0.1 -0.1 0.2 0.0 0.1 0.2 -0.5 0.4 -0.1 -0.1 Astrocyte_All_Zeisel_Science_2015 0.9 0.9 0.9 1.0 0.6 -0.2 0.1 0.0 0.1 0.3 0.6 0.2 0.3 0.3 0.2 -0.6 -0.7 0.0 0.1 -0.1 0.3 -0.4 -0.5 -0.5 -0.3 -0.4 -0.6 -0.4 0.1 -0.1 0.3 -0.1 0.2 0.3 -0.4 0.4 -0.1 -0.2 Astrocyte_All_Zhang_JNeuro_2014 0.8 0.7 0.7 0.6 1.0 Microglia0.1 0.4 0.4 0.2 0.4 0.5 0.3 0.4 0.3 0.2 -0.6 -0.7 -0.1 0.0 0.1 0.2 -0.5 -0.4 -0.6 -0.2 -0.5 -0.3 -0.1 -0.1 -0.1 -0.1 -0.2 -0.1 0.1 -0.5 0.7 0.0 0.3 Microglia_All_Darmanis_PNAS_2015 -0.1 -0.1 -0.1 -0.2 0.1 1.0 0.8 0.6 0.3 0.2 0.1 0.3 0.0 -0.1 0.0 -0.2 -0.1 -0.4 -0.1 0.1 -0.2 -0.2 -0.2 0.0 0.1 -0.3 0.3 0.3 0.2 0.1 -0.1 0.2 0.0 0.1 -0.4 0.1 -0.2 0.0 Microglia_All_Zeisel_Science_2015 0.2 0.2 0.2 0.1 0.4 0.8 1.0 0.7 0.5 0.5 0.5 0.6 0.4 0.2 0.1 -0.5 -0.5 -0.3 -0.1 0.1 0.0 -0.4 -0.2 -0.4 -0.1 -0.4 0.1 0.1 0.3 0.0 0.0 0.1 0.1 0.2 -0.4 0.4 -0.1 0.2 Microglia_All_Zhang_JNeuro_2014 0.2 0.1 0.2 0.0 0.4 0.6 0.7 1.0 Endothelial & Mural0.2 0.3 0.2 0.3 0.1 0.2 -0.1 -0.3 -0.3 -0.2 -0.2 0.0 -0.2 -0.3 -0.2 -0.4 -0.1 -0.4 0.1 0.0 0.1 0.0 -0.1 0.0 0.0 0.1 -0.3 0.2 -0.1 0.4 Endothelial_All_Daneman_PLOS_2010 0.0 0.0 0.0 0.1 0.2 0.3 0.5 0.2 1.0 0.7 0.7 0.6 0.7 0.4 0.3 -0.2 -0.2 0.0 0.2 0.3 0.5 0.0 0.2 -0.3 -0.2 0.1 0.1 0.3 0.0 0.0 -0.1 -0.2 0.0 0.0 -0.1 0.3 0.3 0.2 Endothelial_All_Darmanis_PNAS_2015 0.2 0.2 0.2 0.3 0.4 0.2 0.5 0.3 0.7 1.0 0.8 0.8 0.7 0.7 0.4 -0.4 -0.4 -0.1 0.1 0.1 0.5 -0.1 0.0 -0.4 -0.4 -0.2 -0.1 0.0 0.1 0.1 0.1 -0.1 0.2 0.2 -0.1 0.4 0.1 0.1 Endothelial_All_Zeisel_Science_2015 0.4 0.4 0.4 0.6 0.5 0.1 0.5 0.2 0.7 0.8 1.0 0.8 0.8 0.6 0.5 -0.6 -0.6 -0.1 0.1 0.1 0.5 -0.3 -0.2 -0.5 -0.4 -0.2 -0.3 -0.2 0.2 0.0 0.2 0.0 0.2 0.3 -0.2 0.5 0.1 0.1 Endothelial_All_Zhang_JNeuro_2014 0.1 0.1 0.1 0.2 0.3 0.3 0.6 0.3 0.6 0.8 0.8 1.0 0.7 0.6 0.3 -0.4 -0.5 -0.1 0.0 -0.1 0.3 -0.3 0.0 -0.5 -0.3 -0.2 -0.1 0.0 0.2 0.1 0.1 0.0 0.2 0.2 -0.1 0.5 0.1 0.2 Mural_All_Zeisel_Science_2015 0.3 0.2 0.3 0.3 0.4 0.0 0.4 0.1 0.7 0.7 0.8 0.7 1.0 0.6 0.5 -0.3 -0.3 0.1 0.0 0.2 0.5 -0.1 0.1 -0.5 -0.4 0.0 -0.1 -0.1 -0.1 0.0 -0.1 -0.3 0.0 0.1 -0.1 0.4 0.2 0.2 Mural_Pericyte_Zhang_JNeuro_2014 0.2 0.1 0.1 0.3 0.3 -0.1 0.2 0.2 0.4 0.7 0.6 0.6 0.6 1.0 0.3 -0.2 -0.3 0.1 0.1 -0.1 0.3 0.0 0.1 -0.3 -0.3 0.0 -0.1 -0.2 0.0 0.1 0.1 -0.1 0.3 0.2 0.0 0.3 0.1 0.2 Mural_Vascular_Daneman_PLOS_2010 0.1 0.1 0.1 0.2 0.2 0.0 0.1 -0.1 0.3 0.4 0.5 0.3 0.5 0.3 1.0 -0.3 -0.2 -0.3 -0.2 0.2 0.1 -0.2 -0.2 -0.1 -0.2 -0.1 -0.2 0.0 0.0 0.0 0.1 0.0 0.0 0.1 0.0 0.1 0.0 -0.1 Neuron_All_Cahoy_JNeuro_2008 -0.6 -0.7 -0.6 -0.6 -0.6 -0.2 -0.5 -0.3 -0.2 -0.4 -0.6 -0.4 -0.3 -0.2 -0.3 1.0 0.9 0.6 0.2 0.2 0.1 0.8 0.7 0.6 0.2 0.8 0.4 0.3 -0.5 0.2 -0.4 -0.3 -0.2 -0.5 0.6 -0.3 0.5 0.1 Neuron_All_Darmanis_PNAS_2015 -0.8 -0.8 -0.8 -0.7 -0.7 -0.1 -0.5 -0.3 -0.2 -0.4 -0.6 -0.5 -0.3 -0.3 -0.2 0.9 1.0 0.3 0.1 0.1 -0.1 0.7 0.6 0.7 0.3 0.7 0.3 0.3 -0.3 0.0 -0.4 -0.2 -0.3 -0.4 0.6 -0.4 0.2 0.1 Neuron_All_Zhang_JNeuro_2014 -0.1 -0.2 -0.1 0.0 -0.1 -0.4 -0.3 -0.2 0.0 -0.1 -0.1 -0.1 0.1 0.1 -0.3 0.6 0.3 1.0 0.3 0.1 0.4 0.5 0.6 -0.1 -0.1 0.6 0.1 0.0 -0.3 0.3 -0.2 -0.4 -0.1 -0.1 0.5 0.1 0.6 0.3 Neuron_CorticoSpinal_Doyle_Cell_2008 -0.1 -0.1 -0.1 0.1 0.0 -0.1 -0.1 -0.2 0.2 0.1 0.1 0.0 0.0 0.1 -0.2 0.2 0.1 0.3 1.0 0.0 0.4 0.3 0.4 0.0 0.0 0.3 0.1 0.0 -0.1 0.2 0.1 -0.1 0.0 0.0 0.2 0.1 0.3 0.1 Neuron_CorticoStriatal_Doyle_Cell_2008 -0.1 -0.2 -0.1 -0.1 0.1 0.1 0.1 0.0 0.3 0.1 0.1 -0.1 0.2 -0.1 0.2 0.2 0.1 0.1 0.0 1.0 0.3 0.4 0.2 0.1 0.0 0.3 0.2 0.4 -0.5 0.0 -0.5 -0.5 -0.4 -0.5 0.1 0.1 0.5 0.0 Neuron_CorticoThalamic_Doyle_Cell_2008 0.1 0.0 0.1 0.3 0.2 -0.2 0.0 -0.2 0.5 0.5 0.5 0.3 0.5 0.3 0.1 0.1 -0.1 0.4 0.4 Neurons0.3 1.0 0.3 0.4 -0.2 -0.3 0.3 -0.2 0.1 -0.2 0.2 -0.1 -0.3 -0.1 -0.1 0.1 0.3 0.5 0.1 Neuron_Glutamate_Sugino_NatNeuro_2006 -0.6 -0.6 -0.6 -0.4 -0.5 -0.2 -0.4 -0.3 0.0 -0.1 -0.3 -0.3 -0.1 0.0 -0.2 0.8 0.7 0.5 0.3 0.4 0.3 1.0 0.8 0.4 0.1 0.8 0.4 0.3 -0.4 0.3 -0.2 -0.3 -0.1 -0.3 0.6 -0.3 0.6 0.0 Neuron_Pyramidal_Cortical_Zeisel_Science_2015 -0.6 -0.7 -0.6 -0.5 -0.4 -0.2 -0.2 -0.2 0.2 0.0 -0.2 0.0 0.1 0.1 -0.2 0.7 0.6 0.6 0.4 0.2 0.4 0.8 1.0 0.2 0.1 0.8 0.4 0.3 -0.4 0.2 -0.4 -0.4 -0.2 -0.3 0.6 -0.1 0.5 0.4 Neuron_GABA_Sugino_NatNeuro_2006 -0.5 -0.5 -0.6 -0.5 -0.6 0.0 -0.4 -0.4 -0.3 -0.4 -0.5 -0.5 -0.5 -0.3 -0.1 0.6 0.7 -0.1 0.0 0.1 -0.2 0.4 0.2 1.0 0.2 0.5 0.2 0.3 -0.1 0.0 0.0 0.1 -0.1 -0.2 0.2 -0.5 -0.1 -0.3 Neuron_Interneuron_CORT_Doyle_Cell_2008 -0.2 -0.2 -0.3 -0.3 -0.2 0.1 -0.1 -0.1 -0.2 -0.4 -0.4 -0.3 -0.4 -0.3 -0.2 0.2 0.3 -0.1 0.0 0.0 -0.3 0.1 0.1 0.2 1.0 0.1 0.3 0.1 Mature0.1 -0.1 0.1 0.2 0.0 0.1 0.0 -0.3 -0.2 0.2 Neuron_Interneuron_Zeisel_Science_2015 -0.5 -0.6 -0.5 -0.4 -0.5 -0.3 -0.4 -0.4 0.1 -0.2 -0.2 -0.2 0.0 0.0 -0.1 0.8 0.7 0.6 0.3 0.3 0.3 0.8 0.8 0.5 0.1 1.0 0.4 0.4 -0.5 0.2 -0.2 -0.4 -0.2 -0.4 0.6 -0.2 0.5 0.1 Neuron_Neuron_CCK_Doyle_Cell_2008 -0.5 -0.6 -0.5 -0.6 -0.3 0.3 0.1 0.1 0.1 -0.1 -0.3 -0.1 -0.1 -0.1 -0.2 0.4 0.3 0.1 0.1 0.2 -0.2 0.4 0.4 0.2 0.3 0.4 1.0 0.5 -0.2 0.1 -0.2 0.0 -0.1 -0.3 0.1 -0.3 0.2 0.2 Neuron_Neuron_PNOC_Doyle_Cell_2008 -0.3 -0.4 -0.3 -0.4 -0.1 0.3 0.1 0.0 0.3 0.0 -0.2 0.0 -0.1 -0.2 0.0 0.3 0.3 0.0 0.0 0.4 0.1 0.3 0.3 0.3 0.1 0.4 0.5 1.0 Oligodendrocytes-0.1 0.0 -0.3 0.0 -0.1 -0.3 0.1 0.0 0.3 0.1 Oligodendrocyte_All_Cahoy_JNeuro_2008 0.0 0.1 0.1 0.1 -0.1 0.2 0.3 0.1 0.0 0.1 0.2 0.2 -0.1 0.0 0.0 -0.5 -0.3 -0.3 -0.1 -0.5 -0.2 -0.4 -0.4 -0.1 0.1 -0.5 -0.2 -0.1 1.0 0.1 0.8 0.9 0.7 0.9 -0.2 0.0 -0.4 -0.1 Oligodendrocyte_All_Doyle_Cell_2008 -0.2 -0.2 -0.1 -0.1 -0.1 0.1 0.0 0.0 0.0 0.1 0.0 0.1 0.0 0.1 0.0 0.2 0.0 0.3 0.2 0.0 0.2 0.3 0.2 0.0 -0.1 0.2 0.1 0.0 0.1 1.0 0.1 0.1 0.2 0.1 0.1 -0.1 0.4 -0.1 Oligodendrocyte_All_Zeisel_Science_2015 0.2 0.3 0.2 0.3 -0.1 -0.1 0.0 -0.1 -0.1 0.1 0.2 0.1 -0.1 0.1 0.1 -0.4 -0.4 -0.2 0.1 -0.5 -0.1 -0.2 -0.4 0.0 0.1 -0.2 -0.2 -0.3 0.8 0.1 1.0 0.8 0.7 0.9 -0.1 -0.1 -0.3 -0.3 Oligodendrocyte_Mature_Darmanis_PNAS_2015 -0.1 0.0 0.0 -0.1 -0.2 0.2 0.1 0.0 -0.2 -0.1 0.0 0.0 -0.3 -0.1 0.0 -0.3 -0.2 -0.4 -0.1 -0.5 -0.3 -0.3 -0.4 0.1 0.2 -0.4 0.0 0.0 0.9 0.1 0.8 1.0 0.7 0.8 -0.2 -0.2 -0.4 -0.2 Oligodendrocyte_Mature_Doyle_Cell_2008 0.1 0.1 0.1 0.2 -0.1 0.0 0.1 0.0 0.0 0.2 0.2 0.2 0.0 0.3 0.0 -0.2 -0.3 -0.1 0.0 -0.4 -0.1 -0.1 -0.2 -0.1 0.0 -0.2 -0.1 -0.1 0.7 0.2 0.7 0.7 1.0 0.7 -0.1 0.0 -0.1 -0.1 Oligodendrocyte_Myelinating_Zhang_JNeuro_2014 0.2 0.2 0.2 0.3 0.1 0.1 0.2 0.1 0.0 0.2 0.3 0.2 0.1 0.2 0.1 -0.5 -0.4 -0.1 0.0 -0.5 -0.1 -0.3 -0.3 -0.2 0.1 -0.4 -0.3 -0.3 0.9 0.1 0.9 0.8 0.7 1.0 -0.2 0.1 -0.3 0.0 Oligodendrocyte_Newly-Formed_Zhang_JNeuro_2014 -0.5 -0.5 -0.5 -0.4 -0.5 -0.4 -0.4 -0.3 -0.1 -0.1 -0.2 -0.1 -0.1 0.0 0.0 0.6 0.6 0.5 0.2 0.1 0.1 0.6 0.6 0.2 0.0 0.6 0.1 0.1 -0.2 0.1 -0.1 -0.2 -0.1 -0.2 1.0 -0.2 0.4 0.1 Oligodendrocyte_Progenitor Cell_Darmanis_PNAS_2015 0.5 0.4 0.4 0.4 0.7 0.1 0.4 0.2 0.3 0.4 0.5 0.5 0.4 0.3 0.1 -0.3 -0.4 0.1 0.1 0.1 0.3 -0.3 -0.1 -0.5 -0.3 -0.2 -0.3 0.0 0.0 -0.1 -0.1 -0.2 0.0 0.1 -0.2 1.0 0.2 0.3 Oligodendrocyte_Progenitor Cell_Zhang_JNeuro_2014 -0.2 -0.3 -0.1 -0.1 0.0 -0.2 -0.1 -0.1 0.3 0.1 0.1 0.1 0.2 0.1 0.0 0.5 0.2 0.6 0.3 0.5 0.5 0.6 0.5 -0.1 -0.2 0.5 0.2 0.3 -0.4 0.4 -0.3 -0.4 -0.1 -0.3 0.4 0.2 1.0 0.1 RBC_All_GeneCardSearch_Hemoglobin_ErythrocyteSpecific -0.1 -0.2 -0.1 -0.2 0.3 0.0 0.2 0.4 0.2 0.1 0.1 0.2 0.2 0.2 -0.1 0.1 0.1 0.3 0.1 0.0 0.1 0.0 0.4 -0.3 0.2 0.1 0.2 0.1 -0.1 -0.1 -0.3 -0.2 -0.1 0.0 0.1 0.3 0.1 1.0 C. Microglia_All_Darmanis_PNAS_2015 Microglia_All_Zeisel_Science_2015 Microglia_All_Zhang_JNeuro_2014 Astrocyte_All_Cahoy_JNeuro_2008 Astrocyte_All_Darmanis_PNAS_2015 Astrocyte_All_Zeisel_Science_2015 Astrocyte_All_Zhang_JNeuro_2014 Endothelial_All_Daneman_PLOS_2010 Endothelial_All_Darmanis_PNAS_2015 Endothelial_All_Zeisel_Science_2015 Endothelial_All_Zhang_JNeuro_2014 Mural_All_Zeisel_Science_2015 Mural_Pericyte_Zhang_JNeuro_2014 Mural_Vascular_Daneman_PLOS_2010 Neuron_All_Cahoy_JNeuro_2008 Neuron_All_Darmanis_PNAS_2015 Neuron_All_Zhang_JNeuro_2014 Neuron_GABA_Sugino_NatNeuro_2006 Neuron_Interneuron_Zeisel_Science_2015 Neuron_Glutamate_Sugino_NatNeuro_2006 Neuron_Pyramidal_Cortical_Zeisel_Science_2015 Oligodendrocyte_All_Cahoy_JNeuro_2008 Oligodendrocyte_All_Zeisel_Science_2015 Oligodendrocyte_Mature_Darmanis_PNAS_2015 Oligodendrocyte_Myelinating_Zhang_JNeuro_2014 Oligodendrocyte_Newly-Formed_Zhang_JNeuro_2014 Oligodendrocyte_Progenitor Cell_Darmanis_PNAS_2015 Oligodendrocyte_Progenitor Cell_Zhang_JNeuro_2014 RBC_All_GeneCardSearch_Hemoglobin_ErythrocyteSpecific Microglia_All_Darmanis_PNAS_2015 Astrocytes1.0 0.9 0.9 0.2 0.2 0.3 0.1 0.3 0.4 0.5 0.4 0.4 0.2 0.2 -0.1 -0.2 -0.1 0.1 0.0 0.0 -0.1 0.5 0.5 0.5 0.5 0.1 0.1 0.0 0.1 Microglia_All_Zeisel_Science_2015 0.9 1.0 0.9 0.4 0.3 0.6 0.3 0.6 0.6 0.8 0.7 0.6 0.5 0.4 0.0 -0.2 0.0 0.2 0.2 0.1 0.1 0.5 0.6 0.4 0.5 0.2 0.2 0.2 0.2 Microglia_All_Zhang_JNeuro_2014 0.9 0.9 1.0 0.4 0.3 0.4 0.3 0.5 0.6 0.7 0.6 0.6 0.4 0.3 -0.2 -0.3 0.0 0.0 0.1 0.0 0.0 0.5 0.5 0.3 0.6 0.0 0.0 0.1 0.3 Astrocyte_All_Cahoy_JNeuro_2008 0.2 0.4 0.4 1.0 0.8 0.8 0.8 0.6 0.5 0.7 0.6 0.6 0.6 0.6 0.1 -0.1 0.2 0.2 0.3 0.3 0.2 0.2 0.4 0.1 0.4 0.1 0.3 0.5 0.2 Astrocyte_All_Darmanis_PNAS_2015 0.2 0.3 0.3 Microglia0.8 1.0 0.7 0.5 0.4 0.4 0.5 0.5 0.4 0.4 0.4 -0.2 -0.3 -0.2 0.1 0.0 0.0 -0.1 0.3 0.4 0.2 0.3 -0.1 0.2 0.2 0.1 Astrocyte_All_Zeisel_Science_2015 0.3 0.6 0.4 0.8 0.7 1.0 0.7 0.7 0.5 0.8 0.8 0.7 0.6 0.8 0.3 0.1 0.3 0.5 0.6 0.5 0.5 0.3 0.7 0.2 0.5 0.4 0.3 0.7 0.2 Astrocyte_All_Zhang_JNeuro_2014 0.1 0.3 0.3 0.8 0.5 0.7 1.0 Endothelial & Mural0.6 0.5 0.6 0.6 0.7 0.7 0.7 0.2 0.1 0.4 0.3 0.4 0.4 0.4 0.0 0.3 -0.2 0.2 0.2 0.3 0.6 0.2 Endothelial_All_Daneman_PLOS_2010 0.3 0.6 0.5 0.6 0.4 0.7 0.6 1.0 0.8 0.9 0.9 0.8 0.7 0.7 0.1 0.0 0.2 0.3 0.4 0.4 0.3 0.2 0.5 0.0 0.3 0.3 0.2 0.6 0.3 Endothelial_All_Darmanis_PNAS_2015 0.4 0.6 0.6 0.5 0.4 0.5 0.5 0.8 1.0 0.8 0.8 0.7 0.8 0.6 -0.2 -0.3 -0.1 0.0 0.0 0.1 0.0 0.3 0.4 0.1 0.3 0.1 0.3 0.3 0.3 Endothelial_All_Zeisel_Science_2015 0.5 0.8 0.7 0.7 0.5 0.8 0.6 0.9 0.8 1.0 0.9 0.8 0.7 0.7 0.1 -0.1 0.2 0.4 0.4 0.4 0.3 0.4 0.7 0.2 0.5 0.3 0.3 0.6 0.2 Endothelial_All_Zhang_JNeuro_2014 0.4 0.7 0.6 0.6 0.5 0.8 0.6 0.9 0.8 0.9 1.0 0.8 0.7 0.7 0.1 0.0 0.2 0.4 0.4 0.4 0.4 0.4 0.7 0.2 0.5 0.3 0.1 0.6 0.3 Mural_All_Zeisel_Science_2015 0.4 0.6 0.6 0.6 0.4 0.7 0.7 0.8 0.7 0.8 0.8 1.0 0.8 0.8 0.2 0.0 0.3 0.3 0.4 0.4 0.4 0.2 0.5 0.0 0.4 0.3 0.3 0.6 0.3 Mural_Pericyte_Zhang_JNeuro_2014 0.2 0.5 0.4 0.6 0.4 0.6 0.7 0.7 0.8 0.7 0.7 0.8 1.0 0.7 0.1 -0.1 0.2 0.2 0.3 0.3 0.3 0.1 0.3 -0.1 0.3 0.2 0.3 0.5 0.3 Mural_Vascular_Daneman_PLOS_2010 0.2 0.4 0.3 0.6 0.4 0.8 0.7 0.7 0.6 0.7 0.7 0.8 0.7 1.0 0.5 0.3 0.5 0.6 0.7 0.6 0.6 0.0 0.5 -0.1 0.1 0.6 0.3 0.8 0.2 Neuron_All_Cahoy_JNeuro_2008 -0.1 0.0 -0.2 0.1 -0.2 0.3 0.2 0.1 -0.2 0.1 0.1 0.2 0.1 0.5 1.0 1.0 0.8 0.9 0.9 0.9 0.9 -0.3 0.3 -0.2 -0.2 0.9 0.0 0.7 -0.1 Neuron_All_Darmanis_PNAS_2015 -0.2 -0.2 -0.3 -0.1 -0.3 0.1 0.1 0.0 -0.3 -0.1 0.0 0.0 -0.1 0.3 1.0 1.0 0.8 0.8 0.8 0.8 0.8 -0.3 0.2 -0.2 -0.3 0.8 -0.1 0.5 -0.1 Neuron_All_Zhang_JNeuro_2014 -0.1 0.0 0.0 0.2 -0.2 0.3 0.4 0.2 -0.1 0.2 0.2 0.3 0.2 0.5 0.8 0.8 1.0 0.6 0.9 0.8 0.9 -0.1 0.3 -0.2 0.1 0.7 -0.1 0.7 0.2 Neuron_GABA_Sugino_NatNeuro_2006 0.1 0.2 0.0 0.2 0.1 0.5 0.3 0.3 0.0 0.4 0.4 0.3 0.2 0.6 0.9 0.8Neurons0.6 1.0 0.9 0.9 0.8 -0.1 0.6 0.0 -0.1 0.9 0.1 0.8 -0.1 Neuron_Interneuron_Zeisel_Science_2015 0.0 0.2 0.1 0.3 0.0 0.6 0.4 0.4 0.0 0.4 0.4 0.4 0.3 0.7 0.9 0.8 0.9 0.9 1.0 1.0 1.0 Mature -0.1 0.5 -0.1 0.0 0.9 0.0 0.9 0.1 Neuron_Glutamate_Sugino_NatNeuro_2006 0.0 0.1 0.0 0.3 0.0 0.5 0.4 0.4 0.1 0.4 0.4 0.4 0.3 0.6 0.9 0.8 0.8 0.9 1.0 1.0 1.0 -0.2 0.5 -0.2 0.0 0.9 0.0 0.8 0.1 Neuron_Pyramidal_Cortical_Zeisel_Science_2015 -0.1 0.1 0.0 0.2 -0.1 0.5 0.4 0.3 0.0 0.3 0.4 0.4 0.3 0.6 0.9 0.8 0.9 0.8 1.0 1.0 1.0 Oligodendrocytes-0.2 0.3 -0.3 -0.1 0.8 0.0 0.8 0.1 Oligodendrocyte_All_Cahoy_JNeuro_2008 0.5 0.5 0.5 0.2 0.3 0.3 0.0 0.2 0.3 0.4 0.4 0.2 0.1 0.0 -0.3 -0.3 -0.1 -0.1 -0.1 -0.2 -0.2 1.0 0.7 0.9 0.9 -0.1 0.0 0.0 0.2 Oligodendrocyte_All_Zeisel_Science_2015 0.5 0.6 0.5 0.4 0.4 0.7 0.3 0.5 0.4 0.7 0.7 0.5 0.3 0.5 0.3 0.2 0.3 0.6 0.5 0.5 0.3 0.7 1.0 0.7 0.7 0.5 0.1 0.5 0.1 Oligodendrocyte_Mature_Darmanis_PNAS_2015 0.5 0.4 0.3 0.1 0.2 0.2 -0.2 0.0 0.1 0.2 0.2 0.0 -0.1 -0.1 -0.2 -0.2 -0.2 0.0 -0.1 -0.2 -0.3 0.9 0.7 1.0 0.8 0.0 0.0 -0.1 0.1 Oligodendrocyte_Myelinating_Zhang_JNeuro_2014 0.5 0.5 0.6 0.4 0.3 0.5 0.2 0.3 0.3 0.5 0.5 0.4 0.3 0.1 -0.2 -0.3 0.1 -0.1 0.0 0.0 -0.1 0.9 0.7 0.8 1.0 0.0 0.0 0.1 0.3 Oligodendrocyte_Newly-Formed_Zhang_JNeuro_2014 0.1 0.2 0.0 0.1 -0.1 0.4 0.2 0.3 0.1 0.3 0.3 0.3 0.2 0.6 0.9 0.8 0.7 0.9 0.9 0.9 0.8 -0.1 0.5 0.0 0.0 1.0 0.1 0.7 -0.1 Oligodendrocyte_Progenitor Cell_Darmanis_PNAS_2015 0.1 0.2 0.0 0.3 0.2 0.3 0.3 0.2 0.3 0.3 0.1 0.3 0.3 0.3 0.0 -0.1 -0.1 0.1 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.1 1.0 0.3 -0.1 Oligodendrocyte_Progenitor Cell_Zhang_JNeuro_2014 0.0 0.2 0.1 0.5 0.2 0.7 0.6 0.6 0.3 0.6 0.6 0.6 0.5 0.8 0.7 0.5 0.7 0.8 0.9 0.8 0.8 0.0 0.5 -0.1 0.1 0.7 0.3 1.0 0.1 RBC_All_GeneCardSearch_Hemoglobin_ErythrocyteSpecific 0.1 0.2 0.3 0.2 0.1 0.2 0.2 0.3 0.3 0.2 0.3 0.3 0.3 0.2 -0.1 -0.1 0.2 -0.1 0.1 0.1 0.1 0.2 0.1 0.1 0.3 -0.1 -0.1 0.1 1.0 1361

60 Running Head: PREDICTING CELL TYPE BALANCE

1362 Suppl. Figure 7. There is a convergence of cell content predictions derived from cell type specific 1363 transcripts identified by different publications. A. The similarity of different cell type indices in the 1364 Pritzker cortical dataset can be visualized using a correlation matrix. Within this matrix, correlations can 1365 range from a strong negative correlation (-1, blue) to a strong positive correlation (1, red), therefore a 1366 large block of pink/red correlations is indicative of cell type indices that tend to be enriched in the same 1367 samples. The axis labels for cell type indices representing the same category of cell are color-coded 1368 similar to Suppl. Figure 2. The number of probes included in each index is present in the far left column 1369 (also color-coded, with green indicating few probes and red indicating many probes). B-C. Examples of 1370 the cell type index correlation matrices from the replication cortical datasets: B. Lanz et al. (GSE53987), 1371 C. CMC RNA-Seq. Supplementary,Figure,3. 1372

Pia/&/ Neurons Neurovasculature

Astrocytes

Mature/Oligodendrocytes 1373

1374 Suppl. Figure 8. The convergence of cell content predictions derived from cell type specific transcripts 1375 originating from different publications remains after removing overlapping transcripts. This figure 1376 follows the format of Suppl. Figure 7(Pritzker dataset), but uses cell type indices calculated following 1377 removal of any genes identified as present in more than one index. The similarity of different cell type 1378 indices are visualized using a correlation matrix, with color-coded labels for cell type indices 1379 representing the same category of cell.

1380

61 Running Head: PREDICTING CELL TYPE BALANCE

1381 7.2.4 Cell Type Indices Predict Other Transcripts Known to Be Enriched in Specific Cell Types

1382 To identify other transcripts important to cell type specific functions in the human cortex, we ran

1383 a linear model on the signal from each gene probeset in the Pritzker microarray dataset that included each

1384 of the ten consolidated primary cell type indices as well as the six traditional co-variates (“Model 5”,

1385 Equation 4). Shown in Suppl. Figure 9 are the most significant 10 gene probe sets positively associated

1386 with each cell type within the model.

1387

1388 Suppl. Figure 9. The top 10 transcripts associated with each cell type index include those previously- 1389 identified as cell type enriched in the literature. Transcripts are identified by official gene symbol. 1390 Yellow labels identify transcripts included in the original cell type index, orange transcripts were 1391 previously-identified as cell type enriched in the literature but were not included in the database used to 1392 create the index. Please note that not all of the genes listed in the top ten list associated with the Red 1393 Blood Cell index would survive a traditional threshold for false detection (q<0.05).

1394 Many of the top gene probesets that we found to be related to each of the cell type indices are

1395 already known to be associated with that cell type in previous publications, validating our methodology.

1396 Importantly, this is true even when the genes were not included in the original list of cell type specific

1397 genes used to generate the index. For example, we found that HLA-E and EPAS1 were both strongly

1398 associated with our endothelial index, and both are known to be involved in endothelial cell activation

1399 (96,97). NOTCH2, one of the top astrocyte-related genes, promotes astrocytic cell lineage (98), and

1400 APOE is primarily secreted by astrocytes in the central nervous system (99). One of the top interneuron

1401 genes, LHX6, is specifically enriched in parvalbumin-containing interneurons (2). Another top

62 Running Head: PREDICTING CELL TYPE BALANCE

1402 interneuron gene, ERBB4, controls the development of GABA circuitry in the cortex (100). Many top

1403 neuron-related genes relate to synaptic function (SYT1, SYNGR3, NRXN1; http://www.genecards.org/).

1404 The top projection neuron-related gene, PDE2A, is preferentially expressed in cortical pyramidal neurons

1405 (101), and KIF21B is a kinesin found in the dendrites of pyramidal neurons (102). We also rediscovered

1406 probesets representing genes that were listed as alternative orthologs to those included in our original cell

1407 type specific gene lists (oligodendrocytes: EVI2A vs.CTD-2370N5.3, microglia: LAIR1 vs. LAIR2,

1408 mural cells: COL18A1 vs. COL15A1, ACTA2 vs. ACTG1). Altogether, these results suggest that our cell

1409 type indices were associated with the variability of transcripts in the cortex that represented particular cell

1410 types and could re-identify known cell type specific markers.

1411

1412 7.3 Additional Figures and Results

1413 1414 7.3.1 Inferred Cell Type Composition Explains a Large Percentage of the Sample-Sample

1415 Variability in Microarray Data from Macro-Dissected Human Cortical Tissue

1416 Within the four non-Pritzker human cortical tissue datasets, the relationships between the top

1417 principal components of variation and the consolidated cell type indices were similarly strong (Suppl.

1418 Figure 10), even though these datasets had received less preprocessing to remove the effects of technical

1419 variation. Within the GSE21935 dataset (31) the first principal component of variation (PC1) accounted

1420 for 37% of the variation in the dataset and seemed to represent a gradient running from samples with high

1421 support cell content (PC1 vs. endothelial index: R2=0.85, p<3.6e-18, PC1 vs. astrocyte index: R2=0.67,

1422 p<3.6e-11) to samples with high neuronal content (PC1 vs. neuron_all index: R2=0.85, p<3.9e-18).

1423 Within GSE53987 (30), which had samples derived exclusively from gray-matter-only dissections, PC1

1424 accounted for 13% of the variation in the dataset and was highly correlated with predicted astrocyte

1425 content (PC1 vs. astrocyte index: R2=0.80, p<4.6e-24). In GSE21138 (39), which also had samples

1426 derived exclusively from gray-matter-only dissections, PC1 accounted for 23% of the variation in the

1427 dataset and was strongly related to technical variation (batch), but PC2, which accounted for 14% of the

63 r−squared = 0.15 r−squared = 0.12 r−squared = 0.0025 r−squared = 0.2 p−value = 6.21e−23 p−value = 4.54e−18 p−value = 0.22 p−value = 1.43e−31

● ● ● ● ● ● ● ● ● 300 ● ● ●● ● ● ●●● ●● ●● ● ● ● ● ●● ● 100 ● ● ● ● ● ● ●● ●● ● ●●● 150 200 ●● ●● ● ● ●●●● ●●●● ● ●●● ●●● ● ●●● ●●●●●●●● ●● ● ● ● ● ●●● ● ●● ●●●●●● ● ● ● ● ●●●●●● ● ● ●●●●●●●●●●●●●●●● ● ● ● ● ●●●●● ● ● ● ●●●●●●●●●●●●● ●● ●●● ● 200 ● ●●● ● ●● ●●●●●●●●●●●●●●●● ●●●●●● ● ● ● ● ●●●● ● ● ●●●●● ●●●●●●●●●●●●●●● ●● ● ● ● ●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ●● ● ●● ● ● ● ● 100 ●●●●●● ●●● ● ● ●● ●● ●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● 0 ● ●●●●●●●●●●●●●●●●●●●● ●●● ● ● ●● ●●● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ● ●●●●●●●●●●● ●● ● ●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● 100 ● ●●●●●●●●● ● ●● ●●●●●●●●●●●●●● ●● ●● ● ●●●●●●●●●●●● ● ●●●●●●●●●●●●●●● ●●●● ●● ● ● ● ● ● ● ●●● ● ● ● ● ●●●●●●● ● ●●●● ●●●●●●●●●●● 100 ● ●● ● ● ● ●●●● ● ● ● ● ● ● ●● ●●●●●●●●● ●●● ● ● ●●●●●●●●●●●●● ● ● ● ● ●● ●● ● ● ● ●● ● ●● ● ●●●●●●● ● ● ● ●●●●●●●● ● ●● ●● ● ● ● 50 ●●● ●●●● ●● ●● ●●●● ●●●●●●● ● ●●●● ●●●● ●●● ● ●●●●● ●●●●● ● ●●●●● ● ●●●●●● ● ●●● ● ●● ●●●●●●●●●●●●●● ● ●●●●●●●● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●● ●●●●●● ● ● ● ●●●● ●●● PC1 ●● PC2 ● PC3 ●● ●●●● PC4 ● ● ●●●●●●●●●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ●●●●●●●●●●●●●●●●● ● ●●● ● ●●●●●●●●●● ●● ● ● ●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●● ●● ●●●●●●● ●●●●●●●●● ●●● ● ●●●●●●●●●●● ●● ● ●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●● ●● ●● ●●●●●●● 0 ●●●●●●●●● ●● ●●● ●● ● ● ●● ●●●●●●●●● 0 ● ●●●●●●●● ● ● ●● ● ● ●● ●●●●●●●●●●●●●●●●● ●●● ● ● ●●●●●●●● ●●● ●● ● ● ●●●●●●●●●● 100 ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●● 0 ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● − ● ● ● ●● ●● ●●●●●●●●●●●●● ●● ● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●● ● ● ●●●●●● ●● ●● ● ● ●● ●●●●●●●●●●● ●● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●● ● ● ● ●●●●● ● ● ●●●●●●●●●●●●●●● ●● ● ● ●●●●●● ●●●●●● ● ●● ● ●●● ●● ●● ●● ●●●●●●●●●●●●●●● ● ●●●●●●●●● ● ● ● ● ● ● ●●●●●●●●●●●● ● ●● ●●●● ●●●● ● ● ● ● ●● ● ● ● ● ●●●● ● ●●●●●●●●●● ●●●● ●●● ● ● ●●●●●●●● ● ● ● ●●●●● ●●●●●●● ●● ● ● ● ● ● ●●●● ●●●●● ● ●●●●●●●●● ●●● ● ●●● ● ● ● ● ● ● ●● 100 ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●●● ● ● 200 ● ● ●●●●●●●●● ●●● ● ● ● ●● ● ● − ● ●●● ● ● ● ● ●● − ● ● ● ●●●● ● ● ● ● ●● ●● ● ●● ●● ● ● ●●● ● ● ● ● ● ● 200 100 − − −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

Astrocyte Astrocyte Astrocyte Astrocyte

r−squared = 0.24 r−squared = 0.16 r−squared = 0.017 r−squared = 0.25 p−value = 2.33e−38 p−value = 6.68e−24 p−value = 0.00144 p−value = 2.65e−40

● ● ● ● ● ● ● ● ● 300 ● ● ● ● ● ● ●● ●● ●●● ● ● ● ● ● ● 100 ● ● ● ● ● ● ● ● ● ●●● ● 150 200 ● ● ●● ● ● ● ● ● ●●●● ● ● Running Head: PREDICTING CELL TYPE BALANCE● ●●●●● ●● ●●●●● ● ● ●●●● ● ●● ● ● ● ● ●●●●●●●● ●●●●●● ●● ●●●●●●● ● ● ●●●●●● ●●●● ●●●●●● ● ●●●●●● ● ● ● ●●●●●● ● ●●●●●● ●● ● ● 200 ● ●● ● ● ●●●●●●●●●●●●●● ●●●●●●●●● ●● ● ● ●● ●●●● ● ● ●●●●●●●●●●●●●●●●●●● ●● ● ●●● ●●●●●●● ●● ● ●●● ●●●●●●●● ●●●●●●●●●●●● ●● ● ● ●● ● ● ● ● ● ● 100 ●● ●●●●●● ● ●●● ●●●●●●● ●●●●●●●●●●●●●●●●●● ● ●●●● 0 ●● ●●●●●●●● ●●●●●●●●●●●●●● ● ●● ●●●●● ● ●● ●●● ● ● ● ●●●●●● ●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●● ●● ●● ●● ●●●●●●●●●●●●●●● ●●● ●●● ● ● ● ● ● 100 ●● ●●●●●●●● ●● ●●●●●●●●● ●●●● ●● ●●●●● ● ● ● ●● ● ● ●●●●● ●●●●●●●● ● ● ● ●●●●●● ●●●●●●●●●●●●● ● ● ● ● ●●● ● ●●● ● ● ●● ●●● ●● ●● ● ● ●●●●●●●●●● ●● ● ● 100 ● ●● ●●● ●●●●● ●● ● ● ● ● ●●●●● ●●● ● ● ● ● ●● ●●● ●●●●●●●● ● ● ●●● ●●●●● ● ● ●● ●● ● ● ● ●●●●● ● ● ● ●● ●●● ●●● ● ●● ● ●●● ● ● 50 ●● ● ●● ●●●●●● ●● ●● ●●●●●●●● ●● ●●●● ● ●●●●● ●●● ● ● ●● ●●● ●● ●●●●●●●●●● ● ●● ● ●●●●●●● ● ● ● ●●●●●●●●●●●●● ●● ● ● ● ●●●●● ●●●● ●● ● ●● ●●●●●●●●●● ●●●●●●●●●●●● ● ●● ● ●●●●●● ●●● ● ● PC1 ● ● PC2 ● PC3 ● ● ● ●● PC4 ● ● ● 1428 variation in the dataset, again represented a gradient from samples with● ● ● ●● ●high●●●●● ● ● ●support cell content to high ● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●● ●● ● ●● ●●●●●●●●● ● ● ● ●●● ● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ● ●●●●●●●●●●●●●●●●● ●●●●●● ● ● ● ●●●●●●●●●●● ● ● ● ● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ●● ● ●●● 0 ● ●● ●●●●●● ● ●●● ● ●●● ●● ●●●●●● ● ●● 0 ●● ● ● ●● ●● ● ● ● ● ●● ● ●● ●●●●●●●●●●●●●●● ● ●●●●●●●●● ● ● ● ●●●●●●● ●●● ●● ● ● ●●● ●●●●●●●● ● 100 ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●●●●●●● ●● ●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● 0 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● − ● ● ●●● ● ● ●● ●● ●●●●●●●●● ●● ● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ●●●●●●●●●●●●● ●●●●●●●●● ● ● ●●●●●●● ●●●● ●●● ● ●● ●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●● ●●●●●● ●● ● ●● ● ●●●●●●●● ● ● ●● ●●●●●●●●●● ●● ● ● ●● ●●●●●●●●●●●● ● ●●●●●●●● ● ● ● ●●●●●●●●●● ● ●● ● ●●● ● ●● ●● ●● ●●●●●●●●●●●●●●●●●●● ● 2 ●●●●● ●●●●●●●●● ● 2 ● ●● ● ●● ● ●● ●● ●●● ●●●●●●●●●● ●●●●●● ● ● ● ●●●●●● ● ●● ● ● ● ●●● ● ●●●●●●●●●●●● ●● ●●●●●●● ●●● ●● ● ● ● ● ● ●● ● ● 1429 neuronal content (PC2 vs. astrocyte: R =0.56, p<8.3e-11, PC2 vs.● neuron_all:● ●●●●●●●●●●●●●●● ●● R =0.54, p<2.3e-10). ● ● ● ●●● ● ● ●● ● ●●●● ●●●●●●●●●●●● ● ● ● ●●●●●●●● ●●● ● ●● ● ● ●● ●●●●●●● ● ●●● ● ● ●● ● ● ●● 100 ● ● ●●●● ● ●● ● ● ● ●● ● ●●● ●● ●● ● 200 ● ●● ●●●●●●●●●●●● ● ●● ●● ● ● − ● ●● ● ● ● ● ● ● ● − ● ●●● ●● ●●●● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● 200 100 − 1430 Finally, within the CMC RNA-Seq dataset, PC1 accounted for 16% of the variation in the dataset and was − −1.0 −0.5 0.0 0.5 1.0 1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 1431 highly correlated with projection neuron content (PC1 vs. Neuron_Projection:Endothelial R2=0.54, p=5.77e-104). Endothelial Endothelial Endothelial

r−squared:GSE21935: 0.85 rr−−squaredsquared:GSE53987: = 0.350.8 r−squared = 0.0024 r−squared = 0.00058 r−squared = 0.41 A. pBarnes et al., 2011−value: 3.59e−18 B. ppLanz−−value:valueet al., 2015 = 4.6e 7.6e−2459 p−value = 0.231 p−value = 0.554 p−value = 6.48e−71 ● ● ● ● ● ● ● ● ● ● ● ● 300 ● ● ● ● ●● 20 ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● 100 ●●● ● ● ●● ● ● ● ● ● ●●●● 150 200 ● ● ● ● ●● ●●●● 20 ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●●● ● ●●●●●● ● ● ● ●● ●● ● ●● ● ●●●●● ● ● ● ● ● ● ● ● ●● ●●●● ●● ● ● ● ●●●● ●●●●● ●● ● ● ●● ● ●● 200 ● ●● ● ●●●● ●●●● ● ● ●●● ● ●● ●●● ● ●●●●●●●●●●●●●●●●● ● ● ● ●●●● ● ● ●●●● ●● ●●●●●●●●●●●●● ● ● 10 ● ● ●● ● ● ● ● ● ●●● ●● ●●●● ● ● ● ● ●● ● ●●● ●●●●●●● ● ●● ● ●● ●●●●●●●●●●●●●● ● ● ● ● ● ●●● ● ● ● ●●● 100 ●● ● ● ● ●●●●●● ●● ●● ● ●●● ●●●●●●●●●●●●●●●●●● ● ● ● ● ●● ● 0 ● ●●● ●●●●●●●●●●●●●●●●●● ● ● ● ●● ● ● ●●● ●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●● ●● ● ●● ● ● ● ●● ●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● 100 ● ●● ● ● ● ● ● ● ● ●●●●● ● ● ●● 0 ● ● ●●●● ● ● ● ●● ● ●●●●●● ● ● ●● ● ● ● ● ●●●●●● ●●● ● ●●●●●●●●●●●●●●●●●●●● ●● ● ●● ● ● ●●●● ● ● ● ● ● ● ●● ●●●●● ● ● ●●●●●●●●●●● 100 ●● ●●● ● ● ●● ●● ● ● ● ●●● ● ●● ●●●●●●●●●●●● ● ● ●● ●●●●●●●●● ● ●● ●● ● ●●●●●● ● ● ●●● ●● ● ● 50 ● 0 ● ● ●●● ●● ●● ● ●● ●●●● ●● ●● ● ●●● ●● ● ● ● ● ●●●●● ●● ● ● ●●●●●●●●●●●● ● ●●●●●●●●● ●●● ● ● ● ● ●● ●●●●●●●●●● ● ●●● ●●● ● ●● ●●●● ●● ● ●● ●●●●●●●●● ● ●● ● ●●●●● ●●●●● ● ●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●●● ●●●● ● ● PC1 ●● ●●●● ●●●● PC2 ● ● ● PC3 ● ● ● ●●●●●●●●●●●● ● ● PC4 ●● ●● ● ●●● ● ● ● ● ● ●●●●●●●●●● ● ●●●● ● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ●● ●●●●●●●●●●●●●● ● PC1 ●● ●● ● ● ● ● ●●●●●●●●●●●● ● ●● ●●●● ●●● ● PC1 ● ● ●● ● ● ●● ● ● ● ● ●●●●●●●●●●●●●●●●●●● ● ●● ● ●●●●● ● ● ●● ● ●●●●●●●●●●●●●● ● ●● ● ● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ●●●●●●● ●●●●●●●●●●● ●● ● ● ●● ● ● ●● ● 0 ●● ●● ● ●●●●●●●●●● ● ● ● ●●●●●●●●●● 0 ● ●●●●●●●● ●● ● ●● ●● ● ● ●●● ●●●●●●●●●●●●●●●●●● ● ● ● ● ●● ●●●● ●●●● ● ● ●●● ●●●●●●●● ●●● 100 ● ● ●● ● ● ● ● ●●● ●●● ●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●● ● ● ● ●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● 0 ● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●● ● − ● ●● ●●● ●● ● ● ●●●●●●● ● 20 ● ● ●● ●●●● ● ● ● ● ● ●●●●●● ●● ● ● ●●●●●●●●●●●● ● ●● ●●●●●●●●●●● ● ● ● ●● ●● ● ●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●● ● 10 ● ● ● ● ●●●●●●●● ● ●● ●●●●● ●● ●● ●●●●●● ● − ● ●●●● ● ● ● ● ●●● ● ●●●●●●●● ● ● ●●●●●●●●●● ● ● ● ●● ●● ● ●● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●● ● − ● ● ● ● ●●●●●●●● ● ● ●●● ● ● ● ● ●●● ●●●●●●●●●●●●● ●● ● ●● ●●●●●●●●●●●● ● ●●● ●●● ● ● ● ●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●● ●● ● ● ●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●● ● ● ● ●●● ●● ● ● ●●●● ●●●●●●● ● ● ●●●●●●●●●● ● ●● ●●●●●●●●● ●●●●●●● ● ● ● ● ●●●●●●●●●●●●● ● ● ● ● ●●●●●● ●●●●● ● ●● ●●● ● 100 ●● ●● ●● ●● ● ● ●●●● ●● 200 ● ● ● ● ●●● ●●●●● ● ● ● ● 20 ● ●● ● ● ●●●● ● − ● ● ●●● ● ● ●

● − 40 ● ● ● ● ●●● ●●● ●● ● − ●●●● ● ● ●● ● ●● ●● − ● ● ●● ●● ● ● ● ● ● ● ● 200 ● ● 100 − −

−1.0 −0.5 0.0 0.5 1.0 −1.0 −2 −0.5 −1 0.0 0 0.5 1 −2 −1 0 1 −2 −1 0 1 −2 −1 0 1 Endothelial Index AstrocyteNeuron_All Index Neuron_All Neuron_All Neuron_All

r−squared:GSE21138: 0.56 r−squared = 0.54 r−squared = 0.08 r−squared = 0.01 r−squared = 0.23 C. Narayan et al. 2008p−value: 8.3e−11 D. CommonMindp−value = 5.77eConsortium−104 p−value = 1.44e−12 p−value = 0.0124 p−value = 8.52e−36 ● ● ● ● ● ● ● ●● ● ● 300 ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● 100 ● ● ● ● ● ● ● ● ● ● ● ●● 150

200 ● 20 ● ●● ●● ● ● ● ●● ● ●● ● ●● ●● ●● ● ● ●● ●● ●●●● ●●● ●●● ● ●● ● ● ● ●● ● ●● ●● ● ● ● ●● ●●●●● ● ● ● ● ● ● ●●●● ● ●● ● ● ●● ●●●● ●● ●●●● ●● ●●● ● ● ●●●● ●● ●● ●●●● ● ●● ●● ●●●●●●●●●●●● ● ● 200 ● ● ● ● ●● ● ● ● ● ●● ●● ●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●● ●● ● ● ● ● ● ● ●●●●●●●●●●●● ●●●●●●● ● ●●● ● ● ● ● ● ●●●●● ●●●●● ●● ● ● ● ●●●● ● ●● ●●●●●●●●●●●●●● ●● ● ●● ● ● ●● ● ●●● ● ●● 100 ● ●●● ●● ● ●● ●● ● ● ● ●●●●●●●●●●●●●●●●●●● ●● ● ● 0 ● ●● ● ● ● ● ● ● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● 0 ● ●● ● ●● ●● ●●●● ● ●●● ●●●●●●●●●● ●● ● ● ●● ● ●●●●●●●●●● ●●● ●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ● ● ● ● ● 100 ● ●●●● ●●● ●● ● ● ●●●● ●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ●●●●●●● ●● ● ● ●●● ● ●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ●● ●●● ● ● ● ●● ● ● ●●●● ●● ● ● ●●●●●●●●●●● ● 100 ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●●●●●●●●●●● ● ●● ● ● ●●●●●●●● ●●●●●● ●● ●●● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●●●●●● ● ● ● ● ● ●●● ●● ● ● 50 ● ●● ● ● ●●●●●● ● ● ●● ● ●● ● ●●● ● ● ● ●● ●●●●●●● ●●●●● ● ● ● ● ●●● ●●●●●●● ●●●● ●●●●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●●●●●●●●●●● ●● ●● ● ● ●●●●● ● ●● ●● ● ●●●● ● ●●●●●●●●●●● ●●●●● ●●● ● ● ●●● ● ● ● ●● ●● ● ● PC1 ● ● ● ●●●●● ● PC2 ● ●● PC3 ● ● ● ●●●●●●●●●●●●● ● ● ● ● PC4 ● ● ●●● ●● ●●● ●● PC1 ● ● ●●● ● ● ● ●●● ● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ● ●●● ● ● ● ●● ● ●●● ●● ●●●● ●●● ● PC2 ● ● ●●●● ● ● ● ● ● ● ● ● ●● ●●●●●●●● ● ●● ● ● ● ●●●● ● ●● ●●●● ● ● ● ●●●●● ●●●●●● ● ●●● ● ● ●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●● ●●●●●●●●●● ● ● 20 ●●●● ●● ● ●● ●●● ●●●● ●●● ●●●●● ● ●● ●●●● ●●● ●●● ● ● ● ● ●●● ●● ● ● 0 ● ●● ●●●●●●●● ● ● ● ● ● ● ●●●●●●●● ● ● ● ●● 0 ● ●● ●●●● ● ● ● ● ●● ● ● ● ●●● ●●●●●●●●●●●●●●●●● ●●●●●● ● ● ● ●● ●● ●●●●●●● ●●● ● ●● ●● ● ● ● ●● ● ● 100 ● ● ● ● ● ●●●●●●● ●●●●●●● ● ● ● ●●● ●● ●●● ●●● ●●● ● ● ●● − ● ●● ●● ● ●● ● ● ● ●●●●●●●●● ●●●●●●●●●●● ● ● ● ● ●●● ●●●● ● ●● ● ● ●●● ● ● ●●●● ●●● ● ●●● ● ● ● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●● ● 0 ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●● − ● ●●●● ●●●●● ●● ●●●●●●●●●●● ●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●● ● ● ●●●●● ●● ●●●●●●●●●●●●●●●●●● ● ●●●●● ●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●● ●●●●●●●●●●●● ● ● ● ●●●●●● ● ●● ●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●● ● ●● ● ●●●●●●●●●●●●●●● ● ● ● ● ●● ● ●● ● ● ●●● ●●●●●●●●●●●●● ● ● ● ● ●● ●●●● ●●●● ●●●●● ● ● ● ● ● ●●●● ● ● ● ●● ●●● ●●●● ●●●●●●●● ● ● ●●●● ●●●●●●●● ● ●● ● ● ●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●● ● ● ● ● ●●● ●● ●● ● ● ● ● ●● ●●●●●●●●●●● ● ● ● ●● ●●● ● ● 40 ● ●●● ●●●●●●●●●●●● ● ● ●● ● ● ●● ● ●● ● ●●●● ● ● ● ● ● ● ● ●●● ●● ●● ● ● − ● ●●●● ●●● ●●●● ● ● ● ● ● ●● 100 ● ● ● ●● ●●● ●● ● ● ● ●●● ● ● ●●●●● ● 200 ● ● ● ● ●● ●● ●●●●●●●●● ● ● ● − ●● ●● ●● ● ● ●●● − ● ● ●●●●● ● ●●● ●● ● ●● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● 200 100 60 − − − −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 Projection Neuron Index 1432 Astrocyte Index Neuron_Projection Neuron_Projection Neuron_Projection Neuron_Projection 1433 r−squared = 0.034 r−squared = 0.00055 r−squared = 0.014 r−squared = 0.41 1434 Suppl. Figure 10. Replication: Cell content predictions explain a plarge−value percentage = 4.69e−06 of the variability in p−value = 0.565 p−value = 0.00368 p−value = 2.81e−70 1435 microarray and RNA-Seq data derived from the human cortex. The results shown above illustrate the ● ● ● ● ● ● ● ● ● 300 ● ● ● ● ● ● ● ● ● ●● ● ● ● 1436 strongest relationship between the top principal components of variation (PC1● or PC2) and cell type in ● ● ● ● 100 ● ● ● ● ● ● ● ●● ● ●● ● ● 150 200 ● ● ● ● ● ●● ● ● ● ● ●● ●● ●●●●● ● ● ● ● ●●●●●●● ●●●● ●●● ● ● ● ● ● ● ● ● ●●●●●●●●● ● ●● ● ● ● 1437 each of the four human replication datasets discussed in the paper: A)● GSE21935●● ●● ●● ● : Barnes et al. 2011; B)● ● ● ●●● ●●● ● ●●●● ● ● ● ●● ●●● ● ● ● ● ● ● ● ●●●●●●● ●●●●●●●● 200 ● ●●● ● ● ● ●●●● ●●● ●●● ●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ●● ● ● ● ● ●●●●●● ●●● ●●●●●●● ●●● ● ●●● ● ●● ●●●●●●● ● ●● ●● ● ● ●● ●●●●● ●●●●●● ●●●●●●● ●●● ●● ● ●● ● ● ● ● ● ●● ● ● ● 100 ●●●● ●● ●● ● ● ● ● ● ●● ● ●●●●●●●●●●●●●●●●●● ●● ●●●●● ● ● ●● ●● 0 ● ● ● ●●●● ●●●●●●●●●●●●●●●● ●●●●●● ● ●● 1438 GSE53987: Lanz et al. 2015; C) GSE21138: Narayan et al. 2008; D) CommonMind● ●●● ● ● ●●● ● Consortium. ●● ●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●● ● ● ● ● ● ●●● ● ●●●● ●●●●●●● ●● ● ●● ●●● ● ● ●●●●●●●●●●●●●●●●●● ●●●● ● ● ● ● ● ● 100 ● ● ●● ● ● ● ● ●● ●●● ●●●●●●●●●●●●● ●●● ● ●● ●● ● ●● ●● ● ●●●●●●●●●● ●● ●●●●●● ●●● ●●●●●●●● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●●● ●● ● ● ● ● ● ● ● ●●● ● ●● ● ● ●● 100 ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ●●● ● ●●● ● ●● ●● ● ●● ●●● ●●●●● ●●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●●●●●● ●● ● ● ● ● ● ● ●● ● ● ● 50 ● ●●●● ●●● ●● ●● ● ● ●● ● ● ●●●●● ●● ● ●● ● ● ● ●●●●●●● ● ●● ● ● ●● ●●●●●●● ●●● ● ●●● ● ● ● ● ●●●● ● ● ● ● ● ● ●● ●●●●●●●●●●●● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ●●● ● ●●●●●●●●●●●●●●●●●●●● ● ●●●●● ● ●● ● ● ●● ●●● ●● ●● ● PC1 ● ● PC2 ● PC3 ● ● ●● ●● PC4 ●● ● ●● ● ●● ● ●●●●● ● ● ●● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ●●● ● ●●●●● ●●●●●●●●●●●● ●● ● ●●● ●●●● ● ●●●●●● ● ● ● ● ● ●● ● ●●●● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●● ● ● ● ●●●●●●●●●●●●●●●● ●●●●● ● ● ● ●● ● ●● ●● ●● ●● ●●● ● ● ● ● ● ●●● ● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ●● 0 ●● ●●●●● ●●● ●● ●● ● ● ● ●●●●●●●●● ●●●● 1439 0 ● ● ●●● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ●● ● ●●●●●●●●●●●●●●● ●● ●● ● ●● ●●● ●● ● ●●●●●●● ● ● ● ●●●● ●●● ● ● ● ● 100 ●● ● ● ● ● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●● ● ● ● ● ● ●● ● ●● ●● ● ● ● ●●● ●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●● 0 ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● − ● ● ● ● ● ● ● ● ● ●● ● ●●●● ● ● ● ● ● ●●●●● ● ● ● ●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●● ●● ●● ●●●●●●●●●●●●●●●●●●●● ●●● ● ●● ● ● ●●● ●● ●●●●●●●●●●●●● ●● ● ●● ●●●● ●●●●●●●●●●●●●● ●●● ●● ● ● ● ● ● ●● ●●●●●●●●●●●●●●●●● ●● ● ●● ● ● ● ●● ●●●●● ● ● ● ●●● ●● ●●●●●●●●● ●● ● ●●●●● ● ●●●● ●●●●●●●●●●●●●●● ●●●● ● ● ● ● ●● ●●●●●●● ●●● ●● ● ● ● ● ●● ● ● ● ● ● ●●●● ●●●●●●●●●●● ● ●● ● ● ●●● ●●● ●●● ● ● ● ● ● ● ● ●●●●● ● ●● ●● ●●● ●●●●●●●●●●●●●●● ● ● ● ● ●●●●● ● ● ●●●●●●● ● ● ●● ● ●●● ●●●●● ●●●●●● ● ● ● ●●●●●●● ●●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ●●● ●●●●● ●● ● ● ● ● ● ● ●●●●●● ● When digging deeper, we found that none of the original● ● 38● publication●●● ●● ● ●●●●●● ●●-●specific●● cell type indices ● ● ● ●●● ● ● ●●●●●●●● 1440 ●● ●● ●● ● ● ● ●●●● ● ● ● ● ●●● ● 100 ● ● ●●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● 200 ● ● ●● ● ●●●●● ●● ●●●● ●● ● ● ● ● ● ● − ● ● ● ●●● ● ● ● ●● − ● ● ● ●●●●● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 200 100 − 1441 were noticeably superior to the consolidated indices when predicting the principal components of − −1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1442 variation in the datasets. Human-derived indices did not outperform mouseOligodendrocyte-derived indices, and indices Oligodendrocyte Oligodendrocyte Oligodendrocyte

64 Running Head: PREDICTING CELL TYPE BALANCE

1443 derived from studies using stricter definitions of cell type specificity (fold enrichment cut-off in Table 1)

1444 did not outperform less strict indices.

1445

1446 7.3.2. It is Difficult to Discriminate Between Changes in Cell Type Balance and Cell-Type Specific

1447 Function

A. protein0coupled0 ! palmitoyl0cysteine ! linked0(GlcNAc...) ! gated0 channel ! binding0region:S development transport ! biosynthesis Receptor0Interaction/G ! Purkinje0cell0differentiation receptor0protein0signaling0pathway regulation) 7 eeelr Purkinje0cell0layer0development0/Cerebellar0 Cerebellar0 eetI0/0Repeat:III Repeat:II0 Cell0motion0/0Cell0migration Neuropeptide0hormone0activity0/Neuropeptide Neuron0Differentiation/Neuron0Projection0Development Plasma0Membrane0(Part) Neuroactive0Ligand Regulation0of0neurotransmitter0levels0/0Neurotransmitter0 Neurotransmitter0metabolic0process0/0Neurotransmitter0 Cell0fraction0/0Membrane0fraction Neuron0Projection/Dendrite Synaptic0Transmission/Neurological0System0Process Regulation0of0(Neurological)0System0Process Palmitate0/Lipid0moiety Regulation0of0axonogenesis0/Regulation0of0neuron0projection0 Glycoprotein0/0Glycosylation0site:N Cation0Channel0Complex/Ion0Channel0Complex Cation0channel0complex0/0voltage Anchoring0junction0/0Adherens0junction Synapse/Cell0Junction 0.005

0.0025

0

!0.0025

Average'relationship'with'Age' * !0.005 * * * * * !0.0075 * ("Beta",'negative'indicates'down !0.01

!0.0125

!0.015 * !0.0175 B. Neuron7Related'Genes:'Top'20'Functional'Clusters

1448

1449 Suppl. Figure 11. The predicted decrease in neuronal cell content in relationship to age is unlikely to 1450 be fully explained by synaptic atrophy. Within the list of neuron-specific genes, 240 functional clusters 1451 were identified using DAVID. A) The genes in 19 out of the top 20 functional clusters showed decreased

65 Running Head: PREDICTING CELL TYPE BALANCE

1452 expression with age on average, as determined within a linear model that controlled for traditional 1453 confounds (“Model 2”). Depicted is the average effect of age +/-SE for each cluster (asterisks: p<0.05, 1454 blue=down-regulation, red=up-regulation). Overall, 76% of all 240 functional clusters showed a 1455 negative relationship with age on average (Suppl. Table 4). B.) We blindly chose 29 functional clusters 1456 that were clearly related to dendritic/axonal functions and 41 functional clusters that seemed distinctly 1457 unrelated to dendritic/axonal functions. Transcripts from both classifications showed an average 1458 decrease in expression with age (p=9.197e-05, p=0.008756, respectively), but the decrease was larger 1459 for transcripts associated with dendritic/axonal-related functions (p=0.02339). Depicted is the average 1460 effect of age +/-SE for each classification of cluster. 1461 1462 7.3.3. Previously-documented Psychiatric Effects on Cortical Gene Expression within Particular

1463 Cortical Cell Types or Within Macro-Dissected Prefrontal Cortex

Validation # of # of Co-Variates: Statistical Datasets: Genes Method: Brain Bank: Subjects Brain Region Controlled? Co-Variates: Balanced? Stringency

Schizophrenia Effects In Particular Cortical Cell Types: Reviewed in Lewis & Sweet ICC/in situ Prefrontal (2009) 7 hybridization Variable, often PITT Variable cortex Variable Variable Variable

Top 40 (FDR<0.1 in Direction of effect both layers, Table LCM-Microarray: evaluated, but Sex, Age, PMI, pH, RIN, 2A), Top 2 in Table Arion et al. Pyramidal Neurons covariates not included tissue storage time, 2B, FDR<10E-17 for (2015) 41 (Layers 3 &5) PITT 72 BA9 in final model. race Layer5) Batch. Considered LCM-Microarray: effects of Sex, Age, Pietersen et al. PVALB PMI but not included Sex, Age, PMI, pH not Top 47 (FDR<0.01, (2014) 47 Interneurons HBTRC (MacLean) 16 BA42 in final model. significantly different FC>2, Table 3) LCM-Microarray: Mauney et al. Oligodendrocyte Sex, Age, PMI, pH not Top 35 (FDR<0.001, (2015) 35 Precursors HBTRC (MacLean) 18 BA9 None reported Table S2)

Psychiatric Effects in Macro-dissected Prefrontal Cortex: Meta-analysis of Stanley Foundation, microarray data: HBTRC (MacLean), Model selection Mistry et al. Schizophrenia PITT, CCHPC , BA9, BA10, procedure included (2013) 126 effects MSSM, MHRI 306 BA46 Batch, Age, pH, Study Sex, PMI FDR<0.1 (Table S2) Meta-analysis of Batch (Scan Date), pH, Choi et al. microarray data: BA46 (grey Psychosis, Medication Age, BMI, PMI not FDR<0.05, FC>1.3 (2011) 367 Bipolar effects Stanley Foundation 83 matter only) at TOD reported (Table S1) 1464

1465 Suppl. Figure 12. Gene lists used to assess whether controlling for cell type while performing 1466 differential expression analyses enhances the detection of previously-documented psychiatric effects on 1467 cortical gene expression. These lists include genes with documented relationships to psychiatric illness in 1468 either 1) particular cortical cell types or 2) macro-dissected cortex. The full lists can be found in Suppl. 1469 Table 7. Abbreviations: LCM: Laser Capture Microscopy, PVALB: Parvalbumin, BA: Brodmann’s Area, 1470 PMI: Post-mortem interval, FDR: False detection ratio (or q-value), Brain Banks: PITT (University of 1471 Pittsburgh), HBTRC (Harvard Brain Resource Tissue Center), CCHPC (Charing Cross Hospital 1472 Prospective Collection), MSSM (Mount Sinai Icahn School of Medicine), MHRI (Mental Health Research 1473 Institute Australia).

1474

66 Running Head: PREDICTING CELL TYPE BALANCE

1475 7.3.4 The Top Diagnosis-Related Genes Identified by Models that Include Cell Content

1476 Predictions Pinpoint Known Risk Candidates

1477 Although the inclusion of predicted cell type balance in our model occasionally improved our

1478 ability to detect previously-identified relationships with diagnosis, most relationships still went

1479 undetected in the Pritzker dataset and none of the diagnosis relationships survived standard p-value

1480 corrections for multiple comparisons. This could be due to a variety of factors, including microarray

1481 platform and probe sensitivity. Therefore, we decided to ask a complementary question: Of the top

1482 diagnosis relationships that we see in our dataset, how many have been previously observed in the

1483 literature? If including predicted cell type balance in our models improves the signal to noise ratio of our

1484 analyses, then we would expect that the top diagnosis-related genes in our dataset would be more likely to

1485 overlap with previous findings. To perform this comparison in an unbiased and efficient manner, we

1486 limited our search to PubMed, using as search terms only the respective human gene symbol and

1487 diagnosis (“Schizophrenia”, “Bipolar”, or “Depression”). For the genes related to MDD in our dataset, we

1488 also expanded the search to include two highly-correlated traits that are more quantifiable: “Anxiety” and

1489 “Suicide”. Then we narrowed our results only to studies using human subjects.

1490 Before controlling for cell type, when using a traditional model (“Model#2”) we found that only

1491 one of the top 10 genes related to diagnosis (FOS: (103,104)) or the presence or absence of psychiatric

1492 illness (ALDH1A1: (105)) had been previously noted in the human literature. In contrast, when we used a

1493 model that included the five most prevalent cortical cell types (Model#4), we found that five of the top 10

1494 genes associated with Schizophrenia had been previously identified in the literature (ARHGEF2: (106),

1495 DOC2A: (107), FBX09: (78), GRM1: (108,109); CEBPA: (84)), and three of the top 10 genes associated

1496 with Bipolar Disorder (ALDH1A1: (105), SNAP25: (110), NRN1:(111); Suppl. Figure 13). This was a

1497 significant enrichment in overlap with the literature when compared to 100 randomly-selected genes in

1498 the dataset subjected to the same protocol (Schizophrenia: 5/10 vs. 7/100, p=0.0012; Bipolar: 3/10 vs.

1499 8/100, p=0.0610). Likewise, if we replaced diagnosis with a term representing the general presence or

1500 absence of a psychiatric illness, we found that four of the top 10 genes had been previously identified in

67 Running Head: PREDICTING CELL TYPE BALANCE

1501 the literature (ALDH1A1: (105); HBS1L: (4); HIVEP2: (112), FBX09: (78), Suppl. Figure 14), and 9/10

1502 of the top genes were actually significant with an FDR<0.05 when using permutation based methods

1503 (using the R function lmp{lmPerm}, iterations=9999). The top 10 genes associated with psychiatric

1504 illness in models selected using forward/backward stepwise model selection (criterion=BIC) similarly

1505 included five that had been previously identified in the literature (PRSS16: (113), GRM1: (108,109);

1506 ALDH1A1: (105); SNAP25: (110); HIVEP2: (112), a significant improvement in overlap with the

1507 literature than what can be seen in 100 randomly-selected genes in the dataset subjected to the same

1508 protocol (Fisher’s exact test: 5/10 vs.15/100, p=0.0168).

Top Genes Associated with Schizophrenia: Model 4: Diagnosis + 5 Prevalent Cell Model 5: Diagnosis + All Cell Types & Model 2: Diagnosis + Confounds Types & Confounds Confounds Gene Symbol Beta Pval FDR Gene Symbol Beta Pval FDR Gene Symbol Beta Pval FDR CTRC -0.13 1.00E-04 4.75E-01 ARHGEF2 -0.12 3.96E-05 2.66E-01 ID1 -0.54 3.68E-05 2.22E-01 DMP1 -0.06 1.37E-04 4.75E-01 DOC2A 0.18 4.55E-05 2.66E-01 DOC2A 0.17 6.26E-05 2.22E-01 HHLA1 -0.40 1.70E-04 4.75E-01 ID1 -0.53 6.69E-05 2.66E-01 ARHGEF2 -0.12 6.78E-05 2.22E-01 PITPNB -0.13 1.96E-04 4.75E-01 PITPNB -0.12 8.87E-05 2.66E-01 PITPNB -0.13 7.41E-05 2.22E-01 DHX32 -0.16 2.61E-04 4.75E-01 FBXO9 -0.16 2.53E-04 4.48E-01 PMP22 -0.24 1.67E-04 3.64E-01 ID1 -0.51 2.73E-04 4.75E-01 CTRC -0.10 3.12E-04 4.48E-01 CRYBB1 -0.10 2.34E-04 3.64E-01 CRYBB1 -0.12 3.04E-04 4.75E-01 GPR63 0.12 4.28E-04 4.48E-01 NPPA -0.14 2.65E-04 3.64E-01 ZNF91 -0.29 3.26E-04 4.75E-01 GRM1 0.07 4.71E-04 4.48E-01 CTRC -0.10 2.68E-04 3.64E-01 FAM127B 0.15 3.84E-04 4.75E-01 DHX32 -0.13 4.79E-04 4.48E-01 PHLDB1 -0.17 4.41E-04 3.64E-01 NPPA -0.17 3.98E-04 4.75E-01 CEBPA 0.15 5.70E-04 4.48E-01 FGFR2 -0.16 4.49E-04 3.64E-01

Top Genes Associated with Bipolar Disorder: Model 4: Diagnosis + 5 Prevalent Cell Model 5: Diagnosis + All Cell Types & Model 2: Diagnosis + Confounds Types & Confounds Confounds Gene Symbol Beta Pval FDR Gene Symbol Beta Pval FDR Gene Symbol Beta Pval FDR NDUFS5 -0.15 7.77E-04 1.00E+00 ALDH1A1 -0.37 7.57E-05 9.06E-01 ALDH1A1 -0.40 3.05E-05 2.21E-01 ZNF593 0.16 1.20E-03 1.00E+00 SNAP25 -0.20 3.59E-04 1.00E+00 SNAP25 -0.17 3.69E-05 2.21E-01 LRRK1 0.10 1.64E-03 1.00E+00 G3BP1 0.14 7.61E-04 1.00E+00 CHST1 0.22 4.33E-04 9.98E-01 G3BP1 0.13 1.71E-03 1.00E+00 NDUFS5 -0.15 8.07E-04 1.00E+00 TRA2A -0.15 5.78E-04 9.98E-01 OR7C1 -0.08 1.85E-03 1.00E+00 ZNF593 0.16 1.05E-03 1.00E+00 G3BP1 0.14 6.58E-04 9.98E-01 NARS -0.08 1.89E-03 1.00E+00 NARS -0.08 1.07E-03 1.00E+00 ANGEL2 -0.09 7.27E-04 9.98E-01 FOS -0.63 2.00E-03 1.00E+00 CHST1 0.21 1.09E-03 1.00E+00 NARS -0.08 1.24E-03 9.98E-01 PITPNB -0.10 2.22E-03 1.00E+00 PITPNB -0.10 1.11E-03 1.00E+00 LRRK1 0.10 1.33E-03 9.98E-01 GIT2 -0.05 2.44E-03 1.00E+00 TXNDC5 0.14 1.33E-03 1.00E+00 KCTD2 0.10 1.34E-03 9.98E-01 UTY 0.04 2.87E-03 1.00E+00 NRN1 -0.13 1.42E-03 1.00E+00 TXNDC5 0.13 1.41E-03 9.98E-01

Top Genes Associated with MDD: Model 4: Diagnosis + 5 Prevalent Cell Model 5: Diagnosis + All Cell Types & Model 2: Diagnosis + Confounds Types & Confounds Confounds Gene Symbol Beta Pval FDR Gene Symbol Beta Pval FDR Gene Symbol Beta Pval FDR BRD4 0.12 7.10E-05 4.29E-01 PRPH2 0.21 6.96E-05 8.34E-01 BRD4 0.12 4.06E-05 4.86E-01 PRPH2 0.21 7.16E-05 4.29E-01 BRD4 0.11 2.12E-04 9.99E-01 PRPH2 0.20 1.20E-04 5.49E-01 MED24 0.15 2.08E-04 7.94E-01 BAP1 0.11 2.77E-04 9.99E-01 FZD2 0.08 1.37E-04 5.49E-01 SPRY2 -0.21 3.20E-04 7.94E-01 PRSS16 0.10 5.10E-04 9.99E-01 BAP1 0.11 4.30E-04 9.99E-01 PRSS16 0.11 3.31E-04 7.94E-01 ARL4D -0.13 7.57E-04 9.99E-01 REC8 0.13 5.80E-04 9.99E-01 HEY2 -0.15 6.04E-04 9.16E-01 MED24 0.14 7.86E-04 9.99E-01 ARL4D -0.13 9.74E-04 9.99E-01 NEURL 0.11 6.40E-04 9.16E-01 NKAIN1 0.11 7.97E-04 9.99E-01 MED24 0.13 1.06E-03 9.99E-01 NKAIN1 0.11 1.15E-03 9.16E-01 REC8 0.12 8.57E-04 9.99E-01 PRSS16 0.09 1.29E-03 9.99E-01 GGA3 0.09 1.59E-03 9.16E-01 FZD2 0.08 9.54E-04 9.99E-01 HBS1L -0.18 1.33E-03 9.99E-01 1509 VENTXP1 -0.03 1.60E-03 9.16E-01 KCNN2 -0.14 1.19E-03 9.99E-01 NKAIN1 0.10 1.40E-03 9.99E-01

68 Running Head: PREDICTING CELL TYPE BALANCE

1510 Suppl. Figure 13. When analyzing the full Pritzker dataset, the top genes associated with diagnosis in 1511 models that include cell content predictions include genes previously identified in the literature. 1512 Depicted are the top 10 genes associated with diagnosis using three different models of increasing 1513 complexity, along with their b’s (magnitude and direction of effect within the model – 1514 blue=downregulation, pink=upregulation), nominal p-values, and p-values that have been corrected for 1515 false detection rate (FDR or q-value). Gene symbols that are bolded and highlighted yellow have been 1516 previously detected in the human literature in association with their respective diagnosis in papers 1517 identified using the PubMed search terms “Schizophrenia” (Row 1) and “Bipolar” (Row 2). None of the 1518 top genes associated with major depressive disorder in any of the three models were found to be 1519 associated with “Depression”, “Anxiety”, or “Suicide” on PubMed (Row 3).

Top Genes Associated with Psychiatric Illness:

Model 2: Psychiatric + Confounds Model 4: Psychiatric + 5 Prevalent Cell Model 5: Psychiatric + All Cell Types & Types & Confounds Confounds Gene Symbol Beta Pval FDR Gene Symbol Beta Pval FDR Gene Symbol Beta Pval FDR CLIP2 0.16 2.18E-04 8.71E-01 ARL4D -0.12 1.26E-04 6.37E-01 ARL4D -0.12 9.91E-05 5.12E-01 FAM127B 0.11 2.29E-04 8.71E-01 ALDH1A1 -0.24 3.02E-04 6.37E-01 MICALL2 0.08 2.07E-04 5.12E-01 MED24 0.11 3.91E-04 8.71E-01 CLIP2 0.14 4.06E-04 6.37E-01 HIVEP2 -0.11 2.41E-04 5.12E-01 R3HDM2 -0.16 4.43E-04 8.71E-01 HBS1L -0.16 4.21E-04 6.37E-01 SNAP25 -0.10 3.72E-04 5.12E-01 MAP7D1 0.11 4.61E-04 8.71E-01 MICALL2 0.09 4.57E-04 6.37E-01 TRA2A -0.11 3.79E-04 5.12E-01 ALDH1A1 -0.30 5.34E-04 8.71E-01 PITPNB -0.08 4.69E-04 6.37E-01 CLIP2 0.14 3.79E-04 5.12E-01 PITPNB -0.08 6.38E-04 8.71E-01 DNAJB2 0.12 4.80E-04 6.37E-01 FZD2 0.06 4.06E-04 5.12E-01 FANCC -0.08 7.08E-04 8.71E-01 PPP6C -0.07 5.75E-04 6.37E-01 ALDH1A1 -0.22 4.33E-04 5.12E-01 TTC31 0.08 8.57E-04 8.71E-01 HIVEP2 -0.12 5.91E-04 6.37E-01 SMARCD3 0.12 4.46E-04 5.12E-01 BTG2 -0.16 8.87E-04 8.71E-01 FBXO9 -0.10 7.58E-04 6.37E-01 CHST1 0.16 4.47E-04 5.12E-01

Stepwise Regression: Top Genes Stepwise Regression: Top Genes Associated with Psychiatric Illness Associated with Suicide Gene Symbol Beta Pval Gene Symbol Beta Pval MED24 0.13 1.83E-05 DGKE 0.035 1.81E-05 CLIP2 0.17 4.74E-05 UNKL 0.106 2.40E-05 PRSS16 0.10 8.86E-05 C11orf95 0.17 6.56E-05 GRM1 0.05 1.11E-04 TUBB6 0.162 9.41E-05 ALDH1A1 -0.23 1.28E-04 NEK3 -0.08 1.72E-04 ARL4D -0.11 1.37E-04 ZNF592 0.158 2.27E-04 SNAP25 -0.11 1.39E-04 FAM98A -0.11 3.00E-04 CHST1 0.16 1.45E-04 SPSB1 0.087 3.01E-04 HIVEP2 -0.12 1.53E-04 CEBPB 0.249 4.04E-04 1520 TTC31 0.08 1.67E-04 CHST11 0.069 4.18E-04

1521 Suppl. Figure 14. When analyzing the full dataset, the top genes associated with psychiatric illness in 1522 models that include cell content predictions include genes previously identified in the literature. 1523 Depicted are the top 10 genes associated with psychiatric illness using three different models of 1524 increasing complexity, or associated with psychiatric illness or suicide in models chosen using stepwise 1525 regression. Notably, the results from stepwise regression for the diagnosis term are not included in this 1526 figure because the term was only included in the model for eight genes total (DHX32, ID1, CSRP1, 1527 AKR1B10, TBPL1, HIST1H4F, SETD3, GAL). Formatting follows that of Suppl. Figure 13. Note that the 1528 p-values associated with stepwise regression are likely to be optimistic due to overfitting. Gene symbols 1529 that are bolded and highlighted yellow have been previously detected in the human literature using the 1530 PubMed search terms “Schizophrenia”, “Bipolar”, “Depression”, “Anxiety”, or “Suicide”.

1531

1532 Together, we conclude that including cell content predictions in the analysis of macro-dissected

1533 microarray data can sometimes improve the sensitivity of the assay for detecting altered gene expression

1534 in relationship to psychiatric illness, especially if the dataset is confounded with dissection variation.

69 Running Head: PREDICTING CELL TYPE BALANCE

1535 8 Supplementary Tables

1536 Suppl. Table 1. Master Database of Cortical Cell Type Specific Gene Expression. A single spreadsheet 1537 listing the genes defined as having cell type specific expression in our manuscript, including the species, 1538 age of the subjects, and brain region from which the cells were purified, the platform used to measure 1539 transcript, the statistical criteria and reference cell types used to define“cell type specific expression”, 1540 the gene symbol or orthologous gene symbol in mouse/human (depending on the species used in the 1541 original experiment), and citation. If a gene was identified as having cell type specific expression in 1542 multiple experiments, there is an entry for each experiment – thus the full 3383 rows included in the 1543 spreadsheet do not represent 3383 individual cell type specific genes. A web-version of this spreadsheet 1544 kept interactively up-to-date can be found at https://sites.google.com/a/umich.edu/megan-hastings- 1545 hagenauer/home/cell-type-analysis.

1546 Suppl. Table 2. The average cell type indices for all 160 brain regions included in the Allen Brain Atlas 1547 dataset. This excel file contains two worksheets. The first includes the average consolidated cell type 1548 index for 10 primary cell types for all 160 brain regions included in the Allen Brain Atlas. The second 1549 spreadsheet contains the standard error (SE) for the averages in the first worksheet.

1550 Suppl. Table 3. Output for the analyses of cell type vs. subject variables for all datasets. The first 1551 spreadsheet provides the output from the meta-analysis for each cell type vs. subject variable 1552 combination (“b”= the estimated effect, provided in the units for the variable – e.g., the effect of one 1553 year of age, or the effect of one hour of PMI; “SE”= standard error,”p-value”= nominal p-value, 1554 “BH_adj_P-value (q-value)”= the p-value corrected for multiple comparisons). The second spreadsheet 1555 includes the T-statistics for all cell type vs. subject variable combinations for all datasets.

1556 Suppl. Table 4. Functions associated with genes identified as having neuron-specific expression. 1557 Column A of the excel spreadsheet is a list of general physiological functions that were identified by 1558 DAVID as associated with our list of neuron-specific genes (relative to the full list of probesets included 1559 in the microarray). We named each functional cluster by the top two functions included in it. Column B 1560 indicates whether an experimenter blindly categorized the functional cluster as being clearly related or 1561 unrelated to synaptic function. The next four columns are output from DAVID indicating how strongly the 1562 functional cluster was enriched in our neuron-specific genes (mean fold enrichment, top p-value, top 1563 Bonferronni-corrected p-value, and top BH-corrected p-value). The number of genes from each 1564 functional cluster included in our results is listed in column G. Columns H-J indicate the mean, standard 1565 deviation, and standard error for the betas for Age for each gene included in the cluster. The betas 1566 indicate the strength and direction of the association with Age as determined within a larger linear model 1567 controlling for traditional confounds (“Model 2”). Columns K-M indicate whether, on average, the age- 1568 related betas for the genes in that cluster are statistically different from 0 as determined by a Welch’s t- 1569 test (t-stat, df, p-value). The final column indicates what percentage of the genes included in the cluster 1570 have a negative relationship (b) with age.

1571 Suppl. Table 5. A .gmt file created using our database of cell type specific genes for use with Gene Set 1572 Enrichment Analysis (GSEA). This file should be in the correct format for usage with either GSEA 1573 (http://software.broadinstitute.org/gsea/index.jsp) or similar software.

1574 Suppl. Table 6. Performing Gene Set Enrichment Analysis (fGSEA) using a .gmt that includes 1575 traditional functional gene sets and cell type specific gene lists indicates that genes that are 1576 differentially expressed in association with a wide variety of subject variables tend to have cell type 1577 specific expression. fGSEA was performed using the results from a differential expression analysis 1578 performed on the Pritzker dataset using a model that included diagnosis, pH, agonal factor, age, PMI, 1579 and sex. The gene set enrichment results for each variable is included as its own worksheet in the file.

70 Running Head: PREDICTING CELL TYPE BALANCE

1580 Suppl. Table 7. Previously-identified relationships between gene expression and psychiatric illness in 1581 the human cortex in either particular cell types or macro-dissected cortex. We used this database of 1582 previously-identified effects to determine whether controlling for cell type while performing differential 1583 expression analyses increased our ability to observe previously-documented effects. The effects were 1584 extracted from the publications described in Suppl. Figure 12

1585 Suppl. Table 8. The relationship between diagnosis and all probesets in the Pritzker Dorsolateral 1586 Prefrontal Cortex dataset as assessed using models of increasing complexity. For all probesets in the 1587 dataset, the spreadsheets for Model #2 and Model#4 include the b for all variables in the model (“Beta”: 1588 magnitude and direction of the association (positive associations=pink, negative associations=blue), the 1589 p-value (“Pval_nominal”) and the p-value adjusted for multiple comparisons using the Benjamini- 1590 Hochberg method (“BH_Adj” – also called FDR or q-value), both labeled with green indicating more 1591 significant relationships and red indicating less significant relationships. There are also summary 1592 spreadsheets that include just the results for Bipolar Disorder and Schizophrenia for Models#1-5. In 1593 these spreadsheets, the formatting is a little different: T-statistics are provided, the b is called “LogFC”, 1594 the BH_Adj p-value is called“adj.P.Val”.

1595 Suppl. Table 9. The relationship between diagnosis and all genes in the CMC RNA-Seq dataset as 1596 assessed using models of increasing complexity. There are two summary spreadsheets that include the 1597 results for Bipolar Disorder and Schizophrenia for Models#1-5. For all genes in the dataset, each 1598 spreadsheet includes the b (“LogFC”: magnitude and direction of the association), the T-statistic, the p- 1599 value (“Pval_nominal”) and the p-value adjusted for multiple comparisons using the Benjamini- 1600 Hochberg method (“adj.P.Val”– FDR or q-value) for the effect of diagnosis in each model (#M1-M5).

1601 Suppl. Table 10. The relationship between diagnosis and all probesets in the Barnes et al. (GSE21935) 1602 microarray dataset as assessed using models of increasing complexity. There is a summary spreadsheet 1603 that includes the results for Schizophrenia for Models#1-5. The spreadsheet follows the same format as 1604 Suppl. Table 9.

1605 Suppl. Table 11. The relationship between diagnosis and all probesets in the Lanz et al. (GSE53987) 1606 microarray dataset as assessed using models of increasing complexity. There are two summary 1607 spreadsheets that include the results for Bipolar Disorder and Schizophrenia for Models#1-5. The 1608 spreadsheets follow the same format as Suppl. Table 9.

1609 Suppl. Table 12. The relationship between diagnosis and all probesets in the Narayan et al. 1610 (GSE21138) microarray dataset as assessed using models of increasing complexity. There is a summary 1611 spreadsheet that includes the results for Schizophrenia for Models#1-5. The spreadsheet follows the same 1612 format as Suppl. Table 9. 1613 1614

71 Running Head: PREDICTING CELL TYPE BALANCE

1615 9 References for Supplementary Section 1616 1617 87. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, et al. Exploration, 1618 normalization, and summaries of high density oligonucleotide array probe level data. Biostat Oxf 1619 Engl. 2003 Apr;4(2):249–64.

1620 88. American Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders (DSM- 1621 IV-TR). 4th ed. Washington, D.C.: American Psychiatric Association; 2000.

1622 89. Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, et al. Evolving gene/transcript definitions 1623 significantly alter the interpretation of GeneChip data. Nucleic Acids Res. 2005;33(20):e175.

1624 90. Allen Brain Atlas. Technical White Paper: Microarray Data Normalization, v.1 [Internet]. 2013. 1625 Available from: help.brain-map.org

1626 91. Davis S, Meltzer PS. GEOquery: a bridge between the Gene Expression Omnibus (GEO) and 1627 BioConductor. Bioinforma Oxf Engl. 2007 Jul 15;23(14):1846–7.

1628 92. Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. 1629 Nat Methods. 2015 Apr;12(4):357–60.

1630 93. Law CW, Chen Y, Shi W, Smyth GK. voom: Precision weights unlock linear model analysis tools 1631 for RNA-seq read counts. Genome Biol. 2014 Feb 3;15(2):R29.

1632 94. Oldham MC, Konopka G, Iwamoto K, Langfelder P, Kato T, Horvath S, et al. Functional 1633 organization of the transcriptome in human brain. Nat Neurosci. 2008 Nov;11(11):1271–82.

1634 95. Wilkerson MD, Hayes DN. ConsensusClusterPlus: a class discovery tool with confidence 1635 assessments and item tracking. Bioinforma Oxf Engl. 2010 Jun 15;26(12):1572–3.

1636 96. Coupel S, Moreau A, Hamidou M, Horejsi V, Soulillou J-P, Charreau B. Expression and release of 1637 soluble HLA-E is an immunoregulatory feature of endothelial cell activation. Blood. 2007 Apr 1638 1;109(7):2806–14.

1639 97. Tian H, McKnight SL, Russell DW. Endothelial PAS domain 1 (EPAS1), a transcription 1640 factor selectively expressed in endothelial cells. Genes Dev. 1997 Jan 1;11(1):72–82.

1641 98. Tchorz JS, Tome M, Cloëtta D, Sivasankaran B, Grzmil M, Huber RM, et al. Constitutive Notch2 1642 signaling in neural stem cells promotes tumorigenic features and astroglial lineage entry. Cell Death 1643 Dis. 2012;3:e325.

1644 99. Boyles JK, Pitas RE, Wilson E, Mahley RW, Taylor JM. Apolipoprotein E associated with 1645 astrocytic glia of the central nervous system and with nonmyelinating glia of the peripheral nervous 1646 system. J Clin Invest. 1985 Oct;76(4):1501–13.

1647 100. Fazzari P, Paternain AV, Valiente M, Pla R, Luján R, Lloyd K, et al. Control of cortical GABA 1648 circuitry development by Nrg1 and ErbB4 signalling. Nature. 2010 Apr 29;464(7293):1376–80.

1649 101. Stephenson DT, Coskran TM, Kelly MP, Kleiman RJ, Morton D, O’Neill SM, et al. The 1650 distribution of phosphodiesterase 2A in the rat brain. Neuroscience. 2012 Dec 13;226:145–55.

72 Running Head: PREDICTING CELL TYPE BALANCE

1651 102. Marszalek JR, Weiner JA, Farlow SJ, Chun J, Goldstein LS. Novel dendritic kinesin sorting 1652 identified by different process targeting of two related kinesins: KIF21A and KIF21B. J Cell Biol. 1653 1999 May 3;145(3):469–79.

1654 103. Rao JS, Harry GJ, Rapoport SI, Kim HW. Increased excitotoxicity and neuroinflammatory markers 1655 in postmortem frontal cortex from bipolar disorder patients. Mol Psychiatry. 2010 Apr;15(4):384– 1656 92.

1657 104. Spiliotaki M, Salpeas V, Malitas P, Alevizos V, Moutsatsou P. Altered glucocorticoid receptor 1658 signaling cascade in lymphocytes of bipolar disorder patients. Psychoneuroendocrinology. 2006 1659 Jul;31(6):748–60.

1660 105. Le-Niculescu H, Patel SD, Bhat M, Kuczenski R, Faraone SV, Tsuang MT, et al. Convergent 1661 functional genomics of genome-wide association data for bipolar disorder: comprehensive 1662 identification of candidate genes, pathways and mechanisms. Am J Med Genet Part B 1663 Neuropsychiatr Genet Off Publ Int Soc Psychiatr Genet. 2009 Mar 5;150B(2):155–81.

1664 106. Konopaske GT, Subburaju S, Coyle JT, Benes FM. Altered prefrontal cortical MARCKS and 1665 PPP1R9A mRNA expression in schizophrenia and bipolar disorder. Schizophr Res. 2015 1666 May;164(1–3):100–8.

1667 107. Glessner JT, Reilly MP, Kim CE, Takahashi N, Albano A, Hou C, et al. Strong synaptic 1668 transmission impact by copy number variations in schizophrenia. Proc Natl Acad Sci U S A. 2010 1669 Jun 8;107(23):10584–9.

1670 108. Ayoub MA, Angelicheva D, Vile D, Chandler D, Morar B, Cavanaugh JA, et al. Deleterious GRM1 1671 mutations in schizophrenia. PloS One. 2012;7(3):e32849.

1672 109. Frank RAW, McRae AF, Pocklington AJ, van de Lagemaat LN, Navarro P, Croning MDR, et al. 1673 Clustered coding variants in the glutamate receptor complexes of individuals with schizophrenia 1674 and bipolar disorder. PloS One. 2011;6(4):e19011.

1675 110. Etain B, Dumaine A, Mathieu F, Chevalier F, Henry C, Kahn J-P, et al. A SNAP25 promoter variant 1676 is associated with early-onset bipolar disorder and a high expression level in brain. Mol Psychiatry. 1677 2010 Jul;15(7):748–55.

1678 111. Fatjó-Vilas M, Prats C, Pomarol-Clotet E, Lázaro L, Moreno C, González-Ortega I, et al. 1679 Involvement of NRN1 gene in schizophrenia-spectrum and bipolar disorders and its impact on age 1680 at onset and cognitive functioning. World J Biol Psychiatry Off J World Fed Soc Biol Psychiatry. 1681 2016;17(2):129–39.

1682 112. Volk DW, Chitrapu A, Edelson JR, Roman KM, Moroco AE, Lewis DA. Molecular mechanisms 1683 and timing of cortical immune activation in schizophrenia. Am J Psychiatry. 2015 Nov 1684 1;172(11):1112–21.

1685 113. Girgenti MJ, LoTurco JJ, Maher BJ. ZNF804a regulates expression of the schizophrenia-associated 1686 genes PRSS16, COMT, PDE4B, and DRD2. PloS One. 2012;7(2):e32404.

1687

73