Cell Type Manuscript Rev 180328 Justmaintext

Running Head: PREDICTING CELL TYPE BALANCE 1012 7. Supporting Information 1013 1014 1015 Supplementary Material for: 1016 1017 INFERENCE OF CELL TYPE COMPOSITION FROM HUMAN BRAIN TRANSCRIPTOMIC 1018 DATASETS ILLUMINATES THE EFFECTS OF AGE, MANNER OF DEATH, DISSECTION, 1019 AND PSYCHIATRIC DIAGNOSIS 1020 *Megan Hastings Hagenauer, Ph.D.1, Anton Schulmann, M.D.2, Jun Z. Li, Ph.D.3, Marquis P. Vawter, 1021 Ph.D.4, David M. Walsh, Psy.D.4, Robert C. Thompson, Ph.D.1, Cortney A. Turner, Ph.D.1, William E. 1022 Bunney, M.D.4, Richard M. Myers, Ph.D.5, Jack D. Barchas, M.D.6, Alan F. Schatzberg, M.D.7, Stanley J. 1023 Watson, M.D., Ph.D.1, Huda Akil, Ph.D.1 1024 42 Running Head: PREDICTING CELL TYPE BALANCE 1025 7.1 Additional Methods: Detailed Preprocessing Methods for the Transcriptomic Datasets 1026 1027 7.1.1 RNA-Seq data from Purified Cell Types (GSE52564 and GSE67835) 1028 As initial validation, we used our method to predict sample cell type identity in two RNA-Seq 1029 datasets derived from purified cell types: one derived from from purified cortical cell types in mice 1030 (n=17: two samples per cell type and 3 whole brain samples: GSE52564) (18), and one derived from 466 1031 single-cells dissociated from freshly-resected human cortex (GSE67835) (2). The RNA-Seq data that we 1032 downloaded from GEO was already in the format of FPKM values (Fragments Per Kilobase of exon 1033 model per million mapped fragments) (18) or counts per gene (2). To stabilize the variance in the data, we 1034 used a log(2) transformation (after adding 1), and then filtered out the data for any genes that completely 1035 lacked variation across samples (sd=0; final gene count: GSE52564: 17148, GSE67835: 21627). Then we 1036 applied the methods now found in the BrainInABlender package, and examined the correlation between 1037 each of the cell type indices and sample cell type identity (excluding the fetal, whole brain, and “hybrid” 1038 cells). The code for these analyses can be found at 1039 https://github.com/hagenaue/CellTypeAnalyses_Darmanis and 1040 https://github.com/hagenaue/CellTypeAnalyses_Zhang. 1041 1042 7.1.2 Microarray Data from Artificial Cell Mixtures (GSE19380) 1043 As further validation, we determined whether relative cell type balance could be accurately 1044 deciphered from microarray data for samples containing artificially-generated mixtures of cultured cells 1045 from the cerebral cortices of rat P1 pups (GSE19380, (12)). The microarray profiling was performed 1046 using a Affymetrix Rat Genome 230 2.0 Array. According to the methods on GEO, this data had already 1047 undergone probeset summarization and normalization using robust multi-array averaging (RMA, affy 1048 package (87)), including background subtraction, summarization by median polish, log(2) transformation, 1049 and quantile normalization. We then predicted the cell content of each sample from the microarray data 43 Running Head: PREDICTING CELL TYPE BALANCE 1050 using BrainInABlender, and correlated these predictions with the actual percent of each cell type found in 1051 the mixtures (Figure 2 in main text). The code for these analyses can be found at: 1052 https://github.com/hagenaue/CellTypeAnalyses_KuhnMixtures/tree/master. 1053 1054 7.1.3 Pritzker Dorsolateral Prefrontal Cortex Microarray Dataset (GSE92538) 1055 The original dataset included tissue from 172 high-quality human post-mortem brains donated to 1056 the Brain Donor Program at the University of California, Irvine with the consent of the next of kin. 1057 Frozen coronal slabs were macro-dissected to obtain dorsolateral prefrontal cortex samples. Clinical 1058 information was obtained from medical examiners, coroners’ medical records, and a family member. 1059 Patients were diagnosed with either Major Depressive Disorder, Bipolar Disorder, or Schizophrenia by 1060 consensus based on criteria from the Diagnostic and Statistical Manual of Mental Disorders (88). Due to 1061 the extended nature of this study, this sample collection occurred in waves (“cohorts”) over a period of 1062 many years. This research was overseen and approved by the University of Michigan Institutional Review 1063 Board (IRB # HUM00043530, Pritzker Neuropsychiatric Disorders Research Consortium (2001-0826)) 1064 and the University of California Irvine (UCI) Institutional Review Board (IRB# 1997-74). 1065 As described previously (28,62), total RNA from these samples was then distributed to 1066 laboratories at three different institutions (University of Michigan (UM), University of California-Davis 1067 (UCD), University of California-Irvine (UCI)) for hybridization to either Affymetrix HT-U133A or HT- 1068 U133Plus-v2 chips (1-5 replicates per sample, n=367). Before conducting the current analysis, the subset 1069 of probes found on both the Affymetrix HT-U133A and HT-U133Plus-v2 chips was extracted, 1070 reannotated for probe-to-transcript correspondance (89), summarized using RMA (87), log(2)- 1071 transformed, quantile normalized, and gender-checked. Then, 15 batches of highly-correlated samples 1072 were identified that were defined by a combination of cohort, chip, and laboratory (Suppl. Figure 1). 44 Running Head: PREDICTING CELL TYPE BALANCE Batch# Site Chip Cohort Control BP MDD SCHIZ 1 UCD U133A Dep Cohort 1 & 2 20 9 11 0 2 UCD U133A Dep Cohort 3 11 6 5 0 3 UCD U133A Dep Cohort 4 16 4 7 0 4 UCD U133Plus2 Dep Cohort 5 13 5 10 0 5 UCD U133A Schiz Cohort 1 9 0 0 9 6 UCD U133Plus2 Schiz Cohort 1 8 0 0 8 7 UCD U133Plus2 Schiz Cohort 2 8 0 0 10 8 UCI U133A Schiz Cohort 1 9 0 0 9 9 UM U133A Dep Cohort 1 16 10 9 0 10 UM U133A Dep Cohort 2 3 2 5 0 11 UM U133A Dep Cohort 3 & 4 27 11 11 0 12 UM U133Plus2 Dep Cohort 5 13 5 10 0 13 UM U133Plus2 Dep Cohort 6 7 2 9 3 14 UM U133A Schiz Cohort 1 9 0 0 9 1073 15 UM U133Plus2 Schiz Cohort 2 9 0 0 10 1074 Suppl. Figure 1. The number of microarray chips run in each batch, defined by processing site, 1075 Affymetrix chip type, and sample collection cohort. Samples from the four diagnostic categories 1076 (Control, Bipolar Disorder, Major Depressive Disorder, Schizophrenia) were unevenly distributed across 1077 batches. 1078 Samples that exhibited markedly low average sample-sample correlation coefficients (<0.85: 1079 outliers) were removed from the dataset, including data from one batch that exhibited overall low sample- 1080 sample correlation coefficients with other batches and duplicate microarrays. The batch effects were then 1081 subtracted out using median-centering (detailed procedure: (62)) and the replicate samples were averaged 1082 for each subject. Our current analyses began with this sample-level summary gene expression data 1083 (GSE92538). We further removed data from any subjects lacking information regarding critical pre- or 1084 post-mortem variables necessary for our analysis (final sample size: n=157). The code for these analyses 1085 can be found at https://github.com/hagenaue/CellTypeAnalyses_PritzkerAffyDLPFC. 1086 1087 7.1.4 Allen Brain Atlas Cross-Regional Microarray Dataset 1088 The Allen Brain Atlas microarray data was downloaded from http://human.brain- 1089 map.org/microarray/search in December 2015. This microarray survey was performed in brain-specific 1090 batches, with multiple batches per subject. To remove technical variation across batches, a variety of 1091 normalization procedures had been performed by the original authors both within and across batches 1092 using internal controls, as well as across subjects (90). The dataset available for download had already 45 Running Head: PREDICTING CELL TYPE BALANCE 1093 been log(2)-transformed and converted to z-scores using the average and standard deviation for each 1094 probe. These normalization procedures were designed to remove technical artifacts while best preserving 1095 cross-regional effects in the data, but the full information about relative levels of expression within an 1096 individual sample were unavailable and the effects of subject-level variables (such as age and pH) were 1097 likely to be de-emphasized due to the inability to fully separate out subject and batch during the 1098 normalization process. The 30,000 probes mapped onto 18,787 unique genes (as determined by gene 1099 symbol). The code for these analyses can be found at 1100 https://github.com/hagenaue/CellTypeAnalyses_AllenBrainAtlas. 1101 1102 7.1.5 Human Cortical Microarray Dataset GSE53987 (described in Lanz et al. (30)) 1103 The full publicly-available dataset GSE53987 (30) contained Affymetrix U133Plus2 microarray 1104 data from 205 post-mortem human brain samples from three brain regions: the DLPFC (Brodmann Area 1105 46, focusing on gray matter only (Lanz T.A., personal communication)), the hippocampus, and the 1106 striatum. These samples were collected by the University of Pittsburgh brain bank. For the purposes of 1107 our current analysis, we only downloaded the microarray .CEL files for the dorsolateral prefrontal cortex 1108 samples. We summarized these data with RMA (87) using a custom up-to-date chip definition file (.cdf) 1109 to define probe-to-transcript correspondence (“hgu133plus2hsentrezgcdf_19.0.0.tar.gz” from http://nmg- 1110 r.bioinformatics.nl/NuGO_R.html (89)). This process included background subtraction, log(2)- 1111 transformation, and quantile normalization. Gene Symbol annotation for probeset Entrez gene ids were 1112 provided by the R package org.Hs.eg.db. To control for technical variation, the sample processing batches 1113 were estimated using the microarray chip scan dates extracted from the .CEL files (using the function 1114 protocolData in the GEOquery package (91)), but all chips for the DLPFC were scanned on the same 1115 date. RNA degradation was estimated using the R package AffyRNADegradation (33). During quality 1116 control, two samples were removed - GSM1304979 had a range of sample-sample correlations that was 1117 unusually low compared (median=0.978) compared to the range for the dataset as a whole (median: 46 Running Head: PREDICTING CELL TYPE BALANCE 1118 0.993) and GSM1304953 appeared to be falsely identified as female (signal for XIST<7).

Cell Type Manuscript Rev 180328 Justmaintext

A Computational Approach for Defining a Signature of Β-Cell Golgi Stress in Diabetes Mellitus

Building the Vertebrate Codex Using the Gene Breaking Protein Trap Library

Noelia Díaz Blanco

Genome-Wide DNA Methylation Map of Human Neutrophils Reveals Widespread Inter-Individual Epigenetic Variation

Identification of Inherited Retinal Disease-Associated Genetic Variants in 11 Candidate Genes

Genome-Wide Linkage and Association Study Implicates the 10Q26 Region As a Major Genetic Contributor to Primary Nonsyndromic

UNIVERSITY of CALGARY a Novel Neurodevelopmental Disorder Within the Hutterite Population Maps to 16P and a Possible Causative M

DEAH-Box Polypeptide 32 Promotes Hepatocellular Carcinoma Progression Via Activating the Β-Catenin Pathway

DHX32 Expression Suggests a Role in Lymphocyte Differentiation

The Human Gene Connectome As a Map of Short Cuts for Morbid Allele Discovery

Assessment of Network Module Identification Across Complex Diseases

Comparative Analysis of the Ubiquitin-Proteasome System in Homo Sapiens and Saccharomyces Cerevisiae