Supporting Information

Shay et al. 10.1073/pnas.1222738110 SI Materials and Methods Gaussians (Fig. S2D). Fitting a mixture of Gaussians model to Tissue Atlas Dataset. Data of the human and mouse normal tis- the COE data, where the conserved ’ Gaussian mean and sues were downloaded from http://biogps.org/downloads/, files SD are known, the conserved genes’ Gaussian (black) contained U133AGNF1B.gcrma.avg_ann and GNF1M_plus_macrophage_ α = 51% of the genes, and the nonconserved genes’ Gaussian small.bioGPS_ann. Those datasets include 32 comparable tissues. (green plot) explains 1 − α = 49% of the genes (Fig. S2D). Genes were matched by symbol to ENSEMBL COMPARA We used the Matlab function MixtureOfGaussiansGivenInit version 63, resulting in 10,777 one-to-one orthologs pairs. When (O.Z.; http://www.broadinstitute.org/~orzuk/matlab/libs/stats/mog/ more than one probeset had the same gene symbol, only the MixtureOfGaussiansGivenInit.m) to estimate the mixture of probeset with the highest mean expression was used. Data were Gaussians, with mean and SD set as the mean and SD of the log2-transformed. There were 5,594 human genes with at least one ImmGen-DMAP permutation distribution. value above five (after log transform) and 6,191 mouse genes with 2 fi at least one value above nine, including 4,059 one-to-one ortholog Genes with Divergent Expression. We de ned genes with divergent < pairs that were included in all subsequent analysis. The thresholds expression as having a COE 0.25 and at least a fourfold change five and nine were chosen by examining the bimodal gene ex- between the maximal lineage mean and the minimal lineage pression distribution for all probesets, because they were separat- mean in both human and mouse. We used these strict criteria ing the two modes of the distribution. Conservation of expression to remove any bad probesets and any genes with only slight (COE) was calculated as the Pearson correlation coefficient differences in expression between lineages, which may be because between the values in comparable samples. For of other confounders. For those different genes highlighted in calculating correlation between samples, each gene was standard- Known and Previously Undescribed Differences in Individual Genes ized in each species’ dataset separately. as divergently expressed (IL15, ETS2, TMEM176A, TMEM176B, SNN, and TLR1), we attempted to rule out the possibility that Estimating the Number of Conserved Genes. Assuming that there the differences stem from probeset design issues. For the human are two groups of genes, conserved and nonconserved, the non- genes, the Affymetrix probes match sequences in the genes’ 3′ conserved genes’ COE distribution, denoted F0, should be similar UTRs (except for TMEM176B, whose probes are mapped to to the ImmGen-DMAP permutation distribution (Fig. 2B,black). several exons and 3′ UTR), and the probeset GeneAnnot (1) The COE distribution of conserved genes, denoted Fconserved, (http://genecards.weizmann.ac.il/geneannot) score is one (best). should be concentrated in the positive correlation region. For the mouse genes, the probes’ sequences are mapped to The observed total distribution of COE values for all genes, different positions in the gene region. denoted Fall-genes, is obtained as a mixture of the COE distribution of nonconserved genes F (estimated by the permutation-based Activation and Profiling of Human CD4+ T Cells. For stimulation of 0 + distribution) and the (unknown) COE distribution of conserved human cells, CD4 T cells were isolated from whole blood by negative selection using RosetteSep human CD4+ T-cell enrich- genes Fconserved: ment cocktail (Stemcell Technologies) and RosetteSep density

Fallgenes = α Fconserved + ð1 − αÞF0: medium gradient centrifugation and then stored frozen. On the day of activation, CD4+ T cells from 14 individuals were sep- We next estimated the mixture parameter α, representing the arately thawed, resuspended inRPMIsupplementedwith10% fraction of genes with conserved COE in two ways. FCS, and plated at 50,000 cells/well in a 96-well plate. Cells were First, we used a nonparametric method applicable for any dis- either left untreated as a baseline control or stimulated with tribution. From above, we must have Fall-genes(x) ≥ (1 − α)F0(x)for bead-bound anti-CD3 and anti-CD28 at a bead-to-cell ratio of any x. Thus, we can estimate the maximal possible fraction of the 1:1 for 4 or 48 h. At each time point, a second step of purification nonconserved genes (1 − α) by calculating the ratio between the of CD4+ T cells was performed with magnetic positive selection ImmGen-DMAP and ImmGen-DMAP permutation distributions, using the Invitrogen Dynal CD4 Positive Isolation Kit (96-well Fall-genes(x)/F0(x) for any value of x (Fig. 2B, black and dark blue format) before Trizol extraction of RNA. curves, respectively). To avoid errors resulting from inaccurate Isolated RNA was amplified and prepared for hybridization estimates of Fall-genes(x) and F0(x) from data that may skew our to the Affymetrix HuGene ST1.0 Array using the GeneChip estimator for 1 − α, we took the minimal ratio Fall-genes(x)/F0(x)for Whole Transcript (WT) Sense Target Labeling Assay. Raw data values of x on the negative region (x < 0) and excluded the left tail were normalized using the Robust Multichip Average algorithm (x > −0.8), which gave us a minimal ratio of 1 − α = 0.3 (Fig. 2B, in the “Expression File Creator” module in GenePattern (2). black arrows). Thus, the nonparametric estimator indicates that at Only 11,229 genes with expression that exceeded 285 (ImmGen least α = 1 − 0.3 = 70% of the genes have conserved COE scores. threshold setting procedure; http://www.immgen.org/Protocols/ Similarly, for the activated T cells, we can estimate the maximal ImmGen QC Documentation_ALL-DataGeneration_0612.pdf) possible fraction of the nonconserved genes (1 − α) from the ratio in at least two samples were analyzed. A paired t test between between the human–mouse and human–mouse permutation dis- the 14 samples from the 14 donors was used to identify differ- tributions for any value of x (Fig. S5A, blue and dashed blue entially expressed genes between each of the two samples of the curves, respectively). To avoid errors resulting from inaccurate same donor in consecutive time points, with a false discovery rate estimates of Fall-genes(x)andF0(x) from data that may skew our of 5% and a median fold change of at least two. estimator for 1 − α, we took the left tail, which is most populated (x = −1), and it gave us a minimal ratio of 1 − α = 0.36. Thus, the Activation and Profiling of Mouse CD4+ T Cells. For stimulation of nonparametric estimator indicates that at least α = 1 − 0.36 = mouse cells, CD4+ T cells from pooled spleen and lymph nodes 64% of the genes have conserved COE scores. of male C57BL/6foxp3-gfp mice were purified to >90% purity (Dynal Alternatively, we estimated the fraction of genes that are con- CD4 Negative Isolation Kit), and 2 × 105 cells were cultured in served by modeling the COE distribution as a mixture of two 96-well flat-bottom plates in 200 μL RPMI-1640 and 10% FCS

Shay et al. www.pnas.org/cgi/content/short/1222738110 1of8 together with beads conjugated with anti-CD3 and -CD28 at a Motif Enrichment in Lineage Signatures. For each lineage L and cell:bead ratio of 1:2. Cells were harvested at 1, 4, 20, or 48 h, each motif k, we computed the P value for enrichment, pe(L,k), Tconv and Treg were sorted as DAPI−CD45R−CD8a−CD11b/c− of the motif in the induced and repressed signatures of the lin- CD4+ and GFP− at 1 and 4 h, or 20- and 48-h lysates were pooled eage compared with a background set comprised of the filtered as early or late time points, respectively. Three pools per time one-to-one orthologs. An enrichment of a motif in a signature point were created. Lysates from sorted cells were processed for results in higher than expected MAX-LOD scores for the genes profiling on ST1.0 arrays per ImmGen protocol. in this module—to capture this effect, we computed the P value Only 11,812 genes with expression that exceeded 155 (ImmGen by comparing the scores MAX-LOD(i,k) for all genes i in the threshold setting procedure; http://www.immgen.org/Protocols/ lineage L with the scores for the entire set of filtered one-to-one ImmGen QC Documentation_ALL-DataGeneration_0612.pdf) orthologs by performing a one-sided rank-sum test. We then in at least two samples were analyzed. A t test between the three employed a false discovery rate of 5% on the entire matrix of samples was used to identify differentially expressed genes be- P values pe(L,k) and declared as significant hits all P values tween each pair of consecutive time points, with a false discovery passing this procedure. rate of 10% and a mean fold change of at least two. For identifying putative divergent regulation based on cis- regulatory elements, we looked only on lineages in which the Motif Scanning. We scanned promoters of human and mouse genes Pearson correlation coefficient was >0.7 to ensure that most motifs for enriched motifs. We downloaded promoter sequences for behave similarly. We then fitted a linear model to the P values of human (hg18) and mouse (mm9) from the University of California, enrichment in each lineage in human and mouse using Matlab Santa Cruz, Genome Browser website (http://hgdownload.cse.ucsc. function LinearModel.fit, and only motifs that have a standard- edu/downloads.html). For each gene, we scanned the region ized residual more than five were considered. Of these motifs, we starting from −1,000 bp upstream of the transcription start site selected those motifs that had an enrichment P value < 0.001 in (TSS) and ending at +200 bp downstream of the TSS. We rep- one species and a P value > 0.5 in the other species. resented the nucleotide at position j (relative to −1,000 bp from the TSS) for gene i as Si,j. We represented each cis-regulatory ImmGen Consortium Members element by a position weight matrix (PWM). We compiled Paul Monacha, Susan A. Shintonb, Richard R. Hardyb, Radu a set of 1,651 PWMs from the TRANSFAC matrix database (3) Jianuc, David Laidlawc, Jim Collinsd,RoiGazite,BrianS. v8.3, JASPAR (4) Version 2008 and experimentally determined Garrisone, Derrick J. Rossie, Kavitha Narayanf, Katelyn Sylviaf, f g g PWMs (5, 6). For the kth motif, we denote its PWM by Pk, Joonsoo Kang , Anne Fletcher , Kutlu Elpek , Angelique Bel- g g g amatrixofsize4× Lk,whereLk is the length of the motif and lemare-Pelletier , Deepali Malhotra , Shannon Turley ,Adam h h h i Pk(i,j) represents the probability of encountering the nucleotide j J. Best , Jamie Knell , Ananda Goldrath , Vladimir Jojic , Daphne ( j = ‘A,’‘C,’‘G,’ or ‘T’) at the ith position. For each gene i, Kolleri, Tal Shayj, Aviv Regevj, Nadia Cohenk, Patrick Brennank, position along the promoter j,andPWMk, we computed the Michael Brennerk, Taras Kreslavskyk, Natalie A. Bezmanl,Jo- local motif-matching score logarithm of the odds (LOD), seph C. Sunl,CharlieC.Kiml, Lewis L. Lanierl,Jennifer LOD(i,j,k), defined as the log-likelihood ratio (LOD score) Millerm,BrianBrownm, Miriam Meradm, Emmanuel L. for observing the sequence given the PWM vs. a given random Gautierm,n, Claudia Jakubzickm, Gwendalyn J. Randolphm,n, genomic background: Francis Kimo, Tata Nageswara Raoo, Amy Wagerso, Tracy Hengp, Michio Painterp, Jeffrey Ericsonp, Scott Davisp,Ayla p p p XLk Ergun , Michael Mingueneau , Diane Mathis , and Christophe ð ; ; Þ = ; − : p LOD i j k log2 Pk r Si;j+r−1 log2 Pb Si;j+r−1 Benoist r = 1 aDepartment of Medicine, Boston University, Boston, MA 02118; bFox Chase Center, Philadelphia, PA 19111; Genomic background was determined as P (‘A’) = P (‘T’) = 0.3 b b cComputer Science Department, Brown University, Providence, and P (‘C’) = P (‘G’) = 0.2, representing the nucleotide com- b b RI 02912; dDepartment of Biomedical Engineering, Howard position in the mouse and human genomes. We then found the Hughes Medical Institute, Boston University, Boston, MA 02215; best motif instance over the entire promoter region, defined as eImmune Diseases Institute, Children’s Hospital, Boston, MA MAX-LOD(i,k) = max LOD(i,j,k). j 02115; fDepartment of Pathology, University of Massachusetts Medical School, Worcester, MA 01655; gDana-Farber Cancer Motif Scoring Threshold. We automatically computed a PWM- h specific threshold per species by using the information content Institute and Harvard Medical School, Boston, MA 02215; Di- vision of Biological Sciences, University of California at San of each motif. The information content for the kth motif is i defined as Diego, La Jolla, CA 92093; Computer Science Department, Stanford University, Stanford, CA 94025; jBroad Institute, k XL X4 Cambridge, MA 02141; Division of Rheumatology, Immunology IC = − P ði; jÞlog P ði; jÞ: and Allergy, Brigham and Women’s Hospital, Boston, MA 02467; k k 2 k l i = 1 j = 1 Department of Microbiology and Immunology, University of California, San Francisco, CA 94143; mIcahn Medical Institute, We defined the PWM-specificthresholdforthekth motif k as Mount Sinai Hospital, New York, NY 10029; nDepartment of −ICk τk, the 1 − 2 quantile of the PWM LODs distribution across Pathology and Immunology, Washington University, St. Louis, all genes’ promoters. We considered a hit for the kth motif MO 63110; oJoslin Diabetes Center, Boston, MA 02215; and at the ith gene if the best score, MAX-LOD(i,k), exceeded the pDivision of Immunology, Department of Microbiology and Im- threshold τk. munobiology, Harvard Medical School, Boston, MA 02115

1. Chalifa-Caspi V, et al. (2004) GeneAnnot: Comprehensive two-way linking be- 4. Sandelin A, Alkema W, Engström P, Wasserman WW, Lenhard B (2004) JASPAR: An tween oligonucleotide array probesets and GeneCards genes. Bioinformatics 20(9): open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids 1457–1458. Res 32(Database issue):(Suppl 1):D91–D94. 2. Reich M, et al. (2006) GenePattern 2.0. Nat Genet 38(5):500–501. 5. Badis G, et al. (2009) Diversity and complexity in DNA recognition by transcription 3. Matys V, et al. (2006) TRANSFAC and its module TRANSCompel: Transcriptional factors. Science 324(5935):1720–1723. gene regulation in eukaryotes. Nucleic Acids Res 34(Database issue):(Suppl 1): 6. Berger MF, et al. (2008) Variation in homeodomain DNA binding revealed by high- D108–D110. resolution analysis of sequence preferences. Cell 133(7):1266–1276.

Shay et al. www.pnas.org/cgi/content/short/1222738110 2of8 A HSC1

HSC2

CMP

MEP GMP

ERY1

ERY2

Pre-BCELL2 ERY3 GRAN1

Pre-BCELL3 MEGA1 MONO1

ERY4 GRAN2 BCELLa1 TCELL2 TCELL6

ERY5 MEGA2 GRAN3 MONO2 EOS2 BASO1 DENDa2 DENDa1 BCELLa2 BCELLa3 BCELLa4 NKa1 NKa2 NKa3 NKa4 TCELL1 TCELL3 TCELL4 TCELL7 TCELL8

ERY MEGA GN MONO / MAC DC B-cells NK-cells T-cells

Fig. S1. (Continued)

Shay et al. www.pnas.org/cgi/content/short/1222738110 3of8 FETAL ADULT B SC.LTSL.FL SC.LTSL.BM Sk Th SLN MLN Non-intersecting edges SC.STSL.FL SC.STSL.BM SC.ST34F.BM Fi FRC Putative edges SC.MPP34F.BM Ep.MEChi LEC

MLP.FL MLP.BM Ep.MEClo BEC ST.31-38-44-

SC.CMP.BM CLP.FL CLP.BM ST.31-38-44+ Stromal cells

preT.ETP.Th SC.MEP.BM SC.GMP.BM proB.FrA.FL proB.FrA.BM preT.DN2.Th

preT.DN2-3.Th gdT proB.FrBC.FL proB.FrBC.BM Tgd.Th BM BI LN ThSLN MLN Sp BM 6C+II+ 4+ PreT.DN3-4.Th Vg1+Vd6+.Th Vg1+Vd6–.Th ARTH.BM 6C+II– 8+ preB.FrD.FL preB.FrC.BM ARTH.SYNF 6C–II– 8–4–11b– T.DN4.Th Vg2+24+.Th Vg1+Vd6+24+.Th Vg1+Vd6–24+.Th Bl 6C–II+ 8–4–11b+ B.FrE.FL preB.FrD.BM T.ISP.Th Vg2+24–.Th Tgd.Vg1+Vd6+24–.Th Vg1+Vd6–24–.Th THIO.PC 6C–IIint 8– T.DPbl.Th URAC.PC Monocytes pDC.8+ B.1a.PC B.FrE.BM T.DP.Th Tgd.Sp pDC.8– T.DPsm.Th Granulocytes Vg2+.Sp Vg2–.Sp Vg5+.IEL Vg5–.IEL B.T1.Sp T.DP69+.Th Vg2+.act.Sp Vg2–.act.Sp Vg5+.act.IEL Vg5–.act.IEL IIhilang+103+11bLO.SLN B1b.PC B.MZ.Sp B1a.Sp B.FrF.BM IIhilang+103–11b+.SLN B.T2.Sp NKT IIhilang–103–11b+.SLN T.4int8+.Th T.4+8int.Th NKT.44–NK1.1–.Th

IIhilang–103–11bLO.SLN CD8 T CD4 T B.T3.Sp T.8SP69+.Th T.4SP69+.ThT.4FP3+25+.LN T.4FP3+25+.AA T.4FP3+25+.Sp NKT.44+NK1.1–.Th

NK.Lv NK.Sp T-reg Lv Lu IC SI S.SI NKT.44+NK1.1+.Th BM T.8SP24int.Th T.4SP24int.Th 103–.11b+ B.Fo.Sp Lu 103+.11b– NKT.4+.Sp NKT.4+Lv NKT.4+.Lu T.8SP24-.Th T.4SP24-.Th RP.Sp 103+.11b+ B.FO.PC B.FO.LN B.FO.MLN B.GC.Sp NKT.4–.Sp NKT.4–.Lv NKT.4-.Lu II–480hi.PC 11cloSer T.8Nve.LNT.8Nve.Sp T.8Nve.MLN T.8Nve.PP T.4FP3–.Sp T.4Nve.SpT.4Nve.LN T.4Nve.MLN T.4Nve.PP T.4.LN.BDC Microglia.CNS

B.Pl.AA4+220+.BM B.Pl.AA4+220-.BM B.Pl.AA4-.BM T.8Mem.LN T.8Mem.Sp T.8Nve.Sp.OT1 T.4Mem.Sp T.4Mem.LN T.4.PLN.BDC Thio5.II+480lo.PC LC.Sk

Thio5.II+480int.PC T.8Eff.Sp.12hr T.4Mem44h62l.Sp T.4Mem44h62l.LN T.4.Pa.BDC NK.49CI–.Sp NK.49CI+.Sp NK.49H+.Sp NK.49H-.Sp Thio5.II–480int.PC II+480lo.PC T.8Eff.Sp.24hr

Thio5.II-480hi.PC T.8Eff.Sp.48hr T.8Eff.Sp.d5 Macrophages Dendritic cells NK.MCMV1.Sp NK.H+.MCMV1.Sp NK.H–.MCMV1.Sp T.8Eff.Sp.d6 T.8Eff.Sp.d6

T.8Eff.Sp.d8 T.8Eff.Sp.d8 NK.MCMV7.Sp NK.H+.MCMV7.Sp NK.H–.MCMV7.Sp T.8Eff.Sp.d10

T.8Eff.Sp.d15 T.8Eff.Sp.d15 103-.BR 103-.Sp 103+.BR T.8Mem.d20 T.8Mem.Sp.d45 T.8Mem.Sp.d45 B CTRL T CTRL T.8Mem.Sp.d100 T.8Mem.Sp.d106 LisOva.OT1 VSVOva.OT1

GN MO / MAC DC B-cells NK-cells T-cells

Fig. S1. Lineage trees of the compared compendia. (A) Lineage tree of the hematopoietic cell types profiled in human (1). Figure adapted from the human compendium paper (1). DC, dendritic cells; ERY, erythroid cells; GN, granulocytes, MEGA, megakaryocytes, MONO/MAC, monocytes/macrophages. (B) Lineage tree of the immune cell types profiled in mouse by the ImmGen consortium (www.immgen.org).

1. Novershtern N, et al. (2011) Densely interconnected transcriptional circuits control cell states in human hematopoiesis. Cell 144(2):296–309.

Shay et al. www.pnas.org/cgi/content/short/1222738110 4of8 A Human Mouse CD34 Cd34 KIT Kit FLT3 Flt3 CD93 Cd93 PROM1 Prom1 SELL Sell CD14 Cd14 CD44 Cd44 CSF1R Csf1r FCER2 Fcer2 CD19 Cd19 CR2 Cr2 ITGAM Itgam CD38 Cd38 2 NCAM1 Ncam1 CD247 Cd247 IL2RB Il2rb FOXP3 Foxp3 0 CD4 Cd4 CD8A Cd8a CD8B Cd8b1 CD27 Cd27 −2 CD5 Cd5 CD3E Cd3e CD3G Cd3g CD3D Cd3d SLAMF1 Slamf1 CD2 Cd2 SPN Spn IL2RA Il2ra PTPRC Ptprc NDUFA2 Ndufa2 CD48 Cd48

HSPC GNMO DC B NK T HSPC GNMO DC B NK T

B Human Mouse C Human Mouse Kidney HSPC −0.17 −0.20 −0.23 −0.04 −0.40 −0.08 0.44 −0.27 −0.17 0.04 −0.00 −0.04 0.07 Amygdala Hypothalamus Cerebellum −0.17 0.62 0.01 −0.41 −0.42 −0.46 −0.19 0.33 0.23 −0.03 −0.24 −0.09 −0.21 Olfactory bulb GN Spinal cord Trigeminal ganglion Dorsal root ganglion MO −0.20 0.62 0.11 −0.42 −0.44 −0.53 −0.13 0.17 0.33 −0.00 −0.18 −0.12 −0.17 Hear Skeletal muscle Tongue epidermis DC −0.23 0.01 0.11 −0.42 0.10 −0.53 −0.07 0.16 0.11 0.22 −0.23 −0.04 −0.24 Liver Pancreas Thyroid B −0.04 −0.41 −0.42 −0.42 −0.12 0.37 −0.03 −0.15 −0.13 −0.02 0.36 −0.02 0.10 Salivary gland Human Placenta Pituitary Prostate NK −0.40 −0.42 −0.44 0.10 −0.12 0.17 −0.10 0.04 −0.13 −0.07 0.02 0.20 0.06 Retina Human Testis Adrenal gland Lung T −0.08 −0.46 −0.53 −0.53 0.37 0.17 0.02 −0.22 −0.18 −0.15 0.25 0.09 0.36 Ovary Trachea Uterus 0.44 −0.19 −0.13 −0.07 −0.03 −0.10 0.02 −0.45 −0.26 −0.05 −0.07 −0.16 0.02 CD4+ T-cells HSPC CD8+ T-cells Lymph node Thymus GN −0.27 0.33 0.17 0.16 −0.15 0.04 −0.22 −0.45 0.02 −0.19 −0.36 −0.18 −0.42 Adipocyte Small intestine Kidney Amygdala MO −0.17 0.23 0.33 0.11 −0.13 −0.13 −0.18 −0.26 0.02 0.12 −0.24 −0.29 −0.37 Hypothalamus Cerebellum Olfactory bulb DC 0.04 −0.03 −0.00 0.22 −0.02 −0.07 −0.15 −0.05 −0.19 0.12 −0.21 −0.23 −0.32 Spinal cord lower Trigeminal Dorsal root ganglion Hear B −0.00 −0.24 −0.18 −0.23 0.36 0.02 0.25 −0.07 −0.36 −0.24 −0.21 −0.09 0.20 Bone marrow Mouse Skeletal muscle Tongue −0.04 −0.09 −0.12 −0.04 −0.02 0.20 0.09 −0.16 −0.18 −0.29 −0.23 −0.09 0.25 Liver NK Pancreas Thyroid Salivary gland T 0.07 −0.21 −0.17 −0.24 0.10 0.06 0.36 0.02 −0.42 −0.37 −0.32 0.20 0.25 Placenta Pituitary Prostate Retina T T B B Mouse Testis NK NK DC DC

GN GN Adrenal gland MO MO Lung Ovary HSPC HSPC Trachea Uterus CD4 T-cells CD8 T-cells −1 0 1 Lymph node Thymus Adipose tissue D Small intestine 0.25 Liver Hear Liver Hear Lung Lung Testis Testis Ovary Ovary Retina Retina Uterus Uterus Kidney Kidney Thyroid Thyroid Tongue Thymus Thymus Pituitary Pituitary Trachea Trachea Prostate Prostate Placenta Placenta Pancreas COE Pancreas Adipocyte Amygdala Amygdala Trigeminal Spinal cord Cerebellum Cerebellum CD4 T-cells CD8 T-cells Lymph node 0.49*Component 1 Lymph node CD4+ T-cells CD8+ T-cells Bone marrow Bone marrow Olfactory bulb Olfactory bulb Adrenal gland Adrenal gland Salivary gland Salivary gland Hypothalamus Hypothalamus Small intestine Small intestine 0.51*Component 2 Adipose tissue Skeletal muscle Skeletal muscle Spinal cord lower

0.2 0.49*Component 1+0.51*Component 2 Tongue epidermis Trigeminal ganglion Dorsal root ganglion Dorsal root ganglion E Human Mouse 0.15

2 0.1 Genes fraction

0.05

0 −1 0 1 Human–mouse correlation

−2 0 2 HSPC GN MO DC B NK T HSPC GN MO DC B NK T

Fig. S2. High conservation between matched human and mouse hematopoietic lineages and tissues. (A) Mean-centered expression values (red/blue color scale on the right) in (Left) human and (Right) mouse of the cell surface markers (rows) used in cell sorting in samples from the seven common lineages (columns; color bar on the bottom). Genes are sorted by human lineage with maximal expression. The expression of most markers is conserved, with few (e.g., PROM1) showing different lineage expression patterns in the two species. (B) Pearson correlation coefficients (yellow/purple color bar on the bottom) between mean expression in human and mouse immune lineages measured in both compendia. (C) Pearson correlation coefficients (yellow/purple color bar in B) between mean expression in tissues from the tissues atlas (1). Samples are in rows and columns. Red lines separate species; black lines mark correlation Legend continued on following page

Shay et al. www.pnas.org/cgi/content/short/1222738110 5of8 between corresponding lineages/samples between species. (D) Fitting of a mixture of Gaussians to the observed COE distribution. Shown is the observed COE distribution (blue), the two Gaussians representing the conserved (green) and nonconserved (black) genes, and their combined fit (red) of the observed distribution. (E) Expression profiles (color scale on the left) in (Left) human and (Right) mouse of the 50 ortholog pairs with the highest COE values. Genes are mean-centered and sorted by the human lineage with maximal expression.

1. Su AI, et al. (2004) A gene atlas of the mouse and human -encoding transcriptomes. Proc Natl Acad Sci USA 101(16):6062–6067.

A THY1 Thy1 KLRK1 Klrk1 FLT3 Flt3 CD38 Cd38 CD2 Cd2 NCAM1 Ncam1 CD8A Cd8a DPP4 Dpp4 NT5E Nt5e CD27 Cd27 H H GN GN MO MO MO DC DC B B B NK NK NK T T T Human (DMAP) Human spleen B −2 0 2 Mouse ASNS Asns IFT57 Ift57 ADK Adk SGALNACT. Csgalnact1 PRRG4 Prrg4 CA2 Car2 SYNE2 Syne2 TPD52 Tpd52 TMEM156 Tmem156 *IL15 *Il15 NT5E Nt5e C9ORF95 Bc016495 BNIP3 Bnip3 RNF144A Rnf144a PRR5L Prr5l ALOX5AP Alox5ap ISG20 Isg20 BMPR1A Bmpr1a PBXIP1 Pbxip1 SORL1 Sorl1 SYNE1 Syne1 *ETS2 *Ets2 GGH Ggh *TMEM176B *Tmem176b SLC16A10 Slc16a10 NUCB2 Nucb2 HPGD Hpgd *TMEM176A *Tmem176a CLIC4 Clic4 *JUN *Jun ITM2C Itm2c CD8A Cd8a CYP27A1 Cyp27a1 PLXDC1 Plxdc1 CDK14 Pftk1 CDC14A Cdc14a VRK2 Vrk2 NR4A3 Nr4a3 GPR126 Gpr126 *SNN *Snn CMAH Cmah CPM Cpm GIMAP6 Gimap6 SDC4 Sdc4 *TLR1 *Tlr1 PEA15 Pea15a ATP1B1 Atp1b1 LGALS1 Lgals1 PADI2 Padi2 GIMAP4 Gimap4 S100A10 S100a10 SEPT11 Sept11 ST3GAL6 St3gal6 MR1 Mr1 CD4 Cd4 PDLIM1 Pdlim1 MGST2 Mgst2 FAM46C Fam46c TCF12 Tcf12 PVR Pvr RCBTB1 Rcbtb1 PPM1H Ppm1h DUSP10 Dusp10 PRKCA Prkca F2RL1 F2rl1

Fig. S3. Differentially expressed genes between human and mouse. (A) Expression profiles (color scale on the bottom) in the human blood cell compendium (black), human splenocytes (gray), and mouse (white) of genes that were previously reported to be differentially expressed between human and mouse and show either the same differential expression pattern in our datasets or a distinct pattern in our datasets and are validated below. Shown are the same data as in Fig. 3A, except that samples are first grouped by cell lineage. (B) Selected genes with different expression patterns between human blood (black) and spleen (gray) and mouse (white) immune cell types that were not previously reported. Shown are mean-centered expression values of genes sorted by mouse lineage with maximal expression level. These data are the same data as in Fig. 3B, except that samples are first grouped by cell lineage. Genes discussed in the text are marked with bold and asterisks.

Shay et al. www.pnas.org/cgi/content/short/1222738110 6of8 Human Mouse

FTL Ftl1 Ftl2

Conseved KLRB1 Klrb1a Klrb1c Klrb1b Klrb1f

OAS1 Oas1c 2 Oas1b Neo- Oas1g functionalization Oas1a

SCD Scd2 Scd1

MNDA Ifi204 0 Ifi205

Divergence CLEC4A Clec4a1 Clec4a3 Clec4a4 Clec4a2

Clec4b2 −2 HSPC GNMO DC B NK T HSPC GNMO DC B NK T

Fig. S4. Conservation, neofunctionalization, and subfunctionalization of gene expression after gene duplication. Shown are mean-centered expression values (red/blue color bar on the right) of selected one-to-many orthogroups in (Left; a single ortholog) human and (Right; multiple paralogs) mouse discussed in the text.

Shay et al. www.pnas.org/cgi/content/short/1222738110 7of8 A B Human Mouse 0.4 Human–mouse Human–mouse permutation 0.35 DMAP–ImmGen DMAP–ImmGen permutation 0.3

0.25 2 0.2

0.15 Genes fraction 0.1

0.05

0 –1 –0.8 –0.6 –0.4 –0.2 0 0.2 0.4 0.6 0.8 1 COE 0

−2

51015 20 25 30 35 40 1 2 3 4 5 6 7 8 9 Unstimulated Early Late Unstimulated Early Late

Fig. S5. Conserved transcriptional profiles in activated CD4 T cells between human and mouse. (A) Distribution of COE values for CD4 T-cell activation (solid blue) and ImmGen-DMAP (solid red; based on every possible three of seven common lineages) compared to their corresponding null distribution (dotted blue and red) created by random pairings of one-to-one orthologs. (B) Conservation of differentially expressed patterns. Shown are all the significantly dif- ferentially expressed human genes (Left) sorted by their expression patterns (schematically showing change over time on the left of the heat maps) and (Right) the expression of their orthologous genes in mouse.

Other Supporting Information Files

Dataset S1 (DOCX) Dataset S2 (XLSX) Dataset S3 (XLSX) Dataset S4 (XLSX)

Shay et al. www.pnas.org/cgi/content/short/1222738110 8of8