Supporting Information
Total Page:16
File Type:pdf, Size:1020Kb
Supporting Information Shay et al. 10.1073/pnas.1222738110 SI Materials and Methods Gaussians (Fig. S2D). Fitting a mixture of Gaussians model to Tissue Atlas Dataset. Data of the human and mouse normal tis- the COE data, where the conserved genes’ Gaussian mean and sues were downloaded from http://biogps.org/downloads/, files SD are known, the conserved genes’ Gaussian (black) contained U133AGNF1B.gcrma.avg_ann and GNF1M_plus_macrophage_ α = 51% of the genes, and the nonconserved genes’ Gaussian small.bioGPS_ann. Those datasets include 32 comparable tissues. (green plot) explains 1 − α = 49% of the genes (Fig. S2D). Genes were matched by gene symbol to ENSEMBL COMPARA We used the Matlab function MixtureOfGaussiansGivenInit version 63, resulting in 10,777 one-to-one orthologs pairs. When (O.Z.; http://www.broadinstitute.org/~orzuk/matlab/libs/stats/mog/ more than one probeset had the same gene symbol, only the MixtureOfGaussiansGivenInit.m) to estimate the mixture of probeset with the highest mean expression was used. Data were Gaussians, with mean and SD set as the mean and SD of the log2-transformed. There were 5,594 human genes with at least one ImmGen-DMAP permutation distribution. value above five (after log transform) and 6,191 mouse genes with 2 fi at least one value above nine, including 4,059 one-to-one ortholog Genes with Divergent Expression. We de ned genes with divergent < pairs that were included in all subsequent analysis. The thresholds expression as having a COE 0.25 and at least a fourfold change five and nine were chosen by examining the bimodal gene ex- between the maximal lineage mean and the minimal lineage pression distribution for all probesets, because they were separat- mean in both human and mouse. We used these strict criteria ing the two modes of the distribution. Conservation of expression to remove any bad probesets and any genes with only slight (COE) was calculated as the Pearson correlation coefficient differences in expression between lineages, which may be because between the gene expression values in comparable samples. For of other confounders. For those different genes highlighted in calculating correlation between samples, each gene was standard- Known and Previously Undescribed Differences in Individual Genes ized in each species’ dataset separately. as divergently expressed (IL15, ETS2, TMEM176A, TMEM176B, SNN, and TLR1), we attempted to rule out the possibility that Estimating the Number of Conserved Genes. Assuming that there the differences stem from probeset design issues. For the human are two groups of genes, conserved and nonconserved, the non- genes, the Affymetrix probes match sequences in the genes’ 3′ conserved genes’ COE distribution, denoted F0, should be similar UTRs (except for TMEM176B, whose probes are mapped to to the ImmGen-DMAP permutation distribution (Fig. 2B,black). several exons and 3′ UTR), and the probeset GeneAnnot (1) The COE distribution of conserved genes, denoted Fconserved, (http://genecards.weizmann.ac.il/geneannot) score is one (best). should be concentrated in the positive correlation region. For the mouse genes, the probes’ sequences are mapped to The observed total distribution of COE values for all genes, different positions in the gene region. denoted Fall-genes, is obtained as a mixture of the COE distribution of nonconserved genes F (estimated by the permutation-based Activation and Profiling of Human CD4+ T Cells. For stimulation of 0 + distribution) and the (unknown) COE distribution of conserved human cells, CD4 T cells were isolated from whole blood by negative selection using RosetteSep human CD4+ T-cell enrich- genes Fconserved: ment cocktail (Stemcell Technologies) and RosetteSep density FallÀgenes = α Fconserved + ð1 − αÞF0: medium gradient centrifugation and then stored frozen. On the day of activation, CD4+ T cells from 14 individuals were sep- We next estimated the mixture parameter α, representing the arately thawed, resuspended inRPMIsupplementedwith10% fraction of genes with conserved COE in two ways. FCS, and plated at 50,000 cells/well in a 96-well plate. Cells were First, we used a nonparametric method applicable for any dis- either left untreated as a baseline control or stimulated with tribution. From above, we must have Fall-genes(x) ≥ (1 − α)F0(x)for bead-bound anti-CD3 and anti-CD28 at a bead-to-cell ratio of any x. Thus, we can estimate the maximal possible fraction of the 1:1 for 4 or 48 h. At each time point, a second step of purification nonconserved genes (1 − α) by calculating the ratio between the of CD4+ T cells was performed with magnetic positive selection ImmGen-DMAP and ImmGen-DMAP permutation distributions, using the Invitrogen Dynal CD4 Positive Isolation Kit (96-well Fall-genes(x)/F0(x) for any value of x (Fig. 2B, black and dark blue format) before Trizol extraction of RNA. curves, respectively). To avoid errors resulting from inaccurate Isolated RNA was amplified and prepared for hybridization estimates of Fall-genes(x) and F0(x) from data that may skew our to the Affymetrix HuGene ST1.0 Array using the GeneChip estimator for 1 − α, we took the minimal ratio Fall-genes(x)/F0(x)for Whole Transcript (WT) Sense Target Labeling Assay. Raw data values of x on the negative region (x < 0) and excluded the left tail were normalized using the Robust Multichip Average algorithm (x > −0.8), which gave us a minimal ratio of 1 − α = 0.3 (Fig. 2B, in the “Expression File Creator” module in GenePattern (2). black arrows). Thus, the nonparametric estimator indicates that at Only 11,229 genes with expression that exceeded 285 (ImmGen least α = 1 − 0.3 = 70% of the genes have conserved COE scores. threshold setting procedure; http://www.immgen.org/Protocols/ Similarly, for the activated T cells, we can estimate the maximal ImmGen QC Documentation_ALL-DataGeneration_0612.pdf) possible fraction of the nonconserved genes (1 − α) from the ratio in at least two samples were analyzed. A paired t test between between the human–mouse and human–mouse permutation dis- the 14 samples from the 14 donors was used to identify differ- tributions for any value of x (Fig. S5A, blue and dashed blue entially expressed genes between each of the two samples of the curves, respectively). To avoid errors resulting from inaccurate same donor in consecutive time points, with a false discovery rate estimates of Fall-genes(x)andF0(x) from data that may skew our of 5% and a median fold change of at least two. estimator for 1 − α, we took the left tail, which is most populated (x = −1), and it gave us a minimal ratio of 1 − α = 0.36. Thus, the Activation and Profiling of Mouse CD4+ T Cells. For stimulation of nonparametric estimator indicates that at least α = 1 − 0.36 = mouse cells, CD4+ T cells from pooled spleen and lymph nodes 64% of the genes have conserved COE scores. of male C57BL/6foxp3-gfp mice were purified to >90% purity (Dynal Alternatively, we estimated the fraction of genes that are con- CD4 Negative Isolation Kit), and 2 × 105 cells were cultured in served by modeling the COE distribution as a mixture of two 96-well flat-bottom plates in 200 μL RPMI-1640 and 10% FCS Shay et al. www.pnas.org/cgi/content/short/1222738110 1of8 together with beads conjugated with anti-CD3 and -CD28 at a Motif Enrichment in Lineage Signatures. For each lineage L and cell:bead ratio of 1:2. Cells were harvested at 1, 4, 20, or 48 h, each motif k, we computed the P value for enrichment, pe(L,k), Tconv and Treg were sorted as DAPI−CD45R−CD8a−CD11b/c− of the motif in the induced and repressed signatures of the lin- CD4+ and GFP− at 1 and 4 h, or 20- and 48-h lysates were pooled eage compared with a background set comprised of the filtered as early or late time points, respectively. Three pools per time one-to-one orthologs. An enrichment of a motif in a signature point were created. Lysates from sorted cells were processed for results in higher than expected MAX-LOD scores for the genes profiling on ST1.0 arrays per ImmGen protocol. in this module—to capture this effect, we computed the P value Only 11,812 genes with expression that exceeded 155 (ImmGen by comparing the scores MAX-LOD(i,k) for all genes i in the threshold setting procedure; http://www.immgen.org/Protocols/ lineage L with the scores for the entire set of filtered one-to-one ImmGen QC Documentation_ALL-DataGeneration_0612.pdf) orthologs by performing a one-sided rank-sum test. We then in at least two samples were analyzed. A t test between the three employed a false discovery rate of 5% on the entire matrix of samples was used to identify differentially expressed genes be- P values pe(L,k) and declared as significant hits all P values tween each pair of consecutive time points, with a false discovery passing this procedure. rate of 10% and a mean fold change of at least two. For identifying putative divergent regulation based on cis- regulatory elements, we looked only on lineages in which the Motif Scanning. We scanned promoters of human and mouse genes Pearson correlation coefficient was >0.7 to ensure that most motifs for enriched motifs. We downloaded promoter sequences for behave similarly. We then fitted a linear model to the P values of human (hg18) and mouse (mm9) from the University of California, enrichment in each lineage in human and mouse using Matlab Santa Cruz, Genome Browser website (http://hgdownload.cse.ucsc. function LinearModel.fit, and only motifs that have a standard- edu/downloads.html). For each gene, we scanned the region ized residual more than five were considered.