Supporting Information

Supporting Information

Supporting Information Lee et al. 10.1073/pnas.1309293111 SI Materials and Methods performing independent random permutation of the signature Genome Sequence and Annotation. We obtained mouse genome values for each gene. The FDR corresponding to a given P value sequence via the BSgenome.Mmusculus.UCSC.mm9 package in threshold was computed as the ratio of the number of GO cate- BioConductor (1). We downloaded the corresponding genome gories with a P value below threshold, averaged over 50 ran- annotation coordinates directly from www.genome.ucsc.edu (ver- domized data sets, and the number of GO categories with a P value below threshold. A 1% FDR based on the empirical sion mm9). − permutation test corresponds to a WMW test P < 10 4. Information Content of Locus Expression Signatures. To assess how much information about downstream transcriptional regulation Low-Complexity Sequence Features (Fig. S4). To eliminate the po- was contained in a given signature, without the need to specify tential confounding contribution from low-complexity sequence a particular regulatory mechanism, we summed the squares of the features to LESs, we calculated the frequency of each base and the CpG dinucleotide across the transcribed region for each gene. t-values tgm corresponding to the regression coefficients βgm: X Next, we computed the residuals from a multiple linear regression χ2 = 2 : of each LES on these five frequencies (without an intercept). m tgm g We calculated DNA base composition and CpG content for 1-kb windows up to 200 kb upstream or downstream. The base up 2 To determine the statistical significance of the χ statistic, we composition indicator variable Ngbi for base b, gene g, and dis- constructed a null distribution as follows. We performed 100 tance i from TSS was defined as follows: independent permutations by randomizing expression level of 1if base at locus i is b gene across genes for each tumor. We did not permute insertion Nup = : loci because we want to preserve the correlation structure be- gbi 0 otherwise tween insertion loci. For the randomized data sets, we performed down multiple linear regression to calculate the t-value and corre- The base composition Ngbi for the downstream sequence sponding χ2 statistic for each locus with the same method as is calculated with the same procedure using the downstream χ2 up down the actual data set. The mean statistic averaged over 100 sequence. The CpG content Ng;CpG;i, Ng;CpG;i for upstream and randomized data sets for each locus is shown in Fig. S1C as downstream sequences was calculated using the same procedure. purple bar. To calculate the base composition of upstream sequence for To investigate the fraction of the variance in expression levels window;up each window Ngbw , we divided the 200-kb sequence up- of each tumor that is accounted for by our locus expression stream into 1-kb window intervals. For each such window, we signatures, we performed multiple linear regression of the mRNA calculated the upstream base composition as follows: expression levels for each held-out tumor on the locus expression X signature (LES) matrix constructed using all other tumors and window;up = up : 2 Ngbw Ngbi calculated coefficients of determination (R ). We used either (i) i∈w the 13 insertion loci that occur in at least 10 tumors or (ii) the 87 insertion loci that occur in at least 3 tumors. For comparison, we Here, w represents the wth window of the upstream sequence for also performed multiple linear regression using only the 25% most gene g. We also calculated the downstream base composition variable genes. To construct a null distribution, we used 100 in- window;down window;up window;down Ngbw and CpG content Ng;CpG;w and Ng;CpG;w using dependent random permutations of all genes for each tumor. the same procedure. We measured coefficients of determination (R2) by regressing the LESs on all low sequence complexity for Permutation to Calculate False Discovery Rate. To calculate false window w (without an intercept). We used the residuals of this discovery rates (FDR) in each analysis, we performed permu- model fit in further transcription factor (TF)-locus association tations of locus expression signatures across genes for each locus. analyses. Then, we applied the same procedure as used in each analysis to calculate statistics such as t-value or P value for the randomized TF Binding Affinity Profiles. We used the convert2psam utility from data sets. The FDR corresponding to a given P value threshold REDUCE Suite version 2.0 software package (www.bussemakerlab. was computed as the ratio of the number of associations with the org) to convert each of position weight matrix (PWM) from P value below threshold averaged over 1,000 randomized data JASPAR to a position-specific affinity matrix or position-specific sets and the number of associations with the P value below affinity matrix (PSAM) (2); pseudocounts equal to 1 were added to threshold for the real data set. For the t-value, we computed the the PWM at each position, and the resulting base counts were number of associations whose absolute t-value is bigger than a divided by that of the most frequent base at each position to get an given t-value threshold instead. estimate for the relative affinity associated with each point muta- tion away from the optimal binding sequence. The resulting PSAM Forward Selection of Gene Ontology Categories. For each gene collection was used to compute a weighted promoter affinity ontology (GO) category, we applied the Wilcoxon–Mann–Whitney for each gene. All putative individual binding sites in the ge- (WMW) test to detect differences in distribution between the nomic region from 200 kb upstream to 200 kb downstream of locus expression signature value of genes within the GO category the TSS of each gene with a predicted relative affinity of at least and that of the other genes. At each step, we subtracted the mean 0.1 were identified and scored using the AffinityProfile utility in signature value of the genes in the gene set with the lowest P value the REDUCE Suite. from all genes in that gene set. The P values were then recalcu- lated, and the procedure was repeated until even the most sig- Inferring Length Scale Parameters. For each choice of the regula- − nificantly regulated gene group had P > 10 5, which corresponds tory scale parameter λ in the range from 1,000 to 100,000 base to an FDR < 0.1%. Statistical significance was determined by pairs, we obtained a total weighted upstream affinity by summing Lee et al. www.pnas.org/cgi/content/short/1309293111 1of12 the affinity of all upstream or downstream binding sites using probes mapping to the same mouse RefSeq ID, resulting in 9,757 a weight exp(−d/λ), where d is the (absolute) distance of a given genes shared between both data sets. binding site from the transcription start site (TSS). Then, we To obtain robust results, we filtered out noninformative genes λup computed TF-specific and locus-specific parameters φm that using two criteria. First, only mouse genes showing a high variance maximized the correlation coefficients between a total weighted across tumors (upper 50th percentile) were retained. Second, we upstream affinity and each LES, resulting in an optimized total deleted human genes whose expression was detected in neither weighted affinity. An analogous procedure was performed for treatment nor control. Next, we calculated averages of gene the downstream sequence. The sum of upstream and down- expression levels across profiles for the same drug in different cell stream total weighted affinities was used for mapping the locus- types, resulting in 1,309 drug signatures. Genome-wide linear TF network and drug-TF-locus network. regression of each of these on the locus expression signatures was performed. To determine the statistical significance of each pu- Myc Validation of Result. We downloaded gene expression profiles tative drug-locus association, we performed 100 random permu- obtained by ref. 3 for transgenic mice that conditionally express tations of drug signatures and repeated the analysis. A 1% FDR the human MYC cDNA in T-cell lymphocytes (GEO accession corresponded to a regression coefficients whose t-value has an number GSE10200). In this transgenic mouse, doxycycline treat- absolute value >7. ment suppresses MYC expression. We used the two most extreme Statistical significance for TF-drug associations was also de- doxycycline concentrations of 0 and 20 ng/mL. To obtain an es- termined by performing 100 independent random permutations of timate for the differential expression level in response to inac- drug response profiles, resulting in a 5% FDR for family-level and tivation of Myc, we subtracted the treatment/reference log2-ratio − − individual PSAMs at P < 3.0 × 10 3 < × 5 at 0 ng/mL from that at 20 ng/mL. These values served as the de- and 6.3 10 , respectively. pendent variable in the regression on TF affinity profiles. We adopted the same statistical significance criterion for the drug- locus association and TF-locus associations as in previous analyses. Mapping Drug-Locus Associations. Genome-wide mRNA expres- Human Mutation Expression Signatures. The acute myeloid leuke- sion data for cultured human cells treated with bioactive small “ ” molecules were downloaded from the Connectivity Map website mia data set was downloaded from The Cancer Genome Atlas (www.broadinstitute.org/cmap/). This collection contains 7,056 data portal (https://tcga-data.nci.nih.gov/tcga/tcgaHome2.jsp). We expression profiles for 1,309 distinct compounds. The experi- downloaded level 3 gene expression levels (Affymetrix HG-U133 ments were carried out on two different Affymetrix GeneChip platform) and level 2 somatic mutation data for 197 acute myeloid designs (HG-U133A and HT-HG_U133A) and in four different leukemia tumor samples and found that both data types were cell lines (the breast cancer epithelial cell line MCF7, the pros- available for 194 tumor samples.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    12 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us