Widespread Bidirectional Promoters Are the Major Source of Cryptic Transcripts in Yeast

Vol 457 | 19 February 2009 | doi:10.1038/nature07747 LETTERS Widespread bidirectional promoters are the major source of cryptic transcripts in yeast Helen Neil1, Christophe Malabat1, Yves d’Aubenton-Carafa2, Zhenyu Xu3, Lars M. Steinmetz3 & Alain Jacquier1 Pervasive and hidden transcription is widespread in eukaryotes1–4, termination6,7 as well as the distinction between overlapping yet dis- but its global level, the mechanisms from which it originates and tinct transcripts, which proved critical for the detection of sense CUTs its functional significance are unclear. Cryptic unstable tran- that often overlap mRNA 59 ends (see below). Two barcoded libraries scripts (CUTs) were recently described as a principal class of were constructed, the first one from a wild-type poly(A) RNA fraction RNA polymerase II transcripts in Saccharomyces cerevisiae5. as a control and the second one from the CUT fraction (polyadeny- These transcripts are targeted for degradation immediately after lated in vitro for 39 SAGE synthesis), giving rise to 48,118 and 67,022 synthesis by the action of the Nrd1–exosome–TRAMP com- tags, respectively (see Supplementary Fig. 2 and Methods). plexes6,7. Although CUT degradation mechanisms have been ana- The distribution of the sequenced 39 SAGE tags along the yeast lysed in detail, the genome-wide distribution at the nucleotide genome was non-random (P value , 10216 for both libraries, chi- resolution and the prevalence of CUTs are unknown. Here we squared test), but instead highly organized in clusters (see Supple- report the first high-resolution genomic map of CUTs in yeast, mentary Methods for cluster determination and Supplementary revealing a class of potentially functional CUTs and the intrinsic Table 3 for a detailed description of the clusters). Analysis of these bidirectional nature of eukaryotic promoters. An RNA fraction clusters (Supplementary Fig. 3) indicates that CUTs are distinct from highly enriched in CUTs was analysed by a 39 Long-SAGE (serial mRNAs yet derive from well-defined transcription units and not from analysis of gene expression) approach adapted to deep sequencing. random transcriptional noise. The resulting detailed genomic map of CUTs revealed that they We defined 1,779 clusters with at least four tags from the CUT derive from extremely widespread and very well defined transcrip- fraction (class I, as defined in Supplementary Fig. 3). Most clusters tion units and do not result from unspecific transcriptional noise. (1,496; ‘CUTs’ in Supplementary Table 3) do not correspond to Moreover, the transcription of CUTs predominantly arises within annotated features (open reading frames (ORFs) and non-coding nucleosome-free regions, most of which correspond to promoter RNAs (ncRNAs)). Northern blot analyses of a few revealed hetero- regions of bona fide genes. Some of the CUTs start upstream from geneous transcripts with a small average size of between 200 and 500 messenger RNAs and overlap their 59 end. Our study of glycolysis nucleotides (nt; Fig. 1a) and Nrd1-dependent transcription termina- genes, as well as recent results from the literature8–11, indicate that tion (Supplementary Fig. 4)—two characteristics of previously such concurrent transcription is potentially associated with reg- known CUTs5–7. Among the remaining clusters from this class, 106 ulatory mechanisms. Our data reveal numerous new CUTs with could be assigned to non-coding RNA precursors, and 134 clusters, such a potential regulatory role. However, most of the identified located within intron-containing pre-mRNAs, probably represent CUTs corresponded to transcripts divergent from the promoter degradation products of these transcripts, as suggested from previous regions of genes, indicating that they represent by-products of studies14,15 (‘ncRNAs’ and ‘pre-mRNAs’ in Supplementary Table 3). divergent transcription occurring at many and possibly most pro- Forty-three clusters remained unclassified (mostly found within moters. Eukaryotic promoter regions are thus intrinsically bi- repetitive elements; ‘others’ in Supplementary Table 3). Trans- directional, a fundamental property that escaped previous analyses cription start sites (TSSs) were precisely mapped for 68 of the because in most cases divergent transcription generates short-lived CUTs (Supplementary Table 4 combining 59 RACE data from unstable transcripts present at very low steady-state levels. Supplementary Fig. 5 and RNase H experiment data from Fig. 2, To gain insight on the possible role of CUTs, we determined their Supplementary Fig. 6 and data not shown). The 59 ends of these complete genomic organization. Because CUTs are capped and CUTs were usually heterogeneous, with multiple TSSs. Taking the mainly degraded by the nuclear exosome and the TRAMP (Trf4– most distal TSS identified as the reference for the 59 end and the Air1/Air2–Mtr4 polyadenylation) complexes5, an RNA fraction maximum 39 SAGE density as the reference for the 39 end positions, highly enriched for CUTs (CUT fraction) was prepared by tandem it gave a mean size of 258 6 89 nt (6 standard deviation) for the affinity purification (TAP12; see Methods) from a strain in which (1) CUTs. In addition, to complete the global visualization of these tran- Cbp20 (also known as Cbc2), a component of the nuclear cap-binding scripts, we used the same CUT and poly(A) RNA fractions to perform complex, was TAP-tagged and (2) two components of the nuclear tiling array hybridizations with a modified procedure adapted from degradation machineries were missing (Rrp6 and Trf4 (also known ref. 16 (see Supplementary Methods). These data validated the iden- as Pap2)). Supplementary Fig. 1 shows that the model CUT NEL025c tification of CUTs performed with the SAGE experiment and con- was enriched several thousand fold in this fraction relative to wild- firmed that these transcripts are short. However, technical reasons, type total RNA. To analyse these transcripts, we used an improved specific to the nature of the CUT samples, prevented the use of these 39 Long-SAGE technique13 (adapted to 454 Roche pyro-sequencing) data for global statistical analyses (see Supplementary Methods). because it allowed the characterization of the highly heterogeneous 39 These data are thus provided for illustration purposes only and can ends of the CUTs that result from their Nrd1-dependent transcription be visualized, together with the SAGE data, at http://www.pasteur.fr/ 1Institut Pasteur, Unite´ de Geńe´tique des Interactions Macromolećulaires, CNRS, URA2171, 75015 Paris, France. 2Centre de Geńe´tique Molećulaire, CNRS, Alleé de la Terrasse, 91198 Gif-sur-Yvette, France. 3European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany. 1038 ©2009 Macmillan Publishers Limited. All rights reserved NATURE | Vol 457 | 19 February 2009 LETTERS ab123456 Feature (ORF or ncRNA) c 8 622 S CUTs Tandem 527 6 sense (TS) 198 404 4 934 Sense Tandem 2 antisense (TA) 736 307 0 –1,000 –800 –600 –400 –200 0 200 400 600 800 1,000 242 –2 238 Divergent (D) 217 Number of clusters –4 449 201 190 –6 180 A CUTs Antisense CUTs –8 poly(A) Convergent (C) 103 160 Distance from start of features (nt) Figure 1 | Genome-wide analyses of 39 SAGE-tag clusters. a, Northern blot the start of feature. The orange arrows labelled S CUTs and A CUTs analyses of new CUTs. RNAs from Cbp20–TAP purification (100 ng), symbolize sense and antisense CUTs, respectively, relative to the performed with the rrp6D depl-trf4 strain (LMA587) grown in YPD downstream feature (blue arrow). Note that the peak of the blue curve in the supplemented with doxycyclin, were separated on a 5% polyacrylamide gel sense direction corresponds to the ends of tandem upstream mRNAs. and analysed with different riboprobes: (1) CGR088wTa2, (2) CER176wTa3- c, Classification of CUT clusters according to the type of associated genomic D, (3) CLR263wTa3-B, (4) CDR213wTa3-E, (5) CDR164cD2 and (6) intergenic regions. The genomic organization of CUTs (orange arrows) CGL235wTa2 (primers are listed in Supplementary Table 2). Marker sizes relative to surrounding features (in blue) is schematized. The number of are in nucleotides. b, Distribution of tag clusters relative to the start of CUTs in each of the four categories defined is indicated on the right. Note features. The distances between the start of features (start codons for ORFs) that 10 out of the 1,496 clusters were far from an intergenic region and and the nearest CUT 39 SAGE-tag clusters (orange curve) or mRNA 39 appear to initiate within genes, in antisense (called ‘internal antisense’ in SAGE-tag clusters (blue curve) were computed. The figure shows the Supplementary Table 3), and are not present on the scheme. smoothed distribution of these distances, on either DNA strand, centred on recherche/unites/Gim/genepy/sage.html. Notably, the number of the same TSS. It is likely that other CUTs identified here reflect the CUTs newly identified was comparable to the number of mRNAs same type of mechanism suppressing the expression of other genes. identified from the poly(A) fraction using the same criteria and with However, in most cases that we analysed by RACE or RNase H experi- an equivalent number of sequences (Supplementary Figs 3 and 7), ments, TSSs upstream from the mRNA TSS could be specifically indicating that CUT transcription units are widespread in the yeast detected in the CUT fraction. We thus predict that CUTs sharing their genome, possibly to an extent similar to mRNAs, although their TSS with the mRNA, as exemplified by Nrd1, do not constitute the transcription level is probably globally lower. most frequent class. We and others8,11 have recently shown that several To study the overall genomic organization of the CUTs, we ana- genes of the nucleotide biosynthesis pathway (URA2, URA8, IMD2, lysed the distribution of the CUT 39 SAGE clusters relative to the start IMD3, ADE12) are regulated by a previously unknown mechanism in and end of annotated genomic features (ORFs and ncRNAs).

Widespread Bidirectional Promoters Are the Major Source of Cryptic Transcripts in Yeast

Modulation of Mrna and Lncrna Expression Dynamics by the Set2–Rpd3s Pathway

A Computational and Evolutionary Approach to Understanding Cryptic Unstable Transcripts in Yeast

Genechip® Mouse Tiling 2.0R Array Set Is a Seven-Array Set Designed ORDERING INFORMATION for Chromatin Immunoprecipitation (Chip) Experiments

Systematic Evaluation of Variability in Chip-Chip Experiments Using Predefined DNA Targets

Modeling Dna Methylation Tiling Array Data

Genechip Arabidopsis Tiling 1.0R Array

The Paf1 Complex Broadly Impacts the Transcriptome of Saccharomyces Cerevisiae

Statistical Methods for Affymetrix Tiling Array Data

A Two-Stage Approach for Estimating the Effect of Dna Methylation on Differential Expression Using Tiling Array Technology

Model-Based Analysis of Tiling-Arrays for Chip-Chip

A Wave of Nascent Transcription on Activated Human Genes

At-TAX: a Whole Genome Tiling Array Resource for Developmental Expression Analysis and Transcript Identification in Arabidopsis