REVIEWS

Genome-wide transcription and the implications for genomic ororganizationganizati

Philipp Kapranov, Aarron T. Willinghamam andan Thomashomas R. GingerasGingeras Abstract | Recentnt evidence of genome-wide transcriptioncripti in severaleveral species indicindicates that the amountnt of transcription that occurs cannott be eentirelyy accounted for by current sesets of genome-wide annotations. EvidenceEvi e indicatesindic thatat most of both strands of tthe humanuman genome might be transctranscribed, implyingimply extensivei overlapp of transcriptionalriptio unitsnits and regulatory elements. ThesThesese observatobservations suggest ththat genomic architectureitecture is nott colinear, but is instead interlinterleaved andd modulmodular, and thatt the samsame genomicenomic sequencesseq es are multifunctional:multifunctiona thatt is, used for multipletiple indepeindependentlyde regulated transcrtranscriptsts and aas regulatoryulatory regions.reg Whathat are the implicationsimplicatio and consequences oof suchh an interleavedinter genomic architecturearc ecture in terms of increasedd informationnform on content,conten transcriptionalriptiona complexity,omplexity, evolutiontion and disease states?s?

Emerginggi genomicmic architectuarchitecture TilingTi array The descriptionscrip on of the lac operooperon in 1961961 by Jacobob 1 A microarray design in which and Monod establishedblished a conceptualconce modelodel of genee In-depth ananalyses off ththe transcriptional outpuoutputs of the the probes are selected to organization by whichch a DNA sequencese is neatlynea splitt , momouse, fly and other genomes from a rrangenge of interrogateogate a genome withwit a into separate regulatoryy and protein-codingpro g portions;po s; experimentalexperim approaches (TABLE 1; BOX 1) suggestsugg st that consistent,nsistent, pre-determinedpre-determine with the protein-codingprote portionortion of a preceded byb a the information content of a genome is compcomplex,xand and spacing between each proprobe. definedefined region of DNAD that regulatesregu its transcriptional that this complexityomple ity manifestsmanife ts itself at twtwoo levelevels. initiationtiation and follofollowed by a functional stretch of DNA The fractionraction off a genome thatthat is used aas an informa-in thatt controls its termination.t This simple butut eelegantt tion carrcarrierer is muchuch higher thant an previopreviously expected, modelm hhas been supsupported by a wealth off biochebiochemical and muchmuch of theth unannotaunannotated transcription, the so- and genetic dataa and has conconsequentlyquently becbecomeome eengraved called ‘tra‘transcriptionalscrip dark matter’2, remains to be char- in all thinkingnking regardingr garding genomicgenomic organizationorganizat on for nearly acterized.acterize Unbiased transcriptome profiling using tiling everyery species.s ecies. A simplisticmplistic extensionextension of this model is that arraysarray for ten human revealed that 56% a region in a genomeg me usually hasas just oneon function; so, of the th transcribed base pairs in cytosolic polyadenylated a genomgenome consists of a linear aarrangement of different RNA (the cytosol contains the most mature, processed functionalfunctiona elementselemen that are interspersed with non- RNAs [AU:ok?]) do not correspond to annotated functionalfunction elements. For example, a region of DNA can exons of protein-coding , mRNAs or ESTs3. The be either a promoter or an exon, but usually not both. complexity of nuclear transcriptomes is much higher The advent of genome-wide techniques for studying — fivefold more transcribed base pairs are detected in transcription has enabled transcriptome studies on an nuclear RNA than in cytosolic RNA3, and approximately unprecedented scale (BOX 1). What emerges is that a 80% corresponds to the unannotated portion of the genomic region can be used for different purposes and genome3. In total, ~15% of all interrogated base pairs that different functional elements can co-locate in the can be detected as RNA molecules (either in the cytosol same region in a genome. This observation prompts us or in the nucleus) in a single human cell line. This is to re-evaluate the current dogma, which can be referred in contrast to a total of 1–2% of base pairs that corre- Affymetrix, Inc., to as the ‘colinear’ model, and indicates an alternative spond to the exons of all the annotated protein-coding 3 [AU:pls provide full postal model for genomic organization. This ‘interleaved’ genes . These data strongly indicate that a significant address], Santa Clara, model reflects the observation that multiple functional portion of the can be transcribed. California 95051, USA. elements can overlap in the same genomic space. Here Estimates made by the Encyclopedia Of DNA Elements Correspondence to T.R.G. we discuss recent empirical data supporting this and (ENCODE) consortium4 — a large multidisciplinary e-mail: [email protected] consider the implications, advantages and challenges and collaborative effort to characterize the regulatory doi:10.1038/nrg2083 of this new model of genomic architecture. landscape of ~1% of the human genome — suggest that

NATURE REVIEWS | GENETICS VOLUME 8 | JUNE 2007 | 1

nnrg2083.inddrg2083.indd 1 223/4/073/4/07 55:30:46:30:46 ppmm REVIEWS

Box 1 | Technologies for mapping RNA expression this is in fact the case. Depending on which empirical data sets are included in the estimate, as much as 93% The methods for analysing structure and expression levels of the RNAs described in of the genomic sequences in the surveyed ENCODE this Review can be broadly classified into two groups: sequencing-based and regions seem capable of being transcribed5. This hybridization-based approaches. estimate is derived from the union of all intronic and Sequencing-based approaches. These approaches rely on obtaining direct information exonic sequences detected by several empirical RNA- about the order of nucleotides in an RNA molecule. They can be further subdivided into mapping technologies in multiple biological samples. A methods that involve sequencing of full-length or nearly full-length RNAs, or sequencing surprisingly large number of unannotated transcripts6,7 of short portions of RNAs, typically derived from the 3′ (SAGE)113, 5′ (CAGE)96 or both or novel isoforms of protein-coding genes5 for which termini (PET)97 of the corresponding RNAs. Before sequencing, RNAs are converted into cDNAs that can be further processed to generate truncated cDNAs that contain only primary structures have been elucidated by sequenc- short sequences or ‘tags’ (typically ~14–22 bases) that represent the sequences from ing do not seem to encode proteins. These transcripts either one of the two termini of the original RNA. Generation of tags significantly are often referred to as non-codingn-coding RNRNAs (ncRNAs). increases throughput, which in turn significantly increases depth of coverage. This is a putative designation,ation as it iss possiblepossib that some Sequencing-based methods provide the most detailed information about the might inn fact encodeencod short proteins or pepeptides. The structure of an RNA molecule, but they have a much lower throughput than term ‘transcript of ununknownknown function’ (TUF)( 3,8 has been hybridization-based methods. Therefore, full-length cDNA sequencing is typicallycally proposedoposed by the ENCODECODE consortium to denote such used to catalogue exemplars of different RNA molecules,cules, rather than as a means to putativeive non-codingnon g molecules — thus resereserving the comprehensively count the number of molecules inn a sample. However, sequencing of label ‘ncRNA’ncRNA for thosese RNAs for which therether is some short tags of cDNAs, such as CAGE, SAGE and PETET tags, has greatly benefited from the functionalnal evievidence. increases in throughput and parallelismarallelism of sequence-readoutuence-readout methods, andan is now used to identify the 5′ and 3′ termini of RNA moleculesules and estimate their abundance. Multifunctionalifuncti usagege of the same genomicge space is common.on. OveOverlapping transcriptsi can be proproduced Hybridization-baseddization-bas approaches.proac Thesese methodsmet relyly on measuring the magnitudema of from ththee same or oppoopposite strandsnds of DNADNA. The regions hybridization of a probepr to its target in a complexcom background,ckground, relative to theth signal of ovoverlaprlap of transcriptsscripts fromfro oppositeite strandstrands can from the background or from controlol probes. InI hybridization-baseddization-based detection, a probeprob would detect all moleculesmole that contain regionsregion of complementaritymplementarity to that probe. If include the exons that are presentpresen in mature RRNAs, RNA molecules are not separatedted (using traditional gel-basedge -based techniques,techn that is, or be mostly m confinednfined to the introns.intr s. This is exemexempli- northern blots) before hybridization,h dization, the net sum of thet hybridizationybridiz signalnal is the sums fied byy the phosphatidylserine ddecarboxylaserboxylase (PISDP ) of the signal from all the differenterent molecules that canc hybridizee to a probe.prob gene, whichich has at least nine overoverlappingg independenindependent Compared with the sequsequencing-basedcing-based methods, the main advantagesa of transcriptscr in its genic boundary (FIG. 1). Many detected hybridization-based methods are higherr throughputthroughp and depthth of sampling.sampling transcriptssc containontain both exonic and intronictronic portionsportion Throughput is incrincreased by thehe absenceabsenc of the requirement for theth molecularecula of PISD, and thee 5′ termini of seseveral of these overlap-overl separatseparation and librarybrary construction.cons The depthepth of samplingsa is a keyk differencerence betweenen ping transcriptstransc are positioned proximalp to an empiri- the twoo types of method — sequencing-basedequencing-ba methodsmetho providede informationion abouta onee cally determineder MYC binding site. Additionally,ddition the RNA molecule; althoughh they have benefitedbenefit fromm increases in parallelismp and sameme genogenomic sequences can be sharedd by bobothh long throughput,ghput, owing to practicalical limitations,limitation they cann provide informationinf aboutbout only 105–107 RNA moleculesmolecu in eachch sample, eqequivalent to the RNA ccontent of 1–30–30 cellscells. and short sh RNAs, suggesting thatat the functionunction off some Hybridization-basedridization-based methodss are intrinsicallyintrinsica able to interrogateterroga all RNA moleculesmolecu in of the overlapping codingg andnd non-codingnon-codding transcriptstr scripts a givenen sample. is to produceuce shshortrt RNAs9 (see( ee below). OvOverlappingrlapping transcripts that are madem de frfrom theh Technologyology innovations.innovati Both typesypes of method haveh seen technical innovations to increasee parallelism and throughputghput of the technologies. Sequencing of shorthort piepieces off same DNDNANA strandstr d can either bee functionalfunction isoforms (for nucleic acids has been most susceptibles sc to thehe increaseinc in parallelismallelism owingowing to theth exampleexample, prodproducedd by alternativealter splicing or process- advent of such technologieste as pyrosequencingque ng, massively parallelrallel signatusignaturere seqsequencing ing) or llackck apapparent common functional character- (MPSS))an and others114. The increased parallelismsparalleli ms in hybridization-basedhybridi ation-based technologiest chnol istics (for(f example, protein-coding potential), despite largely stems from the adventent of high-densityhi h-densit microarrays, inn which large numbersumbe of sharsharing the same genomic space. probes (as many as 105–10106 probes forf differentdiffer sequences) canc n be spottedspott or synthesizedd on surfacessurfac as small as a squarequare inchinc 115–117. Antisense transcription. Given the extent of transcrip- Continuing challenges. Mostost of what is currentlyurrently known about the sequences of human tional overlap, it follows that much of it is antisense transcripts iss limited to RRNANA species that are both polyadenylated and long (more than to protein-coding loci. FIGURE 1 illustrates this for the 200 nucleotides).des). This biasb results from technical issues, such as the ease with which PISD gene, which has nearly as many distinct anti- polyadenylated RNAs can be purified, and from the conceptual bias that stems from sense as sense transcripts. Estimates of the extent of the idea that non-polyadenylated RNAs are unlikely to be functional. Recent studies overlapping sense–antisense transcription across the 3 83,84 that use tiling arrays together with earlier studies that used association kinetics whole genome vary; the highest so far comes from suggest that the population of non-polyadenylated transcripts is vast, and its sequence analysis of 158,807 full-length mouse cDNA complexity exceeds that of the polyadenylated RNAs. Short RNAs represent another 10 heavily underappreciated component of the transcriptome. On the basis of sequencing clones . In this case, the authors reported antisense of libraries of short RNAs100–108 and tiling array profiling of short RNA populations9, it transcription for 72% of all transcriptional units, with seems that the short RNA transcriptome is probably at least as complex, if not more so, 18,021 (87%) of protein-coding and 13,401 (58.7%) of than that of the long RNAs. Compartmentalized RNAs are yet another relatively non-coding transcriptional units having an antisense unexplored domain of the transcriptome. Most genome-wide surveys of RNAs have transcript. Large-scale sequencing of libraries of short been limited to RNA populations isolated from whole-cell extracts or the cytosolic sequence tags near the 3′ ends of RNAs (LongSAGE RNA fraction. Recent results from tiling array profiling of the nuclear transcriptome tags) revealed that 9,804 human mRNAs contain an indicate that this specific subcellular compartment, which accounts for 15–25% of the antisense transcript11. Analysis of a randomly selected total cellular RNA, contains an RNA population that is five times more complex than subset of transcripts detected using RACE/tiling arrays 3 the cytosolic population; most of these RNAs remain to be sequenced . found that 61% of all human transcribed regions have

2 | JUNE 2007 | VOLUME 8 www.nature.com/reviews/genetics

nnrg2083.inddrg2083.indd 2 223/4/073/4/07 55:30:55:30:55 ppmm REVIEWS

a counterpart on the opposite strand3. Furthermore, Overlapping, same-strand transcription. cDNA sequenc- a recent study showed that sense–antisense pairing is ing has also revealed significant complexity in transcripts also prevalent in other eukaryotic species, with many that overlap on the same strand (FIG. 1). Analysis of mouse sense–antisense pairs being conserved in evolution12. full-length cDNA clones showed that, on average, 7.6 tran- scripts can be grouped into a single transcriptional unit on the basis of overlapping genomic location and shared Table 1 | Evidence for widespread transcription orientation7. Furthermore, the majority of transcriptional 7 Method Organism Refs units (65%) are alternatively spliced . Sequencing of more than 12 million short sequence tags that represented Abundance of polysome-associated Homo sapiens 82 ′ CAGE polyA RNA the 5 termini of RNA polymerase II transcripts ( tags) from multiple human and mouse RNA samples Nucleic acid hybridization reassociation Strongylocentrotus purpuratus 83 revealed a propensity for transcriptionaltranscriptio start sites kinetics (Cot curves) (sea urchin) (TSSs) to mark internal exons and 3′ UTRsUT of genes, H. sapiens 84 suggestinging that there are multiple additionadditional internal Transcription of satellite DNA on Triturus cristatus carnifex 85 initiationtion sites within the loci of known genes13. chromosomes in oogenesis (crested newt) Abundance of 5′-capped RNAs versus Cricetulusus ggriseus (hamster)hamster) 86 Long-rangerange ininterconnectedected transcription.tr A further level those with 3′ polyA of genomicmic ararchitecturalral complexity is revealerevealed by the Tiling arrays recentlyr observedobserv long-rangerange interactions in theth genome. MappingM the 5′ ends of 39999 genes that lie iin the regions Whole genome EscherichiaE chia colcoli 87 chchosen by ENCODENCODE has revealeddh that manyy use alterna-al ArabidopsisA sis thalianatha 88 tive 5′ eendsds that lie tentens and hundredsndreds of kkilobases away DrosophilaDr a melanmelanogaster 15,89 from thehe anannotated 5′ ends5,1451 . Often, other genegenes are H. ssapiens 90 llocated between the body of a gene anandd the distal, previ-p ously unannotated,u ed, 5′ ends, suggestingsugge ng that a primaryprim Oryza sativasati (rice)91 transcriptipt that is initiated at a distadistal 5′ end must traversetra S. purpuratuspurpur 92 interveinterveningng loci on both strands beforeb reachingeaching exonexons Whole genome, high resolutionresolution Saccharomycesaromyces ccerevisiae 93 to whichic it iss connected in the mamature transcript. H.H sapiens 9 Similarim resultsults were observed for Drosophilaosophila mela-mela nogastenogaster, in whichich distdistal 5′ ends were foundound to be on Ten chromchromosomes, high resoluresolution H. sapienssapien 3 averagege 2020,360 bpp from the annotatedan 5′ ends, and Chromosomemosome 21–22 H. sapiensapiens 94 sometimeses as far awayway as 135,000 bp15[Au:OK?]Au:OK . For 22 H. sapiensens 95 comparisoncomparison, the average length of an annotatednnotated intron 155 Sequencingequenci in a D.D melanogaster gene is 1,158158 bbp . The bbiologi-ologi- cal importance of these distalal regulatoryregulatory elementselements is CAGEE tagtags H. sapiens and Mus musculus 7underscoreded by thehe observationobservat on that manmanyy of ththe distal CAGE tags to identidentify promotersrs H. sasapiens and M. musculus 13,96 5′ ends inn D. melmelanogasternogasterr tratranscriptsnscripts map too the sites off PET tagss M.M musculus 97 P-elemen-element insertionsinser ns that resultresu t in well documentedd fly phenotypphenotypes.es. BecBecausese of the ggenomic distances involved, SAGE tagsgs H. sapienssapi 98 the effectseffect of manyma of these P-element insertions were LongSAGEAGE tags H. sapiens 11,99 previoupreviously unconnected with the related distal gene loci, Short RNAs CaenorhabditisC enorhabditis eelegansegans 100–102100 despidespite the often dramatic phenotypes. A.A thalianaaliana 103 AAs a consequence of such long-range transcription, multiple exons from previously characterized, separate Testes-specificific short RNRNAss H. sapiens,sapiens, M. musculus and 104–108 RattusRat norvegicus protein-coding genes can be joined, creating novel spliced transcripts. An example of this is provided by MPSS H. sapiens 109 transcripts that originate at a distal 5′ end and join previ- Human Inversion, 41,118 full-length H. sapiens 6 ously unannotated exons with those of caveolin 1 (CAV1) cDNAs and caveolin 2 (CAV2) genes (FIG. 2). Recent reports show FANTOM2, 60,770 full-length cDNAs M. musculus 110 that such transcriptional fusions of neighbouring genes 16–18 FANTOM3, 102,281 full-length cDNAs M. musculus 7are surprisingly common in the human genome . One of the main implications of these findings is that exons Chromain IP that have been considered to be discrete modules of a ChIP–chip: p53, Sp1, cMyc H. sapiens 23 specific gene, or at least a genomic locus, must now be ChIP—chip: NFκB H. sapiens 24 considered as more general functional modules that can ChIP—chip: RNA Pol II H. sapiens 111 be joined together in multiple RNA molecules. Although many of the observed fusion transcripts PET-sequencing-based: p53 H. sapiens 112 contain appreciable ORFs, it remains to be determined CAGE, cap analysis of gene expression; ChIP–chip, chromatin immunoprecipitation and whether such molecules are translated or whether they hybridization to DNA microarray; FANTOM, Functional Annotation of Mouse database [Au:OK?]; MMPS, massively parallel signature sequencing; PET, paired-end ditag; polyA, function in some as yet unknown fashion. For example, polyadenylated; SAGE, serial analysis of gene expression. distal 5′ ends and transcriptional fusions might allow

NATURE REVIEWS | GENETICS VOLUME 8 | JUNE 2007 | 3

nnrg2083.inddrg2083.indd 3 223/4/073/4/07 55:31:03:31:03 ppmm REVIEWS

1 3′ 5′ 2 3′ 5′ Overlapping transcripts ′ ′ (sense to PISD) 3 3 5 4 3′ 5′ Alternative 5′ end 5 3′

Annotated form of PISD 3′ 5′

6 5′ 3′ Overlapping transcripts 7 5′ 3′ PISD (antisense to ) 8 5′ 3′ 9 5′ 3′

MYC binding regions

30,340,000 30,344,00000 30,348,0000 30,352,000 ChrChromosomeomos 22 Figure 1 | Overlappingping transcriptional arcarchitecture — the PISD example.xample Five transcriptsanscripts (in purple) that overlap thee RefSeq-annotatedRefSeq-an ated form (in blue) of the phosphatidylserinep ylserine decarboxylaseoxylas gene (PISDSD) on the same stranstrand, and four transcripts (in pink)k) that overlap the gengene on the oppositeposite strand arere showshown. The overlappingrlapping transctranscripts were characterized uusing a combination of RACE and tiling arrays1616. BindingB sites of the trtranscriptional factor MYC (in green) were determineddetermine using a ChIP–chipChIP assayay23. The coordinates are taken from the hs.Nhs.NCBIv35 versionersion of ththe genome.

thehe combinatorial usage of novel prompromotersrs and rregula- of well annotatedd genes, whereas 36%6% lie within gene tory DNA regions, or pprovidevide introintronic RNA,A, which canca boundariesries23. Another 24% map to intergenicntergenic regiregions. be subsequently procprocessed intoo regulatorregulatory RNAs (seesee A significantsign cant fraction of the sites that werere found to lieli discussion below). So,o, genomic proximityy is nnott always proximalim to the 3′ exons,ns, and that wwere presentresent in regionregions required for exons ofo differentt genes to be inincorporatedated that arear annotatedated as internal exonexons or intronsrons of protein-protein into the same RNA molecules. coding genes, wereere also seen in ththe analysisis of CAGE tag SAGE Thee mechanismsme hanisms that genegenerate thesehese recentlytly data133. ComparableCom ble results were obtainedd for another Serial analysissis of gene κ κ expressioexpression; a technique described fusifusion transcriptsranscripts are unclear.un Candidatendida mech-h- transcriptiontio factor, NF B; approximatelyly 28% of NF B for mapping thehe 3′ ends anisms incluinclude processingcessing of lonlong primaryy tratranscriptss bindingding sitsites were found at the 5′ ends off genes aandnd 22% of transcripts. or splicing bbetween differentfferent RNARN species (transtran -splic-c were fofound more than 50 kb awayy from knownnown genesgenes24. In ing). Trans-splicing-sp is thoughtought tot be less commonommon tthan addition, the same genomic sequencesequences werew observedob rved to PET transcriptionalranscriptional gegene fusion,n, altalthough naturally occur- host overlappingapping promoters tthathat regulate divergentiverge tran- Paired-endnd ditag; a methodmet that extractsacts 36-bp signsignatures ringng examples of trans-splicing have been documented scripts inn the mamammalianmmalian genome.genome. ConsistentConsis nt wiwith theseh 19,20 with 18 bpp from the 5′ end and for several mammalianmam genes . In some instances,nst s, recent observations,observa ons, 10% of allll human genes lie ‘head- another 188 bp from the 3′ end thethe degree degr of transns-splicing can be high.. For exexample, to-head’ (thatthat iis, theyhey are tratranscribed in opposite direc- of each cDNA.NA. cloning and sequencingquencing of splicediced MYC isoformssofor from tions, withwi their promoters being closest to each other),

Pyrosequencing a humann cell lineli e that overeoverexpressespresses this gene rrevealed less thathan 1,000 bp apart, and are regulated by the same 25 A method for DNA transns--splicingplicing tto 33 different genesenes on 14 ddifferentfferen chro- genogenomic sequence . In many such cases, the sequence sequencing in which the mosomesosom s2121. TheT levelevel oof trans-splicingsplicing ccan also be high elemelements that regulate the divergent genes are shared25. inorganic pyrophosphatephosphate during adenovirusdenovirus infections, indicating that the virus The architecture of the eukaryotic transcriptome is (PPi) that is releasedeased from a might overwhelmoverwhelm theh endogenous trans-splicing regula- clearly much more complex than could have been antici- nucleoside triphosphateosphate on 22 DNA chain elongationgation is tory memechanisms . Overall, although the frequency of pated in terms of the number of nucleotides that are tran- detected by a bioluminometricoluminometr trans-splicing to each RNA molecule is low, it is more scribed and the final arrangements of nucleotides that are assay. [Au: OK?] readily detectable in highly expressed transcripts. As with present in mature processed RNA molecules. This com- gene fusions, it is not clear whether these trans-spliced plexity makes one reconsider the current linear model of Massively parallel signature sequencing RNA molecules represent functional entities, or whether genomic organization, and ask what possible advantages A sequencing procedure that they are by-products of processes such as transcription of such an interleaved genomic organization might offer. allows the reading, in parallel, separate genes that lie in close proximity to each other, of short sequence segments of resulting in occasional trans-splicing between their Implications and advantages about 17 or 12 nucleotides nascent transcripts19 [AU:ok?]. Compactly organized and highly interleaved genomes long, from hundreds of thousands of microbead- have typically been associated with viruses and micro- attached cDNAs. [Au: OK?] Widespread occurrence of promoter regions. A high organisms, for which genome size is limited by the size degree of overlapping transcription implies the existence of the viral particle or cell26,27. Such constraints are not LongSAGE of regulatory regions other than the canonical locations at known to operate on eukaryotic genomes, which are many Long serial analysis of of gene ′ expression; a method that the 5 ends of annotated genes. Unbiased efforts to map orders of magnitude larger. Given the potential problems allows for the cloning of transcription factor binding sites found that only 22% of that are presented by use of the same genomic space for 20-nucleotide SAGE tags. MYC and SP1 binding sites lie proximal to the 5′ ends multiple purposes in megabase and gigabase genomes,

4 | JUNE 2007 | VOLUME 8 www.nature.com/reviews/genetics

nnrg2083.inddrg2083.indd 4 223/4/073/4/07 55:31:08:31:08 ppmm REVIEWS

Genomic span ~295 kb

CAV1 Known protein-coding 5′ 3′ genes CAV2 5′ 3′

Fusion transcript 1 5′ 3′ Fusion transcript 2 5′ 3′

115,550,000 115,750,000 115,800,000 Chromosome 7 Figure 2 | Fusion transcripts combining exons of different genes and unannotated regions.egions. Two diffedifferent transcripts combine novel 5′ exons with selected exons of caveolin 1 (CAV1) and cacaveolinolin 2 (CAV2). The exexons of the two fusion transcripts (GenBank accession numbers EF179101 and EF179102)F17 102) and CCAV1AV1 aand CAV2AV2 mRNAs are shown as vertical bars. Introns are represented as horizontal lines;nes slantedd lines indicate a gap of ~200 kbkb, used to simplify the depiction of this genomicomic region. The coordinates are ttaken fromthe hhs.NCBIv355 versiversion of the genome.

what are the implicationslications and consequences of such a Functionaltional ncRNAs range in size from ~22~2 bp miR- complex genomgenomic organizationanization for a eukaryeukaryotic cell? NNAs to ~1818 kb XISTX (X-inactive-specificactive-specific tratranscript) and ~108~1 kb AIRIR (ant(antisense IGF2R RNA) ncRNAs.As. SucSuch tre- Increasing protein-codingprotei ng transcript diversity. One obvious mendoumendouss size variationvariation, coupledd with the ggrowing reali- benefit of sharing DNA sequence among different tran- zation that long ncRNAsRNAs mighmight exceed protein-codingprotein-co RACE/tiling arrays scriptsts is the prodproduction of diverse protein species frofrom mRNAs in number if not functionalctional diversity,iversity, highlighthigh An unbiased, high-throughputhigh-throughpu relativelyatively few protein-codingprotein-co ing domains (exons). The most the undunderappreciatedated importance ofo ncRNAs in the cell method to identify the extents prevalent mechanisms tot generate mmRNA diversity are (reviewedwed in REFS 8,38). Below we discusscuss the posspossible of DNA products from rapid alternative splicing, altealternative initiationtiation of transcription,tr , advantagesadvant es of using RNA-based rregulatoryry systems, ofo amplificationmplification of cDNA ends 38–4138 (RACE)RACE) reactions by hybridizing alternativeternative polyadenylation,p lation, gene fusions and tratransns-splic- the sortor envisionedvisioned by Mattick , in the context of themem to tilingtilin arrays. iing. The firstirst three processesp arere comcommonmon in the highergher interleavedle genomicnomic organizationorganization. eukaryoteeukaryotes that haveve been studied7,13,28,297, . AnalysisAnal of the A keyk elementnt in using RNA as a regulatoryulatory agentage CAGE availablee genomicgen mic annotations, inclincluding ESTs, indicateses is its full or partialal sequence complementarityco to the Cap analysis of gene 32,35,42,432,4 expression;xpression a technique that at least 4040–65%% of mammalian protein-codingcodi geneses target . One way to facilitate this iss to trtranscribe 7,28,2929 for mapping the 5′ eends could be alternatively-splicedalte y-splice , with ~70%% of splicingg multiple overlappingov independent RNAsAs that containontain of transcripts. events occurroccurring in the coding sequencese of mRNmRNA29. In the samesam genomic regions but in the cocontexttext of ddifferentfferent thehe mouse, 5858% of protein-codingn-codin transcriptionalptional unitsun transcripts. As such, the intendedded interactionsinteracctions mightm ght be P elementent usese two or more altalternative promotersrom 13. Because genome more likelyy to occuro cur becausebecause they involveinvol e ciscis (causedused A memberer of a family of transposableble elements thatth are annotationnotation databasdatabases are likely tot miss information about by prodproductionuction ooff RNAs fromm the opposoppositee strandsstra df of widely usedd as the basis of protein-codingtein-coding genesgen that are not highly expressed,sed thesee the same DNA sequence)quence) ratratherer than transt (caused by tools for mutatingtatin and estimatesestimate are boundun to be conservative.e. [Au: OK?] interactionsinteractions betbetweenen RNAs thatt are produced from dif- manipulatingg the ggenome of An analysis of 399 human protein-codingotein-coding locloci within ferent regregionsons ofo genome) events. The obvious advan- Drosophilaa melanogastermelanoga . ENCODEDE4 genomicgen mic regionregions indicated thatthat 90% have tages of cis-based RNA signalling is the localization of 5,14 ChIP–chip eitherther a previouslypreviou y unannotateunannotatedd exon or a newn w TSS . interactinginter RNA components, which could also sim- A method that combines TakeTakenn together,toget , these diffedifferentent mechamechanisms signifi- plify the task of localizing various participating protein chromatin immunoprecipitationunoprecipitat cantly increasein ease the diversity of both transcripts and pro- components to the site of the RNA–RNA or RNA–DNA with microarray technology to teins by usingus ng a modularmo l approach to build novel protein interactions (for example, mediation of target RNA identify in vivo targetsgets of a transcription factor.or. [Au: Ok?] and nonnon-protein transcripts from a collection of distal transcription and subsequent stability or regulation and even non-linear constituents. of its interaction with ribosomes)33–35 [Au:ok?]. The MicroRNA utilization of RNAs or portions of RNAs as trans- A form of ssRNA, typically Using RNA transcripts as regulatory agents. The poten- targets is not precluded by such a strategy, and such 20–25 nucleotides long that is thought to regulate the tial of RNA as a regulatory molecule because of its ability interactions would be expected to evolve to increase the expression of other genes, to reversibly bind to virtually any other RNA or DNA overall efficiency of such an approach (see FIG. 3 for a either through inhibiting molecule by nucleotide complementarity was recognized hypothetical model). protein translation or decades ago1,30. Since then, a significant number of RNA- So far, the known effectors of trans-RNA-based degrading a target mRNA based regulatory systems have been characterized, and signalling have been mostly limited to RNAs that are transcript through a process that is similar to RNAi. many classes of RNA regulators have been discovered <200 nucleotides long, and that can be referred to as in species ranging from viruses to mammals8,31–36. RNA short RNAs. Two prominent classes of regulatory short snoRNA has been implicated in the control of RNA stability, gene RNAs, microRNAs (miRNAs) and small nucleolar RNAs A type of small RNA, the expression, tissue and cellular development, RNA modi- (snoRNAs), are produced from longer precursor RNA functions of which include RNA 32,42,44,45 cleavage and specification of fication, chromatin organization, alternative splicing, molecules . A sizeable fraction of known miRNAs sites of ribose methylation and subcellular localization of proteins, heat shock sensing and almost all snoRNAs are found within annotated pseudouridylation. and other processes8,32–37. genes or non-coding transcriptional units45–47. In many

NATURE REVIEWS | GENETICS VOLUME 8 | JUNE 2007 | 5

nnrg2083.inddrg2083.indd 5 223/4/073/4/07 55:31:18:31:18 ppmm REVIEWS

Regulatory Protein-coding Transcripts of unknown Regulatory Protein-coding region 1 gene 1 function (TUF) region 2 gene 2

miRNA Unannotated short RNA

Nucleus

Cytosol

Mature mRNA 1 Mature mRNA 2 5′ 3′ 5′ 3′

Translation Translation

FigureF 3 | RNA-bRNA-based signallinggnalling pathways. A microRNA (miRNA) iss encoded by DNA sequencesequences that lie withinn an intron (blue)blue) of pprotein-codingcoding gene 1. Expression of the miRNA is coupled too expressiexpression of the ‘host’ gene. The miRNANA is processeprocessed from a long precursor RNA, which contains thehe intronic sequences, in thehe nucnucleusus and then fufully matures in the cytosol. TThee mature miRmiRNA is believed too influencinfluence expression and stabstability of othother mRNAmRNAs (potentially including ththat of its owown host gene)ene) in tratrans (indicated by broken arrowsows in the cytoscytosol), through partiapartial nucleotide complemecomplementarity withh the 3′ UTUTR sequenceses of target mmRNAs.As. A ssimilar strategy is ooutlined foror a hypotheticalypothetic [AU:ok?] unannotaunannotated short RNANA ((miRNAmiRNA or other shoshort RNA)A) thathat is encodedncoded by a transcript of unknown function.nction. A llarge amountt of ppreviouslyrevious unannotatednotat stablele short RNAsNAs hahas been discovediscovered in mammammals9,103–107–107, CaenorhabditisCaenorhab elegans100,101 andnd ArabidopsisAra sis ththaliana10303. By analogynalogy with knownwn shshortort RNAs, othother short RNAs probably act byy regulatregulating gene exexpressionn in ccis or transns modemodes (transra modes are indicated by broken arrowsrr in the nucleus). NovNovel shorthort RNRNAs couldd ttarget regulatorygulat regionsons or parts of mmRNAs. Otherther short RNARNAs could potenpotentially be found within tthe annotatednotated boundboundaries of knownnow genes, possibly encodencoded by overlappingla transcripts that are regulated by their ownn promoters (not shown)n) [AU:ok?][A .

cases,ses, these short RNAs are produced from the intronic and thousandsth usands off human gegenesnes could bbe regulregulateddb by sequencesquences of thesthese genes or from separate overlappingverl g RNA–RRNA–RNANA duplexesdu exes as short ass 7 bp50–52. It has also been transcriptstranscri that havhave been cleaved byy ribonucleasesonucleas and estimateestimated that as many as 220–30% of human genes are associated factorsors42,44,4542,4 . Co-expressionpression ananalysisalysis of miR- regulated by miRNAsmi 51,53,54 and an equally large number NAs andd their hosthost genes in humanshuman 488 stronglystrongly susuggested of genesgene are under selective pressure to avoid comple- thathat thethe first mechanismm chanism is likelyl kely to be responsibler pons for mentaritymen with miRNAs55. The ability to couple the productionproduc ion of humanman regulatoregulatoryry short RNAs.R However, expexpression of such short RNA regulators to the expres- a comprehensivecomp ehensiv analysis ofo the expression of 127 sion of protein-coding genes by processing them from ncRNAs, includingncludin snoRNAs and small nuclear RNAs spliced introns has obvious evolutionary advantages if (snRNAssnRNA ), in Caenorhabditis elegans showed that the the RNA regulators and their host genes are involved in intronic ncRNAs might form two classes: those that are the same pathways in a cell, like the snoRNAs and their co-regulated with the expression of the genes in which host genes39,40,45, or if short RNA transcription per se is they lie and those that are apparently regulated by their used as a means of signalling the expression status of own promoters49. the host genes39,40. Additional plasticity in the timing of Once it has been processed from a longer transcript, expression could be achieved by expressing the internal a mature short RNA can act in cis or trans to regulate short RNA regulators from independently regulated other RNA molecules. Two hypothetical scenarios for internal promoters. These aspects make both strate- this mode of regulation, one involving a known miRNA gies attractive, even in the context of the same locus, snRNA pathway and one including a hypothetical as yet unan- so an overlapping transcriptional architecture would be A small RNA molecule that notated short RNA [Au:OK?], are shown in FIG. 3. The advantageous. functions in the nucleus by potential for global genome regulation by this RNA- Although several families of short RNA regulators guiding the assembly of mediated signal is significant, as regulatory RNA–RNA have been identified, an ever-growing number of indi- macromolecular complexes on the target RNA to allow site- interactions could be effected by short stretches of full vidual long ncRNAs have also been found to function in specific modifications or or partial complementarity. Each miRNA has been various key cellular processes. One of the most rapidly processing reactions to occur. estimated to potentially regulate 200 targets on average, evolving human genes identified to date, highly acceler-

6 | JUNE 2007 | VOLUME 8 www.nature.com/reviews/genetics

nnrg2083.inddrg2083.indd 6 223/4/073/4/07 55:31:26:31:26 ppmm REVIEWS

ated region 1A (HAR1F), is transcribed into an ncRNA tages of the interleaved genomic organization. Analysis that is expressed during brain cortical development56. of human head-to-head genes revealed that expression Long putative ncRNAs have been found as markers for levels were correlated across many published microarray hepatocellular carcinomas57 and cell fate during mam- datasets, with a high degree of statistical significance25. mary gland development58. Long ncRNAs have also been Interestingly, functional analysis of ten randomly selected directly linked to regulation of transcription factors such bidirectional promoters showed that most contain shared as calcium-sensing nuclear factor of activated T-cells elements that regulate expression from both strands, (NFAT)59 and the fly homeotic gene Ultrabithorax (Ubx) rather than housing two non-overlapping promoters that (through an interaction with the histone methyltrans- function in opposite directions25. Additionally, one pro- ferase ASH1)60 [Au:OK?]. moter sequence can regulate more than one downstream gene, as suggested by the prevalence of transcriptional Using transcription as a regulatory process. Overlapping gene fusions in the human genome16–1816–1 . transcription could also indicate a different type of regu- lation that does not rely on RNA-mediated signalling, Evolutionaryonary implicimplicationsation but rather on the act of transcription per se. It is has been Implicationsications at the macro-scale: gglobal genomic widely proposed that passagessage of the RNA pol II complex organization.ganization If the interleavednterle genomic organization through a region of DNA might remove the histones, is evolutionarilyutionar advantageous,ntageous, one would expexpect to see thereby resetting its chromatin structure. Consistent with conservationvation of genomicmic architecture and trtranscript this model, lolong-rangerange transcription has been observed organizationo ation (f(for example,ple, that the same two ttranscripts at several genomicgeno loci.oci. Many discrete DNA elements wouldw overlaperlap oor that internalrnal promoters wwould overlap controlling the expressionssion of genes that are separated in syntenicic exonsexon of orthologous genes inn more than from them by tetens or hundreds of kilobases have been one spespecies).ies). identified in eukaryoticeuk c genomesgenom 61–64. Such long-range ThThe data to supportport this cocome from analysis ofo the elementsents can enhanceenh or suppress expression of onone conservationation of ppatterns offoverl overlappingpping genes and or more genes. ElementElements that increase gene expressionexpre transcriptstranscri in prokaryoteskaryote and eukaryotes.uk yotes. Perhaps the canan be separated into enhancersen ncers and locus controltrol regionsregio most unexpectednexpected observation comcomes fromrom the micromicrobial (LCRs),LCRs), whereas elementseleme that suppressuppress genegen expression genomes,genom in which overlapping trantranscriptionon is relatively canan be classedclas as silencersers or elelements that cocontroltrol the commonmo 26,27. In these cases, the frefrequencyy of overlapping iimprintedd status of a locus (imprintmprint controcontrol centres).es). transcriptssc doeses not seem to correlatecorrel withh genome com-com TranscripTranscription has a role in the funfunctionalal acactivities of pactnesspactnes as measuredsured by the distadistance betweenween the genesgen each of these long-rangeong-range elements. or by the proportionpr on of coding sequencese in a genome74. Long-rangLong-range DNANA elements have beenen ffound inn Interestingly,gl microbialbial genes that overlapap tend to have assoassociation with ncRNAsRNAs60,65–67. SomeS of theseese nncRNAss more orthoorthologues in other species and therefortherefore seem exceed 100 kbk in lengthgth and spansp multipleple genesg 68. to be morem evolutionarily conservedved74, yet,et, surprsurprisingly,singly, Effects of nonnon-coding transcriptsranscri on thee expressionexpress 70% overlap by less than 15 bpp74 [AU:ok?][AU:ok??]. off overlapping andan neighbouringouri genes were directly Althoughgh the sense–antissense–antisenseense overlap arranarrangementment investigatedvestigated in fourfou human andd mouse loci: β-globin, seems tot be less commonommon thathann the sense–sense–senseense ooverlapl CD79b79b-GH, the KCNQ1K cluster and the IGF2R2R clus- in microbialmicrobial genomes,ge mes, the firstfir t tends to be longer, sug- terteer69–72. WhenW DNANA elements that cause transcriptionalranscrip gesting a differentiffere type of a relationshipr than in the case termination werere inserted nearr the promoterpromoter elementse of genes tthatat oveoverlap on the same strand, perhaps arising of these ncRNAncRNAs,s, long-ranglong-range transcriptitranscriptionn waswa abro- from sense–antisensesen RNA complementarity74. Together, gatedted lleadingading to either loss of imprinting (KCNQKCNQ1 and these observations indicate that the overlapping arrange- IGF2R clusters)lusters orr decreased gene exprexpression for loci mentmen of genes could have potential regulatory roles that werwere positivelypositive y regulated by a distal LCR (GH) or beyond merely helping to reduce genome size74. enhancer elementlement (β-globin). The patterns of overlapping genes seem to be conserved These studies show that ncRNA transcription is a in mammals. Analysis of ten sequenced animal genomes major regulatory mechanism of long-range control ele- showed that, of the 3,915 human sense–antisense pairs, ments. However, they do not distinguish between the at least 313 are preserved in the mouse12; several human importance of transcription per se and the production sense–antisense pairs are also conserved in evolutionarily of functional RNA products. The model of transcrip- distant organisms such as frogs12 [Au:OK?]. tion as a linear, ‘snow-plow’ regulator is challenged by The reasons for conserving such sense–antisense the observation that, in the case of the two imprinted pairing could involve gene regulation. Indeed, expression clusters KCNQ1 and IGF2R, genes that did not directly analysis in 16 human tissues showed that genes in such overlap the ncRNAs were equally as affected as those pairs tend to either be co-expressed or have an inverse that did. Therefore, it has been suggested that the correlation in levels of expression75. The same study also ncRNA transcription could alter the state of other long- showed that sense–antisense pairs that fit these expres- Locus control region range DNA elements, which could subsequently silence sion patterns tend to be conserved between and A cis-acting sequence that the genes in the imprinted clusters that do not directly mice75 [AU:ok?]. Interestingly, a study of a pair of yeast organizes a gene cluster into an 73 active chromatin block and overlap with the long ncRNAs . genes, GAL10 and GAL7, revealed a drastic decrease enhances transcription. [Au: Co-regulation of different genes by the same DNA in gene expression caused by converging transcription OK?] elements might be yet another example of the advan- complexes in vivo76. However, an inverse correlation of

NATURE REVIEWS | GENETICS VOLUME 8 | JUNE 2007 | 7

nnrg2083.inddrg2083.indd 7 223/4/073/4/07 55:31:33:31:33 ppmm REVIEWS

SFI1 3’ UTR of 5′ 3′ (+) strand

PISD 3’ UTR of 3′ 5′ (–) strand

PhastCons score

30,339,050 30,339,150 30,339,250 Conserved region of overlap between the 2 genes

SFI1 gene (+) strand 5′ 3′

PISD gene (–) strand 3′ 5′

PhastCons score

30,336,000336,000 30,338,00038,000 30,340,00030,34 ,000 30,342,00030,342 0 ChromosomeChromosome 222 Figureure 4 | Conservationonser of an overlappingapp region of two ggenes on opposopposite strands. Conservationnservat on oof the overlappingverlapping region of spispindledle assembly associated sfi1 homologueomologue (SFI1) and phosphatidylserineosphatidy dedecarboxylaserboxyl (PISDD) is as high as that of the protein-codingpro -coding exons (representedepresented by wider boxes).es). The uuntranslatednslate portionson of each mRNA arare presented as narrowernarrowe boxes. Thee conservaconservation score of each nuclenucleotidede (Ph(PhastConsns scoscore) is rerepresenteded by Vertebrate Conservationn scores on the scale of 0.2–1.0.2–1.0. The annotatannotations and cconservationrvation scoresore were loaded fromom thet UCSC Genomenome BrBrowser (Universityversi y of CalCalifornia at Santa CruzCruz) 118,119. Thehe coordinates are tataken fromom the hs.NCBIv35hs.NCBIv3 version of the genome.

sense–antisensesense–antis expressionpression levelsleve is not universal,univ ass longg and shorts RNAs tend to be more conservconservedd than several other reports showhow positipositive correlationon bbetweenen those tthat are shared by only longong RNRNAsAs9. InfluencingInfl encing levelsevels of sense and antisensense trantranscripts10,23,77,788. Thus, use this is the overall fitness of ann organismorganism,m, whiwhich might off the same region of DNA as a ttemplate for overlapping depend onn how essentialssential ththe function of each overlap-erlap- transcriptionanscription doesdoe not seem to predetermine an expres- ping elementel ment is. Iff a region orr a base pair hasas momore thanh sionn relationship. one funfunction,tion, all of these mightmi ht be affeaffected by a single nucleotidenucleoti e change.cha Implications at the micmicro-scale:scale: nucnucleotideleotid level. PerhapsPerha s the most apparent example of this effect Althoughgh interleavedinter aved transcrtranscriptionalptional archiarchitectureecture occurs is the ccase of overlapping ORFs, which are common att somsomee loci anand is conserved between distantdis nt species,sp it in bacterialb and viral genomes. In such cases, a base represerepresentsnts an aaddeded constraint on the abilityabil of a genome pairpai change affects both ORFs, which explains why to evolvevolve inn the reregionion of overloverlap. It is generally assumed the rates of change in the region of overlap tend to be that the rratete at whiwhichh a base changes or a region of DNA reduced26,27,79. However, the degree to which a nucleotide accepts insertions or deletions is inversely related to the change affects any given function might be different. For importance of its biological function. The degree of base- example, a nucleotide substitution could be synonymous pair conservation is expected to be proportional to the in one ORF and non-synonymous in an overlapping number of functional elements that use it. An example ORF26. In cases in which different types of functional of this is provided by higher conservation scores in the elements overlap (for example, an exon and a promoter), overlapping region of the 3′ UTRs of the human genes a nucleotide substitution could bring about an amino- spindle assembly associated sfi1 homologue (SFI1) and acid change if it occurs in the first two positions of a PISD, which are located on the opposite strands of the codon, and change promoter strength if it occurs in a genome (FIG. 4). crucial promoter element. Sequence conservation is higher at and around the overlapping region compared with the intronic and Implications for the interpretation of nucleotide UTR sequences that do not overlap, and is as conserved changes. Mutations at non-annotated genomic sites, as the protein-coding portions of the mRNAs (FIG. 4). such as intronic regions that are distal from splice sites, Furthermore, a whole-genome analysis of the sites of can affect fitness if they occur at internal promoter short- and long-RNA production in human tissue cul- regions, in an exon of an overlapping transcript, or in ture cells indicates that nucleotides that are shared by a short RNA. Thus, a phenotype that is associated with

8 | JUNE 2007 | VOLUME 8 www.nature.com/reviews/genetics

nnrg2083.inddrg2083.indd 8 223/4/073/4/07 55:31:40:31:40 ppmm REVIEWS

a DNA sequence change could be a sum of the pheno- ing’ or ‘non-functional’ might be alternatively regarded types caused by the change in all elements that share this as ‘currently unannotated’. However, the functional sequence. Furthermore, if the overlapping transcripts annotation of a genome will not be limited to chart- have non-redundant functions, the phenotypic effect of ing the function of the unannotated genomic space; a sequence change in a shared exon could manifest itself it will also provide new functions for those regions more severely than a change in an exon that is unique to that are already annotated. The final organization of only one such isoform. The magnitude of the phenotypic transcribed nucleotides into mature RNAs represents effect of such a mutation might not only be proportional another level of complexity that has been significantly to a direct biochemical effect of that change in any one underappreciated until now. The purposes of inter- element, but also to the combined effects of disruption leaved transcription could include combinatorial usage of numerous overlapping transcripts. Moreover, a muta- of DNA regulatory regions, increasing protein diversity, tion that does not seem to affect the function of one clearing DNA of existing chromatinromatin marmarks, facilitating element (that is, synonymous substitutions in protein- regulation of gene expressionession through RNA–RNARN or coding regions) might affect other elements that share RNA–DNANA interactions,interactions, and creating long RNAs that that sequence. eitherr function themsthemselvesselves as long ncRncRNAs or serve as A mutation can affectct a gene that has been annotated thee progenitprogenitors of shortort RNAs.RN as distal or that is separatedrated from the site of mutation by An interleinterleaved genomicnomic organization posespose impor- intervening genes if the mutation occurs at an alterna- tant mechanisticechani challengeslenges for the cell. One involves tive 5′ end of that gene. For example, a significant frac- tthe stericic issuesissue that stemtem from using the sames DNA tion of unannotatedunanno transcripts in the D. melanogaster mmoleculeses for mmultiple functions.nctions. The oveoverlap of func- transcriptome mightt be identified as unannotated tiotionally importantmportan sequence motifsif must bee resolresolved in extensions of knownkn protein-codingrotein-coding genes15. A number time andand space for thisthi organizationzation to workw properly. of these distal 5′ eends mapap to the sites of lethal mutations AnothAnotherer chchallenge is the neeneed to compartmentalizempartmen causedd by P-elem-element insertions.ertions. Genetic complementcomplementa- RRNA or mask RNARNAs that coulduld potpotentiallyntially form long tionon crosses between theth fly lines that contained these double-double-stranded regions, to preventve RNA–RNA inter-in distalistal 5′ P-elementelement insinsertionsions and those thatt carried actionss that couldcould promptprom apoptosis.op is. Despite theseth transposonransposon insertionsinsertion in annotatedated exons of the cor- challenchallenges,es, the existenceence ofo thiss interleavedaved genomicgenomi respondingesponding genes confirmednfirmed the functionalal relarelationshipionship organizationniz n seemseems too have clearc ar evolutionary advan-van- bbetween suchh distal 5′ ends andd the ggenesenes to which theyhey tages.s. Perhaps the clearestcl rest indicationindicati of thishis comesc from were conconnected15. The discovery ofo manyy unannotatedun ted microbmicrobial genomes,mes, in which overlappingove g gene organi-orga and distaltal TSSsTS s suggests that a simisimilar mechanismhani mightht zation is fairlyfa commonommon but dodoes not correlate with also occur in humanh an disease, raisingraisi the possibilityossib thatat genomic cocompactness.ess The functional annotationsannota of many more SNPs couldould be associatedass with humann the human and other eukaryotic genomesmes are ffarr from disorders. OOverall, it iss not uncommonunc forr seqsequencece complecomplete; however, initial observationsvation suggest thathat an polymorphismpolymorphisms that lie inn ‘non‘non-coding’ regions to beb interleaved organization is recapitulatedcapitulatedd in thesethe more associatedsociated with certaince diseaseease conditions [AU:ok?]. complex systems.ystem Interpretationterpretation of suchs SNPs is complicated by the fact A concertedconcerted effortffort to creacreatete a catalogucatalogue of a ddiverse thatt our knowledgeknowledg of the genome and its organizationani n variety ofoffunc functionalnal elementselement has been initiated by the is far from fro complete.lete National HumanHuma Genome RResearch Institute (NHGRI), The complexx phenotypphenotypic implicationsmplication of ththe inter- USA, undunderr the auspices of the ENCODE project4. This leaved genomicenomic architecturerchitecture area e highlighted by the exam- cataloguecatalogu includes maps of sites of transcription, TSSs, plee of a SNP thathat iss associated with a predispredispositionosition to an DNaseDNa hypersensitive genomic regions, promoters, autoimmuneautoimmmune ththyroidid disease; ththee SNP lies in an intron of transcriptiontran regulatory elements, origins of replica- ZFATT, whichw ich encodesencodes a zinc-fzinc-finger protein80. The SNP tion and other elements, derived from multiple human coincides with the 3′ UTRU of a truncated form of ZFAT samples that were selected from diverse developmental (TR-ZFATTR-ZFA ) and the promoter of an overlapping tran- origins or from a time course of differentiation. The script, (SAS-ZFAT), which lies in an antisense orientation availability of various maps of different functional ele- to TR-ZFAT. The RNA levels of TR-ZFAT are positively ments, combined with evolutionary conservation scores correlated with the SNP, whereas the levels of SAS-ZFAT for individual base pairs81 (made possible by the avail- are negatively correlated. However, the SNP does not ability of several genome sequences), should provide directly affect the stability of TR-ZFAT mRNA; instead, fascinating insights into the interrelationship between the levels of TR-ZFAT RNA seem to be downregulated by the rate of evolutionary change and interleaved genomic the SAS-ZFAT antisense transcript80. Therefore, the SNP organization. probably influences expression of an isoform of the gene Similar projects to catalogue functional regions in Evolutionary conservation that carries the intron in which it resides, by influencing model organisms such as D.melanogaster and C. elegans scores expression of an antisense transcript80. have been proposed, to be followed by efforts to deci- A quantitative measure of pher the biological roles and importance of the newly evolutionary relationships Conclusions identified functional elements. Therefore, the scientific derived from comparative analysis of genomic DNA The continuing pace of the discovery of novel tran- community is now faced with a unique opportunity to sequences from multiple scripts and regulatory regions strongly suggests that consider and organize multidisciplinary programmes species. regions of the genome that are considered ‘non-cod- to achieve these aims [AU:ok?].

NATURE REVIEWS | GENETICS VOLUME 8 | JUNE 2007 | 9

nnrg2083.inddrg2083.indd 9 223/4/073/4/07 55:31:46:31:46 ppmm REVIEWS

1. Jacob, F. & Monod, J. Genetic regulatory mechanisms 24. Martone, R. et al. Distribution of NF-κB-binding sites 52. Lewis, B. P., Shih, I. H., Jones-Rhoades, M. W., in the synthesis of proteins. J. Mol. Biol. 3, 318–356 across human chromosome 22. Proc. Natl Acad. Sci. Bartel, D. P. & Burge, C. B. Prediction of mammalian (1961). USA 100, 12247–12252 (2003). microRNA targets. Cell 115, 787–798 (2003). A seminal work on the regulation of gene References 23 and 24 represent the first reports 53. Xie, X. et al. Systematic discovery of regulatory motifs expression; the first to suggest that RNA could of the unbiased profiling of transcription factor in human promoters and 3′ UTRs by comparison of have a role. binding sites and provide the first comprehensive several mammals. Nature 434, 338–345 (2005). 2. Johnson, J. M., Edwards, S., Shoemaker, D. & evidence for the utilization of promoters in 54. Lim, L. P. et al. Microarray analysis shows that some Schadt, E. E. Dark matter in the genome: evidence of non-canonical genomic locations. microRNAs downregulate large numbers of target widespread transcription detected by microarray tiling 25. Trinklein, N. D. et al. An abundance of bidirectional mRNAs. Nature 433, 769–773 (2005). experiments. Trends Genet. 21, 93–102 (2005). promoters in the human genome. Genome Res. 14, 55. Farh, K. K. et al. The widespread impact of 3. Cheng, J. et al. Transcriptional maps of 10 human 62–66 (2004). mammalian MicroRNAs on mRNA repression and chromosomes at 5-nucleotide resolution. Science 26. Krakauer, D. C. Stability and evolution of overlapping evolution. Science 310, 1817–1821 (2005). 308, 1149–1154 (2005). genes. Evolution 54, 731–739 (2000). 56. Pollard, K. S. et al. An RNA gene expressed during 4. ENCODE Project Consortium. The ENCODE 27. Shcherbakov, D. V. & Garber, M. B. Overlapping genes cortical development evolved rapidly in humans. (ENCyclopedia Of DNA Elements) project. Science in bacterial and bacteriophage genomes. Mol. Biol. Nature 443, 167–172 (2006). 306, 636–640 (2004). (Mosk) 34, 572–583 (2000) (in Russian). [AU: ok?] 57. Lin, R., Maeda, S., Liu, C., Karin, M. & Edgington, T. S. 5. ENCODE Project Consortium. The ENCODE pilot 28. Sharov, A. A., Dudekula, D. B. & Ko, M. S. A large noncoding RNA is a marker for murine project: identification and analysis of functional Genome-wide assembly and analysis of alternative hepatocellular carcinomasrcinomas and a spspectrum of human elements in 1% of the human genome. Nature transcripts in mouse. Genome Res. 15, 748–754 carcinomas. OncogeneOncoge 26, 851–858 (2006). (in the press). (2005). 58. Ginger, MM. R. et aal. A noncodinging RNA is a potential 6. Imanishi, T. et al. Integrative annotation of 21,037 29. Zavolan, M. et al. Impact of alternative initiation, marker ofo cell fate during mammary glgland human genes validated by full-length cDNA clones. splicing, and termination on the diversity of the mRNAA development.developmment. Proc. Natl Acad. Sci.S USA 103, PLoS Biol. 2, e162 (2004). transcripts encoded by the mouse transcrtranscriptome. 5781–578681 786 (2006).(2006) 7. Carninci, P. et al. The transcriptional landscape of the Genome Res. 13, 1290–1300300 (2003). 59.59 Willingham,am, A. T. et al. A strategy for probing the mammalian genome. Science 309, 1559–1563 30.0. Britten, R. J. & Davidson, E. H. Gene regularegulation for function of noncodingnoncod RNAs finds a repressor of NFAT. (2005). higher cells: a theorytheory. Science 165, 349–357349–35 (1969). Science 30909, 1570–15731570 (2005). This reference provides an unparalleled insight into 331.1. Gupta, A., Gartner, J. J., Sethupathy, P., An exampleple of the use of high-throughphigh-throughput the complexity of the mouse transcriptome on the Hatzigeorgiou, A. G. & Fraser, N. W. Anti-apoAnti-apoptotic technologieses to elucidate the function of human basis of sequencing of full-lengthh cDNAs and ccDNA function of a microRNA encoded by the HSV-1 latency- ncRNAs. tags. associated transcript. NNature 442, 82–85–85 (20(2006). 60.60. Sanchez-Elsner,S T., Gou, D., Kremmer,Kremm E. & Sauer, F. 8. Willingham, A. T. & Gingeras,s, T. R. TUF love for ‘jun‘junk’ 32.2. Zamore, P. D. & Haley, B. Ribo-gnome: the big worldw of NoncodingNo RNAs of trithoraxth response elementsleme DNA. Cell 125, 121215–12200 (2006).(20 smalsmall RNAs. Sciencence 309, 1519–1524 (2005). recruit DrosophilaDro ASH1 to Ultrabithorax.abithora Science 9. Kapranov,apranov, P. et aal.l. Genome-wideG e-wide RNA maps reveal 33.. Mattick, J. S. & Makunin, I. V. Small regulatory RNAs 311, 1118–1123 (2006).006). interlaced transcript aarchitecture,ecture, new classes of RNARNAs in mammammals. Hum.. Mol. Genet.Ge 14, R121–R132 61. Dean, A. On a chromosomechrom far, far away: LCRs and and possible function ffor pervasive transcription.cription. (2005)(2005). gene expression.expr Trends Genet. 22, 38–45 (2006).(20 SScience (in the press). [AU:[ any updates?]tes?] 34. Mattick, J. S. & Makunin, I. V. Non-cNon-coding RNA. 62.62 Li, Q., Peterson, K. R., Fang, X. & 10.10. KKatayama, S. ett al. AntisenseAnti transcriptioncription in the Hum. Mol. GenetGenet. 15, R17–R2917 R29 (2006). Stamatoyantoyannopoulos, G. LocusL s control regions. mammalian transcriptomtranscriptome. Sciencece 30930 , 1564–1566–1566 35. Storz,torz, G., AltuvAltuvia, S. & Wassarman,n, K. M. Blood 100100, 3077–3086 (2002).( 2). (2(2005). An abundaabundance of RNA regulators.egulators Annu. Rev. Biochem.. 63. Lewis, A. & Reik, W. How imprintingting centres work. 11.11. Ge, X., Wu, Q., Jung, Y. C.C., Chen,, J. & Wang, S. M. 74, 199–2177 (2005). Cytogenet.et. Genome Res. 113, 81–899 (2006). A large quantity of novel huhuman antisense transcriptstran 36. Goodrich,odrich, J. AA. & Kugel, J. F. Non-coding-RNAon-coding-RNA 64. Zuniga, A. Globalisation reachesr gene regulation:on: detected by LongSAGE. BioinformaticsBioinformatics 22, regregulators of RNA polymeraseolymeras II transcrtranscription. the case for vertebrate limblim development.op Curr. Opin. 2475–2479 (2006). Nature Rev. Mol.Mol. Cell BioBiol. 7, 612–616616 ((2006). Genet. Dev. 15, 403–409403–40 (2005). 12.2. Zhang, Y., Liu, X. S., Liu,L Q. R. & Wei, L. GenomGenome-wide 37. Prasanth,nth, K. VV. & Spector,ector, D. L. Eukaryoticaryotic regulatory 65. Ling, J., Baibakov, B., Pi, W., Emerson,on, B. M. & Tuan,Tuan D. in silico identificationide and analysis of cis natural RNAs: an ananswer to thehe ‘ge‘genome complexity’mplexity’ Thee HS2 enhancer of the β-globin locuscus controlcontro region antisense transcriptsripts (cis-NATs)N in ten species. conundrum.m Genes Dev.v. 21, 11–42 (2007).007). initiatesates synthesis of non-coding,non polyadenylated Nucleic Acids Res.Re 34, 3465–34755 (2006). A comprehensiveeh reviewew of ncRNAs. RNAs independentndependent of a cis-linked globinbin prpromoter. 13. Carninci,Carnin P. et al. Genome-wide-wide analysis of mammalianmamm 38. Mattick, J. S. RNA regulation:lation a new genetics?netic J. Mol. Biol.Bio 350, 883–896 (2005).5). promoter architecture and evolution.olutio Nature Genet. Nature RevRev. Genet. 5, 316–3236–3 (2004). 66.6 Masternak, K., Peyraud, N., Krawczczyk,yk, M., Barras,B rras, E. 38, 626–635 (2006).(200 39.9. Mattick, J. S. Introns: evolutionolutio and function.tion. & Reith, W. Chromatin remodelingg and extragenicextragenic 14.4. Denoeud,Denoeu F. et al. ProminentPr usee of ddistal 5′ Curr. OpinOpin. Genet. Dev. 4, 823–823–831 (1994).1994). transcription at the MHCHC class III locusocus contrcontrol region. transcriptionranscription start sitsites and discoveryovery of a large 40. Mattick, JJ. S. Non-coding RNAs: the architects of Nature Immunol. 4, 132–137 (2003).(20 3). numbermber of additional exons in ENCODENCOD regions eukaryotickaryoti complexity. EMBEMBO Rep. 2, 986–991 67. Ashe, H. L., MonkMonks, J., Wijgerde, MM., FraserFraser, P. & [AU:U: journal?] (in theth press). [AU: any updates?] (2001).1). PrProudfoot,udfoot, N. J. IntergenicIn ergenic transcriptiontranscri on and 15. Manak,nak, J. R. et alal. Biological functionion of unannotated 41. Mattick,k J. S. Challenging the dogma: the hiddenden layelayer trtransinductionnsinduction of ththee humahuman β-glob-globin locus.ocus. transcriptionscription during tthe early developmentopme of of non-protein-coding RNAs in complex organisms. Genesnes Dev.Dev 111, 2492494–25094–2509 (1997).(199 Drosophilaophila melanogastemelanogaster. Nature Genet.enet. 38, BioEssays 25, 930–939 (2003).(2003 668.8. O’Neill,eill, M. J. The iinfluenceence of nnon-coding RNAs on 1151–1158–1158 (2006). References 38–41 revieweview the concept of RNA as a aallele-specificspecific genegen expression in mammals. Hum. Mol. 16. Kapranov,nov, PP. et al. Examplesmples of the complex carrier off informationinformat n in the cell. GGenet. 14, R113–R120 (2005). architecturecture of ththe human transcriptome revealedveal by 42.2. KimKim, V. N. & Nam, J. W.. GenoGenomics of microRNA.oRNA. 6969. SSleutels, F., Zwart, R. & Barlow, D. P. The non-coding RACEEand and high-density tiling arrays. Genomeme Res.Re 15, TrendsTre n Genet.Genet 22, 165–173165 173 (2006).(2 Air RNA is required for silencing autosomal imprinted 987–997 (2005). 43. Kiss, T. Small nucleolarnucleola RNAs:NAs: aan abundant group genes. Nature 415, 810–813 (2002). 17. Parra, G. et al. Tandem chimerismrism as a mmeanseans to of nonnoncodingoding RNAs wiwith diverse cellular functions.tions. 70. Mancini-Dinardo, D., Steele, S. J., Levorse, J. M., increase protein complexityty in the human genome.enome Cell 10910 , 145–148 (2002). Ingram, R. S. & Tilghman, S. M. Elongation of the Genome Res.s. 16, 37–437–44 (2006).(2006) 44. Bartel, D. P. MiMicroRNAs: genomics, biogenesis, Kcnq1ot1 transcript is required for genomic 18. Akiva, P. et al. Transcription-mediatedTranscrip n-mediated gengene fusionusion in mecmechanism, and function. Cell 116, 281–297 (2004). imprinting of neighboring genes. Genes Dev. 20, the human genome. GenomeGen e Res. 166, 30–3630–3 (2006). 45. Filipowicz, W. & Pogacic, V. Biogenesis of small 1268–1282 (2006). References 14–18 were thee first studies to detail nucleolar ribonucleoproteins. Curr. Opin. Cell Biol. 14, 71. Ho, Y., Elefant, F., Liebhaber, S. A. & Cooke, N. E. developmentalntal and tissue-tissue or cell-type-specific cell-ty 319–327 (2002). Locus control region transcription plays an active role regulatory regions that area distal from the genes 46. Huang, Z. P. et al. Genome-wide analyses of two in long-range gene activation. Mol. Cell 23, 365–675 they regulate, oftenft utilizing promoters and exons families of snoRNA genes from Drosophila (2006). from upstream genes to form chimeric versions of melanogaster, demonstrating the extensive 72. Ling, J. et al. HS2 enhancer function is blocked by well annotated protein-coding transcripts [AU:ok?]. utilization of introns for coding of snoRNAs. RNA 11, a transcriptional terminator inserted between the 19. Horiuchi, T. & Aigaki, T. Alternative trans-splicing: 1303–1316 (2005). enhancer and the promoter. J. Biol. Chem. 279, a novel mode of pre-mRNA processing. Biol. Cell 98, 47. Rodriguez, A., Griffiths-Jones, S., Ashurst, J. L. & 51704–51713 (2004). 135–140 (2006). Bradley, A. Identification of mammalian microRNA References 69–72 show that long-range 20. Finta, C., Warner, S. C. & Zaphiropoulos, P. G. host genes and transcription units. Genome Res. 14, transcription is required for gene activation and Intergenic mRNAs. Minor gene products or tools of 1902–1910 (2004). silencing. diversity? Histol. Histopathol. 17, 677–682 (2002). 48. Baskerville, S. & Bartel, D. P. Microarray profiling 73. Pauler, F. M. & Barlow, D. P. Imprinting mechanisms 21. Chen, C. et al. High frequency trans-splicing in a cell of microRNAs reveals frequent coexpression with — it only takes two. Genes Dev. 20, 1203–1206 line producing spliced and polyadenylated RNA neighboring miRNAs and host genes. RNA 11, (2006). polymerase I transcripts from an rDNA–myc chimeric 241–247 (2005). 74. Johnson, Z. I. & Chisholm, S. W. Properties of gene. Nucleic Acids Res. 33, 2332–2342 (2005). 49. He, H. et al. Profiling Caenorhabditis elegans non- overlapping genes are conserved across microbial 22. Kikumori, T., Cote, G. J. & Gagel, R. F. Naturally coding RNA expression with a combined microarray. genomes. Genome Res. 14, 2268–2272 (2004). occurring heterologous trans-splicing of adenovirus Nucleic Acids Res. 34, 2976–2983 (2006). 75. Chen, J., Sun, M., Hurst, L. D., Carmichael, G. G. & RNA with host cellular transcripts during infection. 50. Krek, A. et al. Combinatorial microRNA target Rowley, J. D. Genome-wide analysis of coordinate FEBS Lett. 522, 41–46 (2002). predictions. Nature Genet. 37, 495–500 (2005). expression and evolution of human cis-encoded 23. Cawley, S. et al. Unbiased mapping of transcription 51. Lewis, B. P., Burge, C. B. & Bartel, D. P. Conserved sense–antisense transcripts. Trends Genet. 21, factor binding sites along human chromosomes 21 seed pairing, often flanked by adenosines, indicates 326–329 (2005). and 22 points to widespread regulation of noncoding that thousands of human genes are microRNA targets. 76. Prescott, E. M. & Proudfoot, N. J. Transcriptional RNAs. Cell 116, 499–509 (2004). Cell 120, 15–20 (2005). collision between convergent genes in budding yeast.

10 | JUNE 2007 | VOLUME 8 www.nature.com/reviews/genetics

nnrg2083.inddrg2083.indd 1100 223/4/073/4/07 55:31:53:31:53 ppmm REVIEWS

Proc. Natl Acad. Sci. USA 99, 8796–8801 (2002). 93. David, L. et al. A high-resolution map of transcription (2005). 77. Jen, C. H., Michalopoulos, I., Westhead, D. R. & in the yeast genome. Proc. Natl Acad. Sci. USA 103, 110. Okazaki, Y. et al. Analysis of the mouse transcriptome Meyer, P. Natural antisense transcripts with coding 5320–5325 (2006). based on functional annotation of 60,770 full-length capacity in Arabidopsis may have a regulatory 94. Kapranov, P. et al. Large-scale transcriptional activity cDNAs. Nature 420, 563–573 (2002). role that is not linked to double-stranded RNA in chromosomes 21 and 22. Science 296, 916–919 111. Kim, T. H. et al. A high-resolution map of active degradation. Genome Biol. 6, R51 (2005). (2002). promoters in the human genome. Nature 436, 78. Moorwood, K. et al. Antisense WT1 transcription The first unbiased high-resolution microarray- 876–880 (2005). parallels sense mRNA and protein expression in fetal based study of the genomics era, showing that 112. Wei, C. L. et al. A global map of p53 transcription- kidney and can elevate protein levels in vitro. the transcriptional complexity of human cytosolic factor binding sites in the human genome. Cell 124, J. Pathol. 185, 352–359 (1998). polyadenylated RNA is up to an order of magnitude 207–219 (2006). 79. Miyata, T. & Yasunaga, T. Evolution of overlapping more complex that can be explained by exons of 113. Velculescu, V. E., Zhang, L., Vogelstein, B. & genes. Nature 272, 532–535 (1978). known genes. Kinzler, K. W. Serial analysis of gene expression. 80. Shirasawa, S. et al. SNPs in the promoter of a B cell- 95. Rinn, J. L. et al. The transcriptional activity of human Science 270, 484–487 (1995). specific antisense transcript, SAS-ZFAT, determine Chromosome 22. Genes Dev. 17, 529–540 (2003). 114. Metzker, M. L. Emerging technologies in DNA susceptibility to autoimmune thyroid disease. 96. Shiraki, T. et al. Cap analysis gene expression for high- sequencing. Genome Res. 15, 1767–1776 (2005). Hum. Mol. Genet. 13, 2221–2231 (2004). throughput analysis of transcriptional starting point 115. Elvidge, G. Microarray expression technology: from An example of an intronic SNP that causes and identification of promoter usage. Proc. Natl Acad. start to finish. Pharmacogenomicsg 7, 123–134 predisposition to a disease by influencing the levels Sci. USA 100, 15776–15781 (2003). (2006). of an antisense transcript. 97. Ng, P. et al. Gene identification signature (GIS) 116. Kapranov, P., Sementchenko, V. I. & GGingeras, T. R. 81. Siepel, A. et al. Evolutionarily conserved elements analysis for transcriptome characterization and Beyond expressione ression profiling: next genegeneration uses in vertebrate, insect, worm, and yeast genomes. genome annotation. Nature Methods 2, 105–111 of high densityde sity oligonucleotide arrays. Brief.B Funct. Genome Res. 15, 1034–1050 (2005). (2005). Genomic. Proteomic. 2, 47–56 (2003).( 82. Milcarek, C., Price, R. & Penman, S. The metabolism 98. Chen, J. et al. Identifying novel transcriptsnscripts and novelel 117.117 Mockler,kl T. C.C . et al. ApplicationsA of DNA tiling arrays of a poly(A) minus mRNA fraction in HeLa cells. Cell 3, genes in the human genome by using novel SAGE tags. for whole-genomegenome analysis. Genomics 85, 1–15 1–10 (1974). Proc. Natl Acad. Sci. USA 99, 12257–1226257–1226 (2002). (2005). 83. Hough, B. R., Smith, M. J., Britten, R. J. & 99.9. Saha, S. et al. Using the transcriptome to anannotate the 118.118 Karolchik, D. et al.al. The UCSC Genome BrowserBrow Davidson, E. H. Sequence complexity of genome. Nature Biotechnol.Biote 20, 508–5128–512 (2002).(2 Database. Nucleic AcidsAci Res. 31, 51–5454 (2003).( heterogeneous nuclear RNA in sea urchin embryos. 100.100. Ambros, V., Lee, R. C., Lavanway, A., Williams, P. T. & 119. Kent, W. J. et al. The human genome browserbro at Cell 5, 291–299 (1975). Jewell, D. MicroRNAs and other tiny endogenousendogeno UCSC.U Genomee Res. 12, 996–1006 (2002).(2 84. Holland, C. A., Mayrand, S. & PedersPederson, T. RNAsRNA in C. elegansans. Curr.Curr Biol. 13, 807–818–818 (2003).(2 Sequence complexity of nuclearear and messenger RNA 101.01. Deng,Deng W. et al. OrganizationOrganizat of the Caenorhabditisnorhabd AcknowledgementsAcknow in HeLaLa cells. J. MoMol. Biol. 13838, 755–778 (1980). eleganselega small non-codingn-coding transcriptome: genomigenomic We apologize to the authors whose primary work has not 85. Varley,rley, J. M., MacgreMacgregor, H.. C. & Erba, H. P. features, biogenesis, and expression.e Genomeenome Res. been cited due to the spacece constraconstraints. Some of the work Satellite DNA is transctranscribedd on lalampbrush 1616, 20–29 (2006). ddescribed in this Review hahas been funded inn part with Federal chromosomes. NaturNaturee 283, 686–688 (1980).980). 102. Ruby, J. G. et al. Large-scalerge-scale sequencing reveals Funds fromrom the US National Cancer Institutetitute and from the US 8686.. Salditt-Georgieff, M., HaHarpold, M. M., Wilson, M. C. & 21U-RNAs and additional microRNAsmicroRNA and NatioNational Human Genomeme ReseResearchrch Institute, aand by Darnell, J. E. Jr. Large heterogeneoushe us nucnuclear endogenous siRNAssiRNA in C. eleganselegan . Cell 127, Affymetrix. Thehe contentcont of this publicationpu ation does not necessar-nec ribonucleic acid has three times ass manmany 5′ caps as 1193–120793–1207 (2006).(20 ily reflectrefle t the viewviews or policiess ofo thee DepartmentDepartme of Health polyadenylic acid segments,segment and most caps do not 103.103 Lu, C.C et al. al Elucidation of the smallsma RNA component and HumanH Service,ervice, nor does mentionm n of trade names, com-c enter polyribosomes. Mol. Cell. Biol. 1, 179–187 of the transcriptome.ptome. ScienceSc 309, 1567–1569–1569 merciamercial productsts or ororganizationsion imply endorsementdorsement by ththe (198(1981). (2005).05). USS GoveGovernment. References 82–86 provide the ffirstrst indicaindications 104.04. Aravin,Ara A. et al. A novell class of small RNAsRN bind to that a large fraction of the eukaeukaryotic genomee is MILI proteinn in mousem testes.test Naturee 442, 203–2077 Competingpeting intereststerests stastatementem transcribed, and that non-polyadenylatedylated RNRNA is (2006). The authorsthors decladeclare competingompetin financial interests: see wweb prevalent. 105.105 Girard, A., Sachidanandam,Sa dam R., Hannon,on, G. J. & version foror details.details 87. SSelinger, D. W. et al.l. RNA eexpression analysiss ususingga a Carmell, M. A.A A germline-specificine-sp classs of smallsm 30 basee pair resoresolution Escherichia coli genome array.a RNAs bindsds mammaliann Piwi proteins. NatureNatur 442, DATABASESES Nature Biotechnol. 18, 1262–126862–1268 (2(2000). 199–2022 (2006).(2 The following tterms in this article are linkeded online to: 8888. Yamada, K. et al. Empirical analysisalysis of transcriptional 106. Grivna, S. T.,T Beyret, E., Wang, Z. & Lin, H. A noveln Entrez Gene: http://www.ncbi.nlm.nih.gov/entrez/query.v/entrez/q ery. activityvity in the ArabiArabidopsis genome.me. SScience 3022, class of smallsma RNAs in mouseuse spermatogenicsp enic cells. fcgi?db=gene 842–846 (2003). Genes Dev. 20, 1709–17144 (2006).(20 CAV1 | CAV2 | GAL7 | GAL10 | HAR1A | MYCMYC | PISD | SFI1S | 89. Stolc,olc, VV. et al. A gene expression map for the 107. Lau,au, N. C. et al. Characterizationization of thet piRNA UBX | XIST | ZFATFAT euchromaticchromatic genome of Drosophilaa melanogaster. complexmplex fromfr rat testes. ScienceSci 313, 363–367 UniProtKB:niPro B: http://ca.expasy.org/sprothttp://ca.ex asy.org/sprot Scienceence 3063 , 655–6655–660 (2004). (2006).6). MYC | NFATFAT | NNFκB | SP1 90. Bertone,one, P. et al. GGlobal identificationon of human 108. Watanabe,b T. et al. Identification and characterizationzation transcribedcribed sequences with genome tiling arrays. of two novel classes of small RNAs in thee mouse FURTHERFU HER INFORMATIONINFORMA ION Sciencece 30306, 2242–224642–224 (2004). germline: retrotransposon-derivedn-derive siRNAsAs in oocytes Author’sAuth homepage: [AU:would[AU d you like to link a 91. Li, L. ett al. Genome-wide transcritranscriptionption ananalyses in rice and germline small RNAsNAs in testes.teste Genes Dev. 202 , homehomepage?]?] using tilingling microarrays. Nature Genet. 38, 124–1299 1732–1743743 (2006)(2006). RNAs RNAdb database of non-coding RNAs: (2006). 109. JonJongeneel,eel, C. V. et al. AnAn atlas of human genene http://research.imb.uq.edu.au/rnadbh 92. Samanta,anta, M. M P. et al. The transcriptome of the sesea expreexpressionsion from massivelymassiv ly parallelpara signaturee UCSC Genome Browser: http://www.genome.ucsc.edu urchin embryo. Science 314, 960–962 (2006).006). sequesequencingcing (MPSS). GeGenomeome Res.Re 15, 1007–1014–1014 ENCODE: http://www.genome.ucsc.edu/ENCODE Access to this links box is available online.

NATURE REVIEWS | GENETICS VOLUME 8 | JUNE 2007 | 11

nnrg2083.inddrg2083.indd 1111 223/4/073/4/07 55:32:15:32:15 ppmm ONLINE ONLY

Author biographies • On a global level, an interleaved genomic organization of functional Philipp Kapranov did his graduate work in the Michigan State elements seems to be preserved in different kingdoms, and the University (MSU)–Department of Energy (DOE) Plant Research arrangement of specific overlapping functional elements is pre- Laboratory, USA, on isolation and characterization of RNAs that are served among different species. This suggests that such a model specifically induced during legume–Rhizobia symbiotic nitrogen fixa- does indeed provide advantages throughout evolution. tion, and obtained his Ph.D. from Michigan State University. He was a • Mutations at non-canonical sites, such as intronic regions that lie postdoctoral associate at Affymetrix, Inc. investigating the complexity distal from splice sites, can affect fitness if they involve internal of the human transcriptome and functions of non-coding RNAs. He is promoter regions, an exon of an overlapping transcript or a short currently a senior scientist at Affylabs in Affymetrix, Inc. continuing RNA. this work. Online Links Aarron T. Willingham obtained his Ph.D. in biology at the University RNAdb database of non-coding RNAs: of California at San Diego, USA, studying the molecular basis of http://research.imb.uq.edu.au/rnadb Drosophila melanogaster [AU:ok?] touch and hearing. As a postdoc- UCSC Genome Browser: toral researcher at the Scripps Research Institute, California, USA, he http://www.genome.ucsc.edu combined high-throughput cell-based screening and siRNA technolo- ENCODE: gies to investigate the function of human non-coding RNAs. He is cur- http://www.genome.ucsc.edu/ENCODE rently a scientist at Affylabs, where he is combining these functional studies with microarray-based discoveries of novel ncRNAs. Entrez CAV1 Thomas R. Gingeras received his Ph.D. from New York University, http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retr USA, in biology and was a postdoctoral fellow at Cold Spring Harbor ieve&dopt=full_report&list_uids=857 Laboratory, USA, in the laboratory of Richard J. Roberts. He is Vice CAV2 President of Biological Sciences at Affymetrix, Inc. His current http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retr research focuses on the organization and architecture of eukaryotic ieve&dopt=full_report&list_uids=858 genomes by mapping and characterizing the sites of transcription and GAL7 regulation of RNA expression on genome-wide scales. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retr ieve&dopt=full_report&list_uids=852306 ToC blurb GAL10 Genome-wide analyses of transcriptional output in eukaryotes have http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retr revealed an unanticipated transcriptome complexity. These findings ieve&dopt=full_report&list_uids=852307 imply a complex, interleaved genomic organization, in which individ- HAR1A ual sequences carry multiple and overlapping informational content. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retr The authors discuss the evidence for, and functional and evolutionary ieve&dopt=full_report&list_uids=768096 consequences of, this organization. MYC http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retr Online summary ieve&dopt=full_report&list_uids=4609 • In-depth analyses of the transcriptional outputs of eukaryotic PISD genomes suggest that the information content of a genome is http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retr complex, and that this complexity manifests itself at two levels: ieve&dopt=full_report&list_uids=23761 the fraction of the genome that is devoted to encoding functional SFI1 elements is higher than expected, and multiple functional elements http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retr can exist in a single region [AU:ok?]. ieve&dopt=full_report&list_uids=9814 • The architecture of the eukaryotic transcriptome is clearly much more UBX complex than could have been anticipated in terms of the number http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retr of nucleotides that are transcribed and the final arrangements of ieve&dopt=full_report&list_uids=42034XIST nucleotides that are present in mature processed RNA molecules. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retr • The complexity of genomic organization suggests that the currently ieve&dopt=full_report&list_uids=7503 accepted model, by which each region of DNA carries a single dis- ZFAT crete function, must be re-evaluated, and an interleaved model for http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retr the arrangement of functional elements is more likely to represent ieve&dopt=full_report&list_uids=57623 the informational content of eukaryotic genomes. • Despite the potential problems that are presented by use of the same MYC genomic space for multiple purposes, the following advantages http://www.expasy.org/uniprot/P01106 are brought by this complex genomic organization: an increase in NFAT protein-coding transcript diversity; a widespread adoption of RNA http://www.expasy.org/uniprot/O95644 transcripts as regulatory agents; and a reliance on transcription as NFκB a regulatory process. http://www.expasy.org/uniprot/Q7LBY6 SP1 http://www.expasy.org/uniprot/P08047

nnrg2083.inddrg2083.indd 1122 223/4/073/4/07 55:32:31:32:31 ppmm ONLINE ONLY

Competing Financial Interests. Philipp Kapranov, Aarron T. Willingham & Thomas R. Gingeras Genome-wide transcription and the implications for genome organi- zation. Nature Reviews Genetics 8, XXX–XXX (2007); doi:XXXXX

The authors are employees of Affymetrix, Inc.

nnrg2083.inddrg2083.indd 1133 223/4/073/4/07 55:32:34:32:34 ppmm