<<

, pseudogenes, and Alu sequence organization across human chromosomes 21 and 22

Chingfer Chen†, Andrew J. Gentles†, Jerzy Jurka‡, and Samuel Karlin†§

†Department of Mathematics, Stanford University, Stanford, CA 94305-2125; and ‡Genetic Information Research Institute, 2081 Landings Drive, Mountain View, CA 94043

Contributed by Samuel Karlin, December 21, 2001 Human chromosomes 21 and 22 (mainly the q-arms) were the first (but not perfect) agreement across the data sets. However, there complete parts of the human released. Our analysis of genes, are many differences in annotation with respect to ORFs, pseudogenes (⌿g), and Alu repeats across these chromosomes in- predicted genes, matching spliced expressed sequence tags, and clude the following findings: The number of structures contain- alternative splicings (5, 6). Our analysis concentrates on the ing untranslated exceeds 25%; the terminal tends to be Riken and Sanger Centre data but it appears to be consistent the largest among exons, whereas, the initial tends to be the overall with the other data sets. largest among ; single-exon gene length is approximately the mean gene exon number times the mean internal exon length; Chromosomal Counts of Genes, ⌿g, and Alus processed ⌿g lengths are on average approximately the same as The Riken annotation of Chr21 (33.6 Mb) reports 214 complete ,single-exon gene length; and the G؉C content and length of genes gene structures, 53 ⌿g, and 12,168 Alu elements (as of Jan. 16 are uncorrelated. The counts and distribution of genes, ⌿g, and Alu 2001). On Chr22q (34.5 Mb), the Sanger annotation reports 552 ,sequences and G؉C variation are evaluated with respect to clusters genes, 145 ⌿g, and 21,993 Alu elements have been identified. Thus and overdispersions. Other assessments concern comparisons of for the same approximate euchromatin extent, Chr22 has more than intergenic lengths, properties of ⌿g sequences, and correlations twice as many gene structures as Chr21, almost twice as many Alu between Alu and ⌿g sequences. sequences, and 3-fold more ⌿g, consistent with the greater overall Fgc of Chr22 (48%) compared with Chr21 (42%) (3, 4). Chromo- somes with more genes have more accessible genomic DNA with wo ‘‘drafts’’ of the have now been released: a ⌿ public version (Human Genome Project) and the Celera ver- respect to g and Alu sequences, partly because of more transcrip- T tional activity, so a key determinant in these counts is the greater sion (1, 2). The first completely sequenced parts of the human ϩ genome included the euchromatic portions (q-arms) of chromo- gene density and greater G C content in Chr22 versus Chr21. Along these lines, among human chromosomes Chr19 has the somes 21 and 22 (Chr21 and Chr22, respectively). A total of 34.55 ϩ Mb (about 97%) of Chr22q was sequenced in 12 contigs, and 33.6 highest G C content (overall 49%), the highest gene density, the Mb of Chr21q was sequenced in four contigs (3, 4). Neither p-arm highest CpG dinucleotide bias, and more CpG islands, and next in of Chr21 and Chr22, mainly , was completely these contexts is Chr22 (1, 7). In Chr21, the aggregate length of sequenced. The gene annotation available for Chr22 (as of March intergenic regions is 24,851 kb and the aggregate intron length is 6, 2001) is of two kinds: (i) complete gene structures specifying all 8,241 kb, a ratio of about 3:1. For Chr22 the corresponding ratio is exons and introns plus 5Ј and 3Ј untranslated regions (UTRs), and 20,611 kb to 11,758 kb, about 2:1. These data are based on the gene (ii) coding sequence structures (CDSs) restricted to exon regions structure annotation and exclude the Ig gene segments. Chr22 contains 118 ␭-Ig gene segments (variable V segments). translated into proteins and intervening introns. No CDS annota- ⌿ ␬ tion is available for Chr21. Five consecutive gofIg -V region about locations 1329337– In this article we examine, among other things, the distribution 1359121 of Chr22q are included. Excluding these Ig gene segments, in Chr22 the mean number of exons per gene is 7.1 (median 5.5). of genes, pseudogenes (⌿g), repeats (mainly Alu elements), and The mode is 98 genes attained for single-exon genes. Chr21 has GϩC frequency (Fgc) variation. Comparisons, contrasts, and anal- mean exon number 8.5 (median 6) and the mode occurs for genes ysis of Chr21 and Chr22 will center on the following assessments: of three exons, with 39 such genes (see Fig. 1). (i) correlations and associations of genes, ⌿g, Alu counts, and Fgc Ј Ј variables; (ii) gene 5 and 3 intergenic lengths (see later text for Numbers of Genes Containing Untranslated Exons (UTEs) precise definitions); (iii) numbers, lengths, and distribution of A total of 453 of the complete gene structures have their coding single-exon (intronless) genes; (iv) the distribution of genes with region specified in the CDS data set, 333 genes (73.5%) have no different exon numbers; (v) comparisons of intergenic lengths for 5Ј UTEs, 84 have a single 5Ј UTE, 21 have two, seven have three, consecutive pairs of genes with (Ϫ,Ϫ) orientations, (ϩ,ϩ) orien- four have four, three have five, and one has eight.¶ A total of 403 tations, (Ϫ,ϩ) divergent orientations, and (ϩ,Ϫ) convergent ori- (89%) genes have no 3Ј UTEs, 36 have one, eight have two, three entations; (vi) the relative distribution of Alu and ⌿g sequences in have three, two have five, and one has eight. These statistics are intergenic regions vs. introns; (vii) conspicuous genes (e.g., ribo- impressive for the proportion of genes (at least 25%) that possess ⌿ viii somal protein genes) among g sequences; ( ) the distribution of UTEs. It is not known what kinds of controls these UTEs ⌿g sequences associated with processed or small genes versus multiexon genes; (ix) the statistics of exons that are transcribed but ⌿ not translated; and (x) to what extent genes, g, and Alu sequences Abbreviations: Chr21, chromosome 21; Chr22, chromosome 22; UTR, untranslated region; are clustered or overdispersed in Chr21 and Chr22. CDS, coding sequence structure; ⌿g, pseudogenes; Fgc, GϩC frequency; UTE, untranslated There are at least three data annotations covering Chr21 and exon. Chr22. The original Riken gene catalog of Chr21 (4), the Sanger §To whom reprint requests should be addressed. E-mail: [email protected]. Centre database of Chr22 (3), the University of California Santa ¶We assume that for genes where the coding sequence annotation agrees exactly with the Cruz (Golden Path) collection for Chr21 and Chr22, and complete gene structure annotation, no UTEs are present. The main results are unchanged even if this is not always correct; they would then represent lower bounds on the REFSEQ, maintained by the National Center for Biotechnology occurrence of UTEs. Information, derived, and extended from Golden Path. The The publication costs of this article were defrayed in part by page charge payment. This sequence assemblies are virtually the same for each source. The article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. known human genes, with recognized names, are in excellent §1734 solely to indicate this fact.

2930–2935 ͉ PNAS ͉ March 5, 2002 ͉ vol. 99 ͉ no. 5 www.pnas.org͞cgi͞doi͞10.1073͞pnas.052692099 Downloaded by guest on September 23, 2021 GENETICS

Fig. 1. The first three graphs indicate the number of genes in Chr21 and Chr22 with different numbers of exons. The last graph shows the number of genes with counts for 3Ј and 5Ј UTEs in Chr22 (there is no corresponding data set for Chr21).

portend. Some possibilities are: UTEs probably play a role in less coding region that possess introns in their 5Ј UTRs. These regulating export of mRNA from the nucleus and 5Ј UTEs with genes seem to involve retropositions at least in its early evolu- connecting introns participate in translation initiation; and 3Ј tionary stages and alternative splicing events using separate UTEs also may assist in mRNA stability and with polyadenyl- acceptor or donor splice sites of the same exon. ation linkers. 5Ј UTEs putatively contribute in regulating alter- Do genes with greater numbers of exons and extended protein native splicing and translation efficiency (8). It has been estab- coding sequences tend to have more flanking UTEs? A correlation lished in Drosophila that the 3Ј UTR plays a functional role in calculation yields no significant correlation between gene exon cytoplasmic localizations of mRNA transcripts (9, 10). There are numbers and UTE counts and lengths. also examples of sequential processing activities governed by 5Ј What kinds of genes contain many UTEs (5Ј and͞or 3Ј)? Table alternative promoters [e.g., ultrabithorax (11)]. In human, the 1 lists some examples of genes of Chr22 with five or more UTEs at protein coding sectors of G protein-coupled receptors are pre- both the 5Ј and 3Ј ends. dominantly intronless but at least l8% of the underlying genes A gene in possession of one or more 5Ј UTEs does not necessarily contain 5Ј UTEs (12, 13). Sosinsky et al. (13) proffer an excellent involve 3Ј UTEs. A direct calculation shows that the flanking UTR discussion of olfactory G protein-coupled receptors with intron- exon counts are basically uncorrelated: correlation (5Ј UTE, 3Ј

Table 1. Genes of Chr22 with five or more UTEs No. of 5Ј UTL, No. of 3Ј UTL, Locus No. of exons 5Ј UTEs bp 3Ј UTEs bp Description

Genes with 5 or more 5Ј UTEs AC006285.5 9 5 714 0 3088 Homo sapiens MIL1 protein mRNA GGT1 17 5 668 0 1713 Gamma-glutamyltransferase 1 HMG2L1 12 5 565 0 2155 High mobility group protein 2-like 1 LZTR1 21 8 860 0 1713 Leucine-zipper-like transcriptional regulator 1 Genes with 5 or more 3Ј UTEs DJ319F24.C22.1 11 0 0 5 981 Matches expressed sequence tag sequences DJ671014.C22.2 13 2 275 5 661 Homo sapiens gamma-parvin mRNA DJ402G11.C22.6 16 1 247 8 1423 Matches expressed sequence tag cluster

UTL, untranslated exon length.

Chen et al. PNAS ͉ March 5, 2002 ͉ vol. 99 ͉ no. 5 ͉ 2931 Downloaded by guest on September 23, 2021 Table 2. Correlations among counts of genes, ⌿g, Alu Table 4. Comparisons of intergenic lengths sequences, and Fgc Chr21 median, Chr22 median, Chr21 Chr22 bp bp

Window size ⌿g Alu Fgc ⌿g Alu Fgc 5Ј Extension ͉͉3 46,979 18,397 3Ј Extension 3͉͉ 28,260 10,783 25 k g 0.02 0.15 0.32 0.04 0.13 0.26 Intergenic region of (Ϫ,Ϫ) 4͉͉4 35,568 17,998 ⌿g 0.09 Ϫ0.02 0.01 Ϫ0.10 gene pairs Alu 0.31 0.03 Intergenic regions of (Ϫ,ϩ) 4͉͉3 73,116 19,623 50 kb g 0.05 0.23 0.43 0.09 0.22 0.29 gene pairs ⌿g 0.13 Ϫ0.03 0.01 Ϫ0.14 Intergenic regions of (ϩ,Ϫ) 3͉͉4 22,077 5,814 Alu 0.34 0.08 gene pairs 100 kb g 0.05 0.33 0.54 0.16 0.30 0.33 Intergenic regions of (ϩ,ϩ) 3͉͉3 28,905 14,291 ⌿g 0.16 Ϫ0.04 0.01 Ϫ0.18 gene pairs Alu 0.37 0.13 In Chr21, the intergenic lengths do not include the unsequenced gaps and overlapping gene groups. In Chr22, the intergenic lengths encompass the largest five contigs and exclude overlapping gene groups. Intergenic lengths UTE) ϭ 0.006; correlation (5Ј untranslated exon length, 3Ј un- Ϫ Ϫ Ϫ ϭ of ( , ) are the intergenic lengths between two successive genes on the ( ) translated exon length) 0.10. strand. The other categories of (Ϫ,ϩ), (ϩ,Ϫ), and (ϩ,ϩ) are determined in the corresponding manner. Correlations of Genes, ⌿g, Alu Counts, and Fgc Variables We traversed Chr21 and Chr22 and compared the counts of genes, ⌿g, Alu sequences, and the average Fgc in 25-kb, 50-kb, and 100-kb of the gene proceeding downstream to the next gene. There are 190 sliding windows with 5-kb displacements. The correlations between consecutive pairs of genes in Chr21, which we divide into four these variables are displayed in Table 2. The correlations are largely groups (Table 4). There are 51 intergenic lengths for (Ϫ,Ϫ) gene consistent with the familiar facts that in the density of pairs, where both genes share a negative orientation relative to the genes increases with Fgc (e.g., ref. 14), and Alu sequences are reported sequence. The median intergenic length is 35,568 bp. The predominantly GϩC rich (15). Interestingly, the correlations in- group with (Ϫ,ϩ) orientation comprises 48 pairs of genes, also crease with window size, probably as a consequence of the statistical called divergent pairs. In such an orientation, the se- law of large numbers. Explicitly, in Chr21, correlation (gene, Fgc: quences of the two genes are roughly adjacent. The median window size, W ϭ 25 kb) ϭ 0.32, correlation (gene, Fgc: W ϭ 50) ϭ intergenic length here is 73,116 bp. For (ϩ,Ϫ) gene pairs (conver- 0.43, correlation (gene, Fgc: W ϭ 100) ϭ 0.54. A corresponding gent pairs), there are 47 gene pairs with a common downstream pattern prevails in Chr22. intergenic separation of median length 22,077 bp. There are a total Apparently, because gene and Alu counts correlate positively of 44 pairs of (ϩ,ϩ) genes with median intergenic length 28,950 bp. with GϩC levels, they correlate positively with each other. How- The median intergenic lengths, 35,568 bp, of (Ϫ,Ϫ) and 28,905 bp ever, a manifest contrast between Chr21 and Chr22 is that Alu of (ϩ,ϩ) gene pairs differ by about 6,500 bp, consistent within counts and Fgc values are positively correlated in Chr21 but statistical fluctuation. The fact that divergent gene pairs show the uncorrelated in Chr22. Possible reasons are: There could be greatest intergenic separation makes sense because there are more different target sites or sources for the Alu distributions in the two regulatory sequences in the common intergenic region upstream of chromosomes or the Alu samples may differ sharply in their age both genes including promoter and enhancer sequences of both composition and base composition. In both chromosomes, we also genes. The convergent gene pairs generally have small intergenic observe that ⌿g locations are uncorrelated with gene locations. separations. For Chr22, the corresponding results parallel those of This finding could signify that ⌿g sequences are generated ran- Chr21. domly throughout the human genome and randomly inserted into Table 4 suggests that 5Ј regulatory regions are more extensive the genome mostly by reverse transcription. than 3Ј regulatory regions. How is this affected by the extent of each gene and by the number of exons? Comparison of Intergenic Lengths Table 3 highlights longer lengths in 5Ј regions (with the single For Chr21, we concentrated on intergenic regions that do not cross exception of genes of four exons in Chr22, perhaps because of few the three unsequenced gaps, also removing overlapping gene gene numbers). groups and excluding intergenic regions exceeding 1 Mb as outliers. A corresponding scheme was applied to study the intergenic regions Comparison of Lengths of Different Exon and Intron Types of the largest five contigs in Chr22 (these contain 491 genes). Three types of exons—initial, internal, and terminal—are usually The 5Ј extension of a gene is defined as the intergenic region discriminated. The initial exons, which may play a role in transcrip- extending from the 5Ј end of the gene proceeding upstream to the tion initiation, tend to be longer than internal exons (Tables 5 and next gene, which can be in either orientation (see Table 3). The 3Ј 6). Internal exon lengths average about 150 bp and are reasonably extension refers to the intergenic region extending from the 3Ј end constant for genes with at least five exons. The terminal exon length

Table 3. 5؅ and 3؅ extension lengths for genes of different exon counts Chr21 median, bp Chr22 median, bp

Genes Gene count 5Ј Extension Gene count 3Ј Extension Gene count 5Ј Extension Gene count 3Ј Extension

Single exon genes 14 77,249 14 40,140 69 20,174 66 15,191 Genes of 2 exons 17 124,854 15 53,163 53 17,829 53 16,818 Genes of 3 exons 34 85,143 33 33,183 42 23,172 42 13,459 Genes of 4 exons 12 59,389 12 40,616 44 12,415 41 13,155 Genes of 5 exons and more 111 29,940 110 23,332 242 18,358 243 8,851

2932 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.052692099 Chen et al. Downloaded by guest on September 23, 2021 Table 5. Exon and intron lengths in gene structures extension length of a gene exceeds the 3Ј intergenic length inde- Ј Chr21 Chr22 pendent of exon numbers (Table 3). For example, the median of 5 and 3Ј extension lengths of the single-exon genes are 77,249 bp and Mean, Median, Mean, Median, 40,140 bp, respectively, in Chr21 and 20,174 bp and 15,191 bp in bp bp bp bp Chr22. Apparently, single-exon genes need more space to function Single exon gene length 1,209 674 1,322 947 properly. An evolutionary scenario may propose that most single- Initial exon length 197 135 231 139 exon genes derive from a single intronless progenitor of recent Internal exon length 158 129 142 120 evolutionary history with insufficient time to allow for gain of Terminal exon length 784 365 1,009 653 introns (‘‘introns late’’ theory). This scenario putatively allows a Initial intron 13,311 3,844 8,928 2,592 rapid diversification in invertebrates, whereas vertebrates have Internal intron 4,423 1,845 3,510 1,312 acquired introns at a slower rate. A more likely possibility is that Terminal intron 6,160 2,187 2,282 1,057 single-exon genes can be formed from fusions of exons (presumably by means of reverse transcription, transposition, or recombination). There are 180 genes in Chr21 and 389 genes in Chr22 with three or more In this context, many single-exon genes need to be processed rapidly exons. There are 141 genes in Chr21 and 341 genes in Chr22 with three or more to achieve appropriate expression and for this reason avoid introns. introns. An enticing observation is that in both chromosomes the mean single-exon gene length is close to the mean gene exon number Ϸ Ј times the mean internal exon length (Chr21: 1,209 8.5*158; is relatively large and variable because such exons often contain 3 Chr22: 1,322 Ϸ 7*142). UTR sequences. The exon length tends to be greatest for single-exon genes in both Distribution and Properties of ⌿g Sequences chromosomes. Internal exon and intron lengths are generally the ⌿g are nonfunctioning copies of genes that may result either from smallest in Chr21 (Table 5). In multiple-exon genes, the terminal reverse transcription by means of a mRNA transcript (processed) exon length is generally longer than internal exon lengths. This is or from gene duplication and subsequent disablement (17). A not true for intron lengths. In Chr22, the terminal intron length is recent study of ⌿g from Chr21 and Chr22 was set forth by Harrison generally shorter than the internal intron length and the largest et al. (18). ⌿g sequences tend to be biased toward highly expressed intron is principally the initial one (Table 6). This applies to both the genes. For example, many highly expressed ribosomal protein genes GENETICS complete gene structure annotations and also CDS data consonant generate ⌿g in eukaryotes. Clusters of ribosomal protein ⌿g occur with the impression that the first intron often carry some controls more frequently at the carboxyl end of Chr21 and Chr22, these on transcription initiation and gene processing. regions also being somewhat higher in Fgc. Other frequent sources Is there a correlation between gene length and GϩC content? ⌿ ϩ of g include cytochrome subunits and membrane proteins On the basis of isochore studies it is observed that high G C (Table 7). regions are more dense with genes. However from analysis of long In Chr21, 49 ⌿g are presumably processed into one exon each, genes in conjunction with expressed sequence tag data, it was whereas four have at least two exons; in Chr22, 123 ⌿gare suggested that long genes (i.e., genes with many exons) prefer DNA processed, whereas 22 involve two or more partially processed exons regions of reduced Fgc (16). We examined this hypothesis relative (eight consist of two exons, two of three exons, two of four exons, to the q-arms of Chr21 and Chr22. For the variables of exon number three of five exons, one of seven exons, two of eight exons, one of in gene structures we found for all genes correlation (exon no., nine exons, two of 10 exons, and one of 15 exons). Table 7 displays ϩ ϭ Ϫ G C) 0.021 (in Chr21) and 0.019 (in Chr22). For all genes with all ⌿g types that occur at least twice (see also ref. 18). at least three exons, we ascertained correlation (exon no., mean There are ⌿g shared by both chromosomes. In this respect, the ϭ Ϫ internal exon length) 0.082 (in Chr21) and 0.151 (in Chr22); ribosomal protein gene ⌿g are conspicuous. Thus, the 60S L23a has and for all genes with at least four exons, we have correlation (exon two copies in Chr21 and one copy in Chr22. One L10 ⌿gis ϭϪ Ϫ no., mean internal intron length) 0.073 (in Chr21) and 0.014 identified in Chr21 and one in Chr22. Table 8 presents some data (in Chr22). These determinations effectively indicate that long on ⌿g types that occur in both chromosomes. genes are uncorrelated with respect to Fgc and with respect to internal exon and intron lengths. Comparisons of Alu and ⌿g Sequences Alu sequences are found predominantly near the 5Ј UTR of genes Distinctive Features of Single-Exon Genes rather than the 3Ј UTR. This makes sense because Alus are GϩC Chr21 contains 15 single-exon (intronless) genes from a total of 214 rich and CpG islands tend to be located near the 5Ј end of genes genes (7%), with one located in an intron of another gene. Chr22 (19). Actually, the gene structure annotation of Chr22 estimates 540 ␭ has 98 single-exon genes excluding the -Ig V gene segments. There extant CpG islands of which 248 overlap the 5Ј end of genes (4). It are 13 single-exon genes located in intron regions of Chr22. Thus, is thought that for Alu sequences to survive under transposition, in Chr22 the percent of single-exon genes, 98͞552 ϭ 17.8%, is they fare best by targeting CpG islands. In this environment, Alus significantly greater than the 7% in Chr21. Single-exon lengths are gain CpG dinucleotides (20). more than 2-fold longer than most exon lengths of multiexon genes How are Alu and ⌿g distributed in intergenic regions versus (Tables 5 and 6). introns, and how many Alu and ⌿g sequences overlap with gene In Chr21 and Chr22, the 5Ј and 3Ј extensions for single-exon exons? Explicitly, in Chr21 there are 14 (of 12,168) Alu sequences genes generally exceed those of multiexon genes, and the 5Ј that overlap exons, of which only four overlap internal exons. Also, there are 20 Alu sequences within or containing exon sequences and only four of these contact internal exons. The corresponding Table 6. Chr22 coding region exon and intron lengths Alu count in Chr22 is 30 (of 21,993) that overlap exon sequences, Exon length, bp Mean Median Intron length, bp Mean Median of which 28 overlap boundary exons (cf. ref. 21). Also, there are 54 Alu sequences totally contained within or enveloping exon se- Initial exon 162 101 Initial intron 7,876 2,706 quences and 46 Alu sequences in contact with boundary (mostly Internal exon 138 121 Internal intron 3,071 1,271 untranslated) exons. In Chr22, the same analysis was applied to the Terminal exon 206 132 Terminal intron 2,402 1,050 protein CDSs. The results reveal only two Alu sequences, both In Chr22, there are 354 genes with three or more translated exons. There overlapping boundary exons. Also, one short internal exon (136 bp) are 310 genes with three or more introns. Data are not available for Chr21. is completely contained within an Alu sequence. There are no ⌿g

Chen et al. PNAS ͉ March 5, 2002 ͉ vol. 99 ͉ no. 5 ͉ 2933 Downloaded by guest on September 23, 2021 Table 7. Pseudogene types with at least two occurrences Chr21 ⌿g types of at least two occurrences indicating starting positions Ribosomal protein components, 17 occurrences 4930891, 6631155, 7467697, 12311851, 14370507, 15947825, 22421026, 22673718, 22965074, 22999209, 23081472, 23117970, 23252636, 26075948, 26119393, 30480511, 33617159 Cytochrome components (cytochrome p450 and cytochrome c subunits) two occurrences: 883308, 2527156 Chr22 ⌿g types of at least two occurrences indicating starting positions Ribosomal protein components, 26 occurrences 1617744, 2429706, 3091507, 3645358, 10389068, 10853902, 14003472, 14035049, 14552659, 15436724, 15457833, 15714964, 19683973, 21032645, 23923189, 24403570, 26581122, 26896944, 27579196, 27776958, 29114107, 31431878, 31782364, 33006302, 33547793, 34546362 GGT related (gamma-glutamyltransferase), 7 occurrences: 2622592, 2626735, 5131371, 6567692, 7583805, 8214982, 8599941 Human membrane protein, 7 occurrences: 2700170, 2850276, 4618174, 5054329, 5210968, 8219216, 8624564 Cytochrome c oxidase proteins, 3 occurrences; cytochrome p450 2 occurrences: 19093074, 20019122, 22937160; 25944997, 25954605 Immunoglobulin kappa variable region pG, 5 occurrences: 1329337, 1339353, 1346561, 1351060, 1358639 Homeotic Drosophila homolog, 4 occurrences: 6267591, 6294253, 6348328, 6374843 Mitochondrial precursor, 3 occurrences: 694062, 18487456, 22539612 Transcriptional repressor, 3 occurrences: 2687511, 5069545, 5196078 Human keratin type 1 cytoskeletal 18 (cytokeratin 18), 3 occurrences: 3110597, 4413964, 28405473 Human NADH-ubiquinone oxidoreductase chain 1, 2 occurrences: 7919065, 19742955 Phorbolin 1, 2 occurrences: 22766055, 22887693 Similar to mouse tubulin alpha-3 alpha-7 chain, 2 occurrences: 4930817, 4993379 Actin like-protein, 2 occurrences: 911820, 8653947 IGLC immunoglobulin lambda light chain C region: 16098291, 16255043

sequences overlapping exon sequences in Chr21. In Chr22, there is Distribution of Genes and ⌿g Along the Chromosomes ⌿ a single g that overlaps with an internal exon sequence and two Chr22 contains 26 ⌿g in a 1.5-Mb region proximal to the centro- ⌿ g are contained within boundary exon sequences. The Alu mere (18). This is unusually high. Genomic heterogeneity occurs ͞ densities (counts kb) in Chr21 for intergenic and intron regions are broadly and on different scales. In probing the organization of a 0.33 and 0.47, respectively. In Chr22, the density numbers are 0.62 genome, the general problem arises of how to characterize anom- and 0.77, respectively, and in both Chr21 and Chr22 the Alu density ⌿ alies in the spacings of markers in a long sequence of nucleotides is higher in introns than in intergenic regions. However, g or amino acids. These include properties of clustering͞clumping sequences prefer intergenic regions. Size of the sequence may be a ⌿ ͞ (too many neighboring short spacings), overdispersion (too many decisive factor. The g density values (counts kb) are as follows: long gaps between markers), and excessive evenness (too few short Chr21, 0.0018 (intergenic) and 0.0011 (intronic); Chr22, 0.0053 spacings and͞or too few long gaps). Questions concerning the (intergenic) and 0.0028 (intronic). The foregoing data are orga- spacings in a marker array can be approached by consideration of nized in Tables 9 and 10. ⌿ the cumulative lengths of r consecutive distances along the marker What are the lengths of the different g sequences? Of the 49 (r) ⌿ array where Ri is the distance (number of letters) between marker processed g in Chr21, the mean length is 1,250 bp (940-bp ϩ median). The four Chr21 multiexon ⌿g lengths consist of three i and marker i r designated r-scan lengths (e.g., ref. 22). The spans two-exon constructs and one of three exons. Explicitly they have of the longest and shortest r-scans are useful statistics for detecting exon-(intron)-exon lengths of 278-(75)-461 bp; 122-(309)-570 bp; significant clumping, significant overdispersion, or excessive regu- ⌿ larity in the spacings of the marker. The use of sums of r consecutive 185-(17)-110 bp; and a three-exon g with lengths of 92-(68)-152- ϭ (1273)-104 bp. The small sizes of both exons and introns among the fragment lengths, rather than single (r 1) fragment lengths, can multiexon ⌿g putatively reflect corrupted gene structures. It seems provide sensitivity and better tolerate measurement errors. ϭ evident that most ⌿g arise from processed multiexon genes. The We apply the r-scan test for r 5 under 0.95 significance to mean length parallels that of single-exon genes. Chr22 contains 123 analyze the distributions of genes in Chr21 and Chr22. Clusters are processed ⌿g with an average length of 1,082 bp (median 744) identified from significantly small five-scan intervals, and the CϩG roughly the same as in Chr21. The 22 multiexon ⌿g of Chr22 have contents are calculated by masking out those intervals. A similar mean exon length of 182 bp (median 153), again strikingly small scheme is applied to determine regions of significant overdisper- compared with the single-exon ⌿g types. The mean exon number sion. Clusters occur in relatively high GϩC regions and overdis- per multiexon ⌿g is about five. The three longest ⌿g have lengths persed regions occur in comparatively low GϩC regions. Specifi- of 19,168 bp, 16,318 bp, and 11,585 bp, and nine others have lengths cally, in Chr21, there are three clusters and one overdispersed in the range of 4 to 10 kb. region (Table 11).

Table 8. Common ⌿g types in Chr21 and Chr22 Common ⌿g types in both Chr21 and Chr22 Locations in Chr21 Locations in Chr22

60S ribosomal protein L23 15947825, 22965074, 33617159 34546362 60S ribosomal protein L10 14370507 31782364 60S ribosomal protein L34 22421026 3091507 40S ribosomal protein S3 7467697 14035049, 10389068 Human keratin type I cytoskeletal 18 (cytokeratin 18) 7462183 3110597, 4413946, 28405473 Cytochrome c pseudogenes 2527156 19093074, 20019122, 22937160 Cytochrome P450 subfamily IID 883308 25944997, 25954605

2934 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.052692099 Chen et al. Downloaded by guest on September 23, 2021 Table 9. Chr21 distribution of ⌿g and Alu sequences in Table 11. Distribution of genes intergenic regions and introns Location Size, Mb Gene no. Fgc Overlapping Chr21 Total Intergenic Intron with exons Cluster 20,351,850–20,590,675 0.24 8 0.4262 ⌿g counts 53 44 9 0 31,156,109–31,472,069 0.32 7 0.5549 Alu counts 12,168 8,250 3,884 34 33,062,659–33,372,115 0.31 7 0.5160 Length of region, kb 33,092 24,851 8,241 Overdispersion 1,997,831–12,535,699 10 21 0.3721 Marker density, count per kb Chr22 ⌿g 0.0016 0.0018 0.0011 Cluster 4,622,749–4,818,802 0.2 6 0.4837 Alu 0.3677 0.3321 0.4713 Overdispersion 10,527,428–12,563,741 2 10 0.4430 16,334,074–19,246,348 2.9 10 0.4336 There are 34 Alu sequences overlapping with exons. Eight of 34 overlap with internal exons and 26 overlap with boundary exons. average GϩC level of 0.42) and another seven ⌿g, including five ␬ ⌿ ⌿ successive Ig variable g, clustered between positions 1282766 We also applied the r-scan test to the set of ribosomal protein g ϩ ⌿ and 1359121 with an average G C of 0.41. An interesting obser- in both Chr21 and Chr22. We found that the ribosomal g are vation from the three ⌿g clusters is that the orientations of these distributed quite randomly in Chr22. However, the distribution is ⌿g are significantly nonrandom. For example, the 11 ⌿g in Chr21 not so random in Chr21. There is a region of 1 Mb (the expanse of ⌿ ϩ are all on the positive (reported) strand except for the first g. In 22,421,026–23,436,159 with an average G C level of 0.44), which Chr22, the seven ⌿g of the first cluster are also all on the positive ⌿ contains seven ribosomal protein g (17 in the whole chromo- strand and the seven ⌿g in the second cluster are all on the minus some). For the ⌿g distribution, in Chr21, there is a cluster in the strand except for the first ⌿g. 0.8-Mb interval (region of 22,673,718–23,436,157 with an average GϩC level of 0.44) containing 11 ⌿g; in Chr22, there is a cluster of Concluding Comments seven ⌿g in a 0.1-Mb stretch (region of 283,333–371,454 with an The median size and distribution of processed ⌿g are about the same as the length of single-exon genes. Also, the median range of

single-exon genes is remarkably similar to the average internal exon GENETICS Table 10. Chr22 distribution of ⌿g and Alu sequences in length times the average number of exons per gene. These prop- intergenic regions and introns erties support the hypothesis that most single-exon genes derive from processed multiexon genes in dynamic regions. An analysis of Overlapping Ј Ј Total Intergenic Intron with exons Chr22 reveals that at least 25% of gene structures possess 5 and 3 UTEs. Many of these UTEs may have an important role in ⌿g counts 145 109 33 3 (0) alternative splicing, as is the case with G protein-coupled receptor Alu counts 21,993 12,841 9,068 84 (3) membrane proteins (13). The larger length for the 5Ј extension Ј Length of region, kb 32,369 20,611 11,758 region suggests that 5 regulatory regions are more extensive than Ј Marker density, count per kb 3 regulatory regions. The intergenic length of convergent orien- ⌿ tation is also longer than the intergenic length of divergent orien- g 0.0045 0.0053 0.0028 ⌿ Alu 0.6794 0.6230 0.7712 tations. g appear to derive predominantly from highly expressed genes, especially ribosomal protein genes and cytochrome c pro- There are 84 Alu sequences overlapping with exons. Eleven of 84 overlap teins. The largest exons and introns are foremostly the first or last with internal exons and 73 overlap with boundary exons. There are only three exon or intron. The counts of genes are significantly correlated with Alu sequences overlapping with translated exons. Three ⌿g overlap with GϩC chromosomal content. As expected, in the presence of terminal exons but no ⌿g overlap with translated exons. The locations of the increased transcription activity, there are more genes, Alu se- three overlapping pairs of gene and pseudogene are as follows: gene (novel quences, and ⌿g numbers (cf. ref. 23). gene): 26358694ϳ26379157, ⌿g: 26378836ϳ26386771; gene (Homo sapiens cDNA): 15058600ϳ15105383, ⌿g: (similar to H. sapiens angiotensin II receptor We are grateful to Drs. E. Zuckerkandel, A. M. Campbell, B. E. gene): 15103686ϳ15103899; and gene (tissue inhibitor of metalloproteinase Blaisdell, U. Francke, and D. Petrov for helpful discussions regarding this 3, related to Sorsby fundus dystrophy): 16700083ϳ16761409, ⌿g: 16758391– manuscript. This work was supported in part by National Institutes of 16758808. Health Grants 5R01GM10452-36 and 5R01HG00335-14.

1. International Human Genome Sequencing Consortium (2001) (Lon- P. M. (2001) Mol. Cell. Biol. 21, 3462–3471. don) 409, 860–921. 11. Lopez, A. J. (1998) Annu. Rev. Genet. 32, 279–305. 2. Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton G. G., 12. Gentles, A. J. & Karlin, S. (1999) Trends Genet. 15, 47–49. Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., et al. (2001) Science 291, 13. Sosinsky, A., Glusman, G. & Lancet, D. (2000) Genomics 70, 49–61. 1304–1351. 14. Donofrio, G., Jabbari, K., Musto, H., Alvarez-Valin, F., Cruveiller, S. & 3. Dunham, I., Shimizu, N., Roe, B. A., Chissoe, S., Hunt, A. R., Collins, J. E., Bernardi, G. (1999) Ann. N.Y. Acad. Sci. 870, 81–94. Bruskiewich, R., Beare, D. M., Clamp, M. & Smink, L. J. (1999) Nature 15. Jurka, J. (1998) Curr. Opin. Struct. Biol. 8, 333–337. (London) 402, 489–495. 16. Duret, L., Mouchiroud, D. & Gautier, C. (1995) J. Mol. Evol. 40, 308–317. 4. Hattori, M., Fujiyama, A., Taylor, T. D., Watanabe, H., Yada, T., Park, H. S., 17. Vanin, E. F. (1985) Annu. Rev. Genet. 19, 253–272. Toyoda, A., Ishii, K., Totoki, Y. & Choi, D. K. (2000) Nature (London) 405, 18. Harrison, P. M., Hegyi, H., Bertone, P., Echols, N., Johnson, T., Balasu- 311–319. bramanian, S., Luscombe, N. & Gerstein, M. (2002) Genome Res. 12, 5. Reymond, A., Friedli, M., Henrichsen, C. N., Chapot, F., Deutsch, S., Ucla, C., 273–281. Rossier, C., Lyle, R., Guipponi, M. & Antonarakis, S. E. (2001) Genomics 78, 19. Cross, S. H. & Bird, A. P. (1995) Curr. Opin. Genet. Dev. 5, 309–314. 46–54. 20. Jurka, J. & Milosavljevic, A. (1991) J. Mol. Evol. 32, 105–121. 6. Antonarakis, S. E. (2001) Curr. Opin. Genet. Dev. 11, 241–246. 21. Batzer, M. A., Arcot, S. S., Phinney, J. W., Alegria-Hartman, M., Kass, D. H., 7. Gentles, A. J. & Karlin, S. (2001) Genome Res. 11, 540–546. Milligan, S. M., Kimpton, C., Gill, P., Hochmeister, M. & Ioannou, P. A. (1996) 8. Huo, L. & Scarpulla, R. C. (1999) Gene 11, 213–224. J. Mol. Evol. 42, 22–29. 9. Macdonald, P. M. & Kerr, K. (1998) Mol. Cell. Biol. 18, 3788–3795. 22. Karlin, S. & Brendel, V. (1992) Science 257, 39–49. 10. Mancebo, R., Zhou, X. L., Shillinglaw, W., Henzel, W. & Macdonald, 23. Zhang, M. Q. (1998) Hum. Mol. Genet. 7, 919–932.

Chen et al. PNAS ͉ March 5, 2002 ͉ vol. 99 ͉ no. 5 ͉ 2935 Downloaded by guest on September 23, 2021