Genes, pseudogenes, and Alu sequence organization across human chromosomes 21 and 22 Chingfer Chen†, Andrew J. Gentles†, Jerzy Jurka‡, and Samuel Karlin†§ †Department of Mathematics, Stanford University, Stanford, CA 94305-2125; and ‡Genetic Information Research Institute, 2081 Landings Drive, Mountain View, CA 94043 Contributed by Samuel Karlin, December 21, 2001 Human chromosomes 21 and 22 (mainly the q-arms) were the first (but not perfect) agreement across the data sets. However, there complete parts of the human genome released. Our analysis of genes, are many differences in annotation with respect to ORFs, pseudogenes (⌿g), and Alu repeats across these chromosomes in- predicted genes, matching spliced expressed sequence tags, and clude the following findings: The number of gene structures contain- alternative splicings (5, 6). Our analysis concentrates on the ing untranslated exons exceeds 25%; the terminal exon tends to be Riken and Sanger Centre data but it appears to be consistent the largest among exons, whereas, the initial intron tends to be the overall with the other data sets. largest among introns; single-exon gene length is approximately the mean gene exon number times the mean internal exon length; Chromosomal Counts of Genes, ⌿g, and Alus processed ⌿g lengths are on average approximately the same as The Riken annotation of Chr21 (33.6 Mb) reports 214 complete ,single-exon gene length; and the G؉C content and length of genes gene structures, 53 ⌿g, and 12,168 Alu elements (as of Jan. 16 are uncorrelated. The counts and distribution of genes, ⌿g, and Alu 2001). On Chr22q (34.5 Mb), the Sanger annotation reports 552 ,sequences and G؉C variation are evaluated with respect to clusters genes, 145 ⌿g, and 21,993 Alu elements have been identified. Thus and overdispersions. Other assessments concern comparisons of for the same approximate euchromatin extent, Chr22 has more than intergenic lengths, properties of ⌿g sequences, and correlations twice as many gene structures as Chr21, almost twice as many Alu between Alu and ⌿g sequences. sequences, and 3-fold more ⌿g, consistent with the greater overall Fgc of Chr22 (48%) compared with Chr21 (42%) (3, 4). Chromo- somes with more genes have more accessible genomic DNA with wo ‘‘drafts’’ of the human genome have now been released: a ⌿ public version (Human Genome Project) and the Celera ver- respect to g and Alu sequences, partly because of more transcrip- T tional activity, so a key determinant in these counts is the greater sion (1, 2). The first completely sequenced parts of the human ϩ genome included the euchromatic portions (q-arms) of chromo- gene density and greater G C content in Chr22 versus Chr21. Along these lines, among human chromosomes Chr19 has the somes 21 and 22 (Chr21 and Chr22, respectively). A total of 34.55 ϩ Mb (about 97%) of Chr22q was sequenced in 12 contigs, and 33.6 highest G C content (overall 49%), the highest gene density, the Mb of Chr21q was sequenced in four contigs (3, 4). Neither p-arm highest CpG dinucleotide bias, and more CpG islands, and next in of Chr21 and Chr22, mainly heterochromatin, was completely these contexts is Chr22 (1, 7). In Chr21, the aggregate length of sequenced. The gene annotation available for Chr22 (as of March intergenic regions is 24,851 kb and the aggregate intron length is 6, 2001) is of two kinds: (i) complete gene structures specifying all 8,241 kb, a ratio of about 3:1. For Chr22 the corresponding ratio is exons and introns plus 5Ј and 3Ј untranslated regions (UTRs), and 20,611 kb to 11,758 kb, about 2:1. These data are based on the gene (ii) coding sequence structures (CDSs) restricted to exon regions structure annotation and exclude the Ig gene segments. Chr22 contains 118 -Ig gene segments (variable V segments). translated into proteins and intervening introns. No CDS annota- ⌿ tion is available for Chr21. Five consecutive gofIg -V region about locations 1329337– In this article we examine, among other things, the distribution 1359121 of Chr22q are included. Excluding these Ig gene segments, in Chr22 the mean number of exons per gene is 7.1 (median 5.5). of genes, pseudogenes (⌿g), repeats (mainly Alu elements), and The mode is 98 genes attained for single-exon genes. Chr21 has GϩC frequency (Fgc) variation. Comparisons, contrasts, and anal- mean exon number 8.5 (median 6) and the mode occurs for genes ysis of Chr21 and Chr22 will center on the following assessments: of three exons, with 39 such genes (see Fig. 1). (i) correlations and associations of genes, ⌿g, Alu counts, and Fgc Ј Ј variables; (ii) gene 5 and 3 intergenic lengths (see later text for Numbers of Genes Containing Untranslated Exons (UTEs) precise definitions); (iii) numbers, lengths, and distribution of A total of 453 of the complete gene structures have their coding single-exon (intronless) genes; (iv) the distribution of genes with region specified in the CDS data set, 333 genes (73.5%) have no different exon numbers; (v) comparisons of intergenic lengths for 5Ј UTEs, 84 have a single 5Ј UTE, 21 have two, seven have three, consecutive pairs of genes with (Ϫ,Ϫ) orientations, (ϩ,ϩ) orien- four have four, three have five, and one has eight.¶ A total of 403 tations, (Ϫ,ϩ) divergent orientations, and (ϩ,Ϫ) convergent ori- (89%) genes have no 3Ј UTEs, 36 have one, eight have two, three entations; (vi) the relative distribution of Alu and ⌿g sequences in have three, two have five, and one has eight. These statistics are intergenic regions vs. introns; (vii) conspicuous genes (e.g., ribo- impressive for the proportion of genes (at least 25%) that possess ⌿ viii somal protein genes) among g sequences; ( ) the distribution of UTEs. It is not known what kinds of controls these UTEs ⌿g sequences associated with processed or small genes versus multiexon genes; (ix) the statistics of exons that are transcribed but ⌿ not translated; and (x) to what extent genes, g, and Alu sequences Abbreviations: Chr21, chromosome 21; Chr22, chromosome 22; UTR, untranslated region; are clustered or overdispersed in Chr21 and Chr22. CDS, coding sequence structure; ⌿g, pseudogenes; Fgc, GϩC frequency; UTE, untranslated There are at least three data annotations covering Chr21 and exon. Chr22. The original Riken gene catalog of Chr21 (4), the Sanger §To whom reprint requests should be addressed. E-mail: [email protected]. Centre database of Chr22 (3), the University of California Santa ¶We assume that for genes where the coding sequence annotation agrees exactly with the Cruz (Golden Path) collection for Chr21 and Chr22, and complete gene structure annotation, no UTEs are present. The main results are unchanged even if this is not always correct; they would then represent lower bounds on the REFSEQ, maintained by the National Center for Biotechnology occurrence of UTEs. Information, derived, and extended from Golden Path. The The publication costs of this article were defrayed in part by page charge payment. This sequence assemblies are virtually the same for each source. The article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. known human genes, with recognized names, are in excellent §1734 solely to indicate this fact. 2930–2935 ͉ PNAS ͉ March 5, 2002 ͉ vol. 99 ͉ no. 5 www.pnas.org͞cgi͞doi͞10.1073͞pnas.052692099 Downloaded by guest on September 23, 2021 GENETICS Fig. 1. The first three graphs indicate the number of genes in Chr21 and Chr22 with different numbers of exons. The last graph shows the number of genes with counts for 3Ј and 5Ј UTEs in Chr22 (there is no corresponding data set for Chr21). portend. Some possibilities are: UTEs probably play a role in less coding region that possess introns in their 5Ј UTRs. These regulating export of mRNA from the nucleus and 5Ј UTEs with genes seem to involve retropositions at least in its early evolu- connecting introns participate in translation initiation; and 3Ј tionary stages and alternative splicing events using separate UTEs also may assist in mRNA stability and with polyadenyl- acceptor or donor splice sites of the same exon. ation linkers. 5Ј UTEs putatively contribute in regulating alter- Do genes with greater numbers of exons and extended protein native splicing and translation efficiency (8). It has been estab- coding sequences tend to have more flanking UTEs? A correlation lished in Drosophila that the 3Ј UTR plays a functional role in calculation yields no significant correlation between gene exon cytoplasmic localizations of mRNA transcripts (9, 10). There are numbers and UTE counts and lengths. also examples of sequential processing activities governed by 5Ј What kinds of genes contain many UTEs (5Ј and͞or 3Ј)? Table alternative promoters [e.g., ultrabithorax (11)]. In human, the 1 lists some examples of genes of Chr22 with five or more UTEs at protein coding sectors of G protein-coupled receptors are pre- both the 5Ј and 3Ј ends. dominantly intronless but at least l8% of the underlying genes A gene in possession of one or more 5Ј UTEs does not necessarily contain 5Ј UTEs (12, 13). Sosinsky et al. (13) proffer an excellent involve 3Ј UTEs. A direct calculation shows that the flanking UTR discussion of olfactory G protein-coupled receptors with intron- exon counts are basically uncorrelated: correlation (5Ј UTE, 3Ј Table 1. Genes of Chr22 with five or more UTEs No. of 5Ј UTL, No. of 3Ј UTL, Locus No. of exons 5Ј UTEs bp 3Ј UTEs bp Description Genes with 5 or more 5Ј UTEs AC006285.5 9 5 714 0 3088 Homo sapiens MIL1 protein mRNA GGT1 17 5 668 0 1713 Gamma-glutamyltransferase 1 HMG2L1 12 5 565 0 2155 High mobility group protein 2-like 1 LZTR1 21 8 860 0 1713 Leucine-zipper-like transcriptional regulator 1 Genes with 5 or more 3Ј UTEs DJ319F24.C22.1 11 0 0 5 981 Matches expressed sequence tag sequences DJ671014.C22.2 13 2 275 5 661 Homo sapiens gamma-parvin mRNA DJ402G11.C22.6 16 1 247 8 1423 Matches expressed sequence tag cluster UTL, untranslated exon length.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages6 Page
-
File Size-