articles The DNA sequence and comparative analysis of 20

P. Deloukas, L. H. Matthews, J. Ashurst, J. Burton, J. G. R. Gilbert, M. Jones, G. Stavrides, J. P. Almeida, A. K. Babbage, C. L. Bagguley, J. Bailey, K. F. Barlow, K. N. Bates, L. M. Beard, D. M. Beare, O. P. Beasley, C. P. Bird, S. E. Blakey, A. M. Bridgeman, A. J. Brown, D. Buck, W. Burrill, A. P. Butler, C. Carder, N. P. Carter, J. C. Chapman, M. Clamp, G. Clark, L. N. Clark, S. Y. Clark, C. M. Clee, S. Clegg, V. E. Cobley, R. E. Collier, R. Connor, N. R. Corby, A. Coulson, G. J. Coville, R. Deadman, P. Dhami, M. Dunn, A. G. Ellington, J. A. Frankland, A. Fraser, L. French, P. Garner, D. V. Grafham, C. Grif®ths, M. N. D. Grif®ths, R. Gwilliam, R. E. Hall, S. Hammond, J. L. Harley, P. D. Heath, S. Ho, J. L. Holden, P. J. Howden, E. Huckle, A. R. Hunt, S. E. Hunt, K. Jekosch, C. M. Johnson, D. Johnson, M. P. Kay, A. M. Kimberley, A. King, A. Knights, G. K. Laird, S. Lawlor, M. H. Lehvaslaiho, M. Leversha, C. Lloyd, D. M. Lloyd, J. D. Lovell, V. L. Marsh, S. L. Martin, L. J. McConnachie, K. McLay, A. A. McMurray, S. Milne, D. Mistry, M. J. F. Moore, J. C. Mullikin, T. Nickerson, K. Oliver, A. Parker, R. Patel, T. A. V. Pearce, A. I. Peck, B. J. C. T. Phillimore, S. R. Prathalingam, R. W. Plumb, H. Ramsay, C. M. Rice, M. T. Ross, C. E. Scott, H. K. Sehra, R. Shownkeen, S. Sims, C. D. Skuce, M. L. Smith, C. Soderlund, C. A. Steward, J. E. Sulston, M. Swann, N. Sycamore, R. Taylor, L. Tee, D. W. Thomas, A. Thorpe, A. Tracey, A. C. Tromans, M. Vaudin, M. Wall, J. M. Wallis, S. L. Whitehead, P. Whittaker, D. L. Willey, L. Williams, S. A. Williams, L. Wilming, P. W. Wray, T. Hubbard, R. M. Durbin, D. R. Bentley, S. Beck & J. Rogers

The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK ......

The ®nished sequence of human comprises 59,187,298 base pairs (bp) and represents 99.4% of the euchromatic DNA. A single contig of 26 megabases (Mb) spans the entire short arm, and ®ve contigs separated by gaps totalling 320 kb span the long arm of this metacentric chromosome. An additional 234,339 bp of sequence has been determined within the pericentromeric region of the long arm. We annotated 727 and 168 in the sequence. About 64% of these genes have a 59 and a 39 and a complete open . Comparative analysis of the sequence of chromosome 20 to whole- genome shotgun-sequence data of two other vertebrates, the mouse Mus musculus and the puffer ®sh Tetraodon nigroviridis, provides an independent measure of the ef®ciency of annotation, and indicates that this analysis may account for more than 95% of all coding and almost all genes.

The ®nished reference sequence of the is now in contigs with one contig covering the entire p arm (Table 1). sight, underpinned by the recently published working draft1,2.From Boundaries between euchromatin and heterochromatin were iden- the outset of the , the plan has been to ti®ed by presence of satellite repeats in the sequence of clones determine the complete sequence of each chromosome to an located at the most distal and proximal ends, respectively, of the accuracy of greater than 99.99%, and to cover more than 95% of contigs ¯anking the , and served as logical termination the gene-containing part of the genome (the euchromatin). This points for map construction. Clones located at the centromeric ®nished `gold' standard was de®ned and upheld on completion of boundary of the p arm (Fig. 1) gave an additional signal at 20q11.1 the ®rst two human , 22 and 21 respectively. Here we upon ¯uorescent in situ hybridization (FISH) on metaphase chro- report completion of the sequence of the ®rst metacentric human mosomes. We constructed an additional two-clone contig repre- chromosome, chromosome 20, to these standards. Analysis of the senting this duplication (Fig. 1) and postulate that it is located in the ®nished sequence has bene®ted from comparison with substantial heterochromatic region of the q arm. In contrast to the p arm, four new data sets that were not available at the time of the previous gaps remain in the clone map of the q arm. Three of them are ®nished chromosome analyses. These include new collections of clustered within a 1.2-Mb region at qtel (Fig. 1). We anticipate that human and mouse messenger RNA sequences, the indices of the sequences in these gaps are unclonable to the host±vector fully sequenced model organisms, and extensive of systems used in this study, probably owing to the high guanine two vertebrates genomes, those of the mouse and the puffer ®sh T. nigroviridis. As a result, we were able to assess the quality and completeness of human gene annotation by independent analyses. The application of new analytical tools has also enabled Table 1 Sequence contigs on chromosome 20 Contig Size (bp) Size estimate (kb) assessment of predictive methods to de®ne start ...... sites and other features of gene structures, although these require AL360078±AL358116 26,257,626 further development and calibration with the ®nished annotated Centromere ND AL121723±AL512784 5,063,606 sequence. Gap 20 AL450465±AL450463 24,982,240 Clone map and ®nished sequence Gap ,50 AL391316±AL499627 1,147,210 We identi®ed a set of 629 minimally overlapping clones (the tiling Gap ,100 path) that spans the euchromatic regions of the short (p) and long AL449263 35,826 (q) arm of human chromosome 20. The tiling path consists of 455 Gap ,150 AL450469±AL137028 1,700,790 P1-derived arti®cial chromosomes (PACs), 169 bacterial arti®cial Total euchromatic sequence 59,187,298 chromosomes (BACs), 3 yeast arti®cial chromosomes (YACs), 1 AL121762±AL441988 234,339 cosmid and 1 chain reaction (PCR) product (Fig. 1). Total sequence determined 59,421,637 ...... The euchromatic portion of the chromosome is represented in six ND, not determined.

NATURE | VOL 414 | 20/27 DECEMBER 2001 | www.nature.com © 2001 Macmillan Magazines Ltd 865 articles and cytosine (G+C) content of the sequence in this region. All four genes in groups 1 and 2 (and 64% of all the annotated genes), euchromatic gaps were sized by FISH of clones immediately ¯ank- have a full ORF as de®ned by a starting ATG codon and the presence ing each gap to extended DNA ®bres. No gap was estimated to be of a 59 and a 39 untranslated region (UTR). Often a stretch of larger than 150 kb and all the gaps together account for no more immediately preceding the starting ATG seems to be than 320 kb of DNA (Table 1). Finally, we de®ned the location of part of the ORF. When the supporting evidence (for example, ESTs) both . At the end of the p arm (ptel), clone RP11-530N10 terminated within such a stretch, we did not annotate a 59 UTR. (EMBL accession code AL360078; Fig. 1) ends about 10 kb away Such genes were also included (8.9% of the 557 genes) in the above from the block of subtelomeric repeats, which extends for 40±50 kb set. on the basis of the telomeric half-YAC yRM2005 (ref. 3 and H. The transcription start sites of most genes in the human genome Riethman, personal communication). A larger allelic variant of the are not yet known. We carried out several analyses to assist the subtelomeric repeat block is also known, half-YAC yA35 (ref. 4). At annotation of the 59 ends of as many genes on chromosome 20 as the end of the q arm (qtel), clone RP11-476I15 (AL137028; Fig. 1) possible. Analysis of the unmasked sequence predicted a total of 660 contains part of the subtelomeric repeat block. Each clone of the CpG islands, of which 389 are located near (5 kb upstream or 1 kb tiling path was subjected to random subcloning and sequencing. On downstream) the ®rst of an annotated gene structure. Many of the basis of internal and external5 quality checks, we estimate the the remaining predicted CpG islands have intragenic locations, accuracy of our ®nished sequence to exceed 99.99%. Each clone which in our view does not allow a direct correlation between the has been ®nished according to the agreed international ®nishing observed number of CpG islands and the number of genes on standard for the human genome (http://genome.wustl.edu/gsc/ chromosome 20. Among the genes with complete structures, 303 Overview/®nrules/hg®nrules.html). In total, we ®nished 59,421,637 (67%) are associated with a CpG island at their 59 ends, which is in bases in seven sequence contigs. The size of each sequence contig is good agreement with the previously reported ®gure of 60% (ref. 13). given in Table 1; the largest one spans the 26,257,626 bp of the entire We also scanned the sequence of chromosome 20 for putative p arm. The four gaps account for 0.32 Mb (Table 1). Thus, the transcription start (TS) sites using the probabilistic TS site detector sequence covers 99.46% of the euchromatic part of chromosome 20, program Eponine (T. Down, unpublished). Eponine is optimized which spans 59.5 Mb. Our estimate for the total size of the for mammalian genomic DNA sequences and detects likely TS sites chromosome, based on size estimates of 3 Mb for the centromere on the basis of the surrounding sequence (typically 500 bases and 0.2 Mb for subtelomeric repeats, is 62.7 Mb, which is smaller upstream to 100 bases downstream). Multiple predictions are than a previous estimate of 72 Mb (ref. 6). often clustered, suggesting alternative TS sites for a gene. Eponine has a detection sensitivity of 40%, on the basis of an analysis of Gene index of chromosome 20 human chromosome 22. We found 1,432 TS sites on chromosome The ®nished genomic sequence was ®rst analysed for G+C content 20, of which 492 (34%) are located within 2 kb of the ®rst exon of an and CpG islands. Interspersed and simple tandem repeats in the annotated gene. In the set of genes with complete structures, 402 TS sequence were then masked and the masked sequence was compared sites are associated with 166 genes (37.5%) of which 159 (95.8%) against protein, DNA and expressed sequence tags (ESTs) using have a CpG island at their 59 end. So, Eponine predicts multiple TS BLASTX and BLASTN7. In parallel, gene structures were predicted sites per gene (the mean value is 2.42 in the 166 genes) and has a bias ab initio in the masked sequence on a clone-by-clone basis with the in predicting TS sites in genes associated with a CpG island at their programs FGENESH8 and GENSCAN9. 59 end. A total of 895 gene structures was annotated in the ®nished The 727 genes (that is, and exons) extend over a total of sequence on the basis of human interpretation of the combined 25,213,914 bp (mean 34,682 bp per gene). Excluding expressed supportive evidence generated during sequence analysis (see Fig. 1). pseudogenes, 42.4% of the reported sequence of chromosome 20 The structures were divided into ®ve groups: (1) 335 `known' genes, is therefore transcribed. Exons account for only 2.43% of the that is, those that are identical to known human complementary sequence and the mean exon size is 283 bp. A summary per gene DNA or protein sequences (all known genes were in the LocusLink group is given in Table 2, which includes ®gures reported from the database, http://www.ncbi.nlm.nih.gov/LocusLink); (2) 222 `novel genes', that is, those that have an open reading frame (ORF), are identical to human ESTs that splice into two or more exons, and/or have to known genes or (all species); (3) 23 Figure 1 The sequence map of human chromosome 20 and its features. The short Q `novel transcripts', that is, genes as in 2 but for which a unique ORF (p) and the long (q) arm of the chromosome are depicted in the top and bottom cannot be determined; (4) 147 `putative genes', that is, sequences panels, respectively. The features of each chromosome arm are shown from top to bottom identical to human ESTs that splice into two or more exons but as follows: (1) The ®nished sequence of each clone in the tiling path as a yellow line. without an ORF; and (5) 168 `pseudogenes', that is, sequences Sequence positions are indicated in megabases along the x-axis of `G+C content' (see 4, homologous to known genes and proteins but with a disrupted below). Eight of the clones were isolated and sequenced elsewhere, namely AC005808 ORF. (LBNL H136; BAC 185), AC005914 (LBNL H135; BAC 189), AC006076 (LBNL H133; PAC Excluding the pseudogenes, chromosome 20 has a gene density of 12), AC004762 (LBNL H134; PAC 128), AC005220 (LBNL H80; BAC 99), AC004501 12.18 per Mb, which is intermediate to 6.71 (low) and 16.31 per Mb (LBNL H144; BAC 121) and AC004505 (LBNL H65; PAC 86C1) at the Joint Genome (high) reported for chromosome 21 and 22, respectively10,11.We Institute16 and AC006198 (RP11-3A1) at the Whitehead Institute (Massachusetts Institute used the gene density of chromosomes 20, 21 and 22 from ref. 12 to of Technology Center for Genome Research). The centromere has been arbitrarily drawn adjust the number of genes on each of these chromosomes. The to span 3 Mb. The exact location of contig AL121762±AL441988 in the heterochromatic adjusted ®gures were then used to extrapolate a number of 31,500 region of the q arm is not known. Gaps in the map appear as greenish bars. The width of genes for the whole genome, which is in agreement with recent the bar represents the size estimate obtained by ®bre FISH. (2) The location of genetic estimates1,2. markers. (3) The distribution of the main types of repeats in the sequence. (4) Plot of the The analysis of chromosome 20 bene®ted from the availability of G+C content of the sequence. (5) Plot of the SNP density along the sequence. (6) The new large data sets to assist the gene annotation. These included location of predicted CpG islands. (7) The location of the annotated gene structures. Right human (for example, Genoscope) and mouse ESTs and `full-length' and left coloured arrows indicate gene structures on the + and - strand, respectively. The cDNAs (for example, RIKEN mouse cDNA collection) as well as most 39 end of each gene is drawn halfway along the arrowhead. Only the genes of the the protein indices of fully sequenced model organisms such as annotation group 1 (known; dark blue) and 2 (novel; blue) are named. When no gene Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila symbol is available, the gene name used in the EMBL sequence submission ®le appears melanogaster and Arabidopsis thaliana. Some 81% of the 557 (for example, dJ583P15.4). CDS, protein-coding sequence.

866 © 2001 Macmillan Magazines Ltd NATURE | VOL 414 | 20/27 DECEMBER 2001 | www.nature.com articles

Table 2 Structural characteristics of annotated gene structures (eight); in most cases of genes with (130 genes) we observed two isoforms. Typically, we annotated the Chromosome, gene type Mean size Mean exon size Mean number of exons (kb) (bp) longest possible terminal exon and did not create entries for ...... Chr20, known genes 51.3 294 10.3 alternative splice forms on the basis of alternative Chr21, known genes 57.0 sites. If we exclude the putative genes that have mainly incomplete Chr20, novel genes 25.1 278 5.7 structures, then 35% of the genes (average of 1.65 transcripts per Chr20, putative genes 9.1 217 2.5 Chr21, novel + putative genes 27.0 gene) show alternative splicing. This is in agreement with previous Chr20, genes 34.7 283 7.1 estimates15. Analysis of chromosomes 19 and 22 (ref. 1), both gene- Chr21, genes 39.0 rich chromosomes, showed a higher extent of alternative splicing. Chr20, pseudogenes 1.9 499 1.4 Chr20, all 27.6 292 6.0 Chr22, all 19.2 266 5.4 Protein index of chromosome 20 ...... Structural characteristics of groups of annotated gene structures on chromosome 20 are shown, We analysed the proteome of chromosome 20 using InterProScan and compared with similar groups on chromosomes 21 and 22. (http://www.ebi.ac.uk/interpro/scan.html) to look at the distribu- tion of known protein domains. The InterPro database combines information on protein families, domains and functional sites from analyses of chromosomes 21 and 22 (refs 10, 11). Gene size varies the databases , PRINTS, PROSITE, SMARTand SWISS-PROT substantially, from 1,234,386 bp (gene C20orf133 (AL117333± (see http://www.ebi.ac.uk/interpro for links). Of all proteins AL049633), which is similar to a low-density lipoprotein-related encoded on chromosome 20, 73.5% have an InterPro match and protein, LRP16) to 339 bp (gene C20orf127 (AL121753)). Exon 30% are multidomain with an average of 2.1 distinct InterPro sizes are fairly constant, with the exception of 39 terminal exons (for domains. As shown in Table 3, many of the most frequent domains example, 8,181 bp in PTPRT), in contrast to sizes, which vary in the chromosome 20 proteome rank in similar order as in the from 33 bp (C20orf97 (AL034548)) to 523,790 bp (CDH4). human proteome1. There are, however, ®ve domains for which For 209 (29%) of the annotated genes, we found alternative splice chromosome 20 seems enriched. Four of themÐthe forms. Alternative splicing can, for example, give rise to two distinct inhibitor (IPR000010), the immunoglobulin subtype peptides by exclusion of the exons encoding a functional domain (IPR003599), the whey acidic protein (WAP)-type `four-disulphide from one but not the other transcript. The transcript for the soluble core' domain (IPR00222), and the pancreatic trypsin inhibitor form of attractin (ATRN; AL353193±AL132773) lacks the ®ve exons (Kunitz/Bovine) domain (IPR002223)Ðare found in proteins that the transmembrane and cytoplasmic domains and are encoded by three gene clusters along the chromosome. present in the transcript that encodes the membrane form of the Functionally related gene clusters indicate probable ancestral protein. Splice variants may encode structurally unrelated peptides. gene-duplication events. The ®rst cluster, at 1.5 Mb (Fig. 1), extends A complex example of alternative splicing and genetic imprinting is from AL109658 to AL034562 and includes genes with immuno- found in the GNAS1 (AL132655±AL109840). NESP55 and globulin and immunoglobulin-like domains that are involved in XLAS are transcribed from distinct mono-allelic promoters located signal transduction and cell adhesion (SIRPB1, SIRPB2 and upstream of the bi-allelic that drives the transcription of PTPNS1). The annotated genes PTPN1L and PTPNS1L2 and the gene for the a-subunit of the stimulatory guanine-- pseudogenes dJ576H24.1 and dJ673D20.1 are new members of binding protein Gs.Gsa is encoded by exons 1±13. A large G protein, this gene family. Interestingly, the apparently functional gene XLas, is generated by in-frame splicing of an upstream exon (bp PTPNS1L2 is located within a larger fragment of 33,048 bp 120,789±121,953; AL132655) to exon 2 (bp 39,075±39,119; (AL049634 and AL592544), which is an insertion type of poly- AL121917) of Gsa. An additional exon located further upstream morphism. The insertion allele is represented in the RP4 PAC (bp 106,368±107,508; AL132655) splices again to exon 2 of Gsa but library but not the RP11 BAC library. Using a panel of 174 not in frame, giving rise to a structurally unrelated peptide, Caucasians, we estimated that the frequency of the insertion allele NESP55. An antisense transcript (dJ806M20.3.6; AL132655) pos- is 37.3%. The second cluster, at 23.5 Mb (AL096677±AL121831; Fig. tulated to regulate this imprinted region has also been reported14.In 1), comprises members of the cystatin gene family, which encode total, we annotated six isoforms of GNAS1. One gene (PLCB4; inhibitors with antibacterial and antiviral activities. An AL121898±AL031652) was found to have the most isoforms additional member, CST7, is located about 1 Mb distal of the main

Table 3 Most common InterPro domains in the chromosome 20 proteome and their abundance in other species

Abundance

Rank InterPro code Chr20 Hs Dm Ce At Sc Name (genome rank) ...... 1 (2) IPR000822 26 576 341 205 169 53 Zinc ®nger, C2H2 type 2 (3) IPR000719 12 481 230 419 1,033 116 Eukaryotic protein (IPR002290 12 316 158 219 856 113 Serine/threonine family ) (IPR001245 11 193 82 121 477 4 catalytic domain) 3 (ND) IPR002965 11 190 175 62 178 0 Proline rich extensin 4 (ND) IPR000010 9 20 4 3 7 0 Cysteine proteases inhibitor (IPR003243 8 10 0 1 7 0 and M) 5 (4) IPR000276 9 373 84 368 0 0 Rhodopsin-like GPCR superfamily 6 (13) IPR000561 9 234 86 137 42 1 EGF-like domain 7 (24) IPR000008 8 111 42 53 99 11 C2 domain 8 (ND) IPR000504 8 203 135 111 248 54 RNA-binding region RNP-1 (RNA recognition motif) 9 (1) IPR003006 8 584 134 66 0 0 Immunoglobulin and major histocompatibility complex domain (IPR003599 7 177 27 11 0 0 Immunoglobulin subtype) 10 (ND) IPR000345 7 2 4 3 3 3 Cytochrome c family haem- 11 (ND) IPR001124 7 8 0 10 2 0 Lipid-binding serum 12 (ND) IPR002221 7 6 4 7 0 0 WAP-type four-disulphide core domain 13 (ND) IPR002223 6 13 22 38 0 0 Pancreatic trypsin inhibitor (Kunitz/Bovine) family ...... At the time of analysis, the InterPro database contained 18,149 Homo sapiens (Hs, incomplete), 13,843 Drosophila melanogaster (Dm), 18,581 Caenorhabditis elegans (Ce), 25,677 Arabidopsis thaliana (At) and 6,176 Saccharomyces cerevisiae (Sc) protein entries. Rows in parentheses correspond to InterPro domains that are `children' ( = members) of a broader InterPro domain. ND, not determined. The P-loop motif domain IPR001687 was excluded, owing to low speci®city.

NATURE | VOL 414 | 20/27 DECEMBER 2001 | www.nature.com © 2001 Macmillan Magazines Ltd 867 articles cluster (AL035661). Only two of the known cystatins are not on Segmental duplications are another interesting feature of the chromosome 20 (IPR003243; Table 3). We annotated three new genome. We compared the masked sequence of chromosome 20 members (CST8L, CSTL1 and CST9L) and two pseudogenes. The with the rest of the genome and with itself to identify inter- and third cluster, at 43.5 Mb (AL049767±AL050348; Fig. 1), includes 11 intrachromosomal duplications, respectively. The segments of genes that encode proteins with a WAP-type four-disulphide core chromosome 20 involved in interchromosomal duplications domain (IPR002221) and/or a pancreatic trypsin inhibitor (Kunitz/ (Fig. 2) often contain pseudogenes; for example, at 6 Mb, Bovine) domain (IPR002223). A fourth cluster that includes genes AL359954 contains a similar to TRDBP that maps to for the semenogelins SEMG1 and SEMG2 (semen proteins involved 1p36 (AL109811). The region at 53.9 Mb (Fig. 2) that is duplicated in reproduction) is located within the third gene cluster between in chromosomes 21 and 22 was recently described as part of a breast members PI3 and SLPI, which have only a WAP-type domain. amplicon16. A region of about 500 kb between 25.8 and 26.3 Mb is implicated in both types of segmental duplications. A Chromosome landscape core region of 100 kb that harbours a copy of exon 7 of the CFTR The sequence of chromosome 20 has an average G+C content of gene is duplicated on chromosome 20. The second copy is located in 44.1%, which is slightly higher than the genome average of 41%. AL121762±AL441988 at the pericentromeric region of the q arm. The distribution of the G+C content ¯uctuates along the chromo- Copies of the extended region seem to be present on chromosomes some, and regions with higher G+C have a higher gene density 9, 12, 15, 17 and 19 (Fig. 2). Secondary signals in the pericentro- (Fig. 1). For example, the sequence from 49.5 to 54 Mb has an meric regions of these and other chromosomes were also observed average G+C content of 41.1% and a gene density of only 4.9 genes on FISH analysis of clones AL078587 and AL121762. It will be per Mb, in contrast to the region between 60 and 62.5 Mb, which has interesting to investigate whether the gene structures annotated in 56.6% G+C content and a gene density of 28 genes per Mb. Given a AL121762 and AL441988 are expressed genes, particularly sequence length and gene ratio of 1.25 and 1.65, respectively, C20orf80, which is similar to the FRG1 gene. FRG1 is located between the q and the p arm, the q arm seems rich in genes. Gene 100 kb centromeric of the repeat units on chromosome 4q35, density can drop as low as 1.54, for instance in a 1.9-Mb region which are deleted in facioscapulohumeral muscular dystrophy. between AL136990 and AL139163. Interestingly, the largest genes, The region from 48.2 to 48.8 Mb is bordered by two copies of a such as PTPRT, PLCB1 and dJ631M13.5 are located adjacent to or 60-kb intrachromosomal duplication. within gene-poor regions. The integrated Marsh®eld male, female and sex-averaged genetic The repeat content of chromosome 20 is 42%. The distribution of maps of chromosome 20 (ref. 17) were aligned to the physical map the main classes of repeats (detailed in Supplementary Information) (Fig. 3). The steepest increase in recombination frequency is is shown in Fig. 1. Regions of high gene density seem enriched in observed between markers D20S178 and D20S176, which are short interspersed elements (SINEs). both located in the region of duplication described above. A

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

29

32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Figure 2 Duplication landscape of chromosome 20. Intrachromosomal and inter- telomeric end of the long arm (bottom right). The gap indicates the centromeric region. chromosomal duplications are shown in blue and red, respectively. Each horizontal line Pairwise alignments generated by Exonerate and longer than 1 kb are shown. represents 1 Mb of the sequence from the telomeric end of the short arm (top left) to the

868 © 2001 Macmillan Magazines Ltd NATURE | VOL 414 | 20/27 DECEMBER 2001 | www.nature.com articles region of very low recombination extends for about 20 Mb between 120 markers D20S432 and D20S859. The rate of recombination in speci®c loci differs between the two sexes. Compared with that of D20S427 the female, the rate of male recombination is higher along the p 100 arm up to marker D20S432 and lower across the rest of the D20S178 D20S176 chromosome (Fig. 3). 80 D20S859 Sequence variation 60 The de®nition of the common ancestral haplotypes that are present D20S432 in the population relies on the availability of an extensive collection 40 of single nucleotide polymorphisms (SNPs). We ®rst placed 26,678 Sex-averaged Cumulative genetic distance (cM) SNPs (deposited in the dbSNP database, http://www.ncbi.nlm.nih. 20 Female gov/SNP) on the sequence of chromosome 20. Of those, 13,016 were Male derived from sequence analysis of clone overlaps by the program 0 ssahaSNP18. To recover additional SNPs in clone overlaps, we 0102030405060 realigned all available clone-based shotgun sequences from chromo- Physical distance (Mb) some 20 (including un®nished sequence in clone overlaps that was Figure 3 Alignment of the genetic map of chromosome 20 to the physical map. The previously archived and therefore excluded from the earlier analy- two maps are aligned from the telomeric end of the short arm to the telomeric end of the sis) onto the ®nished sequence with ssahaSNP, and detected 11,050 long arm. The position of each genetic marker on the female, the male and the sex- SNPs (submitted to dbSNP). Merging the two data sets resulted in averaged genetic map is indicated. 32,763 unique SNPs on chromosome 20 (Fig. 1), of which 6,085 are new. In the unique set, there are 14,211 SNPs (43.4%) located within annotated genes and 3,061 of them are in exons. species (in contrast to 0.2% of the annotated introns). Thus, we postulate that about 97.2% (2,050 / (2,050 + 60)) of all coding exons Comparative analysis of chromosome 20 have been annotated in this study. A caveat is Functional features such as exons and regulatory elements have whether the set of annotated genes used in this analysis is repre- been conserved through evolution and there is compelling evidence sentative of genes that appeared recently in evolution, as ecores are that comparative genomic sequence analysis is a powerful tool in the biased to more conserved genes; however, we consider that such an quest to complete the structural annotation of the human genome. effect cannot be substantial. Two data sets were available in the public domain at the time of The 639 and 4,808 RSC matches in annotated introns and analysis: about 13 million sequence reads of a mouse whole-genome intergenic regions, respectively, suggest that although the mouse shotgun giving an estimated genome coverage of 2.3-fold (http:// data set provides better coverage (70% of all exons) than the ecores, trace.ensembl.org), released by the Mouse Sequencing Consortium exonic sequences cannot be readily identi®ed by simple comparison on 8 May 2001; and 816,262 single sequence reads from BAC and at the DNA level. plasmid ends of the T. nigroviridis genome, totalling 663,839,518 GENSCAN can be used on small segments of genomic sequence bases and corresponding to 1.72 genome equivalents, generated at to effectively evaluate the likelihood of that segment containing an Genoscope. Thus we undertook the comparative analysis of the exon. We performed a GENSCAN analysis on `extended RSCs', ®nished and annotated sequence of an entire human chromosome which included 100 bp of human sequence either side of the RSC against two vertebrate genomes. match, to divide them into those that were more likely to be coding Mouse sequences were aligned to the sequence of chromosome 20 regions of sequence conservation (cRSC) and those more likely to using Exonerate version 0.3d (Guy StC. Slater, unpublished). We be noncoding. This predicted 3,299 cRSCs and 8,836 noncoding obtained matches with 63,644 mouse sequences representing 12,041 RSCs and found that 65.7% of cRSCs match annotated exons (3.8% regions of sequence conservation (RSC) along chromosome 20. are within introns). As a result, the 874 cRSCs found between Tetraodon sequences were aligned at Genoscope by Exo®sh (`exon annotated genes is a set enriched in regions that may represent non- ®nding by '), which generates ecores (evolu- annotated exons (there are 4,748 RSC matches in intergenic tionary conserved regions)19. Matches were obtained with 2,992 regions). ecores (available at http://www.genoscope.cns.fr/exo®sh). We ®rst examined the annotated gene structures; 77.4% of the 727 genes Conclusion and medical implications and 89% of the 168 pseudogenes have at least one exon matched We sequenced the euchromatic portion of human chromosome 20 by a mouse RSC or Tetraodon ecore. This ®gure is much higher for leaving four small gaps that account for no more than 320 kb. In the the 557 genes in groups 1 and 2 (94%) than it is for the `putative' 59,421,637 bp of sequence, we annotated 727 gene structures of gene structures in group 4 (33%). Furthermore, the two sets differ which 64% are complete and 168 pseudogenes. A comparison of this in the ratio of genes with only a mouse RSC to genes with both an product with the draft assembly of chromosome 20 reported earlier RSC and ecore match: 1:3.8 and 1:0.2, respectively. These observa- this year2 clearly shows the importance of generating a contiguous tions suggest that the putative gene structures may represent ®nished reference sequence for each human chromosome. Both the largely UTRs, which have sequences known to be less well con- G+C and gene density plots of chromosome 20 peak between 60 and served between species, and possibly genes that appeared later in 62.5 Mb (Fig. 1) at the qtel region, which is in sharp contrast to the evolution. corresponding plots shown in Fig. 11 in ref. 2, which peak at least We then looked at matches outside annotated exons as a way to 12 Mb proximal of the . Furthermore, the order in which assess the completeness of the current annotation. Such matches genes are shown in the magni®ed part of Fig. 13 in ref. 2 is incorrect. may correspond to exonic sequences that have not been annotated For example, the gene OSBPL2 (oxysterol binding protein 2) and in the present study owing to lack of supporting evidence (for bB379O24.1 (GATA5 related) are located at 60.2±60.6 Mb and example, EST, cDNA and protein homologies). Note that we did not cannot map between PTPRT (protein tyrosine phosphatase, recep- use RSCs and ecores during the annotation process. We found 5,447 tor type) at 41 Mb and ZNF217 (Kruppel-like ) RSC and 207 ecore matches, and 60 of these non-exonic regions are at 51.7 Mb. The use of the clone map information was instrumental conserved in all three species. Of all annotated exons (including in resolving similar problems during the assembly of the chromo- pseudogenes), 2,050 (36.3%) contain a region conserved in all three some 20 draft sequence.

NATURE | VOL 414 | 20/27 DECEMBER 2001 | www.nature.com © 2001 Macmillan Magazines Ltd 869 articles

The output of the comparative analysis of chromosome 20 from respectively. BLAST 1.4 (default parameters and matrix; http://blast.wustl.edu) was used the mouse whole-genome shotgun and the ecores generated from to identify initial matches, which were then re-aligned by EST_GENOME37. BLASTN was used with a 65% similarity cutoff in the comparison against the RIKEN mouse cDNA set38 the Tetraodon genomic sequence suggests that the current sets of instead of 95%, which is used when searching human ESTs, to ®nd signi®cant matches. In human and mouse ESTs and `full-length' cDNAs together with the the unmasked sequence, CpG islands were predicted by searching for sequence segments proteomes of model organisms are adequate to allow the identi®ca- that are at least 400 bp, have a G+C content greater than 50%, and an expected/observed tion of the vast majority of human genes in the sequence. As CpG count of greater than 0.6. The completed analysis was assembled into contigs and visualiszed in AceDB (http://www.acedb.org), whereas an Ensembl (http://www. expected, we found that comparative analysis can be used to reliably ensembl.org) database of the sequence assembly and the annotated genes was constructed identify exonic sequences. The mouse shotgun data alone cannot be and used for calculation of statistics and producing Fig. 1. In SNP analysis, only those used reliably to postulate the number of non-annotated exons, regions of chromosome 20 where a SNP was detected by at least four reads was considered owing to the overall higher degree of sequence conservation. The valid, since the depth of shotgun sequencing for these clones was greater than 4´. The use of the two data sets together, however, provides an excellent tool phred-quality value of at least four of the reads at the SNP location had to be at least 30 (error probability of phred base calling 0.001 or less). Exonerate was run with an initial for assisting the identi®cation of new, and the completion of word length of 14 bp, gap penalties of 8 for opening a gap and 4 for extending one, and a existing, gene structures. In the present study, the ability to identify score of 5 and -4 for DNA matches and mismatches, respectively. regulatory elements in the sequence of chromosome 20 by compari- son to the mouse sequence data can be substantiated only by Received 13 August; accepted 25 October 2001. anecdotal evidence. A three-way comparison with the addition of 1. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human the genome sequence of a species more closely related to genome. Nature 409, 860±921 (2001). 20 2. Venter, C. J. et al. The sequence of the human genome. Science 291, 1304±1351 (2001). may hold the key in this endeavour . 3. Riethman, H. C. et al. Integration of telomere sequences with the draft human genome sequence. Chromosome 20 is best known for harbouring the genes that Nature 409, 948±951 (2001). cause Creutzfeldt±Jakob disease (PRNP) and severe combined 4. Chute, I., Le, Y., Ashley, T. & Dobson, M. J. The telomere-associated DNA from human chromosome immunode®ciency (ADA). However, the causes of the sporadic 20p contains a pseudotelomere structure and shares sequences with the subtelomeric regions of 4q and 18p. Genomics 41, 51±60 (1997). cases of Creutzfeldt±Jakob disease (80% of all cases) remain 5. Felsenfeld, A., Peterson, J., Schloss, J. & Guyer, M. Assessing the quality of the DNA sequence from the unknown, and no in the ADA gene has been identi®ed Human Genome Project. Genome Res. 9, 1±4 (1999). to explain the of ADA excess in haemolytic anaemia. 6. Morton, N. E. Parameters of the human genome. Proc. Natl Acad. Sci. USA 88, 7474±7476 (1991). Furthermore, there are still single-gene disorders mapped to 7. Altschul, S. F., Gish, W., Miller, W., Meyers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403±410 (1990). chromosome 20 (http://www.ncbi.nlm.nih.gov/Omim) for which 8. Salamov, A. A. & Solovyev, V. V. Ab initio gene ®nding in Drosophila genomic DNA. Genome Res. 10, the underlying genetic defect is not known. The resources generated 516±522 (2000). by the Human Genome Project have already been used to accelerate 9. Burge, C. & Karlin, S. Prediction of gene structures in human genomic DNA. J. Mol. Biol. 268, 78±94 (1997). the cloning of disease genes on chromosome 20; the Alagille 10. Hattori, M. et al. The DNA sequence of human chromosome 21. The chromosome 21 mapping and 21 22 23 (JAG1) , McKusick±Kaufman (MKKS) , ICF (DNMT3B) and sequencing consortium. Nature 405, 311±319 (2000). Hallervorden±Spatz (PANK2)24 syndromes are recent examples. 11. Dunham, I. et al. The DNA sequence of human chromosome 22. Nature 402, 489±495 (1999). The reported ®nished and annotated sequence and its variation 12. Deloukas, P. et al. A physical map of 30,000 human genes. Science 282, 744±746 (1998). 13. Antequera, F. & Bird, A. Number of CpG islands and genes in human and mouse. Proc. Natl Acad. Sci. will be a valuable tool in tackling not only the remaining single-gene USA 90, 11995±11999 (1993). diseases but also the multifactorial diseases that have been linked to 14. Hayward, B. E. & Bonthron, D. T. An imprinted antisense transcript at the human GNAS1 locus. Hum. chromosome 20, such as type 2 diabetes, obesity, cataract, eczema Mol. Genet. 9, 835±841 (2000). and Grave's disease. Evidence for a susceptibility locus for heredi- 15. Brett, D. et al. EST comparison indicates 38% of human mRNAs contain possible alternative splice forms. FEBS Lett. 474, 83±86 (2000). 25,26 tary prostate cancer on 20q13 has also been reported . In addition 16. Collins, C. et al. Comprehensive genome sequence analysis of a amplicon. Genome Res. to the sequence itself, the isolated clones used in the sequencing 11, 1034±1042 (2001). process constitute a unique resource in studying chromosome loss 17. Yu, A. et al. Comparison of human genetic and sequence-based physical maps. Nature 409, 951±953 (2001). and/or ampli®cation in various types of cancer. We have recently 18. Ning, Z., Cox, A. J. & Mullikin, J. C. SSAHA: A fast search method for large DNA databases. Genome reported the re®nement of a commonly deleted region (CDR) of Res. 11, 1725±1729 (2001). 20q12-13.1 found in patients with myeloproliferative disorders and 19. Roest Crollius, H. et al. Estimate of human gene number provided by genome-wide analysis using myelodisplastic syndromes27. Others have reported the character- Tetraodon nigroviridis DNA sequence. Nature Genet. 25, 235±238 (2000). 20. Dubchak, I. et al. Active conservation of noncoding sequences revealed by three-way species ization of a breast cancer amplicon at 20q13.2 (ref. 16), whereas comparisons. Genome Res. 10, 1304±1306 (2000). several studies have reported loss of heterozygosity across regions of 21. Li, L. et al. Allagile syndrome is caused by in human Jagged1, which encodes a ligand for 20q using comparative genomic hybridization28,29. M Notch1. Nature Genet. 16, 243±251 (1997). 22. Stone, D. L. et al. Mutation of a gene encoding a putative causes McKusick-Kaufman Methods syndrome. Nature Genet. 25, 79±82 (2000). 23. Hansen, R. S. et al. The DNMT3B DNA methyltransferase gene is mutated in the ICF immuno- Clone map and sequence assembly de®ciency syndrome. Proc. Natl Acad. Sci. USA 96, 14412±14417 (1999). Clone map construction is described in ref. 30. Mapped sequence tagged sites (STSs) for 24. Zhou, B. et al. A novel gene (PANKS) is defective in Hallervorden-Spatz screening genomic PAC and BAC libraries were selected from the integrated radiation syndrome. Nature Genet. 28, 345±349 (2001). hybrid map constructed for chromosome 20 (http://www.sanger.ac.uk/cgi-bin/rhtop? 25. Berry, R. et al. Evidence for a prostate cancer-susceptibility locus on chromosome 20. Am. J. Hum. chr=20), which harbours 1,493 STS-based markers. In regions with no clone coverage, Genet. 67, 82±91 (2000). screening was extended to the CEPH (Centre d'Etude du Polymorphisme Humain), ICRF 26. Zheng, S. L. et al. Evidence for a prostate cancer linkage to chromosome 20 in 159 hereditary prostate (Imperial Cancer Research Fund) and ICI (Imperial Chemical Industries) YAC libraries cancer families. Hum. Genet. 108, 430±435 (2001). 27. Bench, A. et al. Chromosome 20 deletions in myeloid malignancies: reduction of the common deleted and the LANL (Los Alamos National Laboratory) chromosome-20-speci®c cosmid library region, generation of a PAC/BAC contig and identi®cation of candidate genes. 19, 3902± (links for the libraries can be found at http://www.hgmp.mrc.ac.uk/Biology/descriptions/ 3913 (2000). genomic_libraries.html). For the shotgun phase, pUC plasmids with inserts of 1.4±2 kb 28. Bench, A. et al. A detailed physical and transcriptional map of the region of chromosome 20 that is were sequenced from both ends by the dideoxy chain termination method31 with big dye deleted in myeloproliferative disorders and re®nement of the common deleted region. Genomics 49, terminator chemistry32. Most of the reactions were analysed on ABI3700 capillary 351±362 (1998). sequencing machines. The resulting data were processed by a suite of in-house programs 29. Rigaud, G. et al. High resolution allelotype of nonfunctional pancreatic endocrine tumors: 33,34 (http://www.sanger.ac.uk/Software/sequencing) before assembly with the PHRED and Identi®cation of two molecular subgroups with clinical implications. Cancer Res. 61, 285±292 PHRAP (http://www.phrap.org) algorithms. For the ®nishing phase, we used the GAP4 (2001). 35 program to help assess, edit and select reactions, eliminate ambiguities and close 30. Bentley, D. R. et al. The physical maps for sequencing human chromosomes 1, 6, 9, 10, 13, 20 and X. sequence gaps. Sequence gaps were closed by a combination of primer walking, PCR, Nature 409, 942±943 (2001). 36 short/long insert sublibraries , sublibrary screening with oligonucleotides and, in rare 31. Sanger, F., Nicklen, S. & Coulson, A. R. DNA sequencing with chain-terminating inhibitors. Proc. Natl cases, transposon sublibraries. Acad. Sci. USA 74, 5463±5467 (1977). 32. Rosenblum, B. B. et al. New dye-labeled terminators for improved DNA sequencing patterns. Nucleic Sequence analysis tools Acids Res. 25, 4500±4504 (1997). Interspersed and simple tandem repeats were identi®ed with Repeatmasker (http:// 33. Ewing, B., Hillier, L., Wendl, M. C. & Green, P. Base-calling of automated sequencer traces using repeatmasker.genome.washington.edu) and etandem (http://www.emboss.org), phred. I. Accuracy assessment. Genome Res. 8, 175±185 (1998).

870 © 2001 Macmillan Magazines Ltd NATURE | VOL 414 | 20/27 DECEMBER 2001 | www.nature.com articles

34. Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Acknowledgements Genome Res. 8, 186±194 (1998). All the reported DNA sequences have been deposited to the EMBL, GenBank or DDBJ 35. Bon®eld, J. K., Smith, K. F. & Staden, R. A new DNA sequence assembly program. Nucleic Acids Res. databases. We thank H. Roest Crollius, O. Jaillon and the Tetraodon group at Genoscope 23, 4992±4999 (1995). for the Exo®sh data; E. Zdobnov for the InterPro analysis; the Ensembl team for assistance 36. McMurray, A. A., Sulston, J. E. & Quail, M. A. Short-insert libraries as a method of problem solving in in the computational analysis; and the group of M. S. Jackson at the University of genome sequencing. Genome Res. 8, 562±566 (1998). Newcastle upon Tyne for markers and clones that assisted the mapping of the 37. Mott, R. EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA. Comput. Appl. Biosci. 13, 477±478 (1997). pericentromeric duplication. Finally, we thank H. Wain, E. Bruford, M. Lush, R. Lovering, 38. The RIKEN Genome Exploration Research Group Phase II Team and the FANTOM Consortium. M. Wright, P.Shetty and S. Povey for help with and the Wellcome Trust Functional annotation of a full-length mouse cDNA collection. Nature 409, 685±690 (2001). for ®nancial support.

Supplementary Information accompanies the paper on Nature's website Correspondence and requests for material should be addressed to P.D. (http://www.nature.com). (e-mail: [email protected]).

NATURE | VOL 414 | 20/27 DECEMBER 2001 | www.nature.com © 2001 Macmillan Magazines Ltd 871 AL356742

AL109804 AL121890 AL031655 AL079338 AL121912

AL121758 AL031665 AL138804 AL359916 AL049712 AL121905 AL132773 AL121675 AL389886 AL035461 AL162504 AL021396 AL034551 AL050315 AL121898 AL031682 AL133340 AL049649 AL035448 AL078623 AL050320 AL121754 AL117333 AL354773 AL031677 AL121584 AC006198 AL161941 AL135938 AL031684 AL121779 AL049646 AL121830 AL161658 AL133325 AL117375 AL356733 AL158175 AL591074 AL110503 AL157413 AL080312 AL121904 Contig tiling path AL049761 AL050325 AL049634 AL121760 AL049650 AL035460 AL109976 AL109805 AL356414 AL121781 AL133354 AL121755 AL035249 AL109618 AL034554 AL080248 AL096769 AL021879 AL021406 AL050319 AL049593 AL031652 AL109754 AL034430 AL050403 AL078588 AL109837 AL080274 AL136460 AL138850 AL109983 AL158089 AL354683 AL132826 AL139825 AL138808 AL049867 AL035073 AL357504 AL049794 AL121892 AL031675 AL035045 AL050321 AL121893 AL035563 AL034425 AL049538 AL121721 AL078634 AL133465 AL117332 AL121921 AL035258 AL445603 AL049651 AL390037 AL121894 AL121831 AL034426 AL353582 AL353812 AL031673 AL391119 Contig tiling path AL034548 AL133231 AL161939 AL136531 AL592544 AL117335 AL121899 AL133339 AL161656 AL121891 AL353193 AL353194 AL139350 AL133396 AL109841 AL121757 AL117377 AL118505 AL158093 AL035668 AL161655 AL078643 AL031653 AL035664 AL121589 AL049632 AL445567 AL023805 AL135935 AL353739 AL023913 AL035456 AL158042 AL079337 AL445667 AL109838 AL034561 AL133331 AL133463 AL109657 AL136992 AL121582 AL161800 AL049633 AL050324 AL136991 AL118509 AL034428 AL135940 AL031664 AL160071 AL160411 AL158176 AL121780 AL136090 AL139827 AL121761 AL035454 AL158849 AL109953 AL110120 AL121590 AL157412 AL133464 AL049738 AL118508 AL096677 AL359433 AL133466 AL158090 AL121925 AL035252 AL031672 AL390198 AL450124

AL360078 AL121747 AL118502 AL110114 AL109658 AL109809 AL034562 AL031678 AL133345 AL160414 AL117334 AL356755 AL031670 AL357040 AL121916 AL445075 AL121924 AL109935 AL390028 AL359954 AL121911 AL109856 AL096799 AL096792 AL031679 AL050323 AL365212 AL031683 AL121909 AL353612 AL034427 AL354824 AL353599 AL135937 AL050322 AL049690 AL161938 AL034547 AL096862 AL354804 AL445589 AL121782 AL161659 AL353798 AL355133 AL118510 AL133462 AL158088 AL118503 AL357654 AL117376 AL355579 AL121592 AL359511 AL132765 AL121585 AL389897 AL121900 AL135936 AL049647 AL139429 AL132821 AL049648 AL121896 AL157718 AL121759 AL158013 AL035562 AL109807 AL049737 AL121722 AL353132 AL357508 AL109954 AL359967 AL109830 AL049594 AL035661 AL121772 AL161802 AL078587 AL358116

D20S1155 D20S103 D20S199 D20S906 D20S179 D20S198 D20S181 D20S473 D20S889 D20S168 D20S895D20S835 D20S882D20S916 D20S194 D20S192 D20S448 D20S879D20S177 D20S175 D20S162 D20S160 D20S189 D20S186 D20S446 D20S163 D20S915 D20S66D20S898 D20S104 D20S161 D20S114 D20S182 D20S475 D20S471 D20S54 D20S190 D20S868 D20S180 D20S477 D20S184 D20S148 D20S848 D20S65 D20S191

D20S1157 D20S105 D20S113 D20S867 D20S116 GATA149E11 D20S849 D20S905 D20S156 D20S82 D20S851 D20S917 D20S901 D20S894 D20S27 D20S61 D20S152 D20S875 D20S112 D20S860 D20S432 RH28243 D20S101 Genetic markers Genetic markers D20S864 D20S117 D20S97 D20S846 D20S900 D20S166 D20S431

D20S1156 D20S437 D20S852 D20S470

D20S77

SINE SINE

LINE LINE Repeats LTR LTR Repeats DNA DNA

Other Other

0.7 0.7

G+C content 0.5 0.5 G+C content

0.3 0.3 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 16.5 17.0 17.5 18.0 18.5 19.0 19.5 20.0 20.5 21.0 21.5 22.0 22.5 23.0 23.5 24.0 24.5 25.0 25.5 26.0 26.5 27.0 27.5

20 20

SNPs per 2,000 bp 0 0 SNPs per 2,000 bp

20 20

CpG islands CpG islands

Genes C20orf8 C20orf18 ANGPT4 PSMF1 FKBP1A SIRPB2 PTPNS1 TGM3 NOL5A PTPRA KIAA0860 ATRN C20orf27 C20orf16 PRNP C20orf30 KIAA1434 CHGB BMP2 HAO1 PLCB1 PLCB4 ANKRD5 MKKS JAG1 C20orf187 BTBD3 C20orf38 C20orf13 C20orf6 FLRT3 C20orf68 C20orf23 SNRPB2 PCSK2 DSTN SNX5 CSRP2BP POLR3F C20orf78 SLC24A3 CDH22 INSM1 EIF4EL4 C20orf19 NKX2B RPL41L2 HNF3B SSTR4 NXT1 CST3 CST2 C20orf39 CST7 ENTPD6 KIAA0980 C20orf189 Genes

Known SOX22 TCF15 C20orf55 C20orf46 NSFL1C PDYN SNRPB KIAA1442 GNRH2 GFRA4 C20orf48 ADRA1D PRND PCNA C20orf42 KIAA1162 PAK7 SNAP25 C20orf94 C20orf61 SPTLC2L C20orf7 OTOR BFSP1 C20orf179 ZNF339 ZNF133 C20orf88 NAT5 KIAA1272 XRN2 PAX1 C20orf56 THBD ZNF336 CST4 CST5 C20orf3 PYGB ZNF337 C20orf91 Novel CDS Novel transcript C20orf73 CSNK2A1 RPS10L SNPH SIRPB1 STK35 TGM3L IDH3B OXT L3MBTL SN C20orf29 RASSF2 CDS2 KIAA1153 C20orf103 dJ1099D15.1 C20orf82 C20orf50 RRBP1 C20orf72 C20orf12 C20orf79 CRNKL1 C20orf74 NKX2D C1QR NAPB CST1 GGTLA3 VSX1 C20orf22 C20orf147 C20orf191 Pseudogene C20orf87 C20orf139 C20orf182 SDCBP2 dJ576H24.3 ZNF343 CPXM AVP C20orf28 SLC23A1 UBE2D3 C20orf154 C20orf133 SEC23B C20orf26 CST8 GGTLA4 ACAS2L KIAA0186 Putative C20orf96 SCRT2 RAD21L1 PTPNS1L2 dJ734P14.5 bA12M19.2 KIAA0552 CENPB FTLL1 GPR73L1 C20orf155 C20orf84 RPL12L3 CSTL1 POM121L3 IL6ST2

C20orf99 C20orf54 PTPN1L C20orf145 ITPA CDC25B C20orf75 RBBP9B CST8L

C20orf98 C20orf141 C20orf37 KIAA1271 CST9L

C20orf97 C20orf81 C20orf116 RNF24

C20orf140 MRPS26 C20orf153

SLC4A11 C20orf60

AL049868

AL080249

AL109827 AL035420 AL050318 AL109614 AL391692 AL121763 AL034424 AL031666 AL035685 AL049765 AC005220 AL389883 AL035667 AL109829 AL590548 Contig tiling path AL049539 AL034550 AL139826 AL050349 AL356652 AL121586 AL078461 AL445705 AL079335 AL031659 AL132873 AL109964 AL499625 AL023803 AL118523 AL049691 AL031667 AL133229 AL031656 AL031681 AL121587 AL117382 Z93016 AL109656 AL133227 AL354766 AL049540 AL034418 AL139351 AL137078 AL049766 AL034553 AL035457 AL035684 AL356354 AL121923 AL121902 AL354772 AC005914 AL133335 AL118501 AL121593 AL121826 AL117383 AL137162 AL035541 AL354984 AL109840 AL132822 AL161940 AL160412 AL353715 AL355803 Contig tiling path AL121751 AL160175 AL353092 AL121583 AL035071 AL137225 AL034421 AL356299 AL118520 AL133324 AL121753 AL389875 AL109965 AL450465 AL391602 AL136172 AL133293 AL023804 AL583962 AL133519 AL035251 AL109611 AL121828 AL035665 AL031660 AL031257 AL031669 AL031654 AL035459 AL035666 AL359695 AL121886 AL353797 AL035462 AL139352 AL109839 AL021578 AL031671 AL162458 AL445286 AL133520 AL121776 AL390212 AL021394 Z95330 AL133341 AL035106 AL121903 AL121786 AL162615 AL034423 AL034429 AL031680 AL121785 AL132866 AL034420 AL121771 AL109610 AL161945 AC004505 AL157838 AC004762 AL138805 AL162292 AL162294 AL161801 AL109828 AL121914 AL133232 AL135939 AL121913 AL035455 AL139349 AL035250 AL121908 AL109928 AL358237 AL050326 AL365229 AL079336 AL109911 AL354836 AL449263 AL035669 AL096828 AL121829 AL158091 AL121581

AL441988 AL031650 AL117381 AL031658 AL354800 AL132653 AL121756 AL355392 AL121906 AL031668 AL136173 AL049709 AL135844 AL389874 AL356422 AL121895 AL390374 AL132768 AL391114 AL390014 AL118499 AL031651 AL391693 AL031657 AL136990 AL031661 AL139163 AL009050 AL118521 AL034431 AL035652 AL132654 AL354812 AL049812 Z93942 AL109826 AL031676 AL021395 Z98752 AL035089 AL035447 Z97053 AL008725 AL035660 AL121778 AL008726 AL031687 AL354745 AL359506 AL390211 AL353777 AL121777 AL136102 AL049541 AL133342 AL133174 AL035683 AL121712 AL354889 AL353653 AL050404 AL079339 AL353799 AL109984 AL109977 AL031674 AL391097 AL354993 AC006076 AL354801 AL442131 AL138809 AL035663 AL356388 AL353640 AL359534 AL121920 AL117380 AL122058 AL160176 AL162291 AL354834 AL050327 AL121917 AL121919 AL389889 AL158092 AL035070 AL136143 AL139348 AL121887 AL391316 AL365401 AL078633 AL499627 AL357033 AL121673 AL353658 AL121845 AL356790

AL121762 AL121723 AL110115 AL359765 AL121897 AL133343 AL390298 AL121901 AL162505 AL034549 AL035458 AL109923 AL109824 AL132825 AL121752 AL357374 AL359828 AL512784 AL031662 AL365505 AL034422 AL162293 AL109823 AL359555 AL035419 AL121889 AL121588 AL132981 AL133419 AL031256 AL034552 AL050317 AL022394 AL121674 AL359984 AL024473 AL022239 AL136461 AL117374 AL138806 AL161944 AL034419 AL132772 AL118522 AL049767 AL031663 AL050348 AL035662 AL031686 AL031055 AL359434 AL109825 AL354813 AL357558 AL121888 AL445192 AL049537 AL357560 AL118525 AL031685 AL161937 AL133230 AL133228 AL121915 AL035682 AL138807 AL035072 AL035686 AL049736 AL050316 AL109930 AC005808 AC004501 AL121922 AL133420 AL158015 AL160413 AL121591 AL139824 AL109806 AL161657 AL157414 AL109955 AL161943 AL354776 AL118513 AL132655 AL136532 AL160410 AL357503 AL353677 AL121918 AL117372 AL121910 AL450463 AL162457 AL137077 AL121832 AL450469 AL117379 AL121827 AL121907 AL118506 AL137028

D20S484 D20S111 D20S874 D20S903 D20S195 D20S878 D20S914 D20S834 D20S859 D20S881 D20S174 D20S438 D20S607 D20S850 D20S855 D20S170 D20S858 D20S899 D20S96 D20S150 D20S911 D20S119 D20S424 D20S838 D20S836 D20S888 D20S891D20S197 D20S178 D20S427 D20S445 D20S109 D20S869 D20S857 D20S845 D20S428 D20S167 D20S854 D20S876 D20S1085 D20S120 D20S469 D20S853 D20S100 D20S158 D20S430 D20S164 D20S443 D20S171

Genetic markers D20S425 D20S187 D20S896 D20S841 D20S478 D20S908 D20S85 D20S465 D20S99 D20S466 D20S169 D20S454 D20S481 D20S856 D20S862 D20S176 D20S897 D20S196 D20S185 D20S1083 D20S839 D20S902 D20S183 D20S832 Genetic markers

D20S863 D20S843 D20S870 GATA141H08 D20S466 D20S861 GATA148H05 D20S886 D20S866 D20S893 D20S833 D20S840

D20S872 D20S837 D20S865 D20S107 D20S423 D20S877 D20S211

D20S847 SINE SINE

LINE LINE

Repeats LTR LTR Repeats

DNA DNA

Other Other

0.7 0.7

G+C content 0.5 0.5 G+C content

0.3 0.3 27.5 28.0 28.5 29.0 29.5 30.0 30.5 31.0 31.5 32.0 32.5 33.0 33.5 34.0 34.5 35.0 35.5 36.0 36.5 37.0 37.5 38.0 38.5 39.0 39.5 40.0 40.5 41.0 41.5 42.0 42.5 43.0 43.5 44.0 44.5 45.0 45.5 46.0 46.5 47.0 47.5 48.0 48.5 49.0 49.5 50.0 50.5 51.0 51.5 52.0 52.5 53.0 53.5 54.0 54.5 55.0 55.5 56.0 56.5 57.0 57.5 58.0 58.5 59.0 59.5 60.0 60.5 61.0 61.5 62.0 62.5

20 20

SNPs per 2,000 bp 0 0 SNPs per 2,000 bp

20 20 CpG islands CpG islands

Genes C20orf80 REM BCL2L1 HCK KIAA0180 DNMT3B PLUNC CBFA2T2 RALY AHCY DNCL2A GSS C20orf31 GDF5 CPNE1 SCAND1 C20orf4 MYL9 KIAA0889 RPN2 SRC BLCAP KIAA0406 BPI KIAA1219 PPP1R16A KRML TOP1 LIPN3L PTPRT SFRS6 MYBL2 C20orf100 HNF4A WISP2 TOMM34 SEMG2 WFDC2 UBE2C SLC12A5 C20orf5 SLC2A10 PRKCBP1 NCOA3 KIAA1415 ARFGEF2 KIAA1404 B4GALT5 SPATA2 CEBPB PTPN1 BCAS4 NFATC2 SALL4 ZNF338 dJ823G15.3 BCAS1 PFDN4 C20orf180 MC3R C20orf43 BMP7 RNPC1 TMEPAI PPP4R1L STX16 TH1L EDN3 SYCP2 CDH4 C20orf40 C20orf150 SLC21A12 C20orf21 EEF1A2 TPD52L2 C20orf36 Genes

Known MLLT10L ID1 C20orf1 KIAA0255 C20orf112 MAPRE1 C20orf34 APBA2BP EIF2S2 ITCH NCOA6 MYH7B MMP24 CEP2 RBM12 EPB41L1 KIAA0964 dJ469A13.3 RBL1 GHRH NNAT TGM2 LBP C20orf95 C20orf15 PLCG1 C20orf130 PPIAL KIAA0681 JPH2 TDE1 KCNK15 KCNS1 SDC4 SPINT3 TNNC2 NCOA5 KIAA1834 EYA2 KIAA1247 CSE1L KCNB1 KIAA0939 C20orf176 ADNP ATP9A ZFP64 C20orf17 UBL6 CYP24 CBLNL1 STK15 TFAP2C SPO11 PCK1 RAB22A GNAS1 C20orf66 C20orf83 PPP1R6 TAF2C1 HRH3 C20orf151 NTSR1 C20orf11 BIRC7 PTK6 dJ583P15.4 RGS19 C20orf69 Novel CDS C20orf63 FKHL18 PLAGL2 C20orf113 SPAG4L SNTA1 ASIP MAP1LC3A ACAS2 PROCR SDBCAG84 C20orf152 TGIF2 SAMHD1 MANBAL C20orf33 C20orf77 VIAAT C20orf129 TIX1 KIAA1335 SGK2 C20orf111 PKIG YWHAB PI3 RBPSUHL PTE1 TNFRSF5 ZNF334 STAU PTGIS SNAI1 C20orf175 DPM1 ZNF217 CSTF1 RAE1 C20orf183 VAPB NPEPL1 CTSZ CDH26 PSMA7 RPS21 C20orf166 C20orf20 ARF1GAP GMEB2 URKL1 MYT1 Novel transcript

Pseudogene C20orf93 MYLK2 KIF3B C20orf92 C20orf184 C20orf115 PXMP4 CDC91L1 HMG4L ITGB4BP SPAG4 C20orf24 C20orf118 C20orf109 KIAA1755 ACTR5 RPL12L2 C20orf9 GDAP1L1 ADA STK4 SEMG1 SPINLW1 PPGB C20orf25 SLC13A3 ARPC3B UBE2V1 MOCS3 C20orf32 C20orf89 HMG1L1 C20orf85 ATP5E C20orf101 SS18L1 GATA5 OGFR BHLHB4 KIAA1510 STMN3 KIAA1196 Putative HM13 C20orf57 KIAA0978 C20orf185 C20orf144 C20orf110 C20orf188 C20orf44 NFS1 C20orf156 C20orf131 C20orf102 RPL27AP C20orf142 C20orf190 SLPI PIGT RPS2L1 PLTP C20orf157 DDX27 ZNF313 PARD6B KCNG1 C20orf108 CTCFL C20orf86 C20orf45 C20orf177 OSBPL2 COL9A3 CHRNA4 C20orf41 C20orf14

COX4B C20orf125 TSPYL3 C20orf186 C20orf134 GGTL5 C20orf127 RNPC2 C20orf172 C20orf132 C20orf65 R3HDML C20orf119 MATN4 C20orf171 ZNF335 C20orf123 TMSB4L C20orf105 bA196N14.5 TUBB1 ADRM1 TCFL5 KCNQ2 TNFRSF6B SOX18

DUSP15 C20orf70 SYTIP2 C20orf120 C20orf124 C20orf117 C20orf62 C20orf10 C20orf138 MMP9 C20orf64 C20orf106 C20orf174 LAMA5 DATF1 C20orf58 ARFRP1 TCEA2

C20orf126 C20orf71 ZNF341 C20orf128 C20orf52 C20orf121 C20orf35 C20orf137 C20orf107 GTPBP4 C20orf90 C20orf59 SLC2A4RG OPRL1

C20orf159 C20orf114 C20orf178 C20orf173 C20orf122 C20orf170 C20orf164 RPS4L C20orf143 C20orf149 GPR8

C20orf160 FER1L4 C20orf104 C20orf169 C20orf168 C20orf158 C20orf148 C20orf181

C20orf53 C20orf146 C20orf67 C20orf51 KIAA1769 C20orf135

WFDC3 KIAA1847 bA299N6.3

C20orf167 ZNF340

C20orf161 BTBD4

C20orf162 DNAJC5

C20orf165 C20orf136

C20orf163