Downloaded and Searched Against the Dbest Database to Identify Ests
Total Page:16
File Type:pdf, Size:1020Kb
BMC Genomics BioMed Central Research article Open Access A transcription map of the 6p22.3 reading disability locus identifying candidate genes Eric R Londin1, Haiying Meng2 and Jeffrey R Gruen*2 Address: 1Graduate Program in Genetics, State University of New York at Stony Brook, NY, USA and 2Yale Child Health Research Center, Department of Pediatrics, Yale University School of Medicine, New Haven, CT, USA Email: Eric R Londin - [email protected]; Haiying Meng - [email protected]; Jeffrey R Gruen* - [email protected] * Corresponding author Published: 30 June 2003 Received: 22 April 2003 Accepted: 30 June 2003 BMC Genomics 2003, 4:25 This article is available from: http://www.biomedcentral.com/1471-2164/4/25 © 2003 Londin et al; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL. reading disabilitydyslexia6p22.3In silicoESTs Abstract Background: Reading disability (RD) is a common syndrome with a large genetic component. Chromosome 6 has been identified in several linkage studies as playing a significant role. A more recent study identified a peak of transmission disequilibrium to marker JA04 (G72384) on chromosome 6p22.3, suggesting that a gene is located near this marker. Results: In silico cloning was used to identify possible candidate genes located near the JA04 marker. The 2 million base pairs of sequence surrounding JA04 was downloaded and searched against the dbEST database to identify ESTs. In total, 623 ESTs from 80 different tissues were identified and assembled into 153 putative coding regions from 19 genes and 2 pseudogenes encoded near JA04. The identified genes were tested for their tissue specific expression by RT- PCR. Conclusions: In total, five possible candidate genes for RD and other diseases mapping to this region were identified. Background panel identified a peak of transmission disequilibrium at Reading disability (RD), or dyslexia, is a common syn- marker JA04 (G72384) [14]. drome with a significant genetic component. Genetic link- age studies have identified five potential RD loci located We searched the expressed sequence tag (EST) database, on chromosomes 1 [1,2], 2 [3,4], 6 [5–9], 15 [7,10] and dbEST http://www.ncbi.nlm.nih.gov/dbEST/index.html, 18 [11]. The linkage to 6p has been the most often repro- with the genomic sequence corresponding to the peak of duced in independent samples with five studies showing transmission disequilibrium at marker JA04. ESTs are par- peaks of linkage to different regional marker sets [5–9]. tial and usually incomplete cDNA sequences prepared We recently constructed a BAC/PAC contig of the 6p RD from various tissues. Presently, nearly four million ESTs locus [12] and identified the precise location and order of (dbEST release 04/19/02) are catalogued in dbEST repre- 29 short tandem repeat (STR) markers spanning this senting about 80% or more of all human genes with at region [13]. A subsequent study with this new marker least one representative entry [15]. While not every gene is accounted for in dbEST, computer database searching, Page 1 of 8 (page number not for citation purposes) BMC Genomics 2003, 4 http://www.biomedcentral.com/1471-2164/4/25 also known as in silico cloning, can identify new genes most centromeric 600 Kb of genomic sequence are void of without actual physically manipulating DNA. These types coding regions. Intergenic distances range from less than of analyses can also characterize intron-exon boundaries, 1 Kb (KIAA0319 and TRAF) to 110 Kb (HNRPA1 and splice variants, tissue specific expression levels, and gene P24). Cytokeratin 8, transcribed telomere to centromere, homologies [16]. Furthermore, clustering ESTs together to is located in the intron between exons 1 and 2 of form a contiguous sequence can predict putative open KIAA0319 (transcribed centromere to telomere). Table 1 reading frame (ORFs). The first map of the human lists the genes identified in Figure 1, their size in Kb, the genome, which contained over 30,000 genes, was gener- NCBI accession number for the corresponding mRNA or ated by mapping EST clusters to human-hamster radiation cDNA, number of exons, genomic mapping source, and hybrid cell lines [17]. Despite their wide-ranging utility, putative function. ESTs have two inherent drawbacks: (1) they are based on single sequence reads making them vulnerable to Thirteen genes were previously mapped to this region sequencing errors, and (2) they are generated from cDNA (RU2AS, MRS2L, GPLD1, SSADH, KIAA0319, TRAF, libraries that may contain unexpressed or incompletely HT012, FLJ12619, Geminin, KIAA0386, Cyclophilin A, spliced sequences derived from heteronuclear RNA or CMAH, and NUP50) and are present on the NCBI acces- other artifacts. The clustering of ESTs to form ORFs could sion view (NT_017021) of 6p21.3-22. Our in silico studies contain both expressed and unexpressed sequences and identified an additional six genes (RPS10, FLJ12671, should be treated with caution. Fortunately, the high UBE2D2, AP3, Cytokeratin 8, and P24) and two pseudo- redundancy of entries in dbEST permits the alignment of genes (HNRPA1 and ASSP2) not represented on multiple ESTs for most genes thus diminishing the effects NT_017021. In addition we identified a putative fifth of these drawbacks. exon for P24 from hypothalamus (BG715502) and pineal gland (AA363698) cDNA libraries, located between exons To identify candidate genes for RD and other disorders 2 and 3 in the 5-prime untranslated region and flanked by mapping to this region we downloaded and searched the non-traditional splice donor and acceptor bases [25]. The two million base pairs of genomic sequence surrounding novel exon was not present in the full-length mRNA the peak of transmission disequilibrium. In addition to sequence (AF418980). RD, risk loci for Behçet's disease [18], inflammatory bowel disease (IBD3) [19,20], hypotrichosis simplex The functions of the newly identified genes were inferred (HSS) [21], insulin dependent type 1 diabetes mellitus by their similarity to other known genes identified in the [22], attention deficit hyperactivity disorder (ADHD) [23] BLAST searches. Ribosomal protein L21 (RPS10) is 95% and schizophrenia [24] have all been assigned to this gen- identical to the genomic and mRNA sequences of a eral chromosomal location by genetic linkage analysis. known ribosomal gene. RPS10 is one of the many pro- Using in silico cloning we identified a total of 19 genes and teins that make up the ribosome macromolecule. Other 2 pseudogenes and mapped their precise physical location highly homologous genes of RPS10 are RPS5, RPS9, and direction of transcription. The expression pattern of RPS29, RPL5, RPl27a, and RPL28 [26]. FLJ12671s a hypo- each gene was characterized by examining the number of thetical gene with an unknown function identified by the ESTs identified from various tissues as well as by qualita- NEDO human cDNA-sequencing project [27]. Blasting tive RT-PCR with RNA from 20 different human tissues. the FLJ12671 sequence against the nr database also iden- This study also allowed us to test the usefulness of in silico tified hits on chromosomes 1 and 11, suggesting that this cloning to identify and map new genes in a focussed gene has been duplicated on several chromosomes. Ubiq- region of the genome. uitin conjugating enzyme E2D2 (UBE2D2) is a protein that targets abnormal or short-lived proteins for degrada- Results tion by the 26S proteasome [28]. The EST sequences iden- Using the blastc13 server at NCBI, we performed in silico tified for the UBE2D2 gene on chromosome 6 are 93% cloning studies of the 6p RD locus to identify coding identical to the human UBC4/5 gene located on chromo- regions. In total, 623 ESTs from 80 different tissues were some 5 suggesting that this gene could be a duplicate as identified and aligned to 2 Mb of genomic sequence. well. Adapter-related protein complex 3 (AP3) may be These searches captured 157 putative coding regions from involved in intracellular protein transport [29]. Cytokera- 19 genes and 2 pseudogenes concentrated in the central tin 8 is 95% identical to both the genomic and mRNA 1200 Kb shown in detail in Figure 1 with base pair 1 start- sequences of keratin. Vesicular membrane protein (P24), ing at the 5-prime end of FLJ12671 and ending 1 Kb cen- a previously characterized but unmapped gene [30] has tromeric to the 3-prime end of RPS10. Short tandem been localized in intracellular organelles of highly differ- repeat marker JA04, which identified the peak of transmis- entiated neural cells and may have a role in the neural sion disequilibrium for RD phenotypes in previous stud- organelle transport system. ASSP2 is one of 12 pseudo- ies [14], is at 540 Kb. The most telomeric 200 Kb and the genes of argininosuccinate synthetase encoded on 10 Page 2 of 8 (page number not for citation purposes) BMC Genomics 2003, 4 http://www.biomedcentral.com/1471-2164/4/25 2.5 10 2 8 Linkage Transmission 1.5 6 Disequilibrium TScore 1 4 Chi Square 2 0 JA01 JA02 GAAT3A06 AFM342xe5 D6S506 D6S1686 D6S1050 D6S1663 JA03 D6S1660 D6S461 D6S1691 D6S276 JA04 D6S1571 D6S1281 D6S2233 D6S2238 JA05 D6S2252 D6S1260 D6S105 D6S1001 D6S2227 D6S2217 D6S1624 D6S258 JA08 JA06 STR Markers (0Mb) (10Mb) pter cen 1.2Mb Region 1000kb 300kb 100kb 700kb 200kb 900kb 400kb 0kb 800kb 1100kb 500kb 600kb 1200kb FLJ12671 GPLD1Cytokeratin 8 HT012 AP3 KIAA0386 ASSP2 CMAH UBE2D2 RPS10 Candidate HNRPA1 P24 RU2AS SSADH FLJ12619 Genes MRS2 KIAA0319 TRAF Geminin Cyclophilin A NUP50 TransmissionscriptionFigure 1 map disequilibrium and genetic linkage analyses of the 6p21.3 reading disability locus, regional STR markers and tran- Transmission disequilibrium and genetic linkage analyses of the 6p21.3 reading disability locus, regional STR markers and transcription map.