Prediction of the Coding Sequences of Unidentified Human Genes. XXII

DNA Research 8, 319–327 (2001) Short Communication Prediction of the Coding Sequences of Unidentified Human Genes. XXII. The Complete Sequences of 50 New cDNA Clones Which Code for Large Proteins Takahiro Nagase,∗ Reiko Kikuno, and Osamu Ohara Kazusa DNA Research Institute, 1532-3 Yana, Kisarazu, Chiba 292-0812, Japan (Received 3 December 2001) Abstract As an extension of human cDNA projects for accumulating sequence information on the coding se- Downloaded from quences ofunidentified genes, we herein present the entire sequences of50 cDNA clones, named KIAA1939– KIAA1988. cDNA clones to be entirely sequenced were selected by two approaches based on their protein-coding potentialities prior to sequencing: 10 cDNA clones were chosen because their encoding proteins had a molecular mass larger than 50 kDa in an in vitro transcription/translation system; the remaining 40 cDNA clones were selected because their putative proteins—as determined by analysis ofthe http://dnaresearch.oxfordjournals.org/ genomic sequences flanked by both the terminal sequences ofcDNAs using the GENSCAN gene prediction program—were larger than 400 amino acid residues. According to the sequence data, the average sizes of the inserts and corresponding open reading frames of cDNA clones analyzed here were 4.6 kb and 1.9 kb (643 amino acid residues), respectively. From the results ofhomology and motifsearches against the public databases, the functionalcategories ofthe 31 predicted gene products could be assigned; 25 ofthese predicted gene products (81%) were classified into proteins relating to cell signaling/communication, nucleic acid management, and cell structure/motility. The expression profiles ofthe genes were also studied in 10 human tissues, 8 brain regions, spinal cord, fetal brain and fetal liver by reverse transcription-coupled polymerase chain reaction, the products ofwhich were quantified by enzyme-linked immunosorbent assay. Key words: large proteins; cDNA sequencing; expression profile; brain; zinc finger proteins at Pennsylvania State University on February 23, 2013 The human genome draft sequence, a milestone in the human transcripts from the genomic sequence in silico study ofhuman genomics, has provided a wealth ofin- mainly lies in the fact that most protein-coding sequences formation in biology.1,2 In particular, the complete cat- (CDSs) in human genes are divided into small pieces alogs ofhuman proteins and transcripts are expected along the genome by introns ofvarious sizes. Analysis to transformour understanding ofvarious kinds ofbi- ofcDNA is expected to circumvent this problem, be- ological phenomena on a molecular basis. Toward this cause all the introns are removed in a mature form of end, some groups are actively making efforts to anno- human mRNA, at least in principle. In addition, cDNA tate a set ofproteins and transcripts fromthe genomic clones also serve as reagents for functional analysis of sequence. However, currently available computational human genes, thereby underscoring the importance of methods used for gene prediction are not mature enough cDNA analysis to obtain the complete catalog ofhuman for this purpose. Actually, Hogenesch et al. reported proteins. Several groups have recently begun compre- that the gene sets predicted by Celera Genomics and hensive cDNA sequencing projects for this purpose.4,5 by a public genome consortium share only a small num- We anticipated such a situation more than 7 years ago ber ofnovel genes, while both efforts predicted approxi- and pioneered a sequencing project ofhuman cDNAs en- mately the same number ofgenes. 3 They concluded that coding relatively large proteins in order to accumulate an integrated approach combining computational predic- information on the CDSs of unidentified genes. The to- tions, human curation and experimental validation would tal number ofhuman cDNAs entirely sequenced by us be required to complete the description ofhuman pro- has reached nearly 2000.6,7 In particular, we have fo- teome and transcriptome.3 The difficulty in annotating cused our sequencing efforts on long cDNAs encoding Communicated by Michio Oishi large proteins because we are most interested in cDNA ∗ To whom correspondence should be addressed. Tel. +81-438- clones encoding multidomain proteins. To this end, we 52-3930, Fax. +81-438-52-3931, E-mail: [email protected] have used in vitro transcription/translation assays to se- 320 Prediction ofUnidentified Human Genes [Vol. 8, lect cDNA clones which can produce proteins with an by GeneMark analysis.13 Thus, we carefully checked apparent molecular mass of ≥ 50 kDa.8 After the human whether the observed interruption ofCDSs is spuri- genome draft sequence became publicly available, we also ous or not. The coding splits ofthree cDNA clones selected cDNA clones which were expected, on the ba- (KIAA1977, KIAA1981, and KIAA1983) were found sis ofpredicted gene structure, to encode relatively large to be spurious by direct sequencing ofthe major re- proteins by mapping the 5 and 3 expressed sequence verse transcription-coupled polymerase chain reaction tags (ESTs) oflong cDNAs along the genomic sequence (RT-PCR) products; these cDNA sequences were then in silico.9 As an extension ofthe preceding reports, we revised according to the RT-PCR results. The split of herein present the predicted CDS of50 new cDNA clones CDS in a cDNA clone ofKIAA1982 was determined which have the potential to code for large proteins. To- to be the result ofa deletion ofa single nucleotide, gether with the results ofcomputer analysis ofamino acid probably during reverse transcription, because the com- sequences oftheir predicted products, the expression pro- parison ofthe cloned cDNA sequence with the corre- files of50 new genes in various human tissues including sponding genomic sequence indicated the presence ofa brain regions are also explored. single-base deletion, although we failed to obtain and analyze the RT-PCR products corresponding to this re- 1. Sequence Analysis and Prediction of CDSs in gion. As observed in this case, the frame-shift muta- Downloaded from cDNA Clones tion in cDNAs caused by a one- or two-nucleotide in- sertion/deletion during reverse transcription occur fre- The cDNA clones used in this study were isolated quently in regions with homopolymeric runs, as de- 14 from size-fractionated libraries harboring cDNAs with scribed previously. For these four cDNA clones, the revised sequences, not the actual cloned cDNA sequences, average sizes longer than 4 kb, which were derived http://dnaresearch.oxfordjournals.org/ from human fetal brain, adult whole brain, amyg- were deposited to the GenBank/EMBL/DDBJ databases dala, hippocampus and cultured cell line KG-1.6,8 and used for prediction of their CDSs unless otherwise First, the cDNA clones with unidentified sequences at stated. The differences between the cloned cDNA and both ends were selected by BLAST search against the the revised cDNA sequences are shown on our web site, 15 GenBank database (release 122.0) excluding ESTs and HUGE (http://www.kazusa.or.jp/huge). In contrast to genomic sequences.10,11 Then, 50 cDNA clones which the cases described above, for the remaining 11 cDNA seemed to have coding potentials for production of rel- sequences we could not obtain and analyze RT-PCR atively large proteins were selected for sequencing in products for the regions spanning the predicted CDS in- their entirety and were designated using our system- terruption. Thus, even though KIAA1974–KIAA1976, atic gene identifier, KIAA plus a four-digit number: KIAA1978–KIAA1980, and KIAA1984–KIAA1988 have at Pennsylvania State University on February 23, 2013 10 cDNA clones (KIAA1966–KIAA1973, KIAA1987, and multiple predicted CDSs by GeneMark analysis as de- KIAA1988) were selected because they could produce scribed above, only the longest CDSs are shown in Fig. 1. proteins with an apparent molecular mass larger than Physical maps ofthe 50 cDNAs reported in this paper are 50 kDa in an in vitro transcription/translation system shown in Fig. 1, in which the ORFs and the first ATG according to the method previously described;8 the re- codons in their respective ORFs are indicated by solid maining cDNA clones were identified by GENSCAN gene boxes and triangles, respectively. Repeat sequences an- prediction analysis ofthe genomic regions flanked by alyzed by the RepeatMasker program are also displayed the cDNA terminal sequences.12 In brief, both termi- in Fig. 1. In conclusion, the average size ofthe 50 cDNA nal sequences ofthe cDNAs were subjected to BLAST sequences was 4.6 kb and that ofthe predicted CDSs search10 against the human genome draft sequences was approximately 643 amino acid residues. Table 1 lists (ftp://ncbi.nlm.nih.gov/genomes/H sapiens/) to find out the lengths ofinserts and the ORF lengths ofthe re- the genomic fragments corresponding to the cDNA se- spective clones. Notably, clones for the KIAA1981 and quences. After identification of genomic fragments that KIAA1987 genes seemed to lack a region encoding their were found to be considerably similar to the cDNA se- respective C-terminal portions because the 3 -ends of quence (E-value = 0.0 and sequence identity of90% or these cDNAs ended with a Not I site but did not have greater), the GENSCAN program was used to predict a a dA-tail. These incomplete cDNA clones probably re- plausible gene structure ofthe genomic fragment. 12 In sulted from cleavage of the parental cDNAs at an inter- this way, the 40 cDNA clones (KIAA1939–KIAA1965 nal Not I site since all the cDNAs in our libraries were and KIAA1974–KIAA1986) for unidentified genes were digested with Not I before ligation to a vector during 8 selected because they were expected to encode proteins cDNA library construction. As additional information larger than 400 amino acid residues. Entire sequencing on these KIAA genes, the chromosomal loci ofgenes were ofthese clones was performedaccording to the method determined by comparison with the human genome draft previously described in detail.8 sequence (ftp://ncbi.nlm.nih.gov/genomes/H sapiens/) Regarding 15 cDNA clones (KIAA1974–KIAA1988), (Table 1).

Prediction of the Coding Sequences of Unidentified Human Genes. XXII

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support