DNA Research 8, 319–327 (2001) Short Communication

Prediction of the Coding Sequences of Unidentified Human . XXII. The Complete Sequences of 50 New cDNA Clones Which Code for Large Proteins

Takahiro Nagase,∗ Reiko Kikuno, and Osamu Ohara

Kazusa DNA Research Institute, 1532-3 Yana, Kisarazu, Chiba 292-0812, Japan

(Received 3 December 2001)

Abstract

As an extension of human cDNA projects for accumulating sequence information on the coding se- Downloaded from quences ofunidentified genes, we herein present the entire sequences of50 cDNA clones, named KIAA1939– KIAA1988. cDNA clones to be entirely sequenced were selected by two approaches based on their protein-coding potentialities prior to sequencing: 10 cDNA clones were chosen because their encoding proteins had a molecular mass larger than 50 kDa in an in vitro transcription/translation system; the re-

maining 40 cDNA clones were selected because their putative proteins—as determined by analysis ofthe http://dnaresearch.oxfordjournals.org/ genomic sequences flanked by both the terminal sequences ofcDNAs using the GENSCAN prediction program—were larger than 400 amino acid residues. According to the sequence data, the average sizes of the inserts and corresponding open reading frames of cDNA clones analyzed here were 4.6 kb and 1.9 kb (643 amino acid residues), respectively. From the results ofhomology and motifsearches against the public databases, the functionalcategories ofthe 31 predicted gene products could be assigned; 25 ofthese pre- dicted gene products (81%) were classified into proteins relating to cell signaling/communication, nucleic acid management, and cell structure/motility. The expression profiles ofthe genes were also studied in 10 human tissues, 8 brain regions, spinal cord, fetal brain and fetal liver by reverse transcription-coupled polymerase chain reaction, the products ofwhich were quantified by enzyme-linked immunosorbent assay.

Key words: large proteins; cDNA sequencing; expression profile; brain; zinc finger proteins at Pennsylvania State University on February 23, 2013

The draft sequence, a milestone in the human transcripts from the genomic sequence in silico study ofhuman genomics, has provided a wealth ofin- mainly lies in the fact that most protein-coding sequences formation in biology.1,2 In particular, the complete cat- (CDSs) in human genes are divided into small pieces alogs ofhuman proteins and transcripts are expected along the genome by introns ofvarious sizes. Analysis to transformour understanding ofvarious kinds ofbi- ofcDNA is expected to circumvent this problem, be- ological phenomena on a molecular basis. Toward this cause all the introns are removed in a mature form of end, some groups are actively making efforts to anno- human mRNA, at least in principle. In addition, cDNA tate a set ofproteins and transcripts fromthe genomic clones also serve as reagents for functional analysis of sequence. However, currently available computational human genes, thereby underscoring the importance of methods used for gene prediction are not mature enough cDNA analysis to obtain the complete catalog ofhuman for this purpose. Actually, Hogenesch et al. reported proteins. Several groups have recently begun compre- that the gene sets predicted by Celera Genomics and hensive cDNA sequencing projects for this purpose.4,5 by a public genome consortium share only a small num- We anticipated such a situation more than 7 years ago ber ofnovel genes, while both efforts predicted approxi- and pioneered a sequencing project ofhuman cDNAs en- mately the same number ofgenes. 3 They concluded that coding relatively large proteins in order to accumulate an integrated approach combining computational predic- information on the CDSs of unidentified genes. The to- tions, human curation and experimental validation would tal number ofhuman cDNAs entirely sequenced by us be required to complete the description ofhuman pro- has reached nearly 2000.6,7 In particular, we have fo- teome and transcriptome.3 The difficulty in annotating cused our sequencing efforts on long cDNAs encoding Communicated by Michio Oishi large proteins because we are most interested in cDNA ∗ To whom correspondence should be addressed. Tel. +81-438- clones encoding multidomain proteins. To this end, we 52-3930, Fax. +81-438-52-3931, E-mail: [email protected] have used in vitro transcription/translation assays to se- 320 Prediction ofUnidentified Human Genes [Vol. 8, lect cDNA clones which can produce proteins with an by GeneMark analysis.13 Thus, we carefully checked apparent molecular mass of ≥ 50 kDa.8 After the human whether the observed interruption ofCDSs is spuri- genome draft sequence became publicly available, we also ous or not. The coding splits ofthree cDNA clones selected cDNA clones which were expected, on the ba- (KIAA1977, KIAA1981, and KIAA1983) were found sis ofpredicted gene structure, to encode relatively large to be spurious by direct sequencing ofthe major re- proteins by mapping the 5 and 3 expressed sequence verse transcription-coupled polymerase chain reaction tags (ESTs) oflong cDNAs along the genomic sequence (RT-PCR) products; these cDNA sequences were then in silico.9 As an extension ofthe preceding reports, we revised according to the RT-PCR results. The split of herein present the predicted CDS of50 new cDNA clones CDS in a cDNA clone ofKIAA1982 was determined which have the potential to code for large proteins. To- to be the result ofa deletion ofa single nucleotide, gether with the results ofcomputer analysis ofamino acid probably during reverse transcription, because the com- sequences oftheir predicted products, the expression pro- parison ofthe cloned cDNA sequence with the corre- files of50 new genes in various human tissues including sponding genomic sequence indicated the presence ofa brain regions are also explored. single-base deletion, although we failed to obtain and analyze the RT-PCR products corresponding to this re-

1. Sequence Analysis and Prediction of CDSs in gion. As observed in this case, the frame-shift muta- Downloaded from cDNA Clones tion in cDNAs caused by a one- or two-nucleotide in- sertion/deletion during reverse transcription occur fre- The cDNA clones used in this study were isolated quently in regions with homopolymeric runs, as de- 14 from size-fractionated libraries harboring cDNAs with scribed previously. For these four cDNA clones, the re- vised sequences, not the actual cloned cDNA sequences, average sizes longer than 4 kb, which were derived http://dnaresearch.oxfordjournals.org/ from human fetal brain, adult whole brain, amyg- were deposited to the GenBank/EMBL/DDBJ databases dala, hippocampus and cultured cell line KG-1.6,8 and used for prediction of their CDSs unless otherwise First, the cDNA clones with unidentified sequences at stated. The differences between the cloned cDNA and both ends were selected by BLAST search against the the revised cDNA sequences are shown on our web site, 15 GenBank database (release 122.0) excluding ESTs and HUGE (http://www.kazusa.or.jp/huge). In contrast to genomic sequences.10,11 Then, 50 cDNA clones which the cases described above, for the remaining 11 cDNA seemed to have coding potentials for production of rel- sequences we could not obtain and analyze RT-PCR atively large proteins were selected for sequencing in products for the regions spanning the predicted CDS in- their entirety and were designated using our system- terruption. Thus, even though KIAA1974–KIAA1976, atic gene identifier, KIAA plus a four-digit number: KIAA1978–KIAA1980, and KIAA1984–KIAA1988 have at Pennsylvania State University on February 23, 2013 10 cDNA clones (KIAA1966–KIAA1973, KIAA1987, and multiple predicted CDSs by GeneMark analysis as de- KIAA1988) were selected because they could produce scribed above, only the longest CDSs are shown in Fig. 1. proteins with an apparent molecular mass larger than Physical maps ofthe 50 cDNAs reported in this paper are 50 kDa in an in vitro transcription/translation system shown in Fig. 1, in which the ORFs and the first ATG according to the method previously described;8 the re- codons in their respective ORFs are indicated by solid maining cDNA clones were identified by GENSCAN gene boxes and triangles, respectively. Repeat sequences an- prediction analysis ofthe genomic regions flanked by alyzed by the RepeatMasker program are also displayed the cDNA terminal sequences.12 In brief, both termi- in Fig. 1. In conclusion, the average size ofthe 50 cDNA nal sequences ofthe cDNAs were subjected to BLAST sequences was 4.6 kb and that ofthe predicted CDSs search10 against the human genome draft sequences was approximately 643 amino acid residues. Table 1 lists (ftp://ncbi.nlm.nih.gov/genomes/H sapiens/) to find out the lengths ofinserts and the ORF lengths ofthe re- the genomic fragments corresponding to the cDNA se- spective clones. Notably, clones for the KIAA1981 and quences. After identification of genomic fragments that KIAA1987 genes seemed to lack a region encoding their were found to be considerably similar to the cDNA se- respective C-terminal portions because the 3 -ends of quence (E-value = 0.0 and sequence identity of90% or these cDNAs ended with a Not I site but did not have greater), the GENSCAN program was used to predict a a dA-tail. These incomplete cDNA clones probably re- plausible gene structure ofthe genomic fragment. 12 In sulted from cleavage of the parental cDNAs at an inter- this way, the 40 cDNA clones (KIAA1939–KIAA1965 nal Not I site since all the cDNAs in our libraries were and KIAA1974–KIAA1986) for unidentified genes were digested with Not I before ligation to a vector during 8 selected because they were expected to encode proteins cDNA library construction. As additional information larger than 400 amino acid residues. Entire sequencing on these KIAA genes, the chromosomal loci ofgenes were ofthese clones was performedaccording to the method determined by comparison with the human genome draft previously described in detail.8 sequence (ftp://ncbi.nlm.nih.gov/genomes/H sapiens/) Regarding 15 cDNA clones (KIAA1974–KIAA1988), (Table 1). multiple CDSs were detected in a single cDNA No. 6] T. Nagase, R. Kikuno, and O. Ohara 321

0 1 2 3 4 5 6 7 8 kb 0 1 2 3 4 5 6 7 8 kb KIAA KIAA

1939 1964 1940 1965 1941 1966 1942 1967 1943 1968 1944 1969 1945 1970 1946 1971 1947 1972 1948 1973 1949 1974 *

1950 1975 * Downloaded from 1951 1976 * 1952 1977 1953 1978 * 1954 1979 * 1955 1980 * http://dnaresearch.oxfordjournals.org/ 1956 1981 1957 1982 1958 1983 1959 1984 * 1960 1985 * 1961 1986 * 1962 1987 * 1963 1988 * at Pennsylvania State University on February 23, 2013

Figure 1. Physical maps of cDNA clones analyzed. The physical maps shown here were constructed from the sequence data of respective cDNA clones or, when necessary, from the combination of cDNA clones and RT-PCR products. The horizontal scale represents the cDNA length in kb, and the gene numbers corresponding to respective cDNAs are given on the left. The ORFs and untranslated regions are shown by solid and open boxes, respectively. Regarding the KIAA gene numbers marked with an asterisk, only the largest CDSs predicted by GeneMark analysis are shown even though the multiple CDSs are predicted. Information on the multiple CDSs is available through our web site HUGE.15 The positions of the first ATG codons with or without the contexts of Kozak’s rule are indicated by solid and open triangles, respectively.27 RepeatMasker, a program that screens DNA sequences for interspersed repeats known to exist in mammalian genomes, was applied to detect repeat sequences in respective cDNA sequences (Smit, A. F. A. and Green, P., RepeatMasker at http://ftp.genome.washington.edu/RM/ RepeatMasker.html). Short interspersed nucleotide elements (SINEs) including Alu and MIRs sequences and other repetitive sequences thus detected are displayed by dotted and hatched boxes, respectively.

2. Prediction of Genomic Structures Corre- the genomic sequences with complete coverage (‘per- sponding to cDNA Sequences fect’ in Table 1), while 6 cDNA sequences had termi- nal or internal regions which could not be aligned (‘ter- To assign the genomic structure ofthe genes minal no hit’ or ‘internal no hit’). Moreover, the in- identified through cDNA analysis in this study, tegrity of3 end ofcDNAs could be evaluated by the the cDNA sequences were subjected to BLAST sequences following the 3-extreme end ofthe cDNAs in search against the human genome draft sequences the corresponding genomic regions. Cloned cDNAs for (ftp://ncbi.nlm.nih.gov/genomes/H sapiens/). The ge- 10 genes (KIAA1940–KIAA1942, KIAA1948, KIAA1951, nomic regions corresponding to cDNAs were assigned by KIAA1952, KIAA1955, KIAA1968, KIAA1983, and 16 SIM4 on the genomic fragments which were found to KIAA1988) were found to be synthesized by internal be highly similar to the cDNA sequences (E-value = 0.0 dT priming since canonical polyadenylation signal se- and sequence identity of90% or greater). As shown quences were not observed and, instead, adenine-rich se- in Table 1, 44 cDNA sequences could be aligned along quences were found in the genomic region just down- 322 Prediction ofUnidentified Human Genes [Vol. 8,

Table 1. Information of sequence data and chromosomal locations of the identified genes.

Gene Accession cDNA ORF length Chromosomal Correspondence Source of Gene Accession cDNA ORF length Chromosomal Correspondence Source of number a) length (amino acid c) with the human h) number a) length (amino acid c) with the human h) (KIAA) number residues) location cDNA (KIAA) number residues) location cDNA (bp)b) genomeg) (bp)b) genomeg) 1939 AB075819 5782 1082 15 perfect AA 1964 AB075844 5926 757 17 terminal no hit AH 1940 AB075820 5627 821 2 perfect AA 1965 AB075845 4299 565 7 perfect AH 1941 AB075821 6029 1233 22 internal no hit AA 1966d) AB075846 5142 480 4 perfect FB 1942 AB075822 5549 445 19 perfect FB 1967d) AB075847 3579 818 8 perfect FB 1943 AB075823 5047 1054 5 perfect FB 1968d) AB075848 3334 860 9 perfect FB 1944 AB075824 5305 657 12 perfect FB 1969d) AB075849 3146 595 19 perfect FB 1945 AB075825 5848 420 1 perfect FB 1970d) AB075850 3110 519 16 perfect FB 1946 AB075826 5175 679 2 perfect FB 1971d) AB075851 3637 830 15 perfect FB 1947 AB075827 5700 608 19 perfect FB 1972d) AB075852 3473 576 16 perfect FB 1948 AB075828 5718 487 19 perfect FB 1973d) AB075853 7770 1176 9 perfect AH 1949 AB075829 4015 662 6 perfect FB 1974 AB075854 5610 362 20 perfect AA 1950 AB075830 4415 535 7 perfect FB 1975 AB075855 4559 597 10 perfect AA 1951 AB075831 3954 679 19 perfect FB 1976 AB075856 6495 351 5 perfect FB 1952 AB075832 2894 735 9 perfect FB 1977e) AB075857 3098 606 16 internal no hit FB 1953 AB075833 3132 824 10 terminal no hit FB 1978 AB075858 4694 384 19 perfect FB 1954 AB075834 3846 468 16 terminal no hit FB 1979 AB075859 5053 309 19 terminal no hit FB 1955 AB075835 3260 947 15 perfect FB 1980 AB075860 4173 329 3 perfect FB 1956 AB075836 3543 680 19 perfect FB 1981e) AB075861 2588 862 19 perfect FB 1957 AB075837 2976 481 19 perfect FB 1982f) AB075862 3501 599 4 perfect FB 1958 AB075838 7056 607 9 perfect KG 1983e) AB075863 3250 418 18 perfect FB

1959 AB075839 4722 650 11 perfect AB 1984 AB075864 4207 339 9 perfect AB Downloaded from 1960 AB075840 4017 871 5 perfect AB 1985 AB075865 7327 955 5 perfect AH 1961 AB075841 4180 943 4 perfect AB 1986 AB075866 5949 334 19 perfect AH 1962 AB075842 3655 746 9 perfect AB 1987d) AB075867 3228 357 16 perfect FB 1963 AB075843 6031 488 6 perfect AH 1988d) AB075868 3741 362 16 perfect FB a) Accession numbers of DDBJ, EMBL, and GenBank databases. b) Values excluding poly(A) sequences. http://dnaresearch.oxfordjournals.org/ c) numbers were determined from the results of BLAST search of cDNA clones against the human draft sequence (ftp://ncbi.nlm.nih.gov/genomes/H sapiens/). d) cDNA clones were selected by in vitro transcription/translation experiments. e) cDNA and ORF lengths were revised by direct analysis of the RT-PCR products. f) cDNA and ORF lengths were revised according to the corresponding genomic sequence. g) Correspondence between cDNAs and the human genome draft sequences in NCBI (ftp://ncbi.nlm.nih.gov/genomes/ H sapiens/) are indicated with the following categories: ‘perfect,’ whole sequence of cDNA hits with genomic sequence; ‘terminal no hit,’ terminal sequence of cDNA does not hit with genomic sequence; ‘internal no hit,’ internal sequence of cDNA does nothitwithgenomic sequence. h) The source of tissues from which the cDNAs were derived are indicated as AA, adult amygdala; AB, adult brain; AH, adult hippocampus; FB, fetal brain; and KG, human immature myeloid cell line KG-1. at Pennsylvania State University on February 23, 2013 stream from the 3-terminal end ofthe cDNAs. The re- main database (Pfam, release 6.6)20 and (4) our own sults ofthe genomic structure analysis ofmost ofthe database (HUGE, http://www.kazusa.or.jp/huge).15 KIAA genes are also available through our database, The functions of 31 gene products are classified in HUGE (http://www.kazusa.or.jp/huge).15 Table 2. Among them, 21 gene products exhibited signif- icant sequence similarity to functionally annotated pro- 3. Functional Classification of Predicted Gene teins (Table 2-1), and the functions of the other 10 gene Products products were predicted based on the presence offunc- tional motifs (Table 2-2). Notably, 25 gene products To classify the gene products predicted from the (81% ofthe genes functionallyannotated as described cDNA sequences according to their possible func- above) were suggested to have functions relating to tions, the similarities ofthe sequences were exam- cell signaling/communication, nucleic acid management ined against the following public databases: (1) or cell structure/motility. Homology searches against non-redundant amino acid sequence database, nr, the protein databases deduced from yeast, nematode (ftp://ncbi.nlm.nih.gov/blast/db/nr.z), (2) databases of and fly full genome sequences revealed that KIAA1942, predicted protein sequences from yeast (Saccharomyces KIAA1945, and KIAA1970 were homologues commonly cerevisiae),17 nematoda (Caenorhabditis elegans)18 and shared by these three eukaryotic organisms and humans. fly (Drosophila melanogaster)19 genomes [genome- KIAA1945 and KIAA1970 gene products had homol- ftp://ncbi.nlm.nih.gov/genbank/genomes/S cerevisiae/ ogy to ornithine decarboxylase and glutamyl-tRNA syn- Chr/[I-XVI].faa, ftp.sanger.ac.uk:/pub/databases/C. thetase, respectively. The KIAA1942 gene product had elegans sequences/C elegans proteins 1998-10-16.pep no homology to any functionally annotated proteins al- and fly genome database (http://www.fruitfly.org/ though it might play a very basic role in these organisms sequence/sequence db/aa gadfly.dros)], (3) protein do- (Table 3). No. 6] T. Nagase, R. Kikuno, and O. Ohara 323

Table 2. Functional classifications of the gene products.

2-1. Predicted function based on homology searcha)

Functionb) Gene product aa.res. nr ID aa.res. % identity % coveragec) Definition Nucleic acid management KIAA1947 608 AJ388557 873 54 87 mRNA for zinc finger protein, clone BC3 - dog KIAA1948 487 S42077 548 73 93 zinc finger protein 30 - mouse KIAA1951 679 B32891 651 35 48 finger protein 2, placental - human KIAA1954 468 Q61967 636 85 100 Transcription initiation factor IIB 5 - Halobacterium sp. KIAA1956 680 AJ276316 659 50 93 mRNA for zinc finger protein (ZNF304 gene) - human KIAA1962 746 AF277901 733 55 99 zinc finger protein HIT-10 mRNA, complete cds - rat KIAA1966 480 D78303 712 95 100 RNA splicing-related protein, complete cds - rat KIAA1969 595 S35305 1191 71 90 zinc finger protein ZNF91 - human KIAA1971 830 AF201390 983 37 90 p300 transcriptional cofactor JMY mRNA, complete cds - mouse KIAA1979 309 U67082 591 48 73 KRAB-zinc finger protein KZF-1 mRNA, complete cds - rat KIAA1982 599 S35305 1191 62 99 zinc finger protein ZNF91 - human Cell signaling/communication KIAA1953 824 O02668 935 35 95 Probable G protein-coupled receptor GPR6 - rat KIAA1955 947 JE0110 1000 36 89 mitotic control protein dis3 homolog - human KIAA1964 757 U85711 756 51 97 phospholipase C delta-1 mRNA, complete cds - mouse KIAA1965 565 AF288223 751 35 90 Crossveinless 2 (CV-2) mRNA, complete cds - Drosophila melanogaster KIAA1973 1176 T31068 1115 93 95 N-methyl-D-aspartate receptor homolog NMDAR-L - rat Metabolism KIAA1939 1082 AF067820 1164 42 90 ATPase II mRNA, partial cds - human KIAA1945 420 AY050635 480 95 91 ornithine decarboxylase-like protein variant 2 mRNA, complete cds - human KIAA1963 488 Q9NPZ5 323 100 66 BETA-1,3-GLUCURONYLTRANSFERASE 2 - human Protein management KIAA1970 519 T00743 313 100 56 glutamyl tRNA synthetase homolog - human KIAA1974 362 AF218812 558 59 93 putative cytoplasmic aminopeptidase mRNA, complete cds - Drosophila melanogaster a) Homology search was performed using the Smith-Waterman algorithm, using BioView Toolkit and GeneMatcher (revi- Downloaded from sion 3.3, Paracel Inc. USA) against nr database (see text). The homologous protein with the highest score was listed, when it satisfied the following conditions: i) the protein was functionally annotated, ii) the aligned region exceeded 200 amino acid residues, and iii) percent identity in the aligned region was 30% or greater. b) Function was classified based on the annotation of the entry of the homologous protein in the database. c) The values represent the ratio of the length of the aligned region to the original length of the query sequence expressed as a percentage. http://dnaresearch.oxfordjournals.org/

Table 2. Continued. 2-2. Predicted function by motif searcha)

Functionb) Gene product aa res. Pfam ID E-valuec) Definition Cell signaling/communication KIAA1941 1233 PF02759 6.80E-58 RUN domain PF02759 1.00E-09 RUN domain PF00566 4.50E-14 TBC domain KIAA1942 445 PF00400 6.60E-05 WD domain, G-beta repeat PF00400 1.30E-03 WD domain, G-beta repeat

PF00400 1.30E-05 WD domain, G-beta repeat at Pennsylvania State University on February 23, 2013 KIAA1957 481 PF00168 9.60E-03 C2 domain KIAA1959 650 PF00018 4.30E-05 SH3 domain PF00627 1.30E-04 UBA/TS-N domain PF00300 4.40E-04 Phosphoglycerate mutase family KIAA1975 597 PF01412 4.60E-48 Putative GTP-ase activating protein for Arf PF00169 6.40E-17 PH domain PF00023 4.20E-09 Ank repeat KIAA1988 362 PF00400 1.90E-01 WD domain, G-beta repeat PF00400 5.30E-02 WD domain, G-beta repeat Nucleic acid management KIAA1952 735 PF00096 9.50E-02 Zinc finger, C2H2 type KIAA1972 576 PF00622 8.90E-26 SPRY domain PF00097 1.50E-02 Zinc finger, C3HC4 type (RING finger) KIAA1978 384 PF00076 1.20E-14 RNA recognition motif PF00076 1.20E-01 RNA recognition motif Protein management KIAA1940 821 PF00646 3.80E-01 F-box domain PF01576 3.50E-01 Myosin tail a) Motif search was performed using HMMER2.1.1 against the Pfam database (release 6.6). b) Function was classified based on the annotation of the Pfam entry which was hit in the query sequence. c) Only the entries with an expectation value (E-value) less than 1.0 were presented.

In this study, 9 KIAA genes were predicted to en- examine how many multiple zinc finger proteins en- code proteins containing multiple copies ofthe C 2H2-type coded by KIAA genes are located on chr 19. So far, zinc finger domain. Interestingly, 6 ofthe 9 genes were we have identified 85 KIAA genes which encode zinc found to be located on (chr 19). An finger proteins, 22 ofwhich were mapped to chr 19. unusual abundance ofthe genes encoding such multi- Four newly identified zinc finger proteins on chr 19 ple zinc finger proteins on chr 19 has been observed (KIAA1947, KIAA1948, KIAA1956, and KIAA1969) by other workers.21–23 These observations urged us to contained tandemly repeated zinc finger domains in the 324 Prediction ofUnidentified Human Genes [Vol. 8,

Table 3. Homologues of the newly identified genes found in various databasesa)

Databaseb) New gene aa. res.c) ID in database aa. res. % Identity %coveraged) Commente) HUGE KIAA1939 1082 KIAA1137 933 68 86 haloacid dehalogenase-like hydrolase KIAA1941 1233 KIAA0397 1016 50 95 RUN domain KIAA1944 657 KIAA1906 593 52 92 unclassified KIAA1947 608 KIAA1508 573 48 86 C2H2-type zinc finger proteins KIAA1956 680 53 100 KIAA1982 599 45 86 KIAA1948 487 KIAA0961 530 77 93 C2H2-type zinc finger proteins KIAA1559 544 71 93 KIAA1615 484 51 93 KIAA1954 468 KIAA1396 551 50 94 C2H2-type zinc finger proteins KIAA1827 469 45 97 KIAA1508 573 41 94 KIAA1473 574 49 82 KIAA0628 540 47 83 KIAA1559 544 50 83 KIAA1829 557 50 84 KIAA0961 530 50 81 KIAA1198 553 45 91 KIAA1955 950 KIAA1008 935 35 91 RNB-like protein KIAA1956 680 KIAA1508 573 51 84 C2H2-type zinc finger proteins KIAA1611 813 44 87 KIAA1349 752 45 90 KIAA0065 848 45 90 KIAA1871 783 43 88 KIAA0798 682 42 99 KIAA0412 720 41 85 KIAA0972 702 38 87 KIAA1962 746 38 82 KIAA1982 599 45 80 Downloaded from KIAA1961 943 KIAA1450 1139 50 99 unclassified KIAA1962 746 KIAA0426 613 38 93 C2H2-type zinc finger proteins KIAA1015 841 34 94 KIAA1969 595 KIAA1473 574 74 90 C2H2-type zinc finger proteins KIAA1559 544 49 84 KIAA0961 530 47 88 KIAA0412 720 42 92 KIAA1588 613 41 92 KIAA0798 682 41 90 KIAA1198 553 41 86 KIAA1874 522 44 85

KIAA1806 481 41 81 http://dnaresearch.oxfordjournals.org/ KIAA1982 599 KIAA1508 573 41 89 C2H2-type zinc finger proteins yeast KIAA1942 445 yeast.prot|6323779| 511 35 99 Rsa2p KIAA1945 420 yeast.prot|6322664| 466 38 87 Ornithine decarboxylase; Spe1p KIAA1955 950 yeast.prot|6324552| 1001 33 82 3'-5' exoribonuclease complex subunit; Dis3p KIAA1970 519 yeast.prot|6324540| 536 36 93 Mitochondrial glutamyl-tRNA synthetase; Mse1p C.elegans KIAA1939 1082 Y49E10.11 1108 37 88 P-type ATPase F02C9.3 1064 30 85 ATPase KIAA1942 445 Y54H5A.1 469 43 94 unclassified KIAA1945 420 K11C4.4 422 37 91 odc-1 ornithine decarboxylase KIAA1964 757 R05G6.8 751 37 89 phospholipase K10F12.3 895 32 95 phospholipase C KIAA1970 519 T07A9.2 474 37 93 tRNA synthetase KIAA1972 576 F16A11.1 673 48 86 Zinc finger, C3HC4 type (RING finger) D. melanogaster KIAA1939 1082 CG17034 1297 40 90 unclassified CG18419 1170 30 91 unclassified CG9981 1060 33 88 transporter KIAA1942 445 CG12792 447 54 96 unclassified KIAA1945 420 CG8721 394 38 88 unclassified CG8719 339 32 86 unclassified KIAA1954 468 CG5245 501 33 89 transcription factor CG4360 556 31 82 transcription factor KIAA1955 950 CG6413 982 33 90 unclassified

KIAA1959 650 CG13604 751 34 93 unclassified at Pennsylvania State University on February 23, 2013 KIAA1965 565 CG15671 665 34 90 unclassified KIAA1970 519 CG4573 511 47 93 glutamate--tRNA ligase nr KIAA1939 1082 O43520 1251 55 95 Adenomatous polyposis coli protein - rat KIAA1942* 445 BC002440 446 100 100 clone MGC:2600 IMAGE:3347366, mRNA, complete cds - human KIAA1944 657 BC006896 638 33 94 clone MGC:11927 IMAGE:3599652, mRNA, complete cds - mouse KIAA1945* 420 BC010449 460 100 91 clone MGC:18232 IMAGE:4156927, mRNA, complete cds - human KIAA1947 608 AJ276316 659 50 97 mRNA for zinc finger protein (ZNF304 gene) - human KIAA1948 487 AB060218 519 77 93 brain cDNA clone:QflA-14093, full insert sequence - Macaca fascicular KIAA1951* 679 BC013013 670 100 99 clone MGC:4267 IMAGE:3531734, mRNA, complete cds - human KIAA1953* 824 AK027375 942 100 100 cDNA FLJ14469 - human KIAA1954 468 P52742 469 56 91 Transcription intermediary factor 1-gamma (TIF1-gamma) - human KIAA1955 950 JE0110 1000 36 89 mitotic control protein dis3 homolog - human KIAA1956 680 AK027616 575 65 82 cDNA FLJ14710 - human KIAA1959* 650 BC007541 638 100 98 clone MGC:15437 IMAGE:2958242, mRNA, complete cds - human KIAA1960 871 BC005761 760 56 97 clone MGC:11988 IMAGE:3601742, mRNA, complete cds - mouse KIAA1962 746 AF277901 733 55 99 zinc finger protein HIT-10 mRNA, complete cds - rat KIAA1964* 757 BC010668 613 100 81 clone MGC:9744 IMAGE:3854215, mRNA, complete cds - human KIAA1969 595 Q9Y2Q1 535 71 88 ZINC FINGER PROTEIN 257 - human KIAA1970 519 Q06560 505 38 93 POL polyprotein - Human T-cell leukemia virus type I KIAA1971 830 AF201390 983 37 90 p300 transcriptional cofactor JMY mRNA, complete cds - mouse KIAA1972* 576 BC013173 576 100 100 clone MGC:17340 IMAGE:4340287, mRNA, complete cds - human KIAA1973 1176 T31068 1115 93 95 N-methyl-D-aspartate receptor homolog NMDAR-L - rat KIAA1975 597 AF411132 663 90 97 MRIP2 (MRIP2) mRNA, complete cds - human KIAA1977 606 AK012494 655 81 99 RIKEN full-length enriched library, clone:2700067D09 - mouse KIAA1982 599 AK024442 705 58 94 mRNA for FLJ00032 protein, partial cds - human KIAA1985 955 AK000363 1064 37 91 cDNA FLJ20356, clone HEP15821 - human a) The definition of homologues used here was the proteins found in the databases satisfying the following conditions: i) the length ranged from 80% to 125% of the query sequence; ii) the ratio of the length of the aligned region to that of the original sequence of the query was 80% or greater; and iii) percent identity was 30% or greater. The method of homology search was the same as that explained in Table 2-1. b) The following databases were used. HUGE, our cDNA-encoded protein database, nr database, and databases of predicted protein sequences from yeast, C. elegans and D. melanogaster. c) The number of amino acid residues of the gene product. d) The values represent the ratio of the length of the aligned region to the original length of the query sequence expressed as a percentage. e) For entries from databases, C. elegans, yeast and nr, the annotations are listed. For D. melanogaster, predicted functions described in the FlyBase are listed. Regarding KIAA genes marked with asterisks, identical genes have been deposited in public databases during the preparation of this manuscript. No. 6] T. Nagase, R. Kikuno, and O. Ohara 325

0 500 1000 a.a.

fd\\LQLL;CiozKLLKTLD fd\\LSMR;CiozKLLKTLD fd\\LORN;CiozKMQOSND fd\\LTQT;CiozKMQOSND fd\\LLTS;CiozKLLLRQD fd\\LPSS;CiozKMPLPRD fd\\LSKQ;CiozKMPLPPD fd\\LNTQ;CiozKMPLPMD fd\\LSMT;CiozKLLMNND Downloaded from fd\\LQLP;CiozKLLLTMD fd\\LPPT;CiozKLLLTMD

fd\\LTOS;CiozKLLLTMD http://dnaresearch.oxfordjournals.org/ fd\\KTQL;CiozKLLMNND fd\\KOLM;CiozKLLLKOD fd\\KRTS;CiozKLLKTLD fd\\LTOR;CiozKLLLKOD fd\\LTPQ;CiozKLLLKOD fd\\LPKS;CiozKLLLKOD

KIL at Pennsylvania State University on February 23, 2013

Figure 2. An unrooted phylogenetic tree of 18 KIAA zinc finger proteins that were encoded in chr 19. The length of each branch indicates the number of amino acid substitutions per residue site. For each KIAA protein, domains were identified using HMMer 2.1 against the Pfam database and are shown as well. Solid and shaded boxes represent the KRAB domain and C2H2-type zinc finger domain, respectively. The NCBI RefSeq ID of the genome fragment that encodes the corresponding KIAA protein is given in parenthesis after each KIAA number.

C-terminal portion. Similar domain organizations were sites without gaps, which was considered to be long also found in 14 out of these 22 KIAA proteins (Fig. 2), enough for further phylogenetic analysis. We then recon- and some pairs among the 18 KIAA proteins shared con- structed an unrooted phylogenetic tree based on the se- siderable sequence similarity for almost the entire re- quence alignment by applying NEIGHBOR in the Phylip gions (Table 3). All ofthem encoded 9–20 consecutive computer program package.25 The amino acid sequence C2H2-type zinc finger domains in the C-terminal portions distance was calculated by PROTDIST in Phylip using and were located on chr 19. A KRAB box domain was the Dayhoff PAM matrix. The result is shown in Fig. 2. found in the N-terminal portion in 12 of the 18 KIAA Although the aligned amino acid sequences were rather proteins. To investigate the evolutionary relationships divergent from each other, we found that the 18 KIAA and to consider the process ofevolution ofthese zinc fin- proteins could be divided into four large clusters. The ger proteins encoded on chr 19, we performed phyloge- major type ofKIAA protein that contained 13 zinc fin- netic analysis for the 18 KIAA proteins mentioned above. ger domains appeared in every cluster in the phylogenetic When we compared the amino acid sequences ofall the tree, which implied that it was an ancestral type ofthe 18 KIAA proteins by CLUSTALW,24 we found that re- protein. Genome mapping ofKIAA genes indicated that liable sequence alignment was limited in the C-terminal the closely related genes in the phylogenetic tree were region comprised ofmultiple zinc finger domains. How- located in the same genome fragment. This observation ever, the alignment contained 335 amino acid residue suggests that the gene duplication ofKIAA genes on chr 326 Prediction ofUnidentified Human Genes [Vol. 8,

s us m s ucle m su leu igra n su leus s igra nucleu n _ pus _n ic_ rd llo pu ic m nuc er m nuc ord er callo co le ca tia_ m scle ala ntia us liv brain sc te_ us liv brain ey n ellu ate_ cam alam ey dala ellu cam l_c rt u n us_ l_ in g u creas us_ ng er ncreas ygd po inal_ er leen yg da po stan thala ina tal_ ea rain id vary rp reb ud bsta bth eart ra idn an vary rp b b alam KIAA H B Lu Liv S. m K Pa SpleeTestisO am co ce ca hip su su thalamSp Feta Fetal_ KIAA H B Lun Liv S. m K P Sp TestisO am co cerebcau hip su su th Sp Fe Fetal_ 1939 1964 1940 1965 1941 1966 1942 1967 1943 1968 1944 1969 1945 1970 1946 1971 1947 1972 1948 1973 1949 1974 1950 1975

1951 1976 Downloaded from 1952 1977 1953 1978 1954 1979 1955 1980 1956 1981 http://dnaresearch.oxfordjournals.org/ 1957 1982 1958 1983 1959 1984 1960 1985 1961 1986 1962 1987 1963 1988 at Pennsylvania State University on February 23, 2013

<1 10 100 >1000

Figure 3. Expression profiles of 50 newly identified genes examined by RT-PCR ELISA. The tissue expression levels of the 50 human genes were analyzed by RT-PCR ELISA according to the methods previously described in detail.26 Gene names are given as KIAA numbers at the left side of each set of color codes. Tissue and brain region names are indicated above the top sets of color codes. A color conversion panel (shown at bottom) was used for displaying mRNA levels as color codes. The mRNA levels are expressed in equivalent amounts (fg) of the authentic cDNA plasmids in 1 ng of starting poly(A)+ RNAs. Besides 10 tissues, 9 regions of the adult central nervous system (amygdala, corpus callosum, cerebellum, caudate nucleus, hippocampus, substantia nigra, subthalamic nucleus, thalamus, and spinal cord) and fetal brain were included in the expression profiling. As a control, mRNA levels in fetal liver were also examined.

19 encoding zinc finger proteins often produced a new enzyme-linked immunosorbent assay (ELISA) as de- copy ofthe gene in relatively close proximity to the orig- scribed previously.26 The expression ofKIAA1974, which inal ones, and functional divergence, like a change in the encoded an aminopeptidase-like protein, was high in all copy number ofzinc finger domains, occurred later and tissues examined. KIAA1946 was highly expressed in all rather frequently. brain regions we examined. KIAA1964, which was highly expressed in heart and skeletal muscle in addition to all 4. Expression Profiles of Predicted Genes brain regions, seemed to encode a human homologue of mouse phospholipase C delta. KIAA1980 was predomi- The tissue expression profiles of50 human genes nantly expressed in kidney and spleen although its func- newly identified in this study are shown in Fig. 3. tions have not been predicted. Predominant expression of The expression profiles in 10 human tissues, 8 brain KIAA1960 and KIAA1984 in liver and fetal liver, respec- regions, spinal cord, and fetal liver and brain were tively, suggested their essential functions in liver. These determined by quantitative RT-PCR coupled with an expression profiles provide us important information for No. 6] T. Nagase, R. Kikuno, and O. Ohara 327 identifying biologically important genes characterized in gene structures in human genomic DNA, J. Mol. Biol., this project. 268, 78–94. Acknowledgements: This project was supported 13. Borodovsky, M., McIninch, J. D., Koonin, E. V., Rudd, by grants from the Kazusa DNA Research Institute. K. E., Medigue, C., and Danchin, A. 1995, Detection of We thank Tomomi Tajino, Keishi Ozawa, Tomomi new genes in a bacterial genome using Markov Models for 23 Kato, Kazuhiro Sato, Akiko Ukigai, Kazuko Yamada, three gene classes, Nucleic Acids Res., , 3554–3562. 14. Hirosawa, M., Ishikawa, K.-I., Nagase, T., and Kiyoe Sumi, Takashi Watanabe, Kozue Yamane, Naoko Ohara, O. 2000, Detection of spurious interruptions of Shibano, Mina Waki, Nobue Kashima, and Sachiko protein-coding regions in cloned cDNA sequences by Minorikawa for their technical assistance. GeneMark analysis, Genome Res., 10, 1333–1341. 15. Kikuno, R., Nagase, T., Waki, M., and Ohara, O. HUGE: References a database for human large proteins identified in the Kazusa cDNA sequencing project, Nucleic Acids Res., 1. Venter, J. C., Adams, M. D., Myers, E. W. et al. 2001, in press. The sequence of the human genome, Science, 291, 1304– 16. Florea, L., Hartzell, G., Zhang, Z., Rubin, G. M., and 1351. Miller, W. 1998, A computer program for aligning a 2. International Human Genome Sequencing Consortium. cDNA sequence with a genomic DNA sequence, Genome 2001, Initial sequencing and analysis of the human Res., 8, 967–974. Downloaded from genome, Nature, 409, 860–921. 17. Goffeau, A., Barrell, B. G., Bussey, H. etal. 1996, Life 3. Hogenesch, J. B., Ching, K. A., Batalov, S. et al. 2001, with 6000 genes, Science, 274, 546–567. A comparison of the Celera and Ensembl predicted gene 18. The C. elegans Sequencing Consortium. 1998, Genome sets reveals little overlap in novel genes, Cell, 106, 413– sequence of the nematode, C. elegans: A platform for 282 415. investing biology, Science, , 2012–2018. http://dnaresearch.oxfordjournals.org/ 4. Strausberg, R. L., Feingold, E. A., Klausner, R. D., and 19. Adams, M. D., Celniker, S. E., Holt, R. A. et al. 2000, Collins, F. S. 1999, The mammalian gene collection, Sci- The genome sequence of Drosophila melanogaster, Sci- ence, 286, 455–457. ence, 287, 2185–2195. 5. Wiemann, S., Weil, B., Wellenreuther, R. et al. 2001, To- 20. Bateman, A., Birney, E., Durbin, R. et al. 2000, The ward a catalog of human genes and proteins: Sequencing Pfam families database, Nucleic Acids Res., 28, 263–266. and analysis of 500 novel complete protein coding human 21. Huebner, K., Druck, T., Croce, C., and Thiesen, H.-J. cDNAs, Genome Res., 11, 422–435. 1991, Twenty-seven nonoverlapping zinc finger cDNAs 6. Nomura, N., Miyajima, N., Sazuka, T. etal. 1994, Predic- from human T cells map to nine different tion of the coding sequences of unidentified human genes. with apparent clustering, Am. J. Hum. Genet., 48, 726– I. The coding sequences of 40 new genes (KIAA0001– 740.

KIAA0040) deduced by analysis of randomly sampled 22. Hoovers, J. M. N., Mannens, M., John, R. etal. 1992, at Pennsylvania State University on February 23, 2013 cDNA clones from human immature myeloid cell line High-resolution localization of 69 potential human zinc KG-1, DNA Res., 1, 27–35. finger protein genes: A number are clustered, Genomics, 7. Nagase, T., Kikuno, R., and Ohara, O. 2001, Predic- 12, 254–263. tion of the coding sequences of unidentified human genes. 23. Lichter, P., Bray, P., Ride, T., Dawid, I. B., and Ward, D. XXI. The complete sequences of 60 new cDNA clones C. 1992, Clustering of C2-H2 zinc finger motif sequences from brain which code for large proteins, DNA Res., 8, within telomeric and fragile site regions of human chro- 179–187. mosomes, Genomics, 13, 999–1007. 8. Ohara, O., Nagase, T., Ishikawa, K.-I. etal. 1997, Con- 24. Thompson, J. D., Higgins, D. G., and Gibson, T. J. 1994, struction and characterization of human brain cDNA li- CLUSTAL W: improving the sensitivity of progressive braries suitable for analysis of cDNA clones encoding rel- sequence alignment through sequence weighting, position atively large proteins, DNA Res., 4, 53–59. specific gap penalties and weight matrix choice, Nucleic 9. Hirosawa, M., Nagase, T., Murahashi, Y., Kikuno, R., Acids Res., 22, 4673–4680. and Ohara, O. 2001, Identification of novel transcribed 25. Felsenstein, J. 1996, Inferring phylogenesis from protein sequences on human chromosome 22 by expressed se- sequences by parsimony, distance, and likelihood meth- quence tag mapping, DNA Res., 8, 1–9. ods, Methods Enzymol., 266, 418–427. 10. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., 26. Nagase, T., Ishikawa, K.-I., Suyama, M. etal. 1998, Pre- Zhang, Z., Miller, W., and Lipman, D. J. 1997, Gapped diction of the coding sequences of unidentified human BLAST and PSI-BLAST: a new generation of protein genes. XI. The complete sequences of 100 new cDNA database search programs, Nucleic Acids Res., 25, 3389– clones from brain which code for large proteins in vitro, 3402. DNA Res., 5, 277–276. 11. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Rapp, 27. Kozak, M. 1996, Interpreting cDNA sequences: some in- B. A., and Wheeler, D. L. 2000, GenBank, Nucleic Acids sights from studies on translation, Mammalian Genome, Res., 28, 15–18. 7, 563–574. 12. Burge, C. and Karlin, S. 1997, Prediction of complete