Identification of Seven Novel Genes and Pseudogenes
Total Page:16
File Type:pdf, Size:1020Kb
The Pharmacogenomics Journal (2004) 4, 54–65 & 2004 Nature Publishing Group All rights reserved 1470-269X/04 $25.00 www.nature.com/tpj ORIGINAL ARTICLE Human cytosolic sulfotransferase database mining: identification of seven novel genes and pseudogenes RR Freimuth1,4 ABSTRACT 2 A total of 10 SULT genes are presently known to be expressed in human M Wiepert tissues. We performed a comprehensive genome-wide search for novel SULT 2 CG Chute genes using two different but complementary approaches, and developed a ED Wieben3 novel graphical display to aid in the annotation of the hits. Seven novel RM Weinshilboum1 human SULT genes were identified, five of which were predicted to be pseudogenes, including two processed pseudogenes and three pseudogenes 1Department of Molecular Pharmacology and that contained introns. Those five pseudogenes represent the first unambig- Experimental Therapeutics, Mayo Graduate uous SULT pseudogenes described in any species. Expression-profiling School-Mayo Clinic, Rochester, MN, USA; studies were conducted for one novel gene, SULT6B1, and a series of 2Department of Health Sciences Research, Mayo Graduate School-Mayo Clinic, Rochester, MN, alternatively spliced transcripts were identified in the human testis. SULT6B1 USA; 3Department of Biochemistry and Molecular was also present in chimpanzee and gorilla, differing at only seven encoded Biology, Mayo Graduate School-Mayo Clinic, amino-acid residues among the three species. The results of these database Rochester, MN, USA mining studies will aid in studies of the regulation of these SULT genes, provide insights into the evolution of this gene family in humans, and serve Correspondence: Dr R Weinshilboum, Department of as a starting point for comparative genomic studies of SULT genes. Molecular Pharmacology and Experimental The Pharmacogenomics Journal (2004) 4, 54–65. doi:10.1038/sj.tpj.6500223 Therapeutics, Mayo Graduate School-Mayo Published online 16 December 2003 Clinic, Rochester, MN 55905, USA. Tel: 507 284 2246 E-mail: [email protected] Keywords: sulfotransferase; SULT; sulfation; gene; pseudogene; annotation; data- base mining 4Current address: Department of Medicine, Division of Molecular Oncology, Washington University School of Medicine, St Louis, MO 63110, USA INTRODUCTION Sulfate conjugation is an important pathway in the biotransformation of many exogenous and endogenous compounds including drugs, other xenobiotics, neurotransmitters, and hormones.1,2 The cytosolic SULT enzymes that catalyze these reactions are members of a gene superfamily. Cytosolic SULTs are usually active as homodimers and share two highly conserved regions of the amino-acid sequence, termed Regions I and IV.1,3 These regions of sequence are important in binding the cosubstrate sulfate donor molecule 30-phosphoadenosine 50- phosphosulfate (PAPS).4–6 In addition, Region IV forms part of the dimerization interface region for these enzymes.7 SULTs can be classified into families whose members share at least 45% amino-acid sequence identity, and subfamilies that include enzymes that share at least 60% amino-acid sequence identity.3,8 It should be noted that the membrane-bound sulfotransferases which are located in Received: 28 May 2003 the golgi and catalyze the sulfation of macromolecules form a superfamily Revised: 30 September 2003 9 Accepted: 13 October 2003 distinct from that of the cytosolic SULTs, and were not included in the present Published online 16 December 2003 study. Novel human SULT gene identification RR Freimuth et al 55 In all, 10 cytosolic SULT genes are presently known to be Three of those five contained introns and two were expressed in human tissues: SULTs 1A1, 1A2, 1A3, 1B1, 1C1, processed pseudogenes—the first SULT processed pseudo- 1C2, 1E1, 2A1, 2B1, and 4A1.3,10–15 While the cDNAs for genes to be described in any species. The remaining two some of those genes were cloned after purification and genes, SULT1C3 and SULT6B1, contained no obvious sequencing of the expressed protein, many were subse- inactivating mutations. Expression-profiling studies failed quently identified and cloned using homology-based tech- to detect any SULT1C3 transcripts; so this gene is probably niques that took advantage of the highly conserved more accurately described as a ‘putative’ gene. A series of sequences in Regions I and IV. Eight of the human genes alternatively spliced SULT6B1 transcripts were identified in were characterized using PCR and DNA sequencing meth- the testis, confirming the expression of this gene in humans. ods,12,16–21 but the genes for SULT1B1 and 4A1 were SULT6B1 is also highly conserved in two primate species characterized by annotation of publicly available genomic closely related to humans, the chimpanzee and gorilla. The sequence data generated by the Human Genome Project results of this database mining exercise will aid in future (unpublished observations22). All 10 of these genes share studies of the regulation of these genes, will provide insights very similar gene structures, both with regard to the number into the evolution of this gene family in humans, and will and length of coding exons as well as the locations of splice serve as a starting point for future comparative genomic junctions within the translated amino-acid sequence studies of SULT genes. (Figure 1). In the present study, we took advantage of the genomic sequence data available in public databases, and set out to perform a comprehensive genome-wide search for novel RESULTS human SULT genes and complete annotation of the known BLAST and Sequence Alignment Searches SULT gene clusters. We used two different but complemen- To perform the first search of the human genome sequence tary approaches to search the public sequence database. The database, 17 SULT protein sequences were selected for use as first approach utilized explicit SULT query sequences and query sequences in a TBLASTN search. Those query BLAST23 searches, followed by the use of two different, more sequences included the 11 SULT isoforms encoded by the sensitive sequence alignment programs. The second ap- 10 genes known to be expressed in humans (SULT2B1 proach used the HMMER24 and GeneWise25 programs, both encodes two isoforms as a result of alternative splicing), four of which used a hidden Markov model (HMM) profile26 as a non-human SULT sequences that represented families or query. In addition, we developed a novel graphical display subfamilies that contained no known human orthologs or to view the results of the searches, thereby improving the paralogs, and two different consensus sequences. These 17 efficiency of the manual annotation step. TBLASTN searches produced a total of 1765 hits that were A total of seven novel SULT genes were identified in each annotated manually. In total, 996 hits (56%) corre- addition to the 10 already known to be expressed in sponded to one of the 10 SULT genes known to be expressed humans. Five of the seven were predicted to be pseudogenes in humans, 422 (24%) were hits to novel (previously based either on sequence mutations or exon rearrangement. uncharacterized) SULT genes, 71 (4%) were determined to be nonspecific, and 276 (16%) were ambiguous and retained Human SULT Genes for further analysis. The 276 ambiguous hits were reduced to Exon length (bp) SULT1A1 105 270 152 126 98127 95 181 337 195 nonredundant hits that represented different regions of ~4.4 kb genomic sequence, as well as one hit from each of the 10 74 347 152 126 98127 95 181 196 SULT1A2 ~5.1 kb known human SULT genes. A 500 kb region surrounding 113 88 113 152 126 98127 95 181 499 SULT1A3 each of those 195 hits was then searched using the ~7.3 kb FrameSearch and TFastA programs. Approximately 10 360 SULT1B1 >164129 98127 95 181 >113 ~28 kb searches were performed with those two programs, generat- 432 172 126 9598 127 95 181 361 SULT1C1 ing 39 125 hits (29 395 from TFastA and 9730 from ~20.1 kb FrameSearch). To aid in the annotation of those hits, a filter SULT1C2 542 126 98127 95 181 730 ~10.1 kb was developed that removed 32 798 hits (84%), leaving 6327 SULT1E1 97154 126 98 127 95 181 167 for further analysis. This filter is described in detail in the ~18 kb 215 209 127 95 178 955 Materials and methods section. After plotting and analyzing SULT2A1 ~16 kb those hits, seven novel SULT genes were identified and SULT2B1 152 544 209 127 95 181 291 ~48 kb annotated. 248 131 81 12795 139 1605 SULT4A1 After the novel SULT genes were annotated, we performed ~38 kb a second round of searches, using four of the seven novel Figure 1 Previously-characterized human SULT gene structures. SULT sequences as queries for TBLASTN. In all, 303 hits were Structures for the 10 SULT genes that are known to be expressed in returned and annotated. Of those, 256 (84%) corresponded humans and that have been reported previously are illustrated. to known SULT genes and eight hits were retained for Black boxes represent coding region, white boxes represent UTR, further analysis. The follow-up analysis included 440 queries and hatched boxes represent alternatively spliced regions. Lengths performed between the FrameSearch and TFastA programs, are listed for each exon (in bp) and for the overall gene (in kb). producing 1737 hits. Altogether, 1492 (86%) of those hits www.nature.com/tpj Novel human SULT gene identification RR Freimuth et al 56 were eliminated by the filter. No additional SULT genes were Table 1 Human SULT gene chromosomal localizations identified after annotating the remaining 245 hits. Gene Chromosomal Gene Chromosomal HMMER and GeneWise Searches localization localization In parallel with the first approach, a separate but comple- mentary strategy that utilized HMMs was also used to search SULT1A1 SULT1A2 16p11.2–12.1 SULT1D2P 3q22.2 for novel SULT genes. Initially, a HMM profile was SULT1A3 constructed using the sequences of the 11 human SULT isoforms known to be expressed in humans.