The Pharmacogenomics Journal (2004) 4, 54–65 & 2004 Nature Publishing Group All rights reserved 1470-269X/04 $25.00 www.nature.com/tpj ORIGINAL ARTICLE

Human cytosolic sulfotransferase database mining: identification of seven novel and pseudogenes

RR Freimuth1,4 ABSTRACT 2 A total of 10 SULT genes are presently known to be expressed in human M Wiepert tissues. We performed a comprehensive genome-wide search for novel SULT 2 CG Chute genes using two different but complementary approaches, and developed a ED Wieben3 novel graphical display to aid in the annotation of the hits. Seven novel RM Weinshilboum1 human SULT genes were identified, five of which were predicted to be pseudogenes, including two processed pseudogenes and three pseudogenes 1Department of Molecular Pharmacology and that contained introns. Those five pseudogenes represent the first unambig- Experimental Therapeutics, Mayo Graduate uous SULT pseudogenes described in any species. Expression-profiling School-Mayo Clinic, Rochester, MN, USA; studies were conducted for one novel , SULT6B1, and a series of 2Department of Health Sciences Research, Mayo Graduate School-Mayo Clinic, Rochester, MN, alternatively spliced transcripts were identified in the human testis. SULT6B1 USA; 3Department of Biochemistry and Molecular was also present in chimpanzee and gorilla, differing at only seven encoded Biology, Mayo Graduate School-Mayo Clinic, amino-acid residues among the three species. The results of these database Rochester, MN, USA mining studies will aid in studies of the regulation of these SULT genes, provide insights into the evolution of this gene family in humans, and serve Correspondence: Dr R Weinshilboum, Department of as a starting point for comparative genomic studies of SULT genes. Molecular Pharmacology and Experimental The Pharmacogenomics Journal (2004) 4, 54–65. doi:10.1038/sj.tpj.6500223 Therapeutics, Mayo Graduate School-Mayo Published online 16 December 2003 Clinic, Rochester, MN 55905, USA. Tel: 507 284 2246 E-mail: [email protected] Keywords: sulfotransferase; SULT; sulfation; gene; pseudogene; annotation; data- base mining 4Current address: Department of Medicine, Division of Molecular Oncology, Washington University School of Medicine, St Louis, MO 63110, USA INTRODUCTION Sulfate conjugation is an important pathway in the biotransformation of many exogenous and endogenous compounds including drugs, other xenobiotics, neurotransmitters, and .1,2 The cytosolic SULT that catalyze these reactions are members of a gene superfamily. Cytosolic SULTs are usually active as homodimers and share two highly conserved regions of the amino-acid sequence, termed Regions I and IV.1,3 These regions of sequence are important in binding the cosubstrate sulfate donor molecule 30-phosphoadenosine 50- phosphosulfate (PAPS).4–6 In addition, Region IV forms part of the dimerization interface region for these enzymes.7 SULTs can be classified into families whose members share at least 45% amino-acid sequence identity, and subfamilies that include enzymes that share at least 60% amino-acid sequence identity.3,8 It should be noted that the membrane-bound sulfotransferases which are located in Received: 28 May 2003 the golgi and catalyze the sulfation of macromolecules form a superfamily Revised: 30 September 2003 9 Accepted: 13 October 2003 distinct from that of the cytosolic SULTs, and were not included in the present Published online 16 December 2003 study. Novel human SULT gene identification RR Freimuth et al 55

In all, 10 cytosolic SULT genes are presently known to be Three of those five contained introns and two were expressed in human tissues: SULTs 1A1, 1A2, 1A3, 1B1, 1C1, processed pseudogenes—the first SULT processed pseudo- 1C2, 1E1, 2A1, 2B1, and 4A1.3,10–15 While the cDNAs for genes to be described in any species. The remaining two some of those genes were cloned after purification and genes, SULT1C3 and SULT6B1, contained no obvious sequencing of the expressed protein, many were subse- inactivating mutations. Expression-profiling studies failed quently identified and cloned using homology-based tech- to detect any SULT1C3 transcripts; so this gene is probably niques that took advantage of the highly conserved more accurately described as a ‘putative’ gene. A series of sequences in Regions I and IV. Eight of the human genes alternatively spliced SULT6B1 transcripts were identified in were characterized using PCR and DNA sequencing meth- the testis, confirming the expression of this gene in humans. ods,12,16–21 but the genes for SULT1B1 and 4A1 were SULT6B1 is also highly conserved in two primate species characterized by annotation of publicly available genomic closely related to humans, the chimpanzee and gorilla. The sequence data generated by the Project results of this database mining exercise will aid in future (unpublished observations22). All 10 of these genes share studies of the regulation of these genes, will provide insights very similar gene structures, both with regard to the number into the evolution of this gene family in humans, and will and length of coding as well as the locations of splice serve as a starting point for future comparative genomic junctions within the translated amino-acid sequence studies of SULT genes. (Figure 1). In the present study, we took advantage of the genomic sequence data available in public databases, and set out to perform a comprehensive genome-wide search for novel RESULTS human SULT genes and complete annotation of the known BLAST and Sequence Alignment Searches SULT gene clusters. We used two different but complemen- To perform the first search of the human genome sequence tary approaches to search the public sequence database. The database, 17 SULT protein sequences were selected for use as first approach utilized explicit SULT query sequences and query sequences in a TBLASTN search. Those query BLAST23 searches, followed by the use of two different, more sequences included the 11 SULT isoforms encoded by the sensitive sequence alignment programs. The second ap- 10 genes known to be expressed in humans (SULT2B1 proach used the HMMER24 and GeneWise25 programs, both encodes two isoforms as a result of alternative splicing), four of which used a hidden Markov model (HMM) profile26 as a non-human SULT sequences that represented families or query. In addition, we developed a novel graphical display subfamilies that contained no known human orthologs or to view the results of the searches, thereby improving the paralogs, and two different consensus sequences. These 17 efficiency of the manual annotation step. TBLASTN searches produced a total of 1765 hits that were A total of seven novel SULT genes were identified in each annotated manually. In total, 996 hits (56%) corre- addition to the 10 already known to be expressed in sponded to one of the 10 SULT genes known to be expressed humans. Five of the seven were predicted to be pseudogenes in humans, 422 (24%) were hits to novel (previously based either on sequence mutations or rearrangement. uncharacterized) SULT genes, 71 (4%) were determined to be nonspecific, and 276 (16%) were ambiguous and retained

Human SULT Genes for further analysis. The 276 ambiguous hits were reduced to Exon length (bp) SULT1A1 105 270 152 126 98127 95 181 337 195 nonredundant hits that represented different regions of ~4.4 kb genomic sequence, as well as one hit from each of the 10 74 347 152 126 98127 95 181 196 SULT1A2 ~5.1 kb known human SULT genes. A 500 kb region surrounding 113 88 113 152 126 98127 95 181 499 SULT1A3 each of those 195 hits was then searched using the ~7.3 kb FrameSearch and TFastA programs. Approximately 10 360 SULT1B1 >164129 98127 95 181 >113 ~28 kb searches were performed with those two programs, generat- 432 172 126 9598 127 95 181 361 SULT1C1 ing 39 125 hits (29 395 from TFastA and 9730 from ~20.1 kb FrameSearch). To aid in the annotation of those hits, a filter SULT1C2 542 126 98127 95 181 730 ~10.1 kb was developed that removed 32 798 hits (84%), leaving 6327 SULT1E1 97154 126 98 127 95 181 167 for further analysis. This filter is described in detail in the ~18 kb 215 209 127 95 178 955 Materials and methods section. After plotting and analyzing SULT2A1 ~16 kb those hits, seven novel SULT genes were identified and SULT2B1 152 544 209 127 95 181 291 ~48 kb annotated. 248 131 81 12795 139 1605 SULT4A1 After the novel SULT genes were annotated, we performed ~38 kb a second round of searches, using four of the seven novel Figure 1 Previously-characterized human SULT gene structures. SULT sequences as queries for TBLASTN. In all, 303 hits were Structures for the 10 SULT genes that are known to be expressed in returned and annotated. Of those, 256 (84%) corresponded humans and that have been reported previously are illustrated. to known SULT genes and eight hits were retained for Black boxes represent coding region, white boxes represent UTR, further analysis. The follow-up analysis included 440 queries and hatched boxes represent alternatively spliced regions. Lengths performed between the FrameSearch and TFastA programs, are listed for each exon (in bp) and for the overall gene (in kb). producing 1737 hits. Altogether, 1492 (86%) of those hits

www.nature.com/tpj Novel human SULT gene identification RR Freimuth et al 56

were eliminated by the filter. No additional SULT genes were Table 1 Human SULT gene chromosomal localizations identified after annotating the remaining 245 hits. Gene Chromosomal Gene Chromosomal HMMER and GeneWise Searches localization localization In parallel with the first approach, a separate but comple- mentary strategy that utilized HMMs was also used to search SULT1A1 SULT1A2 16p11.2–12.1 SULT1D2P 3q22.2 for novel SULT genes. Initially, a HMM profile was SULT1A3 constructed using the sequences of the 11 human SULT isoforms known to be expressed in humans. That profile was SULT1B1 calibrated and used with the HMMER program to search a SULT1E1 4q13 SULT3A1P 14q12 six-fold translation of the human genome sequence, produ- SULT1D1P cing 181 hits. Annotation revealed that 111 hits (61%) corresponded to one of the 10 known or seven novel SULT SULT1C1 genes identified using the initial strategy, 54 hits (30%) were SULT1C2 2q12.2 SULT6B1 2p22.3 nonspecific, and 16 (9%) were not readily classifiable and SULT1C3 were included in the next step. SULT1C1P To verify the HMMER results and refine those sequence SULT2A1 alignments, the GeneWise program was used to search all SULT2B1 19q13.33–13.4 genomic contigs identified by the HMMER search that SULT2A1P contained either one of the 17 SULT genes or one of the 16 unclassified hits. GeneWise produced a total of 384 hits. A SULT4A1 22q13.31 total of 119 (31%) hits corresponded to SULT genes and 265 The chromosomal localizations for each of the 17 human SULT genes are listed. (69%) were nonspecific. All seven of the novel SULT genes Obvious pseudogenes are indicated by a ’P’ after their name. In most cases, that had been found during the BLAST-based approach were experimental evidence supports the map locations listed in the table (see text for also identified using the HMM strategy, but no additional details). However, the locations of several of the novel genes are based only on novel genes were observed. map locations for the genomic contigs from which they were annotated.

Annotation of SULT Genes and Genomic Organization of The nine putative protein-coding exons of SULT1C3 span SULT Gene Clusters approximately 18 kb (Figure 2a). While SULT1C3 shares Following identification of the seven novel human SULT many features with other SULT1 family genes (Figure 1), the genes, each of those genes was annotated at the sequence structure of this gene also contains unusual characteristics. level using a manual homology-based approach. Local First, the initial coding exon for SULT1C3 contains three in- sequence alignments, as well as the knowledge of conserved frame ATG codons that would result in proteins 304, 294, or SULT gene structures (Figure 1), were employed during that 284 amino acids in length—lengths similar to those of other annotation process. Of the total of 17 SULT genes that are known SULT isoforms. The sites that would produce the now known, 13 are found in gene clusters on shortest two proteins conformed slightly better to 27 2q, 4q, 16p, and 19q (Table 1). Each of the gene structures the canonical translation initiation sequence than did will be described subsequently, organized on the basis of the third site. Second, exons 7 and 8 are duplicated in chromosomal location. Annotations for the seven novel SULT1C3 (Figure 2a); so alternative splicing might occur for genes were deposited in the GenBank Third Party Annota- exons 7a, 7b, 8a, and 8b. The peptide sequences encoded by tion database under TPA accession numbers BK001431– exons 7a and 7b are 61.7% identical to each other, and the BK001437. peptide sequences encoded by exons 8a and 8b are 78.4% identical. Each exon is flanked by GT-AG dinucleotides and maintains the ORF, so alternative splicing could potentially The 2 Cluster and SULT6B1 produce four SULT1C3 proteins. Each of these four potential The four human members of the SULT1C subfamily are allozymes was named according to the exons 7 and 8 that it located in a cluster at 2q12, and the SULT6B1 gene maps to contained: SULT1C3a (exons 7a and 8a), SULT1C3b (7a/8b), 2p22.3. The gene structures for SULT1C1 (approximately SULT1C3c (7b/8a), and SULT1C3d (7b/8b). A transcript 20 kb long) and SULT1C2 (about 10 kb in length) have been containing exons 7b and 8a would require a nonstandard reported previously.21 Both the SULT1C1 and SULT1C2 splicing process due to the genomic arrangement of those genes contain seven exons that encode the protein (Figure exons and, as a result, would seem quite unlikely. The three 1). However, the SULT1C1 gene contains an additional 50- putative proteins encoded by SULT1C3a–d are 60–68% noncoding exon as well as a rarely expressed alternatively identical to the rat SULT1C1 sequence and 53–58% identical spliced exon located within intron 3. The structures for both to human SULT1C1 and SULT1C2. of those genes were confirmed during the present study. In SULT1C1P contains at least seven exons and spans addition to those two genes, this study also identified two approximately 18 kb (Figure 2b). However, this gene has other genes located within the SULT1C gene cluster: undergone rearrangement, since exons 2–5 are located 30 to SULT1C3 and SULT1C1P. exons 6–8. In addition, this gene lacks an in-frame ATG start

The Pharmacogenomics Journal Novel human SULT gene identification RR Freimuth et al 57

a Novel SULT Genes codon was located much farther downstream than pre- 2 3 4 5 6 7a 8a 7b 8b dicted, and revealed that exon 2 was only 106 bp in length. >142 129 98 127 95 181 >113 181 >113 SULT1C3 ~18 kb Those experiments also identified four noncoding exons 4998 851 2129 3035 137 1421 4144 233 upstream of exon 2, bringing the total length of SULT6B1 to 2 3 4 4b 5 6 7 8 106 113 90 84 127 95 157 246 approximately 29 kb. In addition, experiments designed to SULT6B1 ~29 kb 979 3940 407 3840 4231 3541 3368 detect alternatively spliced transcripts identified a second exon 4-like sequence (termed exon 4b) located between b Novel SULT Pseudogenes exons 4 and 5. The predicted protein sequence of SULT6B1 297 568 was most similar to that of chicken SULT6A1 (52% identity, SULT1D2P Contains 5 stop codons and 3 frameshifts 63% similarity) and, based on the nomenclature that has 8 872 been suggested for SULT genes, this novel SULT was SULT3A1P Contains 4 stop codons and a frameshift classified as the first member of the SULT6B subfamily. 6 7 8 2 3 4 5 Portions of this gene were subsequently annotated by SULT1C1P 101 182 >65 ~151 129 98 129 0 ~18 kb NCBI s automated gene prediction software as ‘similar to 3165 593 3746 5151 1677 2712 sulfotransferase family’. GC CT No ATG CT

2 3 4 5 6 7 8 SULT1D1P >148 126 98 127 95 172 >110 The Cluster ~22 kb 2873 819 8941 2956 3257 2295 Three SULT genes are located at chromosome 4q13: SULT1E1 Stop GA AT and SULT1B1, both of which encode previously character- 2 3 6 5 4a 4b ized proteins, and SULT1D1P, a gene identified during the >130 209 >133 95 127 124 SULT2A1P present database searches. The gene structure for SULT1E1 ~153 kb MER85 2580 131.176 1856 3086 13057 was reported previously.17 SULT1E1 contains seven exons Frame CG Stop shift Opposite Strand that encode the protein, as well as an additional upstream noncoding exon, and is approximately 18 kb in length Figure 2 Novel human SULT gene structures. Structures for the (Figure 1). SULT1B1 also contains seven protein-coding seven novel human SULT genes annotated in this study are exons that span 28 kb (Figure 1), but since no RACE data illustrated. In addition to the information given in Figure 1, the were reported for this gene, no noncoding regions could be lengths of introns (in bp), as well as the locations of mutations 21 (arrows and checkered regions), are shown. Panel (a): Structures of identified. the two novel genes that were not obvious pseudogenes are SULT1D1P shares homology with the conserved gene depicted. RACE was only performed for SULT6B1. Therefore, UTR structures for other SULT1 family genes (Figure 2b), and is data are available only for that gene. The 50 untranslated exons for about 22 kb in length. A potential TATA box (TATTAA) is SULT6B1 are indicated by a dashed line and are depicted in Figure 3. located 60 bp upstream of the putative translation initiation Panel (b): Structures for the five pseudogenes identified during this codon. Only one other human SULT gene, SULT1E1, study are illustrated. The two halves of SULT2A1P are on opposite contains a TATA box.17 A stop codon is present in exon 8 strands, as indicated by the dashed line. near the expected location, and seven polyadenylation signals are present within the subsequent 1800 bp. While codon near the expected location in the predicted exon 2 those features suggested that this gene might be active, exon sequence, and the splice junctions at the 30 ends of exons 3, 2 contains two in-frame stop codons after the predicted ATG 6, and 7 are not GT. There is also a premature stop codon start codon, and the GT dinucleotides that comprise the within exon 8, possibly resulting from a 1 bp deletion in a anticipated splice sites at the ends of exons 2 and 4 have been mutated to GA and AT, respectively. Examples of tyrosine codon conserved in most SULT genes. For these 28,29 reasons, this gene was designated a pseudogene (indicated GA–AG splicing have been reported. However, while it is by the trailing P in the name) and, since its hypothetical known that introns may (rarely) splice using AT–AC pairs,29,30 to our knowledge, there are no examples of translated sequence was most similar to SULT1C1, it was 29 given SULT1C1 as its base name. The order of the four genes introns in humans that utilize AT–AG sequences. There- SULT1D1P in the SULT1C cluster is cen-SULT1C3-SULT1C1-SULT1C1P- fore, was classified as a pseudogene and assigned SULT1C2-ter. The genes are located approximately 23, 5, and the name 1D1 because of the high homology of the 45 kb apart, respectively. hypothetical protein product to mouse SULT1D1 (80% SULT6B1 is located near 2p22.3, on the opposite arm of identical, 85% similar). The SULT1D1P gene is found SULT1B1 relative to the SULT1C gene cluster. SULT6B1 approximately 36 kb telomeric to , and the SULT1D1P contains seven putative protein-coding exons that span SULT1E1 gene is located about 28 kb telomeric to . approximately 21 kb, and has a structure very similar to The Cluster those of other SULT genes (Figure 2a). Homology-based prediction of the location of the ATG start codon in exon 2 Three SULT genes are located on chromosome 16p11.2– indicated that this exon is at least 199 bp in length. 12.1: SULT1A1, SULT1A2, and SULT1A3. The structures of all However, RACE and the expression-profiling experiments three of these genes have been reported previously,16,19,20 described subsequently indicated that the probable start and those structures were confirmed during the present

www.nature.com/tpj Novel human SULT gene identification RR Freimuth et al 58

studies. All three genes contain multiple 50-noncoding 30-RACE studies have not been reported for this gene, so the exons and seven exons that encode protein, with exon number of noncoding exons is not known for SULT4A1. lengths that match the conserved gene structures found among other SULT1 family genes (Figure 1). The SULT1A SULT1D2P and SULT3A1P genes are approximately 4.4, 5.1, and 7.3 kb in length, SULT1D2P and SULT3A1P are located on chromosomes 3 respectively. Therefore, they are much shorter than most and 14, respectively (Figure 2b). Both are processed other SULT genes. The assembly of this region of chromo- pseudogenes and represent the first reported SULT processed some 16 is still incomplete, so no conclusions can presently pseudogenes in any species. SULT1D2P contains five stop be drawn with regard to the orientation of these genes codons and three frame shifts, as well as a 349 bp Alu relative to each other. element inserted approximately 297 bp into the sequence. SULT1D2P The Cluster The predicted encoded amino-acid sequence of shares 69% identity (75% similarity) with the predicted Two members of the human SULT2 family are located at translation of the human SULT1D1P gene. Human SUL- chromosome 19q13.33. SULT2A1 is about 16 kb long, while T3A1P contains four stop codons and a single frame shift. SULT2B1 is approximately 48 kb in length (Figure 1). Both of The predicted translation of the SULT3A1P pseudogene is these genes have six exons that encode the protein, but the 51% identical and 63% similar to that of mouse SULT3A1. SULT2B1 gene has an alternative initial coding exon that is alternatively transcribed and is spliced to a site within the Expression Profiling of SULT1C3 and SULT6B1 ORF (Figure 1). Both of these gene structures were reported Of the seven novel SULT genes which we identified, only previously,12,18 and both were confirmed in the course of SULT1C3 and SULT6B1 do not contain any obvious inacti- this study. vating mutations (Figure 2a). Therefore, both computational The third member of the human SULT2 family is the novel and experimental approaches were used to perform pre- SULT2A1P pseudogene, located at 19q13.4. This gene liminary expression profiling studies for these two genes. contains an in-frame ATG codon at the expected start site Initially, the predicted peptide sequences encoded by both in exon 2, but it lacks an identifiable exon 7 sequence genes were used as queries in a TBLASTN search of the EST (Figure 2b). Exon 6 contains a frame shift and is truncated by database. No ESTs were identified, indicating that, at a MER85 repetitive sequence element, deleting the 50- present, there is no in silico evidence for the expression of terminus of the highly conserved region IV sequence found either SULT1C3 or SULT6B1. Therefore, we also searched for at this location in other SULT genes. Two copies of exon 4 evidence of expression by using PCR and cDNAs from 20 were found that are 68% identical (76% similar) to each different human tissues. No SULT1C3 products were other on the basis of predicted amino-acid sequence. The detected, even after performing nested amplifications. splice site located at the 50-end of exon 4a is CG rather than Therefore, on the basis of our data, the SULT1C3 gene is AG, and exon 4b contains a stop codon. Finally, the most more accurately described as a putative active gene. How- recent assembly of this region indicates that exons 2 and 3 ever, SULT6B1 products were detected when testicular cDNA are located on the opposite strand and approximately 131 kb was used as template. Therefore, cDNA from this tissue was away from exons 4b, 4a, 5, and 6. Therefore, this gene was used to perform 50- and 30-RACE analysis. classified as a pseudogene and named 2A1P, since the hypothetical translation of the truncated gene is 56% SULT6B1 RACE identical (66% similar) to human SULT2A1. SULT2A1P is Marathon-Readyt human testis cDNA was used to perform about 153 kb long and is located approximately 7.76 Mb RACE for SULT6B1. Analysis of 15 30-RACE clones indicated telomeric to SULT2B1, which is in turn located about 665 kb that there were three functional polyadenylation signals, telomeric to SULT2A1, making the SULT2 ‘cluster’ by far the resulting in 30-UTR lengths of 52, 80, and 113 bp. Initial longest of the SULT gene clusters. Unlike the other SULT attempts at 50-RACE failed, so a new set of primers was gene clusters, the orientation of the genes in the SULT2 designed that hybridized farther 30-downstream from the cluster is variable. The SULT2A1 gene and exons 4–6 of predicted ATG translation initiation site. Products were SULT2A1P are on the same strand, opposite that of exons 2– generated successfully with the new primers, and 12 50– 3ofSULT2A1P and SULT2B1. RACE products were cloned and sequenced. Four exons were found upstream of the first predicted coding exon (Figure 3), SULT4A1 one of which (exon 1B) utilized two different 30 splice sites, SULT4A1 is located on chromosome 22q13.31. The gene creating a ‘short’ and ‘extended’ form of that exon. Three spans 38 kb, making it one of the longest human SULT genes different transcription initiation sites were identified (up- (Figure 1). Like SULT1B1, the structure of SULT4A1 had not stream of exons 1D, 1C, and 1A). Finally, six different been reported previously, but it was readily annotated splicing patterns were observed, all of which spliced into the during our studies. While SULT4A1 contains seven coding same location within the predicted exon 2 sequence (Figure exons and shares some homology with the conserved SULT 3). That splice site was located 30 of both the computation- gene structures, the lengths of several of the exons differ ally predicted translation initiation site and the first set of 50- significantly from those found in other SULT genes. 50- and RACE primers, indicating that the in silico prediction of exon

The Pharmacogenomics Journal Novel human SULT gene identification RR Freimuth et al 59

2 extended too far upstream and that an alternate transla- clones), and 1D–1B (one clone). Exon 5 was skipped (1D– tion initiation site was used. Examples of each of the six 1B–2–3–4–6–7–8) in a single transcript, as was exon 6 (1A–2– splicing patterns have been submitted to GenBank (acces- 3–4–5–7–8). In addition, two transcripts were isolated that sion numbers AY289770–AY289775). lacked exons 4, 5, and 6. Three clones contained exons 1D–2–3–7–8 and two clones contained exons 1D–1B–2–3–7–8. SULT6B1 Transcript Characterization Finally, an additional exon (termed exon 4b) was identified. Exon 4b was flanked by GT–AG dinucleotides and was 84 bp To characterize full-length SULT6B1 transcripts, primers in length. It maintained the open reading frame, but it also located in the 50- and 30-UTRs were used to perform PCR contained a stop codon. Exon 4b was identified in three with human testicular cDNA as template. Analysis of the different splice forms, two of which simultaneously lacked products indicated that alternative splicing and exon exon 6: 1A–2–3–4–4b–5–6–7–8 (two clones), 1A–2–3–4–4b–5– skipping was common. In all, 11 different transcripts were 7–8 (one clone), and 1C–1B–2–3–4–4b–5–7–8 (one clone). characterized from 19 clones. ‘Complete’ transcripts con- Among all of these transcripts, only those that contained sisting of coding exons 2–8 were found downstream of exons 2–8 or exons 2–3–7–8 remained in frame. Even though exons 1A (two clones), 1C–1B (two clones), 1D (three the transcripts that skip exons 4–6 include the conserved Region I and IV sequences common to SULT enzymes, it seems Human SULT6B1 5'-RACE Results unlikely that the resulting peptide will have SULT activity. ATG 1D 1C 1B 1A 2 Cloning SULT6B1 from Other Primate Species 428 4774 1211 1335 83 60 97 63 106 Each exon of the SULT6B1 gene was amplified from chimpanzee and gorilla genomic DNA using the PCR. The predicted amino-acid translations were compared to the predicted human sequence (Table 2). A total of 13 ORR nucleotide differences were found among the three species. Nine amino acids were altered, while four were unchanged. The predicted sequences of the human, chimpanzee, and gorilla enzymes were 97% identical to each other. The exon sequences from chimpanzee and gorilla have been deposited in GenBank (accession numbers AY289776–AY289789). Figure 3 SULT6B1 50-RACE results. The four exons identified by performing 50-RACE with human testicular cDNA are indicated by DISCUSSION white boxes, and the lengths of exons and introns are listed in bp. Sulfation is an important pathway in the biotransformation Six distinct splicing patterns were observed from at least three of many drugs as well as endogenous compounds such as different sites of transcription initiation (exons 1D, 1C, and 1A). The hormones and neurotransmitters.1,2 Therefore, it is impor- location of the first in-frame ATG translation initiation codon in tant to explore the possibility of pharmacogenetic variation exon 2 is also shown. in this pathway for drug metabolism. That process began

Table 2 SULT6B1 sequence differences among primates

Location in human SULT6B1 Nucleotide sequence

Nucleotide Amino acid Human Gorilla Chimpanzee

A148G LYS50GLU A A G G274A GLU92LYS G A A A314C LYS105THR A C A A321G THR107THR A G G G336C LEU112PHE G C C C390G PHE130LEU C G C T400C PHE134LEU T T C G429C ARG143SER G C C GCT(538–540)CCC ALA180PRO GCT CCC GCT G609A ALA203ALA C A A C636T HIS212HIS C C T A651C PRO217PRO A C A A697G SER233GLY A A G

The locations and sequences of the differences found in the predicted ORF sequences of human, gorilla, and chimpanzee SULT6B1 are listed. Nucleotide and amino-acid locations correspond to the human sequence.

www.nature.com/tpj Novel human SULT gene identification RR Freimuth et al 60

nearly two decades ago when biochemical genetic studies of Bubble size is % ID of Hit SULT1A1 and SULT1A3 demonstrated that levels of activity SULT1C1 SULT1C1P 10 SULT1C3 SULT1C2 for both of these SULT isoforms in the human platelet had 6 31–33 2 high heritability. More recently, a series of gene -2 resequencing and functional genomic studies have resulted -6

Exon and Strand -10 in the identification of functionally significant genetic Position in Contig (NT) polymorphisms for SULT1A1,34 SULT1A3,35 SULT1E1,36 37 Bubble size is Length of Hit and SULT2A1. One of the SULT1A1 polymorphisms has 10 already been reported to be of pharmacogenetic significance 6 2 for variation in the survival of breast cancer patients treated -2 with tamoxifen.38 Therefore, it is important for pharmaco- -6

Exon and Strand -10 genetics that all human genes encoding cytosolic SULTs be Position in Contig (NT) identified and characterized. The present studies represent the first comprehensive database mining project conducted Figure 4 Graphical display of sequence alignment results. A novel graphical display was developed to facilitate the annotation to identify and annotate SULT genes in humans. We process. The figure illustrates the SULT1C cluster as an example of identified seven novel SULT genes in addition to the 10 this display, with each hit obtained by the FrameSearch (shaded that were previously reported to be expressed in humans. points) or TFastA (open points) programs plotted. Each hit was We also found evidence for the expression of SULT6B1, one graphed according to its position within the genomic contig (x-axis) of the two novel genes that appeared most likely to be and the exon that was used as a query sequence to generate the hit expressed in human tissues. (y-axis). Points above the x-axis are hits on the forward strand, The strategy used to search the human genome database while points below the x-axis represent hits on the reverse strand. involved two different but complementary approaches: one The relative size of each data point indicates either the percent identity of the hit to the query exon sequence (top graph) or the that used explicit SULT sequences as queries, and the other length of the overall alignment (bottom graph). On these plots, hits that utilized HMM profiles. The first approach began by using corresponding to SULT genes form clear diagonals (circled in the TBLASTN program to perform an initial screen of the green) that are readily distinguishable from the nonspecific hits genome. Genomic contigs containing hits of interest were that are scattered throughout the plot. One exception is the then scanned using two additional programs that employ SULT1C1P gene, which does not form a diagonal. That is a result of different alignment algorithms. The TFastA program was rearrangement of the exons in SULT1C1P (see Figure 2). chosen for speed and, in some cases, for its increased sensitivity over TBLASTN. The FrameSearch program was directly. In addition, GeneWise can compensate for gaps in used for its ability to perform a Smith and Waterman local the alignment due to either sequencing error or introns. alignment using a protein sequence as a query and all possible Those searches resulted in the identification of several codons in a nucleotide database. To aid in the annotation of novel SULT genes, including SULT6B1, the first human SULT the large number of hits generated by use of those two gene in family 6 and only the second member of that family programs, a graphical display was created that conveyed on a to be identified in any species. A series of alternatively single graph multiple levels of information with regard to spliced SULT6B1 transcripts was identified in the testis, each hit. As shown in Figure 4, both programs identified most confirming the expression of this gene and forming the basis of the exons in all SULT genes. However, many hits were for future functional studies. In addition to SULT6B1,a identified by only one of the two programs. For example, previous report39 suggested that a third member of the TFastA found most of the exons in the SULT1C1P gene, but SULT1C family existed in humans on the basis of partial FrameSearch found none. Therefore, our decision to use exon sequence of a putative human homolog of rat multiple alignment programs based on different algorithms SULT1C1, although the full sequence and structure of that and then combine the datasets prior to analysis greatly aided gene remained unknown. The present studies have success- in the identification and annotation of novel SULT genes. fully identified and annotated the structure of that gene, The second approach that was used to search the human SULT1C3. While SULT1C3 has the potential to be active on genome used HMM profiles rather than explicit amino-acid the basis of nucleotide sequence, no experimental evidence sequences. HMMs contain position-specific scoring that was found which supported the transcription of that gene in contains information about the degree of conservation of the tissues in the cDNA panel or in the MTE array. However, each residue, making HMM searches more sensitive than it is possible that SULT1C3 is expressed at a different time pair-wise alignment methods like those used in the initial during development or under different physiological con- approach. A HMM profile was created using the sequences of ditions than when the tissue samples for the cDNAs panel the 11 human SULT enzymes known to be expressed and the MTE array were prepared. Therefore, the question of (SULT2B1 produces two isoforms), and that profile was used whether this gene is expressed in humans remains open with the HMMER program to search a six-fold translation of until SULT1C3 transcripts are detected. After we had the human genome database. Genomic contigs containing identified and studied these two genes, both of them were hits of interest were identified and subsequently searched deposited in GenBank as a result of patent filing WO using the GeneWise program. GeneWise also uses HMM 0164904 for SULT6B1 and patent filing WO 9172977 for profiles, but it is able to search a nucleotide database SULT1C3.

The Pharmacogenomics Journal Novel human SULT gene identification RR Freimuth et al 61

While experimental data supported the computationally encoded protein. Nonetheless, several observations can be predicted gene structure for SULT6B1, 50-RACE revealed that made with regard to the predicted SULT6B1 protein exon 2 was approximately half as long as originally sequence. First, the initial in-frame ATG in exon 2 matches predicted. This observation emphasizes the fact that even the Kozak consensus sequence for translation initiation,27 manually curated gene annotations require experimental but that ATG would produce a protein that is approximately confirmation and validation. 50-RACE experiments identi- 20 amino acids shorter at the N-terminus than other human fied four noncoding exons upstream of SULT6B1 exon 2, SULTs. Second, the predicted amino-acid sequence of three of which served as sites of transcription initiation SULT6B1 contains two gaps relative to other human SULTs, (Figure 3). We also obtained evidence for the existence of one located approximately 20 amino acids downstream of several alternatively spliced mRNAs, but further studies will Region I and the other about five residues upstream of be required to determine the function—if any—of the Region IV (Figure 5). The first gap is aligned with gaps

1A1 ~~~~~~~~~~ MELIQDTSRP PLEYVKG.VP LIKYFAEALG PLQS.FQARP DDLLISTYPK SGTTWVSQIL 1B1 ~~~~~~~~~~ MLSPKDILRK DLKLVHG.YP MTCAFASNWE KIEQ.FHSRP DDIVIATYPK SGTTWVSEII 1C1 ~~~~~~MALT SDLGKQI.K. .LKEVEGTLL QPA.TVDNWS QIQS.FEAKP DDLLICTYPK AGTTWIQEIV 1C2 MALHEMEDFT FDGTKRL.S. .VNYVKG.IL QPTDTCDIWD KIWN.FQAKP DDLLISTYPK AGTTWTQEIV 1C3a MAKIEKNAPT MEKKPELFN. .IMEVDG.VP TLILSKEWWE KVCN.FQAKP DDLILATYPK SGTTWMHEIL 1C3d MAKIEKNAPT MEKKPELFN. .IMEVDG.VP TLILSKEWWE KVCN.FQAKP DDLILATYPK SGTTWMHEIL 1E1 ~~~~~~~~~~ MNSELDYYEK .FEEVHG.IL MYKDFVKYWD NVEA.FQARP DDLVIATYPK SGTTWVSEIV 2A1 ~~~~~~~~~~ ~~~~~~MSDD FLWFEGIAFP TMGFRSETLR KVRDEFVIRD EDVIILTYPK SGTNWLAEIL 2B1a ~~~~~MASPP PFHSQKLPGE YFRYKGVPFP VGLYSLESIS LAENTQDVRD DDIFIITYPK SGTTWMIEII 4A1 ~~MAESEAET PSTPGEFESK YFEFHGVRLP P..FCRGKME EIAN.FPVRP SDVWIVTYPK SGTSLLQEVV 6B1 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~MCTSETFQ AL.DTFEARH DDIVLASYPK CGSNWILHIV

1A1 DMIYQGGDLE KCHRAPIFMR VPFLEFKAPG I.PSGMETLK DTPAPRLLKT HLPLALLPQT LLDQKVKVVY 1B1 DMILNDGDIE KCKRGFITEK VPMLEMTLPG LRTSGIEQLE KNPSPRIVKT HLPTDLLPKS FWENNCKMIY 1C1 DMIEQNGDVE KCQRAIIQHR HPFIEWARP. PQPSGVEKAK AMPSPRILKT HLSTQLLPPS FWENNCKFLY 1C2 ELIQNEGDVE KSKRAPTHQR FPFLEMKIP. SLGSGLEQAH AMPSPRILKT HLPFHLLPPS LLEKNCKIIY 1C3a DMILNDGDVE KCKRAQTLDR HAFLELKFPH KEKPDLEFVL EMSSPQLIKT HLPSHLIPPS IWKENCKIVY 1C3d DMILNDGDVE KCKRAQTLDR HAFLELKFPH KEKPDLEFVL EMSSPQLIKT HLPSHLIPPS IWKENCKIVY 1E1 YMIYKEGDVE KCKEDVIFNR IPFLECRKEN LM.NGVKQLD EMNSPRIVKT HLPPELLPAS FWEKDCKIIY 2A1 CLMHSKGDAK WIQSVPIWER SPWVESEI...... GYTALS ETESPRLFSS HLPIQLFPKS FFSSKAKVIY 2B1a CLILKEGDPS WIRSVPIWER APWCETIV...... GAFSLP DQYSPRLMSS HLPIQIFTKA FFSSKAKVIY 4A1 YLVSQGADPD EIGLMNIDEQ LPVLEYPQP. ....GLDIIK ELTSPRLIKS HLPYRFLPSD LHNGDSKVIY 6B1 SELIYAVSKK KYK....YPE FPVLEC.... GDSEKYQRMK GFPSPRILAT HLHYDKLPGS IFENKAKILV

1A1 VARNAKDVAV SYYHFYHMAK VHPEPGTWDS FLEKFMVGEV SYGSWYQHVQ EWWELSRTHP VLYLFYEDMK 1B1 LARNAKDVSV SYYHFDLMNN LQPFPGTWEE YLEKFLTGKV AYGSWFTHVK NWWKKKEEHP ILFLYYEDMK 1C1 VARNAKDCMV SYYHFQRMNH MLPDPGTWEE YFETFINGKV VWGSWFDHVK GWWEMKDRHQ ILFLFYEDIK 1C2 VARNPKDNMV SYYHFQRMNK ALPAPGTWEE YFETFLAGKV CWGSWHEHVK GWWEAKDKHR ILYLFYEDMK 1C3a VARNPKDCLV SYYHFHRMAS FMPDPQNLEE FYEKFMSGKV VGGSWFDHVK GWWAAKDMHR ILYLFYEDIK 1C3d VARNPKDCLV SYYHFHRMAS FMPDPQNLEE FYEKFMSGKV VGGSWFDHVK GWWAAKDMHR ILYLFYEDIK 1E1 LCRNAKDVAV SFYYFFLMVA GHPNPGSFPE FVEKFMQGQV PYGSWYKHVK SWWEKGKSPR VLFLFYEDLK 2A1 LMRNPRDVLV SGYFFWKNMK FIKKPKSWEE YFEWFCQGTV LYGSWFDHIH GWMPMREEKN FLLLSYEELK 2B1a MGRNPRDVVV SLYHYSKIAG QLKDPGTPDQ FLRDFLKGEV QFGSWFDHIK GWLRMKGKDN FLFITYEELQ 4A1 MARNPKDLVV SYYQFHRSLR TMSYRGTFQE FCRRFMNDKL GYGSWFEHVQ EFWEHRMDSN VLFLKYEDMH 6B1 IFRNPKDTAV SFLHFHNDVP DIPSYGSWDE FFRQFMKGQV SWGRYFDFAI NWNKHLDGDN VKFILYEDLK

1A1 ENPKREIQKI LEFVGRSLPE ETVDFMVQHT SFKEMKKNPM TNYTTVPQEF MDHSISPFM. RKGMAGDWKT 1B1 ENPKEEIKKI IRFLEKNLND EILDRIIHHT SFEVMKDNPL VNYTHLPTTV MDHSKSPFM. RKGTAGDWKN 1C1 RDPKHEIRKV MQFMGKKVDE TVLDKIVQET SFEKMKENPM TNRSTVSKSI LDQSISSFM. RKGTVGDWKN 1C2 KNPKHEIQKL AEFIGKKLDD KVLDKIVHYT SFDVMKQNPM ANYSSIPAEI MDHSISPFM. RKGAVGDWKK 1C3a KNPKHEIHKV LEFLEKTWSG DVINKIVHHT SFDVMKDNPM ANHTAVPAHI FNHSISKFM. RKGMPGDWKN 1C3d KDPKREIEKI LKFLEKDISE EILNKIIYHT SFDVMKQNPM TNYTTLPTSI MDHSISPFM. RKGMPGDWKN 1E1 EDIRKEVIKL IHFLERKPSE ELVDRIIHHT SFQEMKNNPS TNYTTLPDEI MNQKLSPFM. RKGITGDWKN 2A1 QDTGRTIEKI CQFLGKTLEP EELNLILKNS SFQSMKENKM SNYSLLSVDY VVD.KAQLL. RKGVSGDWKN 2B1a QDLQGSVERI CGFLGRPLGK EALGSVVAHS TFSAMKANTM SNYTLLPPSL LDHRRGAFL. RKGVCGDWKN 4A1 RDLVTMVEQL ARFLGVSCDK AQLEALTEHC H...QLVDQC CNAEALP...... V. GRGRVGLWKD 6B1 ENLAAGIKQI AEFLGFFLTG EQIQTISVQS TFQAMRAKSQ DTHGAV...... GPFLF RKGKVALWKN

1A1 TFTVAQNERF DADYAKKMAG C..SLTFRSE L~~~~ 1B1 YFTVAQNEKF DAIYETEMSK T..ALQFRTE I~~~~ 1C1 HFTVAQNERF DEIYRRKMEG T..SINFCME L~~~~ 1C2 HFTVAQNERF DEDYKKKMTD T..RLTFHFQ F~~~~ 1C3a HFTVALNENF DKHYEKKMAG S..TLNFCLE I~~~~ 1C3d YFTVAQNEEF DKDYQKKMAG S..TLTFRTE I~~~~ 1E1 HFTVALNEKF DKHYEQQMKE S..TLKFRTE I~~~~ 2A1 HFTVAQAEDF DKLFQEKMAD LP.RELFPWE ~~~~~ 2B1a HFTVAQSEAF DRAYRKQMRG MP...TFPWD E~~~~ 4A1 IFTVSMNEKF DLVYKQKM.G KC.DLTFDFY L~~~~ 6B1 LFSEIQNQEM DEKFKECLAG TSLGAKLKYE SYCQG

Figure 5 Human SULT sequences. An amino-acid sequence alignment of all SULT enzymes known to be expressed in humans. The highly conserved Regions I and IV are highlighted in blue and red, respectively. The dimerization interface region overlaps Region IV, and is underlined. The putative SULT1C3 gene might produce four separate isoforms as a result of alternative splicing. The sequences for two of those isoforms (1C3a and 1C3d) are shown, and those sequences can be used to assemble the sequences for 1C3b and 1C3c. The SULT2B1 gene produces two isoforms (2B1a and 2B1b) that differ only at the N terminus, so only SULT2B1a is shown. The SULT6B1 amino-acid sequence is based on the predicted translation of cDNAs.

www.nature.com/tpj Novel human SULT gene identification RR Freimuth et al 62

present in members of families 2 and 4, while the latter L19999),40 1A2 (U28169),41 1A3 (L19956),42 1B1 aligns with the gap present in human SULT4A1, which is (U95726),43 1C1 (U66036),11 1C2 (AF055584),44 1E1 predicted to enlarge the opening of the active site. Third, the (U08098),45 2A1 (U08024),46 2B1a (U92314),12 2B1b dimerization motif present in other SULT sequences7 is (U92315),12 and 4A1 (AF188698).47 To increase the specifi- altered in two places in human SULT6B1. Specifically, the city of the searches, the terminal 54 amino acids of Glu residue that forms an ion pair in other SULTs is a Gln SULT2B1, which compose the proline-rich tail unique to residue in SULT6B1 (Figure 5). In addition, the conserved that isoform, were removed. Additional query sequences Val residue, which extends into the hydrophobic pocket in were mouse 1D1 (U32371),48 mouse 3A1 (AF026075), mouse the dimer interface region, is a Glu residue in SULT6B1. This 5A1 (AF026074), chicken 6A1 (AF033189),49 and the PFAM residue is also Glu in mouse SULT1E1, an isoform known to database50,51 consensus sequence for cytosolic sulfotrans- exist as a monomer,7 and studies have shown that this ferases (pfam00685). A second consensus sequence was change prevents dimerization.7 Therefore, SULT6B1 may constructed from the 11 human SULT sequences using the exist as a monomer. GCG program Profilemake. Among the seven novel genes identified as a result of these studies, two, SULT1D2P and SULT3A1P, were processed BLAST and Sequence Alignment Searches pseudogenes, the first SULT processed pseudogenes de- To perform the initial screen of the human genome, 17 SULT scribed in any species. Furthermore, the SULT1C1P and sequences were selected as query sequences in a TBLASTN23 SULT2A1P genes are the first unambiguous intron- search using a local copy of the RefSeq database (October containing SULT pseudogenes to be described in any species. 2001). The query sequences included the 11 SULT isoforms While our expression profiling and RACE experiments for known to be expressed in humans, four non-human SULT SULT1C3 and SULT6B1 were being performed, the structure sequences (mouse 1D1, mouse 3A1, mouse 5A1, and chicken of the SULT1D1P gene was reported following an indepen- 6A1), and the two different consensus sequences. For those dent database search.22 Our annotation confirms and refines searches, the BLOSUM62 scoring matrix was used and the the gene structures, map locations, and orientations de- expect parameter was set to 50. The results were annotated scribed in that report. The SULT1D1P gene contains two manually. Hits that did not correspond to known or novel stop codons in the putative exon 2 sequence. However, as a SULT genes, or which were not obviously nonspecific result of sequence divergence at the N-terminus of SULT (alignments that were very short or shared very low enzymes, the initial coding exons of SULT genes are very similarity, or those that should have encompassed portions difficult to predict accurately, so it is possible that an of the highly conserved Region I and IV sequences but did additional, functional exon 2 sequence exists farther up- not display any homology over those regions), were retained stream. Nonetheless, we concluded that SULT1D1P is a for further analysis. Redundancy was eliminated from the pseudogene. However, it must have been active in germline set of retained hits by preserving only a single ‘marker’ hit tissue at one time, since it has a corresponding processed for each SULT gene and, since a 500 kb window of sequence pseudogene on chromosome 3. Unlike SULT1D1P, the was searched in the subsequent step, only the best hit in any ‘active’ SULT3A1 gene corresponding to the SULT3A1P given 100 kb window of genomic sequence was retained processed pseudogene was not found, despite the fact that when a single query produced multiple hits within that we searched with both the mouse SULT3A1 and predicted window. human SULT3A1 sequences. It is likely that the gene that A 500 kb window of sequence was identified for each of gave rise to SULT3A1P has been deleted in the course of the nonredundant hits. To increase the specificity of the evolution. searches, the query sequence that produced each of those Most human SULT genes are found in clusters, presumably hits was then divided into exon fragments (for the human formed through gene duplication events, although several SULTs only). Each window was then searched with either solitary genes were also found. The orientation of all the each of the exon sequences (human query sequences) or the genes in those clusters is head-to-tail, with the exception of intact peptide sequence (non-human or consensus se- the SULT2 cluster, which appears to have been subjected to quences) as the query sequence, using two programs— at least one inversion. The annotations described in this TFastA and FrameSearch. Both programs are from the GCG study will provide additional insight into the evolution of sequence analysis package,52 are more sensitive sequence the SULT gene family in humans and may aid in future alignment programs than TBLASTN, and are designed to studies of the regulation of these genes. In addition, these search a nucleotide database for similarity to a protein query results will serve as a starting point for future comparative sequence. The TFastA program employs the Pearson and genomic studies of SULT genes in other species. Lipman k-tuple algorithm, and searches each reading frame independently. FrameSearch utilizes the Smith and Water- man algorithm for local alignments, and compares the MATERIALS AND METHODS query sequence to all possible codons on both strands of the Query Sequences nucleotide sequence, enabling it to identify significant The protein query sequences used during the initial database sequence similarity even when frame shifts are present. searches were translated from the published cDNA se- The number of hits produced by the two alignment quences for human SULTs 1A1 (GenBank accession number programs was too large to annotate manually. Therefore, a

The Pharmacogenomics Journal Novel human SULT gene identification RR Freimuth et al 63

filter was constructed using empirical data to reduce the described above. Those four sequences were chosen because number of hits that required annotation. The filter was they either represented a member of a new subfamily in designed after searching genomic contig sequences contain- humans or because they appeared to be functional genes. ing known SULT genes (positive controls) or random genomic sequence (negative controls), and analyzing the HMMER and GeneWise Searches hits produced. Specifically, four SULT sequences (SULT1A1, A complementary approach to the BLAST-based method SULT2A1, SULT4A1, and pfam00685) were each used to described above was also used. That strategy relied on HMMs search three genomic contigs (NT_004610, NT_007874, and rather than individual and explicit query sequences. The NT_008101) chosen at random, each approximately 500 kb HMMER (version 2.2 g)24 and GeneWise (version 2.2.0)25 in length. A total of 3080 hits were produced from those programs were chosen to perform those searches. A local searches, and different methods of filtering the data were HMM profile26 was constructed from the sequences of the tested. The final filter criteria were found to be capable of 11 human SULTs which are known to be expressed. Since eliminating most of the nonspecific hits, while retaining an HMMER requires a protein database, a six-fold translation of adequate number of intermediate hits to allow the identi- the human genome database was performed. The E and fication of novel SULT genes. In order for a hit to be retained domE parameters were set to 0.01. After manual annotation for further analysis, it had to satisfy at least one of the three of the HMMER hits, genomic contigs were identified that filter criteria. First, all hits at least 100 bp in length were contained either a SULT gene or a hit that could not be retained so that long sequences with low percent identity unambiguously classified as nonspecific. GeneWise was were preserved. Second, all hits that shared at least 50% then used to search those contigs in an attempt to refine sequence identity to the query sequence and were at least the HMMER sequence alignments. The hits from the 39 bp long were retained. That dual requirement eliminated GeneWise searches were also annotated manually. the very short (but high percent identity) nonspecific hits, while still keeping all hits containing at least seven out of 13 Annotation of SULT Genes matching amino acids. The third criterion, the product of To annotate novel SULT genes, we took advantage of the the length and percent identity (LID), was designed to retain high degree of conservation among SULT gene structures.3 hits of both intermediate length and identity. Any hit with a Specifically, the expected lengths of exons, the locations of LID score of at least 40 was retained. Of the 3080 hits from splice junctions within the predicted cDNA, and the the random genomic contigs, 429 (14%) satisfied at least presence of GT/AG dinucleotides flanking exons were one of the three filter criteria. Therefore, approximately 12 considered when annotating exons. Each hit corresponding hits are expected in every 500 kb of genomic sequence. This to a SULT exon was refined using the GCG local sequence filter successfully reduced the number of hits that required alignment program Gap. This process was used to annotate manual annotation to a reasonable number, while still and assemble each of the gene structures manually. The map retaining a sufficient number of hits to allow the identifica- location and structure of each gene was verified using the tion of novel SULT genes. most recent assembly of the NCBI human genome database A visual display was developed to improve the efficiency (build 32). The structures of previously published genes were of the manual annotation step. For each 500 kb window of confirmed in a similar fashion. For those genes, the BLAST sequence, all hits from the two searches that met or and Gap programs were used to compare the genomic exceeded the filter criteria were plotted to illustrate sequence for each gene to reference mRNA sequences. graphically the quality and context of those hits (Figure 4). Four types of information were conveyed on each graph. Expression Profiling of SULT1C3 and SULT6B1 The location of each hit within the sequence window was Since there were no obvious inactivating mutations in the indicated on the x-axis. The exon used as a query sequence ORF or splice junctions of either SULT1C3 or SULT6B1, they and whether the hit was on the forward or reverse strand appeared to be potentially functional genes. In an attempt were indicated on the y-axis. The program that produced the to find evidence for their expression in humans, the hit (color of data point) and either the length or percent predicted amino-acid sequences of SULT1C3 (isoforms identity of the hit (size of the data point) were also SULT1C3a and SULT1C3d) and SULT6B1 were used as indicated. On these plots, hits corresponding to a single queries for TBLASTN searches of the human EST database.53 gene tend to cluster together and form a diagonal, whereas The results of the searches were filtered such that only those nonspecific hits are readily apparent due to their isolation hits that were at least 75% identical or that were more than and random scatter (Figure 4). Microsoft Excel was used to 30 bp long with at least 90% similarity were retained. All create the plots, and a Visual Basic program was used to alignments that met those criteria were examined manually. automate the process for all searches. To test experimentally for evidence of expression, 20 Following analysis of the first round of hits, seven novel cDNAs (fetal lung, fetal brain, mammary gland, adrenal SULT genes were identified. Since these sequences were not gland, and Multiple Tissue cDNA panels I and II) (Clontech, included as queries, the predicted amino-acid translations of Palo Alto, CA) were used as template for the PCR with four of them (SULTs 1C3, 1D1P, 6B1, and 3A1P) were used as primers designed to hybridize with the predicted ORF query sequences for a second round of searches of the sequences of SULT1C3 and SULT6B1 (primer sequences are genome sequence database, using the same method available upon request). For SULT6B1, forward primers

www.nature.com/tpj Novel human SULT gene identification RR Freimuth et al 64

located in exons 2 and 3 were paired with reverse primers in Washington University Genome Analysis Training Program (NIH exons 8 and 7, respectively, to perform nested PCR. Products T32-HG00045) (RRF). were detected with testicular cDNA as template. Therefore, Marathon-Readyt human testis cDNA (Clontech) was used as template to characterize the transcripts further. For SULT1C3, the primers were located in exons 7 and 8, but DUALITY OF INTEREST Dr.Weinshilboum has either provided consulting services or presented no products were observed following PCR amplification seminars at Abbott Laboratories, Bristol-Myers Squibb, Eli Lilly, Johnson using cDNA from 20 human tissues. Therefore, overlap and Johnson, Roche and Merck, Inc. All fees and honoraria for these 54 extension was used to construct a probe consisting of services and/or seminars were paid to Mayo Foundation. In addition, exons 5, 6, 7b, and 8b. The probe was radioactively labeled Drs. Wieben and Weinshilboum currently hold a peer-reviewed grant with 32P-dCTP by random priming (Oligolabeling Kit, from Eli Lilly. Pharmacia, Piscataway, NJ, USA) and used to probe a Human Multiple Tissue Expression (MTEt) Array (Clontech) containing RNA from 76 tissues. No signal was detected. ABBREVIATIONS HMM hidden Markov model SULT6B1 RACE and Transcript Characterization MTC multiple tissue cDNA ORF open reading frame To identify the site(s) of transcription initiation for PAPS 30-phosphoadenosine-50-phosphosulfate SULT6B1, Marathon-Readyt human testis cDNA was used RACE rapid amplification of cDNA ends as template for 50- and 30-RACE. For 50-RACE, reverse primers SULT sulfotransferase were designed near the 30-end of the predicted exon 2 UTR untranslated region sequence and within exon 3. 30-RACE was performed using a series of nested primers located in the predicted sequences of exons 7 and 8. The RACE products were cloned into pCR2.1 (Invitrogen, Carlsbad, CA, USA). To characterize the REFERENCES splicing patterns in full-length SULT6B1 transcripts, forward 1 Weinshilboum R, Otterness D. Conjugation–deconjugation reactions in drug metabolism and toxicity. Chapter 2, Sulfotransferase enzymes. In: primers designed in exons 1D, 1C, and 1A were paired with Kauffman FC (ed) Handbook of Experimental Pharmacology Series,Vol reverse primers located in the 30-UTR. The PCR products 112 Springer-Verlag: Berlin, Heidelberg 1994; pp 45–78. were cloned into pCR 3.1 Uni (Invitrogen). The Wizard Plus 2 Falany CN. Enzymology of human cytosolic sulfotransferases. FASEB J Miniprep DNA Purification System (Promega, Madison, WI, 1997; 11: 206–216. 3 Weinshilboum RM, Otterness DM, Aksoy IA, Wood TC, Her C, USA) was used to isolate the DNA prior to sequencing. Raftogianis RB. Sulfotransferase molecular biology: cDNAs and genes. Primer sequences are available upon request. FASEB J 1997; 11: 3–14. 4 Komatsu K, Driscoll WJ, Koh Y, Strott CA. A P-loop related motif Cloning SULT6B1 From Other Primate Species (GxxGxxK) highly conserved in sulfotransferases is required for binding To determine if SULT6B1 was present in other primate the activated sulfate donor. Biochem Biophys Res Commun 1994; 204: 1178–1185. genomes, intron-based primers were designed that flanked 5 Driscoll WJ, Komatsu K, Strott CA. Proposed active site domain in each exon of the human gene sequence (sequences available sulfotransferase as determined by mutational analysis. Proc upon request). Those primers were used to PCR amplify each Natl Acad Sci USA 1995; 92: 12328–12332. exon from gorilla and chimpanzee genomic DNA (Coriell 6 Marsolais F, Varin L. Identification of amino acid residues critical for catalysis and cosubstrate binding in the flavonol 3-sulfotransferase. J Biol Cell Repositories Primate Panel, repository numbers Chem 1995; 270: 30458–30463. NG05251 and NG06939, respectively). The products were 7 Petrotchenko EV, Pedersen LC, Borchers CH, Tomer KB, Negishi M. The sequenced and alignments with human sequences were dimerization motif of cytosolic sulfotransferases. FEBS Let 2001; 490: constructed with the GCG program Pileup. 39–43. 8 Raftogianis RB, Freimuth RR, Buck J, Weinshilboum RM, Coughtrie MWH. A proposed nomenclature system for the cytosolic sulfotransfer- DNA Sequencing and Analysis ase (SULT) superfamily. 2004; Pharmacogenetics in press. DNA sequencing was performed in the Mayo Clinic 9 Negishi M, Pedersen LG, Petrotchenko E, Shevtsov S, Gorokhov A, Molecular Biology Core Facility with Applied Biosystems Kakuta Y et al. Structure and function of sulfotransferases. Arch Biochem Model 377 DNA sequencers and BigDyet (PerkinElmer) Biophys 2001; 390: 149–157. 10 Fujita K, Nagata K, Ozawa S, Sasano H, Yamazoe Y. Molecular cloning terminator sequencing chemistry. The University of Wis- and characterization of rat ST1B1 and human ST1B2 cDNAs, encoding 52 consin GCG software package, Version 10.2, was used to thyroid sulfotransferases. J Biochem 1997; 122: 1052–1061. analyze the nucleotide and protein sequences. 11 Her C, Kaur GP, Athwal RS, Weinshilboum RM. Human sulfotransferase SULT1C1: cDNA cloning, tissue-specific expression, and chromosomal localization. Genomics 1997; 41: 467–470. ACKNOWLEDGEMENTS 12 Her C, Wood TC, Eichler EE, Mohrenweiser HW, Ramagli LS, Siciliano MJ We thank Luanne Wussow for her assistance with the preparation of et al. Human hydroxysteroid sulfotransferase SULT2B1: two enzymes this manuscript, Michael Heldebrant and Harold Solbrig for their encoded by a single chromosome 19 gene. Genomics 1998; 53: 284–295. assistance with the GeneWise program, and Dr Rebecca Raftogianis 13 Sakakibara Y, Yanagisawa K, Katafuchi J, Ringer DP, Takami Y, Blanchard, Emily Aaronson, Richard Hwang, and Ray Mak for their Nakayama T et al. Molecular cloning, expression, and characterization contributions to the preliminary experimental characterization of of novel human SULT1C sulfotransferases that catalyze the sulfonation novel genes. This work was supported in part by NIH grants RO1 of N-hydroxy-2-acetylaminofluorene. J Biol Chem 1998; 273: GM35720 (RMW), UO1 GM61388 (RMW, CGC, EDW), and the 33929–33935.

The Pharmacogenomics Journal Novel human SULT gene identification RR Freimuth et al 65

14 Wang J, Falany JL, Falany CN. Expression and characterization of a novel 35 Thomae BA, Rifki OF, Theobald MA, Eckloff BW, Wieben ED, thyroid hormone-sulfating form of cytosolic sulfotransferase from Weinshilboum RM. Human catecholamine sulfotransferase (SULT1A3) human liver. Mol Pharmacol 1998; 53: 274–282. pharmacogenetics: common functional genetic polymorphism. 15 Falany CN, Xie X, Wang J, Ferrer J, Falany JL. Molecular cloning and J Neurochem 2003; 87: 809–819. expression of novel sulphotransferase-like cDNAs from human and rat 36 Adjei AA, Thomae BA, Prondzinski JL, Eckloff BW, Wieben ED, brain. Biochem J 2000; 346: 857–864. Weinshilboum RM. Human estrogen sulfotransferase (SULT1E1) phar- 16 Aksoy IA, Weinshilboum RM. Human thermolabile phenol sulfotransfer- macogenomics: gene resequencing and functional genomics. Br J ase gene (STM): molecular cloning and structural characterization. Pharmacol 2003; 139: 1373–1382. Biochem Biophys Res Commun 1995; 208: 786–795. 37 Thomae BA, Eckloff BW, Freimuth RR, Wieben ED, Weinshilboum RM. 17 Her C, Aksoy IA, Kimura S, Brandriff BF, Wasmuth JJ, Weinshilboum RM. Human sulfotransferase SULT2A1 pharmacogenetics: genotype-to- Human estrogen sulfotransferase gene (STE): cloning, structure, and phenotype studies. Pharmacogenomics J 2002; 2: 48–56. chromosomal localization. Genomics 1995; 29: 16–23. 38 Nowell S, Sweeney C, Winters M, Stone A, Lang NP, Hutchins LF et al. 18 Otterness DM, Her C, Aksoy S, Kimura S, Wieben ED, Weinshilboum Association between sulfotransferase 1A1 genotype and survival of RM. Human sulfotransferase gene: molecular breast cancer patients receiving tamoxifen therapy. J Natl Cancer Inst cloning and structural characterization. DNA Cell Biol 1995; 14: 2002; 94: 1635–1640. 331–341. 39 Nagata K, Yoshinari K, Ozawa S, Yamazoe Y. Arylamine activating 19 Her C, Raftogianis R, Weinshilboum RM. Human phenol sulfotransferase sulfotransferase in liver. Mutat Res 1997; 376: 267–272. STP2 gene: molecular cloning, structural characterization, and chro- 40 Wilborn TW, Comer KA, Dooley TP, Reardon IM, Heinrikson RL, Falany mosomal localization. Genomics 1996; 33: 409–420. CN. Sequence analysis and expression of the cDNA for the phenol- 20 Raftogianis RB, Her C, Weinshilboum RM. Human phenol sulfotransfer- sulfating form of human liver phenol sulfotransferase. Mol Pharmacol ase pharmacogenetics: STP1gene cloning and structural characteriza- 1993; 43: 70–77. tion. Pharmacogenetics 1996; 6: 473–487. 41 Zhu X, Veronese ME, Iocco P, McManus ME. cDNA cloning and 21 Freimuth RR, Raftogianis RB, Wood TC, Moon E, Kim U-J, Xu J et al. expression of a new form of human aryl sulfotransferase. Int J Biochem Human sulfotransferases SULT1C1 and SULT1C2: cDNA characteriza- Cell Biol 1996; 28: 565–571. tion, gene cloning, and chromosomal localization. Genomics 2000; 65: 42 Zhu X, Veronese ME, Bernard CC, Sansom LN, McManus ME. 157–165. Identification of two human brain aryl sulfotransferase cDNAs. Biochem 22 Meinl W, Glatt H. Structure and localization of the human SULT1B1 Biophys Res Commun 1993; 195: 120–127. gene: neighborhood to SULT1E1 and a SULT1D pseudogene. Biochem 43 Wang J, Falany JL, Falany CN. Expression and characterization of a novel Biophys Res Commun 2001; 288: 855–862. thyroid hormone-sulfating form of cytosolic sulfotransferase from 23 Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local human liver. Mol Pharmacol 1998; 53: 274–282. alignment search tool. J Mol Biol 1990; 215: 403–410. 44 Sakakibara Y, Yanagisawa K, Katafuchi J, Ringer DP, Takami Y, 24 Eddy SR HMMER: Profile Hidden Markov Models for Biological Sequence Nakayama T et al. Molecular cloning, expression, and characterization Analysis 2001. http://hmmer.wustl.edu. of novel human SULT1C sulfotransferases that catalyze the sulfonation 25 Birney E, Copley R Wise2 Software Package 2001. http://www.ebi.ac.uk/ of N-hydroxy-2-acetylaminofluorene. J Biol Chem 1998; 273: 33929– Wise2/. 33935. 26 Eddy SR. Profile hidden Markov models. Bioinformatics 1998; 14: 45 Aksoy IA, Wood TC, Weinshilboum R. Human liver estrogen sulfo- 755–763. transferase: identification by cDNA cloning and expression. Biochem 27 Kozak M. An analysis of 50-noncoding sequences from 699 vertebrate Biophys Res Commun 1994; 200: 1621–1629. messenger RNAs. Nucleic Acids Res 1987; 15: 8125–8148. 46 Otterness DM, Wieben ED, Wood TC, Watson WG, Madden BJ, 28 Twigg SRF, Burns HD, Oldridge M, Heath JK, Wilkie AOM. Conserved McCormick DJ et al. Human liver dehydroepiandrosterone sulfotrans- use of a non-canonical 50 splice site (/GA) in alternative splicing by ferase: molecular cloning and expression of cDNA. Mol Pharmacol 1992; fibroblast growth factor receptors 1, 2 and 3. Hum Molec Genet 1998; 7: 41: 865–872. 685–691. 47 Falany CN, Xie X, Wang J, Ferrer J, Falany JL. Molecular cloning and 29 Burset M, Seledtsov IA, Solovyev VV. Analysis of canonical and non- expression of novel sulphotransferase-like cDNAs from human and rat canonical splice sites in mammalian genomes. Nucleic Acids Res 2000; brain. Biochem J 2000; 346: 857–864. 28: 4364–4375. 48 Sakakibara Y, Yanagisawa K, Takami Y, Nakayama T, Suiko M, Liu MC. 30 Tarn W-Y, Steitz JA. A novel spliceosome containing U11, U12, and U5 Molecular cloning, expression, and functional characterization of novel snRNPs excises a minor class (AT–AC) intron in vitro. Cell 1996; 84: mouse sulfotransferases. Biochem Biophys Res Commun 1998; 247: 681–686. 801–811. 49 Cao H, Agarwal SK, Burnside J. Cloning and expression of a novel 31 Reveley AM, Carter SMB, Reveley MA, Sandler M. A genetic study of chicken sulfotransferase cDNA regulated by GH. J Endocrinol 1999; 160: platelet phenolsulfotransferase activity in normal and schizophrenic 491–500. twins. J Psychiatr Res 1982/1983; 17: 303–307. 50 Sonnhammer EL, Eddy SR, Durbin R. Pfam: a comprehensive database 32 Price RA, Cox NJ, Spielman RS, Van Loon J, Maidak BL, Weinshilboum of protein domain families based on seed alignments. Proteins 1997; 28: RM. Inheritance of human platelet thermolabile phenol sulfotransferase 405–420. (TL PST) activity. Genet Epidemiol 1988; 5: 1–15. 51 Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR et al. The 33 Price RA, Spielman RS, Lucena AL, Van Loon JA, Maidak BL, Pfam protein families database. Nucleic Acids Res 2002; 30: 276–280. Weinshilboum RM. Genetic polymorphism for human platelet thermo- 52 Devereux J, Haeberli P, Smithies O. A comprehensive set of sequence stable phenol sulfotransferase (TS PST) activity. Genetics 1989; 122: analysis programs for the VAX. Nucleic Acids Res 1984; 12: 387–395. 905–914. 53 Boguski MS, Lowe TM, Tolstoshev CM. dbEST—database for ‘expressed 34 Raftogianis RB, Wood TC, Otterness DM, Van Loon JA, Weinshilboum sequence tags’. Nat Genet 1993; 4: 332–333. RM. Phenol sulfotransferase pharmacogenetics in humans: association 54 Ho SN, Hunt HD, Horton RM, Pullen JK, Pease LR. Site-directed of common SULT1A1 alleles with TS PST phenotype. Biochem Biophys mutagenesis by overlap extension using the polymerase chain reaction. Res Commun 1997; 239: 298–304. Gene 1989; 77: 51–59.

www.nature.com/tpj