<<

The Pharmacogenomics Journal (2001) 1, 193–203  2001 Nature Publishing Group All rights reserved 1470-269X/01 $15.00 www.nature.com/tpj ORIGINAL ARTICLE

Single nucleotide polymorphism identification in candidate systems of obesity

K Irizarry1,3 ABSTRACT GHu1,3 We have constructed a large panel of single nucleotide polymorphisms (SNP) 2 identified in 68 candidate for obesity. Our panel combines novel SNP M-L Wong identification methods based on EST data, with public SNP data from large- J Licinio2 scale genomic sequencing, to produce a total of 218 SNPs in the coding CJ Lee1,3 regions of obesity candidate genes, 178 SNPs in untranslated regions, and over 1000 intronic SNPs. These include new non-conservative amino acid 1Dept of Chemistry and Biochemistry, changes in thyroid receptor beta, esterase D, acid phosphatase 1. Our data University of California, Los Angeles, CA, USA; show evidence of negative selection among these polymorphisms implying 2Laboratory of Pharmacogenomics, Neuropsychiatric Institute, University of functional impacts of the non-conservative . Comparison of overlap California, Los Angeles, CA, USA; 3UCLA-DOE between SNPs identified independently from EST data vs genomic sequen- Laboratory for Structural Biology and cing indicate that together they may constitute about one half of the actual Molecular Medicine, University of California, total number of amino acid polymorphisms in these genes that are common Los Angeles, CA, USA in the human population (defined here as a population frequency Correspondence: above 5%). We have analyzed our polymorphism panel to construct a data- Dr CJ Lee, UCLA-DOELaboratory for base of detailed information about their location in the gene structure and Structural Biology and Molecular effect on coding, available on the web at http://www.bioinformat Medicine, University of California, Los ics.ucla.edu/snp/obesity. We believe this panel can serve as a valuable new Angeles, CA 90095–1570, USA Tel: +1 310 825 7374 resource for genetic and pharmacogenomic studies of the causes of obesity. Fax: +1 310 267 0248 The Pharmacogenomics Journal (2001) 1, 193–203. E-mail: leecȰmbi.ucla.edu Keywords: obesity; SNP; neuropeptides; body weight regulation: appetite; polymor- phisms

INTRODUCTION The draft sequence of the human has now been completed to 93% cover- age, with an estimated total of 32000 genes.1,2 Of those, 22000 have been ident- ified so far by stricter experimental criteria. This is likely to have profound and far-reaching impact on the , pharmacogenetics, and pharmacogenomics of complex diseases. The most interesting question for investigators today is, where to start? While genome-wide single-nucleotide polymorphism (SNP) map- ping has the potential to identify markers associated with treatment responses of common and complex disorders, use of this approach is still in its infancy and has not demonstrated success yet.3 Meanwhile, association approaches have examined associations of specific SNPs or small groups of SNPs to . With some notable exceptions, those efforts have also not yet borne fruit. An alternative approach would be to identify large batches of SNPs in candidate gene systems, that could be used for genomic and pharmacogenomic studies.4 Obesity provides a logical target for such an SNP discovery effort. Obesity is a common and complex disorder that represents the outcome of gene–environ- ment interactions. In the general population of the US, the prevalence of over- −2 Received: 2 March 2001 weight individuals (body mass index—BMI—between 25 and 29.9 kg m )is Revised: 17 July 2001 34.9% among women and 31.7% among men, and it more than doubles between Accepted: 18 July 2001 the ages of 20 and 55. The incidence of obesity (BMI equal to or higher than 30 SNP identification in candidate gene systems of obesity K Irizarry et al 194

kg m−2) has been conservatively estimated to be at 11%. RESULTS Among women, obesity is strongly associated with socio- Identification of SNPs in Obesity Candidate Genes economic status, being twice as common among those with We based our SNP discovery effort on the Human Gene lower socioeconomic status as it is among those with higher Glossary of the Human Obesity Gene Map (105 candidate status. Weight gain has been described as epidemic among obesity genes; http://www.obesity.chair.ulaval.ca/gloss children and adolescents. The data from the Third National ary.html), as a well-established set of genes for future genetic Health and Nutrition Examination Survey (NHANES III and pharmacogenetic studies of obesity.19 We were able to 1988–1994) indicate that approximately 14% of children map 75 genes from this set to EST clusters from Unigene by and 12% of adolescents are overweight. Among adults, matching gene identifiers. We used these 75 genes for EST- approximately 33% of men and 36% of women were over- based SNP identification as described in this paper. This set weight. Among women, 34% of non-Hispanic whites, 52% included genes encoding neuropeptides and their receptors, of non-Hispanic blacks, and 50% of Mexican Americans transporters, enzymes and regulatory involved in were overweight. Racial/ethnic group-specific variation neurotransmitter synthesis, reuptake, and signaling, as well among men was less than that among women.5 Obesity is as metabolic enzymes. also a well-recognized major risk factor for heart disease. The To identify SNPs in this set of genes, we searched through American Heart Association (AHA) has declared obesity as a the Unigene database of expressed sequence tags (EST) for overlapping cDNA sequence fragments of these genes from ‘major risk factor’ and an expert committee of the National different people. We assembled a total of 9293 sequences Institutes of Health/National Heart Lung Blood Institute (8942 ESTs, 222 mRNAs, and 129 ) for this candidate established obesity as an independent risk factor for heart gene set (Unigene release July 2000), representing 698 disease based on a meta-analysis of over 300 prospectively- libraries. Since most of these libraries were prepared from a controlled randomized trials in literature. A small fraction single individual’s DNA, and a few from multiple individ- of obese patients have been identified who have specific uals, these data represent sequence from more than 700 Mendelian gene mutations that appear to cause obesity. individuals. Sequences from about nine people on average These include mutations in the leptin, POMC, prohormone were available for each gene. From this set of 75 clusters, 45 6–13 convertase 1, and MCR4 receptor genes. contained at least one SNP. The pharmacogenomics of obesity is a new and understu- Based on this gene set, we identified 152 polymorphisms, died area. While some treatments for obesity can have a including 76 cSNPs causing 51 amino acid changes (Table positive outcome for a minority of patients, it is unknown 1). We obtained the original sequencing chromatogram data why some patients respond to pharmacological and/or for the ESTs, and performed a detailed analysis consisting of 14–17 behavioral interventions, while others do not. multiple sequence alignment of overlapping ESTs for each Considerable progress has been achieved on the neurobi- gene, removal of potential paralogous ESTs, and statistical ology of obesity. A variety of new and established candidate scoring of candidate SNPs (see Methods). We selected candi- genes are now known to interact at the central level, parti- date SNPs with a likelihood odds ratio greater than 103 in cularly in the hypothalamus, to regulate food intake, energy favor of a polymorphism as opposed to a sequencing error. expenditure and body weight. The interactions of those We limited this study to candidate SNPs with population gene products and their role in human energy homeostasis allele frequency greater than 1%, and at least one EST obser- have been one of the most exciting areas of biomedical vation of the polymorphic band with good chromatogram investigation in the last decade. For reviews see Refs 18– quality (see Methods), or an observation in a curated 24.18–24 The Human Obesity Gene Map provides an excellent mRNA sequence. resource of candidate genes that may be involved in human Our analysis also estimated population allele frequency obesity.19 We have based our work on this extensive listing for each SNP, reporting an expectation value and 95% rank of candidate genes, to identify polymorphisms that can confidence interval (see Table 2). Population allele fre- serve as useful tools for analyzing the genetic components quencies for the obesity gene SNPs ranged from 3% to 37% of obesity. We have constructed a large panel of single nucleotide Table 1 Obesity gene SNP discovery based on EST data polymorphisms (SNP) in these candidate obesity genes, combining novel SNP identification methods based on EST Analysis stage Genes SNPs data with public SNP data from large-scale genomic sequen- cing. We believe this panel can serve as a valuable new Human Obesity Map gene list 105 resource for genetic and pharmacogenomic studies of the Mapped to Unigene clusters by gene identifier 75 causes of obesity. It includes over 1400 SNPs in obesity can- SNPs identified in EST data 45 152 Untranslated region (UTR) SNPs 27 76 didate genes, as well as a database of detailed information Protein coding region SNPs (cSNPs) 28 76 about their location in the gene structure and effect on Non-conservative amino acid changes 13 22 protein coding, which will be available on the web at http://www.bioinformatics.ucla.edu/snp/obesity. This table summarizes the steps by which we identified SNPs starting from the Human Obesity Gene Map gene glossary, and categorized them by location and effect (see text).

The Pharmacogenomics Journal SNP identification in candidate gene systems of obesity K Irizarry et al 195

Table 2 Amino acid changes in genes believed to be involved in obesity

cSNP Amino acid effects mapped onto proteins Allele frequency Observed cSNP evidence

Gene Allele id lod pos maj min cod effect F low F high ESTs libs DNA mRNA qual

ORM1 25199 12.42 167 R C 1 −5 0.06 0.21 6 4 0 0 40 PGM1 21126 5.27 221 R C 1 −5 0.06 0.38 1 1 0 1 99 BF 328111 4.77 32 R W 1 −4 0.04 0.25 3 2 0 0 35 GYPA 536152 19.16 20 S L 2 −4 0.25 0.46 4 4 2 3 99 ACP1 362490 11.03 61 Q L 2 −3 0.08 0.22 4 3 0 3 99 ACP1 362475 5.56 60 G E 2 −3 0.07 0.21 4 3 0 3 99 ACP1 362622 15.47 70 P H 2 −3 0.1 0.23 5 4 0 3 99 ADRB2 30929 10.4 16 R G 1 −3 0.14 0.33 1 1 3 1 99 ESD 437381 12.69 205 G E 2 −3 0.03 0.09 6 5 0 0 59 THRB 585832 4.01 238 P R 2 −3 0.05 0.4 1 1 0 0 59 APOB 2791 6.69 3319 H D 1 −2 0.13 0.44 0 0 1 2 99 ACP1 362633 13.1 71 M T 2 −1 0.08 0.2 4 3 0 3 99 AGT 37530 23.02 268 M T 2 −1 0.14 0.4 7 4 0 0 47 AGT 37105 17.44 207 T M 2 −1 0.09 0.28 8 4 0 0 34 APOB 2937 7.94 3732 I T 2 −1 0.11 0.34 0 0 0 4 99 APOB 2804 7.32 3427 K T 2 −1 0.11 0.4 1 1 1 1 99 GLO1 353517 25.75 111 A E 2 −1 0.19 0.39 7 7 0 1 99 GYPA 536147 8.55 13 A E 2 −1 0.08 0.25 3 3 0 1 99 GYS1 7794 3.56 706 S R 1 −1 0.07 0.41 0 0 0 2 99 GYS1 7606 3.98 136 T I 2 −1 0.08 0.43 0 0 0 2 99 ORM1 22013 14.95 144 V F 1 −1 0.04 0.18 6 3 0 0 47 PON2 754162 50.3 311 S C 2 −1 0.27 0.42 16 15 0 1 99 ACP1 362185 11.52 47 T A 1 0 0.09 0.23 3 2 0 3 99 APOB 429 3.36 618 A V 2 0 0.08 0.43 0 0 0 2 99 GPC4 313929 4.14 442 A V 2 0 0.08 0.43 1 1 0 1 99 HSD3B1 267920 25.29 367 N T 2 0 0.41 0.63 7 4 0 5 99 LEPR 840643 6.35 85 T A 1 0 0.13 0.44 0 0 0 3 99 ORM1 20485 7.89 131 A G 2 0 0.04 0.17 6 3 0 0 38 PON2 751957 18.24 148 A G 2 0 0.19 0.43 5 5 0 1 99 ACP1 362447 4.34 59 R Q 2 1 0.07 0.21 4 3 0 3 99 ACP1 363031 43.57 106 R Q 2 1 0.28 0.43 15 12 2 2 99 APOA4 16924 3.32 380 Q H 3 1 0.04 0.25 0 0 1 1 99 APOB 3211 23.54 4338 N S 2 1 0.06 0.26 8 1 0 1 99 APOB 3173 8.19 4181 E K 1 1 0.16 0.42 1 1 0 4 99 APOB 3061 8.43 3949 F L 1 1 0.13 0.4 0 0 0 4 99 KEL 2337 5.17 698 H N 1 1 0.04 0.27 2 2 0 0 51 LEPR 840648 6.03 223 Q R 2 1 0.15 0.5 0 0 0 3 99 LIPC 115006 4.94 215 S N 2 1 0.08 0.31 0 0 1 2 99 ORM1 6193 150.15 38 R Q 2 1 0.37 0.58 34 6 1 2 99 ORM1 20362 5.58 130 L F 1 1 0.04 0.18 5 4 0 0 32 ORM1 26385 24.95 174 V M 1 1 0.12 0.28 9 6 0 0 59 KEL 2310 4.07 692 D E 3 2 0.04 0.27 2 2 0 0 34 ACP1 362544 11.95 64 M L 1 3 0.07 0.19 3 2 0 3 99 ADRB2 30939 3.96 27 Q E 1 3 0.04 0.19 1 1 1 1 99 APOA4 16898 4.76 279 R K 2 3 0.11 0.4 0 0 1 2 99 APOB 2811 7.16 3432 E Q 1 3 0.11 0.4 1 1 1 1 99 LEPR 840646 5.03 109 K R 2 3 0.14 0.49 0 0 0 3 99 ORM1 25381 11.67 170 K R 2 3 0.06 0.2 7 4 0 0 40 ACP1 362648 8.17 74 V I 1 4 0.07 0.2 4 3 0 3 99 APOB 3074 3.98 3964 Y F 2 4 0.08 0.31 0 0 0 3 99 ORM1 19878 3.31 128 Y F 2 4 0.03 0.17 4 3 0 0 32

Each non-synonymous cSNP identified from EST data is listed, in order of most to least disruptive amino acid replacements, with its log-odds score (lod), amino acid position (pos), major vs minor allele amino acid (maj, min), position of the SNP in the codon (cod), BLOSUM62 matrix score for the substitution (effect; negative numbers are more disruptive), low vs high population allele frequency estimates (F low, F high), numbers of ESTs, libraries, DNA and mRNA sequences contributing evidence for the minor allele, and highest PHRED quality score observed for the minor allele (qual).

www.nature.com/tpj SNP identification in candidate gene systems of obesity K Irizarry et al 196

in our data. We did not observe striking patterns within of functional selection (Figure 1c). Under a random these allele frequencies. For example, allele frequencies for model, 79% of amino acid replacements should synonymous vs non-synonymous cSNPs did not show stat- be non-conservative (vs 21% conservative). In our data, only istically significant differences. Unfortunately, our SNP data 44% of our detected cSNPs were non-conservative, vs 56% do not include any ethnicity information, since NIH has conservative, more than double that expected by random mandated that such information be stripped from public mutation. These data suggest negative selection has EST sequencing data. The dbEST libraries likely represent a occurred against mutations that were more likely to be struc- cosmopolitan but primarily Caucasian population. Since turally disruptive. SNP allele frequencies can vary widely across ethnic Indeed, for our non-conservative SNPs, there is already groups,25,26 this is a limitation and disadvantage of our data- substantial evidence of association with human diseases. Of set. the 22 non-conservative mutations we identified, a third (7 We have validated a significant fraction of these polymor- of 22) have already been reported in OMIM or SwissProt to phisms vs studies of independent human population be associated with diseases such as hypertension, coronary samples. Candidate SNPs from our EST-based approach have artery disease, plasma glucose levels, and body mass index been tested in 8–24 individuals from California (70% vali- (Table 3). While this is only evidence of correlation, not dation rate27) or Finland (80% validation rate) (G Hu, causation, this surprisingly high ‘hit rate’ increases the inter- unpublished data). For these obesity gene SNPs, we were est of this class of SNPs, especially since these conditions able to validate 25 of the 51 non-synonymous polymor- may be related to obesity. By contrast, we have found a phisms by comparison with amino acid mutations listed in reported disease association for only one of our conservative databases such as SwissProt29 and OMIM. Since these data- SNPs (ADRB2 Q27E). The observation that our non-con- bases by no means contain complete coverage of human servative SNPs are more likely to be associated with disease polymorphism, 100% validation is not expected. We have suggests they may themselves have a functional effect. also compared our EST-based SNPs for the candidate obesity genes vs polymorphisms independently identified by gen- Combining Obesity Gene SNPs from EST and Genomic omic sequencing.30 Overall, 39% of our obesity-gene SNPs Data were independently identified by public genomic sequen- We have also combined our EST-based SNP identification cing, as reported in the dbSNP database.31 results with public SNP data from genomic sequencing30 mapped to the candidate obesity gene set (Table 4). This Characterization of Obesity Gene Polymorphisms yields a total of 218 cSNPs in 48 obesity candidate genes, We have mapped these SNPs to the associated mRNA and and 178 UTR-SNPs. In addition, genomic sequencing has protein sequences where possible. Overall, 50% mapped to identified 1022 SNPs in the intronic regions of these genes. the protein coding region (cSNPs) (Figure 1). This is dramati- Although these polymorphisms are less likely to produce a cally higher than the frequency of cSNP identification from functional effect, they can still be useful markers for map- bulk genomic sequencing, which is typically only a few per- ping studies based on . cent.30 We aligned 76 cSNPs to their corresponding amino These results demonstrate the highly enriched value of acid positions in the protein sequences, representing 28 of EST-based cSNP discovery, vs conventional SNP discovery the candidate obesity genes. The remaining 76 SNPs were in from genomic sequencing (Figure 2). Out of the 1.4 million untranslated regions (UTR). We have catalogued the amino SNPs identified by genomic sequencing, only a few percent acid changes caused by cSNPs in these genes (Table 2). fall within genes.30 And of these, only a small fraction are The distribution of cSNPs in these genes shows evidence in exons or protein coding regions, as can be seen clearly in

Figure 1 Characterization of SNP effects on protein coding.

The Pharmacogenomics Journal SNP identification in candidate gene systems of obesity K Irizarry et al 197

Table 3 Known disease associations for obesity gene cSNPs

Gene Source pos maj min Phenotypic association

ADRB2 omim 16 R G susceptibility to nocturnal asthma35 ADRB2 omim 27 Q E susceptibility to obesity36 AGT swiss 207 T M associated with hypertension29 AGT swiss 268 M T associated with hypertension29 GYPA swiss 20 S L determines blood group M29 LEPR omim 223 Q R lower plasma leptin levels; effects body mass index and fat mass37,38 PON2 omim 148 A G associated with variations in fasting plasma glucose39 PON2 omim 311 C S associated with higher risk of coronary artery disease40

Of the non-synonymous SNPs identified by our study, a number correspond to amino acid changes already identified in association with human disease or other phenotypes. For each SNP, the amino acid position (pos), major and minor allele amino acids (maj, min), and phenotypic association are listed. the obesity gene set. Of the genomic SNPs mapped inside ified common SNPs in a wide range of genes that regulate these genes by dbSNP, just 11% were in coding region and metabolic rate and/or endocrine signaling, and a variety of 8% in UTR exons; the rest were intronic (Figure 1b). By con- other pathways identified in the Human Obesity Gene trast, for our EST-based SNP discovery, these numbers were Map.19 These data includes both a large number of SNPs in 50% and 50%, respectively. Moreover, despite the huge protein coding regions (220), and SNPs in untranslated investment in SNP identification using genomic sequence, regions of exons (177). Although much emphasis has been and the relatively small fraction of all SNPs that have been placed on SNPs that cause amino acid replacements, non- contributed by EST-based approaches (about 3.6% of the coding SNPs can also have important functional effects. Our total), for coding region SNPs, ESTs can make a major contri- dataset also contains more than 1000 intronic SNPs for these bution. The EST-based cSNPs described in this paper consti- genes, which may be useful for mapping studies. tute more than a third (35%) of all cSNPs reported for these While many articles have addressed the issue of specific obesity candidate genes (Figure 2b). For SNPs in untrans- linkages to obesity phenotypes, this is to our knowledge the lated regions of exons (UTR), the corresponding fraction is first attempt to identify a full set of over 1400 SNPs related also high (40%). to obesity, along with information relevant to their func- How complete is this polymorphism dataset? We have tional impact. We hope these data will serve as a public compared the overlap between our results and independent resource to the scientific community and facilitate future SNP identification from genomic sequencing, to assess the genetic and pharmacogenetic work in this area. Moreover, level of saturation of common SNPs by the public SNP the full functional characterization of these and related identification efforts. Our data can be used as an inde- SNPs, and their effects on food intake, energy metabolism, pendent estimator of the completeness of the public SNP and body weight phenotypes, may represent a new direction data. We compared 148 of our obesity gene SNPs with gen- for genomic research in obesity. omic SNPs from the dbSNP database. Fifty-seven (39%) were The dataset presented in this paper should be viewed independently identified by genomic sequence-based SNP mainly as a source of new SNPs for obesity research, not as discovery efforts, suggesting that the genomic SNP detection a complete catalog of polymorphism for this field. There are data cover about 40% of the common cSNPs in these genes. several reasons for this. First, the goal of this work was not Thus, the combined EST- and genomic-based SNP data prob- to incorporate known polymorphisms from the literature. ably constitute a bit more than half of the total SNPs in Our dataset is undoubtedly missing many known SNPs that these genes that are common in the human population are of interest to obesity researchers, particularly those that (allele frequency above 5%). Based on the approximately are rare in the human population or restricted to particular 150 amino acid changes identified so far in the obesity ethnic groups. Second, the small population samples genes, this would suggest a total of about 300 amino acid inherent in sequencing-based high-throughput SNP dis- changes common in the human population, in these genes. covery efforts mean that they primarily detect common polymorphisms. In this study, the average number of indi- Accessing the Obesity Candidate SNP Database viduals represented for a given gene was nine people, giving All of these SNP data (both EST-based and genomic-based) a poor chance of detecting a SNP whose allele frequency is are available as supplementary material online: http:// below 5%. Third, the available data are by no means a com- www.bioinformatics.ucla.edu/snp/obesity. In addition, we plete catalog even of common polymorphisms. Comparison have deposited our EST-based SNPs on the public dbSNP of our EST-based SNP detection with independent SNP database. detection based on genomic sequencing suggests that the total data probably comprise only about half of the SNPs DISCUSSION that are common (allele frequency above 5%). Fourth, these These data provide a novel resource of single nucleotide data represent a cosmopolitan population pool without eth- polymorphisms relevant to obesity research. We have ident- nicity information (due to NIH guidelines that forbid such

www.nature.com/tpj SNP identification in candidate gene systems of obesity K Irizarry et al 198

information), and are undoubtedly missing many polymor- joint probability model was constructed, conditioning on phisms that are specific to a given ethnic group. the true sequence context and a weighted local miscall The genetic substrate of obesity appears to involve a com- count—the number of ‘unassigned’ bases (ie bases called as bination of genes of smaller and larger effects. Obesity ‘N’ instead of A, G, C or T) within 25 nucleotides of the appears to involve many genes. Therefore, a successful strat- position under consideration. The proximity of N indicates egy to dissect the genetics and pharmacogenetics of this poor trace quality and is associated with increases in local common and complex disorder may require correlating spe- error rate of up to 100 fold. To construct this model, 346 cific haplotypes across multiple genes or chromosomal million nucleotides of EST data were analyzed. regions. While our dense map of SNP markers will hopefully provide a useful starting point, there are many unsolved SNP Identification challenges remaining for identifying such multigenic obes- To identify likely SNPs, single base mismatches were 3 ity haplotypes. reported from multiple sequence alignments produced by the programs PHRAP (P Green, http://genome. METHODS washington.edu) and POA41 for each Unigene cluster. Each Sequencing Chromatogram Analysis mismatch was reported with seven nucleotides of the local Trace data of EST sequences in Unigene were obtained from consensus sequence context, centered on the mismatch, Washington University (genome.wustl.edu), and processed along with the corresponding segments of each sequence 32,33 with PHRED (generously provided by Phil Green, Uni- which align with this region. The PHRED quality score (or versity of Washington) to produce base calls and quality fac- weighted miscall count) for each observed base was also tors. reported. To measure representative rates of sequencing error for To evaluate the strength of the evidence for a SNP at a 34 single-pass reads of EST sequences from dbEST, we ana- given position in a gene, we calculated the likelihood-ratio lyzed multiple sequence alignments of EST clusters, where of the observed sequences under a SNP model vs a pure sufficient data were present to yield a reliable consensus. sequencing error model: Specifically, we limited our analysis to regions where at least five sequences were aligned and identical and no more than p(obs͉SNP) SNPscore = two other sequences disagreed with this consensus. More- p(obs͉err) over, sequence positions which were farther than two resi- dues to the left or right of a position that matches consensus For the sequencing error model we sum the probability of were excluded from the analysis, since these can arise from the observations over all possible ‘true sequence’ contexts chimeric or divergent sequences. Approximately 241 million T, to take into account any uncertainty about whether the nucleotides of aligned EST reads met these criteria, and were consensus T* is actually the true sequence of the gene, analyzed to give rates of sequencing error, categorizing each which would undercut confidence in the SNP prediction: base as consensus, mismatch, or insert. Bases present in the p(obs͉err) = ͸ p(obs͉T)p(T) consensus but missing in an individual read were counted T as deletions. To construct a statistical model for distinguishing sequen- The SNP model can be formulated in terms of the true cing error from likely SNPs, we constructed a joint prob- sequence T of the gene (initially unknown), and an ‘alter- ability model conditioned on two separate components: the nate sequence’ TЈ differing only by a single nucleotide sub- observed quality factor for a given base, and the true stitution at the central position: (unobserved) local sequence context around that base. Although the quality factor in a given read is a function of p(obs͉SNP) = ͸͸ p(obs͉T, TЈ) p(TЈ͉T) p(T) that individual observation, and should be largely inde- T TЈ 1 pendent of the true sequence, we constructed a joint prob- Ն p(obs͉T*, TЈ) ͩ ͪ p(T*) ability model that takes both into account without 3 assuming conditional independence. Each observed base Taking a conservative approach to the SNP model, of was reported within the context of five nucleotides of the allowing only a specific polymorphism TЈ and the consensus consensus sequence, centered on the observed base, and its sequence T*asthe‘true sequence’,weassumethatp(TЈ͉T*) PHRED-assigned quality factor. Observations of all possible = 1/3 and p(T) = constant over the possible gene sequences error states (no error; substitution by A/T/G/C/N; insertion T, which cancels in the numerator and denominator of the of A/T/G/C/N; deletion) for all sequence contexts and qual- likelihood ratio, giving: ity factor combinations were counted. Approximately 241 million nucleotides of data were analyzed. Finally, the con- ͉ Ј Ն 1 p(obs T*, T ) ditional error rates were smoothed and interpolated in such SNPscore ͩ ͪ 3 ͸ ͉ a way that they converge to the rate predicted by the PHRED p(obs T) T quality score32,33 when the PHRED quality score is higher than 40. Finally, ignoring the constant factor, we define a log-odds Since trace data for all ESTs were not available, a second score for ranking candidate SNPs:

The Pharmacogenomics Journal SNP identification in candidate gene systems of obesity K Irizarry et al 199

Table 4 Total combined SNPs from EST data and from public genomic sequencing (HGBASE, SNP Consortium), for the obesity candidate gene set

Gene Title Cluster Location Intragenic SNPs

Intron UTR Coding Total

ACP1 acid phosphatase 1 Hs.75393 2p25 0 2 15 17 ADA adenosine deaminase Hs.1217 20q12–q13.11 49 0 3 52 ADRA2A adrenergic, alpha-2A-, receptor Hs.249159 10q25 0 2 0 2 ADRB2 adrenergic, beta-2-, receptor Hs.2551 5q31–q32 0 8 4 12 ADRB3 adrenergic, beta-3-, receptor Hs.2549 8p12–p11.2 4 6 0 10 AGT Angiotensinogen Hs.3697 1q42–q43 0 1 2 3 AK1 adenylate kinase-1 Hs.76240 9q34.1 15 3 1 19 APOA4 apolipoprotein A–IV Hs.1247 11q23–q23 8 1 13 22 APOB apolipoprotein B Hs.585 2p24–p23 20 2 37 59 APOD apoliproprotein D Hs.75736 3q27–qter 9 3 4 16 ASIP agouti signaling protein Hs.37006 20q11.2–q12 3 0 0 3 ATP1A1 ATPase, Na+/K+ transporting, alpha 1 Hs.190703 1p13 0 2 0 2 ATP1A2 ATPase, Na+/K+ transporting, alpha 2 Hs.34114 1q21–q23 0 2 0 2 ATP1B1 ATPase, Na+/K+ transporting, beta 1 Hs.78629 1q22–q25 45 8 0 53 BF B-Factor, properdin Hs.69771 6p21.3 4 0 6 10 CBFA2T1 Core-binding factor, alpha unit 2 (MTG8) Hs.31551 8q22 12 2 0 14 CCKBR cholecystokinin B receptor Hs.203 11p15.4–p15.4 0 2 2 4 CD36L1 CD36 antigen (collagen type I receptor, Hs.180616 12pter–qter 66 5 1 72 thrombospondinreceptor)-like 1 CPE Carboxypeptidase E Hs.75360 4q32 11 4 1 16 DCP1 Dipeptidyl carbozypeptidase 1 (angiotensin 1 Hs.250711 17q23 0 0 1 1 converting enzyme, ACE) DRD2 dopamine receptor D2 Hs.73893 11q22.2–q22.3 15 6 5 26 ESD esterase D Hs.82193 13q14.1–q14.2 0 0 1 1 FABP2 fatty acid binding protein 2 Hs.251811 4q28–q31 6 5 2 13 FGF13 Fibroblast Growth Factor 13 Hs.6540 Xq26.3 4 3 0 7 GCGR glucagon receptor Hs.208 17q25 0 0 1 1 GCK glucokinase Hs.1270 7p15–p13 14 0 0 14 GLO1 glyoxalase Hs.75207 6p21.2–21.1 12 2 2 16 GNAS1 gluanine nucleotide binding protein, alpha Hs.273385 20q13.2–q13.3 0 0 1 1 stimulating activity polypeptide 1 GNB3 guanine nucleotide binding protein (G protein), Hs.71642 12p13–p13 0 2 3 5 beta polypeptide 3 GPC3 Heparan Sulfate Proteoglycan (Glypican) 3 Hs.119651 Xq26.1–q26.1 31 1 0 32 GPC4 Heparan Sulfate Proteoglycan (Glypican) 4 Hs.58367 Xq26–q26 1 0 2 3 GYPA glycophorin (includes MN blood group) Hs.108694 4q28–q31 0 1 3 4 GYS1 glycogen synthase 1 (muscle) Hs.772 19q13.3 0 5 12 17 HSD3B1 hydroxy-delta-5-steroid dehydrogenase, 3 beta Hs.38586 1p13.1 0 2 7 9 HTR2C 5-hydroxytryptamine (serotonin) receptor 2C Hs.46362 Xq24 0 1 0 1 IGF1 insulin-like growth factor I Hs.85112 12q22–q23 10 1 0 11 IGF2 insulin-like growth factor II Hs.251694 11p15.5 2 11 1 14 INS insulin Hs.89832 11p15.5 2 1 0 3 INSR insulin receptor Hs.89695 19p13.3 23 4 9 36 IRS1 insulin receptor substrate 1 Hs.96063 2q36–q36 8 10 0 18 ISL1 ISL1 transcription factor, Hs.505 5q22.3 2 3 0 5 LIM/homeodomain (islet-1)

(Continued)

where the summation is over a Hidden Markov Model rep- = ͉ Ј − ͫ͸ ͉ ͬ LOD log(p(obs T*, T )] log p(obs T) resenting all possible alignments A of the observed sequence T i to the true sequence T. The HMM sums the probability To evaluate the sequencing error model, we treat each of all possible ways the observed sequence could have been observed sequence i as an independent observation, emitted from the true sequence via a stochastic process of sequencing error. The match states (M) correspond to letters ͉ = ͹ ͉ = ͹͸ ͉ p(obs T) p(obsi T) p(obsi, A T), of the true sequence T, and deletion (D), insertion (I) and i i A emission probabilities are derived from the observed fre-

www.nature.com/tpj SNP identification in candidate gene systems of obesity K Irizarry et al 200

Table 4 Continued

Gene Title Cluster Location Intragenic SNPs

Intron UTR Coding Total

KEL Kell blood group Hs.157 7q33 1 0 2 3 LDLR low-density lipoprotein receptor Hs.213289 19p13.2 6 9 12 27 LEP leptin Hs.194236 7q31–q32 2 0 2 4 LEPR leptin receptor Hs.226627 1p31 36 0 6 42 LIPC lipase, hepatic Hs.9994 15q2–q22 104 0 7 111 LIPE lipase, hormone sensitive Hs.95351 19q13.1–q13.2 3 1 1 5 LPL lipoprotein lipase Hs.180878 8p22 66 11 9 86 MC4R melanocortin 4 receptors Hs.247980 18q21.3 0 0 1 1 MYO9A myosin IXA Hs.23395 15q21–q25 17 0 0 17 NPY neuropeptide Y Hs.1832 7pter–q22 24 2 7 33 NPY1R neuropeptide Y receptor Y1 Hs.169266 4q31.3–q32 0 1 0 1 ORM1 Hs.572 9q32 19 0 14 33 PCSK1 protein convertase subtilisin/kexin type 1 Hs.78977 5q15–q21 65 5 4 74 PGD phosphogluconate dehydrogenase Hs.75888 1p36.2–36.13 0 11 1 12 PGM1 phosphoglucomutase 1 Hs.1869 1p31 8 6 1 15 PMM2 phosphomannomutase 2 Hs.154695 16p13 4 1 0 5 POMC proopiomelanocortin Hs.1897 2p23 2 2 0 4 PON2 paraoxonase 2 Hs.169857 7q21.3–q21.3 7 1 2 10 PPARG peroxisome proliferator-activated receptor, Hs.100724 3p25 49 0 3 52 SLC2A1 gamma Hs.169902 1p35–p31.3 39 6 2 47 solute carrier family 2 (facilitated glucose transporter), member 1 SLC2A4 solute carrier family 2 (facilitated glucose Hs.95958 17p13 6 2 2 10 transporter), member 4 SNRPN small nuclear ribonucleoprotein polypeptide N Hs.250727 15q11–q12 0 0 1 1 THRB thyroid hormone receptor, beta Hs.121503 3p24.1–p22 84 1 1 86 TUB Tubby (mouse) homologue Hs.54468 11p15.5 92 3 0 95 UCP1 uncoupling protein 1 Hs.249211 4q28–q31 3 0 0 3 UCP2 uncoupling protein 2 Hs.80658 11q13 2 5 0 7 UCP3 uncoupling protein 3 Hs.101337 11q13 7 0 1 8 Total 1022 178 218 1418

For each gene, we list the Unigene cluster ID (cluster), map location, and numbers of intronic SNPs, UTR exon SNPs, and protein coding region SNPs, and total SNPs. quencies of sequencing errors conditioned on local window of residues that align to these seven residues. The pij = = sequence context (a 5mer window of T) and the observed are calculated over the two dimensional matrix i 1%Lobs,j = PHRED quality score. This corresponds to treating the pro- 1%LT, where Lobs and LT ( 7) are the lengths of the observed per alignment of the observed sequence to the true sequence and true sequence windows, respectively. Using p00 = 1 as as uncertain, and therefore summing over all possibilities. the origin, This sum is calculated using dynamic programming with the ͸ ͉ = recursion relation p(obsi, A T) pLobsLT A = ͉ ⌬͉ pi,j p(oi tj, Qi) pi − 1, j − 1 + p( tj, Qi) pi, j − 1 Secondly, we treat the true sequence itself as uncertain, ͉ + p(insert oi tj, Qi) pi − 1, j and therefore sum p(obs͉T) over all possible true sequences. Since candidate SNPs are evaluated within a 7-nucleotide where oi is the observed letter at position i, and tj is a five- residue window from the true sequence T centered on pos- window centered on the mismatch, this involves consider- ing 47 = 16384 possible true sequences: ition j, Qi is the PHRED quality factor for position i in the observed sequence, ⌬ means deletion of the residue at pos- ͸͹͸ ͉ ition j in T, and insert o means insertion of o as an ‘extra p(obsi, A T) i i i letter’ immediately before position j in T. The probabilities T A p(o͉t,Q) constitute the detailed model of sequencing error The SNP model p(obs͉T*,TЈ) differs from the error model which we have produced for EST data as described above. T first of all at the central position, where a simple weighted is treated as a seven-residue window centred on the putative mixture of the consensus sequence T* and the putative SNP SNP position, and from the observed sequence we take the TЈ is used. Thus, at the central position of the HMM

The Pharmacogenomics Journal SNP identification in candidate gene systems of obesity K Irizarry et al 201

Figure 2 Contribution of EST-based approaches to coding region SNP discovery.

= − ͉ * ⌬͉ * pi,j (1 q)(p(oi tj , Qi) pi − 1, j − 1 + p( tj , Qi) pi,j − 1 rence of the putative SNP over distinct library sources.

* Ј Whereas sequencing errors should assort randomly over the + p(insert o ͉t , Q ) p − )+q (p(o ͉t , Q ) p − − i j i i 1, j i j i i 1, j 1 set of libraries, a genuine SNP should follow a pattern con- ⌬͉ Ј ͉ Ј + p( tj, Qi) pi,j − 1 + p(insert oi tj, Qi) pi − 1, j) sistent with diploid genetics. Specifically, the SNP should be observed predominantly 50% of the time from sequences where q is the expected allele frequency of TЈ. from a heterozygote library, and 100% of the time from Treating the ESTs as independent observations, we individuals homozygous for the rare allele. Thus, the fre- would calculate quency expected from a positive library (50%) is typically ͉ Ј = ͹ ͉ Ј = ͹͸ ͉ Ј much greater than the allele frequency q in the general p(obs T*, T ) p(obsi T*, T ) p(obsi, A T*, T ) i i A population. In this way library information for the observed sequences can be a powerful discriminant of real SNPs from cDNA Library Information the high background of sequencing error. However, under the SNP model the EST observations are not We have combined this library information with a Baye- independent, because they are drawn from a small number sian approach to estimate the true population allele fre- of cDNA libraries each representing (typically) one human quency q of an SNP, subject to the quantity and quality of individual. An important discriminant of genuine SNPs vs sequence data available for detecting it. The probability of sequencing error can be obtained from the pattern of occur- an individual sequence, conditional on the zygosity value

www.nature.com/tpj SNP identification in candidate gene systems of obesity K Irizarry et al 202

for its library z = 0, 1, 2 (copy number of the polymorphism Foundation award to JL, Stanley Foundation award to JL, NIH grants in the individual from whom the DNA was derived) is P50AT00151–020002 and R21MH/NS62777 to M-LW, NARSAD award to M-LW, support for KI from USPHS National Research Service Award ͉ Ј = ͸ ͉ Ј GM08375, and support for KI from NSF IGERT award number 9987641. p(obsi T*, T , z) p(obsi, A T*, T , qz) A = DUALITY OF INTEREST where qz z/2 and is used in place of the allele frequency None declared. q in the HMM for the central position, above. Thus for a = = heterozygote library (z 1) we use qz 0.5. For multiple observations from one library L, we sum over uncertainty ABBREVIATIONS about the true value of z in the library: SNP – Single Nucleotide Polymorphism cSNP – coding region Single Nucleotide Polymorphism ͉ Ј = ͸ ͹ ͉ Ј eSNP – expressed Single Nucleotide Polymorphism p(obsi෈L T*, T ) p(z) p(obsi T*, T , z) ෈ UTR – untranslated region z=0,1,2 i L This allows us to introduce a dependence on the popu- lation allele frequency q assuming Hardy–Weinberg equilib- REFERENCES rium 1 Venter JC et al. The sequence of the . Science 2001; 291: 1304–1351. 2 International Human Genome Sequencing Consortium. Initial sequen- 2 − p(z͉q) = ͩ ͪ qz (1 − q)2 z cing and analysis of the human genome. Nature 2001; 409: 860–921. z 3 Weiss KM, Terwilliger JD. How many diseases does it take to map a gene with SNPs? Nature Genet 2000; 26: 151–157. For observations from multiple libraries, we substitute this 4 Halushka MK et al. Patterns of single-nucleotide polymorphisms in can- expression for p(z) in the previous equation didate genes for blood-pressure homeostasis. Nature Genet 1999, 22: 239–246. ͉ Ј = ͹ ͉ Ј 5 Prevention C.f.D.C.a. Update: Prevalence of overweight among chil- p(obs T*, T , q) p(obsi෈L T*, T , q) L dren, adolescents, and adults—United States, 1988–1994. MMWR Weekly 1997; 46: 199–202. Integrating over q, 6 Vaisse C, Clement K, Guy-Grand B, Froguel P. A frameshift mutation in human MC4R is associated with a dominant form of obesity. Nature 1 ͉ Ј = ͹ ͉ Ј Genet 1998; 20: 113–114. ͵ ෈ p(obs T*, T ) p(obsi L T*, T , q) p(q) dq 7 Yeo GS et al. A frameshift mutation in MC4R associated with domi- 0 L nantly inherited human obesity. Nature Genet 1998; 20: 111–112. 8 Cle´ment K et al. A mutation in the human leptin receptor gene causes 1 2 = ͵ ͹ ͸ ͩ ͪ qz(1 − q)2−z ͹͸ obesity and pituitary dysfunction. Nature 1998; 392: 398–401. z ෈ 9 Ozata M, Ozdemir IC, Licinio J. Human leptin deficiency caused by a 0 L z=0,1,2 i L A missense mutation: multiple endocrine defects, decrased sympathetic ͉ Ј p(obsi, A T*, T , z)p(q)dq tone, and immune system dysfunction indicate new targets for leptin action, greater central than peripheral resistance to the effects of lep- In this paper we use the uninformative prior p(q) = 1 for all tin, and spontaneous correction of leptin-mediated defects. J Clin q. The p(q) distribution will be subject to more in depth Endocrinol Metab 1999; 84: 3686–3695. 10 Montague CT et al. Congenital leptin deficiency is associated with sev- analysis in follow-up works. The posterior distribution for q ere early-onset obesity in humans. Nature 1997; 387: 903–908. is therefore 11 Strobel A, Issad T, Camoin L, Ozata M, Strosberg AD. A leptin missense mutation associated with hypogonadism and morbid obesity. Nature ͹ ͉ Ј Genet 1998; 18: 213–215. p(obsi෈L T*, T , q) 12 Krude H et al. Severe early-onset obesity, adrenal insufficiency and red p(q͉T*, TЈ, obs) = L 1 hair pigmentation caused by POMC mutations in humans. Nature ͹ ͉ Ј Genet 1998; 19: 155–157. ͵ ෈ p(obsi L T*, T , q)dq 0 L 13 Jackson RS et al. Obesity and impaired prohormone processing associa- ted with mutations in the human prohormone convertase 1 gene. Nat- One additional effect must be considered. Some libraries ure Genet 1997; 16: 303–306. were constructed by pooling cDNAs from more than one 14 Scheen AJ, Lefebvre PJ. Pharmacological treatment of obesity: present individual. In this case we define the ploidy of the library P status. Int J Obes Relat Metab Disord 1999; 23 Suppl 1: 47–53. = 15 Stock MJ. Sibutramine: a review of the pharmacology of a novel anti- 2N, where N is the number of people from whom DNA obesity agent. Int J Obes Relat Metab Disord 1997; 21 Suppl 1: S25–29. was pooled. Then z can take values z = 0, 1, 2, % P. Also, 16 Van Gaal LF, Wauters MA, Peiffer FW, De Leeuw IH. Sibutramine and = ͉ qz z/P, and p(z q) is simply the binomial distribution, sub- fat distribution: is there a role for pharmacotherapy in stituting P for 2 in the equation for p(z͉q) above. We abdominal/visceral fat reduction? Int J Obes Relat Metab Disord 1998; 22 Suppl 1: S38–42. obtained detailed information on library construction and 17 Weintraub M, Rubio A, Golik A, Byrne L, Scheinbaum ML. Sibutramine pooling for 850 libraries used in dbEST, and determined P in weight control: a dose-ranging, efficacy study. Clin Pharmacol Ther for all 35 libraries annotated as pooled. 1991; 50: 330–337. 18 Flier JS, Maratos FE. Obesity and the hypothalamus: novel peptides for new pathways. Cells 1998; 92: 437–440. ACKNOWLEDGEMENTS 19 Chagnon YC, Perusse L. Weisnagel SJ, Rankinen T, Bouchard C. The This work was supported by the following grants and awards: Depart- human obesity gene map: the 1999 update. Obes Res 2000; 8:89– ment of Energy grant DEFG0387ER60615 to CL, Searle Scholar Award 117. to CL, NIH grants K30HL0426, R01DK58851, U01GM61394 to JL, Dana 20 Elmquist JK, Maratos-Flier E, Saper CB, Flier JS. Unraveling the central

The Pharmacogenomics Journal SNP identification in candidate gene systems of obesity K Irizarry et al 203

nervous system pathways underlying responses to leptin. Nature Neu- sequencer traces using phred. I. Accuracy assessment. Genome Res rosci 1998; 1: 445–450. 1998; 8: 175–185. 21 Woods SC, Schwartz MW, Baskin DG, Seeley RJ. Food intake and the 33 Ewing B, Green P. Base-calling of automated sequencer traces using regulation of body weight. Annu Rev Psychol 2000; 51: 255–277. phred. II. Error probabilities. Genome Res 1998; 8: 186–194. 22 Woods SC, Seeley RJ. Adiposity signals and the control of energy 34 Boguski MS, Lowe TM, Tolstoshev CM. dbEST—database for homeostasis. Nutrition 2000; 16: 894–902. ‘expressed sequence tags’. Nature Genet 1993; 4: 332–333. 23 Schwartz MW, Woods SC, Porte D, Seeley RJ, Baskin DG. Central ner- 35 Turki J, Pak J, Green SA, Martin RJ, Liggett SB. Genetic polymorphisms vous system control of food intake. Nature 2000; 404: 661–671. of the beta-2-adrenergic receptor in nocturnal and nonnocturnal 24 Thorburn AW, Proietto J. Neuropeptides, the hypothalamus and obes- asthma: evidence that gly16 correlates with the nocturnal phenotype. ity: insights into the central control of body weight. Pathology 1998; J Clin Invest 1995; 95: 1635–1641. 30: 229–236. 36 Large V et al. Human beta-2 adrenoceptor gene polymorphisms are 25 Nickerson DA et al. Sequence diversity and large-scale typing of SNPs highly frequent in obesity and associate with altered adipocyte beta- in the human apolipoprotein E gene. Genome Res 2000; 10: 1532– 2 adrenoceptor function. J Clin Invest 1997; 100: 3005–3013. 1545. 37 Thompson DB et al. Association of the GLN223ARG polymorphism in 26 Tang J et al. TAPI polymorphisms in several human ethnic groups: the leptin receptor gene with plasma leptin levels, insulin secretion characteristics, evolution, and genotyping strategies. Hum Immunol and NIDDM in Pima Indians. Diabetolia 1997; 40: A177. 2001; 62: 256–268. 38 Quinton ND, Lee AJ, Ross RJM, Eastell R, Blakemore AIF. A single nucle- 27 Irizarry K et al. Genome-wide analysis of single-nucleotide polymor- otide polymorphism (SNP) in the leptin receptor is associated with phisms in human expressed sequences. Nature Genet 2000; 26: BMI, fat mass and leptin levels in postmenopausal Caucasian women. 233–236. Hum Genet 2001; 108: 233–236. 28 Reference deleted in proof. 39 Hegele RA et al. Paraoxonase-2 gene (PON2) G148 variant associated 29 Bairoch A, Apweiler R. The SWISS-PROT protein sequence data bank with elevated fasting plasma glucose in noninsulin-dependent diabetes and its supplement TrEMBL in 1998. Nucleic Acids Res 1998; 26:38–42. mellitus. J Clin Endocr Metab 1997; 82: 3373–3377. 30 International SNP Map Working Group. A map of human genome 40 Sanghera DK, Aston CE, Saha N, Kamboh MI. DNA polymorphisms in sequence variation containing 1.42 million single nucleotide polymor- two paraoxonase genes (PON1 and PON2) are associated with the risk phisms. Nature 2001; 409: 928–933. of coronary heart disease. Am J Hum Genet 1998; 62:36–44. 31 Sherry ST, Ward M, Sirotkin K. dbSNP-database for single nucleotide polymorphisms and other classes of minor . Genome Reference Added in Proof Res 1999; 9: 677–679. 41 Lee C, Grasso C, Sharlow M. Multiple sequence alignment using 32 Ewing B, Hillier L, Wendl MC, Green P. Base-calling of automated partial order graphs. Bioinformatics (in press).

www.nature.com/tpj