Integrative analysis of single nucleotide polymorphisms and gene expression efficiently distinguishes samples from closely related ethnic populations Yang et al.

Yang et al. BMC Genomics 2012, 13:346 http://www.biomedcentral.com/1471-2164/13/346 (28 July 2012) Yang et al. BMC Genomics 2012, 13:346 http://www.biomedcentral.com/1471-2164/13/346

RESEARCH ARTICLE Open Access Integrative analysis of single nucleotide polymorphisms and gene expression efficiently distinguishes samples from closely related ethnic populations Hsin-Chou Yang1*, Pei-Li Wang1, Chien-Wei Lin1, Chien-Hsiun Chen2 and Chun-Houh Chen1

Abstract Background: Ancestry informative markers (AIMs) are a type of genetic marker that is informative for tracing the ancestral ethnicity of individuals. Application of AIMs has gained substantial attention in population genetics, forensic sciences, and medical genetics. Single nucleotide polymorphisms (SNPs), the materials of AIMs, are useful for classifying individuals from distinct continental origins but cannot discriminate individuals with subtle genetic differences from closely related ancestral lineages. Proof-of-principle studies have shown that gene expression (GE) also is a heritable human variation that exhibits differential intensity distributions among ethnic groups. GE supplies ethnic information supplemental to SNPs; this motivated us to integrate SNP and GE markers to construct AIM panels with a reduced number of required markers and provide high accuracy in ancestry inference. Few studies in the literature have considered GE in this aspect, and none have integrated SNP and GE markers to aid classification of samples from closely related ethnic populations. Results: We integrated a forward variable selection procedure into flexible discriminant analysis to identify key SNP and/or GE markers with the highest cross-validation prediction accuracy. By analyzing genome-wide SNP and/or GE markers in 210 independent samples from four ethnic groups in the HapMap II Project, we found that average testing accuracies for a majority of classification analyses were quite high, except for SNP-only analyses that were performed to discern study samples containing individuals from two close Asian populations. The average testing accuracies ranged from 0.53 to 0.79 for SNP-only analyses and increased to around 0.90 when GE markers were integrated together with SNP markers for the classification of samples from closely related Asian populations. Compared to GE-only analyses, integrative analyses of SNP and GE markers showed comparable testing accuracies and a reduced number of selected markers in AIM panels. Conclusions: Integrative analysis of SNP and GE markers provides high-accuracy and/or cost-effective classification results for assigning samples from closely related or distantly related ancestral lineages to their original ancestral populations. User-friendly BIASLESS (Biomarkers Identification and Samples Subdivision) software was developed as an efficient tool for selecting key SNP and/or GE markers and then building models for sample subdivision. BIASLESS was programmed in R and R-GUI and is available online at http://www.stat.sinica.edu.tw/hsinchou/ genetics/prediction/BIASLESS.htm. Keywords: Single nucleotide polymorphism (SNP), Allele frequency, Gene expression, HapMap, Classification analysis, Ancestry informative marker (AIM)

* Correspondence: [email protected] 1Institute of Statistical Science, Academia Sinica, 128, Academia Road, Section 2 Nankang, Taipei 115, Taiwan Full list of author information is available at the end of the article

© 2012 Yang et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Yang et al. BMC Genomics 2012, 13:346 Page 2 of 12 http://www.biomedcentral.com/1471-2164/13/346

Background very challenging to classify samples from closely related Ancestry informative markers (AIMs) are the genetic mar- ancestral lineage (e.g., two East-Asian populations such kers carrying ancestral information for classifying samples as Han Chinese and Japanese) using only a small num- from a specific population or various ethnic populations ber of SNPs. In an example of a previous classification [1-12]. AIMs have been applied to various study areas, in- study [20], the HapMap II Asian, African and European cluding population genetics, forensic sciences, medical samples were separated with a classification accuracy of genetics, and others. In population genetics, AIMs can be 0.97 based on 64 SNPs on average. The number of SNPs used to estimate the genetic diversity, population differen- increased to 84 but the classification accuracy was tiation, and admixture proportions and thereby provide a reduced to 0.84 on average if Han Chinese and Japanese more detailed understanding of the genetic background of samples were further regarded as samples from different study populations [2,4,5,7,13,14]. In forensic sciences, sub-Asian populations and classified with African and AIMs can be used to infer ancestral or continental origin European samples jointly. This difficulty to classify the and thereby assist with victim identification in a disaster samples from proximate populations could be overcome situation or criminal identification in a venue [2,15,16]. In by using a large number of genetic markers [3,21], while medical genetics, AIMs are useful for reducing false posi- genotyping cost will increase significantly. tives and false negatives in genetic association studies. On Gene expression (GE) microarray technology has the one hand, AIMs can assist in adjusting for potential advanced in the past 20 years. Previous studies have genetic substructures in a case–control association study shown that GE also is polymorphic and heritable vari- and thereby reduce false positives (i.e., diminish spurious ation in humans [22,23]. Importantly, GE exhibits differ- association) [3]. On the other hand, AIMs can also be used ent genetic/genomic profiles in different ethnic to construct homogeneous sample groups in a genetic as- populations [24]. Similar to SNP markers, GE markers sociation study and thereby reduce false negatives (i.e., di- may potentially provide ancestral information for dis- minish power loss) [2]. In addition, AIMs can provide criminating samples from different ethnic populations. complementary information for self-reported ethnicity. In Of note, the different natures of SNP and GE markers contrast to self-reported ethnicity, which reflects an indivi- may mean that GE provides information that is supple- dual’s environment and culture, AIM-determined ethnicity mentary to SNP information: GE markers are quantita- inferred from genetic markers reflects genetic inheritance tive attributes responsible for gene regulation, and SNP and make-up. In particular, self-reported ethnicity may be markers may act as semi-quantitative (e.g., a locus with challenged when samples have been recruited from a geo- an additive effect) or qualitative (e.g., a locus with a graphic region in which the residents are highly admixed dominant or recessive effect) variables that can be attrib- [9]. Therefore, AIM-determined ethnicity, rather than self- uted to DNA variation. Regarding the relationship be- reported ethnicity, is recommended for genetic studies; tween SNP and GE markers, the regulation of GE may admixture mapping using AIMs is especially suitable for be unrelated to DNA sequences, as with epigenetic highly admixed populations [17]. mechanisms [25], or it may be associated with SNPs, as Short tandem repeat polymorphisms (STRPs) and sin- with expression quantitative trait locus (eQTL) [22,26-28]; gle nucleotide polymorphisms (SNPs) are the most fre- nevertheless, even in the case of an eQTL, only a limited quently used genetic markers for AIMs, and each has its proportion of GE variation can be explained by the own strengths [1,9,18]. Genotyping platforms for eQTL. Therefore, in this study, we proposed that integra- genome-wide STRP and SNP markers have been estab- tive analysis of these two types of genetic markers (SNP lished but are not specific to AIMs, and this significantly and GE markers) may provide a more promising alterna- increases the average genotyping cost for AIMs. This ur- tive for construction of a high-accuracy and cost- gent need motivates the development of AIM panels that beneficial AIM panel than analysis of SNP or GE markers contain as much ancestral information as possible, while alone. To the best of our knowledge, few (if any) studies keeping the number of AIMs as low as possible. AIM in the literature have integrated GE markers with SNP panels with a small to moderate number of genetic mar- markers to aid in subdividing samples from different eth- kers have been constructed to discern samples from dif- nic populations. In this study, we investigated the per- ferent populations, including Europeans [12], East formance of SNP and GE markers in population genetics Asians [11], African Americans [17], and European and evaluated the plausibility of sample classification Americans [3,19], and different continents [2,4,5,9,16] at using the combined resources of SNP and GE data. a more reasonable price. Although a small to moderate number of SNPs or STRPs could provide promising discriminative power to Methods distinguish a large ethnic discrepancy (e.g., subdivision A flowchart is provided to summarize the materials and of samples from Asia, Africa, and Europe), it becomes analysis flow in this study (Figure 1). Yang et al. BMC Genomics 2012, 13:346 Page 3 of 12 http://www.biomedcentral.com/1471-2164/13/346

Samples and genotyping/gene expression experiments for the human genome [35,36]. Procedures for quantifi- In this study, we analyzed SNP and GE data in 210 inde- cation and normalization of GE levels are described in pendent samples from the International HapMap II Pro- Supporting Online Materials [35]. The normalized gene ject [29-32]. The samples encompassed 30 African expression data are publicly available in the Gene Ex- marriage pairs from Yoruba in Ibadan (YRI), 30 Cauca- pression Omnibus (GEO) database (http://www.ncbi. sian marriage pairs of European descent resided in Utah nlm.nih.gov/geo/) (Series accession number GSE6536). (CEU), and 90 Asian persons including 45 Han Chinese Annotation of SNP data from the Affymetrix 500 K and persons in Beijing (CHB) and 45 Japanese persons in Array 6.0 was derived from the NetAffx annotation up- Tokyo (JPT). All 210 samples were genotyped using date 30 (version: dbSNP Build 128), which is available both the Affymetrix Human Mapping 500 K and Array on the Affymetrix website (http://www.affymetrix.com/). 6.0 (Affymetrix Inc., Santa Clara, CA, USA). The two Annotation of GE probes was derived from the GEO an- SNP gene chips provided genotype data for 500,568 notation (accession number GPL2507; version: UCSC SNPs and 906,600 SNPs, respectively, on 23 pairs of HG 18), which is available in the GEO database. chromosomes for each individual. The Bayesian Robust Linear Model with Mahalanobis Distance Classifier Statistical methods and data analysis (BRLMM) [33] or Birdseed [34] were used for SNP This study classified samples in each of eight combina- genotype call analysis of data from the Affymetrix tions of the four ethnic populations: (1) “four popula- Human Mapping 500 K and Array 6.0, respectively. The tions”–CHB, JPT, CEU, and YRI; (2) “three genotype data are publicly available (http://hapmap.ncbi. populations”–Asian (CHB + JPT), Caucasian (CEU), nlm.nih.gov/). In addition, GE levels of the 210 HapMap and African (YRI); (3) “CHB and JPT”; (4) “CHB and samples were measured using Illumina’s Sentrix Human- YRI”; (5) “CHB and CEU”; (6) “JPT and YRI”; (7) “JPT 6 Expression BeadChip (Illumina Inc., San Diego, CA, and CEU”; and (8) “YRI and CEU”. Quality control of USA). Each bead chip provided 47,289 transcript probes SNP and GE markers was performed together with

• Data preparation: Study samples - 210 independent samples from the International HapMap II Project. The samples encompassed 30 African marriage pairs from Yoruba in Ibadan (YRI), 30 Caucasian marriage pairs European descent resided in Utah (CEU), 45 Han Chinese persons in Beijing (CHB), and 45 Japanese persons in Tokyo (JPT).

Genetic markers - All 210 samples were genotyped with both the Affymetrix Human Mapping 500K (500,568 SNPs) and Array 6.0 (906,600 SNPs), and gene expression (GE) were measured using Illumina’s Sentrix Human-6 Expression BeadChip (47,289 transcript probes).

• Eight combinations from four ethnic populations: (1) four populations – CHB, JPT, CEU, and YRI, (2) three populations – Asian (CHB+JPT), Caucasian (CEU), and African (YRI), (3) CHB and JPT, (4) CHB and YRI, (5) CHB and CEU, (6) JPT and YRI, (7) JPT and CEU, and (8) YRI and CEU.

• Quality Control (performed within each of the eight combinations): SNP quality control - SNPs were removed if: (a) genotyping call rate < 0.90, (b) minor allele frequency = 0, (c) Hardy-Weinberg disequilibrium (pFDR < 0.05), or (d) non-autosomal SNPs.

GE quality control - GE transcript probes were removed for: (a) non-RefSeq probes, (b) no- autosomal probes, or (c) probes without gene information.

• Ten-fold cross-validation classification analysis (performed within each of the eight combinations) Step 1- Randomly partition samples in each population into ten subsets. Step 4- The first three Step 2- Apply a forward selection procedure into an FDA to identify key steps were SNP and/or GE markers with the highest training accuracy based on repeated until training samples (in 9 of ten subsets). The model is regarded as a each of 10 candidate model. subsets has been analyzed Step 3- Calculate testing accuracy for the candidate model selected in as a testing Step 2 based on the testing samples (in the remaining subset). dataset.

Step 5- Pick up the model with highest testing accuracy or highest occurrence frequency among 10 candidate models.

Figure 1 Flowchart of this study. A summary of data preparation, study combinations of ethnic populations, quality control of genetic markers, and classification analysis are shown. Yang et al. BMC Genomics 2012, 13:346 Page 4 of 12 http://www.biomedcentral.com/1471-2164/13/346

analysis for each of the eight combinations of ethnic the remaining subset) and the testing accuracy then was populations. A poor-quality SNP was removed if its calculated. Fourth, the first three steps were repeated genotype call rate was lower than 0.9, its minor allele until each of the 10 subsets had been analyzed as a test- frequency was 0, or if Hardy-Weinberg equilibrium ing dataset, resulting in 10 classification candidate mod- (HWE) was violated, where departure from HWE was els. Finally, among the 10 classification models, the one defined as a p-value that was adjusted by a false discov- with the highest testing accuracy or highest cross- ery rate procedure [37] and that was lower than 0.05 in validation consistency was selected as the best classifica- a permutation-based HWE test [38]. Finally, SNPs on tion model. The aforementioned classification analysis sex chromosomes were removed. Quality control of GE was performed for each of the eight ethnic population markers removed 21,198 non-RefSeq probes (11,622 combinations using only GE markers (“GE-only ana- probes from UniGene and 9,576 probes from Gnomon). lysis”), only SNP markers on Affymetrix 500 K (“500 K- A total of 854 probes were removed from sex chromo- only analysis”), only SNP markers on Affymetrix Array somes, and 6,430 probes without gene information also 6.0 (“Array6.0-only analysis”), both GE markers and were removed. The total numbers of SNP and GE mar- SNPs on Affymetrix 500 K (“500 K + GE analysis”), and kers that remained after SNP and GE quality control are both GE markers and SNPs on Affymetrix Array6.0 shown (Additional file 1: Table S1). (“Array6.0 + GE analysis”). The analysis was performed To explore the genetic discrepancy and sample subdi- using our developed software, BIASLESS (Biomarkers visions among the four HapMap II populations, an Identification and Samples Subdivision), which can be exploratory unsupervised analysis was performed, fol- downloaded for free at http://www.stat.sinica.edu.tw/ lowed by an intensive supervised classification analysis. hsinchou/genetics/prediction/BIASLESS.htm. Both analyses used genome-wide SNP and GE markers. First, to understand whether genome-wide SNP and GE Results markers provide sufficient information for subdividing Unsupervised classification analysis using genome-wide samples in HapMap II populations, a preliminary un- SNP or GE markers supervised classification analysis was performed by The genome-wide SNP-based classification analysis drawing allele frequency biplots and gene expression clearly separated samples from ethnic populations using biplots based on genome-wide SNP and GE markers, re- allele frequency profiling of genome-wide SNPs interro- spectively. The analysis was performed using ALOHA gated on Affymetrix 500 K (Additional file 2: Figure S1) software [21], which is available on the ALOHA website or Affymetrix Array 6.0 (Figure 2). Samples from CHB, (http://www.stat.sinica.edu.tw/hsinchou/genetics/aloha/ JPT, CEU, and YRI were classified into three genetically ALOHA.htm). Afterward, intensive supervised classifica- distant ethnic groups, African, Caucasian, and Asian. tion analyses were performed to identify key SNP and/or The Asian group consisted of two genetically close GE markers to study subdivisions of samples from the populations (CHB and JPT) (Additional file 2: Figure HapMap II populations. A five-step discriminant analysis S1A and Figure 2A) that were separated further by was developed to identify key SNP and/or GE markers within-group analysis of Asian populations (Additional with the highest prediction accuracy for the separation file 2: Figure S1B and Figure 2B). All two-population of samples from different populations as follows. First, analyses accurately separated samples from different samples in each study population were randomly parti- populations (Additional file 2: Figures S1B–G and tioned into 10 subsets for cross-validation. Second, a Figure 2B–G). In general, the results of the Affymetrix flexible discriminant analysis (FDA) using optimal scor- 500 K and Affymetrix Array 6.0 analyses were very simi- ing [39] was applied to training sets (i.e., samples in nine lar (Additional file 2: Figure S1 and Figure 2). In contrast of 10 subsets). Given the existing markers in a classifica- to the genome-wide SNP-based analysis, a non- tion model, new SNP or GE markers with the maximum negligible proportion of samples could not be separated increment of training accuracy were added sequentially correctly with genome-wide GE markers (Figure 3). to the model. The marker with the minimum SSW/SSB These results were found not only for samples from the was selected if more than one marker or marker set had four populations (Figure 3A) but also for samples from the same training accuracy, where SSW and SSB indicate any two populations (Figure 3B –G). the within-population and between-population sum of squares for genotypic values or gene expression levels, Supervised classification analysis by selecting key respectively. The procedure continued until the training predictive SNP and/or GE markers from genome-wide accuracy reached 1.0 or its increment was less than a SNP and GE markers threshold such as 0.001 in this study. Third, genetic Ten classification models were established in each of the markers with the highest training accuracy were used to GE-only, 500 K-only, Array6.0-only, 500 K + GE, and classify individuals in the testing dataset (i.e., samples in Array6.0 + GE analyses, which independently identified a Yang et al. BMC Genomics 2012, 13:346 Page 5 of 12 http://www.biomedcentral.com/1471-2164/13/346

Figure 2 Classification of HapMap samples using whole-genome SNPs of the Affymetrix Array 6.0 set. All samples were superimposed onto a two-dimensional plane in an AF biplot. (A) CHB, JPT, CEU, and YRI, (B) CHB and JPT, (C) CHB and YRI, (D) CHB and CEU, (E) JPT and YRI, (F) JPT and CEU, and (G) YRI and CEU. Red line with a B symbol indicates CHB samples; blue line with a J symbol indicates JPT samples; gray line with a Y symbol indicates YRI samples; green line with an E symbol indicates CEU samples. small number of key predictive SNP and/or GE markers together with SNP markers for the classification of sam- to classify samples for each of the eight ethnic popula- ples from “CHB and JPT” and “four populations”, the tion combinations that we studied. The distributions of average testing accuracies increased to 0.89 and 0.92, re- testing accuracy and number of predictive markers are spectively, in the 500 K + GE analysis and to 0.92 and presented in box-whisker plots (Figure 4). The majority 0.91 in the Array6.0 + GE analysis. In comparison with of the classification analyses produced an average testing the integrative analyses of SNP and GE markers, the GE- accuracy, calculated over 10 cross-validation datasets, only analysis presented a larger variation of testing ac- greater than or close to 90%, with the exception of two curacy and required about twice the number of markers SNP-only analyses; the 500 K-only and Array6.0-only to accurately classify samples from “JPT and YRI”, “YRI analyses had relatively low testing accuracies for the and CEU”, “three populations” and “four populations”. classification of samples from two closely related ethnic We established the best classification models in the populations, CHB and JPT. In the 500 K-only analysis, GE-only, 500 K-only, Array6.0-only, 500 K + GE, and the average testing accuracies were only 0.53 and 0.70 Array6.0 + GE analyses for each of the eight ethnic popu- for classifying samples from “CHB and JPT” and from lation combinations we studied (Additional file 3: Table S2). “four populations”, respectively. Similarly, in the The best models of the integrative analysis of SNP Array6.0-only analysis, the average testing accuracies and GE markers attained a testing accuracy of 100% were only 0.70 and 0.79 for the classification of samples in all eight population combinations that we studied. from “CHB and JPT” and “four populations”, respect- Only a few markers were needed for good sample ively. However, if GE markers also were integrated classification. In the 500 K + GE analysis, the number Yang et al. BMC Genomics 2012, 13:346 Page 6 of 12 http://www.biomedcentral.com/1471-2164/13/346

Figure 3 Classification of HapMap samples using whole-genome gene expression. All samples were superimposed onto a two-dimensional plane in a GE biplot. (A) CHB, JPT, CEU, and YRI, (B) CHB and JPT, (C) CHB and YRI, (D) CHB and CEU, (E) JPT and YRI, (F) JPT and CEU, and (G) YRI and CEU. Red line with a B symbol indicates CHB samples; blue line with a J symbol indicates JPT samples; gray line with a Y symbol indicates YRI samples; green line with an E symbol indicates CEU samples. of predictive markers was five for “four populations”, (C/T) for “JPT and CEU”, and rs735480 (C/T) for “YRI three for “three populations”,threefor“CHB and and CEU” (Figure 5 and Additional file 4: Table S3). One JPT”,twofor“JPT and CEU”, and one for the or two SNPs already provided sufficient information for remaining population combinations; in the Array6.0 + classifying ethnically distant samples, but this was not GE analysis, the number of predictive markers required in the case for classifying samples from ethnically close the best model was five for “four populations”, three for populations such as CHB and JPT. In the latter situation, “three populations”,threefor“CHB and JPT”,andonefor the integrative analyses of SNP and GE markers indeed the remaining population combinations. provided much richer information than the SNP-only Notably, the best models in the 500 K + GE and analyses. The best model of the Affy500K + GE analysis, Array6.0 + GE analyses only required 1 or 2 SNPs to cor- which was composed of GI_4506928-S (on SH3GL1), rectly classify samples from genetically distant popula- GI_37540521-S (on OR13C5), and rs11986045, signifi- tions, including “CHB and YRI”, “CHB and CEU”, “JPT cantly improved the testing accuracy of the Affy500K and YRI”, “JPT and CEU”, and “YRI and CEU”.The analysis when classifying samples from CHB and JPT; results show the existence of ancestry informative or the testing accuracy of the best model increased from population-specific SNPs; namely, the SNP-only analysis 0.89 to 1 (Additional file 3: Table S2). The best model of already provided key information, and GE markers the Array6.0 + GE analysis, which was composed of were redundant in this situation, as follows: SNP rs11051 GI_4506928-S (on SH3GL1), GI_37540521-S (on (G/A) for “CHB and YRI”, rs489095 (T/C) for “CHB and OR13C5), and rs10485803, significantly improved the CEU”, rs6546753 (G/T) for “JPT and YRI”, rs6437783 testing accuracy of the Array6.0 analysis; the testing Yang et al. BMC Genomics 2012, 13:346 Page 7 of 12 http://www.biomedcentral.com/1471-2164/13/346

Figure 4 Test accuracy and number of genetic markers in the selected classification models in the 10-fold cross-validation procedure. In each of the eight ethnic population combinations, the testing accuracy and the number of genetic markers in the selected classification model in each of the 10-fold cross-validations are shown in the top and bottom panels, respectively. The top and bottom panels show box-whisker plots of the testing accuracy and the number of key predictive markers based on GE-only (green), 500 K-only (purple), Array6.0-only (red), 500 K + GE (orange), and Array6.0 + GE (yellow). accuracy of the best model increased from 0.89 to 1 org/) (Additional file 5: Figure S2). Programs, test exam- (Additional file 3: Table S2). ples, and the user guide are available at the BIASLESS All samples from the four study populations could also website (http://www.stat.sinica.edu.tw/hsinchou/genet- be classified correctly with the best models of the inte- ics/prediction/BIASLESS.htm). Before using BIASLESS grative analyses of SNP and GE markers. In comparison software, users are encouraged to read the user guide with the SNP-only and GE-only analyses, the best inte- for software installation, initialization, working director- grative model of SNP and GE markers used only four ies, functions, operation, and format of input/output SNP markers and one GE marker to perfectly classify data. BIASLESS is structured by the following five main samples from the four populations that we studied components: (Additional file 3: Table S2). The 500 K + GE analysis prioritized rs2736306, GI_41281459-S (on CENTB1), (1) Input/Output settings: Users can choose between rs12063564, rs2725379 (on PURG), and rs2250072 as two types of data input formats (for markers). Users the key predictive markers; the Array6.0 + GE analysis can click the browse buttons to specify the data identified rs6546753, GI_4506928-S (on SH3GL1), input directory (for markers), data input directory rs1986420, rs6560625 and rs12632185 as the key pre- (for a trait), and result output directory. All results dictive markers in the best classification model. will be saved automatically in the user-specified output directories. Users should fill in a notation or BIASLESS software code to indicate any missing values in their marker The developed classification algorithm is packaged into- data and trait data. BIASLESS software with a user friendly interface pro- (2) Cross-validation: (a) Seed for cross-validations: grammed in language R and R-GUI (http://www.r-project. users can use a random seed or provide a fixed seed Yang et al. BMC Genomics 2012, 13:346 Page 8 of 12 http://www.biomedcentral.com/1471-2164/13/346

Figure 5 Genotype frequencies of key predictive SNPs in the best model for the classification of samples from ethnically distant populations. Bar charts show the genotype frequencies of the key predictive SNPs in the 500 K-only (first field), Array6.0-only (second field), 500 K + GE (third field), and Array6.0 + GE (fourth field) analyses for classification of samples from “CHB and YRI” (first panel), “CHB and CEU” (second panel), “JPT and YRI” (third panel), “JPT and CEU” (fourth panel), and “YRI and CEU” (fifth panel). Legends for the two scenarios, a single key predictive SNP or two key predictive SNPs, are shown. For the scenario of a single key predictive SNP, the proportion of blue, green, and brown reflects the genotype frequencies of AA, AB and BB of the SNP, respectively. The proportions range from 0 to 1 and sum to 1. For the scenario of two key predictive SNPs, nine colors are used to indicate nine possible genotype combinations for two SNPs. Again, the proportion of colored bars indicates the frequencies of genotype combinations, and the proportions range from 0 to 1 and sum to 1.

(a real value between −65535 and 65535) for value of the accuracy) by cross-validation datasets. partitioning samples into the training dataset and (b) Parallel coordinates plot: the number of markers testing dataset. (b) Fold size: users can select the selected, training accuracy, and testing accuracy number of cross-validations (e.g., “10” denotes a over cross-validation datasets are presented in a 10-fold cross-validation). parallel coordinates plot. (c) Multidimensional (3) Marker selection (Stop if any of the following three scaling plot: the study samples and selected markers procedures stop): (a) Continue until the number of in the final classification model with the highest markers in the model increases to a specified value. testing accuracy are presented in a (b) Continue until the training/testing accuracy multidimensional scaling plot. (d) Stacked-bar/box- increases to a specified value. (c) Continue until the whisker plot: genotypic distributions (AA call – increment of training/testing accuracy is reduced to blue color, AB call – green color, and BB call – a specified value. brown color) of SNPs and expression distributions (4) Graphical output: (a) Overlay line graph: the curves (pink color) of genes selected in the classification of training accuracy and testing accuracy are model are displayed. (e) Sample misclassification overlaid in a plot (the horizontal axis is the number plot: states of correct classification or of markers in the model and the vertical axis is the misclassification in training and testing samples in Yang et al. BMC Genomics 2012, 13:346 Page 9 of 12 http://www.biomedcentral.com/1471-2164/13/346

each of 10-fold cross-validations are shown. Although GE markers, which are more variable com- Misclassification proportion across all cross- pared to SNPs, may change by population-specific food validations is shown for each individual in training preferences or environmental exposures, previous stud- and testing samples. Misclassification proportions ies did disclose the evidences of the genetic basis of glo- across all training samples and all testing samples bal GE [28,36,43,44]. Moreover, this study analyzed the also are shown respectively for each step of marker GE data from the total RNA samples extracted from selection in each of 10-fold cross-validations. (f) Epstein Barr virus (EBV)-transformed lymphoblastoid Marker impact plot: states of markers selected (red cell lines of study individuals [35]. The GE variation of color) or unselected (blue color) in training samples lymphoblastoid cell lines, which are important materials in each of 10-fold cross-validations are shown. The for dissecting genetic basis of GE variation of human figure also provides the selection times of the populations [23,26,27,35,36,45], reflects a substantially identified markers across all cross-validations and higher proportion of genetic effect compared to the ef- which step a marker is selected in a forward fect of food preferences or environmental exposures variable selection procedure. Graphical outputs in [46]. The finding of genetics of global GE can also be the test example of BIASLESS are provided supported by previous studies. An important genomic (Additional file 6: Figure S3). study of global GE variation validated the genetic contri- bution of the discrepancy of GE between Asian and Cau- Discussion casian samples, not an artifact due to life styles. This The concept conveyed by the proposed integrative ana- study showed that 24 Han Chinese residing in Los lysis of SNP and GE markers also is applicable to pre- Angeles had much more similar GE profiles to the 82 dicting disease status in biomedical studies and drug HapMap CHB + JPT samples than to the 60 HapMap response in studies. Genome-wide CEU samples [44]. The other important genomic study association studies that identify disease susceptibility of GE also uncovered the genetic contribution on global genes using a large number of SNPs suffer from the patterns of GE after adjusting potential confounding fac- problem of missing heritability and are limited in tors that may influence GE. This study analyzed GE data explaining the etiology of complex diseases [40-42]. of 270 individuals from four HapMap II populations and However, with the aid of GE, it is possible to increase found GE variation differentiated in population compari- the proportion of explained genetic variations which sons in agreement with earlier studies [36]. then elevates prediction accuracy. In view of the poten- The GE variation may also be influenced by the type tial importance of integrative analysis of SNP and GE of biological specimen, attributes related to the time and markers in the population genetics, forensic sciences, other circumstances of taking the biological samples, or and medical genetics, we developed BIASLESS software. GE microarray platform. This study provides a proof-of- BIASLESS, which is useful for selecting important pre- concept method for construction of AIM panels by inte- dictive marker sets from large numbers of biomarkers grating SNP and GE markers but the current results are for inferences of ethnic groups, disease groups, and drug still limited by the use of single cell type (lymphoblastoid response groups, is a free, publicly available, and user- cell lines), fixed time/circumstances of taking the bio- friendly analysis tool. logical samples, and single microarray platform (Illumi- The method and software introduced in this paper can na’s Sentrix Human-6 Expression BeadChip). More be used to construct high-accuracy and cost-beneficial investigations should be carried out to understand the AIM panels. Nevertheless, rather than the construction proportions of the identified AIMs specific to the cur- of AIM panels, the main focus of this paper is to intro- rently used conditions or transferable to more general duce an integrative analysis of SNP and GE markers for conditions. For practical applications, we also plan to in- the discrimination of samples from various populations, tegrate SNP and GE variation from global genomic stud- especially for closely related ancestral lineages. We don’t ies and construct larger reference database for intend for the AIMs identified in this study to take the normalizing GE data. SNP and GE markers will be inte- place of the AIMs found earlier for CEU, CHB, JPT, and grated to identify AIMs and establish robust discrimin- YRI populations. Some of the AIMs identified in this ant models using BIASLESS software. Biological study may be limited by the small to moderate number specimen from a tested individual are collected and used of samples in the HapMap II project; therefore, the gen- to genotype/measure the identified and confirmed AIMs. erality of the identified AIMs should be further exam- Finally, SNP genotypes and GE levels of the tested indi- ined by using more independent samples and confirmed vidual are plugged into the discriminant models to de- by biological verifications such as real-time reverse- termine the correct ethnic group. transcription polymerase chain reaction before the AIMs Regarding the supervised classification method, two applying to practical studies. points are important to discuss. First, we modified the Yang et al. BMC Genomics 2012, 13:346 Page 10 of 12 http://www.biomedcentral.com/1471-2164/13/346

efficient and broadly used FDA algorithm and integrated Conclusion forward variable selection and cross-validation proce- In conclusion, we recommend SNP-only analysis for dures with FDA to select key predictive markers from sample subdivision when the study samples come from enormous numbers of SNP and GE markers, and we ethnically distant populations such as Asian (CHB + JPT), then built accurate classification models for sample sub- African (YRI), and Caucasian (CEU); ancestry inform- division. Our supervised classification procedure pro- ative or population-specific SNPs provide sufficient in- vides multiple candidate models (e.g., 10 in a 10-fold formation for sample classification in this situation, but cross-validation). Choosing a model with the highest population-specific SNPs may not be available or may testing accuracy is recommended but should not be the be very hard to identify in ethnically close populations only criterion for model selection. Other optimal criteria such as Chinese (CHB) and Japanese (JPT). Quantitative and domain knowledge may need to be considered to GE data, which are more variable than qualitative SNP determine the best model that satisfying both statistical data, are useful for sample classification after properly properties and biological relevance. For example, the removing noisy GE markers. Note, however, that the cross-validation consistency of a model among all candi- GE-only analysis is still limited by slightly fluctuating date models may be used simultaneously, or accuracies and a larger number of predictive knowledge, biological relevance, and quality evaluation markers even when the samples are from ethnically dis- of genetic markers may also be integrated to assist in se- tant populations. However, GE data do reveal important lection of the final classification model. Second, there is classification information supplemental to SNP data. a very rich body of literature in the field of supervised Using an integration of SNP and GE markers, we estab- classification, including support vector mechanisms [47] lished classification models with a reduced number of and classification trees [48]. Different algorithms have markers to accurately assign samples to the correct eth- pros and cons in different study scenarios and data nic populations. Importantly, the genotyping cost is types. We are adding various classification algorithms to reduced because the number of required markers in an further enrich the BIASLESS software. AIM panel is significantly diminished after inclusion of This study analyzed the data in the HapMap II Project, ancestry informative GE markers. which contains only four populations, rather than the HapMap III Project, which contains 11 populations be- Availability and requirements cause GE data for the majority of samples in the Hap- The BIASLESS software, test examples, and user guide Map III Project are not available. However, the proposed can be downloaded from the BIASLESS website: http:// method and software can be applied in general to con- www.stat.sinica.edu.tw/hsinchou/genetics/prediction/ struct AIM panels for additional populations. The SNP BIASLESS.htm. data in this study came from two genotyping platforms: Project name: Biomarker identification and sample Affymetrix 500 K and Array6.0 SNP chips. The results of subdivision. the sample classification were similar, although the num- Project home page: http://www.stat.sinica.edu.tw/ – ber of SNPs interrogated on Affymetrix 500 K (~4 4.9 hsinchou/genetics/prediction/BIASLESS.htm. W hundred thousand SNPs after quality control) was only Operating system: MS Windows . – about half the number in Array 6.0 (~7 8.7 hundred Programming language: Language R and R-GUI. thousand SNPs after quality control), suggesting that the Other requirements: No. ancestral information in SNPs identified with Affymetrix Any restrictions to use by non-academics: On request Array6.0 is not more informative than that in SNPs and citation. identified with Affymetrix 500 K, with regard to the clas- sification of samples in the HapMap II Project. Recently, Additional files whole-genome sequencing technology, in comparison with SNP microarrays, has become more common and Additional file 1: Table S1. The total numbers of SNP and GE markers has promoted the identification of new common SNPs remaining in the analysis after quality control. This table summarizes the and rare variants. Novel population-specific or ancestry- number of SNP and GE markers in each analysis of the eight combinations of ethnic populations. After quality control, 18,807 GE informative variants may be identified, and more eQTLs markers, 403,067 – 486,092 SNPs in the Affymetrix Human Mapping 500 K that contribute to genetic variation of ancestry inform- set, and 700,682 – 868,434 SNPs in the Affymetrix Array 6.0 set remained. ative GE may become available. It will be interesting to The number of SNPs in the intersection of the Affymetrix Human – investigate if the bottleneck in a SNP-only analysis for Mapping 500 K and Array 6.0 sets was 385,493 469,057. Additional file 2: Figure S1. Classification of HapMap samples using discerning samples from closely related populations can whole-genome SNPs of Affymetrix Human Mapping 500 K set. All be overcome using highly dense common SNPs and rare samples were superimposed onto a two-dimensional plane in an allele variants from massive parallel sequencing in the 1000 frequency (AF) biplot. (A) CHB, JPT, CEU, and YRI, (B) CHB and JPT, (C) CHB and YRI, (D) CHB and CEU, (E) JPT and YRI, (F) JPT and CEU, and (G) Genomes Project [49]. Yang et al. BMC Genomics 2012, 13:346 Page 11 of 12 http://www.biomedcentral.com/1471-2164/13/346

YRI and CEU. Red line with a B symbol indicates CHB samples; blue line 2. Halder I, Shriver M, Thomas M, Fernandez JR, Frudakis T: A panel of with a J symbol indicates JPT samples; gray line with a Y symbol ancestry informative markers for estimating individual biogeographical indicates YRI samples; green line with an E symbol indicates CEU ancestry and admixture from four continents: utility and applications. – samples. Hum Mutat 2008, 29(5):648 658. 3. Seldin MF, Price AL: Application of ancestry informative markers to Additional file 3: Table S2. The best classification models. This table association studies in European Americans. PLoS Genet 2008, 4(1):e5. summarizes the key predictive markers, testing accuracy, cross-validation 4. Nassir R, Kosoy R, Tian C, White PA, Butler LM, Silva G, Kittles R, Alarcon- consistency (CVC), and number of genetic markers in the best Riquelme ME, Gregersen PK, Belmont JW, et al: An ancestry informative classification models in the analysis of each of the eight combinations of marker set for determining continental origin: validation and extension ethnic populations. using human genome diversity panels. BMC Genet 2009, 10:39. Additional file 4: Table S3. Genotype frequencies of key predictive 5. Kosoy R, Nassir R, Tian C, White PA, Butler LM, Silva G, Kittles R, SNPs in the best model for the classification of samples from ethnically Alarcon-Riquelme ME, Gregersen PK, Belmont JW, et al: Ancestry distant populations. This table summarizes genotype frequencies of the informative marker sets for determining continental origin and key predictive SNPs in the 500 K-only (first field), Array6.0-only (second admixture proportions in common populations in America. Hum field), 500 K + GE (third field) and Array6.0 + GE (fourth field) analyses for Mutat 2009, 30(1):69–78. the classification of samples from “CHB and YRI” (first panel), “CHB and 6. Myles S, Stoneking M, Timpson N: An assessment of the portability of CEU” (second panel), “JPT and YRI” (third panel), “JPT and CEU” (fourth ancestry informative markers between human populations. BMC Med panel), and “YRI and CEU” (fifth panel). The genotype frequencies also are Genomics 2009, 2:45. shown in Figure 5. 7. Paschou P, Lewis J, Javed A, Drineas P: Ancestry informative markers for Additional file 5: Figure S2. Interface of BIASLESS software. BIASLESS fine-scale individual assignment to worldwide populations. J Med Genet software programmed in R and R-GUI is a user-friendly tool for the 2010, 47(12):835–847. identification of key predictive markers to classify samples from different 8. Drineas P, Lewis J, Paschou P: Inferring geographic coordinates of origin populations/groups. for Europeans using small panels of ancestry informative markers. PLoS ONE 2010, 5(8):e11892. Additional file 6: Figure S3. Graphical outputs in BIASLESS software. 9. Londin ER, Keller MA, Maista C, Smith G, Mamounas LA, Zhang R, Madore BIASLESS software outputs six graphs including (A) overlay line graph, (B) SJ, Gwinn K, Corriveau RA: CoAIMs: a cost-effective panel of ancestry parallel coordinates plot, (C) multidimensional scaling plot, (D) stacked- informative markers for determining continental origins. PLoS ONE 2010, bar/box-whisker plot, (E) sample misclassification plot, and (F) marker 5(10):e13443. impact plot from the analysis of a test example (Detailed explanations to 10. Kidd JR, Friedlaender FR, Speed WC, Pakstis AJ, De La Vega FM, Kidd KK: these graphs can be seen in the User Guide of BIASLESS, which can be Analyses of a set of 128 ancestry informative single-nucleotide downloaded at http://www.stat.sinica.edu.tw/hsinchou/genetics/ polymorphisms in a global set of 119 population samples. Investig Genet prediction/BIASLESS.htm). 2011, 2(1):1. 11. Tian C, Kosoy R, Lee A, Ransom M, Belmont JW, Gregersen PK, Seldin MF: Abbreviations Analysis of East Asia genetic substructure using genome-wide SNP AF: Allele frequency; AIM: Ancestry informative marker; CEU: CEPH Utah arrays. PLoS ONE 2008, 3(12):e3862. residents; CHB: Han Chinese in Beijing; CVC: Cross-validation consistency; 12. Tian C, Plenge RM, Ransom M, Lee A, Villoslada P, Selmi C, Klareskog L, eQTL: Expression quantitative trait loci; FDA: Flexible discriminant analysis; Pulver AE, Qi L, Gregersen PK, et al: Analysis and application of GE: Gene expression; JPT: Japanese in Tokyo; BIASLESS: Biomarkers European genetic substructure using 300 K SNP information. PLoS Identification and Samples Subdivision; SNP: Single nucleotide Genet 2008, 4(1):e4. polymorphism; SSB: Sum of squares between populations; SSW: Sum of 13. Serre D, Paabo S: Evidence for gradients of human genetic diversity squares within population; STRP: Short tandem repeat polymorphism; within and among continents. Genome Res 2004, 14(9):1679–1685. YRI: Yoruba in Ibadan. 14. Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics 2000, 155(2):945–959. Competing interests 15. Pomeroy R, Duncan G, Sunar-Reeder B, Ortenberg E, Ketchum M, Wasiluk H, The authors declare that they have no competing interests. Reeder D: A low-cost, high-throughput, automated single nucleotide polymorphism assay for forensic human DNA applications. AnBio 2009, Authors’ contributions 395(1):61–67. HCY conceived the study, developed the statistical methods, and prepared 16. Phillips C, Salas A, Sanchez JJ, Fondevila M, Gomez-Tato A, Alvarez-Dios J, the manuscript and software user guide. PLW and CWL developed the Calaza M, Casares-de-Cal M, Ballard D, Lareu MV, et al: Inferring ancestral BIASLESS software and analyzed the data with HCY. CHC and CHC origin using a single multiplex assay of ancestry-informative marker contributed to the discussion and reviewed the manuscript. All authors have SNPs. Forensic Sci Int Genet 2007, 1(3–4):273–280. read and approved the final manuscript. 17. Tian C, Hinds DA, Shigeta R, Kittles R, Ballinger DG, Seldin MF: A genomewide single-nucleotide-polymorphism panel with high ancestry Acknowledgements information for African American admixture mapping. Am J Hum Genet This work was supported by the Career Development Award of Academia 2006, 79(4):640–649. Sinica (AS-100-CDA-M03) and by research grants from National Science 18. Kayser M, de Knijff P: Improving human forensics through advances in Council of Taiwan (NSC 100-2314-B-001-005-MY3 and NSC 100-2319-B-010- genetics, genomics and molecular . Nat Rev Genet 2011, 12(3):179–192. 002). We also thank two anonymous reviewers for their instructive 19. Price AL, Butler J, Patterson N, Capelli C, Pascali VL, Scarnicci F, Ruiz-Linares comments that helped in preparing our revision. A, Groop L, Saetta AA, Korkolopoulou P, et al: Discerning the ancestry of European Americans in genetic association studies. PLoS Genet 2008, Author details 4(1):e236. 1Institute of Statistical Science, Academia Sinica, 128, Academia Road, Section 20. Zhou N, Wang L: Effective selection of informative SNPs and classification 2 Nankang, Taipei 115, Taiwan. 2Institute of Biomedical Sciences, Academia on the HapMap genotype data. BMC Bioinformatics 2007, 8:484. Sinica, Taipei, Taiwan. 21. Yang HC, Lin HC, Huang MC, Li LH, Pan WH, Wu JY, Chen YT: A new analysis tool for individual-level allele frequency for genomic studies. Received: 18 February 2012 Accepted: 16 July 2012 BMC Genomics 2010, 11(1):415. Published: 28 July 2012 22. Cheung VG, Spielman RS: The genetics of variation in gene expression. Nat Genet 2002, 32(Suppl):522–525. References 23. Cheung VG, Conlin LK, Weber TM, Arcaro M, Jen KY, Morley M, Spielman RS: 1. Rosenberg NA, Li LM, Ward R, Pritchard JK: Informativeness of genetic Natural variation in human gene expression assessed in lymphoblastoid markers for inference of ancestry. Am J Hum Genet 2003, 73(6):1402–1422. cells. Nat Genet 2003, 33(3):422–425. Yang et al. BMC Genomics 2012, 13:346 Page 12 of 12 http://www.biomedcentral.com/1471-2164/13/346

24. Storey JD, Madeoy J, Strout JL, Wurfel M, Ronald J, Akey JM: Gene- expression variation within and among human populations. Am J Hum Genet 2007, 80(3):502–509. 25. Portela A, Esteller M: Epigenetic modifications and human disease. Nat Biotechnol 2010, 28(10):1057–1068. 26. Morley M, Molony CM, Weber TM, Devlin JL, Ewens KG, Spielman RS, Cheung VG: Genetic analysis of genome-wide variation in human gene expression. Nature 2004, 430(7001):743–747. 27. Stranger BE, Forrest MS, Clark AG, Minichiello MJ, Deutsch S, Lyle R, Hunt S, Kahl B, Antonarakis SE, Tavare S, et al: Genome-wide associations of gene expression variation in humans. PLoS Genet 2005, 1(6):e78. 28. Rockman MV, Kruglyak L: Genetics of global gene expression. Nat Rev Genet 2006, 7(11):862–872. 29. The International HapMap Consortium: The International HapMap Project. Nature 2003, 426(6968):789–796. 30. The International HapMap Consortium: Integrating ethics and science in the international HapMap project. Nat Rev Genet 2004, 5(6):467–475. 31. The International HapMap Consortium: A haplotype map of the human genome. Nature 2005, 437(7063):1299–1320. 32. The International HapMap Consortium: A second generation human haplotype map of over 3.1 million SNPs. Nature 2007, 449(7164):851–861. 33. Affymetrix Inc: BRLMM: An improved genotype calling method for the GeneChip human mapping 500K array set; 2006. 34. Korn JM, Kuruvilla FG, McCarroll SA, Wysoker A, Nemesh J, Cawley S, Hubbell E, Veitch J, Collins PJ, Darvishi K, et al: Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat Genet 2008, 40(10):1253–1260. 35. Stranger BE, Forrest MS, Dunning M, Ingle CE, Beazley C, Thorne N, Redon R, Bird CP, de Grassi A, Lee C, et al: Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 2007, 315(5813):848–853. 36. Stranger BE, Nica AC, Forrest MS, Dimas A, Bird CP, Beazley C, Ingle CE, Dunning M, Flicek P, Koller D, et al: Population genomics of human gene expression. Nat Genet 2007, 39(10):1217–1224. 37. Benjamini Y, Hochberg Y: Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Statist Soc B 1995, 57(1):289–300. 38. Guo SW, Thompson EA: Performing the exact test of Hardy-Weinberg proportion for multiple alleles. Biometrics 1992, 48(2):361–372. 39. Hastie T, Tibshirani R, Buja A: Flexible discriminant analysis by optimal scoring. J Amer Statist Ass 1994, 89(428):1255–1270. 40. Maher B: Personal genomes: The case of the missing heritability. Nature 2008, 456(7218):18–21. 41. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, et al: Finding the missing heritability of complex diseases. Nature 2009, 461(7265):747–753. 42. Eichler EE, Flint J, Gibson G, Kong A, Leal SM, Moore JH, Nadeau JH: Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet 2010, 11(6):446–450. 43. Emilsson V, Thorleifsson G, Zhang B, Leonardson AS, Zink F, Zhu J, Carlson S, Helgason A, Walters GB, Gunnarsdottir S, et al: Genetics of gene expression and its effect on disease. Nature 2008, 452(7186):423–U422. 44. Spielman RS, Bastone LA, Burdick JT, Morley M, Ewens WJ, Cheung VG: Common genetic variants account for differences in gene expression among ethnic groups. Nat Genet 2007, 39(2):226–231. 45. Cheung VG, Spielman RS, Ewens KG, Weber TM, Morley M, Burdick JT: Mapping determinants of human gene expression by regional and genome-wide association. Nature 2005, 437(7063):1365–1369. 46. Gibson G: The environmental contribution to gene expression profiles. Submit your next manuscript to BioMed Central Nat Rev Genet 2008, 9(8):575–581. and take full advantage of: 47. Cortes C, Vapnik V: Support-vector networks. MLear 1995, 20(3):273–297. 48. Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification and Regression • Convenient online submission Trees. Belmont, CA: Wadsworth & Brooks/Cole Advanced Books & Software; • Thorough peer review 1984. 49. Siva N: . Nat Biotechnol 2008, 26(3):256. • No space constraints or color figure charges • Immediate publication on acceptance doi:10.1186/1471-2164-13-346 • Inclusion in PubMed, CAS, Scopus and Google Scholar Cite this article as: Yang et al.: Integrative analysis of single nucleotide polymorphisms and gene expression efficiently distinguishes samples • Research which is freely available for redistribution from closely related ethnic populations. BMC Genomics 2012 13:346. Submit your manuscript at www.biomedcentral.com/submit