Pattern Recognition Letters 31 (2010) 2097–2102

Contents lists available at ScienceDirect

Pattern Recognition Letters

journal homepage: www.elsevier.com/locate/patrec

Identification and analysis of family-specific features derived from DNA and protein information

Ashish Anand a, Ganesan Pugalenthi a, Gary B. Fogel b, P.N. Suganthan a,* a School of Electrical and Electronic Engineering, Nanyang Technological University, 50 Nanyang Avenue, 639798 Singapore, Singapore b Natural Selection Inc., 9330 Scranton Road, Suite 150, San Diego, CA 92121, United States article info abstract

Article history: A common approach for understanding the relationship between transcription factors (TFs) and tran- Available online 17 October 2009 scription factor binding sites (TFBSs) is to use features at either the TF level or the DNA level. For a given TF family, features can be derived from the DNA-binding domains at the protein level as well as TF bind- Keywords: ing sites at the DNA sequence level. Here we investigate the relative importance of features from these Transcription factor different levels for main TF families to better understand: (1) family-specific features and (2) the propor- TF family-specific features tion of features from either the DNA or protein level. We perform class-wise feature selection on TF fam- TF–TFBS interaction ilies to identify important features for each family. Importance of the selected features is assessed in Multi-class classification terms of predictive accuracy of assigning TFs and associated TFBSs to correct TF families. Evaluation of TFBS Feature selection the best model on an independent test set resulted in a predictive accuracy of 90%. Analysis of the selected features used in the best model on a family-by-family basis shows congruence with the fact that interaction between TF proteins and TFBS in the DNA is quite family specific. Our analysis further sug- gests that: (1) this approach can be used to determine and better understand which features (at both the DNA and protein levels) are important to consider for each TF family, and (2) a similar approach to combine DNA and protein level features may be useful for other datasets where protein–DNA interaction is a key component of biological function. Ó 2009 Elsevier B.V. All rights reserved.

1. Introduction 2006, 2007) have attempted to integrate features at both the pro- tein and DNA levels for modeling, leaving open the question of Protein–nucleic acid interactions play a central role in many which features (either DNA, protein, or some combination) might cellular processes including transcription and translation. A key as- be more informative when making models of TF binding. Qian pect of transcriptional regulation requires the binding of a class of et al. (2007) integrated functional information of TF proteins and proteins (e.g., transcription factors (TFs)) to cis-acting DNA regula- their DNA TFBSs, to predict DNA-binding preferences. However, tory sequences (e.g., transcription factor binding sites (TFBS)). they did not specifically review which features at the DNA or pro- Understanding the mechanisms of these interactions and identify- tein levels were more or less useful for this purpose over all TF ing fundamental associations between each TF and its associated families or for individual TF families. TFBSs remains a major challenge for both experimental and com- Motivation for this paper is based on the fact that the binding putational biology. modes of TFs from different TF families to TFBSs are different at TF binding to DNA sequences depends largely on two factors: a both the structure as well as sequence levels. Hence we focus on DNA-binding domain present in the tertiary structure of the TF the identification of TF family-specific features which are most protein and a TFBS nucleotide sequence at the DNA level that is informative for each TF family. In particular, we consider TF pro- recognized by the TF. Despite obvious and important shared inter- teins and their TFBS at the DNA level using TF families as defined actions, very few previous studies (Kaplan et al., 2005; Qian et al., in JASPAR (Vlieghe et al., 2006), and identify family-specific fea- tures using a method of class-wise feature selection. We perform an extensive study of the possible DNA-based features, protein- * Corresponding author. Address: S2-B2a-21, School of EEE, Nanyang Technolog- based features, and their combination to determine their impor- ical University, 50 Nanyang Avenue, Singapore 639 798, Singapore. Tel.: +65 6790 tance in terms of highest accuracy TF family assignment, thus 5404; fax: +65 6793 3318. allowing us to identify TF family-specific features. It is important E-mail addresses: [email protected] (A. Anand), ganesan@ntu. edu.sg (G. Pugalenthi), [email protected] (G.B. Fogel), epnsugan@ntu. to mention that this study is not proposing a method to predict edu.sg (P.N. Suganthan). the structure family of a given TF and or its associated binding sites

0167-8655/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2009.10.008 2098 A. Anand et al. / Pattern Recognition Letters 31 (2010) 2097–2102 but demonstrates that the family-specific features at both TF and (1) Frequency of subsequence features: Integers representing TFBS levels are important for their interaction and evaluates the se- counts of all subsequences of length 1 (i.e., each of the 4 lected important features both quantitatively and qualitatively. nt) to length 5 (i.e., each of the 45 possible nucleotide Quantitative evaluation is done in terms of predictive accuracy of strings). These integers account for a total of 1364 entries models using these features for classification of TF and its associ- in the vector, comprising the vast majority of possibly rele- ated binding sites into TF structure families. Qualitative analysis vant features. The full 15-letter code is not considered as is done in terms of biological relevance of the selected features no consensus sequences in any form are considered as bind- to the corresponding TF family. ing sites in this study. (2) Ungapped palindrome features: Binary indicator variables 2. Methods denoting whether the binding site contains palindromic sub- sequences of half-length 3, 4, 5 or 6 that span the entire site 2.1. Dataset (i.e., end-to-end), as well as those that do not span the entire site (i.e., are somewhere in the middle of the site). We use the same dataset as used in our earlier study (Anand (3) Gapped palindrome features: Binary indicator variables et al., 2008b). We describe the dataset here again for the sake of denoting whether the binding site contains gapped palin- completeness. JASPAR is the largest, curated, and open-access col- dromic subsequences of half-length 3, 4, 5 or 6 that span lection of eukaryotic TFBS profile matrices (Vlieghe et al., 2006). the entire site (i.e., end-to-end), as well as those that do Within JASPAR, TFBSs are classified into 11 structural families. As not span the entire site (i.e., are somewhere in the middle part of our experimental design, we make use of only those fami- of the site). A gapped palindromic subsequence is one in lies with four or more samples. Given this requirement, two which some non-palindromic nucleotides are inserted TFBS-families (bZIP-cEBP and TRP (MYB)) are removed. The exactly in the middle of two otherwise palindromic halves. remaining 9 TFBS-families composed of a total of 55 TFs are used (4) Special features: Binary indicator variables that denote the for modelling (Table 1). presence or absence of features that have been identified in the literature to be overrepresented in the binding sites of certain classes of TFs. These seven features were G.. G 2.1.1. Feature formulation (Wolfe et al., 2000), [GC].. [GC].. [GC] (Wolfe et al., 2000), Both DNA-based and protein-based sequence features are used AGGTCA | TGACCT (Zilliacus et al., 1995), CA.. TG (Atchley to represent each TF. Four main feature types and their combina- and Fitch, 1997), TGA. * TCA (Mulder et al., 2003), and TAAT tion are defined. These include: DNA features, DNA-physico fea- | ATTA (Pabo and Sauer, 1992). ‘.’ means presence of any sin- tures, protein features, and peptide features (each defined in gle nucleotide, ‘.*’ means presence of at least one nucleotide, further detail below). DNA-based features are the same as defined [XY] means presence of one of the letters X or Y, and ‘‘XYZ | in our earlier study (Anand et al., 2008b) and we describe them be- ABC” means presence of one of the strings ‘‘XYZ” or ‘‘ABC”. low for completeness. DNA features are calculated using only The presence or absence of each of these features was con- known binding sites or DNA motifs obtained from the JASPAR data- sidered as additional features. base. Flanking sequences are avoided as no general consensus ex- ists on the notion of an ‘‘ideal” flanking sequence length. Protein 2.1.1.1.2. DNA-Physico features. Conformational and physico- features are calculated using only experimentally determined or chemical properties have been shown to affect the activity of cis- PFAM annotated DNA-binding domains of TFs. For each TF, a list regulatory DNA elements (Ponomarenko et al., 1999; Anand of binding DNA motifs or sites is obtained from the JASPAR data- et al., 2006). The mean values of 38 conformational and physico- base. Features corresponding to each DNA motif are calculated chemical properties of dinucleotides are downloaded from the and the average is taken to get single feature vector representing Property subdirectory of the Activity database (Ponomarenko each TF. We describe the four basic feature types below. et al., 2001). For a given DNA site ‘S = s1, s2, ..., si, ...sL’ of length L a value representing each of the 38 features is calculated as: 2.1.1.1. Basic features P 2.1.1.1.1. DNA features. We use the same set of DNA features as L1 j¼1 Pqðsj; sjþ1Þ defined in (Narlikar and Hartemink, 2006). Total of 1387 features xi ðsÞ¼ ð1Þ q L are defined solely using TFBS sequence information. The classifier model using only this feature vector is referred to as the ‘‘DNA- where Pq is the qth property of dinucleotides (sj, sj+1). There are 38 model”. These features are: such properties (Table 1 in Supplementary File 1) and when these are combined with the DNA feature set above, a total of 1425 fea- tures resulted. The classifier model based on this feature set is re- Table 1 Transcription factor families from the JASPAR database. Abbreviations used in this ferred as the ‘‘DNA-Physico model”. paper for some TF families are provided in square brackets. 2.1.1.1.3. Protein features. The classifier model using protein features is referred to as the ‘‘Protein-model”. These features are TF family Number of samples in JASPAR classified into three sub-groups as follows: ETS 7 bZIP-CREB 4 RELa 5 (1) Amino acid type and groups: The 20 AAs are classified into 11 Nuclear Receptor [NR] 8 groups based on the physicochemical properties (Pugalenthi Forkhead [Fkh] 4 et al., 2007). For each TF binding domain, the frequency of all bHLH (zip) 9 20 amino acids as well as their 11 groups are calculated. This MADSb 5 generated 31 features. [Hbox] 7 HMG 6 (2) Physicochemical properties: A total of 13 physicochemical Total samples 55 properties obtained from the AAindex database (Kawashima et al., 2008). The computed properties include molecular a REL indicates Rel Homology Domain (RHD). b The ‘‘MADS” acronym refers to the genes in which the MADS-box was first weight, hydration potential, refractivity, flexibility, melting identified: a. MCM1, b. AGAMOUS, c. DEFICIENS, and d. SRF. point, optical activity, polarity, isoelectric points, and A. Anand et al. / Pattern Recognition Letters 31 (2010) 2097–2102 2099

Table 2 2.2.2. Approach Compound feature vectors. For our analysis, feature selection was performed using Type of basic features Number of OVA-RFE. Selected features were then used with corresponding features OVA-SVM classifiers. The resulting class-prediction was made DNA-Physico-Protein DNA-Physico and protein 1571 using probabilities scores obtained from OVA-SVMs. Anand et al. DNA-Physico-Peptide DNA-Physico and peptide 2822 (2008a) demonstrated that conversion of decision function values Protein–Peptide Protein and peptide 1543 into probability scores increases predictive performance. Among DNA-Physico-Protein-Peptide All four basic 2968 the three methods of converting decision function values into probability scores that were evaluated in (Anand et al., 2008a), Platt’s approach (Platt, 1999) was determined to provide equiva- normalized frequencies of parallel beta sheet, antiparallel lent or improved predictive accuracy over several datasets. Thus beta sheet, and beta turn. The average value for each phys- for the purpose of the current investigation, we used Platt’s ap- ico-chemical property was calculated from the sum of phys- proach to convert the decision function values into probability ico-chemical property of all the amino acids in the sequence scores. We used LibSVM (Chang and Lin, 2001) to implement divided by total number of amino acids in the sequence (13 OVA-RFE, OVA-SVM and to calculate probability scores. features). Similarly, the average value for each physico- chemical property in helix (H), beta sheet (E), and coil (C) 2.3. Experimental design regions was calculated (3 13 = 39 features). This generated in total 52 features. During pre-processing, for each feature type, redundant fea- (3) Secondary structure: Secondary information for each TF is tures with identical values over all samples were removed. The assigned using PSIPRED (McGuffin et al., 2000). The overall remaining features were normalized to [1, 1]. Table 2 in Supple- composition of H, E, and C (3 features) and the frequencies mentary File 1 lists the number of features before and after pre- of 20 amino acids at helix, sheet, and coil regions were cal- processing. The performance of all models was assessed using k- culated (3 20 = 60 features). fold external cross-validation (CV) following Ambroise and McLachlan (2002) to provide an unbiased estimate of generaliza- All these features are concatenated to make a single feature vec- tion error. CVs were performed 100 times to provide reliable esti- tor of length 146. mates of prediction accuracy. A linear kernel was used for the SVM 2.1.1.1.4. Peptide features. A classifier using only peptide fea- and hence only one SVM parameter (C) required tuning. For each tures is referred to as a ‘‘Peptide-model”. Frequencies of di- and model, a range of C was evaluated {105,104,103,102,101, 1}. tri-peptides are used to represent protein sequences for classifica- Using all features, the model and C setting with best average 4-fold tion. To reduce the dimensionality of the feature space, amino CV accuracy over all 100 runs was selected as the most appropriate acids are clustered into 11 groups with similar physicochemical setting of C. or structural properties (Pugalenthi et al., 2007). All possible pair- For feature selection, 3= of the data were considered for the fea- wise and triplet combinations are computed from the 11 groups 4 ture selection process and the best features are re-evaluated in and this resulted in 66 di-peptide and 1331 triplet combinations. terms of performance on the remaining 1= of the data. Average 4- The di- and tri-peptide frequencies are computed from each se- 4 fold accuracy on the held-out testing data was calculated. This pro- quence and are represented by one or more pairwise and triplet cedure of dividing the data into training for feature selection and combinations, respectively. All these features are concatenated to testing for evaluation via 4-fold cross-validation was repeated make a single feature vector of length 1397. 100 times keeping the class-wise proportion in all 4-fold partitions approximately the same (i.e., stratified partitioning). The average 2.1.1.2. Compound features. Four sets of compound features are accuracy over all 100 stratifications was calculated to estimate generated by combining one or more of the above basic features. the prediction accuracy of each model. Starting with all features, The compound feature vectors are summarized in Table 2. we successively eliminated 1% of the features at each OVA-RFE iteration until a minimum of 10 features were left. 2.2. Algorithm

2.2.1. Background of SVM, OVA-SVM, and SVM-RFE 3. Results and discussion Support vector machines (SVMs) (Vapnik, 1998) belong to the family of margin-based classifiers. SVMs were originally designed We first present the results in terms of classifier performance to solve binary classification problems. Several algorithms have ex- for prediction of TF family and then present a brief analysis of some tended binary SVMs to multi-class problems (Kreßel, 1999; Wes- of the important features identified for each TF family based on ton and Watkins, 1999; Crammer and Singer, 2001; Lee et al., their contribution in classification. 2004). One-versus-all (OVA) is one simple and early extension of SVM to multi-class problems (Bottou et al., 1994). SVM-recursive feature elimination (SVM-RFE) (Guyon et al., 3.1. Comparison and analysis of all models 2002) was originally proposed for binary classification problems. The method of SVM-RFE begins with the set of all features and The ‘‘DNA-Physico-Protein-Peptide” model combining four pri- selectively eliminates one feature at a time. Features are scored mary feature types obtained the best predictive accuracy of 2 and ranked on squared coefficients wj (j =1,2,..., p) of weight vec- 94.18 ± 2.12% even when using as few as 39 features per class (Ta- 2 tor w. The feature with smallest wj is eliminated in each iterative ble 3 in Supplementary File 1). Predictive accuracies of >90% were step. The procedure is repeated until a pre-determined number of generated using as few as 13 features per class. The difference in features remain. This procedure can also be generalized to remove predictive accuracy obtained by the DNA-Physico-Protein-Peptide more than one feature per step (Guyon et al., 2002). SVM-RFE is model using a small sample of features was quite significant com- also extended in OVA fashion by many researchers (Ramaswamy pared to the accuracies obtained by other models using similar et al., 2001; Rifkin et al., 2003; Chai and Domeniconi, 2004) for number of features (see below for detailed discussion). This result multi-class problems. indicates that features taken from vastly different aspects of the 2100 A. Anand et al. / Pattern Recognition Letters 31 (2010) 2097–2102 relationship between TFs and TFBSs can complement each other In brief, the combined ‘‘DNA-Physico-Protein-Peptide” model for increased predictive accuracy of TF family. provided improved predictive accuracies over all TF families with To verify this hypothesis, we performed two additional analy- the exception of the Hbox family when compared to all other mod- ses. First, after fixing the number of features to 39,1 we reviewed els. The DNA-Physico-Protein-Peptide model improved the predic- all features which occurred at least 50% of the time over 100 runs tive accuracy by 10% relative to the best performance obtained by of 4-fold cross-validation (providing a maximum possible represen- other approaches. In light of the above analysis, we would like to tation for each feature of 400 times). Table 4 in Supplementary File 1 focus discussion on HMG and Homeobox. HMG represents an ideal provides a listing of the class-wise statistics for each feature type. case where the significance of combining features from both levels We have provided an additional supplementary file (Supplementary can be observed easily. However, as compared to the other TF fam- File 3), listing selected features (corresponding to Table 4 in Supple- ilies, HMG was difficult to predict using the sequence features in mentary File 1). Note that the major proportions of most frequent the study. This difficulty requires further analysis to determine se- features were DNA and peptide features. However, it should also quence features that might be HMG-specific. With regards to be noted that the number of possible DNA features (n = 1305) and Homeobox, the protein-model resulted in the highest predictive number of possible peptide features (n = 1363) were significantly accuracy. When comparing models that combined protein and larger than the number of possible protein features (n = 144). The re- other features (e.g., the DNA-Physico-Protein and Protein–Peptide sult that both DNA and peptide features appear to be important models), we observed that simply using protein features led to might also simply be the result of sampling from overrepresented the best results. One explanation for this could be that the addi- feature classes. For the second analysis, we reviewed each basic fea- tional features do not always lead to improved predictive accuracy ture type and evaluate class-wise predictive accuracy. Such an anal- and may only add noise to the model. Another explanation is that ysis helps to identify key features that could be useful for particular for this family, the DNA and peptide characteristics are not repre- TF families. sented well by the current feature set. Table 5 in Supplementary File 1 summarizes the class-wise and overall mean predictive accuracies obtained by different models 3.2. Evaluation using independent test data using 39 features per class. DNA-Physico model performed well for bZIP-CREB, REL and bHLH-ZIP families. In fact, for the bHLH- As an additional test of our modelling approach, we evaluated ZIP family, performance was statistically significantly better (pro- the classifier on an independent test set. First we extracted TFs 11 portion test, p-value = 3.44 10 ) than the Protein-model and and associated TFBSs for the nine TF families mentioned previously slightly better mean predictive accuracy was obtained by the from TRANSFACÒ Professional v11.4. The list of TFs from TRANS- Ò DNA-Physico-Protein model. The Protein-model correctly classified FAC was compared to the list of TFs obtained from JASPAR used the ETS-family with 100% accuracy, which was a statistically signif- to generate the models. TFs which were not used for model devel- icant improvement when compared to the DNA-Physico model opment are selected for use as an independent test dataset 16 (proportion test, p-value <2.2 10 ). Combining the protein- (n = 258). features and DNA-Physico features led to almost perfect prediction The average rank of each feature over all iterations and parti- of ETS-family members. Similarly significant improvement in pre- tions of the CV experiment was calculated as follows. The ranking dictive accuracy was obtained through the use of DNA-Physico- of each feature obtained in all iterations of feature selection by Protein features for the Nuclear-Receptor-family (proportion test, SVM-RFE was added and averaged over the number of iterations. 16 p-value <2.2 10 ), and Homeobox (proportion test, p-value = This process was repeated over all partitions and the average ranks 6 1.99 10 ) compared to the DNA-Physico model. The Peptide- obtained at each partition were added and averaged over all parti- model gave improved prediction accuracy for Nuclear Receptor tions to obtain a final average rank for each feature. This can be and MADS families but significantly worse performance for summarized by following equations: bZIP-CREB, bHLH-ZIP and Homeobox families as compared to P either DNA-Physico or Protein models. However, by combining n rj Rj ¼ i¼1 i ð2Þ peptide-features with DNA-Physico features, the DNA-Physico- p n Peptide model improved the overall mean predictive accuracy sig- P nificantly (proportion test, p-value < 2.2 1016) compared to P Rj Rj ¼ p¼1 p ð3Þ DNA-Physico model. Significant improvement was also observed P in the mean predictive accuracies of Fkh (proportion test, p- value < 2.2 1016) and MADS (proportion test, p-value < where n is number of feature selection iterations, P is total number 16 j 2.2 10 ) families when compared to the DNA-Physico-model. of partitions, ri is rank of the feature j at iteration i of pth partition, j The ‘‘DNA-Physico-Protein-Peptide” model making use of all four Rp is average rank of feature j over feature selection iteration of pth j features gave slightly improved or competitive mean predictive partition, R is final average rank of feature j. Several experiments accuracies for all TF families when compared to the DNA-Phys- are performed with different numbers of top-ranked features ico-Protein or the DNA-Physico-Peptide model. Its performance {10, 20, 30, 40, 50, 100, 200, 300, 400, 500} generated previously was significantly better for HMG family (proportion test, p- when modeling the JASPAR data. As mentioned previously, the value = 6.15 108) compared to the best result among DNA- ‘‘DNA-Physico-Protein-Peptide” model making use of all four fea- Physico-Peptide and DNA-Physico-Protein models. This mainly ture sets gave the best CV accuracy compared to other models. contributed toward a better overall mean predictive accuracy Fig. 1 shows that a predictive accuracy of 90% was obtained for when compared to DNA-Physico-Protein (Wilcoxon rank sum test, this model on the independent test data using the 100 top-ranked p-value < 2.2 1016), DNA-Physico-Peptide (Wilcoxon rank sum features. test, p-value = 2.52 1012) and DNA-Physico (Wilcoxon rank sum test, p-value < 2.2 1016) models. 3.3. Analysis of selected top-ranked features for each TF family

The 100 top-ranked features for each TF family were reviewed and were provided as Supplementary File 2. The majority of 1 We choose 39 features because the DNA-Physico-Protein-Peptide model obtain features were family-specific, as the interactions of TFs–TFBSs as the best predictive accuracy using 39 features. well as features related to classification of TF families were noted A. Anand et al. / Pattern Recognition Letters 31 (2010) 2097–2102 2101

Fig. 1. Average predictive accuracy as a function of the number of features used in each model. to be very family dependent. A brief analysis of these features ‘‘FPY,” and ‘‘HNL” which occur as a family-specific signature for the indicated their biological significance for each corresponding TF forkhead domain as reported in PRINTS (Weigel and Jackle, 1990; family. Clark et al., 1993). Similar observations are also made for other Feature analysis provides insight into the various sequence fea- TF families, however a complete review of all features for each tures favored by each TF family. For example, the conserved triplet TF family is beyond the scope of this paper, which focuses mainly TGA is an important feature of the bZIP-CREB family. This triplet is on the modelling approach used to allow such feature analysis for a part of the CREB site ‘‘TGACGTCA” which is a typical consensus individual protein families. sequence recognized by members of the bZIP-CREB family (Fujii et al., 2000). Furthermore, two conserved penta-nucleotides ATTCC for the REL family and TTGTT for the forkhead family are present in 4. Conclusions the selected nucleotide-based features. ATTCC is known to be in- volved in TF binding in the REL family (Chen et al., 1998) and In this paper, a wide variety of features from both the protein TTGTT is a portion of the forkhead DNA-binding motif (T G/A and DNA levels was integrated for TF family analysis. We per- TTTGT) (Bell et al., 2007). Two palindromic features ‘‘TGACGTCA” formed an extensive study investigating the importance of for bZIP-CREB family and ‘‘CACGCGTG” for bHLH family are among different feature types (and their combination) in family-specific the top-ranking features of respective families. The top-ranking TF–TFBS interactions. Class-wise feature selection was performed features from the ETS-family including GGAA, CGGAA, and GAA to identify family-specific features and SVM-based classifier then frequency (average number of occurrences of a subsequence in a used to evaluate these features to classify TF (using protein-level given set of TFBSs) corresponded well with the earlier observation information) and/or TFBSs (using DNA-level information) into nine that the TFs from ETS-family bind generally to purine-rich seg- possible TF families. Class-wise feature selection demonstrated ments (Karim, 1990). that very different feature sets characterize each TF-family. Only Features related to amino acid distribution were also identified a few features were shared by more than one TF family. This data as being important for some families. For example, tryptophan (W) shows coherence with the fact that interactions between TFs and composition was selected as an important ETS-family-specific fea- associated TFBSs are very family specific, each requiring unique as- ture. It is known that tryptophan plays a role in minor groove rec- pects of protein and DNA interaction. ognition (Werner et al., 1995). Cysteine (C) composition is noted to The ‘‘DNA-Physico-Protein-Peptide-model” using all four fea- be an important feature of the nuclear receptor family. Our further ture types had improved or competitive accuracies in terms of analysis on protein sequences from this family shows that they both family-wise and overall predictive accuracy. This same mod- have nine conserved cysteines. Features related to protein second- el provided a predictive accuracy of 90% when evaluated on an ary structure (helix, sheet, and coil) or the frequencies of 20 AAs at independent test dataset. Taken together, this data supports the helix, sheet, and coil regions are selected frequently. For example, notion that ensembles of classifiers might be able to generate use- REL TFs are all beta proteins consisting of immunoglobulin-like ful TF family assignments wherein each component classifier was beta-barrel sub-domains and 10 out of 20 top-ranked features for developed specifically for each TF family using very different this family are related to physico-chemical properties of helices features. indicating that these features are important for the separation of We hope that this paper will initiate a new research direction REL TF families. where such features will be analyzed and integrated into models Tri-peptide features were also identified as being important for trying to understand various aspects of TF protein–DNA interac- TF classification. The significance of selected tripeptides was as- tions. One such example is to study the interactions between mul- sessed through comparison to family-specific fingerprints obtained tiple TFs and composite regulatory elements. Composite regulatory from the PRINTS database (Attwood and Beck, 1994) and the liter- elements are closely situated subsequences in the promoter region ature. The amino acid sequence NXXAAXXCR (where X represents of DNA sequence and acts as binding sites for different TFs. The any amino acid) has been reported as a signature sequence of bZIP- proposed approach can also easily be extended to include flanking CREB family proteins for DNA recognition (Fujii et al., 2000). Most sequences around TFBSs and DNA-binding domains of TFs to deter- tri-peptide features identified by our approach for this family are mine the nature and importance of features from these areas. part of this signature motif. Analysis of TF DNA-binding domains These applications remain to be evaluated and will be the focus from this family identified conserved amino acid sequences ‘‘YSY,” of our future research. 2102 A. Anand et al. / Pattern Recognition Letters 31 (2010) 2097–2102

Acknowledgement Karim, F.D., 1990. The ETS-domain: a new DNA-binding motif that recognizes a purine-rich core DNA sequence. Genes Dev. 4, 1451–1453. Kawashima, S. et al., 2008. AAindex: amino acid index database, progress report The authors acknowledge financial support offered by the 2008. Nucleic Acids Res. 36 (Database issue), D202. Agency for Science, Technology, and Research, Singapore (A*Star) Kreßel, U., 1999. Pairwise classification and support vector machines. In: Advances under Grant #052 101 0020. in Kernel Methods: Support Vector Learning. MIT Press, Cambridge, MA, USA, pp. 255–268. Lee, Y. et al., 2004. Multicategory support vector machines: theory and application Appendix A. Supplementary data to the classification of microarray data and satellite radiance data. J. Amer. Statist. Assoc. 99 (465), 67–82. McGuffin, L. et al., 2000. The PSIPRED protein structure prediction server. Supplementary data associated with this article can be found, in Bioinformatics 16, 404–405. the online version, at doi:10.1016/j.patrec.2009.10.008. Mulder, N.J. et al., 2003. The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res. 31, 315–318. Narlikar, L., Hartemink, A.J., 2006. Sequence features of DNA binding sites reveal References structural class of associated transcription factor. Bioinformatics 22 (2), 157– 163. Ambroise, C., McLachlan, G.J., 2002. Selection bias in gene extraction on the basis of Pabo, C.O., Sauer, R.T., 1992. Transcription factors: structural families and principles microarray gene-expression data. Proc. Nat. Acad. Sci. USA 99, 6562–6566. of DNA recognition. Ann. Rev. Biochem. 61, 1053–1095. Anand, A. et al., 2006. Feature selection approach for quantitative prediction of Platt, J., 1999. Probabilistic outputs for support vector machines and comparisons to transcriptional activities. In: IEEE Symposium on Computational Intelligence regularized likelihood methods. In: Smola, A.J., Bartlett, P.L., Scholkopf, B., and Bioinformatics and Computational Biology, 2006. Schuumans, D. (Eds.), Advances in Large Margin Classifiers. MIT Press, Anand, A. et al., 2008a. Predicting protein structural class by SVM with class-wise Cambridge, pp. 61–74. optimized features and decision probabilities. J. Theoret. Biol. 253 (2), 375–380. Ponomarenko, J. et al., 1999. Conformational and physicochemical DNA features Anand, A. et al., 2008b. Prediction of transcription factor families using DNA specific for transcription factor binding sites. Bioinformatics 15, 654–668. sequence features. In: Proceedings of the Third IAPR International Conference Ponomarenko, J.V. et al., 2001. Activity: a database on DNA/RNA sites activity on Pattern Recognition in Bioinformatics, Melbourne, Australia, LNBI 5265, pp. adapted to apply sequence–activity relationships from one system to another. 154–164. Nucleic Acids Res. 29 (1), 284–287. Atchley, W.R., Fitch, W.M., 1997. A natural classification of the basic helix–loop– Pugalenthi, G. et al., 2007. A machine learning approach for the identification of helix class of transcription factors. Proc. Natl. Acad. Sci. USA 94, 5172–5176. odorant binding proteins from sequence-derived properties. BMC Attwood, T.K., Beck, M.E., 1994. PRINTS-a protein motif fingerprint database. Protein Bioinformatics 8, 351. Eng. Des. Selection 7 (7), 841–848. Qian, Z. et al., 2007. An approach to predict transcription factor DNA binding site Bell, M.P. et al., 2007. Forkhead box P3 regulates TLR10 expression in human T specificity based upon gene and transcription factor functional categorization. regulatory cells. J. Immunol. 179 (3), 1893–1900. Bioinformatics 23(18%U http://bioinformatics.oxfordjournals.org/cgi/content/ Bottou, L. et al., 1994. Comparison of classifier methods: a case study in abstract/23/18/2449%8 September 15, 2007), 2449–2454. handwritten digit recognition. In: Proceedings of the 12th IAPR International Qian, Z.L. et al., 2006. Automatic transcription factor classifier based on functional Conference on Pattern Recognition, 1994. Vol. 2 – Conference B: Computer domain composition. Biochem. Biophys. Res. Commun. 347 (1), 141–144. Vision & Image Processing. Ramaswamy, S. et al., 2001. Multiclass cancer diagnosis using tumor gene Chai, H., Domeniconi, C., 2004. An evaluation of gene selection methods for multi- expression signatures. Proc. Natl. Acad. Sci. USA 98, 15149–15154. class microarray data classification. In: Proceedings of the Second European Rifkin, R. et al., 2003. An analytical method for multiclass molecular cancer Workshop on Data Mining and Text Mining in Bioinformatics. classification. SIAM Rev. 45 (4), 706–723. Chang, C.C., Lin, C.J., 2001. LIBSVM: A Library for Support Vector Machines. . Vlieghe, D. et al., 2006. A new generation of JASPAR, the open-access repository for Chen, Y.Q. et al., 1998. A novel DNA recognition mode by the NF-kappa B p65 transcription factor binding site profiles. Nucleic Acids Res. 34, D95–D97. homodimer. Nat. Struct. Biol. 5, 67–73. Weigel, D., Jackle, H., 1990. The fork head domain: a novel DNA binding motif of Clark, K.L. et al., 1993. Co-crystal structure of the HNF-3/fork head DNA-recognition eukaryotic transcription factors? Cell 63 (3), 455–456. motif resembles histone H 5. Nature 364 (6436), 412–420. Werner, M.H. et al., 1995. The solution structure of the human ETS1–DNA complex Crammer, K., Singer, Y., 2001. On the algorithmic implementation of multiclass reveals a novel mode of binding and true side chain intercalation. Cell 83 (5), kernel-based vector machines. J. Mach. Learn. Res. 2, 265–292. 761–771. Fujii, Y. et al., 2000. Structural basis for the diversity of DNA recognition by bZIP Weston, J., Watkins, C., 1999. Support vector machines for multi-class pattern transcription factors. Nat. Struct. Biol. 7, 889–893. recognition. In: Proceedings of the Seventh European Symposium on Artificial Guyon, I. et al., 2002. Gene selection for cancer classification using support vector Neural Networks. machines. An evaluation of gene selection methods for multi-class microarray Wolfe, S.A. et al., 2000. DNA recognition by Cys2His2 zinc finger proteins. Ann. Rev. data classification. Mach. Learn. 46, 389–422. Biophys. Biomol. Struct. 29, 183–212. Kaplan, T. et al., 2005. Ab initio prediction of transcription factor targets using Zilliacus, J. et al., 1995. Structural determinants of DNA-binding specificity by structural knowledge. PLoS Comput. Biol. 1 (1), 5–13. steroid receptors. Mol. Endocrinol. 9, 389–400.