Pattern Recognition Letters 31 (2010) 2097–2102
Total Page:16
File Type:pdf, Size:1020Kb
Pattern Recognition Letters 31 (2010) 2097–2102 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec Identification and analysis of transcription factor family-specific features derived from DNA and protein information Ashish Anand a, Ganesan Pugalenthi a, Gary B. Fogel b, P.N. Suganthan a,* a School of Electrical and Electronic Engineering, Nanyang Technological University, 50 Nanyang Avenue, 639798 Singapore, Singapore b Natural Selection Inc., 9330 Scranton Road, Suite 150, San Diego, CA 92121, United States article info abstract Article history: A common approach for understanding the relationship between transcription factors (TFs) and tran- Available online 17 October 2009 scription factor binding sites (TFBSs) is to use features at either the TF level or the DNA level. For a given TF family, features can be derived from the DNA-binding domains at the protein level as well as TF bind- Keywords: ing sites at the DNA sequence level. Here we investigate the relative importance of features from these Transcription factor different levels for main TF families to better understand: (1) family-specific features and (2) the propor- TF family-specific features tion of features from either the DNA or protein level. We perform class-wise feature selection on TF fam- TF–TFBS interaction ilies to identify important features for each family. Importance of the selected features is assessed in Multi-class classification terms of predictive accuracy of assigning TFs and associated TFBSs to correct TF families. Evaluation of TFBS Feature selection the best model on an independent test set resulted in a predictive accuracy of 90%. Analysis of the selected features used in the best model on a family-by-family basis shows congruence with the fact that interaction between TF proteins and TFBS in the DNA is quite family specific. Our analysis further sug- gests that: (1) this approach can be used to determine and better understand which features (at both the DNA and protein levels) are important to consider for each TF family, and (2) a similar approach to combine DNA and protein level features may be useful for other datasets where protein–DNA interaction is a key component of biological function. Ó 2009 Elsevier B.V. All rights reserved. 1. Introduction 2006, 2007) have attempted to integrate features at both the pro- tein and DNA levels for modeling, leaving open the question of Protein–nucleic acid interactions play a central role in many which features (either DNA, protein, or some combination) might cellular processes including transcription and translation. A key as- be more informative when making models of TF binding. Qian pect of transcriptional regulation requires the binding of a class of et al. (2007) integrated functional information of TF proteins and proteins (e.g., transcription factors (TFs)) to cis-acting DNA regula- their DNA TFBSs, to predict DNA-binding preferences. However, tory sequences (e.g., transcription factor binding sites (TFBS)). they did not specifically review which features at the DNA or pro- Understanding the mechanisms of these interactions and identify- tein levels were more or less useful for this purpose over all TF ing fundamental associations between each TF and its associated families or for individual TF families. TFBSs remains a major challenge for both experimental and com- Motivation for this paper is based on the fact that the binding putational biology. modes of TFs from different TF families to TFBSs are different at TF binding to DNA sequences depends largely on two factors: a both the structure as well as sequence levels. Hence we focus on DNA-binding domain present in the tertiary structure of the TF the identification of TF family-specific features which are most protein and a TFBS nucleotide sequence at the DNA level that is informative for each TF family. In particular, we consider TF pro- recognized by the TF. Despite obvious and important shared inter- teins and their TFBS at the DNA level using TF families as defined actions, very few previous studies (Kaplan et al., 2005; Qian et al., in JASPAR (Vlieghe et al., 2006), and identify family-specific fea- tures using a method of class-wise feature selection. We perform an extensive study of the possible DNA-based features, protein- * Corresponding author. Address: S2-B2a-21, School of EEE, Nanyang Technolog- based features, and their combination to determine their impor- ical University, 50 Nanyang Avenue, Singapore 639 798, Singapore. Tel.: +65 6790 tance in terms of highest accuracy TF family assignment, thus 5404; fax: +65 6793 3318. allowing us to identify TF family-specific features. It is important E-mail addresses: [email protected] (A. Anand), ganesan@ntu. edu.sg (G. Pugalenthi), [email protected] (G.B. Fogel), epnsugan@ntu. to mention that this study is not proposing a method to predict edu.sg (P.N. Suganthan). the structure family of a given TF and or its associated binding sites 0167-8655/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2009.10.008 2098 A. Anand et al. / Pattern Recognition Letters 31 (2010) 2097–2102 but demonstrates that the family-specific features at both TF and (1) Frequency of subsequence features: Integers representing TFBS levels are important for their interaction and evaluates the se- counts of all subsequences of length 1 (i.e., each of the 4 lected important features both quantitatively and qualitatively. nt) to length 5 (i.e., each of the 45 possible nucleotide Quantitative evaluation is done in terms of predictive accuracy of strings). These integers account for a total of 1364 entries models using these features for classification of TF and its associ- in the vector, comprising the vast majority of possibly rele- ated binding sites into TF structure families. Qualitative analysis vant features. The full 15-letter code is not considered as is done in terms of biological relevance of the selected features no consensus sequences in any form are considered as bind- to the corresponding TF family. ing sites in this study. (2) Ungapped palindrome features: Binary indicator variables 2. Methods denoting whether the binding site contains palindromic sub- sequences of half-length 3, 4, 5 or 6 that span the entire site 2.1. Dataset (i.e., end-to-end), as well as those that do not span the entire site (i.e., are somewhere in the middle of the site). We use the same dataset as used in our earlier study (Anand (3) Gapped palindrome features: Binary indicator variables et al., 2008b). We describe the dataset here again for the sake of denoting whether the binding site contains gapped palin- completeness. JASPAR is the largest, curated, and open-access col- dromic subsequences of half-length 3, 4, 5 or 6 that span lection of eukaryotic TFBS profile matrices (Vlieghe et al., 2006). the entire site (i.e., end-to-end), as well as those that do Within JASPAR, TFBSs are classified into 11 structural families. As not span the entire site (i.e., are somewhere in the middle part of our experimental design, we make use of only those fami- of the site). A gapped palindromic subsequence is one in lies with four or more samples. Given this requirement, two which some non-palindromic nucleotides are inserted TFBS-families (bZIP-cEBP and TRP (MYB)) are removed. The exactly in the middle of two otherwise palindromic halves. remaining 9 TFBS-families composed of a total of 55 TFs are used (4) Special features: Binary indicator variables that denote the for modelling (Table 1). presence or absence of features that have been identified in the literature to be overrepresented in the binding sites of certain classes of TFs. These seven features were G.. G 2.1.1. Feature formulation (Wolfe et al., 2000), [GC].. [GC].. [GC] (Wolfe et al., 2000), Both DNA-based and protein-based sequence features are used AGGTCA | TGACCT (Zilliacus et al., 1995), CA.. TG (Atchley to represent each TF. Four main feature types and their combina- and Fitch, 1997), TGA. * TCA (Mulder et al., 2003), and TAAT tion are defined. These include: DNA features, DNA-physico fea- | ATTA (Pabo and Sauer, 1992). ‘.’ means presence of any sin- tures, protein features, and peptide features (each defined in gle nucleotide, ‘.*’ means presence of at least one nucleotide, further detail below). DNA-based features are the same as defined [XY] means presence of one of the letters X or Y, and ‘‘XYZ | in our earlier study (Anand et al., 2008b) and we describe them be- ABC” means presence of one of the strings ‘‘XYZ” or ‘‘ABC”. low for completeness. DNA features are calculated using only The presence or absence of each of these features was con- known binding sites or DNA motifs obtained from the JASPAR data- sidered as additional features. base. Flanking sequences are avoided as no general consensus ex- ists on the notion of an ‘‘ideal” flanking sequence length. Protein 2.1.1.1.2. DNA-Physico features. Conformational and physico- features are calculated using only experimentally determined or chemical properties have been shown to affect the activity of cis- PFAM annotated DNA-binding domains of TFs. For each TF, a list regulatory DNA elements (Ponomarenko et al., 1999; Anand of binding DNA motifs or sites is obtained from the JASPAR data- et al., 2006). The mean values of 38 conformational and physico- base. Features corresponding to each DNA motif are calculated chemical properties of dinucleotides are downloaded from the and the average is taken to get single feature vector representing Property subdirectory of the Activity database (Ponomarenko each TF. We describe the four basic feature types below. et al., 2001).