Maize and Millet Transcription Factors Annotated Using Comparative
Total Page:16
File Type:pdf, Size:1020Kb
Lin et al. BMC Genomics 2014, 15:818 http://www.biomedcentral.com/1471-2164/15/818 RESEARCH ARTICLE Open Access Maize and millet transcription factors annotated using comparative genomic and transcriptomic data Jinn-Jy Lin1,2,3, Chun-Ping Yu4, Yao-Ming Chang3, Sean Chun-Chang Chen3 and Wen-Hsiung Li2,3,4,5* Abstract Background: Transcription factors (TFs) contain DNA-binding domains (DBDs) and regulate gene expression by binding to specific DNA sequences. In addition, there are proteins, called transcription coregulators (TCs), which lack DBDs but can alter gene expression through interaction with TFs or RNA Polymerase II. Therefore, it is interesting to identify and classify the TFs and TCs in a genome. In this study, maize (Zea mays) and foxtail millet (Setaria italica), two important species for the study of C4 photosynthesis and kranz anatomy, were selected. Result: We conducted a comprehensive genome-wide annotation of TFs and TCs in maize B73 and in two strains of foxtail millet, Zhang gu and Yugu1, and classified them into families. To gain additional support for our predictions, we searched for their homologous genes in Arabidopsis or rice and studied their gene expression level using RNA-seq and microarray data. We identified many new TF and TC families in these two species, and described some evolutionary and functional aspects of the 9 new maize TF families. Moreover, we detected many pseudogenes and transposable elements in current databases. In addition, we examined tissue expression preferences of TF and TC families and identified tissue/condition-specific TFs and TCs in maize and millet. Finally, we identified potential C4-related TF and TC genes in maize and millet. Conclusions: Our results significantly expand current TF and TC annotations in maize and millet. We provided supporting evidence for our annotation from genomic and gene expression data and identified TF and TC genes with tissue preference in expression. Our study may facilitate the study of regulation of gene expression, tissue morphogenesis, and C4 photosynthesis in maize and millet. The data we generated in this study are available at http:// sites.google.com/site/jjlmmtf. Keywords: Transcription factor annotation, Coregulators, Comparative genomics, Functional annotation Background contains an AP2 DBD, which binds the GCC box in the Gene regulation by transcription factors (TFs) is crucial for promoter sequences of ethylene responsive genes , while development, maintenance of normal physiology and re- Arabidopsis AINTEGUMENTA (ANT) contains two AP2 sponse to external or internal stimuli. Hence, identification domains [1,2]. A TF may also contain an auxiliary domain and classification of TFs will increase our understanding of that facilitates DNA binding. For example, an auxin TF functions and regulation of biological processes. A TF response factor (ARF) contains a B3 DBD and an auxiliary contains one or more DNA-binding domains (DBDs), domain, the “auxin/indole-3-acetic acid (Aux/IAA) which bind specific DNA sequences to mediate the binding domain”. ARF proteins can form homodimers, where the of RNA polymerase at the onset of transcription initiation. two combined B3 domains can bind a TGTCTC- For example, Arabidopsis ethylene response factor 1 (ERF1) containing auxin responsive element, through interaction with the two Aux/IAA domains [3,4]. In addition, there * Correspondence: [email protected] are proteins that have no DBD but can bind TFs or RNA 2Institute of Molecular and Cellular Biology, National Tsing Hua University, polymerase II to alter gene regulation; such proteins are Hsinchu 300, Taiwan 3Biodiversity Research Center, Academia Sinica, Taipei 115, Taiwan called transcription coregulators (TCs), which include Full list of author information is available at the end of the article coactivators and corepressors [5]. For example, an Aux/ © 2014 Lin et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Lin et al. BMC Genomics 2014, 15:818 Page 2 of 19 http://www.biomedcentral.com/1471-2164/15/818 IAA protein, which contains an Aux/IAA domain, can Results form a heterodimer with an ARF protein and prevent the Genome-wide prediction and classification of TFs and TCs ARF from activating its target genes [4,6,7]. To identify TF and TC genes in the maize and millet In recent years, many plant TF databases have been genomes, we collected all protein sequences annotated in developed. According to Mitsuda et al., in Arabidopsis the maize and millet genomes to form an initial set of approximately 70 families of TFs have been classified in protein sequences. Then, for each sequence we checked public TF databases, including RARTF, AGRIS, DATF and the presence of DBDs or TC domains. PlnTFDB 3.0 [8-12]. Furthermore, there are other data- There is abundant information of protein domains re- bases, such as PlantTFDB 3.0, ProFITS, GrassTFDB in lated to TFs and TCs in TF databases. To select Grassius, PlantTFcat and TreeTFDB, which provide classi- domains that may be related to a TF or TC family, we fications of TFs in Arabidopsis and other plants [13-17]. compiled a set of related signature domains from However, these databases classify TFs into families accord- PlantTFDB 3.0, PlnTFDB 3.0, Grassius, ProFITS and ing to their own criteria, leading to differences in the AnimalTFDB [11,13,14,17,23]. Besides TF databases, we number of annotated TF genes and in the number of fam- also used other resources to find out more possible ilies among databases. Moreover, only a few databases DBDs or characteristic domains of TCs. For example, provide annotation of TCs for plants, such as ProFITS, Gene Ontology (GO) annotation can be used to select PlnTFDB 3.0, PlantTFcat and GrassCoregDB in Grassius protein domains that may have TC function but have not [11,13-15]. been included in the TF databases we used [24]. On the We are interested in annotating TF genes and TC other hand, experimental data such as ChIP-seq and pro- genes in maize and millet. Maize gives a high crop tein binding microarray (PBM) are also useful for finding yield and is efficient in water usage. Its genome was more DBDs [25,26]. From these sources, we defined 67 TF sequenced and annotated in 2009 [18,19]. Maize gene families and 29 TC families (Additional file 1: Table S1 and annotation contains ~110,000 genes in the maize Working Additional file 2: Table S2). As described in Methods, the Gene Set (WGS, release 5b) and 39,656 genes in the maize domains we selected were represented by Hidden Markov Filtered Gene Set (FGS, release 5b), in which transposons, Model (HMM). We used HMMER 3.0 to predict protein pseudogenes, contaminants, and other low-confidence domains on a protein sequence. After we obtained the genes have been excluded. PlantTFDB 3.0, Grassius and domain compositions on the protein sequences under iTAK (http://bioinfo.bti.cornell.edu/cgi-bin/itak/index.cgi) study, a set of classification rules was applied to identify provide maize TF annotation only for the FGS genes, TFsandTCs(Methods).Figure1depictstheworkflowof while most other databases have not been updated our pipeline. recently [13,17]. According to the maize transcriptomes of By applying our pipeline, 2538 genes (3637 proteins) in Liu et al., there were 6355 expressed genes in WGS that maize were predicted as TF genes and classified into 64 were not included in FGS, suggesting that some TFs and families, including 153 genes that are not in FGS (Table 1 TCs have not been included in the above databases [20]. and Additional file 3: Table S3 (A)). In addition, 149 genes Millet is an emerging C4 model plant. It has a shorter (236 proteins) were predicted as TC genes and classified generation time (~12 weeks vs. ~16 weeks) and a much into 21 families, including 8 genes that were not in FGS smaller genome size (~490 Mb vs. 2500 Mb), suggest- (Table 2 and Additional file 3: Table S3 (B)). ing much fewer duplicate genes and less functional In millet Yugu1, 1880 genes (2116 proteins) were redundancy compared to maize. Two different millet predicted as TF genes and classified into 64 families cultivars, Yugu1 and Zhang gu, were sequenced in 2012 (Table 1 and Additional file 3: Table S3 (C)). In [21,22]. For millet, only PlantTFDB 3.0 provides TF addition, 99 genes (118 proteins) were predicted as TC annotation [17]. genes and classified into 22 families (Table 2 and In this study, we conducted a comprehensive genome- Additional file 3: Table S3 (D)). wide annotation of TFs and TCs for maize WGS and In millet Zhang gu, 1846 genes were predicted as TF also for the two millet strains, Zhang gu and Yugu1. We genes and classified into 65 families (Table 1 and used protein domains to identify TFs and TCs and Additional file 3: Table S3 (E)), while 104 genes were classified them into families in maize and millet, separ- predicted as TC genes and classified into 22 families ately. We also identified tissue- or condition-specific TF (Table 2 and Additional file 3: Table S3 (F)). Orthologs of and TC genes in maize and millet. Our study provides a maize and millet TF and TC genes in other genomes database of annotated TF and TC genes in maize and To gain additional support of identified TF and TC millet with various kinds of supporting evidence, espe- genes, we checked the existence of orthologous genes in cially genomic and transcriptomic data.