Highly Conserved Upstream Sequences for Transcription Factor Genes and Implications for the Regulatory Network
Total Page:16
File Type:pdf, Size:1020Kb
Highly conserved upstream sequences for transcription factor genes and implications for the regulatory network Hisakazu Iwama*† and Takashi Gojobori*‡§ *Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Research Organization of Information and Systems, Yata 1111, Mishima, 411-8540 Japan; and ‡Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and Technology, Time 24 Building, 10th Floor, 2-45 Aomi, Koto-ku, Tokyo 135-0064, Japan Communicated by Wen-Hsiung Li, University of Chicago, Chicago, IL, October 15, 2004 (received for review May 27, 2004) Identifying evolutionarily conserved blocks in orthologous We report here that the genes with high upstream conserva- genomic sequences is an effective way to detect regulatory ele- tion are predominantly transcription factor (TF) genes. Further- ments. In this study, with the aim of elucidating the architecture of more, we show that the developmental process-related TF genes the regulatory network, we systematically estimated the degree of have significantly higher conservation of the upstream sequences conservation of the upstream sequences of 3,750 human–mouse than other TF genes. orthologue pairs along 8-kb stretches. We found that the genes with high upstream conservation are predominantly transcription Materials and Methods factor (TF) genes. In particular, developmental process-related TF Orthologue Identification and Upstream Sequence Collection. We genes showed significantly higher conservation of the upstream searched the human and mouse Reference Sequence (RefSeq) sequences than other TF genes. Such extreme upstream conserva- (8) annotations from the National Center for Biotechnology tion of the developmental process-related TF genes suggests that Information (ftp:͞͞ftp.ncbi.nih.gov͞refseq͞LocusLink͞ the regulatory networks involved with developmental processes LLtmpl) for genes whose human and mouse official gene have been evolutionarily well conserved in both human and mouse symbols were identical (9,207 gene pairs, as of February 2, 2004). lineages. Next, we selected only the nuclear protein-coding genes (7,408 genes). For these genes, we collected the corresponding genomic cis-element ͉ development ͉ noncoding ͉ ZFHX1B ͉ Hirschsprung disease sequences, i.e., the RefSeq contig entries, according to the contig feature descriptions in the RefSeq annotations. We surveyed the ross-species genome-wide comparison of noncoding or- entire annotation of every contig to check whether there were Cthologous sequences has been demonstrated to be effective any genic sequences within the 9-kb stretch upstream of the first for identifying regulatory sequences for Saccharomyces species coding site for each of the genes collected. Then, we excised the (1, 2). For higher eukaryotes, orthologous noncoding sequence 8-kb genomic sequence immediately upstream of the coding start comparison has been successfully applied to human and mouse site for every gene that did not contain any descriptions of genic sequences (3–5). These results can contribute to the elucidation regions within its 9-kb upstream stretch. We set a 1-kb margin to decrease the frequency of cases in which the excised 8-kb of the architecture of regulatory networks. Ј However, because comprehensive knowledge regarding reg- sequences overlapped with promoter regions or 3 regulatory sequences of adjacent genes. For genes having alternative coding ulatory networks remains to be elucidated, particularly for Ј higher eukaryotes, direct comparison of their regulatory net- start sites, we always used the most 5 coding start site according works is still difficult. Thus, in the present study, with the aim of to the annotation. elucidating the features of regulatory networks that are charac- teristic of higher eukaryotes, we systematically estimated the Genomic Global Alignment. Initially, we made local nucleotide degree of the sequence conservation upstream of human–mouse alignments of every human–mouse orthologue pair of genomic orthologous genes and categorized the gene function according sequences by using BLAST 2 sequences (9). To appropriately align to the Gene Ontology (GO) Consortium (6). the short conserved regulatory sequences in the noncoding regions, we reduced the mismatch penalty to Ϫ2 and shortened In higher eukaryotes, the regulatory sequences are located in the word size to 7. We processed the resultant set of alignments a wider range outside the coding sequences than in yeast. by using the program REALIGNER, which we developed to obtain However, to date, 85% of mouse regulatory sequences have been genomic global alignments based on the results from BLAST 2 estimated to be located within 2 kb from the promoter, and most sequences. First, we selected the local alignments by using the promoters reside immediately upstream of the transcription start following set of criteria: hit length Ͼ7 bps, identity of 70% or site (7), both of which play major roles in gene expression higher, and hit strand in the same direction. For these local control. Thus, between humans and mice, we can expect that the alignments, REALIGNER performed the following two steps: (i) degree of orthologous upstream sequence conservation in the when two local alignments overlapped, the program removed the kilobase range could reflect the evolutionary conservation of alignment with the lower bit score and retained the other and (ii) features related to gene expression control. when two local alignments were not syntenic, the alignment with In the present study, we examined the upstream sequences of 3,750 human–mouse orthologous gene pairs and constructed a global alignment of the 8-kb upstream sequences for each of the Freely available online through the PNAS open access option. orthologous gene pairs based on their local alignments. To Abbreviations: TF, transcription factor; GO, gene ontology; EPD, Eukaryotic Promoter identify human–mouse orthologous genes, we focused on genes Database. that have been assigned an identical official gene symbol (www. †Present address: Information Technology Center, Kagawa University, 1750-1 Ikenobe, gene.ucl.ac.uk͞nomenclature͞) between humans and mice, be- Miki-cho, Kita-gun, Kagawa Prefect 761-0793, Japan. cause these kinds of genes are annotated not only on the basis §To whom correspondence should be addressed at the * address. E-mail: tgojobor@ of sequence homology but also on evidence from functional and genes.nig.ac.jp. physiological experiments. © 2004 by The National Academy of Sciences of the USA 17156–17161 ͉ PNAS ͉ December 7, 2004 ͉ vol. 101 ͉ no. 49 www.pnas.org͞cgi͞doi͞10.1073͞pnas.0407670101 Downloaded by guest on September 27, 2021 We finally confirmed 347 developmental process-related genes (Ndev). Statistical Analysis. We counted the total number of genes that were assigned any of the GO terms in the categories of molecular function or biological process for either the mouse or human annotation (Ntotal ϭ 2,883). Assuming a binomial distribution of ptf ϭ Ntf͞Ntotal, we calculated the cumulative probability, p,of observing T or more TF genes in the top n genes as follows, unless specified otherwise: n n p ϭ ͩ ͪpi ͑ Ϫ p ͒nϪi i tf 1 tf . Fig. 1. Bar graph showing the frequencies of the 3,750 human–mouse iϭT orthologue pairs relative to the number of identical sites along the 8-kb upstream sequences. The area of each bar corresponds to each relative fre- Retrieval of SNP Information for the ZFHX1B Gene. We searched the quency. The line graph shows the relative frequency of the result of the RefSeq contig annotations for every description of variation simulation study in which 10,000 randomly generated 8-kb sequence pairs linked to the SNP Database (11) (www.ncbi.nlm.nih.gov͞SNP) were processed in the same way as the human–mouse orthologue alignments. within the range of the 8-kb upstream stretch of the ZFHX1B gene. We also confirmed the variation information according to the lower bit score was removed and the other was retained. the H-Invitational Database (12) (www.h-invitational.jp). REALIGNER performed steps i and ii in decreasing order of the Results bit score for each local alignment of each sequence pair. In these steps, if the bit scores to be compared were equal, then the longer Alignment of the 8-kb Upstream Sequences of the Human–Mouse hit-stretch and then the more downstream alignments had the Orthologue Pairs. We identified 9,207 genes whose human and higher priority. Finally, the numbers of identical sites for every mouse official gene symbols were identical. We then selected local alignment were summed for each orthologue pair. only the nuclear protein-coding genes, which amounted to 7,408 genes. We regarded these gene pairs as orthologues. Among Simulation Analysis. We generated 10,000 pairs of 8-kb random these, we were able to collect 9-kb genomic upstream nucleotide sequences. Each pair of 8-kb sequences was generated so that its sequences without any described genic regions for 3,750 ortholo- frequencies of A, T, G, C, and N became proportional to the gous gene pairs. Then, we excised 8-kb stretches upstream of the observed average counts of all of the examined human and translation start sites. We set a 1-kb margin to decrease the mouse 8-kb sequences, respectively. These 10,000 random se- frequency of cases in which the excised 8-kb sequences over- quence pairs were then processed in the same way as described lapped with promoter regions or 3Ј regulatory sequences of above. adjacent genes. For all of the 3,750 pairs of human and mouse genes, the accessions of the contig entries used are shown in Validation of Genomic Alignment Procedures by the Eukaryotic Pro- Table 3, which is published as supporting information on the moter Database (EPD) Data Set. We downloaded the EPD (10) data PNAS web site, together with the positions of the excised set (Release 771, February 2004) from ftp:͞͞ftp.epd.unil.ch͞ sequences. Finally, we were able to make a global pairwise pub͞databases͞epd͞771.