<<

EVOPRINTER, a multigenomic comparative tool for rapid identification of functionally important DNA

Ward F. Odenwald*†, Wayne Rasband‡, Alexander Kuzin*, and Thomas Brody*†

*Neural Cell-Fate Determinants Section, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD 20892; and ‡Office of the Scientific Director, Intramural Research Program, National Institute of Mental Health, National Institutes of Health, Bethesda, MD 20892

Communicated by Marshall Nirenberg, National Institutes of Health, Bethesda, MD, August 10, 2005 (received for review July 5, 2005)

Here, we describe a multigenomic DNA sequence-analysis tool, of BLAT alignment and the current availability of 13 vertebrate EVOPRINTER, that facilitates the rapid identification of evolutionary and seven Drosophila species BLAT-formatted genomes (see the conserved sequences within the context of a single species. The Human BLAT Search database, available at http:͞͞genome. EVOPRINTER output identifies multispecies-conserved DNA sequences ucsc.edu͞cgi-bin͞hgBlat) enables rapid reference-DNA vs. test- as they exist in a reference DNA. This identification is accomplished genome pairwise homology searches of related or evolutionary by superimposing multiple reference DNA vs. test-genome pair- distant species. wise BLAT (BLAST-like alignment tool) readouts of the reference DNA Taking advantage of the speed of the BLAT alignment and the to identify conserved nucleotides that are shared by all ortholo- availability of multiple BLAT-formatted genomes, we developed gous . EVOPRINTER analysis of well characterized genes reveals a simple multigenomic comparative tool that allows one to that most, if not all, of the conserved sequences are essential for rapidly identify MCSs as they appear in a species of interest. The gene function. For example, analysis of orthologous genes that are EVOPRINTER algorithm superimposes multiple BLAT readouts of shared by many vertebrates identifies conserved DNA in both individual reference-DNA vs. test-genome alignments to gener- -encoding sequences and noncoding cis-regulatory regions, ate an evolutionary gene print (EvoP) of invariant DNA se- including enhancers and mRNA microRNA binding sites. In Dro- quences as they appear in the reference DNA. Unlike most sophila, the combined mutational histories of five or more species multispecies-alignment programs that display MCSs as consec- affords near-base pair resolution of conserved transcription factor utive columns of invariant nucleotides interspersed by alignment DNA-binding sites, and essential amino acids are revealed by the gaps, the EvoP readout displays only the reference DNA, with no nucleotide flexibility of their codon-wobble position(s). Conserved alignment gaps, highlighting a species-centric representation of small peptide-encoding genes, which had been undetected by the conserved sequences. To facilitate the comparative analysis conventional gene-prediction algorithms, are identified by the of evolutionary changes between test species, a second algo- codon-wobble signatures of invariant amino acids. Also, EVOPRINTER rithm, EVODIFFERENCE (EVODIF) enables one to identify MCSs allows one to assess the degree of evolutionary divergence be- that are common to all but one of the test genomes. tween orthologous DNAs by highlighting differences between a To demonstrate the efficacy of EVOPRINTER as a phyloge- selected species and the other test species. netic-footprinting tool, we show how EvoPs of well character- ized genes (one vertebrate and one Drosophila gene) accu- comparative genomics ͉ evolution ͉ gene structure and function rately identify DNA sequences that have been shown to be essential for gene function. Also, we describe how EVOPRINTER eciphering the regulatory mechanisms that control coordi- can be used to identify genes that had not been noticed by Dnate gene expression is a long-standing goal of biology. The conventional gene-prediction methods. comparison of orthologous DNA sequences from multiple ver- Materials and Methods tebrate or invertebrate species promises to identify the cis- regulatory elements that are central to the dynamic interplay EVOPRINTER is a tool for discovering MCSs that are shared between a gene and its transcriptional regulators (1–3). This among three or more orthologous DNAs. The program uses the cross-species comparison, termed phylogenetic footprinting, is reference DNA outputs of BLAT alignments and then identifies based on the hypothesis that functionally important sequences the sequences within this DNA that are shared by all species. evolve at a significantly slower rate than nonfunctional DNA (1). EVOPRINTER is a JAVASCRIPT program that runs on the user’s Phylogenetic footprinting has been used successfully to discover computer. Its algorithm creates an array of strings from the selected BLAT outputs and then looks for conservation of se- multispecies-conserved sequences (MCSs) that are critical for quence by looping through the strings one letter at a time gene function (reviewed in refs. 2, 4, and 5). An essential first (outputting a black capital letter only for the reference DNA step in this process is the alignment of multiple orthologous nucleotides that are aligned in all test species). Nucleotides DNAs. Multisequence-alignment programs include THREADED within the reference DNA that are not shared are represented BLOCKSET ALIGNER (6), FOOTPRINTER (7), CONREAL (5), and by lowercase gray letters. The program requires an up-to-date PHYME (8). The multiDNA alignments are accomplished either web browser, and JAVASCRIPT has to be enabled. There is no by simultaneous or sequential pairwise alignments of input arbitrary limit on sequence capacity. For example, a 50-kb EvoP DNAs, with alignment gaps introduced to optimize the overall can be generated by splicing together two 25-kb BLAT outputs. homology comparisons. The second EVODIF algorithm reveals what is different in any one Individual genome searches have also been commonly used to species from the EvoP of all other test species (described below). initiate MCS searches, and two popular whole-genome search The first step in generating an EvoP is the curation of the algorithms are BLAST (9) and BLAT (BLAST-like alignment tool) reference DNA (up to 25 kb per alignment) from the University of (10). One significant difference between the BLAST and BLAT algorithms is that BLAT keeps an index of a species genome in memory and uses this index to scan linearly through the query Freely available online through the PNAS open access option. sequence, whereas BLAST indexes the query sequence first and Abbreviations: MSC, multispecies-conserved sequence; EVODIF, EVODIFFERENCE; bHLH, basic then scans linearly along the database. This fundamental differ- helix–loop–helix; Kr, Kru¨ppel; Hb, Hunchback. ence is the primary reason a BLAT alignment is significantly faster †To whom correspondence may be addressed. E-mail: [email protected] or than other whole-genome alignment algorithms (10). The speed [email protected].

14700–14705 ͉ PNAS ͉ October 11, 2005 ͉ vol. 102 ͉ no. 41 www.pnas.org͞cgi͞doi͞10.1073͞pnas.0506915102 Downloaded by guest on September 27, 2021 California, Santa Cruz Genome Browser database (http:͞͞ DNA), opossum, chicken, and Xenopus tropicalis DNA identifies genome.ucsc.edu͞cgi-bin͞hgGateway), the Ensembl database a dense cluster of MCSs that are distributed throughout the (available at: www.ensembl.org), or the FlyBase database (http:͞͞ critical tissue-specific regulatory region (Fig. 1). When the more flybase.net). When copied and pasted into the BLAT engine input evolutionarily distant X. tropicalis and chicken species are ex- window (http:͞͞genome.ucsc.edu͞cgi-bin͞hgBlat), the pairwise cluded from the analysis, additional MCSs are identified in alignment is performed between the reference DNA and a selected enhancer-activator sequences flanking the core tissue-specific test species, and the highest-scoring readout alignment is then regulatory region (Fig. 1B.2). EVODIF prints of the individual test selected. The readout labeled as ‘‘YourSequence’’ (showing the species revealed also that the opossum has lost MCSs in the 3Ј reference DNA) is then copied and pasted into one of the EVO- negative-regulatory element (21) that are present in higher PRINTER input windows (http:͞͞evoprinter.ninds.nih.gov) without vertebrates (data not shown). Outside of the clustered conserved removing numbering or spaces. This procedure is repeated with the sequences that were detected in the initial EvoP, no MCSs were same reference DNA vs. as many test species as required. EVO- identified in the flanking 5Ј upstream 3.2-kb and 3Ј downstream PRINTER can also be used to generate a protein EvoP from BLAT 5.3-kb regions (Fig. 1B and data not shown). The ability of an alignments of sequences. EvoP to identify biologically significant DNA within the context One important feature of the EVOPRINTER program is its of reference DNA in excess of 10-kb demonstrates its usefulness ability to generate EvoPs from subsets of the selected BLAT as a phylogenetic-footprinting tool. readouts by unchecking the species or groups of species to be Transcription-factor DNA-binding site searches have revealed excluded. This flexibility is particularly useful when assessing that many of the MCSs have core DNA-binding motifs for whether the loss of an MCS or group of MCSs in one or more different transcription factors, such as homeodomain, bHLH, or BLAT alignments is caused by (i) small mutational differences; (ii) Zn-finger , and some have multiple interlocking binding chromosome rearrangements, including large insertions and͞or sites for different factors (ref. 20 and data not shown). EvoPs of deletions, resulting in loss of sequence colinearity; (iii) overall other characterized vertebrate enhancers have identified clus- sequence divergence being so great that alignment is not tered MCSs within all cis-regulatory elements examined. For achieved for short homologies; or (iv) sequencing gaps in the test example, within the 90-bp tissue-specific region of the murine genome. anterior neuroectoderm OTX2 enhancer (22), the EvoP reveals To identify MCSs that are shared by all but one of the test that 86% of the nucleotides (77 bp) are part of MCSs (data not species, deselect all of the test-species readouts that were entered shown). EVOLUTION into the EVOPRINTER except for the species in question, and then Constitutive expression of human Ascl1 (hash1) gene in lung select the ‘‘Highlight Species Differences’’ button to generate the neoplasms is a feature of one of the most virulent forms of lung EVODIF readout. The lowercase red letters are nucleotides that cancer, small-cell lung cancer (SCLC) (23, 24). SCLC cell culture are lost from the final EvoP if that species is included in the studies have demonstrated that hash1 expression in neuroendo- comparison. In addition to assessing the degree of evolutionary crine tumors is controlled in part by a proximal enhancer divergence, the EVODIF is particularly useful for identifying positioned Ϫ234 to Ϫ46 from the transcribed sequence and a chromosome rearrangements (identified by uninterrupted proximal repressor region located at Ϫ308 to Ϫ234 (24) (both blocks of lost MCSs). Color formatting of the EvoP and EVODIF regions are shown in the hatched box in Fig. 1C). These studies readouts can be maintained by dragging the saved HTML output have also indicated that the mammalian homologue of the into WORD (Microsoft). Drosophila Hairy transcription factor, HES-1 (Hairy Enhancer Identification of potential transcription-factor DNA-binding of Split 1), functions as a direct repressor of hash1 expression by sites was carried out by using MATINSPECTOR (11). MicroRNA binding to a HES-1 binding site in the proximal promoter region binding sites in Drosophila were identified as described (12), and (24). Our EvoP identified a cluster of MCSs within the proximal human microRNA binding sites were identified by using the enhancer͞repressor region of the vertebrate ash-1 and two of Human miRNA Viewer database (www.cbio.mskcc.org͞ these MCSs contain HES-1 binding sites (underlined in the mirnaviewer) as described (13). dashed box in Fig. 1C). MCSs containing DNA-binding sites for other known transcription factors were also identified. For Results and Discussion example, IA-1 (25) and the FAST-1 Smad-interacting protein EVOPRINTER Analysis of Vertebrate Achaete–Scute Homologue 1 (Ascl1) (26) transcription-factor DNA-binding sites are present in prox- Genes Identifies DNA Sequences That Are Essential for Its Expression imal promoter MCSs (Fig. 1C). and Function. The basic helix–loop–helix (bHLH) Ascl1 tran- In addition to identifying MCSs within the upstream enhancer scription factor has a critical role in establishing neural cell and proximal promoter, EvoP analysis of the transcribed region identities in the developing vertebrate embryo (see refs. 14–16 revealed multiple MCSs in the 5Ј untranslated leader, one of and references therein). Studies on the mammalian Ascl1 gene which contains a canonical HES-1 binding site, whereas another (Mash1) demonstrate that it is dynamically expressed in many harbors a docking site for IA-1 (25) (Fig. 1C, red underline). proliferating CNS and peripheral nervous system neural pro- IA-1 binding sites were also found in proximal and CNS en- genitor cells (NPCs) during murine development (17, 18) and it hancer MCSs (see above). Also suggesting that IA-1 may be a is also tightly regulated in NPCs that give rise to pulmonary direct regulator of Ascl-1 expression, recent studies (27) have neuroendocrine cells (19). Cis-regulatory elements important revealed that IA-1 is dynamically expressed in murine CNS for Mash1 expression in mice have been identified in the 5Ј neural progenitor cells. flanking intragenic DNA and within the 3Ј noncoding region of In the Ascl-1 protein-encoding sequence, multiple MCSs are its transcribed sequence (20, 21). Transgenic studies have local- present and most delineate essential amino acid codons as ized the Mash1 CNS enhancer to an 1,158-bp region located 7 kb deduced from invariant nucleotides in critical codon positions. A 5Ј to its transcription start site (boxed sequence in Fig. 1B) (20). protein EvoP of the different vertebrate Ascl-1 amino acid Further dissection of this enhancer has revealed that most of the sequences confirms that many of the conserved nucleotides region- and tissue-specific regulatory elements map to an inter- identified in the genomic EvoP are positioned in invariant codon nal 472-bp region (dashed box in Fig. 1B), and elements that positions (data not shown). modulate expression levels (enhance or reduce) in flanking Within the 3Ј UTR of the Ascl1 transcript, the EvoP identifies sequences map both 5Ј and 3Ј to the 472-bp region (21). a dense cluster of MCSs that spans 600 bp of the 1.3-kb trailer Remarkably, a 15-kb EvoP of this region generated from (Fig. 1C), and five of the conserved regions (yellow underline in human, chimpanzee, rhesus monkey, dog, rat, mouse (reference Fig. 1C) harbor potential mRNA binding sites for 13 different

Odenwald et al. PNAS ͉ October 11, 2005 ͉ vol. 102 ͉ no. 41 ͉ 14701 Downloaded by guest on September 27, 2021 Fig. 1. EVOPRINTER analysis of the vertebrate achaete-scute homolog 1 locus. (A) A linear cartoon of the Ascl1 locus 15 kb used in the EvoP analysis indicating the approximate locations of sequences shown in B and C (box represents transcribed region with the red-colored inner box indicating the ORF). (B and C) EvoPs were generated with 15 kb of mouse (B) or human (C) reference-DNA that included the Ascl1-transcribed sequence plus 9 kb of upstream and 3 kb of downstream flanking intragenic sequence. We searched the following test genomes: human, chimpanzee, rhesus monkey, dog, rat, mouse, oppossum, chicken, and X. tropicalis. Invariant MCSs, shared by all test species, are identified with uppercase black letters. (B.1)AnEvoP, using all test species, identifies clustered MCSs within the tissue- and region-specific regulatory region of the murine Mash1 CNS enhancer. Shown is the upper DNA strand of 1.9 kb corresponding to nucleotides Ϫ8692 to Ϫ6784 5Ј to the murine Mash1-transcribed region. The solid lined box denotes the 1,158-bp CNS enhancer region, and the dashed-lined inner box identifies the 472-bp domain that contains multiple tissue͞region-specific regulatory elements (21). (B.2) The MCSs that are gained when X. tropicalis is excluded from the analysis are shown as uppercase red letters and when both X. tropicalis and chicken genomes are excluded from the EvoP, the additional MCSs are shown as blue lowercase letters. Nonconserved nucleotides are indicated as lowercase gray letters. (C) EvoP analysis of the ash1 proximal promoter region, transcribed sequence, and flanking 3Ј intragenic sequence reveals conserved MCSs that contain cis-regulatory and protein-encoding sequences. Shown is 3.9 kb of the human hash1 gene (nucleotides Ϫ687 to ϩ3235). The hatched line box denotes a 259-bp region that contains the proximal enhancer and tissue-specific repressor regulatory elements (22). Underlined sequences are HES-1 DNA-binding sites, red-boxed sequences are potential binding sites for IA-1 and a potential FAST-1 binding site is highlighted with red-colored letters (see Results and Discussion). The 5Ј UTR of the hash1 transcript is highlighted in light blue, the transcript ORF is shown with red background (the HLH coding sequence is marked with yellow background), and the 3Ј untranslated sequence is indicated with a dark blue background. Yellow nucleotides in the 3Ј trailer represent potential binding sites for 13 different microRNAs (see Results and Discussion). Note that the 3Ј UTR is interrupted by a 359-bp (annotation according to the Ensembl sequence data base).

14702 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0506915102 Odenwald et al. Downloaded by guest on September 27, 2021 EVOLUTION

Fig. 2. EVOPRINTER analysis of the Drosophila Kr gene. The 7.7-kb (upper strand) of a 12-kb genomic EvoP that corresponds to the D. melanogaster reference DNA (nucleotides Ϫ4,207 to ϩ 3,531) is shown. The EvoP was generated from BLAT readouts of the reference DNA aligned with D. simulans, D. yakuba, D. ananassae, D. pseudoobscura, and D. virilis DNAs. MCSs that are shared by all species are shown as uppercase black nucleotides. Boxed sequences represent the cis-regulatory regions described in Results and Discussion. Underlined MCSs within the CD1͞Kr730 box contain known transcription-factor binding sites (35, 36). Underlined sequences in the AD2͞NS2 box contain potential HB (TTTTAGT) and PDM1 (ATTTGCAT) DNA-binding sites, respectively. The D. melanogaster Kr transcribed sequence is annotated according to FlyBase as follows: 5Ј untranslated leader (light blue), protein-encoding sequence (red; Zn finger domain, yellow), and the 3Ј untranslated sequence (dark blue). Note that the protein-encoding sequence is interrupted by a 373-bp intron. The underlined nucleotides in the 3Ј untranslated transcribed sequence correspond to E-box bHLH binding sites. EVODIF analysis of the individual test species revealed that the first two nucleotides of the first E-box (red letters) are shared by all tested species except for D. yakuba and D. ananassae.

human microRNAs (13). In light of studies (21) indicating that characterized Drosophila Kr transcription-factor gene. Kr plays the murine Mash1 is under posttranscriptional control, mediated multiple roles essential to different phases of Drosophila devel- by sequence(s) located 3Ј of the protein-coding region (21), it is opment. Initially identified as a regulator of thoracic and ab- likely that many or all of these microRNA binding sites are dominal segmental identity in the early embryo (ref. 28 and see physiologically relevant. ref. 29 for review), Kr gene function has been shown to be required for the development of the Malpighian tubule (kidney) Intragenus EVOPRINTER Analysis of the Drosophila Kru¨ppel (Kr) Gene Loci (30), muscle (31), and the nervous system (32, 33). Identifies Functionally Important DNA. As a second example of the Detailed studies of the cis-regulatory elements that control Kr usefulness of EVOPRINTER, we generated an EvoP of the well embryonic expression have identified multiple enhancer regions

Odenwald et al. PNAS ͉ October 11, 2005 ͉ vol. 102 ͉ no. 41 ͉ 14703 Downloaded by guest on September 27, 2021 Fig. 3. EVOPRINTER identifies a small peptide gene not annotated in the Berkeley Drosophila Genome Project (BDGP) database. EvoP analysis of the intragenic 12.9-kb region between the Drosophila Appl and vnd genes uncovered a small peptide gene conserved in D. melanogaster, D. simulans, D. yakuba, D. ananassae, D. pseudoobscura, D. virilis, and D. mojavensis species. Shown is 1.5-kb of the D. melanogaster reference species (nucleotides Ϫ9,154 to Ϫ7,670 5Ј to the vnd transcribed region). MCSs shared by all species are identified by uppercase, black nucleotides. EVODIF analysis of individual species revealed that one 5Ј upstream MCS was not conserved in D. mojavensis but is present in all other species (lowercase red nucleotides). Underlined sequence in this MCS represents a consensus Hb DNA-binding motif. A protein EvoP of the encoded 40-aa peptide is also shown. Aligned with the codons, invariant amino acids residues are shown as uppercase black letters, and residues that are different in at least one of the six species tested above are shown as lowercase gray letters.

located upstream of its transcribed sequence (34). The genomic genes has identified MCSs within all examined enhancer regions regions that control early blastoderm, muscle precursor, amnio- (10 genes and 20 enhancers; data not shown). serosa, or CNS expression are shown in Fig. 2. In the early Within the Kr transcribed sequence, EvoP identifies multiple pregastrula embryo, the enhancer regions CD1 and CD2 are clusters of MCSs, many of which essential amino acids as required for Kr expression in the central domain of the blasto- identified by their wobble signatures (identified by two or more derm, and the AD1 and AD2 enhancer regions regulate expres- 2-bp MCSs separated by single nonconserved nucleotides; Fig. sion in the anterior portion of the late blastoderm (34, 35). 2). The region that encodes the five consecutive Zn-fingers During late embryonic development, cis elements that regulate spanning 146 aa (highlighted in yellow in Fig. 2) is especially Kr expression in muscle precursor cells and amnioserosa cells prominent. Excluding four nonwobble methionine (ATG) have been mapped to CD1 (34). The AD2 region also harbors codons in the Zn-finger domain, the genomic EvoP reveals that nervous system-specific (NS2) control elements (34). only 39 of the remaining 142 codons are invariant in all species Further dissection of the 1,159-bp CD1 region revealed that for all three codon positions. Of these 39 codons, 28 of the only its first 730 bp (Kr730, dashed line in Fig. 2) contain most, encoded amino acids have restricted wobble, allowing for only if not all, of the CD1 cis-regulatory elements (36). Cis elements two different nucleotide substitutions in the third codon posi- that regulate Kr expression in muscle precursor cells and CNS tion. Although the genomic EvoP found conserved amino acids ventral midline cells also map to the last 3Ј 295 bp of Kr730 (34). within the Zn-finger region and in the immediate flanking These studies also demonstrated that Kr730 contains Bicoid domains, two additional conserved protein domains were missed (Bcd), Hunchback (Hb), Knirps (Kni), and Tailless (Tll) tran- either partially or completely [the N-terminal transrepressor1– scription factor in vivo responsive elements, and in vitro DNA- transactivator1 region (38, 39) and the C-terminal C64 repressor binding studies have also demonstrated that each of these domain (40, 41), respectively (Fig. 2)]. However, when D. virilis transcription factors bind directly to different regions of Kr730 and D. pseudoobscura are excluded from the EvoP, both the N- (35, 36). and C-terminal encoding domains are revealed by invariant An EvoP of 12 kb spanning the Kr genomic locus using amino acid wobble signatures (data not shown). Drosophila melanogaster DNA as the reference DNA and Dro- In the 3Ј untranslated sequence of the Kr transcribed region, sophila simulans, Drosophila yakuba, Drosophila ananassae, Dro- the EvoP identified a single MCS (Fig. 2). A genomewide search sophila pseudoobscura, and Drosophila virilis as test genomes has for 3Ј UTR microRNA binding sites (12) identified a potential identified multiple MCSs within Kr730 but not in the remaining miR-34 mRNA binding-site that overlaps the first seven nucle- 420 bp that were found not to be essential for CD1 enhancer otides of this MCS (data not shown). Interestingly, this MCS also activity (36) (Fig. 2). Remarkably, all but three of the MCSs in contains a bHLH E-box consensus DNA-binding site the Kr730 region are contained in or overlap DNAse1-protected (CAATTG) and when D. yakuba and D. ananassae are excluded footprinted sequences of the transcription factors mentioned from the EvoP, the MCS includes two additional 5Ј nucleotides above (underlined MCSs in Kr730, Fig. 2). For example, the 5Ј- and now contains a second E-box (CAGCTG) (both E-boxes are and 3Ј-most Kr730 MCSs identified by the EvoP are overlapping underlined, Fig. 2). Additional MCSs were detected 3Ј to the Bicoid (Bcd)͞Hb and Knirps (Kni)͞Bcd DNA-binding sites, transcribed region (Fig. 2). Although a recent study (42) did not respectively (35, 36). Analysis of the MCSs positioned within the detect any posttranslational regulation of Kr mRNA in the CD2͞AD1 and AD2͞NS2 also identify multiple potential DNA- embryo, and analysis of the 3Ј downstream intragenic region did binding sites for homeodomain, Hb, Bcd, and POU domain not identify any additional embryonic cis-regulatory elements transcription factors. For example, the 5Ј-most MCS in the (34), the possibility remains that some or all of these MCSs may AD2͞NS2 region contains a consensus Hb docking site (TTT- have a role in controlling larval or adult Kr expression. TATG), and the third 5Ј-most MCS in this region contains a canonical POU domain DNA-binding Octamer motif (ATTT- EVOPRINTER Uncovers a Small Peptide Gene Not Identified in the Current GCAT) (Hb and POU binding sites, underlined in Fig. 2). FlyBase Annotation of the Drosophila Genome. The EvoP has the Interestingly, CNS cis-regulatory elements map to this region potential to discover small protein-encoding genes that had been (34), and studies have shown that Kr neuroblast expression is previously unannotated by conventional gene-prediction meth- preceded by expression of Hb, which is a known repressor of Kr, ods. For example, EvoP exploration of the 12.9-kb intragenic and temporally followed by the Octamer-binding POU domain region between the Drosophila beta amyloid protein precursor-like transcription factors Pdm-1 and Pdm-2 (ref. 33 and see ref. 37 for gene (43) and the ventral nervous system defective (vnd) gene (44) review). EVOPRINTER analysis of other characterized Drosophila has identified a cluster of MCSs that were invariant in the D.

14704 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0506915102 Odenwald et al. Downloaded by guest on September 27, 2021 melanogaster, D. simulans, D. yakuba, D. ananassae, D. pseudoob- nucleotides in Fig. 3). The fact that this sequence contains a scura, D. virilis, and Drosophila mojavensis species. Positioned 8.5 consensus DNA-binding motif for Hb (TTTTATG) suggests kb upstream of the vnd transcribed sequence, portions of the that Hb may have a role in the regulation of HDC16822 MCS cluster possessed all of the hallmarks of an ORF that expression in at least six of the seven species. encodes short runs of conserved amino acids (multiple 2-bp MCSs separated by single nonconserved nucleotides) (Fig. 3). Summary. We have developed a simple, yet effective, comparative Further analysis of this region revealed that most, but not all, of genomics tool for identifying MCSs shared among related the MCSs are part of an ORF that codes for a 40-aa peptide in DNAs. Generated from multiple pairwise BLAT alignments of a all species. As indicated by the genomic EvoP, a protein EvoP of reference DNA to different test genomes, the EvoP presents an the predicted amino acid sequence (shown in Fig. 3) reveals that ordered, uninterrupted representation of the evolutionarily re- all but four of the residues are invariant in the seven species. The silient sequences within the reference DNA. By superimposing genomic EvoP also revealed that the translation stop codon in the different species evolutionary histories, the combined mu- one or more of the species had diverged. Subsequent analysis of tagenic force reveals DNA sequences that are essential for gene the different test species BLAT readouts revealed that both expression and function. Also, the EVODIF algorithm reveals the D. virilis and D. mojavensis use TGA as their termination codon, degree of molecular divergence between species by identifying whereas the others use TAA as the stop codon (data not shown). individual species differences to the EvoP. When compared with Although the conserved ORF was not identified in the recent other multispecies-alignment tools, the two principal advantages FlyBase genome annotation release 3.1, a GenBank BLAST of EVOPRINTER are its speed (derived from the speed of a BLAT homology search using the predicted protein sequence revealed alignment) and the fact that only a single curated genomic that the Heidelberg Prediction, Heidelberg Collection (HDC) sequence is required to initiate the analysis of orthologous DNAs had identified the ORF as the HDC16822 gene (45). The from multiple species. Based on the success of the EVOPRINTER presence of HDC16822 was initially predicted by using a lower- identification of MCSs within known vertebrate and Drosophila stringency, ab initio gene-prediction algorithm (FGENESH; ref. cis-regulatory elements, we believe that this tool could be of 46) and then confirmed by whole-transcriptome microarray great use to understanding gene regulation in all animals. analysis (41). The genomic EvoP analysis also revealed the presence of additional upstream MCSs that may harbor We thank L. Elnitski, J. Kassis, S. Landis, M. Muenke, H. Nash, and A. Raldow for helpful discussions; L. Elnitski and H. Nash for critically

HDC16822 cis-regulatory elements (Fig. 3). For example, core EVOLUTION reading the manuscript; and J. Brody for help with the EVOPRINTER web homeodomain DNA-binding motifs (ATTA) exist in four of site construction and editorial assistance. This work was supported by the these MCSs. Interestingly, EVODIF analysis of the D. mojavensis National Institutes of Health National Institute of Neurological Disor- species revealed that an additional upstream 5Ј MCS is con- ders and Stroke and National Institute of Medical Health Intramural served in all species except for D. mojavensis (red lowercase Research Program.

1. Tagle, D. A., Koop, B. F., Goodman, M., Slightom, J. L., Hess, D. L. & Jones, 22. Kurokawa, D., Takasaki, N., Kiyonari, H., Nakayama, R., Kimura-Yoshida, C., R. T. (1988) J. Mol. Biol. 203, 439–455. Matsuo, I. & Aizawa, S. (2004) Development (Cambridge, U.K.) 131, 3307–3317. 2. Wasserman, W. W., Palumbo, M., Thompson, W., Fickett, J. W. & Lawrence, 23. Ball, D. W., Azzoli, C. G., Baylin, S. B., Chi, D., Dou, S., Donis-Keller, H., C. E. (2000) Nat. Genet. 26, 225–228. Cumaraswamy, A., Borges, M. & Nelkin, B. D. (1993) Proc. Natl. Acad. Sci. 3. Yuh, C. H., Brown, C. T., Livi, C. B., Rowen, L., Clarke, P. J. & Davidson, E. H. USA 90, 5648–5652. (2002) Dev. Biol. 246, 148–161. 24. Chen, H., Biel, M. A., Borges, M. W., Thiagalingam, A., Nelkin, B. D., Baylin, 4. Grad, Y. H., Roth. F. P., Halfon, M. S. & Church, G. M. (2004) Bioinformatics S. B. & Ball, D. W. (1997) Cell Growth Differ. 8, 677–686. 20, 2738–2750. 25. Breslin, M. B., Zhu, M., Notkins, A. L. & Lan, M. S. (2002) Nucleic Acids Res. 5. Berezikov, E., Guryev, V., Plasterk, R. H. & Cuppen, E. (2004) Genome Res. 30, 1038–1045. 14, 170–178. 26. Liu, F., Pouponnot, C. & Massague, J. (1997) Genes Dev. 11, 3157–3167. 6. Blanchette, M., Kent, W. J., Riemer, C., Elnitski, L., Smit, A. F., Roskin, K. M., 27. Breslin, M. B., Zhu, M. & Lan, M. S. (2003) J. Biol. Chem. 278, 38991–38997. Baertsch, R., Rosenbloom, K., Clawson, H., Green, E. D., et al. (2004) Genome 28. Wieschaus, E., Nusslein-Volhard, C. & Kluding, H. (1984) Dev. Biol. 104, Res. 14, 708–715. 172–186. 7. Blanchette, M. & Tompa, M. (2003) Nucleic Acids Res. 31, 3840–3842. 29. Rivera-Pomar, R. & Jackle, H. (1996) Trends Genet. 12, 478–483. 8. Sinha, S., Blanchette, M. & Tompa, M. (2004) BMC Bioinformatics 5, 170. 30. Gaul, U. & Weigel, D. (1990) Mech. Dev. 33, 57–67. 9. Altschul, S. F., Madden, T. L., Schaffer, A., Zhang, J., Zhang, Z., Miller, W. 31. Ruiz-Gomez, M., Romani, S., Hartmann, C., Jackle, H. & Bate, M. (1997) & Lipman, D. J. (1997) Nucleic Acids Res. 25, 3389–3402. Development (Cambridge, U.K.) 124, 3407–3414. 10. Kent, W. J. (2002) Genome Res. 12, 656–664. 32. Romani, S., Jimenez, F., Hoch, M., Patel, N. H., Taubert, H. & Jackle, H. 11. Cartharius, K., Frech, K., Grote, K., Klocke, B., Haltmeier, M., Klingenhoff, (1966) Mech. Dev. 60, 95–107. A., Frisch, M., Bayerlein, M. & Werner, T. (2005) Bioinformatics 21, 2933–2942. 33. Isshiki, T., Pearson, B., Holbrook, S. & Doe, C. Q. (2001) Cell 106, 511–521. 12. Enright, A. J., John, B., Gaul, U., Tuschl, T., Sander, C. & Marks, D. S. (2003) 34. Hoch, M., Schroder, C., Seifert, E. & Jackle, H. (1990) EMBO J. 9, 2587–2595. Genome Biol. 5, R1. 35. Hoch, M., Seifert, E. & Jackle, H. (1991) EMBO J. 10, 2267–2278. 13. John, B., Enright, A. J., Aravin, A., Tuschl, T., Sander, C. & Marks, D. S. (2004) 36. Hoch, M., Gerwin, N., Taubert, H. & Jackle, H. (1992) Science 256, 94–97. PLoS Biol. 2, e363. 37. Brody, T. & Odenwald, W. F. (2002) Development (Cambridge, U.K.) 129, 14. Casarosa, S., Fode, C. & Guillemot, F. (1999) Development (Cambridge, U.K.) 3763–3770. 126, 525–534. 38. Licht, J. D., Grossel, M., Figge, J. & Hansen, U. M. (1990) Nature 346, 76–79. 15. Torii, M., Matsuzaki, F., Osumi, N., Kaibuchi, K., Nakamura, S., Casarosa, S., 39. Licht, J. D., Hanna-Rose, W., Reddy, J. C., English, M. A., Ro, M., Grossel, Guillemot, F. & Nakafuku M. (1999) Development (Cambridge, U.K.) 126, M., Shaknovich, R. & Hansen, U. (1994) Mol. Cell. Biol. 14, 4057–4066. 443–445. 40. Sauer, F. & Jackle, H. (1993) Nature 364, 454–457. 16. Tuttle, R., Nakagawa, Y., Johnson, J. E. & O’Leary, D. D. (1999) Development 41. Hanna-Rose, W., Licht, J. D. & Hansen, U. (1997) Mol. Cell. Biol. 17, (Cambridge, U.K.) 126, 1903–1916. 4820–4829. 17. Lo, L. C., Johnson, J. E., Wuenschell, C. W., Saito, T. & Anderson, D. J. (1991) 42. Grosskortenhaus, R., Pearson, B. J., Marusich, A. & Doe, C. Q. (2005) Dev. Cell Genes Dev. 5, 1524–1537. 8, 193–202. 18. Guillemot, F. & Joyner, A. L. (1993) Mech. Dev. 42, 171–185. 43. Rosen, D. R., Martin-Morris, L., Luo, L. Q. & White, K. (1989) Proc. Natl. 19. Borges, M., Linnoila, R. I., van de Velde, H. J., Chen, H., Nelkin, B. D., Mabry, Acad. Sci. USA 86, 2478–2482. M., Baylin, S. B. & Ball, D. W. (1997) Nature 386, 852–855. 44. Kim, Y. & Nirenberg, M. (1989) Proc. Natl. Acad. Sci. USA 86, 7716–7720. 20. Verma-Kurvari, S., Savage, T., Gowan, K. & Johnson, J. E. (1996) Dev. Biol. 45. Hild, M., Beckmann, B., Haas, S. A., Koch, B., Solovyev, V., Busold, C., 180, 605–617. Fellenberg, K., Boutros, M., Vingron, M., Sauer, F., et al. (2003) Genome Biol. 21. Verma-Kurvari, S., Savage, T., Smith, D. & Johnson, J. E. (1998) Dev. Biol. 197, 5, R3. 106–116. 46. Salamov, A. A. & Solovyev V. V. (2000) Genome Res. 10, 516–522.

Odenwald et al. PNAS ͉ October 11, 2005 ͉ vol. 102 ͉ no. 41 ͉ 14705 Downloaded by guest on September 27, 2021