EVOPRINTER, a Multigenomic Comparative Tool for Rapid Identification of Functionally Important DNA

EVOPRINTER, a multigenomic comparative tool for rapid identification of functionally important DNA Ward F. Odenwald*†, Wayne Rasband‡, Alexander Kuzin*, and Thomas Brody*† *Neural Cell-Fate Determinants Section, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD 20892; and ‡Office of the Scientific Director, Intramural Research Program, National Institute of Mental Health, National Institutes of Health, Bethesda, MD 20892 Communicated by Marshall Nirenberg, National Institutes of Health, Bethesda, MD, August 10, 2005 (received for review July 5, 2005) Here, we describe a multigenomic DNA sequence-analysis tool, of BLAT alignment and the current availability of 13 vertebrate EVOPRINTER, that facilitates the rapid identification of evolutionary and seven Drosophila species BLAT-formatted genomes (see the conserved sequences within the context of a single species. The Human BLAT Search database, available at http:͞͞genome. EVOPRINTER output identifies multispecies-conserved DNA sequences ucsc.edu͞cgi-bin͞hgBlat) enables rapid reference-DNA vs. test- as they exist in a reference DNA. This identification is accomplished genome pairwise homology searches of related or evolutionary by superimposing multiple reference DNA vs. test-genome pair- distant species. wise BLAT (BLAST-like alignment tool) readouts of the reference DNA Taking advantage of the speed of the BLAT alignment and the to identify conserved nucleotides that are shared by all ortholo- availability of multiple BLAT-formatted genomes, we developed gous DNAs. EVOPRINTER analysis of well characterized genes reveals a simple multigenomic comparative tool that allows one to that most, if not all, of the conserved sequences are essential for rapidly identify MCSs as they appear in a species of interest. The gene function. For example, analysis of orthologous genes that are EVOPRINTER algorithm superimposes multiple BLAT readouts of shared by many vertebrates identifies conserved DNA in both individual reference-DNA vs. test-genome alignments to gener- protein-encoding sequences and noncoding cis-regulatory regions, ate an evolutionary gene print (EvoP) of invariant DNA se- including enhancers and mRNA microRNA binding sites. In Dro- quences as they appear in the reference DNA. Unlike most sophila, the combined mutational histories of five or more species multispecies-alignment programs that display MCSs as consec- affords near-base pair resolution of conserved transcription factor utive columns of invariant nucleotides interspersed by alignment DNA-binding sites, and essential amino acids are revealed by the gaps, the EvoP readout displays only the reference DNA, with no nucleotide flexibility of their codon-wobble position(s). Conserved alignment gaps, highlighting a species-centric representation of small peptide-encoding genes, which had been undetected by the conserved sequences. To facilitate the comparative analysis conventional gene-prediction algorithms, are identified by the of evolutionary changes between test species, a second algo- codon-wobble signatures of invariant amino acids. Also, EVOPRINTER rithm, EVODIFFERENCE (EVODIF) enables one to identify MCSs allows one to assess the degree of evolutionary divergence be- that are common to all but one of the test genomes. tween orthologous DNAs by highlighting differences between a To demonstrate the efficacy of EVOPRINTER as a phyloge- selected species and the other test species. netic-footprinting tool, we show how EvoPs of well characterized genes (one vertebrate and one Drosophila gene) accu- comparative genomics ͉ evolution ͉ gene structure and function rately identify DNA sequences that have been shown to be essential for gene function. Also, we describe how EVOPRINTER eciphering the regulatory mechanisms that control coordi- can be used to identify genes that had not been noticed by Dnate gene expression is a long-standing goal of biology. The conventional gene-prediction methods. comparison of orthologous DNA sequences from multiple ver- Materials and Methods tebrate or invertebrate species promises to identify the cis- regulatory elements that are central to the dynamic interplay EVOPRINTER is a tool for discovering MCSs that are shared between a gene and its transcriptional regulators (1–3). This among three or more orthologous DNAs. The program uses the cross-species comparison, termed phylogenetic footprinting, is reference DNA outputs of BLAT alignments and then identifies based on the hypothesis that functionally important sequences the sequences within this DNA that are shared by all species. evolve at a significantly slower rate than nonfunctional DNA (1). EVOPRINTER is a JAVASCRIPT program that runs on the user’s Phylogenetic footprinting has been used successfully to discover computer. Its algorithm creates an array of strings from the selected BLAT outputs and then looks for conservation of se- multispecies-conserved sequences (MCSs) that are critical for quence by looping through the strings one letter at a time gene function (reviewed in refs. 2, 4, and 5). An essential first (outputting a black capital letter only for the reference DNA step in this process is the alignment of multiple orthologous nucleotides that are aligned in all test species). Nucleotides DNAs. Multisequence-alignment programs include THREADED within the reference DNA that are not shared are represented BLOCKSET ALIGNER (6), FOOTPRINTER (7), CONREAL (5), and by lowercase gray letters. The program requires an up-to-date PHYME (8). The multiDNA alignments are accomplished either web browser, and JAVASCRIPT has to be enabled. There is no by simultaneous or sequential pairwise alignments of input arbitrary limit on sequence capacity. For example, a 50-kb EvoP DNAs, with alignment gaps introduced to optimize the overall can be generated by splicing together two 25-kb BLAT outputs. homology comparisons. The second EVODIF algorithm reveals what is different in any one Individual genome searches have also been commonly used to species from the EvoP of all other test species (described below). initiate MCS searches, and two popular whole-genome search The first step in generating an EvoP is the curation of the algorithms are BLAST (9) and BLAT (BLAST-like alignment tool) reference DNA (up to 25 kb per alignment) from the University of (10). One significant difference between the BLAST and BLAT algorithms is that BLAT keeps an index of a species genome in memory and uses this index to scan linearly through the query Freely available online through the PNAS open access option. sequence, whereas BLAST indexes the query sequence first and Abbreviations: MSC, multispecies-conserved sequence; EVODIF, EVODIFFERENCE; bHLH, basic then scans linearly along the database. This fundamental differ- helix–loop–helix; Kr, Kru¨ppel; Hb, Hunchback. ence is the primary reason a BLAT alignment is significantly faster †To whom correspondence may be addressed. E-mail: [email protected] or than other whole-genome alignment algorithms (10). The speed [email protected]. 14700–14705 ͉ PNAS ͉ October 11, 2005 ͉ vol. 102 ͉ no. 41 www.pnas.org͞cgi͞doi͞10.1073͞pnas.0506915102 Downloaded by guest on September 27, 2021 California, Santa Cruz Genome Browser database (http:͞͞ DNA), opossum, chicken, and Xenopus tropicalis DNA identifies genome.ucsc.edu͞cgi-bin͞hgGateway), the Ensembl database a dense cluster of MCSs that are distributed throughout the (available at: www.ensembl.org), or the FlyBase database (http:͞͞ critical tissue-specific regulatory region (Fig. 1). When the more flybase.net). When copied and pasted into the BLAT engine input evolutionarily distant X. tropicalis and chicken species are ex- window (http:͞͞genome.ucsc.edu͞cgi-bin͞hgBlat), the pairwise cluded from the analysis, additional MCSs are identified in alignment is performed between the reference DNA and a selected enhancer-activator sequences flanking the core tissue-specific test species, and the highest-scoring readout alignment is then regulatory region (Fig. 1B.2). EVODIF prints of the individual test selected. The readout labeled as ‘‘YourSequence’’ (showing the species revealed also that the opossum has lost MCSs in the 3Ј reference DNA) is then copied and pasted into one of the EVO- negative-regulatory element (21) that are present in higher PRINTER input windows (http:͞͞evoprinter.ninds.nih.gov) without vertebrates (data not shown). Outside of the clustered conserved removing numbering or spaces. This procedure is repeated with the sequences that were detected in the initial EvoP, no MCSs were same reference DNA vs. as many test species as required. EVO- identified in the flanking 5Ј upstream 3.2-kb and 3Ј downstream PRINTER can also be used to generate a protein EvoP from BLAT 5.3-kb regions (Fig. 1B and data not shown). The ability of an alignments of amino acid sequences. EvoP to identify biologically significant DNA within the context One important feature of the EVOPRINTER program is its of reference DNA in excess of 10-kb demonstrates its usefulness ability to generate EvoPs from subsets of the selected BLAT as a phylogenetic-footprinting tool. readouts by unchecking the species or groups of species to be Transcription-factor DNA-binding site searches have revealed excluded. This flexibility is particularly useful when assessing that many of the MCSs have core DNA-binding motifs for whether the loss of an MCS or group of MCSs in one or more different transcription factors, such as homeodomain, bHLH, or BLAT alignments is caused

EVOPRINTER, a Multigenomic Comparative Tool for Rapid Identification of Functionally Important DNA

Mouse Kcnip2 Conditional Knockout Project (CRISPR/Cas9)

BIO4342 Exercise 2: Browser-Based Annotation and RNA-Seq Data

BLAT—The BLAST-Like Alignment Tool

A Multithread Blat Algorithm Speeding up Aligning Sequences to Genomes Meng Wang and Lei Kong*

Homology & Alignment

A Dissertation

Tutorial 1: Exploring the UCSC Genome Browser

Genome-Wide DNA Methylation in Chronic Myeloid Leukaemia

Technologies for Genomic Medicine

BIOINFORMATICS Doi:10.1093/Bioinformatics/Bti1205

BAM Alignment File — Output: Alignment Counts and RPKM Expression Measurements for Each Exon • Calculate Coverage Profiles Across the Genome with “Sam2wig”

Association Between DNA Methylation and Coronary Heart Disease Or Other Atherosclerotic Events: a Systematic Review