The Phylogenetic Handbook: a Practical Approach to Phylogenetic Analysis and Hypothesis Testing, Philippe Lemey, Marco Salemi, and Anne-Mieke Vandamme (Eds.)

Sequence databases and database searching

THEORY Guy Bottu

2.1 Introduction Phylogenetic analyses are often based on sequence data accumulated by many investigators. Faced with a rapid increase in the number of available sequences, it is not possible to rely on the printed literature; thus, scientists had to turn to digitalized databases. Databases are essential in current bioinformatic research: they serve as information storage and retrieval locations; modern databases come loaded with powerful query tools and are cross-referenced to other databases. In addition to sequences and search tools, databases also contain a considerable amount of accompanying information, the so-called annotation,e.g.fromwhich organism and cell type a sequence was obtained, how it was sequenced, what properties are already known, etc. In this chapter, we will provide an overview of the most important publicly available sequence databases and explain how to search them. A list of the database URLs discussed in this section is provided in Box 2.1. To search sequence databases, there are basically three different strategies. – To easily retrieve a known sequence, you can rely on unique sequence identiﬁers. – Tocollect a comprehensive set of sequences that share a taxonomic origin or a known property, the annotation can be searched by keyword. – To ﬁnd the most complete set of homologous sequences a search by similarity of a selected query sequence against a sequence database can be performed using tools like BLAST or FASTA.

The Phylogenetic Handbook: a Practical Approach to Phylogenetic Analysis and Hypothesis Testing, Philippe Lemey, Marco Salemi, and Anne-Mieke Vandamme (eds.). Published by Cambridge University Press. C Cambridge University Press 2009. 33 34 Guy Bottu

Box 2.1 URLs of the major sequence databases and database search tools

ACNUC: http://pbil.univ-lyon1.fr/databases/acnuc/acnuc.html BioXL/H: http://www.biocceleration.com/BioXLH-technical.html BLAST: http://www.ncbi.nlm.nih.gov/blast/ DDBJ and DAD: http://www.ddbj.nig.ac.jp/ EMBL: http://www.ebi.ac.uk/embl/ EMBL Sequence Version Archive: http://www.ebi.ac.uk/cgi-bin/sva/sva.pl Ensembl: http://www.ensembl.org/ Entrez: http://www.ncbi.nlm.nih.gov/Entrez/ fastA: http://fasta.bioch.virginia.edu/fasta/ GenBank: http://www.ncbi.nlm.nih.gov/Genbank/ Gene Ontology: http://www.geneontology.org/ HAMAP: http://www.expasy.org/sprot/hamap/ HCV database: http://hcv.lanl.gov/ HIV database: http://hiv-web.lanl.gov/ HOGENOM: http://pbil.univ-lyon1.fr/databases/hogenom.html HOVERGEN: http://pbil.univ-lyon1.fr/databases/hovergen.html IMGT/HLA: http://www.ebi.ac.uk/imgt/hla/ IMGT/LIGM: http://imgt.cines.fr/ MPsrch, Scan-PS, WU-BLAST and fastA at EBI: http://www.ebi.ac.uk/Tools/ similarity.html MRS: http://mrs.cmbi.ru.nl/mrs-3/ NCBI Map Viewer: http://www.ncbi.nlm.nih.gov/mapview/ ORALGEN: http://www.oralgen.lanl.gov/ PDB: http://www.rcsb.org/ PRF/SEQDB: http://www.prf.or.jp/ RefSeq: http://www.ncbi.nlm.nih.gov/RefSeq/ Sequin: http://www.ncbi.nlm.nih.gov/Sequin/ SRS: http://www.biowisdom.com/navigation/srs/srs SRS server of EBI: http://srs.ebi.ac.uk/ SRS list of public servers: http://downloads.biowisdomsrs.com/publicsrs.html Taxonomy: http://www.ncbi.nlm.nih.gov/Taxonomy/ TIGR: http://www.tigr.org/ UCSC Genome Browser: http://genome.ucsc.edu/ UniGene: http://www.ncbi.nlm.nih.gov/UniGene/ UniProt at EMBL: http://www.ebi.ac.uk/uniprot/ UniProt at SIB: http://www.expasy.uniprot.org/ VA S T : http://www.ncbi.nlm.nih.gov/Structure/VAST/vastsearch.html WU-BLAST: http://blast.wustl.edu/ 35 Sequence databases and database searching: theory

2.2 Sequence databases

2.2.1 General nucleic acid sequence databases There are parallel efforts in Europe, USA, and Japan to maintain public databases with all published nucleic acid sequences:

EMBL (European Molecular Biology Laboratory) database, maintained at the EMBL- EBI (European Bioinformatics Institute, Hinxton, England, UK). GenBank, maintained at the NCBI (National Center for Biotechnology Information, Bethesda, Maryland, USA). DDBJ (DNA Data Bank of Japan), maintained at the NIG/CIB (National Institute of Genetics, Center for Information Biology, Mishima, Japan).

In the early 1980s the database curators scanned the printed literature for novel sequences, but today the sequences are exclusively and directly submitted by the authors through World Wide Websubmission tools (or by email after preparation of the data using the Sequin software). There is an agreement between the curators of the three major databases to cross-submit the sequences to each other. The databases contain RNA as well as DNA sequences but, by convention, a sequence is always written as DNA, that is with Ts and not with Us. Note that often, but not always, sequence analysis software treats U and T without making a distinction. Modified bases are replaced by their “parent” base ACGT but the modification is mentioned in the annotation. The databases have the notion of release. At regular intervals (2 months for GenBank, 3 months for the two others) the database is “frozen” in its current state to make a “release.” The sequences that are submitted since the last release are available as “daily updates.” On June 15, 2007, GenBank release 160.0 contained more than 77 248 690 945 nucleotides in 73 078 143 sequences. Over the years, the sequence databases have grown at an enormous rate (Fig. 2.1). Each sequence has a series of unique identifiers (see Fig. 2.2), which allow easy retrieval:

Entry name, locus name or identifier (ID) The ID was originally designed to be mnemonic (e.g. ECARGS for the ArgS gene of E. coli) but, facing the rapid accumulation of anonymous fragments, many sequences now have an ID that is simply identical to their AC. The ID can change from release to release (in the face of new information of what the sequence is) and can be different in the three databases. Note that the EMBL has decided to drop the use of IDs from June 2006 on, but GenBank and DDBJ still use them. 36 Guy Bottu

1x1011

1x1010

1x109 nucleotides 1x108

1x107

6 1x10 entries 1x105

1x104

1x103

1985 1990 1995 2000 2005

Fig. 2.1 Growth (on a logarithmic scale) of the number of nucleotides and entries in the public EMBL database (non-redundant) from June 1982 to September 2007.

LOCUS HUMBMYH7 28452 bp DNA linear PRI 30-OCT-2002 DEFINITION Homo sapiens beta-myosin heavy chain (MYH7) gene, complete cds. ACCESSION M57965 M30603 M30604 M30605 M57747 M57748 M57749 VERSION M57965.2 GI:24429600

Fig. 2.2 First lines of a GenBank entry. Note the different identifiers: the entry name HUMBMYH7, the primary accession number M57965 (followed by 6 secondary accession numbers), the version number M57965.2 and the GI number.

Accession number (AC) Contrary to the ID, the AC remains constant between releases and there exists an agreement between the managers of the three databases to give the same AC to the same sequence. Besides its primary accession number, a sequence can have secondary accession numbers; the reason is that, when several sequences are merged into one, the new sequence gets a new AC but inherits all the old ACs. Because of its stable and universal nature the AC is thus the most useful for retrieving a sequence.

Version number The version number is made from the AC and a number that is incremented each time the sequence is modiﬁed. The version number is thus useful if it is necessary to ﬁnd back exactly the sequence that was used for a particular study. To allow the retrieval of “old sequences” the EBI has set up a “Sequence Version Archive,” which 37 Sequence databases and database searching: theory

you can search by SV or by AC (to retrieve the last entry that had that primary accession number). Note, however, that the version number was introduced in the beginning of 1999 and does not cover the period before.

GenInfo number (GenBank only) Each sequence treated by the NCBI is given a GI number, which is unique over all NCBI databases.

In addition to these identifiers, each coding sequence (CDS) identified in a database entry gets a protein identifier (protein id), which is shared by the three databases and followed by a version number; in GenBank each CDS has also its own GI number. The databases can be downloaded by anonymous ftp and can also be searched on-line. GenBank can be searched as part of the Entrez Nucleotide table. EMBL andDDBJareavailableintheSRSserversofEBIandNIGrespectively(see below). In some interfaces, the Expressed Sequence Tags (ESTs) and Genome Survey Sequences (GSSs) sections of the databases are presented in separate sections because they make up huge collections of sequences with a high error rate (∼1%) and often have a poor annotation. ESTs are basically mRNA fragments, obtained by extracting RNA followed by reverse transcription and single-pass automated DNA sequencing. GSSs are single-pass fragments of genomic sequence.

The three databases have two “annexes”:

Whole Genome Shotgun (WGS) sequences Sequences from genome sequencing projects that are being conducted by the whole genome shotgun method. They do not have a stable accession number. At regular time intervals, all the WGS sequences for one organism are removed from the database and replaced by a new set.

Third Party Annotations (TPA) Sequences with improved documentation and/or sequences constructed by assem- bling overlapping entries, submitted by a person who is not the original author of the sequence(s).

2.2.2 General protein sequence databases Proteins can be sequenced using the classic Edman degradation method, using modern methods based on peptide mass spectrometry, or their sequence can be 38 Guy Bottu

deduced from the 3-D structure, as determined by X-ray crystallography or NMR. The bulk of the data in the protein databases is, however, obtained by sequencing and translating coding DNA/RNA. In this respect, it is important to be aware that the databases can, and do, contain errors (translations of open reading frames that do not correspond to real proteins, frame-shifting sequencing errors, erroneously assigned intron–exon boundaries, etc.). Contrary to the situation for nucleic acids, which are stored in three nearly identical databases, proteins are stored in databases that differ considerably in content and quality. The main source of information is the UniProt knowledgebase, which is meant to contain all protein sequences that have ever been published, with exclusion, however, of minor variants of the same sequence, short fragments, and “dubious” sequences. It contains a collection of translations of open reading frames extracted from the annotation of the EMBL sequences and also submissions from authors who have performed “real protein sequencing”. The sequences in UniProt are accompanied by an excellent annotation, including cross-references to many other databases and a tentative assignment of motifs, based on automated database searches, which contributes significantly to the value of this database. UniProt is composed of two parts, UniProt/SwissProt (maintained by the team of Prof. Amos Bairoch at the University of Geneva, Switzerland, in collaboration with the EMBL-EBI) and UniProt/TrEMBL (maintained at the EMBL-EBI). TrEMBL is a computer-generated database with little human interference; SwissProt is curated by human annotators, who manually add information extracted from the literature. Currently, EMBL translations always enter TrEMBL first and get an accession number. After their annotation has been verified and enriched by the curators, they are removed from TrEMBL and enter SwissProt while keeping their AC; author submissions directly go to SwissProt. SwissProt sequences have an ID of type XXXXX YYYYY, where XXXXX stands for the nature of the protein and YYYYY is a unique organism identifier (e.g. TAP DROME for the target of Poxn protein of D. melanogaster), TrEMBL sequences have a provisional ID composed of the AC and the organism identifier. For the purpose of Blast similarity searches, EBI also provides the following collections:

UniProt splice variants, useful in case the query sequence matches only a “variant” and not the “representative” sequence present in SwissProt. UniRef100, a subset from UniProt, chosen so that no sequence is identical to or is a subfragment of another sequence; useful for faster searching. UniRef90 and UniRef50, still smaller, are made from UniRef100 by excluding sequences shorter than 11 amino acids and by selecting sequences so that no sequence has more than 90% and 50% identity respectively with another (see also Chapter 9). 39 Sequence databases and database searching: theory

In addition to UniProt the following databases are also available:

GenPept: collection of open reading frames extracted from GenBank, maintained at the NCI-ABCC (National Cancer Institute, Advanced Biomedical Computing Center, Frederick, Maryland, USA). DAD: collection of open reading frames extracted from DDBJ, maintained at the NIG/CIB. PRF/SEQDB (Protein Resource Foundation), maintained at the PRF (Osaka, Japan), contains both translations and proteins submitted by authors.

2.2.3 Specialized sequence databases, reference databases, and genome databases In addition to the general databases, there exist a large number (more than 100) of specialized databases. They can be constructed based on the general databases; they can accept author submissions and/or are generated by automated genome annotation pipelines. They offer one or more of the following advantages:

– The data form a well-deﬁned searchable set of sequences. Searching a specialized database instead of a general one might yield the same result, but in less time and less “polluted” by background noise. – The database is often “cleaned,” containing less errors and less redundant entries. – Sometimes the annotation has been standardized, so that you can easily ﬁnd all the sequences you want without having to repeat the search using alternative keywords. – The annotation is usually vaster and of better quality than the one found in the general databases. Moreover, some databases have the sequences already clustered into families and/or contain tables with sequence-related data, like genomic maps.

In Box 2.2 we provide a (necessarily incomplete and subjectively selected) list with the most interesting specialized databases.

2.3 Composite databases, database mirroring and search tools

2.3.1 Entrez None of the general databases available is really complete. To make up for this deﬁciency, efforts have been undertaken to make composite databases. The most popular one is the Entrez database maintained by NCBI. Originally, it contained only sequences and abstracts from the medical literature, but in the course of the years new tables have been added (there are currently more than 30), making it an integrated database of data related to molecular biology. The individual tables are interlinked (Fig. 2.3), so that you can search, for example, for a protein sequence and then, by simply making a link, obtain the nucleic acid sequence of the gene that codes for it, its 3-D structure, the abstract of the reference that describes its sequencing, etc. Box 2.3 provides more detail on a selection of ENTREZ databases. 40 Guy Bottu

Box 2.2 A selection of secondary databases

RefSeq: collection of “reference” sequences, maintained at the NCBI. One can find in RefSeq an entry for each completely sequenced genomic element (chromosome, plasmid or other), as well as genomic sequences from organisms for which a genome sequencing project is underway, and also genomic segments corresponding to complete genes (as referenced in the Gene table of Entrez, see Box 2.3), together with their transcript (mRNA or non-coding RNA) and protein product. Some RefSeq entries are generated fully automatically, others are generated with some amount of human intervention (curated). The nature of a RefSeq Entry can be distinguished by its accession number, e.g. NC * for (curated) complete genomic element or NT * for (automated) intermediate assembly from BAC. Some databases with automatically assembled and annotated eukaryotic genomes allow the user to navigate through genomic maps and extract well-defined portions of chromosome: Ensembl: maintained at the EMBL-EBI, in collaboration with the Wellcome Trust Sanger Institute (Hinxton, England, UK) NCBI Map Viewer: maintained at the NCBI UCSCGenomeBrowser: maintained at the University of California, Santa Cruz (USA) The Institute for Genomic Research, Rockville, Maryland, USA (TIGR): various genomes, mainly of microbes. The TIGR website also contains a collection of hyperlinks to other sites with information about genomes. Databases with clustered sequences: UniGene: collection of sequences grouped by gene, extracted from GenBank (for the moment only for a restricted number of organisms), maintained at the NCBI and accessible as a table of Entrez, see Box 2.3 Homologous Vertebrate Genes (HOVERGEN) database: vertebrate sequences (coding sequences extracted from EMBL and proteins extracted from UniProt), classified by protein family, maintained at the Universite´ Claude Bernard (Lyon, France) Homologous sequences from complete Genomes (HOGENOM) database: sequences from completely sequenced genomes (coding sequences extracted from EMBL and proteins extracted from UniProt), classified by protein family, maintained at the Universite´ Claude Bernard (Lyon, France) High-quality Automated and Manual Annotation of microbial Proteomes (HAMAP) database: protein sequences from completely sequenced prokaryotic genomes (eubacteria, archaebacteria, and also plastids), extracted from UniProt, classified by protein family and accompanied by expert annotation, maintained at the University of Geneva (Switzerland) Databases with sequences related to sexually transmitted diseases, maintained at the LANL (Los Alamos National Laboratory, New Mexico, USA): HIV database: HIV and SIV sequences HCV database: Hepatitis C and related flavivirus sequences ORALGEN: sequences from Herpes simplex virus and other oral pathogens 41 Sequence databases and database searching: theory

Immunogenetics IMGT: collection of sequence databases of immunological interest:

Laboratoire d’ImmunoGen´ etique´ Moleculaire´ (IMGT/LIGM): subset of EMBL with the immunoglobulin and T-cell receptor genes, maintained by the group of Professor Marie Paul Lefranc at the Laboratoire d’ImmunoGen´ etique,´ University of Montpellier (France). The documentation is enriched in three levels (the keywords are standardized and then the feature list is completed, ﬁrst automatically and later by experts). Human histocompatibility locus A (IMGT/HLA): contains genes for the human HLA class I and II antigens; it is maintained at the Anthony Nolan Research Institute (London, UK) and the sequences are named and documented according to the WHO HLA Nomenclature. It accepts direct submissions by authors.

Protein Data Bank (PDB): 3-D structures, as determined by X-ray crystallography or NMR, of proteins, nucleic acids, and their complexes, maintained by the Research Col- laboratory for Structural Bioinformatics. Some sites offer a search against the protein or nucleic acid sequences corresponding to the PDB entries.

Fig. 2.3 Relationships among the different Entrez databases. 42 Guy Bottu

Box 2.3 A selection of ENTREZ databases

Nucleotide: A complete and non-redundant (“nr”) nucleic acid sequence database. It contains, in that order, RefSeq, GenBank, EMBL, DDBJ and sequences from PDB. Signiﬁcant effort has been made to remove duplicates. The NCBI also offers a BLAST search of your sequence against this “nr” database. Furthermore, the NCBI performs at regular intervals a BLAST search of each sequence against the complete database, so that Entrez can provide a “Related Sequences” link that allows retrieving highly similar sequences. Protein: Idem for protein sequences, composed of RefSeq, translations from open reading frames found in GenBank, EMBL and DDBJ, sequences from the PDB, UniProt/SwissProt, PIR and PRF. UniGene: Collection of GenBank sequences grouped by gene (for the moment only for a restricted number of organisms), as already mentioned above Box 2.2 about specialized databases. Structure: Molecular Modeling Database (MMDB): 3-D structures of proteins, nucleic acids, and their complexes, derived from the PDB database, but in improved format. The NCBI has a tool for comparing 3-D structures, Vector Alignment Search Tool (VAST); they offer a search of a query protein structure against MMDB, as well as inside MMDB information about very similar structures. NCBI also provides a specialized tool for viewing the structures, Cn3D, which can be installed as a helper application (“plug-in”) in a Web browser. Gene: Information about genetic loci (including the alternative names of the gene, the phenotype, the position on the genetic map and cross-references to other databases), corresponding to the organisms and sequences in RefSeq. PopSet: Database of multiple sequence alignments of DNA sequences of different populations or species, submitted as such to GenBank and used for phylogenetic and population studies. PubMed: MEDLINE is a database with abstracts from the Medical Literature (starting from 1966), maintained at the National Library of Medicine (Bethesda, Mary- land, USA). PubMed comprises MEDLINE plus not yet annotated references and references to non-life-science articles published in MEDLINE journals. The WWW interface from PubMed provides hyperlinks to on-line versions of some journals. Online Mendelian Inheritance in Man (OMIM): A collection of really neat literature reviews. There is one entry for each identiﬁed human gene and for each trait shown to be hereditary in man. OMIM is maintained at the Johns Hopkins University School of Medicine (Baltimore, Maryland, USA) and began as the personal work of Dr. Victor McKusick, but in the meantime a whole team of experts has taken over the work. Taxonomy: Contains an entry for each species, subspecies or higher taxon for which there is at least one sequence in GenBank/EMBL/DDBJ. The entry has a standard 43 Sequence databases and database searching: theory

name, which is used to make the “Organism” ﬁelds of the databases, as well as a series of alternative names. At present, the Taxonomy nomenclature is the standard for the three main nucleic acid databases and for UniProt. Medical Subject Headings (Mesh): A collection of standardized keywords, used by the NLM to index MEDLINE.

2.3.2 Sequence Retrieval System (SRS) An alternative to searching databases at their maintenance site or turning to “the big site that has everything” is to have a local site with local copies of a series of often used databases. It is not trivial to install “mirrors” of a variety of different databases in one place under a common interface, since these databases not only have a different design, but they are also maintained using different incompatible and often expensive commercial software tools. A solution is to export/import the databases as eXtensible Markup Language (XML), text interwoven with “tags” that ease parsing, or even as a simple “flat file” in American Standard Code for Information Interchange (ASCII) and use software that can generate searchable indexes with keywords. At the time of this writing, the most popular and most elaborate tool to do this is SRS. It was originally developed by Thure Etzold and coworkers at the EMBL/EBI. Its copyright now belongs to the private enterprise, BioWisdom Ltd. (Cambridge, UK), but academic users can still obtain a free license. SRS can handle XML as well as simple ASCII; the commercial version, which has extended capabilities, can also integrate databases under native RDBS, and has a tool to automate the update of the databases (Prisma). The different databases can be interlinked in a way similar to the tables of Entrez. To help the user perform their query, version 8 of SRS has databases and links between databases organized according to a concept-oriented database ontology (so that you can, for example, search for “Macromolecular Structure” without having to know explicitly that you needtoselectthePDB).Thereisalsoasynonym finder that proposes alternatives for keywords. SRS can submit sequences to analysis tools (Blast, Clustal, Emboss,etc.) and can present the results on-the-fly as a searchable database and/or in a graphical survey. There are now more than 40 publicly accessible SRS servers worldwide; Bio- Wisdom maintains a list of these sites and their collection of available databases (see Box 2.1). Note that, at the moment of this writing, the future of SRS is somewhat uncertain because of the recent sale of SRS by Lion Ltd. to BioWisdom Ltd. Alternatives do exist, however; among the many available we can cite the great classic ACNUC of 44 Guy Bottu

the Universite´ Claude Bernard (Lyon, France) and the very promising MRS of the University of Nijmegen (the Netherlands).

2.3.3 Some general considerations about database searching by keyword While searching databases, it is important to understand that the search results depend in the first place on what has been written in the database by the submitting authors and by the curators. Notwithstanding ongoing efforts, databases are only partly standardized. So, you can find “HIV-1” as well as “HIV1.” If you are looking for proteases, sometimes you find the word “protease” in the text, but sometimes only the name of the particular protease, like “trypsin” or “papain.” Even worse, there can be typographical errors like “psuedogene” instead of “pseudogene.” Furthermore, search results also depend on what has been put in the alphabetic indexes used to search the database. So, it serves no purpose to search for “reverse transcriptase” if “reverse” and “transcriptase” have been placed in the index as separate keywords. It is a good idea to repeat your query several times with different search terms. After you have performed some searches, you will get a better feeling of what to expect in the databases. In this respect, the utility of some specialized databases with standardized keywords has already been mentioned (see above). The Taxonomy database in particular, which is part of Entrez, is useful to search by keyword. In Entrez, or in an SRS server that holds a copy of the Taxonomy database, it is possible to search for an organism name, eventually to browse up and down in the hierarchy of the taxonomy, and then make the link to a sequence database. While the “Organism” field has already been standardized a long time ago, a similar effort is under way to standardize the “Keywords” fields of some databases, using the Gene Ontology (GO). A few more notes about composing queries:

With Entrez and SRS (and in nearly all database search systems), a search is case- insensitive (there is no difference in using capitals or small letters). Search systems usually allow the use of wild cards: ? and * stand, respectively, for one character and for any string of characters, including nothing. SRS and Entrez automatically append a wildcard * at the end of each search word longer than one letter; you can turn this feature off by putting the search term between quotes. Entrez, however, only allows wild cards at the end of a search term. You can compose a query with logical combinations of search terms. Entrez will recognize AND, OR, and NOT (means but not); SRS uses the symbols &|! instead. You can nest logical expression with parentheses “( ).” We already mentioned the problem of search terms containing spaces. SRS has a special feature: when you type, for example, protein c it will in fact search protein* & c | protein c* unless you type explicitly with quotes "protein c". 45 Sequence databases and database searching: theory

2.4 Database searching by sequence similarity

2.4.1 Optimal alignment Relationships between homologous sequences are commonly inferred from an alignment. An alignment has bases/amino acids that occupy corresponding positions written in the same column. Deletions/insertions are indicated by means of a gap symbol, usually a ‘-’ (or sometimes a ‘.’). From any alignment a similarity score can be computed, which tells something about the quality of the alignment. This score is computed by summing up a comparison score for each aligned pair of bases/amino acids and by subtracting a penalty for each introduced gap. For nucleic acids one generally uses a simple match/mismatch score, although more complex scoring schemes like transition/transversion scoring have been proposed. For proteins log odds matrices are usually employed, like the classic PAM (Dayhoff et al., 1978)andBLOSUM substitution matrices (Henikoff & Henikoff, 1992;seeFig. 3.4 in the next chapter). They contain similarity scores proportional to: q s(i, j) = s( j, i) = log ij (2.1) pi ∗ p j

with:

qij the probability of finding amino acid i and amino acid j aligned in a reliable alignment of related protein sequences. pi the proportion of amino acid i in proteins and thus the probability of finding i at a randomly chosen position of a protein. pi ∗ pj (by consequence) the probability of finding amino acid i and amino acid j aligned in a random alignment.

The simplest method for handling gaps is to penalize every gap symbol. This is, however, unsatisfactory because deletions/insertions of more than one base/amino acid at a time are common. Therefore a gap of n positions is often penalized using a + b ∗ n,wherea is the gap penalty and b the gap length penalty (the approach followed by GCG, Staden, Ncbi Blast, fastA and Clustal), or by the functionally equivalent a + b ∗ (n – 1), where a is the gap opening penalty and b the gap extension penalty (the approach followed by Emboss and WU-Blast). Note that the choice of an appropriate gap penalty is a difﬁcult bioinformatic problem that has not yet been solved appropriately and that it depends, among other things, on the scoring matrix used (if the values in the matrix are higher, so must be the gap penalty). An example of how a pairwise alignment is scored using the BLOSUM50 matrix and a gap penalty in conjunction with a gap length penalty is shown in Fig. 2.4. 46 Guy Bottu

AFEIWG---M

AY-LWHCTRM

Fig. 2.4 An example of how an alignment is scored. Using the BLOSUM50 table and a gap penalty of 10 + 2n one obtains a similarity score of 3, as 5 (pair A-A) + 4 (pair F-Y) − 12 (one position gap) + 2 (pair I-L) + 15 (pair W-W) − 2 (pair G-H) − 16 (three positions gap) + 7 (pair M-M).

Note that the similarity score is always defined symmetrically; it does not matter which of the two sequences is on top. Note also that the similarity score can eventually be negative, but this would mean that the alignment is of poor quality, e.g. an alignment between proteins where hydrophobic amino acids have been matched with hydrophilic amino acids. Sometimes you can make a rapid and reliable alignment manually because the sequences are so conserved or because a lot of experimental data are available that show which bases/amino acids must correspond. If this is not the case, computa- tional approaches must be employed to search for the optimal alignment,thatis the alignment yielding the highest possible similarity score for a particular scoring scheme. Note, however, that nothing guarantees that the optimal alignment is also the “true” alignment (in the biological sense of the term)! Indeed, when altering the scoring scheme, different optimal alignments can be found (see Chapter 3). To find the optimal alignment it is not necessary to write down explicitly all the alternative alignments. A method exists that guarantees to find the best possible alignment while minimizing the amount of calculation needed. This is the so- called dynamic programming and backtracking algorithm. An explanation of how itworksisprovidedbyBox 3.1 of Chapter 3. This algorithm yields the best global alignment and was first proposed by Needleman and Wunsch (1970). Gotoh (1982) proposed a trick to minimize the amount of computation needed for handling gap penalties of type a + b ∗ n. It can happen that two sequences are only recognizably homologous in a limited region. To handle this situation, Smith and Waterman (1981) proposed a modified version of the algorithm that searches the best local alignment, allowing for stretches of unaligned sequences. An obvious procedure for searching a database is to make, one by one, the optimal alignments between your sequence (the query sequence) and all the sequences in the database. The Smith–Waterman (SW) will usually be preferred over the Needleman–Wunsch algorithm, because sequences in the databases are often of varying lengths. One can then recover the sequences that yield a similarity score above a certain threshold or, better, the sequences that satisfy some statistical criterion that shows that the alignment is better than what you could obtain by chance (see the explanation of the Blast,andfastA algorithms below for examples 47 Sequence databases and database searching: theory

of such criteria). This type of analysis can, however, take a lot of computer time and the time needed increases linearly with the size of the database. A search of a sequence of common size (1000 bases or 300 amino acids) against a general database can easily take several hours or several days on a standard computer! In order to obtain the result in a reasonable amount of time, such a search is usually performed on a cluster of computers, whereby the individual computers each search a part of the database. An example of this is the public MPsrch server (formerly known as Blitz) of EMBL–EBI, which in its present version uses a Compaq cluster and allows only protein searches. To save computation time, the algorithm can be hard-coded in a dedicated processor; a commercially available example of such a device is the BioXL/H manufactured by Biocceleration Ltd. (Beith-Yitschak, Israel).

2.4.2 Basic Local Alignment Search Tool (Blast) Since one cannot always afford a fancy machine or a long search time on a standard computer, bioinformaticians have developed so-called “heuristic” algorithms, which allow searching a database in considerably less time, at the price, however, of not having the absolute guarantee to ﬁnd all the sequences that yield the highest optimal alignment scores. The most popular one is Blast. The earlier version 1 of Blast developed at NCBI produced alignments without gaps (Altschul et al., 1990). After Warren Gish of the Blast team moved to Washington University (Saint Louis, Missouri, USA), a new version at NCBI called interchangeably Blast, Blast 2,“gapped Blast”orNCBI Blast (Altschul et al., 1997)hasbeen developed in parallel with WU-Blast at Washington University. Compared to NCBI Blast, WU-Blast has more optional parameters that allow trading off speed against sensitivity. Here, we will restrict our explanation to the algorithm used by NCBI Blast:

Blast starts by searching for “words” (oligonucleotides/oligopeptides) shared by the query sequence and a database sequence. For nucleic acids the default word length is 11, for proteins 3. For proteins, Blast performs “neighboring” (Fig. 2.5a): it searches not only identical oligopeptides but also similar oligopeptides. When a common “word” has been found, Blast tries to extend the “hit” by adding base/amino acid pairs (Fig. 2.5b). The growing alignment is scored, using for nucleic acids a simple match/mismatch scoring scheme (by default +1/–3) and for proteins a scoring matrix (by default the BLOSUM62 table). If the score drops more than a certain amount below the already found highest score (“X”inFig. 2.5b), Blast gives up extending further and considers that one must take the alignment up to the position corresponding to that high score. The procedure is repeated in the other direction. So one obtains a local alignment without gaps, called “High Scoring segment Pair” (HSP). Blast retains an HSP if its E() is below a chosen threshold. Because an HSP will often include several “hits,” protein Blast saves time by only 48 Guy Bottu

(a) Neighborhood word list: Query sequence: RLEGDCVFDGMIGSDQGSLRFDGFDVECDSPG GSD GAD GTD GDD GED search in GGD database sequences GSE GCD GHD GMD GSN Threshold = 11 GID GLD GVD

(b) Extending a word hit into an HSP: Query sequence: EGDCVFDGMIGSDQGSL E C+ +G G+D GS+ Database sequence: EAGCLQNGQRGTDVGSV

G S D Q G S L R F D G F D V E C D G T D V G S V M D E I P N D F E C 6 1 6-2 6 4 2-1-3 2-4-4 1-3-3-4-5

Fig. 2.5 (a) Neighborhood list for the word “GSD” in the query sequence. Only words yielding a similarity greater than the threshold (11) according to the BLOSUM62 matrix will be considered as a hit. (b) Extending a word hit into a High Scoring segment Pair (HSP). The procedure of scoring a growing alignment based on the BLOSUM62 matrix is shown for the sequence to the right of the word. If the score drops more than a certain amount below the highest score found so far (“X” in Fig. 2.5b), Blast aborts further extending and settles with the alignment up to the position corresponding to the highest score.

starting the extension procedure if there is a second “hit” in the proximity of the ﬁrst one. This procedure is repeated for every sequence in the database. If the score of an HSP is above a certain threshold, Blast makesanSWalignment, but saves time by searching only the squares of the dynamic programming matrix in the neighborhood of the HSP (it does not allow the score to drop more than a certain amount). The HSP is retained if its E() is below a chosen threshold. 49 Sequence databases and database searching: theory

It is possible to ﬁnd several HSPs for one database sequence. If Blast does not make an alignment with gaps or if all the HSPs are not included in the gapped alignment, these HSPs are reported separately.

Blast computes E() (or E-value), which is the number of sequences yielding a similarity score of at least S that one would expect to ﬁnd by chance or, in other words, an estimation of the number of false positives:

−λ∗ E () = m ∗ n ∗ K ∗ e S (2.2)

where m is the length of the query sequence, n the total length of the database, and K and λ the so-called “Karlin and Altschul” parameters, which depend on the scoring scheme and the base/amino acid composition of the database. Stephen Altschul from the NCBI derived a rigorous statistical formula that allows computation of K and λ in case of an alignment without gaps (HSP), assuming that the scores follow an extreme value distribution (Karlin & Altschul, 1990). He also derived a formula in case one finds several HSPs for one database sequence (“Sum statistics”, see Karlin & Altschul, 1993). There is no rigorous analytical formula in case of alignments with gaps. The experts, however, believe that the given formula is still approximately valid and Blast uses a K and λ obtained by simulation (by searching a random sequence against a random database). The probability P of finding an alignment with a score of at least S by chance alone is then 1 – e–E. Blast computes also a “bit score,” which does not depend directly on the scoring scheme used and thus reflects better the significance of the alignment: λ ∗ s − ln K s = (2.3) ln 2 The current Blast package contains an everything-in-one program Blastall,which offers five types of database searches:

BLASTN compares a nucleic acid sequence with all the sequences of a nucleic acid database and their complements. BLASTP compares a protein sequence with all the sequences in a protein database. BLASTX compares a nucleic acid sequence, translated in the six translation frames, with all the sequences of a protein database. TBLASTN compares a protein sequence with all the sequences of a nucleic acid database, translated on-the-ﬂy in the six translation frames. TBLASTX compares a nucleic acid sequence, translated in the six translation frames, with all the sequences of a nucleic acid database, translated on-the-ﬂy in the six translation frames. TBlastX does not make gapped alignments.

There is also a program Blastpgp (for protein searches only), which computes E() using a more accurate formula taking into account the composition of the query 50 Guy Bottu

and database sequence (Schaffer et al., 2001) and also offers the following types of searches:

Position Specific Iterated Blast (PSI-Blast) simply searches a protein against a protein database first, then picks out the best query sequence/database sequence alignments, merges them into a multiple sequence alignment, transforms the multiple sequence alignment into an amino acid frequency profile, and searches the profile against the database. The procedure can be repeated until the profile converges; that means until the sequences found by the profile are the same as the sequences used to make it. Pattern-Hit Initiated Blast (PHI-Blast) searches both a pattern (defined in PROSITE format) and a protein sequence against a protein database and finds sequences that match the pattern and show, in the same region, a significant local similarity. It is possible to perform a PHI-Blast as a first round and go on with PSI-Blast.

For those who want to learn how to use Blast in an expert manner, we can recommend the book by I. Korf et al.(2003).

2.4.3 FastA FastA (Pearson & Lipman, 1988) is older than Blast and results from the work of William “Bill” Pearson at the University of Virginia (Charlottesville, USA). On the fastA distribution site, you can still ﬁnd version 2 in parallel with version 3, because the non-database searching programs were not ported when the statistics were improved (Pearson, 1998). Brieﬂy, the algorithm goes as follows:

FastA searches all the “words” or k-tuples that are common between the query sequence and a database sequence, similar to Blast but without “neighboring.” By default, k is 6 for nucleic acids and 2 for proteins, but you can decrease k to obtain a greater sensitivity at the price of a signiﬁcant increase in computing time. FastA tries to assemble the “words” into long contiguous segments. These segments are scored using a simple match/mismatch scoring scheme for nucleic acids (by default +5/–4) and a scoring matrix for proteins (by default the BLOSUM50 table); the 10 best ones are retained. The highest score of all is called init1. FastA tries to assemble segments that are compatible with the same alignment (that is, which do not overlap). A score is computed by adding the scores of the segments and by subtracting joining penalties. The combination that yields the highest score (initn) is retained. This procedure is repeated for every sequence in the database. For those database sequences that yield an initn above a certain chosen threshold, fastA makes the SW alignment (score opt), but saves time by searching only the squares of the dynamic programming matrix in a band around the already-found segments. 51 Sequence databases and database searching: theory

FastA computes a Z-score and an E() value and lists the database sequences that result in an E() below a chosen threshold.

The Z-score is a similarity score corrected for the effect of the different lengths of the database sequences: s − ρ∗ ln(n) − µ Z = (2.4) σs where the mean score for unrelated sequences of length n, µ + ρ ∗ ln(n), is estimated by linear regression using as dataset a collection of sequences found during the database search that yielded very low opt scores. You can also force fastA to use randomly shuffled copies of the database sequences instead. This is particularly useful if you want to assess the false positive rate in a specialized database or a set of sequences that you defined yourself. In such cases, even sequences with the lowest similarity to the query (low opt scores) can still represent true homologous, and as such, they would not be appropriate to compute a mean score for “unrelated” sequences. The probability P that an unrelated sequence yields a score at least equal to the observed score is estimated, assuming that the Z-scores follow an extreme value distribution. Contrary to Blast, which treats the complete database as one long sequence, fastA considers the comparisons with the sequences of the database as statistically independent events and computes E() by multiplying P with the number of sequences in the database. FastA also computes a bit score, similar to that of Blast, but using a K and λ derived from the database sequences. Note that it is possible to find twice the same alignment with different Z-score, E() value and bit score if the same subsequence is found as part of a shorter and a longer database sequence. The fastA package contains programs that perform the following searches:

Fasta compares a nucleic acid sequence and its complement with all the sequences of a nucleic acid database, or compares a protein sequence with all the sequences of a protein database. Fastx compares a nucleic acid sequence and its complement with all the sequences of a protein database, aligning translated codons with amino acids and allowing for gaps between codons (Pearson et al., 1997) Fasty the same as fastx, but allowing also gaps within a codon. Tfasta compares a protein sequence with all the sequences of a nucleic acid database, translated on-the-ﬂy in the six translation frames. Tfastx compares a protein sequence with all the sequences of a nucleic acid database and their complements, aligning translated codons with amino acids and allowing for gaps between codons. Tfasty the same as tfastx, but allows also gaps within a codon. Ssearch the same as fastA, but uses the slow and precise SW algorithm. 52 Guy Bottu

2.4.4 Other tools and some general considerations Among other database search tools, Ssaha (available as part of the Ensembl interface, see Ning et al., 2001)andBlat (available as part of the UCSC Genome Browser interface, see Kent, 2002) are worthwhile mentioning here. They are very fast because they do not search the sequences themselves but use an index with oligonucleotides, which is made each time the database is updated. They are suited for searching a genome to see if it contains a sequence stretch very similar to a query sequence. Most database search tools compute an E() value, an estimation of the number of false positives expected by chance. For the inexperienced user, who can be bewildered by the multiplicity of scores and statistics in the program outputs, this E()-value is the most useful. An often-cited rule of thumb is: if E() < 0.1, you can accept the sequence hit as a homolog with reasonable confidence; if 0.1 < E() < 10, you are in the so-called “twilight zone,” which means, that the hit could be a homolog, but you should not take it for granted without additional evidence; if E() > 10, you are in the so-called “midnight zone,” which means, do not accept the result, because even if the hit is a true homolog, it has diverged beyond recognition and the alignment presented is likely to be incorrect. Beware of over-interpreting the exact value of E(), which is calculated based on assumptions that do not necessarily hold in reality (see, for example, an article by Brenner et al., 1998, arguing that the Blast E() might be underestimated). In any case, very low E() values indicate that the hit almost certainly represents a true homolog. Protein similarity searches are inherently more sensitive than nucleic acid searches. The reason is a simple effect of the sequence “alphabet.” Indeed, since there are only four bases in nucleic acids, one expects to find 25% of identities by chance alone. So, when sequences start diverging by substitutions, nucleic acids reach a point much earlier where you cannot infer homology from sequence similarity. If you are interested in a (potentially) coding DNA, it is therefore preferable to perform a protein similarity search. In this case, one can take advantage of the fact that Blast and fastA packages allow “translated” searches (see above). It has been shown that, for most protein families, Blast (even the old “ungapped” Blast) is outperforming fastA, and it performs nearly as well as the SW algorithm. The reason is that Blast can rather easily detect substitutions in proteins thanks to the “neighboring.” The scoring matrices precisely reflect the fact that one often finds similar amino acids rather than identical amino acids at corresponding positions. Furthermore, the HSPs found by Blast often correspond to the segments forming the conserved “core” of the protein. In coding DNA, deletions/insertions that do not encompass a multiple of three bases are rare because they cause a shift in reading frame, and furthermore deletions/insertions in the “core” of a protein are rare because they change the relative position of the segments. Non-coding 53 Sequence databases and database searching: theory

DNA has, however, the tendency to rapidly accumulate deletions/insertions. In this case, fastA is known to perform better than the BlastN option of Blast, because it can assemble smaller segments. Furthermore, the default settings in Blast can be adapted to search rapidly for closely related sequences, while Blast sensitivity can be increased by decreasing the word size (and in the case of WU-Blast by enabling “neighboring” for nucleic acids). Keep in mind, however, that nucleic acid comparisons do not probe evolution very far. Finding all the sequences of a family is something other than finding a few related sequences. To achieve a maximum sensitivity, it is not enough to perform an SW search. Note that, for protein searches, the usage of a high number BLOSUM/low number PAM substitution matrix gives the best probability for finding sequence fragments that are very closely related but very short, because one attaches a greater importance to conserved positions. Using a low number BLOSUM/high number PAM matrix yields the best probability of finding sequence fragments that are very distantly related, because one attaches a greater importance to amino acids aligned with a “similar” amino acid. This results from the way these tables were derived: BLOSUM62 is made from local alignments with at least 62% of the columns fully conserved, PAM250 should reflect evolution at a distance of 250 substitutions per 100 amino acids (see also Chapter 9). So, if you want to find just a few similar sequences, it is usually sufficient to run fastA or Blast once with the default parameter settings. But, if you want to be certain to have found as many as possible potentially homologous sequences, it will be necessary to perform several database searches (with Blast, fastA and/or eventually SW) using different scoring schemes. What one search does not find, another might. An alternative and complementary strategy to find all the sequences of a family is to perform new searches with, as a query, some of the sequences you found already. PSI-Blast (see above) and Scan-PS are also potentially useful; the latter works in a way similar to PSI-Blast but uses a rigorous SW algorithm. These programs find a self-consistent set of proteins, which can be well suited for phylogenetic analysis, as well as some remote relatives that a simple run might not find. The number of false positives, reflected by the E() value, will increase with database size. By searching a small set instead of a complete database (e.g. only the bacterial sequences or the E. coli sequences), you can reduce the “noise” and save computation time. In this context, remember the utility of the specialized databases (see Section 2.2.3). Note also that the NCBI Blast server allows the user to perform a Blast search against a set of sequences resulting from an Entrez search (McGinnis &Madden,2004). Proteins typically are composed of several domains. If your query protein shares a domain with hundreds of proteins in the database and shares another domain with only a few other sequences, the second domain might be missed. Similarly, 54 Guy Bottu

eukaryotic DNA sequences often contain repetitive elements such as Alu repeats that can produce a long but uninteresting list of hits at the top of the output. Therefore, if your Blast or fastA query repeatedly results in alignments covering the same range of your query sequence, it is a good idea to cut out that range (or replace it by XXX) and then run the search again. In addition, regions with biased composition like poly-A tails or histidine-rich regions yield a lot of uninteresting “hits.” At the NCBI, some “ﬁlters” have been developed that automatically remove such regions of low compositional complexity: Dust for DNA (Hancock & Armstrong, 1994)and Seg for protein (Wootton & Federhen, 1993). They can be obtained as standalone programs and they are also implemented in Blast. Note that, in the simple NCBI Blast (Blastall), the ﬁlters are enabled by default, while in PSI/PHI-Blast (Blastpgp) and WU-Blast they are disabled by default, but this can always be changed to suit your needs. PRACTICE Marc Van Ranst and Philippe Lemey

2.5 Database searching using ENTREZ ENTREZ is an easy-to-use integrated text-based cross-database search and retrieval system developed by NCBI. It links a large number of databases including nucleotide and protein sequences, PubMed, 3-D protein structures, complete genomes, taxonomy, and others. ENTREZ has a comprehensive Help Manual,whichyoucan consult at http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpentrez. On the ENTREZ homepage (http://www.ncbi.nlm.nih. gov/gquery/gquery.fcgi), you can specify a search term in the input box. The example in Fig. 2.6 shows the result for the search term “human papillomavirus type 13.” You can perform a concurrent search across all ENTREZ databases by typing your search text into the box at the top of the page and then clicking “Go,” or you can search a single database by entering your search term and clicking the database’s icon instead of “Go.” At the time of writing, you could find eight articles in the scientific literature on human papillomavirus type 13, and the full text of all eight are freely available on PubMed Central. Figure 2.7 shows the first four most recent articles on human papillomavirus type 13. When you click for example on the fourth item on the list, it will link you to the abstract of this article. When you would use the same search term “human papillomavirus type 13” to search the nucleotide database, ENTREZ finds five nucleotide sequences in the CoreNucleotide records (the main database), 0 in the Expressed Sequence Tags (EST) database, and 0 in the Genome Survey Sequence (GSS) database (Fig. 2.8). When you click on the blue hyperlink of the CoreNucleotide records, ENTREZ will present the records where it found the text string “human papillomavirus type 13” (Fig. 2.9). You will notice that the first record is the complete sequence of Caenorhabditis elegans. Somewhere in the annotation of the C. elegans genome, the search engine encountered the text string “human papillomavirus type 13.” The second and fourth records contain the complete genome of human papillomavirus type 13, and the third record contains the complete sequence of a related virus, the pygmy chimpanzee papillomavirus. Thus, although this ENTREZ nucleotide will, in most instances, allow you to find the information that you wanted, it will not necessarily be in the first record in the list. Clicking on a blue accession number hyperlink in the list will bring you to the GenBank file of the sequence (Fig. 2.10). A detailed explanation of the Genbank flat file format is provided at http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html. The Genbank format is well suited to incorporate a single sequence and 55 56 Marc Van Ranst and Philippe Lemey

Fig. 2.6 An Entrez search across all databases using the search term “human papillomavirus type 13.”

its annotation. For phylogenetic analysis, however, multiple sequences need to be formatted in a single file. Unfortunately, different programs have been using different input file formats resulting in about 18 formats that have been commonly used in molecular biology (an overview of most sequence formats can be found at http://emboss.sourceforge.net/docs/themes/SequenceFormats.html). There- fore, much of the effort involved in carrying out a comprehensive sequence analysis is devoted to preparing and/or converting files. The most popular sequence formats are illustrated in Box 2.4. Although file formats can be converted manually 57 Sequence databases and database searching: practice

Fig. 2.7 PubMed results for the search term “human papillomavirus type 13.”

Fig. 2.8 ENTREZ nucleotide database results for “human papillomavirus type 13.”

in any text editor, there exist sequence conversion programs, like Readseq written byDonGilbert,toperformthistaskinanautomatedfashion.Anonlineversion of this tool is available at http://iubio.bio.indiana.edu/cgi-bin/readseq.cgi.Fortu- nately, many alignment and phylogenetic programs have implemented sequence conversion code to import and/or export different ﬁle formats, and newly developed programs tend to use PHYLIP, NEXUS or FASTA as standard input format. Fig. 2.9 “human papillomavirus type 13” records in the CoreNucleotide database.

Fig. 2.10 Genbank flat file for accession number X62843. 59 Sequence databases and database searching: practice

Box 2.4 Common formats of sequence data input files

FASTA FASTA is probably the simplest text-based format for representing either nucleic acid sequences or protein sequences. The format begins with single-line sequence description preceded by “>.” The rest of the line can be used for description, but it is recommended to restrict the total number of characters to 80. On the next line, the actual sequence is represented in the standard IUB/IUPAC amino acid and nucleic acid codes. For aligned sequences, this may contain gap “-” symbols. FASTA format is sometimes referred to ∗ as Pearson/Fasta format. Many programs like ClustalW (see Chapter 3), Paup (use the “tonexus” command) (see Chapter 8), HyPhy (see Chapter 14), Rdp (see Chapter 16), and Dambe (see Chapter 20) can read or import FASTA format. To indicate this type of format, the extensions “.fas” or “.fasta” are generally used.

>Taxon1 AATTCCCCAGCTTTCCACCAAGCTC >Taxon2 AATTCCACAGCTTTCCACCAAGCTC >Taxon3 AACTCCAGCACATTCCACCAAGCTC >Taxon4 AACTCCACAACATTCCACCAAGCTC

NEXUS is a modular format for systematic data and can contain both sequence data and phylogenetic trees in “block” units (Maddison et al., 1997). An example of a NEXUS file containing a data block is shown below. For a description of the tree format, see the practice section in Chapter 5. Data and trees are public blocks containing information used by several programs. Private blocks containing relevant information ∗ for specific programs are being used by Paup (see Chapter 8), MrBayes (see Chapter 7), FigTree (see Chapter 5), SplitsTree (Chapter 21), etc. Box 8.4 in Chapter 8 further outlines the NEXUS format and explains how to convert data sets from PHYLIP to NEXUS format. Note that comments can be added in the NEXUS file using “[]”; the example below uses two lines of comments to indicate sequence positions in the alignment. NEXUS formatted files usually have the “.nex,” “.nexus,” or “.nxs” extension. #NEXUS

Begin DATA; Dimensions ntax = 4 nchar = 25; Format datatype = NUCLEOTIDE gap = -; Matrix [ 1 11 21 ] [| | |] Taxon1 AATTCCCCAGCTTTCCACCAAGCTC Taxon2 AATTCCACAGCTTTCCACCAAGCTC Taxon3 AACTCCAGCACATTCCACCAAGCTC Taxon4 AACTCCACAACATTCCACCAAGCTC ; End; 60 Marc Van Ranst and Philippe Lemey

PHYLIP PHYLIP is the standard input file format for programs in the Phylip package (see Chapters 4 and 5) and has subsequently been implemented as input file format for many other programs, like Tree-Puzzle, PhyML and Iqpnni (see Chapter 6).Thefirstlineoftheinput file contains the number of taxa and the number of characters (in this case alignment sites), separated by blanks. In many cases, the taxon name needs to be restricted to ten characters and filled out to the full ten characters by blanks if shorter. Special characters like parentheses (“(“ and ”)”), square brackets (“[” and “]”), colon (“:”), semicolon (“;”) and comma (“,”) should be avoided; this is true for most sequence formats. The name should be on the same line as the first character of the data for that taxon. PHYLIP files frequently have the “.phy” extension.

425 Taxon1 AATTCCCCAGCTTTCCACCAAGCTC Taxon2 AATTCCACAGCTTTCCACCAAGCTC Taxon3 AACTCCAGCACATTCCACCAAGCTC Taxon4 AACTCCACAACATTCCACCAAGCTC

The PHYLIP format can be “sequential” or “interleaved.” In the sequential format, the character data can run on to a new line at any time:

465 Taxon1 AATTCCCCAGCTTTCCACCAAGCTCTGCAAGATCCCAGAGTCAAGGGCCTG- TATTTTCCTGCTGG Taxon2 AATTCCACAGCTTTCCACCAAGCTCTGCAAGATCCCAGAGTCAGGGGCCTG- TATTTTCCTGCTGG Taxon3 AACTCCAGCACATTCCACCAAGCTCTGCTAGATCC--- AGTGAGGGGCCTATACGTTCCTGCTGG Taxon4 AACTCCACAACATTCCACCAAGCTCTGCTAGATCCCAGAGTGAGGGGCCTTTAT- TATCCTGCTGG

The PHYLIP interleaved format has the ﬁrst part of each of the sequences (50 sites in the example below), then some lines giving the next part of each sequence, and so on.

465 Taxon1 AATTCCCCAG CTTTCCACCA AGCTCTGCAA GATCCCAGAG TCAAGGGCCT Taxon2 AATTCCACAG CTTTCCACCA AGCTCTGCAA GATCCCAGAG TCAGGGGCCT Taxon3 AACTCCAGCA CATTCCACCA AGCTCTGCTA GATCC---AG TGAGGGGCCT Taxon4 AACTCCACAA CATTCCACCA AGCTCTGCTA GATCCCAGAG TGAGGGGCCT

GTATTTTCCT GCTGG GTATTTTCCT GCTGG ATACGTTCCT GCTGG TTATTATCCT GCTGG

To make the alignment easier to read, there is a blank every ten characters in the example above (any such blanks are allowed). In some cases, the interleaved format can or needs to be speciﬁed using “I” on the ﬁrst line (e.g. “465I”). Note that NEXUS 61 Sequence databases and database searching: practice

formatted alignments can also be interleaved or sequential. Programs in the Paml package (see Chapter 11) use the PHYLIP sequential format, but allow for more characters in the taxon name (up to 30). Paml considerstwoconsecutivespacesastheendofaspecies name, so that the species name does not have to have exactly 30 (or 10) characters.

CLUSTAL CLUSTAL is not widely supported as input format in phylogenetic programs but, since it is the standard output format of a popular alignment software (see Chapter 3), we include it in our overview. The format is recognized by the word CLUSTAL at the beginning of the ﬁle. CLUSTAL is an interleaved format with blocks that repeat the taxa names, which should not contain spaces or exceed 30 characters and which are followed by white space. The sequence alignment output from Clustal software is usually given the default ∗ extension “.aln.” Clustal also indicates conserved residues in the alignment using a “ ” for each block.

CLUSTAL W (1.83.1) multiple sequence alignment

Taxon1 AATTCCCCAGCTTTCCACCAAGCTC Taxon2 AATTCCACAGCTTTCCACCAAGCTC Taxon3 AACTCCAGCACATTCCACCAAGCTC Taxon4 AACTCCACAACATTCCACCAAGCTC ∗∗ ∗∗∗ ∗ ∗∗∗∗∗∗∗∗∗∗∗∗∗

MEGA In this format, the “#Mega” keyword indicates that the data file is prepared for analysis using Mega (see Chapters 4 and 5). It must be present on the very first line in the data file. On the second line, the word “Title” must be written, which can be followed by some description of data on the same line. Comments may be written on one or more lines right after the title line and before the data. Each taxon label must be written on a new line staring with “#” (not including blanks or tabs). The sequences can be formatted in both sequential and interleaved format.

#Mega Title: example data set

! commenting #Taxon1 AATTCCCCAGCTTTCCACCAAGCTC #Taxon2 AATTCCACAGCTTTCCACCAAGCTC #Taxon3 AACTCCAGCACATTCCACCAAGCTC #Taxon4 AACTCCACAACATTCCACCAAGCTC 62 Marc Van Ranst and Philippe Lemey

Some advanced program for population genetic inference, like Beast (see Chapter 18) and Lamarc (see Chapter 19)useanXML ﬁle format. However, those programs usually provide ﬁle conversion tools that accept FASTA or NEXUS format.

The human papilloma virus type 13 genome can be displayed and downloaded in FASTA format using the Display (FASTA)andShow, Send to (File) options. One can focus on a particular gene region using the Range (from.. to..)andRefresh Option. Multiple sequences can be obtained by typing accession numbers separated by blanks, for example, search the CoreNucleotide database for “AY843504 AY843505 AY843506.” Three entries for primate TRIM5α sequences will be retrieved, which can be saved in FASTA format (Display FASTA and Show, Send to File). Alternatively, the accession numbers can be typed in plain text file with a single accession number on each line, which can be uploaded for sequence retrieval (from CoreNucleotide) by the Batch Entrez system (http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Nucleotide). Finally, if a set of DNA sequences has been collected to analyze the evolutionary relationships within a particular population (single or multiple species population), they can be submitted and stored as a “PopSet.” Such collections are submitted to GenBank via Sequin, often as a sequence alignment. To illustrate this, go back to the ENTREZ homepage and specify the search term “Sawyer.” At the time of writing, this resulted in 16 entries in the PopSet database. When you click on hyperlink of the PopSet section, the 16 journal article titles should appear with the first author represented as a hyperlink. Follow the link of “Sawyer” with title “Positive selection of primate TRIM5α identifies a critical species-specific retroviral restriction domain.” This will bring you to the population set from this study (Fig. 2.11). By scrolling down, you will notice that this collection holds 17 TRIM5α sequences from different species. A multiple sequence alignment can be generated and the sequences can be saved to a single file in FASTA format (Display FASTA and Show, Send to File). This is a subset of the sequences that will be further analyzed in Chapters 3, 4, 5,and11.

2.6 Blast

On the Blast homepage at NCBI (http://www.ncbi.nlm.nih.gov/BLAST/), a choice is offered between the different Blast programs (Fig. 2.12) through different hyperlinks. In this exercise we will compare a nucleotide and protein Blast search for one of the TRIM5α sequences. 63 Sequence databases and database searching: practice

Fig. 2.11 PopSet page for the TRIM5α collection.

As query for a nucleotide Blast (BlastN), you can use a sequence in FASTA format, an accession number or a GenBank identiﬁer (gi) (Fig. 2.13). We will use the TRIM5α nucleotide sequences in FASTA format (with accession number AY843504.1). You must also specify the database you want to search. If the complete GenBank database needs to be searched, the choice should be “Nucleotide collection (nr/nt)”database.WewillalsoselectBLASTN under the Pro- gram Selection header, which optimizes the search for somewhat similar nucleotide sequences. The parameters of the algorithm can be adjusted by clicking on Algorithm parameters at the bottom of the page. Increase the Max target sequences to 20000 to make sure we will see all the BlastN hits in the results page. The default threshold for the E()valueis10andthedefaultword size is 11. Click on the Blast button to initiate the search. At the time of writing, this search resulted in 2372 hits. A graphical overview with color-coded alignment scores is provided for the most similar sequences. For all hits, there are discontinuous stretches with high alignment scores. This is 64 Marc Van Ranst and Philippe Lemey

Fig. 2.12 Different basic and specialized Blast programs.

because the input sequence contained several stretches of “Ns” for regions that were not sequenced, which will not be taken into account in the word hits and Blast extension with database sequences. The output of the BlastN search also contains a table with sequences producing significant alignments (part of this table is shown in Fig. 2.14). The first three hits have the same Max score (1168) and maximum identity, one of which is, in fact, the query sequence. The first 11 hits are all TRIM5α sequences from African green monkeys. Scrolling down the list you will find many more TRIM5α sequences and eventually other isoforms or members of the tripartite motif containing protein family. At the bottom of the list are hits that resulted in an E() value of 6, which places these hits in the so-called “twilight zone” (see Section 2.4.4). We also perform a protein Blast search using the amino acid translation of the TRIM5α coding sequence (accession number AAV91975.1). For this search, 65 Sequence databases and database searching: practice

Fig. 2.13 BLASTN submission page.

Fig. 2.14 Part of the BlastN output table for the TRIM5α search.

we specify blastp under Program Selection and set the Max target sequences to 20000. The default word size is now 3. During the search, a page will be displayed that shows four putative conserved domains detected in the query. The Blast search now provides 7799 hits with the query sequence (AAV91975.1) heading the list. 66 Marc Van Ranst and Philippe Lemey

Fig. 2.15 Pairwise alignment of the TRIM5α query sequence with the subject sequence in the GenBank database for which a protein structure has been determined.

Using “TRIM5” as query would not result in any hits in the NCBI Structure database. However, we might ﬁnd homologous proteins with known structures in the database using our sequence similarity search. Scroll down the BlastP hit list and look for an “S”inaredsquarenexttotheE() column. The ﬁrst hit in the list is “pdb|2IWG|B.” When you click on the hyperlink in the Score column (117), the pairwise alignment of this subject (sbjct) sequence with your query sequence will appear (Fig. 2.15). The alignment of the Pryspry Domain Of Trim21 and the query has 76 identities in 221 aligned residues (34%). Clicking on the “S”ina red square will bring you to the structure database. There is another hit in the list (pdb|2FBE|A)withSprydomainstructures(E() value 1e-09). These structures can be useful for homology modeling of this TRIM5α domain. Note that there are also structures in the list for other conserved domains in this protein.

2.7 FastA

As a comparison, we now perform a fastA search with the same TRIM5α protein query at http://fasta.bioch.virginia.edu/fasta www2/fasta list2.shtml. Click on the Protein-protein FastA link and paste in the TRIM5α protein sequence in FASTA format (AAV91975.1) (Fig. 2.16). To have similar settings as the previous Blast search, select the NCBI NR non-redundant database under 67 Sequence databases and database searching: practice

Fig. 2.16 Protein–protein fastA submission page.

(C) Database and BlastP62 as Scoring matrix under Other search options.ThefastA search retrieves about 5000 records, with again the query sequence (AAV91975.1) heading the list of entries with best scores. The first part of the output file contains a histogram showing the distribution of the observed (“=”)andexpected(“∗”) Z-scores.(SeeSection 2.4.3 for an explanation of Z-score.) The theoretic curve (“∗”), following the extreme value distribution, is very similar to observed results (“=”) and provides a qualitative assessment of how well the statistical theory fits the similarity scores calculated by the program. Below the histogram, fastA displays a listing of the best scores (with 10 as E() value cut-off), followed by the alignments of the regions of best overlap between the query and search sequences. Clicking on the align link in the score list will bring you to the relevant alignment, which, in turn, provides a link to the Entrez database entry for the hit.