<<

Phylogenetic Analysis of Sequence UNIT 19.11 Data Using the Randomized Axelerated Maximum Likelihood (RAXML) Program

Antonis Rokas1 1Department of Biological Sciences, Vanderbilt University, Nashville, Tennessee

ABSTRACT Phylogenetic analysis is the study of evolutionary relationships among , phenotypes, and . In the context of protein sequence data, phylogenetic analysis is one of the cornerstones of comparative and has many applications in the study of protein and function. This unit provides a brief review of the principles of phylogenetic analysis and describes several different standard phylogenetic analyses of protein sequence data using the RAXML (Randomized Axelerated Maximum Likelihood) Program. Curr. Protoc. Mol. Biol. 96:19.11.1-19.11.14. C 2011 by John Wiley & Sons, Inc. Keywords: r bootstrap r multiple r r evolutionary relationship r

INTRODUCTION the baboon-colobus monkey almost Phylogenetic analysis is a standard and es- 25 million years ago, whereas baboons and sential tool in any molecular biologist’s bioin- colobus monkeys diverged less than 15 mil- formatics toolkit that, in the context of pro- lion years ago (Sterner et al., 2006). Clearly, tein sequence analysis, enables us to study degree of sequence similarity does not equate the evolutionary history and change of pro- with degree of evolutionary relationship. teins and their function. Such analysis is es- A typical phylogenetic analysis of protein sential to understanding major evolutionary sequence data involves five distinct steps: (a) questions, such as the origins and history of data collection, (b) inference of , (c) macromolecules, developmental mechanisms, sequence alignment, (d) alignment trimming, phenotypes, and life itself. On a more prac- and (e) phylogenetic analysis. Although this tical level, phylogenetic analysis of protein unit concentrates only on the last step, the first sequence data is integral to annotation, four steps are critical to accurate inference and prediction of gene function, the identification are thus also worthy of brief discussion. and construction of gene families, and gene discovery. Data collection Phylogenetic trees are mathematical struc- The sources of protein sequence data used tures that depict the evolutionary history of for phylogenetic analysis are very diverse. For a group of organisms or . The aim of example, data may be generated via PCR, phylogenetic trees is to depict historical (i.e., cloning, and DNA of a partic- evolutionary) relationships, and not degree of ular locus (or loci) across several similarity. For example, although and (Murphy et al., 2001; Rokas et al., 2002; James crocodiles look more similar to each other et al., 2006) or a particular gene family across than to , crocodiles are evolutionar- a (Garcia-Fernandez and Holland, ily closer to humans because the last com- 1994) and then translated into amino acid se- mon crocodile- ancestor was more re- quences. Alternatively, protein sequence data cent than the last common crocodile– an- may be collected from high-throughput DNA cestor. Similarly, the lysozyme protein of the sequencing experiments, such as EST- and colobus monkey exhibits 14 amino acid differ- RNA-sequencing (Dunn et al., 2008; Hittinger ences when compared to either the human or et al., 2010), or from whole-genome data the baboon lysozyme protein (Stewart et al., (Rokas et al., 2003; Ciccarelli et al., 2006; 1987), even though humans diverged from Fitzpatrick et al., 2006). Informatics for Molecular Biologists

Current Protocols in Molecular 19.11.1-19.11.14, October 2011 19.11.1 Published online October 2011 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/0471142727.mb1911s96 Supplement 96 Copyright C 2011 John Wiley & Sons, Inc. Homology inference BRIEF INTRODUCTION TO Virtually every phylogenetic analysis of PHYLOGENETIC ANALYSIS molecular data assumes that the un- Once a set of protein sequences has been der study are homologous; that is, every anal- aligned, the resulting MSA can be entered ysis assumes that all proteins studied are re- directly into a phylogenetic analysis. There lated by descent to the same ancestral protein. are several different methods and protocols It is only after homologs have been inferred for molecular phylogenetic analysis (Swof- (or assumed) that phylogenetic analysis can be ford et al., 1996; Li, 1997; Kitching et al., performed. Depending on the question asked, 1998; Page and Holmes, 1998; Nei and Kumar, homolog inference can take many forms. For 2000; Huelsenbeck et al., 2001; Felsenstein, example, in studies aimed at understanding 2003). This abundance of methods means that the evolution of a certain gene family, ho- a novice user will have to make numerous de- molog inference is typically performed by con- cisions and choices at several different steps ducting similarity search analyses with local and levels during analysis, which may vary alignment search algorithms, such as BLAST from one data set to another. (UNIT 19.3). Alternatively, in studies focused on The aim of any phylogenetic analysis is to reconstructing species histories from gene his- identify which tree, out of all possible trees, tories, homology inference requires inference best estimates the true evolutionary history of orthologs (homologs that have originated of the protein sequence data analyzed. At the via speciation) using more complex search most fundamental level, this estimation of phy- strategies (Remm et al., 2001; Li et al., 2003; logenetic relationships involves two decisions. Wall et al., 2003; Alexeyenko et al., 2006; The first decision is which optimality criterion Kuzniar et al., 2008; Salichos and Rokas, should be used. Given a set of alternative phy- 2011). logenetic trees, the optimality criterion allows the user to decide which tree explains or fits the data better. There are several different optimal- Sequence alignment ity criteria including, but not limited to, maxi- Once the set of homologous proteins mum likelihood, Bayesian inference, and par- has been identified, sequences are typically simony (for detailed descriptions of these and aligned globally, that is, across their en- other optimality criteria see Swofford et al., tire length, to construct a multiple sequence 1996; Huelsenbeck et al., 2001). For example, alignment (MSA). In contrast to local align- under the parsimony optimality criterion, the ment search algorithms like BLAST (UNIT 19.3), best is the one that requires where the objective is the accurate identifica- the smallest number of evolutionary changes. tion of homologs, MSA algorithms focus on The second decision is the choice of search accurately aligning all the individual amino strategy for exploration of tree space (for a acids across all the sequences. The industry detailed, but remarkably lucid, description of classic for MSA is the family of pro- the different search strategies see Swofford grams (Larkin et al., 2007). However, in recent et al., 1996). It so happens that one cannot years, a new generation of much faster and typically estimate the best tree among all pos- much more accurate programs, such as MAFFT sible trees for a set of protein sequences, for (Katoh et al., 2002; Katoh and Toh, 2008), T- two reasons. First, because the number of pos- COFFEE (Notredame et al., 2000), and PRANK sible trees grows exponentially with the num- (Loytynoja and Goldman, 2008, 2010) have ber of sequences, the numbers of alternative been developed. trees for even small numbers of sequences are extremely large. For example, the number of Alignment trimming different phylogenetic trees that can depict the MSAs constructed for many proteins con- evolutionary relationships of 50 sequences is tain regions that are aligned poorly. In sev- nearly as large as the number of atoms in the eral cases, removal of such poorly aligned known universe (Stamatakis et al., 2007). Sec- regions has been shown to improve phylo- ond, it has been proven that efficient solutions genetic inference (Talavera and Castresana, to the computational problem of finding the 2007), which has resulted in the common prac- best phylogenetic tree do not exist (Day et al., tice of “trimming” such poorly aligned regions 1986; Chor and Tuller, 2005), and search of from protein MSAs prior to phylogenetic in- Phylogenetic the near-entire tree space is required for ac- Analysis of ference. Popular programs for MSA trimming curate identification of the best tree. Because Protein Sequence include G-BLOCKS (Castresana, 2000) and TRI- exhaustive evaluation of such large numbers Using RAXML MAL (Capella-Gutierrez et al., 2009). of trees is unfeasible for data sets that contain 19.11.2

Supplement 96 Current Protocols in Molecular Biology a dozen sequences or more, phylogeneticists (Edwards, 1992). Briefly, in the context of phy- have devised a number of different heuristic logenetic analysis, the maximum likelihood search strategies for identifying the best tree. optimality criterion states that the phyloge- Although these heuristic search strategies are netic tree that makes a given sequence data very accurate and much faster than an exhaus- set most likely constitutes the maximum like- tive search, they are not guaranteed to find the lihood estimate of the phylogeny and is the pre- best tree. ferred explanation (Page and Holmes, 1998). The standard practice in molecular phy- Formally, the likelihood score LD of a se- logenetics is to analyze each data set using quence data set D for phylogenetic hypothesis several different optimality criteria (maximum H can be estimated by calculating the proba- likelihood, Bayesian inference, and parsimony bility of D given H,orLD = Pr(D|H). It should are the three most popular). Selection of a par- be noted that H does not only correspond to ticular search strategy is typically determined the phylogenetic tree but also to the proba- by computational feasibility considerations. bilistic model of sequence evolution used in Exhaustive searches on data sets with more phylogenetic reconstruction. Importantly, the than a dozen sequences are still prohibitively likelihood criterion not only enables us to di- time-consuming irrespective of which opti- rectly estimate the parameters in the model of mality criterion is used; however, it is now sequence evolution, but also to identify their customary to use the most rigorous heuristic optimal values for the data set analyzed (the search strategies available on data sets con- optimal values are the ones that maximize the taining hundreds or thousands of sequences. likelihood). Once the user has chosen which optimality The model of sequence evolution involves criterion and search strategy to employ on a several parameters that describe how the se- given data set, a series of trees is generated and quences in a given data set evolve, such as evaluated, always keeping track of the ‘best’ the rates of substitution between amino acids, tree(s) examined in the course of the search of the frequencies of amino acids, and the het- tree space. Once the search reaches the point erogeneity in rate of evolution across sites of where a better tree cannot be found, the search the MSA. The overwhelming majority of pro- ends, and the ‘best’ tree becomes the best es- tein sequence phylogenetic analyses use em- timate of the evolutionary history of the data pirically derived amino acid substitution ma- set analyzed. trices, whose rates are fixed to specific values As with any other type of statistical anal- estimated from large numbers of real protein ysis, phylogenetic analysis allows for many MSAs (Whelan et al., 2001). For example, the different options and many different ways to RTREV substitution matrix is derived from analyze protein sequence data. This unit de- virally-encoded amino acid data (Dimmic scribes how to perform a set of standard phy- et al., 2002), whereas CPREV is derived from logenetic analyses on protein sequence data chloroplast-encoded amino acid data (Adachi using the RAXML program (Stamatakis, 2006; et al., 2000). In addition to the chosen sub- Stamatakis et al., 2005, 2008), and how to in- stitution matrix, the user can typically decide terpret the results. All analyses described here whether to use in phylogenetic estimation the use the maximum likelihood optimality crite- empirical frequencies of the amino acids in the rion and a specific search strategy (see below), data set (or the ones calculated during matrix both of which are state-of-the-art and highly construction), as well as whether to allow for accurate. Nevertheless, depending on the data any variation in the rate of evolution across analyzed and the question(s) asked, the reader sites of the protein MSA. Most proteins show should be aware that publication of phyloge- substantial heterogeneity in the rate of evo- netic trees in most journals typically requires lution across their sequence (sites critical to several different analyses using several differ- function tend to be highly conserved, whereas ent programs. It is important to demonstrate others tend to be much more variable), so ex- agreement in results obtained from applica- plicitly accounting for this rate heterogeneity tion of different optimality criteria and from among sites in the specification of the model several measures of robustness of inference. of sequence evolution is generally a very good idea. One standard and very popular approach Phylogenetic analysis using the for incorporating rate heterogeneity among maximum likelihood optimality criterion sites into the phylogenetic analysis uses the The concept of likelihood has a long tra- gamma distribution to approximate the distri- Informatics for dition in the field of statistical inference and bution of rates in a protein MSA (Yang, 1996). Molecular Biologists has many applications in biological research This distribution is very suitable for this task 19.11.3

Current Protocols in Molecular Biology Supplement 96 because, depending on the value of its shape Versions of the RAXML program are avail- parameter α, it can be either L-shaped (ap- able for the Unix/, Mac, and Windows propriate for MSAs that exhibit extreme rate operating systems (from http://wwwkramer. heterogeneity) or bell-shaped (appropriate for in.tum.de/exelixis/software.html; Stamatakis, MSAs that exhibit minor rate heterogeneity). 2006), as well as from two Web servers (from the Swiss Institute of The RAXML program at http://phylobench.vital-it.ch/raxml-bb/, and The RAXML (Randomized Axelerated Max- from the CIPRES Science Gateway at http:// imum Likelihood) program has been devel- www.phylo.org/portal2/; Stamatakis et al., oped to perform both sequential (on a single 2008). The stand-alone version is command- processor) and parallel (on multiple proces- line based, but Graphical User Interface fronts sors) phylogenetic analysis using the maxi- are also available (from http://sourceforge.net/ mum likelihood optimality criterion. Histori- projects/raxmlgui/ and http://sourceforge. cally, RAXML stems from the FASTDNAML pro- net/projects/wxraxml/). gram (Olsen et al., 1994), which in turn stems The RAXML search strategy. The first step of from the DNAML program (Felsenstein, 1993). the search strategy employed by RAXML is the Although RAXML’s design emphasis is on com- generation of a starting tree. This starting tree putationally efficient and biologically accurate is constructed by adding the sequences one analysis of very large data sets, it is also appro- by one in random order, and identifying their priate for and amenable to the analysis of data optimal location on the tree under the parsi- sets of any size. RAXML can use a variety of mony optimality criterion (Stamatakis et al., different character sets, including , 2005). The random order in which sequences amino acid, binary, and multi-state character are added is likely to generate several differ- state data. ent starting trees every time a new analysis is

AB

break a branch, separate the subtrees

regraft the pruned branch of the one subtree to all repeat procedure for possible locations of all possible subtrees the second subtree that are < branches away

D C

optimize only the four newly created branches

Figure 19.11.1 The lazy subtree rearrangement (LSR) tree search strategy.

19.11.4

Supplement 96 Current Protocols in Molecular Biology run (especially for data sets with more than supplementary file of this unit). Generation a few sequences), which allows better explo- of PHYLIP formatted alignments is a standard ration of the tree space. If multiple analyses feature of many different alignment programs using different starting trees all converge on and sequence editors (e.g., http://www- the same best tree, then confidence that this bimas.cit.nih.gov/molbio/readseq/). is the true best tree increases. The second step of the search strategy involves a method Constructing a maximum likelihood tree known as lazy subtree rearrangement or LSR with RAXML (Stamatakis et al., 2005; Schmidt and von Hae- We can set a simple maximum likelihood seler 2009), which is summarized in Figure analysis in RAXML by typing: 19.11.1. Briefly, under LSR, all possible sub- trees of a tree are clipped and reinserted at all RAXMLHPC -s protein.phy -n A1 possible locations as long as the number of -m PROTGAMMAWAG branches separating the clipped and insertion The option -s protein.phy specifies points is smaller than N branches. RAXML esti- the sequence data file to be analyzed. The op- mates the appropriate N value for a given data tion -n A1 specifies the file name appendix set automatically, but one can also run the pro- that will be added to all the output files pro- gram with any fixed value. The LSR method duced by RAXML in this run, which will be is first applied on the starting tree, and sub- in the format RAXML filename.A1.Al- sequently multiple times on the currently best though RAXML will not overwrite previous re- tree as the search continues, until no better tree sults (the second time you use the same file is found. name appendix the program will not run any analysis but will ask you to provide a different PHYLOGENETIC ANALYSIS appendix), it is best practice to use a differ- USING THE RAXML PROGRAM ent file name appendix for every run. The op- tion -m PROTGAMMAWAG specifies to RAXML The RAXML command line interface three parameters associated with the model of Irrespective of used, the sequence evolution employed: first, that we typical way to perform phylogenetic analysis are using protein data (the PROT part); sec- with RAXML is using the command line. Invok- ond, that we are accounting for rate hetero- ing the RAXML command by typing geneity among sites in our alignment by using RAXMLHPC -h the gamma distribution (the GAMMA part); and third, that we are employing the Whelan and and hitting Enter at the terminal cursor Goldman (Whelan and Goldman, 2001) amino (typically indicated by the >, $,or% signs) acid substitution matrix (the WAG part). prints the program version, command line If we wanted to choose a different amino options, and author and contact information. acid substitution matrix (e.g., the RTREV ma- Like most command line programs, RAXML trix), it would be necessary to simply replace can be executed by typing the command invok- the WAG part of the -m option with RTREV. ing the program (i.e., RAXMLHPC) followed by To use empirical base frequencies drawn from a series of options. In the example above, the the alignment (rather than use the pre-defined -h option displays a long help message that base frequencies that come with the matrix), describes the multitude of options available via all that is needed is to add the letter F to the the program. The user can use these options to -m option so that the RAXML command now specify the data to be analyzed as well as to looks like: set and control the different parameters of the analysis. RAXMLHPC -s protein.phy -n A2 -m PROTGAMMARTREVF Input format for protein alignments There are a few different ways of de- The RAXML program accepts protein ciding what amino acid substitution ma- sequence alignments in the PHYLIP format. trix should be used. As discussed above, An example alignment of part of the mi- these empirical amino acid substitution ma- tochondrial cytochrome oxidase subunit II trices are derived from several different sets alignment in the PHYLIP format is shown in of protein MSAs. One reasonable choice Figure 19.11.2 (it can be downloaded from is to use a model that derives from data Informatics for http://wwwkramer.in.tum.de/exelixis/hands- that are most similar to the data at hand. Molecular on/protein.phy and is also available as a For example, for our mitochondrial sequence Biologists 19.11.5

Current Protocols in Molecular Biology Supplement 96 10 50 Cow MAYPMQLGFQ DATSPIMEEL LHFHDHTLMI VFLISSLVLY IISLMLTTKL Carp MAHPTQLGFK DAAMPVMEEL LHFHDHALMI VLLISTLVLY IITAMVSTKL Chicken MANHSQLGFQ DASSPIMEEL VEFHDHALMV ALAICSLVLY LLTLMLMEKL Human MAHAAQVGLQ DATSPIMEEL ITFHDHALMI IFLICFLVLY ALFLTLTTKL MAHPTQLGFQ DAASPVMEEL LHFHDHALMI VFLISALVLY VIITTVSTKL Mouse MAYPFQLGLQ DATSPIMEEL MNFHDHTLMI VFLISSLVLY IISLMLTTKL Rat MAYPFQLGLQ DATSPIMEEL TNFHDHTLMI VFLISSLVLY IISLMLTTKL Seal MAYPLQMGLQ DATSPIMEEL LHFHDHTLMI VFLISSLVLY IISLMLTTKL Whale MAYPFQLGFQ DAASPIMEEL LHFHDHTLMI VFLISSLVLY IITLMLTTKL MAHPSQLGFQ DAASPIMEEL LHFHDHTLMA VFLISTLVLY IITIMMTTKL

Figure 19.11.2 An example alignment of part of the mitochondrial cytochrome oxidase subunit II alignment in the PHYLIP format.

alignment, we could choose MTREV (Adachi ing trees converge on the same best tree, then and Hasegawa, 1996), an empirical substi- the researcher’s confidence that the inferred tution matrix estimated from the complete tree is the best increases. To conduct multiple mitochondrial sequence data of 20 vertebrate searches for the best tree, it is necessary to add species. A much more thorough, but com- the option -# n, where n is the desired num- putationally much more demanding approach ber of multiple searches to perform. Thus, if is to use the program PROTTEST (Abascal et the goal is to perform 10 searches, the RAXML al., 2005), which calculates several different command should look like: to identify which model best fits the RAXMLHPC -s protein.phy -n A4 data. PROTTEST can be run online on a protein -m PROTGAMMAWAGF -# 10 MSA from http://darwin.uvigo.es/software/ prottest server.html, although a stand-alone version of the program is also available (for Visualizing the maximum likelihood tree a detailed theoretical and practical guide on Examination of the contents of the direc- the program see Posada, 2009). Finally, in- tory where the different analyses were run stead of using one of the standard models avail- shows that RAXML generated several different able, the user can actually estimate the amino files (the number and type of files generated acid model based on the amino acid data at will vary depending on the option settings hand, by replacing the WAG part with GTR, and specified) from the several different analyses. typing: These files provide detailed information and results about the analysis (e.g., the RAXMLHPC -s protein.phy -n A3 RAXML info.A1 file), the maximum likeli- -m PROTGAMMAGTR hood tree (e.g., the RAXML bestTree.A1 Note, however, that one should employ this file), the starting parsimony tree (e.g., the option only on protein alignments that con- RAXML parsimonyTree.A1 file), etc. A tain thousands of amino acid columns, be- full description of the contents of each file cause only those contain sufficient data to can be found in the program’s manual (avail- estimate all possible amino acid substitution able from http://wwwkramer.in.tum.de/ parameters. exelixis/oldPage/RAxML-Manual.7.0.4.pdf). As mentioned above, RAXML generates a Typically, the most useful output files are the starting tree by adding the sequences one by tree files, which are written in the NEWICK one in random order and inferring the best format (a full description of the format can be starting tree using the parsimony optimality found at http://evolution.genetics.washington. criterion. Thus, each time RAXML is run, a dif- edu//newicktree.html), and can be ferent starting tree is generated. Because, like opened for viewing in any of several different all heuristic search strategies, the LSR search tree visualization programs. For exam- strategy employed by RAXML is not guaran- ple, Figure 19.11.3 shows a screenshot of Phylogenetic bestTree.A1 Analysis of teed to find the best tree, it is customary to RAXML file when opened Protein Sequence conduct multiple searches for the best tree. with the tree visualization program FIGTREE Using RAXML If all searches that begin from different start- (http://tree.bio.ed.ac.uk/software/figtree/). 19.11.6

Supplement 96 Current Protocols in Molecular Biology Figure 19.11.3 Screenshot of the (unrooted) maximum likelihood tree in the FIGTREE program. Branch lengths are in substitutions per site.

Rooting the phylogenetic tree tion, where sequence is the name of one (or Phylogenetic trees can be either rooted or more) of the sequences in the multiple align- unrooted. Rooted phylogenetic trees have di- ment that we want to use as the . Thus, rection, since all lineages depicted on the tree if we want to use the Carp sequence as the originate from the same common ancestor, outgroup, the RAXML command should look which is also known as the root of the tree. like: In contrast, unrooted trees lack a root and, RAXMLHPC -s protein.phy -n A5 consequently, do not inform us about the di- -m PROTGAMMAWAGF -o Carp rection of evolution. Examples of rooted and unrooted phylogenetic trees are shown in Fig- The outgroup can also consist of more than ure 19.11.4. one sequence. For example, in our data set, one Although the majority of biologists want may want to root the phylogenetic tree using to obtain and work with rooted phylogenetic the two fish sequences (Carp and Loach) trees, the overwhelming majority of molecu- as the outgroup (as shown in panel B of Fig. lar phylogenetic programs produce unrooted 19.11.4), in which case the RAXML command phylogenetic trees. To produce a rooted phy- should look like: logenetic tree, the user must include in the set RAXMLHPC -s protein.phy -n A6 of sequences to be analyzed a sequence from -m PROTGAMMAWAGF -o Carp,Loach a species that is known, based on independent evidence (e.g., from paleontological data), to Note that if the sequences specified as the have diverged prior to the origin of our set outgroup do not form a monophyletic group of sequences. Such a sequence is known as (i.e., a group of sequences descended from the Informatics for an outgroup.InRAXML, one can specify an same common ancestral sequence not shared Molecular Biologists outgroup by adding the -o sequence op- with any other group of sequences), RAXML 19.11.7

Current Protocols in Molecular Biology Supplement 96 A

B

Figure 19.11.4 Examples of an unrooted (A) and a rooted (B) phylogenetic tree for the example data set. The unrooted tree on the top panel, if rooted on the branch leading to the Carp and Loach sequences, corresponds to the rooted tree on the bottom. Note that in the unrooted tree neighboring sequences are not necessarily closely related.

will be unable to place all of them as the genetic relationships uses a technique known outgroup. In this event, RAXML will print a as bootstrapping (Felsenstein, 1985). In boot- warningintheRAXML information file (in this strapping, one generates multiple data sets that example, this is the RAXML info.A6 file) have the same number of alignment columns and proceed to root the phylogeny using the as the original data set by randomly - first of the sequences specified in the -o op- pling the original alignment columns with re- tion as the outgroup. placement. Each bootstrap replicate data set is Phylogenetic then analyzed in exactly the same way as the Analysis of Assessing robustness of inference original data set, which results in the produc- Protein Sequence The standard statistical approach for as- Using RAXML tion of a maximum likelihood tree from each sessing robustness in the inference of phylo- bootstrap replicate data set. The frequency of 19.11.8

Supplement 96 Current Protocols in Molecular Biology occurrence of any given grouping on the set of high-quality bootstrap values and saves the re- bootstrap trees (which is known as the boot- sults in the RAXML bootstrap.A8 file (the strap support value) signifies the measure of final number of bootstrap replicates will vary support for that particular grouping in our data between runs, even if the same random seed is (Soltis and Soltis, 2003). used, because the statistics are calculated on To conduct a bootstrap analysis in RAXML, random splits). it is necessary to specify two additional op- It is customary to visualize bootstrap sup- tions. The first is the -b n option, where n port values on the maximum likelihood tree. can be any positive integer, and which specifies Therefore, we can use the maximum like- the random number seed required for the boot- lihood tree generated from the A4 analy- strap analysis. Using the same random seed sis (see above), which was saved in the number in different runs of the same data set RAXML bestTree.A4 file, as the tree on will result in the generation of identical boot- which to display the bootstrap support values. strap replicate data sets, so it may be desir- We can instruct RAXML to draw bootstrap val- able to pick a new random seed number every ues on the maximum likelihood tree using the time an analysis is run. The second option is following command: the -# n option, where n can be any posi- RAXMLHPC -n A9 -m tive integer, and which specifies the number PROTGAMMAWAGF -f b - of bootstrap replicates to be performed. Af- t RAXML bestTree.A4 -z ter specifying these two options, the RAXML RAXML bootstrap.A8 command should look like: Here, the -f b option specifies the anal- RAXMLHPC -s protein.phy -n A7 ysis to be performed (draw bootstrap sup- -m PROTGAMMAWAGF --b 0123 -# port values on a given tree), the -t RAXML 100 bestTree.A4 option specifies the tree that The typical number of bootstrap replicates we want the values depicted on, whereas the performed varies greatly between studies, and -z RAXML bootstrap.A8 option speci- can range from a hundred to thousands of repli- fies the file containing the trees generated via cates. Because the number of bootstrap repli- bootstrapping. cates required to obtain bootstrap support val- Once the command is executed, the tree ues of high quality varies with the type and size containing the bootstrap support values drawn of data set analyzed (Pattengale et al., 2010), on the maximum likelihood tree will be saved RAXML allows the user to automatically esti- in the RAXML bipartitions.A9 file mate when an appropriate number of bootstrap (Fig. 19.11.5). One can also use the set of replicates has been performed through the use trees produced from the bootstrap replicates of several different stopping criteria. The logic to construct various kinds of consensus trees underlying all these criteria is the same; after that summarize their agreements. For exam- every 50 bootstrap replicates, the program per- ple, strict consensus trees contain only those forms 100 random splits of the bootstrap repli- groupings present in all bootstrap replicate cate set into two halves and computes statis- trees, whereas majority rule consensus trees tics, which vary depending on the criterion contain only those groupings that are present implemented. For example, the frequency- in more than half of the bootstrap replicate based criterion, which is specified by setting trees. Consensus tree construction in RAXML the -# option to autoFC, determines whether uses the -J option. For example by setting -J enough replicates have been performed by cal- STRICT, one can construct a strict consensus culating the Pearson and Sierk correlation co- tree (Fig. 19.11.5): efficient (2005) in the two halves from the RAXMLHPC -n A10 -m PROTGAM- 100 splits. Bootstrapping stops if there are at MAWAGF -J STRICT -z least 99 splits whose halves show a correlation RAXML bootstrap.A8 coefficient greater than 0.99. Thus, the RAXML command implementing the frequency-based whereas by setting -J MR one can construct stopping criterion should look like: a majority rule consensus tree (Fig. 19.11.5): RAXMLHPC -s protein.phy -n A8 RAXMLHPC -n A11 -m -m PROTGAMMAWAGF --b 0123 -# PROTGAMMAWAGF -J MR -z autoFC RAXML bootstrap.A8 Informatics for Running this command, RAXML calculates Standard bootstrapping can be computa- Molecular Biologists that 800 bootstrap replicates are sufficient for tionally very demanding, especially for larger 19.11.9

Current Protocols in Molecular Biology Supplement 96 A BC

Figure 19.11.5 Different ways of visualizing bootstrap support values on phylogenetic trees. (A) Bootstrap support values depicted on the maximum likelihood tree. (B) The strict consensus tree, which is completely unresolved because none of the groupings was present in all bootstrap replicate trees. (C) Bootstrap support values depicted on the majority rule tree.

data sets. To facilitate faster analysis, the RAXMLHPC --f a -s protein.phy RAXML program also contains a rapid boot- -n A14 -m PROTGAMMAWAGF --x strapping algorithm that is at least an order of 0123 -# 100 magnitude faster than the standard one, while Execution of the command will generate similarly accurate (Stamatakis et al., 2008). To the maximum likelihood tree file (RAXML run an analysis using this algorithm, the user bestTree.A14) and the bootstrap replicate need simply change the -b n option to the tree file (RAXML bootstrap.A14), as well -x n option, where n can be any positive in- as the tree file containing the bootstrap support teger. After replacing the -b n with the -x n values drawn on the maximum likelihood tree option, the RAXML command should look like: (RAXML bipartitions.A14). RAXMLHPC -s protein.phy -n A12 -m PROTGAMMAWAGF --x 0123 -# Comparing different phylogenetic trees 100 Researchers often want to test directly dif- or, if the frequency-based stopping criterion is ferent phylogenetic hypotheses. For example, to be implemented, it should look like: one frequent question is whether a phylo- genetic tree obtained from a given protein RAXMLHPC -s protein.phy -n A13 alignment is significantly different from the -m PROTGAMMAWAGF --x 0123 -# traditional phylogeny. Such questions can be autoFC addressed by performing tests that evaluate For comparison, in a standard 2.3-GHz pro- whether the likelihood scores of different phy- cessor, the 100 bootstrap replicates performed logenetic trees are significantly different. One for the A7 analysis took ∼266 sec to run, such very useful and frequently used test is whereas the 100 rapid bootstrap replicates per- the Shimodaira-Hasegawa test (or SH test), formed for the A12 analysis took ∼74 sec. which examines whether the maximum like- One useful feature of the program is that lihood tree is significantly better than user- it allows the user to simultaneously perform supplied phylogenetic trees (Shimodaira and maximum likelihood and rapid bootstrapping Hasegawa, 1999). Phylogenetic analysis by adding the -f a option, so that For example, examination of the maximum Analysis of likelihood tree estimated from our data set Protein Sequence the RAXML command looks like: Using RAXML 19.11.10

Supplement 96 Current Protocols in Molecular Biology (Fig. 19.11.3) shows that the human sequence phylogenetic tree estimated from this data set groups with the chicken sequence and not with does not significantly differ from the standard the other sequences, as one would vertebrate phylogeny. expect based on the vertebrate phylogeny. In this case, we can use the SH test to evaluate Analyzing multiple data sets whether the maximum likelihood tree is sig- Phylogenies based on single proteins are nificantly better than the traditional vertebrate often unreliable or lack the phylogenetic sig- tree using the following RAXML command: nal necessary for successfully inferring phy- logenies (Rokas et al., 2003). Consequently, RAXMLHPC -f h -s protein.phy in recent years, researchers have been increas- -n A15 -m PROTGAMMAWAGF -t ingly analyzing multiple data sets. If the user RAXML bestTree.A4 -z verte- constructs a single data matrix that contains brate.tree the alignments of both proteins and provides In this command, the ‘-f h’ option RAXML with information about the boundaries specifies that we want to perform an SH of the different data sets, such analyses of mul- test. Similar to several previous commands, tiple data sets can be performed in RAXML. the maximum likelihood tree is specified Importantly, RAXML allows the user to spec- by the -t RAXML bestTree.A4 option, ify different models of sequence evolution for and the vertebrate tree by the -z verte- each data set and optimizes parameters sepa- brate.tree option. This latter file can be rately for each data set. created by writing the vertebrate phylogeny For example, let us hypothesize that our ex- for the 10 sequences used in this data set in the ample data set is actually a composite of two NEWICK format: different protein alignments, with amino acid columns 1-30 corresponding to protein A and (((((Human,(Mouse,Rat)), amino acid columns 31-50 corresponding to ((Cow,Whale),Seal)),Chicken), protein B. We can inform RAXML that our se- Frog),Carp,Loach); quence file is a composite of two different data The analysis produces a single output file sets by creating a plain text file (let us name it (RAXML info.A15) that reports the results partition.txt) that contains the follow- of the SH test in its last few lines: ing text: Model optimization, best Tree: WAGF, proteinA = 1-30 −411.163389 RTREVF, proteinB = 31-50 Found 1 trees in File verte- In this file, each line describes each pro- brate.tree tein in the data set. The first part of each line (WAGF in the first and RTREVF in the second) Tree: 0 Likelihood: describes the amino acid substitution matrix −423.777863 D(LH): −12.614474 we have chosen to use for each of the proteins. SD: 6.714754 The second part (proteinA = 1-30 in the Significantly Worse: No (5%), first and proteinB = 31-50 in the sec- No (2%), No (1%) ond) describes the names of the two proteins, which are arbitrary, as well as the multiple The best Tree: −411.163389 sequence alignment columns they occupy (the text reposrts the likelihood score (after first 30 amino acid columns are from the align- logarithmic transformation) of the best ment of protein A, whereas the last twenty for tree, whereas the Tree: 0 Likeli- the alignment of protein B). Somewhat con- hood: −423.777863 text reports the fusingly, executing a multiple data set anal- likelihood score of the vertebrate tree. The ysis in RAXML requires that we also specify D(LH): −12.614474 SD: 6.714754 the model of sequence evolution, the --m op- text reports the difference D(LH) in the tion. Although the models of amino acid evo- likelihood scores between the two trees and lution specified in the partition.txt file its standard deviation SD. Finally, the last line take precedence over the model specified via reports whether this difference in likelihood the --m option, this option is still necessary between the best tree and the vertebrate and useful in that it allows us to specify if we tree is significant at the 5%, 2%, and 1% want to account for rate heterogeneity among level. Given that the reported result is No sites in the two proteins. Thus, by setting - Informatics for for all three levels of significance, it can -m PROTGAMMAGTR,theRAXML command Molecular be concluded that the maximum likelihood Biologists should look like: 19.11.11

Current Protocols in Molecular Biology Supplement 96 RAXMLHPC -s protein.phy -n thologs and inparalogs shared by multiple pro- A16 -m PROTGAMMAGTR -q parti- teomes. Bioinformatics 22:E9-E15. tion.txt Capella-Gutierrez, S., Silla-Martinez, J.M., and Ga- baldon, T. 2009. trimAl: A tool for automated Using this command, RAXML will use the alignment trimming in large-scale phylogenetic WAG model for protein A and the RTREV analyses. Bioinformatics 25:1972-1973. model for protein B, as specified in the Castresana, J. 2000. Selection of conserved blocks partition.txt file (and ignore the GTR from multiple alignments for their use in phylo- model specified via the --m option). It will genetic analysis. Mol. Biol. Evol. 17:540-552. also use the empirical amino acid frequen- Chor, B. and Tuller, T. 2005. Maximum likelihood cies of each protein (because we specified of evolutionary trees: Hardness and approxima- so in the partition.txt file), and es- tion. Bioinformatics 21:97-106. timate the degree of rate heterogeneity in- Ciccarelli, F.D., Doerks, T., von Mering, C., dependently for each protein (because we Creevey, C.J., Snel, B., and Bork, P. 2006. To- ward automatic reconstruction of a highly re- specified so in the -m option). Examination solved tree of life. Science 311:1283-1287. of the RAXML output indicates that the pro- Day, W.H.E., Johnson, D.S., and Sankoff, D. gram analyzes each protein separately, but 1986. The computational complexity of infer- produces a single maximum likelihood tree ring rooted phylogenies by parsimony. Math. (RAXML bestTree.A16) that summarizes Biosci. 81:33-42. the results from the analysis of the two Dimmic, M.W., Rest, J.S., Mindell, D.P., and Gold- proteins. stein, R.A. 2002. rtREV: An amino acid sub- stitution matrix for inference of retrovirus and CONCLUDING REMARKS phylogeny. J. Mol. Evol. The theory and practice of phylogenetic 55:65-73. analysis of sequence data has blossomed in Drummond, A.J. and Rambaut, A. 2007. BEAST: the last three decades. As such, any proto- Bayesian evolutionary analysis by sampling BMC Evol. Biol. col aimed at describing how to perform a trees. 7:214. set of analyses is bound to serve as an in- Dunn, C.W., Hejnol, A., Matus, D.Q., Pang, K., troduction to this rather complex field, rather Browne, W.E., Smith, S.A., Seaver, E., Rouse, G.W., Obst, M., Edgecombe, G.D., Sorensen, than as a full description of the state-of-the- M.V., Haddock, S.H., Schmidt-Rhaesa, A., art methods of analysis. Readers interested Okusu, A., Kristensen, R.M., Wheeler, W.C., in delving deeper into the theory and prac- Martindale, M.Q., and Giribet, G. 2008. Broad tice of molecular are advised phylogenomic sampling improves resolution of Nature to consult any of the several excellent and the tree of life. 452:745-749. more in depth descriptions of the theory and Edwards, A.W.F. 1992. Likelihood (Expanded Edi- practice of phylogenetic inference (Swofford tion). The John Hopkins University Press, Baltimore, Maryland. et al., 1996; Page and Holmes, 1998; Nei and Kumar, 2000; Felsenstein, 2003; Salemi et al., Felsenstein, J. 1985. Confidence limits on phyloge- nies: An approach using the bootstrap. Evolution 2009), as well as explore several different op- 39:783-791. timality criteria and programs for phyloge- Felsenstein, J. 1993. PHYLIP (Phylogeny Infer- netic analysis (Swofford, 2002; Zwickl, 2006; ence Package). Distributed by the Author, De- Drummond and Rambaut, 2007; Guindon partment of , University of Washington, et al., 2010). A remarkably up to date Seattle. list of phylogeny programs can be found Felsenstein, J. 2003. Inferring Phylogenies. Sin- at http://evolution.genetics.washington.edu/ auer, Sunderland, Massachusetts. phylip/software.html. Fitzpatrick, D.A., Logue, M.E., Stajich, J.E., and Butler, G. 2006. A fungal phylogeny based on LITERATURE CITED 42 complete derived from Abascal, F., Zardoya, R., and Posada, D. 2005. and combined gene analysis. BMC Evol. Biol. Prottest: Selection of best-fit models of protein 6:99. evolution. Bioinformatics 21:2104-2105. Garcia-Fernandez, J. and Holland, P.W.H. 1994. Adachi, J. and Hasegawa, M. 1996. Model of amino Archetypal organization of the amphioxus Hox acid substitution in proteins encoded by mito- gene cluster. Nature 370:563-566. chondrial DNA. J. Mol. Evol. 42:459-468. Guindon, S., Dufayard, J.F., Lefort, V., Anisimova, Adachi, J., Waddell, P.J., Martin, W.,and Hasegawa, M., Hordijk, W., and Gascuel, O. 2010. New M. 2000. Plastid genome phylogeny and a model algorithms and methods to estimate maximum Phylogenetic of amino acid substitution for proteins encoded likelihood phylogenies: assessing the perfor- Analysis of by chloroplast DNA. J. Mol. Evol. 50:348-358. mance of PhyML 3.0. Syst. Biol. 59:307-321. Protein Sequence Hittinger, C.T., Johnston, M., Tossberg, J.T., and Using RAXML Alexeyenko, A., Tamas, I., Liu, G., and Sonnham- mer, E.L. 2006. Automatic clustering of or- Rokas, A. 2010. Leveraging skewed transcript 19.11.12

Supplement 96 Current Protocols in Molecular Biology abundance by RNA-Seq to increase the genomic with interactive alignment browser. BMC Bioin- depth of the tree of life. Proc. Natl. Acad. Sci. formatics 11:579. U.S.A. 107:1476-1481. Murphy, W.J., Eizirik, E., Johnson, W.E., Zhang, Huelsenbeck, J.P., Ronquist, F., Nielsen, R., and Y.P., Ryder, O.A., and O’Brien, S.J. 2001. Bollback, J.P. 2001. Bayesian inference of phy- and the origins of pla- logeny and its impact on evolutionary biology. cental . Nature 409:614-618. Science 294:2310-2314. Nei, M. and Kumar, S. 2000. Molecular Evolution James, T.Y., Kauff, F., Schoch, C.L., Matheny, P.B., and Phylogenetics. Oxford University Press, Hofstetter, V., Cox, C.J., Celio, G., Gueidan, New York. C., Fraker, E., Miadlikowska, J., Lumbsch, Notredame, C., Higgins, D.G., and Heringa, J. 2000. H.T., Rauhut, A., Reeb, V., Arnold, A.E., T-Coffee: A novel method for fast and accu- Amtoft, A., Stajich, J.E., Hosaka, K., Sung, rate multiple sequence alignment. J. Mol. Biol. G.H., Johnson, D., O’Rourke, B., Crockett, M., 302:205-217. Binder, M., Curtis, J.M., Slot, J.C., Wang, Z., Wilson, A.W., Schussler, A., Longcore, J.E., Olsen, G.J., Matsuda, H., Hagstrom, R., and O’Donnell, K., Mozley-Standridge, S., Porter, Overbeek, R. 1994. Fastdnaml: A tool for D., Letcher, P.M., Powell, M.J., Taylor, J.W., construction of phylogenetic trees of DNA- White, M.M., Griffith, G.W., Davies, D.R., sequences using maximum-likelihood. Comput. Humber, R.A., Morton, J.B., Sugiyama, J., Ross- Appl. Biosci. 10:41-48. man, A.Y., Rogers, J.D., Pfister, D.H., Hewitt, Page, R.D.M. and Holmes, E.C. 1998. Molecular D., Hansen, K., Hambleton, S., Shoemaker, Evolution: A Phylogenetic Approach. Blackwell R.A., Kohlmeyer, J., Volkmann-Kohlmeyer, B., Science, Malden, Massachusetts. Spotts, R.A., Serdani, M., Crous, P.W., Hughes, K.W., Matsuura, K., Langer, E., Langer, G., Un- Pattengale, N.D., Alipour, M., Bininda-Emonds, tereiner, W.A., Lucking, R., Budel, B., Geiser, O.R.P., Moret, B.M.E., and Stamatakis, A. 2010. D.M., Aptroot, A., Diederich, P., Schmitt, I., How many bootstrap replicates are necessary? J. Schultz, M., Yahr, R., Hibbett, D.S., Lutzoni, Comput. Biol. 17:337-354. F., McLaughlin, D.J., Spatafora, J.W., and Vil- Pearson, W.R. and Sierk, M.L. 2005. The limits galys, R. 2006. Reconstructing the early evolu- of protein sequence comparison? Curr. Opin. tion of Fungi using a six-gene phylogeny. Nature Struct. Biol. 15:254-260. 443:818-822. Posada, D. 2009. Selecting models of evolution. In: Katoh, K., and Toh, H. 2008. Recent developments The Phylogenetic Handbook: A Practical Ap- in the MAFFT multiple sequence alignment pro- proach to Phylogenetic Analysis and Hypoth- gram. Brief. Bioinformatics 9:286-298. esis Testing (P. Lemey, M. Salemi, and A.M. Katoh, K., Misawa, K., Kuma, K., and Miyata, T. Vandamme, eds.) pp. 345-361. Cambridge Uni- 2002. MAFFT: A novel method for rapid mul- versity Press, Cambridge. tiple sequence alignment based on fast Fourier Remm, M., Storm, C.E., and Sonnhammer, E.L. transform. Nucleic Acids Res. 30:3059-3066. 2001. Automatic clustering of orthologs and in- Kitching, I.J., Forey, P.L., Humphries, C.J., and paralogs from pairwise species comparisons. J. Williams, D.M. 1998. : The Theory Mol. Biol. 314:1041-1052. and Practice of Parsimony Analysis, 2nd Ed. Rokas, A., Nylander, J.A.A., Ronquist, F., and Oxford University Press, New York. Stone, G.N. 2002. A maximum likelihood anal- Kuzniar, A., van Ham, R.C.H.J., Pongor, S., and ysis of eight phylogenetic markers in gall- Leunissen, J.A.M. 2008. The quest for or- wasps (Hymenoptera: Cynipidae): Implications thologs: Finding the corresponding gene across for insect phylogenetic studies. Mol. Phylo- genomes. Trends Genet. 24:539-551. genet. Evol. 22:206-219. Larkin, M.A., Blackshields, G., Brown, N.P., Rokas, A., Williams, B.L., King, N., and Carroll, Chenna, R., McGettigan, P.A., McWilliam, H., S.B. 2003. Genome-scale approaches to resolv- Valentin, F., Wallace, I.M., Wilm, A., Lopez, ing incongruence in molecular phylogenies. Na- R., Thompson, J.D., Gibson, T.J., and Higgins, ture 425:798-804. D.G. 2007. Clustal W and Clustal X version 2.0. Salemi, M., Vandamme, A-M., and Lemey, P. 2009. Bioinformatics 23:2947-2948. The Phylogenetic Handbook: A Practical Ap- Li, L., Stoeckert, C.J. Jr., and Roos, D.S. 2003. proach to Phylogenetic Analysis and Hypothe- OrthoMCL: Identification of ortholog groups sis Testing, 2nd Ed. Cambridge University Press, for eukaryotic genomes. Genome Res. 13:2178- Cambridge. 2189. Salichos, L. and Rokas, A. 2011. Evaluating or- Li, W-H. 1997. Molecular Evolution. Sinauer, Sun- tholog prediction algorithms in a yeast model derland, Massachusetts. . PLoS One 6:e18755. Loytynoja, A. and Goldman, N. 2008. Phylogeny- Schmidt, H.A. and von Haeseler, A. 2009. Phy- aware gap placement prevents errors in sequence logenetic inference using maximum likelihood alignment and evolutionary analysis. Science methods. In The Phylogenetic Handbook: A 320:1632-1635. Practical Approach to Phylogenetic Analysis and Hypothesis Testing (P. Lemey, M. Salemi, Informatics for Loytynoja, A. and Goldman, N. 2010. webPRANK: and A.M. Vandamme, eds.) pp. 181-209. Cam- A phylogeny-aware multiple sequence aligner Molecular bridge University Press, Cambridge. Biologists 19.11.13

Current Protocols in Molecular Biology Supplement 96 Shimodaira, H. and Hasegawa, M. 1999. Multi- Swofford, D.L. 2002. PAUP*: Phylogenetic Anal- ple comparisons of log-likelihoods with applica- ysis Using Parsimony (*and Other Methods). tions to phylogenetic inference. Mol. Biol. Evol. Sinauer, Sunderland, Massachusetts. 16:1114-1116. Swofford, D.L., Olsen, G.J., Waddell, P.J., and Soltis, P.S. and Soltis, D.E. 2003. Applying the Hillis, D.M. 1996. Phylogenetic inference. In bootstrap in phylogeny reconstruction. Stat. Sci. Molecular Systematics (D.M. Hillis, C. Moritz, 18:256-267. and B.K. Mable, eds.) pp. 407-514. Sinauer, Stamatakis, A. 2006. RAXML-VI-HPC: Maxi- Sunderland, Massachusetts. mum likelihood-based phylogenetic analyses Talavera, G. and Castresana, J. 2007. Improve- with thousands of taxa and mixed models. Bioin- ment of phylogenies after removing divergent formatics 22:2688-2690. and ambiguously aligned blocks from pro- Stamatakis, A., Ludwig, T., and Meier, H. 2005. tein sequence alignments. Syst. Biol. 56:564- RAXML-III: A fast program for maximum 577. likelihood-based inference of large phylogenetic Wall, D.P., Fraser, H.B., and Hirsh, A.E. 2003. trees. Bioinformatics 21:456-463. Detecting putative orthologs. Bioinformatics Stamatakis, A., Blagojevic, F., Nikolopoulos, D.S., 19:1710-1711. and Antonopoulos, C.D. 2007. Exploring new Whelan, S. and Goldman, N. 2001. A general em- search algorithms and hardware for phylogenet- pirical model of protein evolution derived from ics: RAXML meets the IBM cell. J. VLSI Sig- multiple protein families using a maximum- nal Process. Syst. Signal Image Video Technol. likelihood approach. Mol. Biol. Evol. 18:691- 48:271-286. 699. Stamatakis, A., Hoover, P., and Rougemont, J. 2008. Whelan, S., Lio, P., and Goldman, N. 2001. Molec- A rapid bootstrap algorithm for the RAXML ular phylogenetics: State-of-the-art methods for Web servers. Syst. Biol. 57:758-771. looking into the past. Trends Genet. 17:262-272. Sterner, K.N., Raaum, R.L., Zhang, Y.P., Stewart, Yang, Z. 1996. Among-site rate variation and its C.B., and Disotell, T.R. 2006. Mitochondrial impact on phylogenetic analyses. Trends Ecol. data support an odd-nosed colobine clade. Mol. Evol. 11:367-372. Phylogenet. Evol. 40:1-7. Zwickl, D.J. 2006. approaches Stewart, C.B., Schilling, J.W., and Wilson, for the phylogenetic analysis of large biological A.C. 1987. Adaptive evolution in the stom- sequence datasets under the maximum likeli- ach lysozymes of foregut fermenters. Nature hood criterion. Doctoral thesis. The University 330:401-404. of Texas at Austin.

Phylogenetic Analysis of Protein Sequence Using RAXML 19.11.14

Supplement 96 Current Protocols in Molecular Biology