Phylogenomics

Casey Dunn Assistant Professor Ecology and Evolutionary Biology Casey Dunn (G Giribet)

(S Haddock) Casey Dunn Sophia Tintori Stefan Siebert (tech) (postdoc)

Stephen Smith Freya Goetz (postdoc) (tech)

Rebecca Helm (grad student)

Casey Dunn Collaborators

Steve Haddock

Gonzalo Giribet, A. Stamatakis, AToL team, Cnidarian AToL team, Mark Robinson Casey Dunn What does “phylogenomics” mean? 1. The study of genome evolution in a phylogenetic context

2. The inference of species phylogenies with genome data

3. The inference of species phylogenies with data from lots of

genes Casey Dunn alveolates angiosperms "...stronger sampling fungi effort aimed at genomic depth, in liverworts green algae addition to taxonomic cnidarians breadth, will be required to build high- resolution molluscs phylogenetic trees at birds [a broad] scale." mammals amphibians ray finned fishes Sanderson, 2008 2.6 million sequences doi:10.1126/science.1154449 1127 taxa

Casey Dunn Why collect data from lots of genes? - Many hard problems will require lots of data - Lots of data makes some aspects of inference easier - These data are useful for things besides building trees - It can be much cheaper to collect a lot of data than a little bit of data

Casey Dunn DNA sequencing

Helicos Roche Illumina

Casey Dunn Current Illumina costs: $2,095 for one lane (HiSeq)

Paired-end 100bp

~150 million clusters

30 gigabases of data

Casey Dunn Current Illumina costs: ~$120 per sample to prepare a library

Casey Dunn Current Illumina costs:

Samples per Cost per Clusters per Gigabases per lane sample sample (millions) sample

1 $2,215 150 30

4 $644 37.5 7.5

8 $382 18.75 3.75

12 $295 12.5 2.5

Casey Dunn Current Illumina costs:

~$100 a gigabase

$0.0000001 per base

Casey Dunn Will cheap sequence data allow us to answer all our questions? Of course not.

Casey Dunn Should we approach problems with more data or improved analysis methods?

This is a false dichotomy.

We need both!

Casey Dunn Are other types of data now obsolete? No! These data open entirely new opportunities for integrating genomic, morphological, and functional perspectives Casey Dunn Marrus claudanielis

Dunn, Pugh, and Haddock (2005) 1cm Bull. Mar. Sci. 76:699-714 (MBARI) Casey Dunn “Physonects”: Nanomia bijuga F =RRLGV E G J %UDFW )HPDOH 0DOH ([FUHWRU\  ›M )HHGLQJ ! K  ›M 0 H  ›M

 ›M L I

FP  ›M PP  ›M 2 , 6 $ 6 $ Dunn and Wagner (2006) Development, Genes, and Evolution 216:743-754 Casey Dunn “More Isn't Just More— More Is Different” Wired,June 23, 2008

Casey Dunn Gene selection as part Gene selection as part of of project design data analysis (Directed PCR) (ESTs, shotgun genomes)

Select genes Sequence at random

Amplify and Identify homologous sequence sequences and selected genes evaluate paralogy

Assemble matrix Select genes from all sequenced genes Assemble matrix

Phylogenetic Phylogenetic inference inference

Casey Dunn Getting data...

Casey Dunn DNA sequencing

Helicos Roche Illumina

Casey Dunn Read (sequence data) Read (sequence data)

Fragment of DNA

Casey Dunn DNA Fragments can be: Amplified/ enriched gene regions

Genomic DNA

cDNA (Transcriptomes)

Casey Dunn Genome Transcriptome Start with DNA Start with mRNA Get all genes, Get a snapshot of regulatory regions, etc active genes Genomes can be Almost all data is from really big coding regions Can be hard to Handling RNA is tricky identify genes

Casey Dunn Overview of sequencing: Get DNA or RNA Make a library (chunks of DNA with adapters) Prepare library for sequencing Sequence Process data into raw reads Casey Dunn Casey Dunn 18 R. M. KRISTENSEN AND P. FUNCH

Figs. 16, 17. maerski nov. gen. et nov. sp. SEM. Jaw apparatus treated with sodium hypochlorite. The dorsal fibularium (df) is twisted ventrally compared to the view in Figure 14. Note also: The ventral jaws (pseudophalangia), the accessory sclerite, and lamellae oralis are lacking. Fig. 16: Dorsal view. Fig. 17: Lateral view. ar, articularium of main jaws (ja2); as1, accessory Figure 2 sclerite to ventral jaw (which is lacking); ba, basal plates; ca, cauda; de, dentarium of main jaws (ja2); de.o, dentes oralis; d.se, dorsal serratum; d.te, dentes terminales; fd1-fd3, fenestrae of dorsal fibularium; fe.d, fenestra dentarialis; fe.s, fenestra symphysis; fi.d, fibula dorsalis; ja2, main jaws consisting of articularium (ar) and dentarium (de); ja3, dorsal jaw with pseudodentes (p.de), trochanter (tr) and (Kristensenspinula (sp); o.la, outer lamella; sy, symphysis. and Funch, 2000) Casey Dunn Casey Dunn Casey Dunn Casey Dunn Some options for preservation Freeze tissue (-80C or colder) RNALater (Ambion), kept cold

Extract RNA in the field

Homogonize in Trizol, keep cold

Casey Dunn Casey Dunn mRNA isolation - Lots of tissue

Isolate Total RNA with Trizol

Digest DNA

Isolate mRNA (eg NEB S1550S, Dynabeads mRNA Kit)

Casey Dunn mRNA isolation - Small amount of tissue

mRNA straight from tissue (eg NEB S1550S, Dynabeads mRNA DIRECT Kit)

Casey Dunn mRNA isolation - Minute amount of tissue

Linear cDNA amplification (eg NuGen Ovation)

Casey Dunn RNA quality is (almost) Everything!

Avoid contamination

Reduced sample size requirements have improved this

Casey Dunn RNA quality is (almost) Everything!

Quantity matters - be cautious working at the bottom range of sample requirements

Casey Dunn RNA quality is (almost) Everything!

Amount of ribosomal RNA matters

There are tradeoffs between rRNA fraction and yield. If material is limiting, purify less and sequence more

Casey Dunn RNA quality is (almost) Everything!

If you have enough RNA, look at its size distribution with a BioAnalyzer or similar tool.

Casey Dunn What do you do once you have high quality DNA or RNA?

Break it into little pieces!

Casey Dunn Read Read

DNA

Casey Dunn Starting material Fragment

Prepare library, sequence

Casey Dunn Library preparation options Get a library preparation kit from the sequencer vender Get a third party library preparation kit Make the library from scratch

Casey Dunn These days, in my lab we use:

- TruSeq RNA kit (Illumina)

- NEBNext (NEB)

Casey Dunn Prices at: www.brown.edu/Research/CGP/core/illumina/price Casey Dunn Data are usually delivered in fastq or qseq format.

Casey Dunn fastq example:

@HWI-ST625:51:C02UNACXX:7:1101:1179:1962 1:N:0:TTAGGC CTAGNTGTTGAAGAGAAGGTTCAAGAACCAAAAGAAAGCTCACAACAACATATGGT + =AAA#DFDDDHHFDGHEHIAFHHIIIIGICDGAGDHGGIHG@A@BFIHIIIGC@@8

@HWI-ST625:51:C02UNACXX:7:1101:1242:1983 1:N:0:TTAGGC ATAATTTCAATGACTGGAGTAGTGAAAATGAACATAGATATGAGAATAACCGTAGA + ACCCFFFFFGHHHHJJJIJEHIFHIJJJJIJJJJIIJIJJIIJJJJJJJJIIJJJJ

Casey Dunn These instruments generate a lot of data.

Casey Dunn The data files from lane of an Illumina HiSeq are more than 70 GB.

A HiSeq has 16 lanes.

Casey Dunn (Sam Fulcomer) Casey Dunn The first thing you’ll want to do is to examine the quality profiles of your data.

There are many tools that do this, we use python and R.

Casey Dunn Plot the quality of each base across reads: Quality

Base pair position Casey Dunn Make a histogram of the mean quality of each read:

Casey Dunn Use these plots to decide if you need to trim off the end of reads or discard low quality reads.

Casey Dunn Assembly

Casey Dunn Starting material Fragment

Prepare library, sequence

Assembly Final product Casey Dunn Overlap assemblers that work fine on large sanger datasets don’t scale to these very large data sets

The number of pairwise comparisons that are needed to detect overlap become intractable

Casey Dunn A new generation of de Bruijn graph assemblers have been developed to meet these challenges

Better defined memory footprint

Simpler comparisons between sequences

Casey Dunn What is a graph?

Nodes

Edges

Casey Dunn The first step in de Bruijn graph assembly is breaking each read down into all sequences of k length

actg actgtcat ctgt tgtc gtca tcat

Casey Dunn There are 4k possible k-mers In practice, k is often in the 25-70 range The k-mers are loaded into a hash table: actg 1 ctgt 1 tgtc 1 gtca 1 tcat 1

Casey Dunn A de Bruijn graph is constructed from the has table

Each node corresponds to a k-mer sequence from the hash table

An edge unites each node that extends another node by one base pair

Casey Dunn Downloaded from genome.cshlp.org on November 22, 2010 - Published by Cold Spring Harbor Laboratory Press Downloaded from genome.cshlp.org on November 22, 2010 - Published by Cold Spring Harbor Laboratory Press

Schatz et al. Schatz et al. Choice of assembler and sequencingChoice of assembler strategy and Onlysequencing de Bruijn strategy graph assemblers have demonstrated the ability to successfully Only de Bruijn graph assemblers have assemble very short reads (<50 bp). For demonstrated the ability to successfully longer reads (>100 bp), overlap graph as- assemble very short reads (<50 bp). For semblers have been quite successful and longer reads (>100 bp), overlap graph as- have a much better track record overall. A semblers have been quite successful and de Bruijn graph assembler should func- have a much better track record overall. A tion with longer reads as well, but a large de Bruijn graph assembler should func- difference between the read length and tion with longer reads as well, but a large the k-mer length will result in many more difference between the read length and branching nodes than in the simplified the k-mer length will result in many more overlap graph. The precise conditions un- branching nodes than in the simplified der which one assembly method is supe- overlap graph. The precise conditions un- rior to the other remain an open question, der which one assembly method is supe- and the answer may ultimately depend rior to the other remain an open question, on the specific assembler and genome and the answer may ultimately depend characteristics. on the specific assembler and genome As Figure 3 illustrates, there is a di- characteristics. rect and dramatic tradeoff among read As Figure 3 illustrates, there is a di- Figure 2. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the set length, coverage, and expected contig of 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5bp rect and dramatic tradeoff among read length in a genome assembly. The figure areFigure indicated 2. Differences by directed between edges. Transitive an overlap overlaps, graph and which a de are Bruijn implied graph by for other assembly. longer Based overlaps, on the are set length, coverage, and expected contig of 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5bp shows the theoretical expected contigs shown as dotted edges. In a de Bruin graph (C ), a node is created for every k-mer in all the reads; here length in a genome assembly. The figure are indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, are length, based on the Lander-Waterman theSchatzk-mer size is 3.et Edges al are2010 drawn between(dx.doi.org/10.1101/gr.101360.109) every pair of successive k-mers in a read, where the k-mers shows the theoretical expected contigs overlapshown by ask dotted1 bases. edges. In Inboth a de approaches, Bruin graph repeat (C ), a sequences node is created create for a fork every ink the-mer graph. in all Note the reads; here we here model (Lander and Waterman 1988), in the k-mer sizeÀ is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mers length, based on the Lander-Waterman have only considered the forward orientation of each sequence to simplify the figure. Caseyan assembly Dunn where all overlaps have been overlap by k 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here we model (Lander and Waterman 1988), in À detected perfectly. This model, which was have only considered the forward orientation of each sequence to simplify the figure. an assembly where all overlaps have been widely applied for predicting assembly quality in the Sanger se- separate paths. Short repeats of this type can be resolved, but they detected perfectly. This model, which was require additional processing and therefore additional time. quencing era, predicts that under ideal conditions, 710-bp reads separate paths. Short repeats of this type can be resolved, but they widely applied for predicting assembly quality in the Sanger se- Another potential drawback of the de Bruijn approach is that should require 33 coverage to produce 4-kbp average contig sizes, require additional processing and therefore additional time. quencing era, predicts that under ideal conditions, 710-bp reads the de Bruijn graph can require an enormous amount of computer while 30-bp reads would require 283 coverage. In practice, the Another potential drawback of the de Bruijn approach is that should require 33 coverage to produce 4-kbp average contig sizes, space (random access memory, or RAM). Unlike conventional model is inadequate for modeling very short reads: The figure also the de Bruijn graph can require an enormous amount of computer while 30-bp reads would require 283 coverage. In practice, the overlap computations, which can be easily partitioned into mul- shows the actual contig sizes for the dog genome, assembled with space (random access memory, or RAM). Unlike conventional model is inadequate for modeling very short reads: The figure also tiple jobs with distinct batches of reads, the construction and 710-bp reads, and the panda genome, assembled with 52-bp reads. overlap computations, which can be easily partitioned into mul- shows the actual contig sizes for the dog genome, assembled with analysis of a de Bruijn graph is not easily parallelized. As a result, de The dog assembly tracked closely to the theoretical prediction, tiple jobs with distinct batches of reads, the construction and 710-bp reads, and the panda genome, assembled with 52-bp reads. Bruijn assemblers such as Velvet and ALLPATHS, which have been while the panda assembly has contig sizes that are many times analysis of a de Bruijn graph is not easily parallelized. As a result, de The dog assembly tracked closely to the theoretical prediction, used successfully on bacterial genomes, do not scale to large ge- lower than predicted by the model. The large discrepancy between Bruijn assemblers such as Velvet and ALLPATHS, which have been while the panda assembly has contig sizes that are many times nomes. For a human-sized genome, these programs would require predicted and observed assembly quality results from the fact that used successfully on bacterial genomes, do not scale to large ge- lower than predicted by the model. The large discrepancy between several terabytes of RAM to store their de Bruijn graphs, which is far nomes. For a human-sized genome, these programs would require predicted and observed assembly quality results from the fact that more memory than is available on most computers. several terabytes of RAM to store their de Bruijn graphs, which is far To date, only two de Bruijn graph assemblers have been shown more memory than is available on most computers. to have the ability to assemble a mammalian-sized genome. ABySS To date, only two de Bruijn graph assemblers have been shown (Simpson et al. 2009) assembled a human genome in 87 h on to have the ability to assemble a mammalian-sized genome. ABySS a cluster of 21 eight-core machines each with 16 GB of RAM (168 (Simpson et al. 2009) assembled a human genome in 87 h on cores, 336 GB of RAM total). SOAPdenovo assembled a human ge- a cluster of 21 eight-core machines each with 16 GB of RAM (168 nome in 40 h using a single computer with 32 cores and 512 GB of cores, 336 GB of RAM total). SOAPdenovo assembled a human ge- RAM (Li et al. 2010). Although these types of computing resources nome in 40 h using a single computer with 32 cores and 512 GB of are not widely available, they are within reach for large-scale sci- RAM (Li et al. 2010). Although these types of computing resources entific centers. are not widely available, they are within reach for large-scale sci- In theory, the size of the de Bruijn graph depends only on the entific centers. size of the genome, including polymorphic alleles, and should be In theory, the size of the de Bruijn graph depends only on the independent of the number of reads. However, because sequencing size of the genome, including polymorphic alleles, and should be errors create their own graph nodes, increasing the number of reads independent of the number of reads. However, because sequencing inevitably increases the size of the de Bruijn graph. In the de novo errors create their own graph nodes, increasing the number of reads assembly of human from short reads, SOAPdenovo reduced the inevitably increases the size of the de Bruijn graph. In the de novo number of 25-mers from 14.6 billion to 5.0 billion by correcting assembly of human from short reads, SOAPdenovo reduced the errors before constructing the de Bruijn graph (Li et al. 2010). Its number of 25-mers from 14.6 billion to 5.0 billion by correcting Figure 3. Expected average contig length for a range of different read error correction method first counts the number of occurrences of errors before constructing the de Bruijn graph (Li et al. 2010). Its lengths and coverage values. Also shown are the average contig lengths all k-mers in the reads and replaces any k-mers occurring less than Figure 3. Expected average contig length for a range of different read error correction method first counts the number of occurrences of and N50 lengths for the dog genome, assembled with 710-bp reads, and three times with the highest frequency alternative k-mer. thelengths panda and genome, coverage assembled values. Also with shown reads averaging are the average 52 bp contig in length. lengths all k-mers in the reads and replaces any k-mers occurring less than and N50 lengths for the dog genome, assembled with 710-bp reads, and three times with the highest frequency alternative k-mer. the panda genome, assembled with reads averaging 52 bp in length. 1168 Genome Research www.genome.org 1168 Genome Research www.genome.org Paths through the de Bruijn graph are assembled sequences

These paths can be very complicated due to sequencing error, snp’s, splicing variants, repeats, etc

The graphs require considerable post- processing to simplify them (pop bubbles, trim dead ends, etc)

Casey Dunn 318 J.R. Miller et al. / Genomics 95 (2010) 315–327

assemblers rely on heuristic algorithms and approximation algo- rithms to remove redundancy, repair errors, reduce complexity, enlarge simple paths and otherwise simplify the graph.

Greedy Graph-based Assemblers

The first NGS assembly packages used greedy algorithms. These have been reviewed well elsewhere [3,30]. The greedy algorithms apply one basic operation: given any read or contig, add one more read or contig. The basic operation is repeated until no more operations are possible. Each operation uses the next highest-scoring overlap to make the next join. The scoring function Fig. 2. A pair-wise overlap represented by a K-mer graph. (a) Two reads have an error- free overlap of 4 bases. (b) One K-mer graph, with K=4, represents both reads. The measures, for instance, the number of matching bases in the overlap. pair-wise alignment is a by-product of the graph construction. (c) The simple path Thus the contigs grow by greedy extension, always taking on the read through the graph implies a contig whose consensus sequence is easily reconstructed that is found by following the highest-scoring overlap. The greedy from the path. algorithms can get stuck at local maxima if the contig at hand takes on reads that would have helped other contigs grow even larger. length. Each single-base sequencing error induces up to K false nodes The greedy algorithms are implicit graph algorithms. They in the K-mer graph. Each false node has a chance of matching some drastically simplify the graph by considering only the high-scoring other node and thereby inducing a false convergence of paths. edges. As an optimization, they may actually instantiate just one Real-world WGS data induces problems in overlap graphs and overlap for each read end they examine. They may also discard each K-mer graphs. overlap immediately after contig extension. Like all assemblers, the greedy algorithms need mechanisms to • Spurs are short, dead-end divergences from the main path (Fig. 3a). avoid incorporating false-positive overlaps into contigs. Overlaps They are induced by sequencing error toward one end of a read. induced by repetitive sequence may score higher than overlaps They can be induced by coverage dropping to zero. induced by common position of origin. An assembler that builds on • Bubbles are paths that diverge then converge (Fig. 3b). They are false-positive overlaps will join unrelated sequences to either side of a induced by sequencing error toward the middle of a read, and by repeat to produce chimera. polymorphism in the target. Efficient bubble detection is non-trivial SSAKE [31] was the first short-read assembler. It was designed for [28]. unpaired short reads of uniform length. It was based on the notion • Paths that converge then diverge form the frayed rope pattern that high coverage would provide a tiling in error-free reads if the (Fig. 3c). They are induced by repeats in the target genome. erroneous reads could be avoided. SSAKE does not use a graph • Cycles are paths that converge on themselves. They are induced by explicitly. It does use a lookup table of reads indexed by their prefixes. repeats in the target. For instance, short tandem repeats induce SSAKE iteratively searches for reads that overlap one contig end. Its small cycles. candidate reads must have a prefix-to-suffix identical overlap whose length is above a threshold. SSAKE chooses carefully among multiple In general, branching and convergence increases graph complex- reads with equally long overlaps. First, it prefers reads with end-to- fi ity, leading to tangles that are dif cult to resolve. Much of the end confirmation in other reads. This favors error-free reads. Second, complexity is due to repeats in the target and sequencing error in the the software detects when the set of candidates presents multiple reads. extensions. In particular, it detects when the candidate read suffixes In the graph context, assembly is a graph reduction problem. exhibit differences that are each confirmed in other reads. This is J.R. Miller et al. / Genomics 95 (2010) 315–327 321 Most optimal graph reductions belong to a of problems, called equivalent to finding a branch in a graph. At this point, the software fi NP-hard, for which no ef cient solution is known [29]. Therefore, terminates the contig extension. Users can elect to override the “stringent” behavior, in which case SSAKE takes the higher-scoring extension. When no reads satisfy the initial minimum threshold, the program decrements the threshold until a second minimum is reached. Thus, user settings determine how aggressively SSAKE extends through possible repeat boundaries and low-coverage regions. SSAKE has been extended to exploit paired-end reads and imperfectly matching reads [32]. SHARCGS [33] also operates on uniform-length, high-coverage, unpaired short reads. It adds pre- and post-processor functionality to the basic SSAKE algorithm. The pre-processor filters erroneous reads by requiring a minimum number of full-length exact matches in other reads. An even higher-stringency filter is optional, requiring that the combined QVs of matching reads exceed a minimum threshold. SHARCGS filters the raw read set three times, each at a different stringency setting, to generate three filtered sets. It assembles each set separately by iterative contig extension. Then, in a post-process, it

Fig. 3. Complexity in K-mer graphs can be diagnosed with read multiplicity information. In merges the three contig sets using sequence alignment. The merger these graphs, edges represented inFig. more 4. Three reads methods are drawn to with resolve thicker graph arrows. complexity. (a) An (a) Readaims threading to extendjoins paths contigs across collapsed from highly repeats that con arefirmed shorter reads than the by read integrating lengths. (b) Mate threading joins paths errant base call toward the end of aacross read causes collapsed a “spur repeats” or that short are dead-end shorter than branch. the paired-end The longer distances. contigs (c) Path following from lower-stringency chooses one path if itsfi lengthlters.fits the paired-end constraint. Reads and mates are shown as fi same pattern could be induced by coincidencepatterned of lines. zero Not coverage all tangles after can polymorphism be resolved by near reads a and mates.VCAKE The non-branching[34] is another paths are iterative illustrative; extension they could be algorithm. simpli ed to Unlikesingle edges its or nodes. repeat. (b) An errant base call near a read middle causes a “bubble” or alternate path. predecessors, it could incorporate imperfect matches during contig Polymorphisms between donor chromosomes would be expected to induce a bubble with parity of read multiplicity on the divergent paths. (c) Repeat sequences lead to the “frayed extension. VCAKE was later combined with Newbler in a pipeline for rope” pattern of convergent and divergent paths. Solexa+454 hybrid data [35]. Another pipeline had combined After threading, Euler implements graph simplifications at regions induced by sequencing error. It exploits low-quality read ends and with low and high depth of coverage in reads. Euler's spur erosion paired-end constraints to tease apart graph tangles induced by genomic reduces branching in graph paths and thereby lengthens simple paths. repeats. The software targets de novo assembly from short reads, MillerThe et spurs al are 2010 presumed (dx.doi.org/10.1016/j.ygeno.2010.03.001) due to sequencing error that survived the including paired-ends, from the Solexa platform. spectral alignment filter. Euler identifies remaining edges that appear repetitive and removes them from the set of paths. This is equivalent Casey Dunn to breaking contigs at repeat boundaries in OLC assembly. The de Bruijn Graph in Velvet Reads from many platforms contain lower quality base calls at their 3′ ends, and the problem can be exacerbated by long-read Velvet [25,56] is a reliable and easy to use DBG assembler. Velvet protocols on short-read platforms. Euler addresses this problem by makes extensive use of graph simplification to reduce simple non- trusting read prefixes more than their suffixes. It chooses trustable intersecting paths to single nodes. Simplification compresses the prefixes during the error correction step. Prefix length varies per read. graph without loss of information. Velvet invokes the simplification During read threading, prefixes and suffixes can map to multiple step during graph construction and again several times during the paths. By a heuristic, Euler trusts mappings that are significantly assembly process. The technique, introduced as elimination of better than their second-best choice. Just as the suffixes would add singletons for K-mer graphs [24], is analogous to unitig formation in coverage to a multiple sequence alignment, they add connectivity to overlap graphs [23] and OLC assemblers [37]. the graph. The extra sequence leads to greater contig size. Euler Velvet prunes the K-mer graph by removing spurs iteratively. Its chooses not to alter the assembly consensus sequence based on the tip removal algorithm is similar to Euler's erosion procedure. The spur suffixes, so the mapped suffixes contribute connectivity only. removal drastically reduced the graph size on real data [25], possibly Overlap graphs are sensitive to the minimum overlap length because it was the pipeline's first attempt at filtering out base call threshold, and K-mer graphs are sensitive to the parameter K. Larger errors. Velvet does not implement Euler's spectral alignment filter. values of K resolve longer repeats but they also fracture assemblies in Velvet has a parameter for the minimum number of occurrences in the regions of low read coverage. Euler addresses this with a heuristic. reads for a K-mer to qualify as a graph node. The Velvet publication Euler constructs and simplifies two K-mer graphs with different seems to discourage use of this naïve filter. values of K. It identifies edges present in the smaller-K graph that are Velvet reduces graph complexity with a bounded search for missing in the larger-K graph. It adds corresponding pseudo-edges to bubbles in the graph. Velvet's tour bus algorithm uses breadth-first- the second graph. The borrowed edges extend paths in the second search, fanning out as much as possible, starting at nodes with graph and thus enlarge contigs in the assembly. This technique multiple out-going edges. Since graphs of real data can have bubbles effectively uses large K-mers to build reliable initial contigs, and then within bubbles, an exhaustive search for all bubbles would be fills gaps with more prolific small K-mers. This is analogous to gap impractical. The search is bounded to make it tractable; the candidate filling approaches in OLC assemblers [37]. paths are traversed in step, moving ahead one node on all paths per Some of the Euler software incorporates another structure called iteration, until the path lengths exceed a threshold. Velvet narrows the A-Bruijn graph. It gets its name from being a combination of a de the bubble candidates to those with a sequence similarity require- Bruijn graph and an adjacency matrix or A-matrix. Nodes of the graph ment on the alternate paths. Having found a bubble, Velvet removes represent consecutive columns in multiple sequence alignments. the path representing fewer reads and, working outside the graph, re- Compared to nodes representing K-mers in individual reads, the aligns reads from the removed path to the remaining path. Because adjacency nodes can be less sensitive to sequencing error. The A-Bruijn higher read multiplicity determines the target path, the re-aligner graph was deployed for converting a genome sequence to a repeat graph effectively calls the consensus bases by a column-wise voting and classifying repeats. It was proposed as a basis for assembly [26]. algorithm. The operation risks “papering over” genuine sequence In summary, Euler compares de Bruijn graphs built from different differences due to polymorphism in the donor DNA or over-collapse of K-mer sizes. Euler applies heuristics to mitigate graph complexity near-identical repeats. Velvet's algorithm is similar to bulge removal Assembly takes a lot of RAM!

One lane of Illumina HiSeq data can require hundreds of gigabytes of RAM to assemble This is one of the largest challenges for using next-generation sequencing data to build trees

Eliminating low-quality data can greatly reduce RAM requirements Casey Dunn Genome Transcriptome Start with DNA Start with mRNA Get all genes, Get a snapshot of regulatory regions, etc active genes Genomes can be Almost all data is from really big coding regions Can be hard to Handling RNA is tricky identify genes

Casey Dunn Transcript splicing

mRNA’s are spliced before leaving the nucleus

en.wikipedia.org/wiki/File:Pre-mRNA_to_mRNA.svg

Casey Dunn Transcript splicing

With deep sequencing, many splice variants are sequenced for each gene

en.wikipedia.org/wiki/File:Alt_splicing_bestiary2.jpg Casey Dunn Assembly results... Genome

...aagtcagtggagatgcaccatgagaccttggaagaagctgtccctggagacaatgtgggt...

Transcript

ggagatgcaccatgag gtccct ...aagtcagta ctgtccctgg agacaatgtgggt... ccttggaagaag

Casey Dunn Splice variants -Different splice variants for a given gene can vary widely in abundance -Deep sequencing captures some “intermediate splice variants”, molecules in the process of being spliced -Sequencing and assembly errors can be misinterpreted as splice variants -Data may be insufficient to predict

splice variants Casey Dunn It gets worse...

Casey Dunn Genomes are uniform depth

Poisson distribution

en.wikipedia.org/wiki/ File:Poisson_pmf.svg

Assemblers can make assumptions about uniform distribution of sequencing effortCasey Dunn Expression differences mean: - Can’t assume that the expected frequency of sequences is uniform across or even within genes - Low copy number doesn’t necessarily indicate an error - High copy number doesn’t necessarily indicate a repeat - Sequencing error is hard to

accomodate in transcriptomes Casey Dunn When assembling transcriptomes, it is essential to use an assembler that can explicitly accommodate splice variants and expression differences!!!!!

Casey Dunn Transcriptome assemblers include: Newbler (Roche)

Oases (www.ebi.ac.uk/~zerbino/oases)

Trinity (trinityrnaseq.sourceforge.net)

TransAbyss (www.bcgsc.ca/platform/ bioinfo/software/trans-abyss)

Casey Dunn Biological Newbler Oases concept term term

Gene isogroup locus

Splice isotig transcript variant exon contig contig

Casey Dunn Post-assembly annotation: - Selection of exemplar transcripts for each gene - Blastx to a taxon restricted subset of the NCBI nr database - Translation with prot4est

Casey Dunn Identifying homologs

Casey Dunn Phylogenetic tools build trees from homologous characters Most phylogenetic tools assume character homology, they can’t evaluate it We need to make a first pass with phenetic tools Casey Dunn Throw all sequences for all taxa in a study into a hat

Make all pairwise sequence comparisons (eg blast)

Construct a graph where nodes are sequences and edges indicate similarity Casey Dunn Nodes are sequences, thickness of edges indicate similarity Casey Dunn Nodes are sequences, thickness of edges indicate similarity Casey Dunn Gene1 Gene2

Gene4

Gene3 Gene5

Gene6

Gene7 Gene8

Gene9 Nodes are sequences, thickness of edges indicate similarity Casey Dunn We use: Blastp on a cuda array, assign -log10(e-value) for edge weight

Throw away edges < 20 mcl for clustering (inflation ~2.1) Apply taxon sampling criteria

Casey Dunn Vol. 27 no. 3 2011, pages 326–333 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btq655

Sequence analysis Advance Access publication November 29, 2010 Improving the quality of protein similarity network clustering algorithms using the network edge weight distribution 1 1 1,2 1,2, Leonard Apeltsin , John H. Morris , Patricia C. Babbitt and Thomas E. Ferrin ∗ 1Department of Pharmaceutical Chemistry and 2Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA, USA Associate Editor: Burkhard Rost

ABSTRACT of these proteins have not been experimentally characterized. Motivation: Clustering protein sequence data into functionally Computational clustering approaches can provide an important specific families is a difficult but important problem in biological means to deciphering the functions of these uncharacterized proteins research. One useful approach for tackling this problem involves in an efficient way. Recent efforts in this area, discussed below, representing the sequence dataset as a protein similarity network, have focused on developing and testing algorithms for clustering and afterwards clustering the network using advanced graph proteins by functional similarity based only on sequence data. analysis techniques. Although a multitude of such network clustering These algorithms go beyond traditional clustering approaches, such

algorithms have been developed over the past few years, comparing as hierarchical and k-means, which require advance knowledge Downloaded from algorithms is often difficult because performance is affected approximating the number of functional groups present in order to by the specifics of network construction. We investigate an either cluster effectively or to interpret clustering output correctly. important aspect of network construction used in analyzing protein Rather, these algorithms rely on the network properties of a protein superfamilies and present a heuristic approach for improving the sequence dataset to cluster the data into functionalCasey groups without Dunn bioinformatics.oxfordjournals.org performance of several algorithms. any prior knowledge of the group identities (Schaeffer, 2007). Results: We analyzed how the performance of network clustering Network clustering algorithms take as input a protein similarity algorithms relates to thresholding the network prior to clustering. Our graph (Noble et al., 2005). Vertices in the graph represent individual results, over four different datasets, show how for each input dataset proteins, while edges represent the pairwise sequence similarities there exists an optimal threshold range over which an algorithm between the proteins. Often, BLAST (Altschul et al., 1997) scores are used as edge weights. Subsequent to input, the similarity graph

generates its most accurate clustering output. Our results further at Brown University on April 28, 2011 show how the optimal threshold range correlates with the shape is processed by the network clustering algorithm to identify distinct of the edge weight distribution for the input similarity network. We groups of nodes in the graph that in many cases correspond to groups used this correlation to develop an automated threshold selection of proteins that share the same function. heuristic in order to most optimally filter a similarity network prior to How the similarity graphs are processed varies with each clustering. This heuristic allows researchers to process their protein clustering algorithm. In general, most network clustering approaches datasets with runtime efficient network clustering algorithms without may be assigned to one of two categories; geometry-based and sacrificing the clustering accuracy of the final results. flow-based (Frivolt and Pok, 2006). Geometry-based approaches, Availability: Python code for implementing the automated threshold such as Force (Wittkop et al., 2007), Regularized Kernel Estimation selection heuristic, together with the datasets used in our analysis, (Lu et al., 2005), spectral clustering (Paccanaro et al., 2006) and are available at http://www.rbvi.ucsf.edu/Research/cytoscape/ TransClust (Wittkop et al., 2010) embed the protein graph into threshold_scripts.zip. high-dimensional space and then group the nodes into clusters Contact: [email protected] based on spatial proximity. Flow-based approaches, such as the Supplementary information: Supplementary data are available at Markov Clustering Algorithm (MCL; Enright et al., 2002) and Bioinformatics online. Affinity Propagation (Frey and Dueck, 2007) model the possible flow of information between nodes based on edge weight. How the Received on June 14, 2010; revised on November 19, 2010; accepted information congregates across groups of nodes then determines the on November 22, 2010 final output of clusters. The differences between these two categorizes of algorithms 1 INTRODUCTION reflect a difference in performance. Geometry-based approaches such as Force rely on non-linear calculations between pairwise In the last decade, there has been an explosion in the available elements in the similarity graph, leading to potentially long protein sequence data. Currently, the Uniprot database contains execution times. Flow-based approaches such as MCL rely on simple approximately 11 million protein sequences and is growing matrix and vector multiplication, which leads to relatively short exponentially (Apweiler et al., 2004); a very large proportion execution times. However, it has been shown that Force outperforms MCL for certain similarity graphs (Wittkop et al., 2007), making the hours to seconds difference in run times a worthwhile performance ∗To whom correspondence should be addressed. trade-off.

326 © The Author 2010. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

[15:37 6/1/2011 Bioinformatics-btq655.tex] Page: 326 326–333 Identifying orthologs

Casey Dunn “The paralogy problem”

But paralogs aren’t inherently a problem

The problem is misascribing paralogs as orthologs

Casey Dunn Species A Species B Species C

Gene divergence due to duplication Casey Dunn Species A Species B Species C

Gene divergence due to duplication

Casey Dunn Species_3@seq5667

Species_1@seq0524 Species_4@seq7967

Species_2@seq0525 Species_1@seq9950

Species_2@seq5610

Species_3@seq4966 Species_6@seq8843 Species_5@seq7743Species_3@seq8284 Species_8@seq9098 Species_10@seq2558 Species_4@seq8842 Species_9@seq1243

Species_4@seq8954 Species_11@seq2319 Species_12@seq1339

Species_13@seq8943 Species_14@seq9939 Species_5@seq1486

Species_17@seq9976 Species_18@seq1374 Species_15@seq7783 Species_16@seq9943

Species_11@seq8540

Species_12@seq9013

Casey Dunn Species_3@seq5667

Species_1@seq0524 Species_4@seq7967

Species_2@seq0525 Species_1@seq9950

Species_2@seq5610

Species_3@seq4966 Species_6@seq8843 Species_5@seq7743Species_3@seq8284 Species_8@seq9098 Species_10@seq2558 Species_4@seq8842 Species_9@seq1243

Species_4@seq8954

Species_11@seq2319 Species_12@seq1339

Species_13@seq8943 Species_5@seq1486 Species_14@seq9939

Species_17@seq9976 Species_18@seq1374 Species_15@seq7783 Species_16@seq9943

Species_11@seq8540

Species_12@seq9013

Casey Dunn Species_3@seq5667

Species_4@seq7967 Species_1@seq0524

Species_2@seq0525 Species_1@seq9950

Species_2@seq5610

Species_3@seq4966 Species_6@seq8843 Species_5@seq7743Species_3@seq8284 Species_8@seq9098 Species_10@seq2558 Species_4@seq8842 Species_9@seq1243

Species_4@seq8954

Species_11@seq2319 Species_12@seq1339

Species_13@seq8943 Species_5@seq1486 Species_14@seq9939

Species_17@seq9976 Species_18@seq1374 Species_15@seq7783 Species_16@seq9943

Species_11@seq8540

Species_12@seq9013

Casey Dunn Isolation of Isolation of Evaluation of Homologs Orthologs Orthology

Phenetic Phylogenetic

Phenetic Phylogenetic

Casey Dunn Once we have subtrees of orthologs... Align each ortholog

Concatenate!

Casey Dunn There are many exciting alternatives to concatenation

As these become more computationally efficient, robust to missing data, etc they will be exciting to apply to these datasets

Casey Dunn 77 taxa, 150 Genes, >20k aa Genes Taxa

White cells indicates sampled gene 50.9% gene sampling

Dunn et al., 2008 doi:10.1038/nature06614 Casey Dunn 94 Aplysia californica 94 Biomphalaria glabrata 94 Mytilus galloprovincialis 94 Argopecten irradians 94 Crassostrea virginica 94 Chaetopleura apiculata 94 Clade C Euprymna scolopes 94 Chaetoderma nitidulum 94 Urechis caupo Annelida 94 Capitella sp. 94 Lumbricus rubellus Echiura Clade B 94 Haementeria depressa 94 Clade A Platynereis dumerilii 94 Chaetopterus sp. 94 Themiste lageniformis 90 Phoronis vancouverensis Phoronida 93 Terebratalia transversa Brachiopoda 93 Carinoma mutabilis 93 Cerebratulus lacteus 85 Pedicellina cernua 91 Dugesia japonica 91 Schmidtea mediterranea Platyhelminthes 91 Echinococcus granulosus 91 Paraplanocera sp. 91 Macrostomum lignano 90 Turbanella ambronensis Gastrotricha 85 Myzostoma seymourcollegiorum Myzostomida 78 Neochildia fusca Acoela 78 Symsagittifera roscoffensis 80 Gnathostomula peregrina Gnathostomulida 82 Brachionus plicatilis Rotifera 82 Philodina roseola 88 Bugula neritina 88 Cristatella mucedo Protostomia 82 Flaccisagitta enflata 82 Spadella cephaloptera 93 Hypsibius dujardini Tardigrada 93 Richtersius coronifer 94 Xiphinema index 94 Trichinella spiralis Nematoda 92 Spinochordodes tellinii 92 Priapulus caudatus 93 Echinoderes horni 95 Euperipatoides kanangrensis Arthropoda 95 Drosophila melanogaster 95 Tetraconata Daphnia magna 95 Carcinus maenas 95 Fenneropenaeus chinensis 95 Scutigera coleoptrata 95 Anoplodactylus eroticus 95 Acanthoscurria gomesiana 94 Aplysia californica Mollusca 95 94 Biomphalaria glabrata Boophilus microplus 94 Mytilus galloprovincialis 95 94 Argopecten irradians 94 Crassostrea virginica Carcinoscorpius rotundicauda 94 Chaetopleura apiculata 94 96 Clade C Euprymna scolopes 94 Chaetoderma nitidulum Xenoturbella bocki Xenoturbellida 94 Urechis caupo Annelida 94 Capitella sp. 98 Echiura Strongylocentrotus purpuratus 94 Lumbricus rubellus Echinodermata Clade B 94 Haementeria depressa 98 94 Clade A Platynereis dumerilii 94 Chaetopterus sp. Asterina pectinifera Sipuncula 94 Themiste lageniformis 98 90 Phoronis vancouverensis Phoronida 93 Saccoglossus kowalevskii Hemichordata Terebratalia transversa Brachiopoda 93 98 Carinoma mutabilis Nemertea Dunn et al., 2008 93 Cerebratulus lacteus Lophotrochozoa Ptychodera flava 85 Pedicellina cernua Entoprocta 91 Dugesia japonica 97 91 Platyhelminthes Ciona intestinalis Schmidtea mediterranea Deuterostomia Chordata 91 Echinococcus granulosus 98 doi:10.1038/ 91 Paraplanocera sp. Homo sapiens 91 Macrostomum lignano 90 Turbanella ambronensis Gastrotricha 98 85 Myzostoma seymourcollegiorum Myzostomida Gallus gallus 78 Neochildia fusca Acoela nature06614 78 Symsagittifera roscoffensis 98 80 Gnathostomula peregrina Gnathostomulida 82 Branchiostoma floridae Brachionus plicatilis Rotifera 82 Philodina roseola 96 88 Bugula neritina Bryozoa Metazoa Acropora millepora 88 Cristatella mucedo Protostomia 82 96 Flaccisagitta enflata Chaetognatha 82 Spadella cephaloptera Nematostella vectensis 93 Hypsibius dujardini Tardigrada 150 Genes, >20k aa 93 96 Richtersius coronifer Cyanea capillata 94 Xiphinema index 94 Trichinella spiralis Nematoda 96 Ecdysozoa 92 Spinochordodes tellinii Nematomorpha Hydra magnipapillata 92 Priapulus caudatus Priapulida 93 Echinoderes horni Kinorhyncha 96 95 Euperipatoides kanangrensis Onychophora Hydractinia echinata Arthropoda 95 Drosophila melanogaster 95 Tetraconata 90 Bilateria Daphnia magna 95 Carcinus maenas Oscarella carmela 95 Porifera Bootstrap support Fenneropenaeus chinensis 95 89 Scutigera coleoptrata Myriapoda Suberites domuncula 95 Anoplodactylus eroticus Chelicerata >98% >80% 95 Acanthoscurria gomesiana 99 95 Boophilus microplus Mnemiopsis leidyi 95 Carcinoscorpius rotundicauda 96 Xenoturbella bocki Xenoturbellida >90% >70% 98 99 Strongylocentrotus purpuratus Echinodermata Mertensiid sp. 98 Asterina pectinifera Ambulacraria 98 Saccoglossus kowalevskii Hemichordata 98 Ptychodera flava 97 Capsaspora owczarzaki Deuterostomia Ciona intestinalis Chordata Outgroups 98 Homo sapiens 98 Gallus gallus Monosiga ovata 98 Branchiostoma floridae 96 Metazoa Acropora millepora Cnidaria raxML 0.2 96 Nematostella vectensis Sphaeroforma arctica 96 Cyanea capillata 96 Hydra magnipapillata Amoebidium parasiticum 96 Hydractinia echinata 1,000 BS replicates 90 Oscarella carmela 89 Porifera Bootstrap support Suberites domuncula >98% >80% 99 Mnemiopsis leidyi Cryptococcus neoformans Ctenophora >90% >70% 99 Mertensiid sp. WAG+Γ Capsaspora owczarzaki Outgroups Saccharomyces cerevisiae Monosiga ovata Sphaeroforma arctica 0.2 Amoebidium parasiticum Cryptococcus neoformans Saccharomyces cerevisiae Casey Dunn Extracting more information from analyses

Casey Dunn Casey Dunn Mess with Rerun your dataset your and analysis analyses settings

But this doesn’t work well for large analyses Casey Dunn Information attrition Matrix Tree set “Final Product” Inference Summary * *

Explicit Non-consensus We throw information on topological away: distribution of variation, support and information on conflict variation in taxon placement Casey Dunn 94 Aplysia californica Mollusca 94 Biomphalaria glabrata 94 Mytilus galloprovincialis 94 Argopecten irradians 94 Crassostrea virginica 94 Chaetopleura apiculata 94 Clade C Euprymna scolopes 94 Chaetoderma nitidulum 94 Urechis caupo Annelida 94 Capitella sp. Echiura Unstable taxa can obscure 94 Lumbricus rubellus Clade B 94 Haementeria depressa 94 Clade A Platynereis dumerilii 94 Chaetopterus sp. Sipuncula support for relationships 94 Themiste lageniformis 90 Phoronis vancouverensis Phoronida 93 Terebratalia transversa Brachiopoda 93 Carinoma mutabilis Nemertea between stable taxa. Lophotrochozoa 93 Cerebratulus lacteus 85 Pedicellina cernua Entoprocta 91 Dugesia japonica 91 Schmidtea mediterranea Platyhelminthes 91 Echinococcus granulosus 91 Paraplanocera sp. 91 Macrostomum lignano Leaf stability indeces 90 Turbanella ambronensis Gastrotricha 85 Myzostoma seymourcollegiorum Myzostomida 78 Neochildia fusca Acoela 78 Symsagittifera roscoffensis (Thorley & Wilkinson, 1999) 80 Gnathostomula peregrina Gnathostomulida 82 Brachionus plicatilis Rotifera 82 Philodina roseola 88 quantify the stability of each Bugula neritina Bryozoa 88 Cristatella mucedo Protostomia 82 Flaccisagitta enflata Chaetognatha 82 Spadella cephaloptera 93 taxon. Hypsibius dujardini Tardigrada 93 Richtersius coronifer 94 Xiphinema index 94 Trichinella spiralis Nematoda Ecdysozoa 92 Spinochordodes tellinii Nematomorpha 92 Priapulus caudatus Priapulida 93 Echinoderes horni Kinorhyncha 95 Euperipatoides kanangrensis Onychophora Arthropoda 95 Drosophila melanogaster 95 Tetraconata Bilateria Daphnia magna 95 Carcinus maenas 95 Fenneropenaeus chinensis 95 Scutigera coleoptrata Myriapoda 95 Anoplodactylus eroticus Chelicerata 95 Acanthoscurria gomesiana 95 Boophilus microplus 95 Carcinoscorpius rotundicauda 96 Xenoturbella bocki Xenoturbellida 98 Leaf Stability < 90% Strongylocentrotus purpuratus Echinodermata 98 Asterina pectinifera Ambulacraria 98 Saccoglossus kowalevskii Hemichordata 98 Ptychodera flava 97 Deuterostomia Ciona intestinalis Chordata 98 Homo sapiens 98 Gallus gallus 98 Branchiostoma floridae 96 Metazoa Acropora millepora Cnidaria 96 Nematostella vectensis 96 Cyanea capillata 96 Hydra magnipapillata 96 Hydractinia echinata 90 Oscarella carmela 89 Porifera Bootstrap support Suberites domuncula >98% >80% 99 Mnemiopsis leidyi Ctenophora >90% >70% 99 Mertensiid sp. Capsaspora owczarzaki Outgroups Monosiga ovata Sphaeroforma arctica 0.2 Amoebidium parasiticum Cryptococcus neoformans Saccharomyces cerevisiae Casey Dunn Split frequencies following removal of unstable taxa

Unstable taxa have 100 Removed after inference little or no influence

(pruned form trees) on relationships 75 between stable taxa

50 x=y

25

0 0 25 50 75 100 Removed before inference 1,000 BS replicates

(trimmed from matrix) Casey Dunn Treesets as onions

Least stable taxa

Most stable taxa

Casey Dunn Investigation of stable taxa

Ctenophores (comb jellies) Dunn et al., 2008 Cnidaria Anthozoans (corals, anemones) Medusozoans (jellyfish, hydroids) Echinoderms (sea urchins, starfish) doi:10.1038/ Deuterostomia Xenoturbella nature06614 (, sea squirts) Nematodes Bilateria Cycloneuralia Nematomorphs Kinorhynchs Ecdysozoa Priapulids (water bears) 150 Genes Onychophorans (velvet worms) Branchiopod (water fleas) Hexapods (insects) >20k aa Malacostracan crustaceans (crabs, shrimp) Protostomia Arthropoda Myriapods (milipedes, Centipedes) Chelicerates (spiders, horseshoe crabs) Platyhelminthes (, tapeworms) Molluscs Lophotrochozoa (lampshells) Clade C Clade A Nemerteans Leeches Oligochaetes (earthworms) Clade B Echiurans Posterior prob. (CAT and WAG) Both >95% Annelida WAG >95% Sipunculans

Casey Dunn Tools for identifying and visualizing relationships between stable taxa phyutility code.google.com/p/phyutility/ Smith & Dunn (2008), Bioinformatics doi:10.1093/bioinformatics/btm619

•Calculate leaf stabilities •Prune taxa from treesets •Explore the positions of unstable taxa

Casey Dunn Casey Dunn Beyond species trees

Casey Dunn Phylogenies now generate:

- Species trees

- Extensive gene sequence data

- Well sampled gene trees

Casey Dunn Ribosomes

rna.ucsc.edu/rnacenter/ribosome_images.html Casey Dunn Sequences relevant to focal phenotypes

signal morphogenesis transduction

carbon

photosynthesis sequestration www.rcsb.org/ Casey Dunn But we don’t know which genes are relevant to which phenotypes

Casey Dunn A small glimpse of a much greater schism

Genomes Evolutionary Morphology, functional function, genomics ecology, development

>FZTBY7Y04I0Z6U rank=0418088 x=3584.5 y=3492.0 length=457 CAAGGTCTTGAACCAACAGTTGGATACAATTCAGAATGACGGAATGAAAGAAATCCATTT CACTGGTCATGTTGACTTTTTCCACGTGTTCTGGATCGATCTTCTCCTTTAACTTCTCCT GAAGCTGTTGTGTTGTCTCCTGGCAGTATTCGCATGTCGTCAGTACACAACATATCAACC TCCGTTCGTCTTCGTTGAACTTCACGTCGTTGTCCTTCAAGATGCTCGAAATACCGCCCG TCGAAGAAACCTTGGGTAAGGCTTCCCAGCAGAAGGCGATTAGTGTATTCCACCAGGTAT CTACTTGAACGTATCTGTAGAAAGGGAAGAAGCAGAGGTTGCCACTGCTCAACTGAACGC ATTGCACCATACTCTTCTTGAAGAAGACAAACAACTCGCCAGCACTCGGGAGAACTGTCC CTTCGGAATCCTCGCCGAGTTTTGGAACACCTTGTTC >FZTBY7Y04IQ5F0 rank=0418094 x=3472.5 y=2494.5 length=288 AATGAAATATGCTGAGCAGTTCAAGTTTCTATACTCACGAAGAAACAACATTGTAGATGG TTCATACGAACCCAACAATGAAGAGGCGGTTTGGGTTGATCCTTTAGAAGAATTGGTTGA ACAGTTGAATAAGGGTGGTGAAGAAAAGCTGAATCTGAGAAAACTGAAGAAGAGAAATTG GCTGGATGGTGTGAAAACTTTATCATTTGGTGAAGAACACAAAAGGTATTCCTGAATTTT GGCTCACTGCAATGAAGAACGTTGAAATACTTGAAGATATGATTCAGG >FZTBY7Y04IQ7J9 rank=0418096 x=3473.0 y=1143.0 length=421 AGGCCGGGGCCTTTCGATTAAGATATCTAAAAGAGTTTGGTTCTCCACGGAGCTAAGGCT AACAAATCTACGTAAATCTTGCATTTGTTGCAACCTTCTCTATTAAAAATGTCTGACACA TCTGTATCCGAACTTGCCTGTGTATACAGTGCCCTTATTTTATACGATGATGATATCGAT ATCACAGGAGAAAAAAATGGCTAAAATCATCGCTGCTGCCAACGTCAACGTAGAAACCTT CTGGCCTGGACTCTTCGCCAAGGCTCTCCAAGGACGTAACATCGGTGACCTTATCTGCAA TGTAGGATCCTCCGCAGCCGCTGCTCCAGCCGCCGCTGCTGCTGCTGGTGATGCTCCAGC TGCTGCTGAAGAGAAGAAGAAGAAAAGAAGGTCAGTTCAGATGAGGATCAGATGATGATA

Casey Dunn Measuring expression

Casey Dunn (MBARI)

(S Haddock) Casey Dunn Which genes are differentially expressed between bodies in a siphonophore colony?

Casey Dunn Reference

Map Reads

Casey Dunn Gene Count

Gene001 4

Gene002 6

Gene003 22

Gene004 1

Gene005 2

Casey Dunn Differential Gene Expression in Nanomia bijuga

data to quantify expression. Some of these studies lack biological only a partial reference is available, we collected short-read data replication, which makes it difficult to assess the significance of the from the same samples with three different off-the-shelf expression results. The wide variation in methods across these studies provide workflows: SOLiD SAGE ( Technologies), Illumina mRNA- interesting glimpses into the benefits and drawbacks of different Seq, and Helicos Digital Gene Expression (DGE). These work- approaches for measuring expression in non-model organisms, but flows differ in sample preparation protocols (Figure S1), sequenc- such comparisons are difficult to interpret across studies since ing platform, and read mapping. All these differences have the entirely different organisms are under investigation. There is a potential to impact each workflow’s ability to measure differential pressing need for well-replicated expression studies on non-model gene expression. organisms that use multiple methods to measure expression on the Both the Helicos and SOLiD sample preparation protocols are same samples. tag based – a single read is generated from a particular region of In non-model species, reference gene sequences can be derived each sequenced mRNA molecule. In the case of Helicos Digital from the same transcript reads that are used to quantify gene Gene Expression (DGE), the protocol is designed to generate a abundance, providing a one-step approach to expression analyses single read at the 59 end of each sequenced transcript [21]. In the in non-model species. For example, the number of reads in de novo case of the SOLiD SAGE protocol, the tag site is adjacent to the assemblies can be used to measure expression [17]. However, one- 39-most NIaIII endonuclease cleavage site [22,23]. In the case of step reference sequencing and expression quantification is not cost Illumina mRNA-Seq, the RNA is fragmented and multiple reads effective for many studies. Assembling raw sequence reads into a are sequenced at random locations along the length of each reference of gene sequences is best served by long reads [18], but transcript. The number of mRNA-Seq reads is therefore related to quantifying gene abundance is best served by having many reads gene length as well as expression [24]. [19]. It is less expensive to collect short reads than long reads, Expression analyses of field-collected specimens, such as the so collecting long reads across all the samples to be analyzed present study, capture expression differences due to variation in (including multiple treatments and biological replicates) would genotypes, environmental history, and other factors that can therefore greatly increase the cost of the project or greatly reduce obscure or mislead the analyses of interest (tissue-specific expres- the number of reads that could be sequenced for quantification. sion in this case) [25]. It is therefore critical to design a sampling Here we use a hybrid strategy that leverages the advantages of strategy that can capture and identify these multiple effects. We long reads for assembling gene predictions and short reads for collected three replicated pairs of data, where both gastrozooids Differentialquantifying transcript abundance. Gene We apply Expression this hybrid long- and in nectophores the were Siphonophore collected from three different colonies. In read/short-read sequencing strategy to investigate differential gene contrast to collecting each tissue sample from a different colony, Nanomiaexpression between specialized bijuga zooids(Cnidaria) in the siphonophore Assessedthis paired sampling with strategy maximized Multiple our ability to Next-examine Nanomia bijuga (Figure 1 and Video S1). In this preliminary survey, both between-colony effects (e.g., environment, ontogeny, and we focus on two zooid types — developing gastrozooids (feeding genotype) and within-colony effects (zooid type) since there are Generationpolyps) and developing nectophores Sequencing (swimming medusae). Workflowsreplicate samples of each colony as well as of each tissue type. We used Roche1 454 sequencing, with long reads2,3 on the order of This1 study has implications1 for the analysis of gene1 expression in Stefan400 bp [20], Siebert to assemble*, Mark a partial D. Robinson gene reference, dataset. Sophia Given C. Tintorimany other, Freya taxa. Goetz The vast, majority Rebecca of species R. Helm on earth, Stephen will never A. Smiththe depth1,4 of, Nathan 454 sequencing, Shaner some5, gene Steven sequences H. D. are Haddock expected to 5, Caseybe cultured W. Dunn in the1 lab,* so addressing these important technical be full length, some to be missing one or both ends, and others to issues regarding reference completeness, workflow selection, and 1 Department of Ecology and Evolutionary Biology, Brown University, Providence, Rhode Island, United States of America, 2 Epigenetics Laboratory, Cancer Research be fragmentary (i.e., different reference sequences may come from variation in field-collected specimens is essential for the use of Program, Garvan Institute of Medical Research, Sydney, New South Wales, Australia, 3 Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, Parkville, different parts of the same gene). To get multiple independent these methods for most of the diversity of life. Robust analyses of Victoria, Australia, 4 Heidelberg Institute for Theoretical Studies, Heidelberg, Germany, 5 Monterey Bay Aquarium Research Institute, Landing, California, United gene expression in field-collected non-model organisms will enable Statesperspectives of America on the ability to assess differential expression when

Abstract We investigated differential gene expression between functionally specialized feeding polyps and swimming medusae in the siphonophore Nanomia bijuga (Cnidaria) with a hybrid long-read/short-read sequencing strategy. We assembled a set of partial gene reference sequences from long-read data (Roche 454), and generated short-read sequences from replicated tissue samples that were mapped to the references to quantify expression. We collected and compared expression data with three short-read expression workflows that differ in sample preparation, sequencing technology, and mapping tools. These workflows were Illumina mRNA-Seq, which generates sequence reads from random locations along each transcript, and two tag-based approaches, SOLiD SAGE and Helicos DGE, which generate reads from particular tag sites. Differences in expression results across workflows were mostly due to the differential impact of missing data in the partial reference sequences. When all 454-derived gene reference sequences were considered, Illumina mRNA-Seq detected more than twice as many differentially expressed (DE) reference sequences as the tag-based workflows. This discrepancy was largely due to missing tag sites in the partial reference that led to false negatives in the tag-based workflows. When only the subset of reference sequences that unambiguously have tag sites was considered, we found broad congruence across workflows, and they all identified a similar set of DE sequences. Our results are promising in several regards for gene expression studies in non-model organisms. First, we demonstrate that a hybrid long-read/short-read sequencing strategy is an effective way to collect gene expression data when an annotated genome sequence is not available. Second, our replicated sampling indicates that expression profiles are highly consistent across field-collected in this case. Third, the impacts of partial reference sequences on the ability to detect DE can be mitigated through workflow choice and deeper reference sequencing. Figure 1. Tissues(dx.doi.org/10.1371/journal.pone.0022953) sampled from the siphonophore Nanomia bijuga. (A) Paired samples of young nectophores (B) and young gastrozooids (C) were removed from each of three remotely operated vehicle-collected specimens (see video S1). n: nectophore, g: gastrozooid, s: stem of the colony.Casey Dunn FramesCitation: in (SiebertA) indicate S, Robinson regions MD, shown Tintori in SC, (B Goetz) and F, (C Helm). Numbers RR, et al. indicate (2011) Differential the sampled Gene zooids. Expression in the Siphonophore Nanomia bijuga (Cnidaria) Assessed doi:10.1371/journal.pone.0022953.g001with Multiple Next-Generation Sequencing Workflows. PLoS ONE 6(7): e22953. doi:10.1371/journal.pone.0022953 Editor: Johannes Jaeger, Centre for Genomic Regulation (CRG), Universitat Pompeu Fabra, Spain ReceivedPLoSApril ONE 6, | 2011; www.plosone.orgAccepted July 1, 2011; Published July 29, 2011 2 July 2011 | Volume 6 | Issue 7 | e22953 Copyright: ß 2011 Siebert et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: The authors have no support or funding to report. Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected] (SS); [email protected] (CWD)

Introduction expression studies, fragments of transcripts are sequenced and the resulting reads are mapped to known gene reference sequences. The Siphonophores belong to Cnidaria, a diverse group of animals number of reads that map to each gene sequence in the reference that also includes corals, Hydra, and jellyfish. Like a coral, each provides a measure of its expression level [10,11]. To date, NGS siphonophore is a colonial organism made up of many genetically expression studies have been largely limited to model species identical multicellular zooids (bodies) that arise by asexual because their well-annotated genomes provide high quality re- reproduction but remain attached and physiologically integrated ferences for mapping [11,12]. There is, however, growing interest in to each other [1,2,3,4]. Unlike most other colonial animals, where using these tools to quantify expression in non-model species. all the zooids are structurally and functionally identical, siphono- Several studies taking a variety of approaches along these lines phore zooids are functionally specialized for particular tasks such have recently been published. Bellin et al. used Roche 454 se- as feeding, swimming, defense, or sexual reproduction. To date, quencing to assemble gene reference sequences for the grape vine, there have been no studies of differential gene expression between Vitis vinifera, and microarrays based on these sequences to quantify functionally specialized zooids in siphonophores. Such analyses expression [13]. Fraser et al. constructed a gene reference for the would help identify genes that specify zooid types, and play a role guppy, Poecilia reticulata, also with Roche 454, but quantified in the development and functions of different zooid phenotypes. expression with Illumina mRNA-Seq [14]. Other studies have Next generation sequencing (NGS) has rapidly transformed high- used Illumina mRNA-Seq data rather than Roche 454 to assemble throughput analyses of gene expression [5,6,7,8,9]. In sequencing-based gene references, and tag-based [15] or mRNA-Seq [16] Illumina

PLoS ONE | www.plosone.org 1 July 2011 | Volume 6 | Issue 7 | e22953 Nanomia bijuga

(C Carré) Casey Dunn (MBARI) Casey Dunn Nanomia 454 sequencing

589k reads sequenced (454Titanium)

19,925 “genes” in reference (Newbler, cap3)

Casey Dunn Paired samples, 3 specimens

Swimming

(C Carré) Feeding

Casey Dunn Replicated design

Tissue A Tissue B

Specimen 1 X Reads X Reads

Specimen 2 X Reads X Reads

Specimen 3 X Reads X Reads

Casey Dunn Casey Dunn Casey Dunn Helicos SOLiD Illumina

Casey Dunn (dx.doi.org/10.1371/journal.pone.0022953) Casey Dunn Illumina Swimming

Feeding

Swimming v. Feeding

Casey Dunn Genes with significant DE All genes

EdgeR, Bonferroni corrected p < 0.05 (dx.doi.org/10.1371/journal.pone.0022953) Casey Dunn Are these differences due to differences in read numbers across workflows?

No.

Number of DE genes SOLiD SAGE Helicos DGE 500 1000 1500 2000 2500 3000 3500 Illumina mRNAseq

0 5 10 15 20 Subsampled Geometric mean (of 6 libraries) í0LOOLRQVRIPDSSHGUHDGV Casey Dunn Helicos (Tag-based) 5’ AAAA-3’

SOLiD (Tag-based) NlaIII NlaIII NlaIII 5’ AAAA-3’

Illumina (RNAseq)

5’ AAAA-3’

Casey Dunn Original 454-derived reference: 19,925 genes

Reference sequences that unambiguously have tag sites: 4,255 genes

Casey Dunn Genes with significant DE Genes with complete 3ʼ end

EdgeR, Bonferroni corrected p < 0.05 (dx.doi.org/10.1371/journal.pone.0022953) Casey Dunn Distribution of sequencing effort

All genes Complete at 3’ end A Fraction of total number of mapped reads of total number Fraction of mapped reads of total number Fraction

SOLiD SAGE Helicos DGE

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4Illumina 0.6 mRNAseq 0.8 1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 Fraction of total number of genes Fraction of total number of genes (dx.doi.org/10.1371/journal.pone.0022953) Casey Dunn Where to next? Characterization of genes with significant differential expression

Casey Dunn Dunn Casey Mini-collagen Overall expression

Red genes have significant differential expression differential significant have genes Red Swimming Feeding swimming bodies

feeding

(C Carré) bodies

Cells expressing mini-collagen areCasey blue Dunn Dunn Casey Overall expression

Red genes have significant differential expression differential significant have genes Red Swimming Feeding Uh oh. “Data deluge” “Firehose of data” “I’m drowning in data.” “Data overload”

Casey Dunn The problem isnʼt too much data.

We need more data that tell us about our data

Casey Dunn What other data do we need? Comparative data - we need to be looking at a lot more than one species at a time.

Casey Dunn Current approach: Which genes have expression correlated with my phenotype of interest?

New approach: Which genes have evolutionary changes in expression that are coincident with changes in my phenotype of interest? Casey Dunn Analyze expression data on phylogenies

Expression data Gene trees

Species_3@seq5667

Species_1@seq0524 Species_4@seq7967

Species_2@seq0525 Species_1@seq7156 Species_1@seq9950

Species_2@seq5610 Species_3@seq5583

Species_1@seq5601 Species_7@seq3345 Species_2@seq3104 Species_3@seq4966 Species_6@seq8843 Species_5@seq7743Species_3@seq8284 Species_8@seq9098 Species_10@seq2558 Species_4@seq8842 Species_9@seq1243

Species_4@seq8954 Species_11@seq2319 Species_12@seq1339

Species_13@seq8943 Species_14@seq9939 Species_5@seq1486 Species_14@seq8893 Species_17@seq9976 Species_18@seq1374 Species_15@seq7783 Species_16@seq9943

Species_11@seq8540

Species_12@seq9013

Casey Dunn The state of comparative biology

Casey Dunn New tools are going to transform comparative biology

But the biggest impact will be how they enable comparative biology to transform the rest of biology

Casey Dunn Mechanisms and diversity 20th Century 21st Century

Experimental Comparative work in model work in organism nonmodel organisms

Comparative work in Experimental nonmodel work in model organisms organism Casey Dunn 0 Number of species (Diversity) 20 Mechanistic detail th centurymodel species 19 th , 20 th centurycomparativebiology 21 Biology st century Casey Dunn Computation

Casey Dunn Field Lab

Computational

Casey Dunn practical computing for biologists

Steven H. D. Haddock The Monterey Bay Aquarium Research Institute, and University of California, Santa Cruz

Casey W. Dunn practical computingDepartment of Ecology and Evolutionary Biology, for biologists Brown University practical computing

Stevenfor H.biologists D. Haddock The Monterey Bay Aquarium Research Institute, and University of California, Santa Cruz Steven H. D. Haddock The Monterey Bay Aquarium Research Institute, Sinauer Associates, Inc. • Publishers and UniversityCasey of California,W. Dunn Santa Cruz Department of Ecology and Evolutionary Biology, Sunderland, Massachusetts U.S.A. Casey W.Brown Dunn University Sinauer Department of Ecology and Evolutionary Biology, Associates, Inc. Brown University Casey Dunn

HDFrontmatter.indd iii 10/13/10 2:40 PM Sinauer Associates, Inc. • Publishers Sunderland, Massachusetts U.S.A.

HDFrontmatter.indd iii 10/13/10 2:40 PM goals

To show you how to use general tools to address the day-to-day computational challenges faced by biologists.

Casey Dunn goals

We focus on the entire computer as a general analysis environment, rather than focus on one particular type of analysis or analysis tool.

The material we present can be thought of as glue to hold together and automate your existing analysis tools, and as a general purpose workbench for creating new analysis tools.

Casey Dunn goals Will we show you how to convert file formats so that you can use the output of one program as input to another? Yes. Will we show you how to use command- line tools to automate a series of analyses? Yes. Will we show you how to use a remote cluster to run programs? Yes.

Casey Dunn goals Will we show you how to optimize an algorithm to speed it up ten fold? No. Will we explain maximum likelihood? No. Will we walk you through microarray data analysis? No. Casey Dunn contents PART I: Text Files 7

PART II: The Shell 45

PART III: Programming 103

PART IV: Combining Methods 243

PART V: Graphics 321

PART VI: Advanced Topics 381

Appendices 449

Casey Dunn Beginning Python Programming 133

In addition to converting strings to fl oats, !"#$%&' can also convert integers to fl oats. Modify the ()*+),-%. line of your code to include the !"#$% function as follows: Command-line Operations: The Shell 65

Notice that the asterisk doesn’t include text beyond the path divider (!). There- fore,()*+),-%./0/ "#$%&'()( is not!"#$% equivalent&"), to&123()*'' the command above, because it would only show fi les in the current directory that start with % and end with '()(. The construction of &!&'()( is a convenient Functions can be nested—that is, placed inside one another. In this case, the re- way to indicate all the text fi les in all immediate Wildcards carry the risk of causing sultsubfolders from of the your function current directory. "),&' Itis is evaluated even pos- fi rst andproblems used through as the unanticipated input to the matches. func- tionsible to!"#$%&' search deeper. The folders, output for of instance that function with You is especially what is want assigned to be careful to before the variable using ()*+),-%.&!&!&'()(. These. This commands is a bit like only the show pipe fi les at at the commandwildcards to remove, line: data copy, canor move be importantprocessed the exact specifi ed number of levels, so they would fi les; this will help keep you from accidentally through several steps without writing them tomodifying variables fi les oryou fi didn’t les at realize each were stage. there. With It not show a '()( fi le in the current directory. this modifi cation, the variable ()*+),-%. isis now often awise fl oat to test rather wildcards than with an a harmlessinteger. If When using wildcards, if "# fi nds a fi le that "#. This will give you a list of fi les recognized by youmatches, run it the lists program the fi lename; at this if it fipoint, nds a directory notice that thethe command length sois younow can reported check to see as if 456it in- in- steadthat matches, of 4, indicating it lists the directorythat it can along have with a itsfractional cludes part any that (even you didn’tthough anticipate. that fractional partcontents. is currently 6). CopyingSince and()*+),-%. moving multipleis used in fi leseach calculation, changing it to a fl oat is suffi cient toUsing ensure wildcards, that allyou of can the quickly subsequent copy or movedivision multiple operations fi les with willnames also that result in fl oats. Tomatch avoid a certain having pattern. to useIn the 7%8&' following to convertexample, theall of resulting the fi les that fraction end with to a string, we’ll samplecombine'()($in the the !"#$%&!' print operationdirectorycontent are with copied the to thecalculation '#()*+" folder: into one line. At the end of the existing script, add in a print statement like this for each of the four nucleotides: *+#(,-$"./01!"#!$%&"'(%)*+#(,- *+#(,#2345+)$"./01!.) *+#(,#2345+)$"./01$98:,%/;3<;=/2>?@)83A()*+),-%."&!//%0-*1&.0)%2/3-3!/%, Ν,,-!./,)011!.!(2,1.+$,!$*/,02'!&1 *+#(,#2345+)$"./01!.) 678"9#('()($$Ν,3+4,50&&,'!!,+26!.',2++777 Here is the whole program up to this point: This should begin to give you a sense of some of the tasks that can be easier work- ing at the command line rather than in a graphical user interface. Both the copy andBCA>78A@:,A),D/9E%.#, move commands, used in conjunction with wildcards, are signifi cant time- savers when dealing with large numbers of fi les. You can imagine, for instance, how123()*/0/ you could gather;3FG33H; all your physiology data from a particular species by using the98:,% taxonomic/;()*>),I)<; name at the beginning=/123()* and the data format at the end of a path (:23+;92&'42() while ignoring any similarly named images or documents that might()*+),-%./0/ be in that folder.!"#$%&"),&123()*''

Ending98:,%/ ;()*>),I)/+),-%.<;your terminal session=/()*+),-%. To end your session, type 7)9(. You can just close the window, but this is a bit like unplugging2>?@)83/0/123()*5 your computer withoutI#>,% turning&;3; 'it off fi rst: if there are any programs still2>?@)8H/0/123()*5 running in the terminal window,I#>,% &they;H; will' come unceremoniously to a halt. Closing2>?@)8G/0/123()*5 the window is also notI#>,% an option&;G; when' you log in to a different computer from within your terminal. If 7)9( does not work, some shells use "+<+.( or =.9(2>?@)8F/0/123()*5.7 I#>,%&;F;'

7 If98:,% none of those/;3<; work,=/2>?@)83A()*+),-%. then pull the plug. 98:,%/;H<;=/2>?@)8HA()*+),-%. 98:,%/;G<;=/2>?@)8GA()*+),-%. 98:,%/;F<;=/2>?@)8FA()*+),-%. Casey Dunn

HD1e04.indd 65 10/7/10 8:52 AM

HD1e08.indd 133 10/7/10 10:10 AM sample contentAdvanced Shell and Pipelines 305

grep ATOM REMARK 3 NUMBER OF NON-HYDROGEN ATOMS USED IN REFINEMENT. REMARK 3 PROTEIN ATOMS : 3650 REMARK 470 M RES CSSEQI ATOMS REMARK 500 RMS DISTANCE OF ALL ATOMS FROM THE BEST-FIT PLANE REMARK 500 RMSD 0.02 ANGSTROMS, OR AT LEAST ONE ATOM HAS grepATOM -v REMARK 1 N ALA A 1 -14.093 60.494 -9.249 1.00 42.10 ATOM 2 CA ALA A 1 -14.989 61.651 -8.981 1.00 41.80 ATOM 3 C ALA A 1 -14.809 62.769 -10.006 1.00 41.60 ATOM 9 O SER A 2 -11.264 62.734 -12.155 1.00 39.50 ATOM 10 CB SER A 2 -13.236 65.292 -11.216 1.00 39.90 ATOM 11 OG SER A 2 -12.004 65.880 -11.497 1.00 39.90 ATOM 12 N LYS A 3 -12.516 63.462 -13.894 1.00 38.90 ATOM 13 CA LYS A 3 -11.712 62.828 -14.936 1.00 38.10 ...... ATOM 3644 CD1 ILE B 229 37.302 62.306 9.573 1.00 42.30 ATOM 3645 N THR B 230 39.340 65.048 4.879 1.00 48.70 ATOM 3646 CA THR B 230 39.969 64.839 3.566 1.00 50.40 ATOM 3647 C THR B 230 41.207 63.924 3.637 1.00 51.30

cut -c 18-21, cut -f 1 24-26 sort | uniq -d " " uniq -c sort -nr ALA 1 ALA 1 ALA 9 ALA 21 GLY ALA 1 ALA 37 ALA 7 ARG 19 LYS ALA 1 ALA 87 ALA 13 ASN 18 LEU ALA 1 ALA 110 ALA 17 ASP 17 VAL ALA 1 ALA 154 ALA 2 CYS 17 ASP SER 2 ALA 179 ALA 7 GLN 15 THR SER 2 ALA 206 ALA 15 GLU 15 GLU SER 2 ALA 226 ALA 21 GLY 13 PHE SER 2 ALA 227 ALA 9 HIS 13 ASN SER 2 ARG 73 ARG 12 ILE 12 ILE ...... 18 LEU 11 SER ILE 229 VAL 68 VAL 19 LYS 10 TYR ILE 229 VAL 93 VAL 4 MET 10 PRO ILE 229 VAL 112 VAL 13 PHE 9 HIS THR 230 VAL 120 VAL 10 PRO 9 ALA THR 230 VAL 150 VAL 11 SER 7 GLN THR 230 VAL 163 VAL 15 THR 7 ARG THR 230 VAL 176 VAL 1 TRP 4 MET THR 230 VAL 193 VAL 10 TYR 2 CYS THR 230 VAL 219 VAL 17 VAL 1 TRP THR 230 VAL 224 VAL

FIGURE 16.1 The successive extractions and modifi cations made by each command in the example pipeline Orange boxes show &'() commands and other boxes show the output once those commands have been added to the pipeline. Casey Dunn The fi rst step in doing this will be to use the !"# command to extract just the amino acid three-letter code and the numerical position, characters 18 to 21 and 24 to 26. (The intervening $ or % indicates which repeated subunit includes that

HD1e16.indd 305 10/7/10 2:04 PM 348 Chapter 18

To start the process, scan or photograph the image, import it into your illus- tration program, and move it to its own locked layer. Then create drawing layers above the locked image and trace the image using the pen tool, creating Bézier curves as described in the next section. You can hide or delete the layer containing the original image when you are done. It can also be helpful to make the back- ground image invisible or semi-transparent, to quickly check the overall look of your tracing. A drawing tablet is often more convenient to use for tracing and drawing than a mouse or trackpad, especially for bigger projects. You would not want to use a pixel art program such as !"#$#%"#& for tracing, as this would result in a simplifi ed pixel image of the original pixel image, rather than a clean vector art representation. Some programs have an auto-trace feature that can help automate vector tracing, though the results almost always require subsequent adjustment. The '()*+,-./* feature of 01#2*+3445%$-.$#- lets you dynamically control the threshold for identifying which image boundaries should become lines. To create single out- lines instead of fi lled objects, choose ,*/"6(/.4+7-.8(69 from the '()*+,-./* pop-up menu. Once you have an object you like, you can select :2;*/$+Ή+'()*+,-./*+Ή+<=&.61 to turn the trace into an editable outline. Anatomy of vector art Bézier curves At its core, vector art is composed of anchor points and lines that connect anchor points. These lines are known as Bézier curves (pronounced bezz-ee-ay). Vector drawing programs have the standard suite of tools for creating boxes and straight lines, but to get full control over your illustrations, you will need to learn how to manipulate these curves. While they seem confusing at fi rst (and they are hard to describe in words), once you understand them you will fi nd it much easier to draw what you see in your mind. sample A contentBézier curve is a line that intersects a series of anchor points. Anchor points are sort of like pins stuck through a very fl exible rod, which represents the line. The line must pass through each anchor point, and its shape is controlled by the

Handle

Control line Corner anchor Smooth anchor End anchor

FIGURE 18.2 Bézier curve showing anchor points, han- dles, and control lines

HD1e18.indd 348 10/7/10 2:07 PM

Casey Dunn Casey Dunn