Multiple alignments, and clustalW

1. Blast idea: a. Filter out low complexity regions (tandem repeats… that sort of thing) [optional] b. Compile list of high-scoring strings (words, in BLAST jargon) of fixed length in query (threshold T) . Extend alignments (highs scoring pairs) d. Report High Scoring pairs: score at least S (or an E value lower than some threshold) 2. Multiple Sequence Alignments: a. Attempts to extend dynamic programming techniques to multiple sequences run into problems after only a few (8 average proteins were a problem early in 2000s) b. approach c. Idea (Progressive Approach): i. homologous sequences are evolutionarily related ii. Build multiple alignment by series of pairwise alignments based off some (the initial tree or the guide tree ) iii. Add in more distantly related sequences d. Progressive example:

1. NYLS & NKYLS: N YLS N(K|-)YLS NKYLS

2. NFS & NFLS: N YLS NF S NF(L|-)S NKYLS NFLS

3. N(K|-)YLS & NF(L|-)S N YLS N(K|-)(Y|F)(L|-)S NKYLS N YLS N F S N FLS e. Assessment: i. Works great for fairly similar sequences ii. Not so well for highly divergent ones f. Two Problems: i. local minimum problem: greedily adds sequences based off of tree— might miss global solution ii. Alignment parameters: Mistakes (misaligned regions) early in procedure can’t be corrected later. g. ClustalW does multiple alignments and attempts to solve alignment parameter problem i. gap costs are dynamically varied based on position and ii. weight matrices are changed as the level of divergence between sequence increases (say going from PAM30 -> PAM60) iii. Sequences are weighted according to similarity 1. very similar sequences are down-weighted 2. relatively unique sequences are up-weighted iv. Neighbor Joining Trees are used to form the initial tree h. Some details of the algorithm i. Form a distance matrix & pairwise alignments 1. Best alignment between two sequences –gap penalty using a fast approximation 2. or, use slower deterministic method (say Smith Waterman) ii. Generate a guide tree using Neighbor Joining 1. Place root at midpoint of longest chain of consecutive edges 2. Weight of the sequence is related to distance from the root 3. sequences that share a branch with other sequences share the weight of that branch iii. Now align the sequences progressively using the branching order in the guide tree. Gaps are added to a profile of an existing multiple sequence alignment and scoring is adjusted as needed (I’m skipping those details) i. ClusalW example from people.sc.fsu.edu/~swofford:

j. Weakness of ClustalW: If the sequences are similar only in small sub-regions , sequences may be misaligned. This is because ClustalW uses global alignment, not local. If one sequence contains a large insertion relative to the rest, alignment are prone to error. If one sequence contains a repetitive element while another contains one copy of the element, ClustalW may split the single domain into two half-domains. a. Other examples: i. (no W means no weights) ii. ClustalX (really just a web front end for Clustal W iii. T-COFFEE (slower than Clustal but a bit more accurate) iv. Many others 3. : a. First genome was sequenced in 1995 ( Haemophilus influenzae) b. 3 years later Caenorhabditis elegans c. i. released in 2000 (Human Genome Project) ii. completed in 2003 (HGP) iii. Some regions are harder to sequence then others d. eukaryotes and prokaryotes are quite different i. 1.4% of human and mouse genome encodes genes ii. Only 5% of both genomes are highly conserved iii. 80% of the genes (however) are orthologs e. Some sequencing technologies & techniques i. PCR (polymerase chain reaction) 1. Chemical reaction (most sequencing technologies use PCR at some point in their process) 2. amplifies (copies) DNA (doesn’t tell you what it is though) 3. Developed by Kary Mullis in 1984 (won Nobel in 1993) 4. Ingredients: a. DNA region to be amplified b. two primers i. short bits of RNA ii. complementary to 5’ and 3’ DNS sequences at ends of region to be amplified c. DNA polymerase (usually Taq) d. Lots of Deoxynucleoside triphosphates (dNTPs)—the building blocks out of which the new DNA will be assempled e. buffer f. some other stuff 5. http://www.sumanasinc.com/webcontent/animations/content/pcr.html

ii. Sanger sequencing 1. Frederick Sanger in 1975 2. Components: a. single-stranded DNA to be sequenced (heat is often used to denature) b. a DNA primer (complement to the location at which sequencing is to start!) c. DNA polymerase d. a bunch of DNTPs e. some specially labeled : dideonucleotriphosphates:

3. http://smcg.cifn.unam.mx/enp-unam/03- EstructuraDelGenoma/animaciones/secuencia.swf f. Shotgun sequencing: i. Sample DNA is amplified ii. cut up into smaller pieces iii. Each piece is sequenced iv. the pieces are assembled programmatically v. http://smcg.cifn.unam.mx/enp-unam/03- EstructuraDelGenoma/animaciones/humanShot.swf g. High Throughput Sequencing: i. Illumina and Helicos 1. sequencing by synthesis (dye terminated) 2. massively parallel 3. fragments are bound to a surface and sequenced in parallel 4. reversible terminators are used to add 1 basepair at a time 5. the terminators have fluorescing dyes 6. http://cat.ucsf.edu/pdfs/SS_DNAsequencing.pdf 7. http://seq.molbiol.ru/sch_clon_ampl.html 8. These days generate about 50 million 60 basepair samples per flow cell. 9. Data is can be paired end. ii. 454 1. Also sequencing by synthesis (one base at a time) 2. When a base is added to the end of the strand light is released 3. higher error rate then Illumina process but longer reads (300-500) 4. problematic for highly repetitive DNA regions (slippage) iii. ABI 1. SOLiD sequencing a. sequencing by ligation b. uses 8-mer probes c. uses ligase rather than polymerase

d. http://www3.appliedbiosystems.com/AB_Home/applicationste chnologies/SOLiDSystemSequencing/OverviewofSOLiDSequenci ngChemistry/index.htm e. AND http://appliedbiosystems.cnpg.com/Video/flatFiles/699/index.a spx 4. Assembling samples a. Generate samples of the genome… these samples are called contigs b. Assemble the contigs into longer sequences i. Consensus sequences ii. super contigs iii. enter genomes iv. Landmarks? 1. STS: Sequence Tagged Sites a. short sequences (200-500 nucleotides) b. Present only once in the genome c. You emulated this sort of process in your first test d. Template-guided vs deNovo 5. Rest of the class a. Genomic Regions (tracks) i. Ka vs Ks ii. intron/exon, feature files b. template guided (eland, maq) c. de novo (velvet, odena) d. MySQL e. SNP detection! f. BioPerl, bio conductor, etc.