<<

Prof. Bystroff talks about

hi

• Sequence database searching • Phylogenetic Trees • Structure

1 AAAGAGATTCTGCTAGCGGTCGG A G A G A T G C T G C A G C G A G T C G G C C AAAGAGATTCTGCTAGCGGTCGG A G A G A T G C T G C A G C G A G T C G G C C AAAGAGATTCTGCTAGCGGTCGG A G A G A T G C T G C A G C G A G T C G G C C Protein uses a "substitution matrix".

Sequence 1 Sequence2

5 Find the best pathway through the substitution scores, and you have an alignment

6 "dynamic programming" algorithm. BLAST searches millions of sequences

GenBank contains over 162 million sequences!! The score for each should be the optimal alignment score. Even if we can do 1 per millisecond, it would take 45 hours to do one search. BLAST usually finishes in under a minute. How does BLAST do it so fast? BLAST precalculates all triplet hits in the database.

... PGQ PGR PGS ... My sequence has PGT PGV PGW this triplet BLAST saves a lookup table (called PGY PAQ PCQ an INDEX) for all of PGQ the near identity PDQ PEQ PFQ ... triplet location in the ... whole database.

BLAST uses an expansion table to allow for near perfect matches

This is all done when BLAST is set up, before any searches are carried out. BLAST finds diagonal arrangements of triplet hits

triplet hits in one database Hits are joined by extension protein BLAST scores only the best hits (saves time) cutoff

Re-scoring.

Convert score to a e-value*.

Rank by e-value.

This protein is given a score, and we save BLAST connects the it for later only if diagonals (FASTA the score passes a algorithm) cutoff.

*later... Protein Databases available for BLAST search Go to BLAST search page (i.e. blastp) , select a database to search and then select ? to learn a little about that database.

11 Protein Databases available for BLAST search On BLAST search page, select.a database to search and then select ? to learn a little about that database.

12 Protein Databases available for BLAST search On BLAST search page, select.a database to search and then select ? to learn a little about that database.

13 forms of BLAST BLAST query database blastn nucleotide nucleotide blastp protein protein tblastn protein translated DNA blastx translated DNA protein tblastx translated DNA translated DNA psi- protein, profile protein phi-blast pattern protein

14 How significant is that?

Please give me a number for...... how likely the data would not ...as have been the opposed ...a specific result of chance,... to... inference. e-value

A better metric of significance. E-value = p-value x (number of attempts)

16 Scores from random alignments are used to calculate the p-value of an alignment score

freq

score---> x

∞ p-value of x = ∫normalized normal distribution fit to random scores x p-value is the significance of one (1) alignment score. e-value is the significance of one score of many tries.

Searching a database of 162 million sequences for one hit is like trying 162 million times to get one good alignment. The number of times you will see that score by chance is the p-value times 162 million!

e-value = p-value * 162,000,000 (GenBank search) Pop-quiz

BLAST HIT...... e-value 1. annotation 3.0 2. annotation 3.0 3. annotation 3.0 4. annotation 3.0 5. annotation 3.0 6. annotation 3.0 7. annotation 3.0 8. annotation 3.0 9. annotation 3.0 10. annotation 3.0

How many of the above 10 hits are the expected to be by chance? Pop-quiz

BLAST HIT...... e-value 1. annotation 1.0 2. annotation 2.0 3. annotation 3.0 4. annotation 4.0 5. annotation 5.0 6. annotation 6.0 7. annotation 7.0 8. annotation 8.0 9. annotation 9.0 10. annotation 10.0

How many of the above 10 hits are the expected to be by chance? Pop-quiz

BLAST HIT...... e-value 1. annotation 0.0 2. annotation 0.01 3. annotation 0.01 4. annotation 0.01 5. annotation 0.02 6. annotation 0.02 7. annotation 0.02 8. annotation 0.02 9. annotation 0.02 10. annotation 10.0

How many of the above 10 hits are the expected to be by chance? Bioinformatics

• Sequence database searching • Phylogenetic Trees •

22 Evolutionary time

Cladogram Phylogram Ultrametric tree

6 B 1 B B 1 C 3 C C 1 A A A

D 5 D D no meaning genetic change time

(D:5,(A:1,(C:1,B:6):1):3)

parenthesis (Newick) notation has both labels and distances. Multiple Sequence Alignment

A multiple sequence alignment is made using many pairwise sequence alignments Construct a distance-based tree

ABCDEF A A 97 81 82 59 32 B B 77 80 55 31 C C 90 65 40 D D 61 42 E E 33 F F Draw tree here distances Life is not strictly a tree -- horizontal gene transfer Discrete Steps Needed for Stability of Gene Transfer Stably incorporating horizontally transferred genes into a recipient genome involves five distinct steps (Fig. 1). 1. First, a particular segment of DNA or RNA is prepared for transfer from the donor strain through one of several processes, including excision and circularization of conjugative transposons, initiation of conjugal transfer by synthesis of a mating pair-formation protein complex, or packaging of nucleic acids into phage virions. 2. Next, the segment is transferred either by conjugation, which requires contact between the donor and recipient cells, or by transformation and transduction without direct contact. 3. During the third step, genetic material enters the recipient cell, where cell exclusion may abort the transfer. 4. Otherwise, during the fourth step, the incoming gene is integrated into the recipient genome by legitimate or sitespecific recombination or by plasmid circularization and complementary strand synthesis. Barriers to transfer during this step come from restriction modification systems, failure to integrate and replicate within the new host genome, and incompatibility with resident . 5. In the final step, transferred genes are replicated as part of the recipient genome and transmitted to daughter cells in stable fashion over successive generations. Researchers from different disciplines tend to focus on specific stages within this five-step sequence. Thus, evolutionary biologists who examine microbial genomes for evidence of past transfers tend to look at HGTs from the perspective of step five. Molecular biologists are more likely to examine BF Smets, T Barkay (2005) “Horizontal the details of the transfer events, while microbial ecologists look more broadly when they describe gene transfer: perspectives at a the magnitude and diversity of the mobile gene crossroads of scientific disciplines” pool, sometimes called the mobilome. Nature Reviews Microbiology. 26 Sequence homology trees are complicated by paralogy Orthologs: homologs originating from a speciation event Paralogs: homologs originating from a gene duplication event.

duplication

speciation

speciation gene loss fish crab clam duck fish A fish B fish A fish B fish crab B crab clam A clam duck A duck B duck crab B crab clam A clam duck B duck duck A duck True Species tree Sequence tree !! reconciled trees Use orthologs • To make the right inferences about evolution, make sure your is composed of orthologs

How do you know it's an ortholog? 1. It has the same function in both species. 2. It has about the same number of differences across species as other orthologs. 3. You don't. Functional inference from multiple sequence alignments

folding

Not conserved Conserved function Functional inference from multiple sequence alignments

stability

kinetics activity folding

binding Not conserved Conserved post-translationalfunction modification species differences

stability

folding kinetics Not conserved Conserved enzyme activity function binding

post-translational modification Next time:

• Visit rcsb.org • Try visualizing a protein. • Locate a residue that is conserved across all species in a BLAST search. • Locate one that is conserved except in one species. What might be its function? Phylogenetic analysis

variable position conserved

Mm1:C362 Mm1:S393 Mm1:C499 Mm2:C210 Mm3:C96 Mm3:C197 Mm4:C114 Mm4:C170 Mm4:C233 • • • • • • • • • *

1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2

2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3

1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4

4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3

5 4 2 1 3 single position conserved

Transmembrane Cys can still for self-reacting SS when it mutates to a new position. Therefore, variable position conservd Cys are self-reacting. Single position conserved cys are cross-reacting. http://etetoolkit.org

Mm1:C362 Mm1:S393

• • In this case, mammals found to be missing conserved cysteines in the sperm specific calcium channel CatSper were species that lacked sperm competition. http://etetoolkit.org Format of sequence alignment for ETE tree

>Squirrel FLVVCLNT---CIFLCIYV---LTLMFTCLF---LLRICRVLR---VSICTSEFA---LGFCLFGI---LTILVCEV---LVHVCMAV---ICITQDGW >Beaver FVTVCLNT---CIFLCIYV---LILMFTCMF---LLRICRVLR---VSICTSEFF---LGFCLFGI---LTILICEV---LVHVCMAV---ICITQDGW >Blind mole rat FLVVCLNT---SIFLCIYI---LTLMFTCLF---LLRICRVLK---VSTYACEFF---LGFCLFGV---LTILTCEV---LVHVCMAV---ICITQDGW >Mouse FIVVCLNT---SIFLSIYV---LTLMFTCLF---LLRVCRVLR---VSVYVCEFL---LGFCLFGV---LTILICEV---LVHVCMAV---ICITQDGW >Pika FLVICLNT---CIFLSIYV---LTLMFTCLF---LLRICRVLR---VSIYASEFS---LGFCLFGT---LTILICEV---LLHVCMSV---ICITQDGW >Rabbit FLVVCLNT---CIFLCIYM---FVLMFTCLF---LLRICRVLR---VSIYASEFS---LGFCLFGA---LTILFCEV---LLHVCMAV---ICITQDGW >Gibbon FFVVCLNT---SIFFCIYV---LILMFTCLF---LLRICRVLR---VSICTSELF---LGFCLFGS---LTILICEV---LVHVCMAV---ICITQDGW >Monkey FFIVCLNT---SIFFCIYV---LILMFTCLF---FLRICRVLR---VSICTSELA---LGFCLFGS---LTILICEV---LVHVCMAV---ICITQDGW >Bushbaby FFIICLNT---CIFFCIYV---LILMFTCLF---FLRICRVLR---VGIYSAEFY---LGFCLFGV---LSILVCEV---LIHVCMAV---ICITQDGW >Lemur FFIICLNT---AIFFSIYL---LILMFTCLF---FLRICRVLR---VSIYSSEFV---LGFCLFGV---LTILICEV---LVHVCMAV---ICITQDGW >Sifaka FFVICLNT---SIFFCIYV---LILMFTCLF---FLRICRVLR---VSIYSSEFS---LGFCLFGV---LTILICEV---LVHVCMAV---ICITQDGW

FASTA format. Output by most alignment programs and packages. http://etetoolkit.org Format of tree for ETE tree

(Beaver:0.106861,(('Blind mole rat':0.0870003,Mouse:0.128141):0.0287991,('Naked mole rat':0.316691,((Pika:0.0584227,Rabbit:0.0514835):0.0419969, (((Gibbon:0.062089,Monkey:0.0723263):0.0501071, (Bushbaby:0.104971, (Lemur:0.0853643,Sifaka:0.0510973):0.0395091):0.00449631):0.011 1712,((Marsupials:0.390453, ((Manatee:0.0517099,Mole:0.11989):0.00669083,'Elephant shrew':0.143216):0.0258566):0.00193119,('Star-nosed mole':0.111938,((Alpaca:0.102393,Pig:0.0618056):0.010159, (Leopard:0.0585696,('Brown bat':0.108369,('Fruit bat':0.0725375,('Horseshoe bat':0.0640651,'Leaf-nosed bat':0.0497925):0.0219586):0.00813098):0.00277678):0.00371913): 0.00419253):0.00732498):0.0142435):0.00485045):0.00731201):0.00 875856):0.00305836,Squirrel:0.0786295);

Newick format. Output UGENE, NCBI, most tree tools.

• rcsb.org • 4ms2 a voltage-gated calcium channel.

1) visualize overall structure in NGL 2) view ligands 3) view electron density 3) find an . Zoom in. 4) Homology modeling. superposed homologs

38 39 40 Homology modeling in a nutshell.

ALIGNMENT target ACDEFG....HIKLMNPQRSTVWY ||:|| || :| | ||||: .CDDFGACDGHIYIM..QQSTVWF template

Modeling action... • Add Ala to the N-terminal Cys using energy minimization.• • Keep the conserved Phe sidechain and backbone. • Cut out the four residue insertion and connect G to H. • Switch non-similar sidechains Y->K. Possibly move backbone.. • Cut at M-Q, insert two residues, Asn-Pro. • Switch similar sidechains F->Y. Keep backbone fixed.. Automatic homology modeling by SWISS-MODEL

https://swissmodel.expasy.org/interactive

42 Next time:

• Read CatSper paper Bystroff, C. (2018). Intramembranal disulfide cross-linking elucidates the super-quaternary structure of mammalian CatSpers. Reproductive , 18(1), 76-82. Chicago

Read at least one part of the paper in detail. Bring a comment, suggestion, or question to class 2/19.