Introduction to 3D-Structure Visualization and Homology Modeling using the Swiss-Model Workspace

Lorenza Bordoli Biozentrum of the University of Basel and Swiss Institute of Bioinformatics Lecture 2

May 2009 Lecture 2: Outline

Families evolution • Recapitulation: sequence alignments – Pair wise and multiple sequence alignments – Databases search (BLAST, PSI-BLAST) – Patter, PSSMs, Profiles, HMMs – InterPro: Database of protein families and domains • Introduction to Homology modeling: – Target sequence feature annotation – Homology Modeling: Template identification Evolution of Protein Families Hemoglobin

HemoglobinHemoglobin(Hb)(Hb) is is thethe heme-iron- heme-iron- containingcontaining oxygen- oxygen- transporttransport protein protein in in thethe red red blood blood cells cells of of vertebrates.vertebrates. Hb Hb transportstransports oxygen oxygen fromfrom the the lungs lungs or or gills gills toto the the rest rest of of the the bodybody where where it it releasesreleases the the oxygen oxygen forfor cell cell use. use. In In the the tetramerictetrameric form form of of hemoglobin,hemoglobin, the the

Image source:Commons. Wikimedia bindingbinding of of oxygen oxygen is is a a cooperativecooperative process. process.

Human hemoglobin; PDB:1GZX Myoglobin

MyoglobinMyoglobin(Mb)(Mb) is is a a single-chainsingle-chain globular globular proteinprotein containing containing a a hemeheme prosthetic prosthetic groupgroup in in the the center. center. MbMb is is the the primary primary oxygen-carryingoxygen-carrying pigmentpigment of of muscle muscle tissues.tissues. Image source:Commons. Wikimedia

Sperm Whale myoglobin; PDB:1MBO Evolution of the globin family: • Multiple alignment of amino acid sequences:

sp|P01922|HBA_HUMAN 1 -V LS PA DK TN VK AA WG KV GA HA GE YG AE AL ER MF LS FP TT KT YF PH F- DL SH -- -- -G SA sp|P01958|HBA_HORSE 1 -V LS AA DK TN VK AA WS KV GG HA GE YG AE AL ER MF LG FP TT KT YF PH F- DL SH -- -- -G SA sp|P02023|HBB_HUMAN 1 VH LT PE EK SA VT AL WG KV -- NV DE VG GE AL GR LL VV YP WT QR FF ES FG DL ST PD AV MG NP sp|P02062|HBB_HORSE 1 VQ LS GE EK AA VL AL WD KV -- NE EE VG GE AL GR LL VV YP WT QR FF DS FG DL SN PG AV MG NP sp|P02185|MYG_PHYCA 1 -V LS EG EW QL VL HV WA KV EA DV AG HG QD IL IR LF KS HP ET LE KF DR FK HL KT EA EM KA SE

sp|P01922|HBA_HUMAN 54 QV KG HG KK VA DA LT NA VA HV DD MP NA LS AL SD LH AH KL RV DP VN FK LL SH CL LV TL AA HL sp|P01958|HBA_HORSE 54 QV KA HG KK VG DA LT LA VG HL DD LP GA LS NL SD LH AH KL RV DP VN FK LL SH CL LS TL AV HL sp|P02023|HBB_HUMAN 59 KV KA HG KK VL GA FS DG LA HL DN LK GT FA TL SE LH CD KL HV DP EN FR LL GN VL VC VL AH HF sp|P02062|HBB_HORSE 59 KV KA HG KK VL HS FG EG VH HL DN LK GT FA AL SE LH CD KL HV DP EN FR LL GN VL VV VL AR HF sp|P02185|MYG_PHYCA 60 DL KK HG VT VL TA LG AI LK KK GH HE AE LK PL AQ SH AT KH KI PI KY LE FI SE AI IH VL HS RH

sp|P01922|HBA_HUMAN 114 PA EF TP AV HA SL DK FL AS VS TV LT SK YR ------sp|P01958|HBA_HORSE 114 PN DF TP AV HA SL DK FL SS VS TV LT SK YR ------sp|P02023|HBB_HUMAN 119 GK EF TP PV QA AY QK VV AG VA NA LA HK YH ------sp|P02062|HBB_HORSE 119 GK DF TP EL QA SY QK VV AG VA NA LA HK YH ------sp|P02185|MYG_PHYCA 120 PG DF GA DA QG AM NK AL EL FR KD IA AK YK EL GY QG

HBA|B_HUMAN: Hemoglobin alpha|betachain from Homo sapiens (Human) HBA|B_HUMAN: Hemoglobin alpha|betachain from Equus caballus (Horse) MYG_PHYCA: Myoglobin from Physeter catodon (Sperm whale) Evolution of the globin family:

• Variability at different positions reflects functional and/or structural importance:

Human hemoglobin β chain

His 92

His 63 sp|P01922|HBA_HUMAN 1 - V L S P A D K T N V K A A W G K V G A H A G E Y G A E A L E R M F L S F P T T K T Y F P H F - D L S H - - - - - G S A sp|P01958|HBA_HORSE 1 - V L S A A D K T N V K A A W S K V G G H A G E Y G A E A L E R M F L G F P T T K T Y F P H F - D L S H - - - - - G S A sp|P02023|HBB_HUMAN 1 V H L T P E E K S A V T A L W G K V - - N V D E V G G E A L G R L L V V Y P W T Q R F F E S F G D L S T P D A V M G N P sp|P02062|HBB_HORSE 1 V Q L S G E E K A A V L A L W D K V - - N E E E V G G E A L G R L L V V Y P W T Q R F F D S F G D L S N P G A V M G N P sp|P02185|MYG_PHYCA 1 - V L S E G E W Q L V L H V W A K V E A D V A G H G Q D I L I R L F K S H P E T L E K F D R F K H L K T E A E M K A S E sp|P01922|HBA_HUMAN 54 Q V K G H G K K V A D A L T N A V A H V D D M P N A L S A L S D L H A H K L R V D P V N F K L L S H C L L V T L A A H L sp|P01958|HBA_HORSE 54 Q V K A H G K K V G D A L T L A V G H L D D L P G A L S N L S D L H A H K L R V D P V N F K L L S H C L L S T L A V H L sp|P02023|HBB_HUMAN 59 K V K A H G K K V L G A F S D G L A H L D N L K G T F A T L S E L H C D K L H V D P E N F R L L G N V L V C V L A H H F sp|P02062|HBB_HORSE 59 K V K A H G K K V L H S F G E G V H H L D N L K G T F A A L S E L H C D K L H V D P E N F R L L G N V L V V V L A R H F sp|P02185|MYG_PHYCA 60 D L K K H G V T V L T A L G A I L K K K G H H E A E L K P L A Q S H A T K H K I P I K Y L E F I S E A I I H V L H S R H sp|P01922|HBA_HUMAN 114 P A E F T P A V H A S L D K F L A S V S T V L T S K Y R ------sp|P01958|HBA_HORSE 114 P N D F T P A V H A S L D K F L S S V S T V L T S K Y R ------sp|P02023|HBB_HUMAN 119 G K E F T P P V Q A A Y Q K V V A G V A N A L A H K Y H ------sp|P02062|HBB_HORSE 119 G K D F T P E L Q A S Y Q K V V A G V A N A L A H K Y H ------sp|P02185|MYG_PHYCA 120 P G D F G A D A Q G A M N K A L E L F R K D I A A K Y K E L G Y Q G Evolution of the globin family:

LeghemoglobinLeghemoglobinisis a a hemoproteinhemoprotein found found in in thethe nitrogen-fixing nitrogen-fixing rootroot nodules nodules of of leguminousleguminous plants. plants. LeghemoglobinLeghemoglobin has has a a highhigh affinity affinity for for oxygenoxygen - - high high enough enough toto allow allow nitrogenase nitrogenase to to functionfunction but but low low enoughenough so so that that it it can can provideprovide the the bacteria bacteria withwith oxygen oxygen for for respiration.respiration. Image source: A. Lesk

Sperm Whale Myoglobin; PDB:1MBD Lupin leghemoglobin; PDB:2LH7 LeghemoglobinLeghemoglobinisis a a hemoproteinhemoprotein found found in in thethe nitrogen-fixing nitrogen-fixing rootroot nodules nodules of of leguminousleguminous plants. plants. LeghemoglobinLeghemoglobin has has a a highhigh affinity affinity for for oxygenoxygen - - high high enough enough toto allow allow nitrogenase nitrogenase toto function function but but low low enoughenough so so that that it it can can provideprovide the the bacteria bacteria withwith oxygen oxygen for for respiration.respiration. Evolution of the globin family: Evolution of the globin family:

Structurally Structurally variable conserved loop regions core Evolution of the globin family: Image source: A.Lesk. Evolution of Protein Structures

Relationship between sequence and structural divergence of :

• Within the same fold family, sequence conservation can be as low as 20 %

• The selection is acting on the function of the protein (determined by its structure): the ability of to accommodate changes in nonfunctional residues permits a large amount of apparently non-adaptive changes to occur Homology (Comparative) Modeling Evolution of Protein Structures

Protein structure is better conserved than sequence Similar Sequence Î Similar Structure

Homology modeling = Comparative protein modeling

Idea: Using experimental 3D-structures of related family members (templates) to calculate a model for a new sequence (target). Comparative Protein Structure Modeling General Workflow

Known Structures Sequence alignments techniques: (Templates) Pair wise alignments, multiple sequence alignments, Patterns, PSSMs, Profiles, HMMs Target Sequence Template Selection

Alignment Structure Evaluation & Template - Target Assessment

Structure modeling Homology Model(s) Evolution of Protein Structures

Protein structure is better conserved than sequence • Biological sequences evolve through mutation and selection => Selective pressure is different for each residue position in a protein (i.e. conservation of active site, structure, charge, etc.) • Alignments try to tell the evolutionary story of the proteins:

Similar Sequence

Same Origin Silimar Function

Similar 3D Fold Comparing two sequences Comparing 2 sequences

• Sequence comparison through pair wise alignments: – Goal of pair wise comparison is to find conserved regions (if any) between two sequences – Extrapolate information about our sequence using the known characteristics of the other sequence

THIO_EMENITHIO_EMENI GFVVVDCFATWCGPCKAIAPTVEKFAQTYGFVVVDCFATWCGPCKAIAPTVEKFAQTY Thioredoxin GG ++VD++VD +A+A WCGPCKWCGPCK IAPIAP ++++++ AA YY ?????? GAILVDFWAEWCGPCKMIAPILDEIADEYGAILVDFWAEWCGPCKMIAPILDEIADEY

ExtrapolateExtrapolate

THIO_EMENI ??? SwissProt Pair wise sequence alignment

• Concept of pair wise Alignment: – Explicit mapping between the residues of 2 sequences

deletion

SeqSeq A A GARFIELDTHELASTFA-TCATGARFIELDTHELASTFA-TCAT |||||||||||||||||||||| |||| |||||||| SeqSeq B B GARFIELDTHEVERYFASTCATGARFIELDTHEVERYFASTCAT

mismatches insertion

– Tolerant to mismatches, insertion / deletions or indels, gaps Evaluation of the alignment quality

• What is a good alignment ? – We need a way to evaluate the quality of a given alignment – Intuitively we "know" that the following alignment:

SeqSeq A A OLDGARFIELDTHELASTFATCAT OLDGARFIELDTHELASTFATCAT |||||||||||||||||||||| |||||||||||| SeqSeq B B GARFIELDTHEVERYFATCAT GARFIELDTHEVERYFATCAT

is better than:

SeqSeq A A OLDGARFIELDTHELASTFATCAT OLDGARFIELDTHELASTFATCAT || |||| SeqSeq B B GARFIELDTHEVERYFATCAT GARFIELDTHEVERYFATCAT

– We can express this notion more rigorously, by using a scoring system Scoring System

• Simple alignment score: – A simple way (but not the best) to score an alignment is to count 1 for each match and 0 for each mismatch.

SeqSeq A A OLDGARFIELDTHELASTFATCAT OLDGARFIELDTHELASTFATCAT |||||||||||||||||||||| |||||||||||| SeqSeq B B GARFIELDTHEVERYFATCAT GARFIELDTHEVERYFATCAT

Ö Score: 17

SeqSeq A A OLDGARFIELDTHELASTFATCAT OLDGARFIELDTHELASTFATCAT || |||| SeqSeq B B GARFIELDTHEVERYFATCAT GARFIELDTHEVERYFATCAT

Ö Score: 3 Pair wise sequence alignment

• Number of possible alignments between two sequences: – There are many ways to align two sequences • Consider the sequence fragments below: a simple alignment shows some conserved portions:

SeqSeq A A GARFIELDTHEFATCAT GARFIELDTHEFATCAT |||||||||||||||||||||| SeqSeq B B GARFIELDTHEVERYFATCAT GARFIELDTHEVERYFATCAT

but also: SeqSeq A A GARFIELDTHEFATCAT GARFIELDTHEFATCAT |||||||||||| SeqSeq B B GARFIELDTHEVERYFATCAT GARFIELDTHEVERYFATCAT

and also: SeqSeq A A GARFIELDTHE----FATCAT GARFIELDTHE----FATCAT |||||||||||||||||||||| |||||||||||| SeqSeq B B GARFIELDTHEVERYFATCAT GARFIELDTHEVERYFATCAT Dynamic Programming

• Dynamic Programming: computational method that provide in mathematical sense the best alignment(s) between two sequences, given a scoring system.

• Details about the implementation of this algorithm: “David W. Mount, Bioinformatics, Cold Spring Harbor Laboratory Press” Dynamic Programming

• Dynamic Programming: – Best alignment(s) between two sequences = highest score

SeqSeq A A GARFIELDTHEFATCAT GARFIELDTHEFATCAT Ö Score: 11 |||||||||||||||||||||| SeqSeq B B GARFIELDTHEVERYFATCAT GARFIELDTHEVERYFATCAT

SeqSeq A A GARFIELDTHEFATCAT GARFIELDTHEFATCAT Ö Score: 6 |||||||||||| SeqSeq B B GARFIELDTHEVERYFATCAT GARFIELDTHEVERYFATCAT

SeqSeq A A GARFIELDTHE----FATCAT GARFIELDTHE----FATCAT Ö Score: 17 |||||||||||||||||||||| |||||||||||| SeqSeq B B GARFIELDTHEVERYFATCAT GARFIELDTHEVERYFATCAT Scoring system: Introducing biological information

• Importance of the scoring system: => discrimination of significant biological alignments

• Substitution matrices: – In proteins some substitutions (mismatches) are more acceptable than others – Substitution matrices give a score for each substitution of one amino- acid by another • Types of matices: – Based on physico-chemical properties of amino-acids • Hydrophobicity, acid / base, sterical properties, ... • Scoring system scales are arbitrary

– Based on biological sequence information • Substitutions observed in structural or evolutionary alignments of well studied protein families • Scoring systems have a probabilistic foundation Substitution matrices (log-odds matrices)

• For a set of well known proteins: Example matrix • Align the sequences • Count the mutations at each position (Leu, Ile): 2 • For each substitution set the score to the log-odd ratio (Leu, Cys): -6 ... ⎛ observed ⎞ log⎜ ⎟ ⎝ expected by chance ⎠ • Positive score: the amino acids are similar, mutations from one into the other occur more often then expected by chance during evolution • Negative score: the amino acids are dissimilar, the mutation from one into the other occurs less often then expected by chance during evolution

PAM250 From: A. D. Baxevanis, "Bioinformatics" Substitution Matrices

• Different kind of matrices – PAM series (Dayhoff M., 1968, 1972, 1978) Percent Accepted Mutation. A unit introduced by Dayhoff et al. to quantify the amount of evolutionary change in a protein sequence. 1.0 PAM unit, is the amount of evolution which will change, on average, 1% of amino acids in a protein sequence. A PAM(x) substitution matrix is a look-up table in which scores for each amino acid substitution have been calculated based on the frequency of that substitution in closely related proteins that have experienced a certain amount (x) of evolutionary divergence.

• Based on 1572 protein sequences from 71 families • Old standard matrix: PAM250 Substitution Matrices

• BLOSUM series (Henikoff S. & Henikoff JG., PNAS, 1992) – Blocks Substitution Matrix.

A substitution matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins. Each matrix is tailored to a particular evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment so as to avoid over-weighting closely related family members.

• Based on alignments in the BLOCKS database • Standard matrix: BLOSUM62 Alignment score

• Amino acid substitution matrices: – Example: PAM250 – Most used: Blosum62

Raw score of the alignment

TPEATPEA ¦|¦| || APGAAPGA

Score = 1 + 6 +0+2 = 9 Alignment gaps

• Insertions or deletions – Proteins often contain regions where residues have been inserted or deleted during evolution – There are constraints on where these insertions and deletions can happen (between structural or functional elements like: alpha helices, active site, etc.) • Alignments

SeqSeq A A GARFIELDTHEFATCAT GARFIELDTHEFATCAT |||||||||||||||||||||| SeqSeq B B GARFIELDTHEVERYFATCAT GARFIELDTHEVERYFATCAT

can be improved by inserting a gap

SeqSeq A A GARFIELDTHE----FATCAT GARFIELDTHE----FATCAT |||||||||||||||||||||| |||||||||||| SeqSeq B B GARFIELDTHEVERYFATCAT GARFIELDTHEVERYFATCAT Gap opening and extension penalties

• Costs of gaps in alignments – We want to simulate as closely as possible the evolutionary mechanisms involved in gap occurrence. •Example – Two DNA alignments with identical number of gaps but very different gap distribution. We may prefer one large gap to several small ones (e.g. poorly conserved loops between well-conserved helices)

CGATGCAGCAGCAGCATCGCGATGCAGCAGCAGCATCG CGATGCAGCAGCAGCATCGCGATGCAGCAGCAGCATCG |||||||||||| |||||||||||||| |||| |||| |||||||| |||| |||| || CGATGC------AGCATCGCGATGC------AGCATCG CG-TG-AGCA-CA--AT-GCG-TG-AGCA-CA--AT-G gap opening gap extension Gap opening penalty: Counted each time a gap is introduced in an alignment

Gap extension penalty Counted for each extension of a gap in an alignment Gap opening and extension penalties

• Example: – With a match score of 1 and a mismatch score of 0 – With an opening penalty of 10 and extension penalty of 1, we have the following score:

CGATGCAGCAGCAGCATCGCGATGCAGCAGCAGCATCG CGATGCAGCAGCAGCATCGCGATGCAGCAGCAGCATCG |||||||||||| |||||||||||||| |||| |||| |||||||| |||| |||| || CGATGC------AGCATCGCGATGC------AGCATCG CG-TG-AGCA-CA--AT-GCG-TG-AGCA-CA--AT-G gap opening gap extension

13 x 1 -10-6 x 1= - 3 13 x 1 -5 x 10-6 x 1= - 43 Alignment: raw score

Gap penalties gap opening gap extension gap

SeqSeq A A GARFIELDTHE----CAT GARFIELDTHE----CAT |||||||||||||||||||||| |||||| SeqSeq B B GARFIELDTHELASTCAT GARFIELDTHELASTCAT

• The raw score of a gapped alignment is the sum of all amino acid substitutions from which we subtract the gap opening and extension penalties. Alignments: some definitions

• Identity – Proportion of pairs of identical residues between two aligned sequences. Generally expressed as a percentage. – This value strongly depends on how the two sequences are aligned. • Similarity – Proportion of pairs of similar residues between two aligned sequences. If two residues are similar is determined by a substitution matrix. – This value also depends strongly on how the two sequences are aligned, as well as on the substitution matrix used. • Homology – Two sequences are homologous if and only if they have a common ancestor. – There is no such thing as a level of homology ! (It's either yes or no) Example: alignment – “text view”

• Two similar regions of the D. Melanogaster Slit and Notch proteins:

970970 980980 990990 10001000 10101010 10201020 SLIT_DROMESLIT_DROME FSCQCAPGYTGARCETNIDDCLGEIKCQNNATFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCCIDGVESYKCECQPGFSGEFCDTKIQFC ..:.:..:.: :.:. :.::.: ...:.:...:.: .... :: :..:.. :: ::..::.. .. :.::.: ::..:.::..:. :.:. :.:. :: NOTC_DROMENOTC_DROME YKCECPRGFYDAHCLSDVDECASN-PCVNEGRYKCECPRGFYDAHCLSDVDECASN-PCVNEGRCEDGINEFICHCPPGYTGKRCELDIDECCEDGINEFICHCPPGYTGKRCELDIDEC 740740 750750 760760 770770 780780 790790 Importance of Similarity

Rule-of-thumb: If your sequences are more than 100 amino acids long (or 100 nucleotides long) you can considered them as homologues if 25% of the aa are identical (70% of nucleotide for DNA). Below this value you enter the twilight zone.

Twilight zone = protein sequence similarity between ~0-20% identity: is not statistically significant, i.e. could have arisen by chance.

Beware: •E-value (Expectation value) : which tells you how likely it is that the similarity between your sequence and a database sequence is due to chance (the lower the better) • Length of the segments similar between the two sequences • The number of insertions/deletions Importance of Similarity: databases search

sequence DB

unknown Similarity Search similar protein function ? with known function

extrapolate

function Importance of Similarity: databases search for Homology Modeling

PDB: database of protein structures

unknown Similarity Search similar protein structure ? with known structure

comparative

modelling Heuristic Sequence Alignment

• With the Dynamic Programming algorithm, one obtain an alignment in a time that is proportional to the product of the lengths of the two sequences being compared. Therefore when searching a whole database the computation time grows linearly with the size of the database. With current databases calculating a full Dynamic Programming alignment for each sequence of the database is too slow (unless implemented in a specialized parallel hardware).

• The number of searches that are presently performed on whole genomes creates a need for faster procedures.

⇒ Two methods that are least 50-100 times faster than dynamic programming were developed: FASTA and BLAST BLAST

Basic Local Alignment Search Tool A Blast for each query

Different programs are available according to the type of query

Program Query Database

blastp proteinVS protein

blastn nucleotide VS nucleotide

blastx nucleotide

protein VS protein

tblastn nucleotide

proteinVS protein

tblastx nucleotide nucleotide

protein VS protein BLASTing protein sequences

blastp = Compares a protein sequence with a protein database

If you want to find something about the function of your protein, use blastp to compare your protein with other proteins contained in the databases; identify common regions between proteins, or collect related proteins (phylogenetic analysis);

blastp = Compares a protein sequence with a protein database of known structures (PDB) use blastp to find homologous templates for homology (comparative) modeling. BLASTing protein sequences

1. Blast : using the Unix command line

2. Three of the most popular blastp online services:

• NCBI (National Center for Biotechnology Information) server: http://www.ncbi.nlm.nih.gov/BLAST

• ExPASy server: http://www.expasy.org/tools/blast/

• Swiss EMBnet server (European Molecular Biology network): http://www.ch.embnet.org/software/bBLAST.html (basic) http://www.ch.embnet.org/software/aBLAST.html (advanced) BLASTing protein sequences: Swiss EMBnet blastp server

Select the type of query Select the nucleotide database to search with either blastn, tblastn, tblastxSelect the protein database to search with either blastp, Select the substitutionblastx matrix to use Select your input type: Either a raw sequence or an accession or id number, as well as the database from which blast should retrieve your query BLASTing protein sequences: Swiss EMBnet blasp server BLASTing protein sequences: Swiss EMBnet blasp server

• Greater choice of databases to search • Advanced Blast parameter modification Understanding your BLAST output

1. Graphic display: shows you where your query is similar to other sequences

2. Hit list: the name of sequences similar to your query, ranked by similarity

3. The alignment: every alignment between your query and the reported hits

4. The parameters: a list of the various parameters used for the search Understanding your BLAST output: 1. Graphic display

query sequence

Portion of another sequence similar to your query sequence:

red, green, ochre, matches: good grey matches: intermediate blue: bad, (twilight zone)

The display can help you see that some matches do not extend over the entire length of your sequence => useful tool to discover domains. Understanding your BLAST output: 2. Hit list

Sequence ac number and name Description Bit score E-value

• Sequence ac number and name: Hyperlink to the database entry: useful annotations • Description: better to check the full annotation

• Bit score (normalized score) : A measure of the similarity between the two sequences: the higher the better (matches below 50 bits are very unreliable)

• E-value: The lower the E-value, the better. Sequences identical to the query have an E-value of 0. Matches above 0.001 are often close to the twilight zone. As a rule-of-thumb an E-value above 10-4 (0.0001) is not necessarily interesting. If you want to be certain of the homology, your E-value must be lower than 10-4 Understanding your BLAST output: 3. Alignment

Length of the alignment Positives fraction of residues that Percent identity are either identical or similar 25% is good news

XXX: low complexity regions masked

mismatch similar aa identical aa

A good alignment should not contain too many gaps and should have a few patches of high similarity, rather than isolated identical residues spread here and there Understanding your BLAST output: 4. Parameters

Search details (at the bottom of the results)

• Size of the database searched • Scoring system parameters • Details about the number of hits found Building a Multiple sequence alignment (MSA) Evolution of the globin family: Image source: A.Lesk. MSA programs

Program Address T-coffee http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi MUSCLE http://www.drive5.com/muscle/ Kalign http://msa.sbc.su.se/ Dialign http://bibiserv.techfak.uni-bielefeld.de/dialign/ Probcons http://probcons.stanford.edu/ Main Applications of MSA

Application Procedure Extrapolation A good multiple alignment can help convince you that an uncharacterized sequence is really a member of a protein family Phylogenetic analysis If you carefully choose the sequences to include in your alignment, you can reconstruct the history of these proteins. Pattern identification By discovering very conserved positions, you can identify a region that is characteristic of a function (in proteins, e.g. catalytic triad of proteases) Domain identification It is possible to turn a multiple sequence alignment into a profile that describes a protein family or a protein (structural/functional) domain. You can use this profile to scan databases for new members of the family. Domain combination/recombination

• Proteins can also achieve variety by combining different structural and functional domains; • Gene fusion or duplications result in mosaic proteins;

Urokinase Prothrombin

EGF domain Serine protease domain Ca2+ bdg. domain Kringle domain Functional annotation of multi-domain proteins

Protein: unknown function BLAST

DNA bdg. Activation domain: domain Function 1

Transcription Factor: known function

Query sequence Subject

? => DNA Bdg. Protein Functional annotation of multi-domain proteins

MSA MSA MSA

Model (HMM, PSSM,…) for Model for Model for DNA bdg. Function Activation Function 1 Activation Function 2 Functional annotation of multi-domain proteins

Protein: unknown function

Search database of models

⇒DNA bdg. Protein with ⇒ Activation Function 2 Domain recognition

• How to identify functional domains in protein sequences? – Sequence Homology (Blast, etc.) – Patterns – Position specific scoring matrices (PSSMs) – Sequence Profiles – Hidden Markov Models (HMMs) PROSITE Patterns

• Features following an exact pattern – e.g. restriction enzyme recognition sites ...GGATCC...... CCTAGG... • Features with approximate patterns – Promoters – Ribosome binding sites

– Protein sequence features -N-{P}-[ST]-{P}- PROSITE Patterns

Consensus sequences and patters are regular expressions, that can be used

like “fingerprints“. E.g. PROSITE patters:

-N-{P}-[ST]-{P}- PS00001: N-Glycosylation

MGENDPPAVEAPFSFRSLFGLDDLKISPVAPDADAVAAQILSLLPLKFFPIIVIGIIALIL ALAIGLGIHFDCSGKYRCRSSFKCIELIARCDGVSDCKDGEDEYRCVRVGGQNAVLQVFTA ASWKTMCSDDWKGHYANVACAQLGFPSYVSSDNLRVSSLEGQFREEFVSIDHLLPDDKVTA LHHSVYVREGCASGHVVTLQCTACGHRRGYSSRIVGGNMSLLSQWPWQASLQFQGYHLCGG SVITPLWIITAAHCVYDLYLPKSWTIQVGLVSLLDNPAPSHLVEKIVYHSKYKPKRLGNDI ALMKLAGPLTFNEMIQPVCLPNSEENFPDGKVCWTSGWGATEDGAGDASPVLNHAAVPLIS NKICNHRDVYGGIISPSMLCAGYLTGGVDSCQGDSGGPLVCQERRLWKLVGATSFGIGCAE VNKPGVYTRVTSFLDWIHEQMERDLKT PROSITE Patterns

Prosite Patterns Conventions ...

-N-P-Q- Elements are separated by “-” -N-x-Q- The symbol “x” is used in positions where any residue is allowed -N-[AST]-Q- ‘[]’ are used to mark ambiguities, i.e. in this example Ala,Ser,Thr -N-{MP}-Q- ‘{}’ curly brackets indicate residues that are not accepted in this position i.e. not Met or Pro -N-A(3)-Q- ‘()’ indicate repetitions of residues, i.e. –N-A-A-A-Q- -N-x(2,4)-Q- means x-x, x-x-x, or x-x-x-x ’ indicate patters restricted to the N- or C-terminal ends of a sequence PROSITE Patterns entry

PROSITE: PS00134

ID TRYPSIN_HIS; PATTERN. AC PS00134; DT APR-1990 (CREATED); APR-1990 (DATA UPDATE); JUL-1998 (INFO UPDATE). DE Serine proteases, trypsin family, histidine active site. PA [LIVM]-[ST]-A-[STAG]-H-C. NR /RELEASE=40.7,103373; NR /TOTAL=374(374); /POSITIVE=356(356); /UNKNOWN=0(0); /FALSE_POS=18(18); NR /FALSE_NEG=31; /PARTIAL=38; CC /TAXO-RANGE=??EP?; /MAX-REPEAT=1; CC /SITE=5,active_site; DR P23604, ACH1_LONAC, T; P23605, ACH2_LONAC, T; P10323, ACRO_HUMAN, T; [...] DR Q9PT40, VSP2_VIPLE, N; Q9I8W9, VSP4_AGKAC, N; P33589, VSPA_LACMU, N; DR O13060, VSPA_TRIGA, N; DR P50363, ATP6_ALLAR, F; P50364, ATP6_ALLMA, F; Q16850, CP51_HUMAN, F; [...] 3D 1FON; 1AU8; 1CGH; 1DFP; 1DST; 1DSU; 1HYL; 2HLC; 1AZZ; 1AB9; 1ACB; 1AFQ; 3D 1CA0; 1CBW; 1CGI; 1CGJ; 1CHG; 1CHO; 1GCD; 1GCT; 1GHA; 1GHB; 1GMC; 1GMD; [...] PROSITE Patterns

Confusion matrix for a two-class classification problem:

TRUE (+) FALSE (-)

Predicted (+) True positives (TP) False Positives (FP)

Predicted (-) False Negatives (FN) True Negatives (TN) PROSITE Scan on ExPASy

http://www.expasy.org/tools/scanprosite/ PROSITE Patterns

Conclusion • Advantages: – Patterns are fast and easy to implement – Easy to understand – Easy to design • Limitations – Poor Insertions / Deletion model – Small patterns find a lot of false positives. Long patterns are very difficult to design. – No scoring system, only binary response (YES/NO). PSSMs Position Specific Scoring Matrices

• PSSMs attempt to cover a over its whole length

• PSSMs are supposed to be more sensitive and more robust than patterns PSSMs Position Specific Scoring Matrices

How to construct weight matrices?

1. Start from a multiple sequence alignment:

1 2 3 4 5 6 7 ------A S T A M P V A T S L M V T S S S L M L T A T P A M S S A T A L L S A PSSMs Position Specific Scoring Matrices

How to construct weight matrices?

2. Count the number of occurrence of different amino acids at each position of the alignment:

1 2 3 4 5 6 7 ------4A 3T 2S 3L 4M 2S 1V 1S 2S 1T 2A 1L 1L 2T 1A 1V 1S 1P 1P 1A PSSMs Position Specific Scoring Matrices

How to construct weight matrices?

3. Derive a preliminary matrix with amino acid frequencies The matrix size is N x M where N is the number of positions in the alignment and M is 20 (for proteins; 4 for DNA/RNA)

1 2 3 4 5 6 7 ------A 0.8 0 0.2 0.4 0 0 0.2 L 0 0 0 0.6 0.2 0.2 0 M 0 0 0 0 0.8 0 0 V 0 0 0 0 0 0.2 0.2 P 0 0 0.2 0 0 0.2 0 S 0.2 0.4 0.4 0 0 0.4 0.2 T 0 0.6 0.2 0 0 0 0.4 PSSMs Position Specific Scoring Matrices

How to construct weight matrices?

4. Step: • Normalization (take into account the amino-acid frequency)

• Weights for residues not found in the alignment are extrapolated from observed amino acid compositions amino acid substitution tables (Pseudo-counts)

• Definition of a cut-off value PSSMs Position Specific Scoring Matrices

Using a weighted multiple alignment. If this is not done, the matrix will be biased toward the most represented subgroups of sequences. PSSMs Position Specific Scoring Matrices

Advantages: • Good for short, conserved regions. • Relatively fast and simple to implement. • Produce match scores that can be interpreted based on statistical theory. Limitations: • Insertions and deletions are strictly forbidden. • Relatively long sequence regions can therefore not be described with this method. When should I use PSSMs? • To model small regions with high variability but constant length. PSI-Blast

Position-Specific Iterated BLAST • Start from a query sequence running gapped BLAST • Construct multiple alignment M from the output hits – Collect sequence segments output with E-value below a threshold – Identical sequence are dropped – Pair-wise alignment columns with gaps in query are ignored – Multiple alignment M has same length L (column length) as query • Generate position-specific matrix : L * 20 • Use this matrix in place of the query for the next run PSI-Blast

Advantage over BLAST: 1) more sensitive method: detects more distantly related family members 2) more accurate alignments for evolutionary more distant sequences. Position-Specific Iterated BLAST

P P P P P P P P P P P P P P P P P P Q P P P P P P P P P P P P P P P http://blast.ncbi.nlm.nih.gov/Blast.cgi PSI-Blast Submission PSI-Blast - first run PSI-Blast - second run PSI-Blast • Advantage: – Automatic construction of PSSMs – Significantly improved sensitivity to detect homologous sequences.

• Caveat: – If non-homologous sequences are included in the PSSMs, the iterative procedure may diverge by including more non-homologous sequences. – ==> Use a stringent threshold for including hits into the PSSM PSI-Blast

Caveat: Generalized Profiles

The idea behind generalized profiles • One would like to generalize PSSMs to allow for insertions and deletions. However this raises the difficult problems of defining and computing an optimal alignment with gaps.

• Let us recycle the principle of dynamic programming, as it was introduced to define and compute the optimal alignments between a pair of sequences e.g. by the Smith-Waterman algorithm, and generalize it by the introduction of: – position-dependent match scores, – position-dependent gap penalties. Generalized Profiles are an extension of PSSMs Excerpt of a Generalized Profile Generalized Profiles: Conclusions

Advantages: • Possibility to specify where deletions and insertions occur. • Very sensitive to detect homology below the twilight zone. • Good scoring system. • Automatic building of the profiles. Limitations: • Require more sophisticated software. • CPU expensive. • Require some expertise to use proficiently. Hidden Markov Models (HMMs)

Î For the purpose of sequence alignment: the concept is the same as for generalized profile, the implementation is different, for details see: Durbin, Richard; Sean R. Eddy, Anders Krogh, Graeme Mitchison (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press.

• For our purpose here, advantages and limitations are just like generalized profiles. HMMs/Profile based sequence alignments

Program Address

HHpred http://toolkit.tuebingen.mpg.de/hhpred

COACH http://www.drive5.com/lobster/ (Lobster) HMAP http://msa.sbc.su.se/

FFAS03 http://ffas.ljcrf.edu/ffas-cgi/cgi/ffas.pl

HMMER http://hmmer.org/ InterPro http://www.ebi.ac.uk/interpro/ InterPro is a database protein families, domains, regions, repeats and sites in which identifiable features found in known proteins can be applied to new protein sequences. Member Databases: –Pfam(HMMs) –PRINTS (PSSMs) – PROSITE (Patterns & Profiles) – ProDom (Motifs from PsiBlast) – SMART (HMMs) –TIGRFAMs(HMMs) – PANTHER (HMMs) – Superfamily (HMMs from SCOP) – UniProt (Sequences) InterPro

1. Family: An InterPro family is a group of evolutionarily related proteins that share similar domain (or repeat) architecture. 2. Domain: An InterPro domain is an independent structural unit, which can be found alone or in conjunction with other domains or repeats. Domains are evolutionarily related. 3. Repeat: An InterPro repeat is a region that is not expected to fold into a globular domain on its own. For example 6-8 copies of the WD40 repeat are needed to form a single globular domain. 4. PTM site: A post-translational modification modifies the primary protein structure. This modification may be necessary for activation or de-activation of function. Examples include glycosylation, phosphorylation, and sulphation, splicing etc. 5. Binding site: An InterPro Binding site binds chemical compounds, which themselves are not substrates for a reaction. The compound, which is bound, may be a required co-factor for a chemical reaction, be involved in electron transport or be involved in protein structure modification. 6. Active site: Active sites are best known as the catalytic pockets of enzymes where a substrate is bound and converted to a product, which is then released. Distant parts of a protein's primary structure may be involved in the formation of the catalytic pocket. Therefore, to describe an active site, different signatures will be needed to cover the active site residues.

InterPro InterPro InterPro - Pfam InterPro - Pfam Protein Comparative Protein StructureDomain Modeling annotation General Workflow Protein (InterPro) Domain annotation (InterPro) Known Structures (Templates) Pair wise sequence alignments (Blast, PSI-BLAST) or Target HMMs/Profile based Sequence Template Selection tools (HHpred)

Alignment PairStructure wise sequence Evaluation & Template - Target alignmentsAssessment (Blast, PSI-BLAST) , MSA (T-coffee) or HMMs/Profile based Structure modeling tools (HHpred)Homology Model(s) References and further reading:

1. David W. Mount, “Bioinformatics”, Cold

Spring Harbor Laboratory Press

2. Durbin, Richard; Sean R. Eddy, Anders

Krogh, Graeme Mitchison (1998). “Biological

Sequence Analysis: Probabilistic Models of

Proteins and Nucleic Acids” References and further reading: