<<

BIOINFORMATICS FOR HEALTH SCIENCES

Sequence similarity

Nuria Lopez-Bigas [email protected] More that 5 million unique sequences in the public databases Experimental verification of the function is not easily feasible

Fraction of experimentally characterized in different genomes

H. sapiens M. musculus D. melanogaster C. elegans

20 % 7 % 10 % 1 % Knowledge of the structure of the protein can give insight into the function, but only around 1 % of the structures of the sequences deposited in databases have been experimentally solved. Protein function can be considered from 3 points of view:

Molecular function •Describes activity at the molecular level, e.g. catalysis •Commonly predicted by methods that identify homologues or orthologues

Biological process •Describe broader processes carried out by assemblies of molecules, e.g. MAPK pathway •Methods that predict interactions between molecules are used for predictions

Cellular component •Describes cell compartment in which the protein performs its function •Methods used include those for prediction of signal peptides, residue composition, post-translational modification, membrane association Sequence based methods for function prediction

Basic idea: to transfer the annotations from a sequence of known function to a sequence of unknown function

Why? Because proteins with similar sequences usually carry out similar functions

Recall that the accuracy of the prediction will depend on the quality of the and also on the quality of the annotations Two sequences are said to be homologous if they are both derived from a common ancestral sequence.

Orthology Homologous that have appeared by speciation (A and B)

Paralogy Homologous genes that arise by duplication Duplication in one specie (A and A’). Usually they diverge in term of function.

Xenology similar sequences that do not share the same evolutionary origin, but rather have arisen by A A’ B horizontal transfer events through symbiosis, viruses, etc. Homology Two sequences are said to be homologous if they are both derived from a common ancestral sequence.

speciation Orthology Homologous genes that have appeared by speciation (A and B)

ParalogyOrthologs are likely to play Homologous genes that arise by Duplication in onesimilar specie (A function and A’). Usually they diverge in term of function.

Xenology similar sequences that do not share the same evolutionary origin, but rather have arisen by A A’ B horizontal transfer events through symbiosis, viruses, etc. Sequence alignment and similarity Sequence alignment provides a hint on the relationship between two sequences

The quality of the alignment has to be scored in order to select the optimal alignment Sequence identity, similarity, homology???

Sequence identity When we find the occurrence of exactly the same nucleic acid or in the same position in two aligned sequences.

Sequence similarity Is meaningful only when possible substitutions are scored according to the probability with which they occur. In protein sequences, amino acids of similar chemical properties are found to substitute each other more often than dissimilar amino acids. These propensities are represented in “scoring Matrices”, that are used to score sequence alignments.

Sequence homology Is a more general term that indicates evolutionary relatedness among sequences. Two sequences are said to be homologous if they are both derived from a common ancestral sequence. Sequence alignment

highly-conserved region not-conserved region

Sequence a

Sequence b

Local alignment Local alignment

Global alignment Local alignment •is an optimal alignment that includes only the most similar local region or regions (BLAST generates local alignments).

•the aligned regions are not necessarily in the same order in both sequences and can occur multiple times.

•useful for finding motifs common to two unrelated sequences, or for aligning two related sequences which have undergone some non-local changes (e.g. domain shuflling).

Global alignment •is an optimal alignment that includes all characters from each sequence (Clustal generates global alignments)

•useful for relating sequences with colinear properties (e.g. no domain shuffling) or when we know that they are related in their entire length (e.g. exons). Query sequence

BLAST results

BLAST

Sequence database Program query target or database

BLASTP Protein Protein

BLASTN nucleotides

BLASTX (translated) nucl. protein

TBLASTN protein (translated) nucl.

TBLASTX (translated) nucl. (translated) nucl. How to choose a BLAST database?

• Consider your research question: – Are you looking for a particular gene in a particular species? • BLAST against the genome of that species.

– Are you looking for additional members of a across all species? • BLAST against the non-redudant database (nr)

– Are you annotating genes in your species of interest? • BLAST against known genes (RefSeq) and/or ESTs from a closely related species. When choosing a database for BLAST…

• It is important to know your reagents. – Changing your choice of database is changing your search space – Database size affects the BLAST statistics • record BLAST parameters, database choice, database size in your lab book, just as you would for your wet-bench experiments. – Databases change rapidly and are updated frequently • It may be necessary to repeat your analyses Sequence query against biological databases

•A common application of sequence alignment is to search a database for sequences similar to a query sequence.

•The query sequence is aligned against a database of at least thousands of sequences

•Special heuristic algorithms were developed to be able to perform these database searches, which exploit knowledge about sequences and alignment statistics

•BLAST and FASTA are the most common search algorithms BLAST: basic local alignment search tool

Performs pair-wise comparisons of sequences looking for regions of local similarity, rather than optimal global alignments between whole sequences.

Algorithm

1. Identify sequences that contain similar words to those found in the query sequence 2. Extend the aligned regions: segments of high similarity pairs (HSP) 3. With these HSP segments perform a local pair-wise sequence alignment

BLAST: basic local alignment search tool

Identify sequences that contain similar words to those found in the query sequence

parameters:

W: window size (3-5 for aa, 11 for nt) T: similarity cut-off

Example:

Query sequence ....RAKIDTV.... Search for matches in W=3 database KID, and similar words (T) KIE RID QID BLAST: basic local alignment search tool

Identify sequences that contain similar words to those found in the query sequence (cont.) BLAST: basic local alignment search tool

Finding local alignments

Seq 1

Seq 2 Alignment extension

Seq 1 Word hits are extended in either Seq 2 direction to generate an alignment with a Final local alignment score exceeding the threshold of "S". Seq 1 Seq 2

The alignment extension stops when regions of low sequence similarity are found BLAST: basic local alignment search tool

Finding local alignments (cont.) Extending the High Scoring Segment Pair (HSP)

Significance Decay

Minimum Score

Neighborhood Score Threshold

Where does the score (S) come from?

• The quality of each pair-wise alignment is represented as a score and the scores are ranked. • Scoring matrices are used to calculate the score of the alignment base by base (DNA) or amino acid by amino acid (protein). • The alignment score will be the sum of the scores for each position. What’s a scoring matrix?

• Substitution matrices are used for amino acid alignments. – each possible residue substitution is given a score

• A simpler unitary matrix is used for DNA pairs – each position can be given a score of +1 if it matches and a score of -2 if it does not. • To consider insertions and deletions: gap penalties BLOSUM62 •BLOSUM 62 is the default matrix in BLAST •tailored for comparisons of moderately distant proteins, but performs well in detecting closer relationships. •a search for distant relatives may be more sensitive with a different matrix. BLAST: basic local alignment search tool

BLAST sequence alignment scores

Final local alignment

Seq 1 SCORE Seq 2

Example:

LAASTYV NAAS--V Under BLOSUM62 S(L,N) = -3 S(A,A) = 4 score = -3+4+4+4-1-1+4=11 S(S,S) = 4 S(V,V) = 4

Gap = -1

Evaluation of BLAST results

We want to distinguish meaningful alignments from random alignments.

For a given alignment with a score, what is the probability that the score is due to chance? Evaluation of BLAST results

We want to distinguish meaningful alignments from random alignments.

For a given alignment with a score, what is the probability that the score is due to chance?

Raw score Represents the sum of the scores of the maximal-scoring segment pairs (MSPs) that make up the alignment.

P value Probability that the alignment is better than random. Score and E-value

• The quality of the alignment is represented by the Score. – Score (S) • The score of an alignment is calculated as the sum of substitution and gap scores. Substitution scores are given by a look-up table (PAM, BLOSUM) whereas gap scores are assigned empirically .

• The significance of each alignment is computed as an E value. – E value (E) • Expectation value. The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. Is the E-value the same as a P-value?

• The E-value is not a probability; it’s an expect value • Obtained by multiplying P by the size of the database

– The BLAST programs report E-value rather than P-values because it is easier to understand the difference between, for example, E-value of 5 and 10 than P-values of 0.993 and 0.99995.

– However, when E < 0.01, P-values and E-value are nearly identical. Is the E-value the same as a P-value?

• The E-value is not a probability; it’s an expect value • Obtained by multiplying P by the size of the database

– The BLAST programs report E-value rather than P-values because it is easier to understand the difference between, for example, E-value of 5 and 10 than P-values of 0.993 and 0.99995.

– However, when E < 0.01, P-values and E-value are nearly identical.

E-value is the number of times you expect to see your hit occur in the database (with as good as or better score) due to random chance alone. Notes on E-values

• Low E-values suggest that sequences are homologous – Can’t show non-homology

• Statistical significance depends on both the size of the alignments and the size of the sequence database – Important consideration for comparing results across different searches – E-value increases as database gets bigger – E-value decreases as alignments get longer Homology: Some Rules to Consider

• Similarity can be indicative of homology

• Generally, if two sequences are significantly similar over entire length they are likely homologous

• 50% similarity over a short sequence often occurs by chance

• Low complexity regions can be highly similar without being homologous

• Homologous sequences are not always highly similar

• Suggested BLAST Cutoffs

– For based searches, one should look for hits with E- values of 10-6 or less and sequence identity of 70% or more – For protein based searches, one should look for hits with E-values of 10-3 or less and sequence identity of 25% or more Rough guide to interpret P-values:

P < 10 –100 Exact match

P in range 10 –100 -10 –50 sequences nearly identical (e.g. alleles or SNPs)

P in range 10 –50 -10 –10 Closely related sequences

P in range 10 –5 -10 –1 Usually in distant relatives

P >10 –1 Match probably insignificant

Similarities are better found at the protein level. Consider the following two sequences:

A G G T A C T T A C C G | | | | | | C G A T A T A T C C C T

They have very few matches. However, using TBLASTX:

A G G T A C T T A C C G Arg Tyr Leu Pro | | : | Arg Tyr Ile Pro C G A T A T A T C C C T

Three residues are the same and the fourth pair has similar biochemical properties. Multiple sequence alignment and phylogenetic reconstruction Search for orthologs/paralogs in Ensembl Search for orthologs/paralogs in Ensembl Search for orthologs/paralogs in Ensembl Pages 81 - 100 : Ensembl Compara

http://www.ebi.ac.uk/training/sites/ebi.ac.uk.training/files/materials/ 2012/120508_Rotterdam/denise_carvalho- silva_ensembl_rotterdam_090512-1.pdf