Sequence Similarity
Total Page:16
File Type:pdf, Size:1020Kb
BIOINFORMATICS FOR HEALTH SCIENCES Sequence similarity Nuria Lopez-Bigas [email protected] More that 5 million unique protein sequences in the public databases Experimental verification of the function is not easily feasible Fraction of experimentally characterized proteins in different genomes H. sapiens M. musculus D. melanogaster C. elegans 20 % 7 % 10 % 1 % Knowledge of the structure of the protein can give insight into the function, but only around 1 % of the structures of the sequences deposited in databases have been experimentally solved. Protein function can be considered from 3 points of view: Molecular function •Describes activity at the molecular level, e.g. catalysis •Commonly predicted by methods that identify homologues or orthologues Biological process •Describe broader processes carried out by assemblies of molecules, e.g. MAPK pathway •Methods that predict interactions between molecules are used for predictions Cellular component •Describes cell compartment in which the protein performs its function •Methods used include those for prediction of signal peptides, residue composition, post-translational modification, membrane association Sequence based methods for function prediction Basic idea: to transfer the annotations from a sequence of known function to a sequence of unknown function Why? Because proteins with similar sequences usually carry out similar functions Recall that the accuracy of the prediction will depend on the quality of the sequence alignment and also on the quality of the annotations Homology Two sequences are said to be homologous if they are both derived from a common ancestral sequence. speciation Orthology Homologous genes that have appeared by speciation (A and B) Paralogy Homologous genes that arise by gene duplication Duplication in one specie (A and A’). Usually they diverge in term of function. Xenology similar sequences that do not share the same evolutionary origin, but rather have arisen by A A’ B horizontal transfer events through symbiosis, viruses, etc. Homology Two sequences are said to be homologous if they are both derived from a common ancestral sequence. speciation Orthology Homologous genes that have appeared by speciation (A and B) ParalogyOrthologs are likely to play Homologous genes that arise by gene duplication Duplication in onesimilar specie (A function and A’). Usually they diverge in term of function. Xenology similar sequences that do not share the same evolutionary origin, but rather have arisen by A A’ B horizontal transfer events through symbiosis, viruses, etc. Sequence alignment and similarity Sequence alignment provides a hint on the relationship between two sequences The quality of the alignment has to be scored in order to select the optimal alignment Sequence identity, similarity, homology??? Sequence identity When we find the occurrence of exactly the same nucleic acid or amino acid in the same position in two aligned sequences. Sequence similarity Is meaningful only when possible substitutions are scored according to the probability with which they occur. In protein sequences, amino acids of similar chemical properties are found to substitute each other more often than dissimilar amino acids. These propensities are represented in “scoring Matrices”, that are used to score sequence alignments. Sequence homology Is a more general term that indicates evolutionary relatedness among sequences. Two sequences are said to be homologous if they are both derived from a common ancestral sequence. Sequence alignment highly-conserved region not-conserved region Sequence a Sequence b Local alignment Local alignment Global alignment Local alignment •is an optimal alignment that includes only the most similar local region or regions (BLAST generates local alignments). •the aligned regions are not necessarily in the same order in both sequences and can occur multiple times. •useful for finding motifs common to two unrelated sequences, or for aligning two related sequences which have undergone some non-local changes (e.g. domain shuflling). Global alignment •is an optimal alignment that includes all characters from each sequence (Clustal generates global alignments) •useful for relating sequences with colinear properties (e.g. no domain shuffling) or when we know that they are related in their entire length (e.g. exons). Query sequence BLAST results BLAST Sequence database Program query target or database BLASTP Protein Protein BLASTN nucleotides nucleotides BLASTX (translated) nucl. protein TBLASTN protein (translated) nucl. TBLASTX (translated) nucl. (translated) nucl. How to choose a BLAST database? • Consider your research question: – Are you looking for a particular gene in a particular species? • BLAST against the genome of that species. – Are you looking for additional members of a protein family across all species? • BLAST against the non-redudant database (nr) – Are you annotating genes in your species of interest? • BLAST against known genes (RefSeq) and/or ESTs from a closely related species. When choosing a database for BLAST… • It is important to know your reagents. – Changing your choice of database is changing your search space – Database size affects the BLAST statistics • record BLAST parameters, database choice, database size in your bioinformatics lab book, just as you would for your wet-bench experiments. – Databases change rapidly and are updated frequently • It may be necessary to repeat your analyses Sequence query against biological databases •A common application of sequence alignment is to search a database for sequences similar to a query sequence. •The query sequence is aligned against a database of at least thousands of sequences •Special heuristic algorithms were developed to be able to perform these database searches, which exploit knowledge about sequences and alignment statistics •BLAST and FASTA are the most common search algorithms BLAST: basic local alignment search tool Performs pair-wise comparisons of sequences looking for regions of local similarity, rather than optimal global alignments between whole sequences. Algorithm 1. Identify sequences that contain similar words to those found in the query sequence 2. Extend the aligned regions: segments of high similarity pairs (HSP) 3. With these HSP segments perform a local pair-wise sequence alignment BLAST: basic local alignment search tool Identify sequences that contain similar words to those found in the query sequence parameters: W: window size (3-5 for aa, 11 for nt) T: similarity cut-off Example: Query sequence ....RAKIDTV.... Search for matches in W=3 database KID, and similar words (T) KIE RID QID BLAST: basic local alignment search tool Identify sequences that contain similar words to those found in the query sequence (cont.) BLAST: basic local alignment search tool Finding local alignments Seq 1 Seq 2 Alignment extension Seq 1 Word hits are extended in either Seq 2 direction to generate an alignment with a Final local alignment score exceeding the threshold of "S". Seq 1 Seq 2 The alignment extension stops when regions of low sequence similarity are found BLAST: basic local alignment search tool Finding local alignments (cont.) Extending the High Scoring Segment Pair (HSP) Significance Decay Minimum Score Neighborhood Score Threshold Where does the score (S) come from? • The quality of each pair-wise alignment is represented as a score and the scores are ranked. • Scoring matrices are used to calculate the score of the alignment base by base (DNA) or amino acid by amino acid (protein). • The alignment score will be the sum of the scores for each position. What’s a scoring matrix? • Substitution matrices are used for amino acid alignments. – each possible residue substitution is given a score • A simpler unitary matrix is used for DNA pairs – each position can be given a score of +1 if it matches and a score of -2 if it does not. • To consider insertions and deletions: gap penalties BLOSUM62 •BLOSUM 62 is the default matrix in BLAST •tailored for comparisons of moderately distant proteins, but performs well in detecting closer relationships. •a search for distant relatives may be more sensitive with a different matrix. BLAST: basic local alignment search tool BLAST sequence alignment scores Final local alignment Seq 1 SCORE Seq 2 Example: LAASTYV NAAS--V Under BLOSUM62 S(L,N) = -3 S(A,A) = 4 score = -3+4+4+4-1-1+4=11 S(S,S) = 4 S(V,V) = 4 Gap = -1 Evaluation of BLAST results We want to distinguish meaningful alignments from random alignments. For a given alignment with a score, what is the probability that the score is due to chance? Evaluation of BLAST results We want to distinguish meaningful alignments from random alignments. For a given alignment with a score, what is the probability that the score is due to chance? Raw score Represents the sum of the scores of the maximal-scoring segment pairs (MSPs) that make up the alignment. P value Probability that the alignment is better than random. Score and E-value • The quality of the alignment is represented by the Score. – Score (S) • The score of an alignment is calculated as the sum of substitution and gap scores. Substitution scores are given by a look-up table (PAM, BLOSUM) whereas gap scores are assigned empirically . • The significance of each alignment is computed as an E value. – E value (E) • Expectation value. The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. Is the