Bioinformatics 2 − 2nd lecture
Prof. László Poppe BME Dept. Organic Chemistry and Technology
Bioinformatics – proteomcs Lecture and computer room practice
10/1/2019 Bioinformatics 2 N.M. Luscombe, D. Greenbaum, M. Gerstein: International Medical Informatics Association Yearbook , 2001, 83-100.
The bioinformatics space
2 10/1/2019 Bioinformatics 2 A. Kremer,R. Schneider, G.C. Terstappen, Biosci. Rep., 2005, 25, 95-106 Relationships in proteomics
3 10/1/2019 Bioinformatics 2 Bioinformatics databases
Databases: DNA−sequences: gene identification and gene structure; Genome databases, genome maps; Gene expression databases; Proteins: protein sequences, protein sequence patterns, Structure databases, Proteom analysis; Enzyme databases, Metabolic databases; Molecular interactions: protein-protein, ligand databases, pharmaceuticals databases
1. Nucleotide sequence databases - Primary DNA−sequence databases - Specialized databases 2. Protein sequence databases - Primary protein sequence databases - Secondary and ternary (sequence motifs) databases - Integrated protein sequence databases 3. Structure databases 4. Protein family databases - Clustering - Sequence family databases - Structure family databases
4 10/1/2019 Bioinformatics 2 Sequence analysis
Most important process of bioinformatics: search for sequences similar to a novel sequence (with unknown structure / function) among the known sequences with known structure / function.
Sequence identity: percent ratio of the identical amino acids in the aligned sequences
Transferability of function / structure is decreasing with decreasing sequence identity
5 10/1/2019 Bioinformatics 2 Sequence analysis Pairwise sequence alignment
6 10/1/2019 Bioinformatics 2 Sequence analysis problems
Questions of orthology and paralogy: By analysis of novel sequences, a serious question how the functional information is applicable to the new protein (a similar sequence in a different organism may be a paralogue of the orthologous protein of the other organism, i.e. common origin but development of a novel function during evolution). This may result in errors during automated function predictions (be careful with the data from automated function annotations)!
In the case of modular proteins sequence similarity is often valid only for a part of the sequences.
7 10/1/2019 Bioinformatics 2 Modular proteins
Modules: protein domains can serve as exchangeable building blocks (e.g. integration of a module of the A membrane protein into the B protein forms a novel structure)
During evolution, the function of modules may change as a part of different proteins => similarity in sequence but different function
8 10/1/2019 Bioinformatics 2 Sequence analysis problems
Even in case of large sequence or structural similarity the function can be totally different. E.g. the sequence identity is 50% between lysozyme and alpha-lactalbumin and the spatial structures are also very similar, but these two proteins have completely different functions (−lactalbumin: lactose synthase regulatory protein, lysozyme: gastrointestinal bacterial cell wall hydrolases) −lactalbumine lysozyme
=> In case of approx. one-third of the uncharacterized sequences their function can not be inferred on the basis of proteins of known function
9 10/1/2019 Bioinformatics 2 Sequence analysis problems
Sequence comparison means locating significantly similar zones between two or more sequences.
The main problem is to decide what is significant when you are talking about biological sequences. There are several different approaches for many purposes.
10 10/1/2019 Bioinformatics 2 Nucleotide sequence databases
Primary DNA sequence databases (International Nucleotide Sequence Database Collaboration http://www.insdc.org/)
DDBJ (Japan, DNA Data Base of Japan - National Institute of Genetics)
EMBL (Europe, European Bioinformatics Institute)
GenBank (USA, National Center for Biotechnology Information)
Sequence data collection:
• directly from researchers • from literature data • from patent data • from genome sequencing projects
11 10/1/2019 Bioinformatics 2 Sequence analysis problems
Start with a simple sequence-pair (the vertical lines represent the agreement):
A conserved region is apparent. Is there a better match? Slide the two sequences! The conserved regions are enlarged.
Is there an even better match? Insert gaps! The conserved region is even larger.
12 10/1/2019 Bioinformatics 2 Sequence analysis problems
Insert gaps! The conserved region is even larger.
Even full identity can be achieved in the alignment if there is no limit for gaps (deletions) / inserts! => Limit should be set.
Low identity is shown between two sequences:
The situation changes dramatically when the bottom chain is mirrored horizontally (5‘ and 3' ends reversal)
=> Relationships need to be analyzed by computers
13 10/1/2019 Bioinformatics 2 Identity matrices
Unlimited insertion of gaps has no biological sense.
Creation of gaps should be limited - this can be solved by penalty points: - By inserting a novel gap: (gap opening penalty) - By enlargement of an existing gap: (gap extension penalty)
If only the identities are taking into account by judging during sequnce alignements, the identity matrix is used:
Nukleotide identity matrix
14 10/1/2019 Bioinformatics 2 Identity matrices
Protein identity matrix
The identity matrix is a sparse matrix.
As the full matches are considered with equal weights, these matrices are not favorable for similarity search.
15 10/1/2019 Bioinformatics 2 Amino acid similarity matrices
By alignments of real biological meaning different amino acids can be placed under each other, so there is an importance of what is exchanged to what:
"Looser" amino acid similarity matrices are required in which amino acid similarity is scored. Disadvantage: increased „noise" (more false hits of unrelated proteins)
As the signal-to-noise ratio depends on the similarity matrix, creating good amino acid similarity matrices is an independent research area. Similarity matrix can be made on a statistical basis (e.g. mutation frequency) or based on the physicochemical properties of amino acids.
The two most common matrices (PAM / BLOSUM) were created by the aid of mutation statistics.
By using similarity matrices for the alignment, a similarity value may be calculated in addition to the identity (e.g. the similarity % is the % value of the amino acid pairs with positive similarity scores) 16 10/1/2019 Bioinformatics 2 Dayhoff’s PAM matrices
Hydrophilic Amino Acids Dr. Margaret Oakley Dayhoff et al: Sulfhydryl Aliphatic In 70’s, the probabilities of amino acid exchanges were calculated from Basic normalized probabilities multiplied by 10000 Aromatic comparisons of hand-aligned sequences of >85% identity. Special Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R N D C Q E G H I L K M F P S T W Y V A 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18 R 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1 N 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1 D 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 0 1 C 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2 Q 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1 E 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 1 2 G 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5 H 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1 I 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33 L 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15 K 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1 M 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4 F 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0 P 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 0 2 S 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 2 2 T 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9 W 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0 Y 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1 V 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901 17 10/1/2019 Bioinformatics 2 Dayhoff’s PAM matrices
PAM Accepted Point Mutation (“accepted” mutations do not negatively affect a protein’s fitness)
PAM is a measure of the evolutionary distance: 1 PAM is the evolutionary distance (~ time) which results in 1% difference between two originally identical sequnces.
PAM matrices converted to log-odds matrix - Calculate odds ratio for each substitution: Taking scores in previous matrix Divide by frequency of amino acid - Convert ratio to log10 and multiply by 10 - Take average of log odds ratio for converting A to B and converting B to A Result: Symmetric matrix
"log odds" matrix −−> adding the logarithms instead of multiplication is simpler
18 10/1/2019 Bioinformatics 2 „Log odds" matrix (250 PAM)
Positive values: conservative substitutions, negative values: unlikely exchanges. The amino acids listed are grouped according to their properties, and therefore scores are 19 10/1/2019greater close to the diagonal Bioinformatics 2 „Log odds" matrix (250 PAM)
PAM 1: 1 accepted mutation event per 100 amino acids PAM 250: 250 mutation events per 100 …
PAM 250: 20% identity PAM 120: 40% identity PAM 80: 50% identity PAM 60: 60% identity
The PAM 250 matrix is often used as this corresponds to the critical approx. 20% sequence identity (1 PAM: a point mutation that causes 1% deviation; over 250 PAM time about 80% of the total sequence is changing by point mutations).
Disadvantages of PAM matrices: - data are derived from relatively small, hand-aligned sequence sets - data are derived from sequences of >85% identity, for smaler degree of identity it can only be extrapolated
20 10/1/2019 Bioinformatics 2 Henikoff és Henikoff (PNAS 1992, 89, 10915-10919)
BLOSUM matrices
Blocks Amino Acid Substitution Matrices (BLOSUM) - based on the BLOCKS database of protein families containing multiple aligned sequence blocks.
On the basis of sequence similarity between sequences groups / clusters are formed (e.g. sequences with >62% sequence identity are grouped together). Based on the degree of identity of different clusters can be formed (80%, 60%, 40%, etc).
Amino acid substitution matrices are calculated from the sequences found in clusters (=> BLOSUM 80, BLOSUM 60, BLOSUM 40, etc. matrices);
BLOSUM 62 matrix is often used (comparable to PAM 250).
The BLOSUM matrices usually give better alignment in a biological sense than PAM matrices.
21 10/1/2019 Bioinformatics 2 BLOSUM 62 matrix
22 10/1/2019 Bioinformatics 2 The PAM and BLOSUM matrices
The BLOSUM and PAM matrices prefer different amino acid substitutions: the + signs for the preferred replacement are at different positions in PAM 250 (a) and BLOSUM 62 (b) based alignments.
E.g.
23 10/1/2019 Bioinformatics 2 Statistical significance
Because by inserting sufficient amounts of gaps virtually any two sequences can be aligned => to the "goodness" should be quantified
Statistical parameters characteristic for reliability: P value: the probability that the alignment in the global sequence comparison is a result of chance. The low values are good. E value: (expected frequency) the number hits in a database search which can be attributed to chance. Smaller values are better.
24 10/1/2019 Bioinformatics 2 Dotplot analysis
A Dotplot analysis give an overview about the possible alignements. In case of identity of amino acids / nucleic acid bases a point (or x) is placed. Ideal case: two identical sequences Sequence 1
T A T C G A A G T A All the letters of the first sequence are correlated to each T letter of the second sequence A T C Sequence 2 G Dotplot indicates a diagonal A A G There are other hits. T Are there only noise or do they A have sense?
25 10/1/2019 Bioinformatics 2 Dotplot analysis
Similar sequences Far but still related sequences (lysozymes of closely related species) (lysozyme and α−lactalbumin):
26 10/1/2019 Bioinformatics 2 Local and global similarity
Local similarity: similarity can be observed only in particular sequence regions
Global similarity: similarity can be observed along the whole sequences
Global similarity: ”loose” if the sequences are too dissimilar
The goodness of an alignment can not be interpreted in absolute terms, the various mathematical models were developed by taking different biological aspects into account.
It is advised to search local similarities - Proteins often have modular structure => sites important for a certain function are often shorter sections - Local search is faster
27 10/1/2019 Bioinformatics 2 Needleman SB, Wunsch CD. J Mol Biol 1970, 48(3), 443-453. Global alignment Needleman−Wunsch algorithm
Needleman, Wunsch 1970 (+ further developments):
The first application of dynamic programming for comparison of biological sequences. Search for a maximal match between two sequences, with possible deletions. Gap penalty applies.
Starting from a Dotplot a matrix is created. Then starting from the bottom right corner, moving from right to left and bottom to up, the largest of three values is added to the contents of each cell. The three values: 1. content of the cell right and downward from the cell 2. the maximum of the values in the column down one row and at least two elements to the right reduced by gap penalty 3. the maximum of values of the elements in the column and at least two rows downwards reduced by gap penalty
By using the recursive algorithm with suitable similarity matrices (e.g. PAM, BLOSUM) good quality optimal alignments can be obtained (maximum score, taking into account the similarities and gap penalties)
28 10/1/2019 Bioinformatics 2 Smith TF, Waterman MS J Mol Biol 1981, 147, 195–197. Local alignment Smith−Waterman algorithm
Smith, Waterman 1981 (+ further developments):
A dynamic programming algorithm-based (similarly to the Needleman, Wunsch 1970 algorithm) application for the comparison of biological sequences, which is able to identify short, locally similar regions.
Two basic differences compared to the Needleman, Wunsch 1970 algorithm: 1. A different (or dissimilar) pairing of amino acids are scored by negative scores (and not zero) 2. Filling the matrix a negative value is not allowed, instead of a negative value 0 is entered.
Each cell in the matrix is an endpoint of a possible local settlement (far right element), the maximum similarity score related to this settlement is added to the cell.
29 10/1/2019 Bioinformatics 2 Fast algorithms
The Needleman-Wunsch and Smith-Waterman algorithms are accurate, but not fast enough for handling large amounts of sequences
=> Fast algorithms are required for large datasets
=> FASTA, BLAST: search for short, identical / similar sequensec as start · E or P values are given · Parameters can be varied (e.g. gap penalty, influencing selectivity and sensitivity). Selectivity (finding real homologues) and sensitivity (ability ti identify far homologues) usually vary at the expense from one to the other.
30 10/1/2019 Bioinformatics 2 Lipman DJ, Pearson WR. Science 1985, 227(4693), 1435-1441.
FASTA algorithm
Start: search for identical „words” of k lenght (k−tuple) between two sequences. (Proteins: k=1−2, DNA: k=4−6)
In case of sufficient number of matches, dynamic programming (Smith-Waterman) is applied for alignment calculation
31 10/1/2019 Bioinformatics 2 Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ J Mol Biol 1990, 215(3), 403–410.
BLAST algorithm
Altschul at al. (1990 + further refinements and extensions):
Basic Local Alignment Search Tool, or BLAST: can be implemented efficiently, parallelizable and very fast
The main idea of BLAST is that there are often high-scoring segment pairs (HSP) contained in a statistically significant alignment. BLAST searches for high scoring sequence alignments between the query sequence and sequences in the database using a heuristic approach that approximates the Smith-Waterman algorithm.
Gapped BLAST (Altschul et al. 1997): looking for only one segment pair, which is the elongated to both directions by dynamic programming. Three times faster than the BLAST without gap.
PSI−BLAST: even more sensitive, uses multiple alignments
Magic-BLAST (NCBI, 2016): new generation RNA and DNA BLAST (for WGS data)
32 10/1/2019 Bioinformatics 2