<<

2 − 2nd lecture

Prof. László Poppe BME Dept. Organic Chemistry and Technology

Bioinformatics – proteomcs Lecture and computer room practice

10/1/2019 Bioinformatics 2 N.M. Luscombe, D. Greenbaum, M. Gerstein: International Medical Informatics Association Yearbook , 2001, 83-100.

The bioinformatics space

2 10/1/2019 Bioinformatics 2 A. Kremer,R. Schneider, G.C. Terstappen, Biosci. Rep., 2005, 25, 95-106 Relationships in proteomics

3 10/1/2019 Bioinformatics 2 Bioinformatics databases

Databases: DNA−sequences: gene identification and gene structure; Genome databases, genome maps; Gene expression databases; : sequences, protein sequence patterns, Structure databases, Proteom analysis; Enzyme databases, Metabolic databases; Molecular interactions: protein-protein, ligand databases, pharmaceuticals databases

1. Nucleotide sequence databases - Primary DNA−sequence databases - Specialized databases 2. Protein sequence databases - Primary protein sequence databases - Secondary and ternary (sequence motifs) databases - Integrated protein sequence databases 3. Structure databases 4. Protein family databases - Clustering - Sequence family databases - Structure family databases

4 10/1/2019 Bioinformatics 2 Sequence analysis

 Most important process of bioinformatics: search for sequences similar to a novel sequence (with unknown structure / function) among the known sequences with known structure / function.

:

 Sequence identity: percent ratio of the identical amino acids in the aligned sequences

 Transferability of function / structure is decreasing with decreasing sequence identity

5 10/1/2019 Bioinformatics 2 Sequence analysis Pairwise sequence alignment

6 10/1/2019 Bioinformatics 2 Sequence analysis problems

Questions of orthology and paralogy: By analysis of novel sequences, a serious question how the functional information is applicable to the new protein (a similar sequence in a different organism may be a paralogue of the orthologous protein of the other organism, i.e. common origin but development of a novel function during evolution). This may result in errors during automated function predictions (be careful with the data from automated function annotations)!

In the case of modular proteins sequence similarity is often valid only for a part of the sequences.

7 10/1/2019 Bioinformatics 2 Modular proteins

Modules: protein domains can serve as exchangeable building blocks (e.g. integration of a module of the A membrane protein into the B protein forms a novel structure)

During evolution, the function of modules may change as a part of different proteins => similarity in sequence but different function

8 10/1/2019 Bioinformatics 2 Sequence analysis problems

Even in case of large sequence or structural similarity the function can be totally different. E.g. the sequence identity is 50% between lysozyme and alpha-lactalbumin and the spatial structures are also very similar, but these two proteins have completely different functions (−lactalbumin: lactose synthase regulatory protein, lysozyme: gastrointestinal bacterial wall hydrolases) −lactalbumine lysozyme

=> In case of approx. one-third of the uncharacterized sequences their function can not be inferred on the basis of proteins of known function

9 10/1/2019 Bioinformatics 2 Sequence analysis problems

Sequence comparison means locating significantly similar zones between two or more sequences.

The main problem is to decide what is significant when you are talking about biological sequences. There are several different approaches for many purposes.

10 10/1/2019 Bioinformatics 2 Nucleotide sequence databases

Primary DNA sequence databases (International Nucleotide Sequence Database Collaboration http://www.insdc.org/)

DDBJ (Japan, DNA Data Base of Japan - National Institute of Genetics)

EMBL (Europe, European Bioinformatics Institute)

GenBank (USA, National Center for Biotechnology Information)

Sequence data collection:

• directly from researchers • from literature data • from patent data • from genome sequencing projects

11 10/1/2019 Bioinformatics 2 Sequence analysis problems

Start with a simple sequence-pair (the vertical lines represent the agreement):

A conserved region is apparent. Is there a better match? Slide the two sequences! The conserved regions are enlarged.

Is there an even better match? Insert gaps! The conserved region is even larger.

12 10/1/2019 Bioinformatics 2 Sequence analysis problems

Insert gaps! The conserved region is even larger.

Even full identity can be achieved in the alignment if there is no limit for gaps (deletions) / inserts! => Limit should be set.

Low identity is shown between two sequences:

The situation changes dramatically when the bottom chain is mirrored horizontally (5‘ and 3' ends reversal)

=> Relationships need to be analyzed by computers

13 10/1/2019 Bioinformatics 2 Identity matrices

Unlimited insertion of gaps has no biological sense.

Creation of gaps should be limited - this can be solved by penalty points: - By inserting a novel gap: (gap opening penalty) - By enlargement of an existing gap: (gap extension penalty)

If only the identities are taking into account by judging during sequnce alignements, the identity matrix is used:

Nukleotide identity matrix

14 10/1/2019 Bioinformatics 2 Identity matrices

Protein identity matrix

The identity matrix is a sparse matrix.

As the full matches are considered with equal weights, these matrices are not favorable for similarity search.

15 10/1/2019 Bioinformatics 2 similarity matrices

By alignments of real biological meaning different amino acids can be placed under each other, so there is an importance of what is exchanged to what:

 "Looser" amino acid similarity matrices are required in which amino acid similarity is scored.  Disadvantage: increased „noise" (more false hits of unrelated proteins)

As the signal-to-noise ratio depends on the similarity matrix, creating good amino acid similarity matrices is an independent research area. Similarity matrix can be made on a statistical basis (e.g. frequency) or based on the physicochemical properties of amino acids.

The two most common matrices (PAM / BLOSUM) were created by the aid of mutation statistics.

By using similarity matrices for the alignment, a similarity value may be calculated in addition to the identity (e.g. the similarity % is the % value of the amino acid pairs with positive similarity scores) 16 10/1/2019 Bioinformatics 2 Dayhoff’s PAM matrices

Hydrophilic Amino Acids Dr. et al: Sulfhydryl Aliphatic In 70’s, the probabilities of amino acid exchanges were calculated from Basic normalized probabilities multiplied by 10000 Aromatic comparisons of hand-aligned sequences of >85% identity. Special Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R N D C Q E G H I L K M F P S T W Y V A 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18 R 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1 N 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1 D 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 0 1 C 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2 Q 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1 E 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 1 2 G 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5 H 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1 I 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33 L 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15 K 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1 M 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4 F 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0 P 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 0 2 S 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 2 2 T 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9 W 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0 Y 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1 V 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901 17 10/1/2019 Bioinformatics 2 Dayhoff’s PAM matrices

PAM  Accepted Point Mutation (“accepted” do not negatively affect a protein’s fitness)

PAM  is a measure of the evolutionary distance: 1 PAM is the evolutionary distance (~ time) which results in 1% difference between two originally identical sequnces.

PAM matrices converted to log-odds matrix - Calculate odds ratio for each substitution: Taking scores in previous matrix Divide by frequency of amino acid - Convert ratio to log10 and multiply by 10 - Take average of log odds ratio for converting A to B and converting B to A Result: Symmetric matrix

"log odds" matrix −−> adding the logarithms instead of multiplication is simpler

18 10/1/2019 Bioinformatics 2 „Log odds" matrix (250 PAM)

Positive values: conservative substitutions, negative values: unlikely exchanges. The amino acids listed are grouped according to their properties, and therefore scores are 19 10/1/2019greater close to the diagonal Bioinformatics 2 „Log odds" matrix (250 PAM)

PAM 1: 1 accepted mutation event per 100 amino acids PAM 250: 250 mutation events per 100 …

PAM 250: 20% identity PAM 120: 40% identity PAM 80: 50% identity PAM 60: 60% identity

The PAM 250 matrix is often used as this corresponds to the critical approx. 20% sequence identity (1 PAM: a point mutation that causes 1% deviation; over 250 PAM time about 80% of the total sequence is changing by point mutations).

Disadvantages of PAM matrices: - data are derived from relatively small, hand-aligned sequence sets - data are derived from sequences of >85% identity, for smaler degree of identity it can only be extrapolated

20 10/1/2019 Bioinformatics 2 Henikoff és Henikoff (PNAS 1992, 89, 10915-10919)

BLOSUM matrices

Blocks Amino Acid Substitution Matrices (BLOSUM) - based on the BLOCKS database of protein families containing multiple aligned sequence blocks.

On the basis of sequence similarity between sequences groups / clusters are formed (e.g. sequences with >62% sequence identity are grouped together). Based on the degree of identity of different clusters can be formed (80%, 60%, 40%, etc).

Amino acid substitution matrices are calculated from the sequences found in clusters (=> BLOSUM 80, BLOSUM 60, BLOSUM 40, etc. matrices);

BLOSUM 62 matrix is often used (comparable to PAM 250).

The BLOSUM matrices usually give better alignment in a biological sense than PAM matrices.

21 10/1/2019 Bioinformatics 2 BLOSUM 62 matrix

22 10/1/2019 Bioinformatics 2 The PAM and BLOSUM matrices

The BLOSUM and PAM matrices prefer different amino acid substitutions: the + signs for the preferred replacement are at different positions in PAM 250 (a) and BLOSUM 62 (b) based alignments.

E.g.

23 10/1/2019 Bioinformatics 2 Statistical significance

Because by inserting sufficient amounts of gaps virtually any two sequences can be aligned => to the "goodness" should be quantified

Statistical parameters characteristic for reliability: P value: the probability that the alignment in the global sequence comparison is a result of chance. The low values are good. E value: (expected frequency) the number hits in a database search which can be attributed to chance. Smaller values are better.

24 10/1/2019 Bioinformatics 2 Dotplot analysis

A Dotplot analysis give an overview about the possible alignements. In case of identity of amino acids / nucleic acid bases a point (or x) is placed. Ideal case: two identical sequences Sequence 1

T A T C G A A G T A All the letters of the first sequence are correlated to each T letter of the second sequence A T C Sequence 2 G Dotplot indicates a diagonal A A G There are other hits. T Are there only noise or do they A have sense?

25 10/1/2019 Bioinformatics 2 Dotplot analysis

Similar sequences Far but still related sequences (lysozymes of closely related species) (lysozyme and α−lactalbumin):

26 10/1/2019 Bioinformatics 2 Local and global similarity

Local similarity: similarity can be observed only in particular sequence regions

Global similarity: similarity can be observed along the whole sequences

Global similarity: ”loose” if the sequences are too dissimilar

The goodness of an alignment can not be interpreted in absolute terms, the various mathematical models were developed by taking different biological aspects into account.

It is advised to search local similarities - Proteins often have modular structure => sites important for a certain function are often shorter sections - Local search is faster

27 10/1/2019 Bioinformatics 2 Needleman SB, Wunsch CD. J Mol Biol 1970, 48(3), 443-453. Global alignment Needleman−Wunsch algorithm

Needleman, Wunsch 1970 (+ further developments):

 The first application of dynamic programming for comparison of biological sequences. Search for a maximal match between two sequences, with possible deletions. Gap penalty applies.

 Starting from a Dotplot a matrix is created. Then starting from the bottom right corner, moving from right to left and bottom to up, the largest of three values is added to the contents of each cell. The three values: 1. content of the cell right and downward from the cell 2. the maximum of the values in the column down one row and at least two elements to the right reduced by gap penalty 3. the maximum of values of the elements in the column and at least two rows downwards reduced by gap penalty

 By using the recursive algorithm with suitable similarity matrices (e.g. PAM, BLOSUM) good quality optimal alignments can be obtained (maximum score, taking into account the similarities and gap penalties)

28 10/1/2019 Bioinformatics 2 Smith TF, Waterman MS J Mol Biol 1981, 147, 195–197. Local alignment Smith−Waterman algorithm

Smith, Waterman 1981 (+ further developments):

A dynamic programming algorithm-based (similarly to the Needleman, Wunsch 1970 algorithm) application for the comparison of biological sequences, which is able to identify short, locally similar regions.

Two basic differences compared to the Needleman, Wunsch 1970 algorithm: 1. A different (or dissimilar) pairing of amino acids are scored by negative scores (and not zero) 2. Filling the matrix a negative value is not allowed, instead of a negative value 0 is entered.

Each cell in the matrix is an endpoint of a possible local settlement (far right element), the maximum similarity score related to this settlement is added to the cell.

29 10/1/2019 Bioinformatics 2 Fast algorithms

The Needleman-Wunsch and Smith-Waterman algorithms are accurate, but not fast enough for handling large amounts of sequences

=> Fast algorithms are required for large datasets

=> FASTA, BLAST: search for short, identical / similar sequensec as start · E or P values are given · Parameters can be varied (e.g. gap penalty, influencing selectivity and sensitivity). Selectivity (finding real homologues) and sensitivity (ability ti identify far homologues) usually vary at the expense from one to the other.

30 10/1/2019 Bioinformatics 2 Lipman DJ, Pearson WR. Science 1985, 227(4693), 1435-1441.

FASTA algorithm

Start: search for identical „words” of k lenght (k−tuple) between two sequences. (Proteins: k=1−2, DNA: k=4−6)

In case of sufficient number of matches, dynamic programming (Smith-Waterman) is applied for alignment calculation

31 10/1/2019 Bioinformatics 2 Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ J Mol Biol 1990, 215(3), 403–410.

BLAST algorithm

Altschul at al. (1990 + further refinements and extensions):

Basic Local Alignment Search Tool, or BLAST: can be implemented efficiently, parallelizable and very fast

The main idea of BLAST is that there are often high-scoring segment pairs (HSP) contained in a statistically significant alignment. BLAST searches for high scoring sequence alignments between the query sequence and sequences in the database using a heuristic approach that approximates the Smith-Waterman algorithm.

Gapped BLAST (Altschul et al. 1997): looking for only one segment pair, which is the elongated to both directions by dynamic programming. Three times faster than the BLAST without gap.

PSI−BLAST: even more sensitive, uses multiple alignments

Magic-BLAST (NCBI, 2016): new generation RNA and DNA BLAST (for WGS data)

32 10/1/2019 Bioinformatics 2