Bioinformatics 2 − 2 Nd Lecture
Total Page:16
File Type:pdf, Size:1020Kb
Bioinformatics 2 − 2nd lecture Prof. László Poppe BME Dept. Organic Chemistry and Technology Bioinformatics – proteomcs Lecture and computer room practice 10/1/2019 Bioinformatics 2 N.M. Luscombe, D. Greenbaum, M. Gerstein: International Medical Informatics Association Yearbook , 2001, 83-100. The bioinformatics space 2 10/1/2019 Bioinformatics 2 A. Kremer,R. Schneider, G.C. Terstappen, Biosci. Rep., 2005, 25, 95-106 Relationships in proteomics 3 10/1/2019 Bioinformatics 2 Bioinformatics databases Databases: DNA−sequences: gene identification and gene structure; Genome databases, genome maps; Gene expression databases; Proteins: protein sequences, protein sequence patterns, Structure databases, Proteom analysis; Enzyme databases, Metabolic databases; Molecular interactions: protein-protein, ligand databases, pharmaceuticals databases 1. Nucleotide sequence databases - Primary DNA−sequence databases - Specialized databases 2. Protein sequence databases - Primary protein sequence databases - Secondary and ternary (sequence motifs) databases - Integrated protein sequence databases 3. Structure databases 4. Protein family databases - Clustering - Sequence family databases - Structure family databases 4 10/1/2019 Bioinformatics 2 Sequence analysis Most important process of bioinformatics: search for sequences similar to a novel sequence (with unknown structure / function) among the known sequences with known structure / function. Sequence alignment: Sequence identity: percent ratio of the identical amino acids in the aligned sequences Transferability of function / structure is decreasing with decreasing sequence identity 5 10/1/2019 Bioinformatics 2 Sequence analysis Pairwise sequence alignment 6 10/1/2019 Bioinformatics 2 Sequence analysis problems Questions of orthology and paralogy: By analysis of novel sequences, a serious question how the functional information is applicable to the new protein (a similar sequence in a different organism may be a paralogue of the orthologous protein of the other organism, i.e. common origin but development of a novel function during evolution). This may result in errors during automated function predictions (be careful with the data from automated function annotations)! In the case of modular proteins sequence similarity is often valid only for a part of the sequences. 7 10/1/2019 Bioinformatics 2 Modular proteins Modules: protein domains can serve as exchangeable building blocks (e.g. integration of a module of the A membrane protein into the B protein forms a novel structure) During evolution, the function of modules may change as a part of different proteins => similarity in sequence but different function 8 10/1/2019 Bioinformatics 2 Sequence analysis problems Even in case of large sequence or structural similarity the function can be totally different. E.g. the sequence identity is 50% between lysozyme and alpha-lactalbumin and the spatial structures are also very similar, but these two proteins have completely different functions (−lactalbumin: lactose synthase regulatory protein, lysozyme: gastrointestinal bacterial cell wall hydrolases) −lactalbumine lysozyme => In case of approx. one-third of the uncharacterized sequences their function can not be inferred on the basis of proteins of known function 9 10/1/2019 Bioinformatics 2 Sequence analysis problems Sequence comparison means locating significantly similar zones between two or more sequences. The main problem is to decide what is significant when you are talking about biological sequences. There are several different approaches for many purposes. 10 10/1/2019 Bioinformatics 2 Nucleotide sequence databases Primary DNA sequence databases (International Nucleotide Sequence Database Collaboration http://www.insdc.org/) DDBJ (Japan, DNA Data Base of Japan - National Institute of Genetics) EMBL (Europe, European Bioinformatics Institute) GenBank (USA, National Center for Biotechnology Information) Sequence data collection: • directly from researchers • from literature data • from patent data • from genome sequencing projects 11 10/1/2019 Bioinformatics 2 Sequence analysis problems Start with a simple sequence-pair (the vertical lines represent the agreement): A conserved region is apparent. Is there a better match? Slide the two sequences! The conserved regions are enlarged. Is there an even better match? Insert gaps! The conserved region is even larger. 12 10/1/2019 Bioinformatics 2 Sequence analysis problems Insert gaps! The conserved region is even larger. Even full identity can be achieved in the alignment if there is no limit for gaps (deletions) / inserts! => Limit should be set. Low identity is shown between two sequences: The situation changes dramatically when the bottom chain is mirrored horizontally (5‘ and 3' ends reversal) => Relationships need to be analyzed by computers 13 10/1/2019 Bioinformatics 2 Identity matrices Unlimited insertion of gaps has no biological sense. Creation of gaps should be limited - this can be solved by penalty points: - By inserting a novel gap: (gap opening penalty) - By enlargement of an existing gap: (gap extension penalty) If only the identities are taking into account by judging during sequnce alignements, the identity matrix is used: Nukleotide identity matrix 14 10/1/2019 Bioinformatics 2 Identity matrices Protein identity matrix The identity matrix is a sparse matrix. As the full matches are considered with equal weights, these matrices are not favorable for similarity search. 15 10/1/2019 Bioinformatics 2 Amino acid similarity matrices By alignments of real biological meaning different amino acids can be placed under each other, so there is an importance of what is exchanged to what: "Looser" amino acid similarity matrices are required in which amino acid similarity is scored. Disadvantage: increased „noise" (more false hits of unrelated proteins) As the signal-to-noise ratio depends on the similarity matrix, creating good amino acid similarity matrices is an independent research area. Similarity matrix can be made on a statistical basis (e.g. mutation frequency) or based on the physicochemical properties of amino acids. The two most common matrices (PAM / BLOSUM) were created by the aid of mutation statistics. By using similarity matrices for the alignment, a similarity value may be calculated in addition to the identity (e.g. the similarity % is the % value of the amino acid pairs with positive similarity scores) 16 10/1/2019 Bioinformatics 2 Dayhoff’s PAM matrices Hydrophilic Amino Acids Dr. Margaret Oakley Dayhoff et al: Sulfhydryl Aliphatic In 70’s, the probabilities of amino acid exchanges were calculated from Basic normalized probabilities multiplied by 10000 Aromatic comparisons of hand-aligned sequences of >85% identity. Special Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R N D C Q E G H I L K M F P S T W Y V A 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18 R 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1 N 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1 D 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 0 1 C 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2 Q 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1 E 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 1 2 G 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5 H 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1 I 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33 L 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15 K 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1 M 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4 F 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0 P 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 0 2 S 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 2 2 T 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9 W 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0 Y 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1 V 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901 17 10/1/2019 Bioinformatics 2 Dayhoff’s PAM matrices PAM Accepted Point Mutation (“accepted” mutations do not negatively affect a protein’s fitness) PAM is a measure of the evolutionary distance: 1 PAM is the evolutionary distance (~ time) which results in 1% difference between two originally identical sequnces. PAM matrices converted to log-odds matrix - Calculate odds ratio for each substitution: Taking scores in previous matrix Divide by frequency of amino acid - Convert ratio to log10 and multiply by 10 - Take average of log odds ratio for converting A to B and converting B to A Result: Symmetric matrix "log odds" matrix −−> adding the logarithms instead of multiplication is simpler 18 10/1/2019 Bioinformatics 2 „Log odds" matrix (250 PAM) Positive values: conservative substitutions, negative values: unlikely exchanges. The amino acids listed are grouped according to their properties, and therefore scores are 19 10/1/2019greater close to the diagonal Bioinformatics 2 „Log odds" matrix (250 PAM) PAM 1: 1 accepted mutation event per 100 amino acids PAM 250: 250 mutation events per 100 … PAM 250: 20% identity PAM 120: 40% identity PAM 80: 50% identity PAM 60: 60% identity The PAM 250 matrix is often used as this corresponds to the critical approx.