Substitution Matrices E S V U

Total Page:16

File Type:pdf, Size:1020Kb

Substitution Matrices E S V U C E N Introduction to bioinformatics T R E 2007 F B O I R O I I N N Lecture 8 T F E O G R R M A A T T I I V C Substitution Matrices E S V U C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [1] Substitution matrices – Sequence analysis 2006 Sequence Analysis Finding relationships between genes and gene products of different species, including those at large evolutionary distances C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [2] Substitution matrices – Sequence analysis 2006 Archaea Domain Archaea is mostly composed of cells that live in extreme environments. While they are able to live elsewhere, they are usually not found there because outside of extreme environments they are competitively excluded by other organisms. Species of the domain Archaea are •not inhibited by antibiotics, •lack peptidoglycan in their cell wall (unlike bacteria, which have this sugar/polypeptide compound), •and can have branched carbon chains in their membrane lipids of the phospholipid bilayer. C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [3] Substitution matrices – Sequence analysis 2006 Archaea (Cnt.) • It is believed that Archaea are very similar to prokaryotes (e.g. bacteria) that inhabited the earth billions of years ago. It is also believed that eukaryotes evolved from Archaea, because they share many mRNA sequences, have similar RNA polymerases, and have introns. • Therefore, it is generally assumed that the domains Archaea and Bacteria branched from each other very early in history, after which membrane infolding * produced eukaryotic cells in the archaean branch approximately 1.7 billion years ago. There are three main groups of Archaea: 1. extreme halophiles (salt), 2. methanogens (methane producing anaerobes), 3. and hyperthermophiles (e.g. living at temperatures >100º C!). *Membrane infolding is believed to have led to the nucleus of eukaryotic cells, which is a membrane-enveloped cell organelle that holds the cellular DNA. Prokaryotic cells are more primitive and do not have a nucleus. C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [4] Substitution matrices – Sequence analysis 2006 Example of nucleotide sequence database entry for Genbank LOCUS DRODPPC 4001 bp INV 15-MAR-1990 DEFINITION D.melanogasterdecapentaplegic gene complex (DPP-C), complete cds. ACCESSION M30116 KEYWORDS . SOURCE D.melanogaster, cDNA to mRNA. ORGANISM Drosophila melanogaster Eurkaryote; mitochondrial eukaryotes; Metazoa; Arthropoda; Tracheata; Insecta; Pterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea; Drosophilidae; Drosophilia. REFER ENCE 1 (bases 1 to 4001) AUTHORS Padgett, R.W., St Johnston, R.D. and Gelbart, W.M. TITLE A transcript from a Drosophila pattern gene predicts a protein homologous to the transforming growth factor-beta family JOURNAL Nature 325, 81-84 (1987) MEDLINE 87090408 COMMENT The initiation codon could be at either 1188-1190 or 1587-1589 FEATURES Location/Qualifiers source 1..4001 /organism=“Drosophila melanogaster” /db_xref=“taxon:7227” mRNA <1..3918 /gene=“dpp” /note=“decapentaplegic protein mRNA” /db_xref=“FlyBase:FBgn0000490” gene 1..4001 /note=“decapentaplegic” /gene=“dpp” /allele=“” /db_xref=“FlyBase:FBgn0000490” CDS 1188..2954 /gene=“dpp” /note=“decapentaplegic protein (1188 could be 1587)” /codon_start=1 /db_xref=“FlyBase:FBgn0000490” /db_xref=“PID:g157292” /translation=“MRAWLLLLAVLATFQTIVRVASTEDISQRFIAAIAPVAAHIPLA SASGSGSGRSGSRSVGASTSTALAKAFNPFSEPASFSDSDKSHRSKTNKKPSKSDANR …………………… LGYDAYYCHGKCPFPLADHFNSTNAVVQTLVNNMNPGKVPKACCVPTQLDSVAMLYL NDQSTBVVLKNYQEMTBBGCGCR” BASE COUNT 1170 a 1078 c 956 g 797 t ORIGIN 1 gtcgttcaac agcgctgatc gagtttaaat ctataccgaa atgagcggcg gaaagtgagc 61 cacttggcgt gaacccaaag ctttcgagga aaattctcgg acccccatat acaaatatcg 121 gaaaaagtat cgaacagttt cgcgacgcga agcgttaaga tcgcccaaag atctccgtgc 181 ggaaacaaag aaattgaggc actattaaga gattgttgtt gtgcgcgagt gtgtgtcttc 241 agctgggtgt gtggaatgtc aactgacggg ttgtaaaggg aaaccctgaa atccgaacgg 301 ccagccaaag caaataaagc tgtgaatacg aattaagtac aacaaacagt tactgaaaca 361 gatacagatt cggattcgaa tagagaaaca gatactggag atgcccccag aaacaattca 421 attgcaaata tagtgcgttg cgcgagtgcc agtggaaaaa tatgtggatt acctgcgaac 481 cgtccgccca aggagccgcc gggtgacagg tgtatccccc aggataccaa cccgagccca 541 gaccgagatc cacatccaga tcccgaccgc agggtgccag tgtgtcatgt gccgcggcat 601 accgaccgca gccacatcta ccgaccaggt gcgcctcgaa tgcggcaaca caattttcaa …………………………. 3841 aactgtataa acaaaacgta tgccctataa atatatgaat aactatctac atcgttatgc 3901 gttctaagct aagctcgaat aaatccgtac acgttaatta atctagaatc gtaagaccta 3961 acgcgtaagc tcagcatgtt ggataaatta atagaaacga g // C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [5] Substitution matrices – Sequence analysis 2006 Example of protein sequence database entry for SWISS-PROT (now UNIPROT) ID DECA_DROME STANDARD; PRT; 588AA. AC P07713; DT 01-APR-1988 (REL. 07, CREATED) DT 01-APR-1988 (REL. 07, LAST SEQUENCE UPDATE) DT 01-FEB-1995 (REL. 31, LAST ANNOTATION UPDATE) DE DECAPENTAPLEGIC PROTEIN PRECURSOR (DPP-C PROTEIN). GN DPP. OS DROSOPHILA MELANOGASTER (FRUIT FLY). OC EUKARYOTA; METAZOA; ARTHROPODA; INSECTA; DIPTERA. RN [1] RP SEQUENCE FROM N.A. RM 87090408 RA PADGETT R.W., ST JOHNSTON R.D., GELBART W.M.; RL NATURE 325:81-84 (1987) RN [2] RP CHARACTERIZATION, AND SEQUENCE OF 457-476. RM 90258853 RA PANGANIBAN G.E.F., RASHKA K.E., NEITZEL M.D., HOFFMANN F.M.; RL MOL. CELL. BIOL. 10:2669-2677(1990). CC -!- FUNCTION: DPP IS REQUIRED FOR THE PROPER DEVELOPMENT OF THE CC EMBRYONIC DOORSAL HYPODERM, FOR VIABILITY OF LARVAE AND FOR CELL CC VIABILITY OF THE EPITHELIAL CELLS IN THE IMAGINAL DISKS. CC -!- SUBUNIT: HOMODIMER, DISULFIDE-LINKED. CC -!- SIMILARITY: TO OTHER GROWTH FACTORS OF THE TGF-BETA FAMILY. DR EMBL; M30116; DMDPPC. DR PIR; A26158; A26158. DR HSSP; P08112; 1TFG. DR FLYBASE; FBGN0000490; DPP. DR PROSITE; PS00250; TGF_BETA. KW GROWTH FACTOR; DIFFERENTIATION; SIGNAL. FT SIGNAL 1 ? POTENTIAL. FT PROPEP ? 456 FT CHAIN 457 588 DECAPENTAPLEGIC PROTEIN. FT DISULFID 487 553 BY SIMILARITY. FT DISULFID 516 585 BY SIMILARITY. FT DISULFID 520 587 BY SIMILARITY. FT DISULFID 552 552 INTERCHAIN (BY SIMILARITY). FT CARBOHYD 120 120 POTENTIAL. FT CARBOHYD 342 342 POTENTIAL. FT CARBOHYD 377 377 POTENTIAL. FT CARBOHYD 529 529 POTENTIAL. SQ SEQUENCE 588 AA; 65850MW; 1768420 CN; MRAWLLLLAV LATFQTIVRV ASTEDISQRF IAAIAPVAAH IPLASASGSG SGRSGSRSVG ASTSTAGAKA FNRFSEPASF SDSDKSHRSK TNKKPSKSDA NRQFNEVHKP RTDQLENSKN KSKQLVNKPN HNKMAVKEQR SHHKKSHHHR SHQPKQASAS TESHQSSSIE SIFVEEPTLV LDREVASINV PANAKAIIAE QGPSTYSKEA LIKDKLKPDP STYLVEIKSL LSLFNMKRPP KIDRSKIIIP EPMKKLYAEI MGHELDSVNI PKPGLLTKSA NTVRSFTHKD SKIDDRFPHH HRFRLHFDVK SIPADEKLKA AELQLTRDAL SQQVVASRSS ANRTRYQBLV YDITRVGVRG QREPSYLLLD TKTBRLNSTD TVSLDVQPAV DRWLASPQRN YGLLVEVRTV RSLKPAPHHH VRLRRSADEA HERWQHKQPL LFTYTDDGRH DARSIRDVSG GEGGGKGGRN KRHARRPTRR KNHDDTCRRH SLYVDFSDVG WDDWIVAPLG YDAYYCHGKC PFPLADHRNS TNHAVVQTLV NNMNPGKBPK ACCBPTQLDS VAMLYLNDQS TVVLKNYQEM TVVGCGCR C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [6] Substitution matrices – Sequence analysis 2006 Definition of substitution matrix • Two-dimensional matrix with score values describing the probability of one amino acid or nucleotide being replaced by another during sequence evolution. C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [7] Substitution matrices – Sequence analysis 2006 Scoring matrices for nucleotide sequences • Can be simple: • Can be more • e.g. positive value complicated: for match and zero • taking into account for mismatch. transitions and • frequencies of transversions mutation are equal (e.g. Kimura model) for all bases. C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [8] Substitution matrices – Sequence analysis 2006 Scoring matrices for nucleotide sequences • Simple model • Kimura A C T G A 1 0 0 0 C 0 1 0 0 T 0 0 1 0 G 0 0 0 1 purines pyrimidines C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [9] Substitution matrices – Sequence analysis 2006 What is better to align? DNA or protein sequences? 1. Many mutations within DNA are synonymous ⇒ divergence overestimation C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [10] Substitution matrices – Sequence analysis 2006 2. Evolutionary relationships can be more accurately expressed using a 20 ×20 amino acid exchange table 3. DNA sequences contain non-coding regions , which should be avoided in homology searches. 4. Still an issue when translating into (six) protein sequences through a codon table. 5. Searching at protein level: frameshifts can occur, leading to stretches of incorrect amino acids and possibly elongation. However, frameshifts normally result in stretches of highly unlikely amino acids. C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [11] Substitution matrices – Sequence analysis 2006 So? Rule of thumb: ⇒ if ORF exists, then align at protein level C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [12] Substitution matrices – Sequence analysis 2006 Scoring matrices for amino acid sequences • Are complicated, scoring has to reflect: • Physio-chemical properties of aa’s • Likelihood of residues being substituted among truly homologous sequences • Certain aa with similar properties can be more easily substituted: preserve structure/function • “Disruptive” substitution is less likely to be selected in evolution (e.g.
Recommended publications
  • Optimal Matching Distances Between Categorical Sequences: Distortion and Inferences by Permutation Juan P
    St. Cloud State University theRepository at St. Cloud State Culminating Projects in Applied Statistics Department of Mathematics and Statistics 12-2013 Optimal Matching Distances between Categorical Sequences: Distortion and Inferences by Permutation Juan P. Zuluaga Follow this and additional works at: https://repository.stcloudstate.edu/stat_etds Part of the Applied Statistics Commons Recommended Citation Zuluaga, Juan P., "Optimal Matching Distances between Categorical Sequences: Distortion and Inferences by Permutation" (2013). Culminating Projects in Applied Statistics. 8. https://repository.stcloudstate.edu/stat_etds/8 This Thesis is brought to you for free and open access by the Department of Mathematics and Statistics at theRepository at St. Cloud State. It has been accepted for inclusion in Culminating Projects in Applied Statistics by an authorized administrator of theRepository at St. Cloud State. For more information, please contact [email protected]. OPTIMAL MATCHING DISTANCES BETWEEN CATEGORICAL SEQUENCES: DISTORTION AND INFERENCES BY PERMUTATION by Juan P. Zuluaga B.A. Universidad de los Andes, Colombia, 1995 A Thesis Submitted to the Graduate Faculty of St. Cloud State University in Partial Fulfillment of the Requirements for the Degree Master of Science St. Cloud, Minnesota December, 2013 This thesis submitted by Juan P. Zuluaga in partial fulfillment of the requirements for the Degree of Master of Science at St. Cloud State University is hereby approved by the final evaluation committee. Chairperson Dean School of Graduate Studies OPTIMAL MATCHING DISTANCES BETWEEN CATEGORICAL SEQUENCES: DISTORTION AND INFERENCES BY PERMUTATION Juan P. Zuluaga Sequence data (an ordered set of categorical states) is a very common type of data in Social Sciences, Genetics and Computational Linguistics.
    [Show full text]
  • Lecture 5: Sequence Alignment – Global Alignment
    Sequence Alignment COSC 348: Computing for Bioinformatics • Sequence alignment is a way of arranging two or more sequences of characters to identify regions of similarity – b/c similarities may be a consequence of functional or Lecture 5: evolutionary relationships between these sequences. Sequence Alignment – Global Alignment • Another definition: Procedure for comparing two or more sequences by searching for a series of individual characters that Lubica Benuskova, Ph.D. are in the same order in those sequences – Pair-wise alignment: compare two sequences – Multiple sequence alignment: compare > 2 sequences http://www.cs.otago.ac.nz/cosc348/ 1 2 Similarity versus identity Sequence alignment: example • In the process of evolution, from one generation to the next, and from one species to the next, the amino acid sequences of • Task: align abcdef with somehow similar abdgf an organism's proteins are gradually altered through the action of DNA mutations. For example, the sequence: • Write second sequence below the first one – ALEIRYLRD • could mutate into the sequence: ALEINYLRD abcdef abdgf • in one generation and possibly into AQEINYQRD • Move sequences to give maximum match between them. • over a longer period of evolutionary time. – Note: a hydrophobic amino acid is more likely to stay • Show characters that match using vertical bar. hydrophobic than not, since replacing it with a hydrophilic residue could affect the folding and/or activity of the protein. 3 4 Sequence alignment: example Quantitative global alignments abcdef • We are looking for an alignment, which || – maximizes the number of base-to-base matches; abdgf – if necessary to achieve this goal, inserts gaps in either sequence (a gap means a base-to-nothing match); • In order to maximise the alignment, we insert gap between – the order of bases in each sequence must remain and in lower sequence to allow and to align b d d f preserved and abcdef – gap-to-gap matches are not allowed.
    [Show full text]
  • Sequence Motifs, Correlations and Structural Mapping of Evolutionary
    Talk overview • Sequence profiles – position specific scoring matrix • Psi-blast. Automated way to create and use sequence Sequence motifs, correlations profiles in similarity searches and structural mapping of • Sequence patterns and sequence logos evolutionary data • Bioinformatic tools which employ sequence profiles: PFAM BLOCKS PROSITE PRINTS InterPro • Correlated Mutations and structural insight • Mapping sequence data on structures: March 2011 Eran Eyal Conservations Correlations PSSM – position specific scoring matrix • A position-specific scoring matrix (PSSM) is a commonly used representation of motifs (patterns) in biological sequences • PSSM enables us to represent multiple sequence alignments as mathematical entities which we can work with. • PSSMs enables the scoring of multiple alignments with sequences, or other PSSMs. PSSM – position specific scoring matrix Assuming a string S of length n S = s1s2s3...sn If we want to score this string against our PSSM of length n (with n lines): n alignment _ score = m ∑ s j , j j=1 where m is the PSSM matrix and sj are the string elements. PSSM can also be incorporated to both dynamic programming algorithms and heuristic algorithms (like Psi-Blast). Sequence space PSI-BLAST • For a query sequence use Blast to find matching sequences. • Construct a multiple sequence alignment from the hits to find the common regions (consensus). • Use the “consensus” to search again the database, and get a new set of matching sequences • Repeat the process ! Sequence space Position-Specific-Iterated-BLAST • Intuition – substitution matrices should be specific to sites and not global. – Example: penalize alanine→glycine more in a helix •Idea – Use BLAST with high stringency to get a set of closely related sequences.
    [Show full text]
  • Computational Biology Lecture 8: Substitution Matrices Saad Mneimneh
    Computational Biology Lecture 8: Substitution matrices Saad Mneimneh As we have introduced last time, simple scoring schemes like +1 for a match, -1 for a mismatch and -2 for a gap are not justifiable biologically, especially for amino acid sequences (proteins). Instead, more elaborated scoring functions are used. These scores are usually obtained as a result of analyzing chemical properties and statistical data for amino acids and DNA sequences. For example, it is known that same size amino acids are more likely to be substituted by one another. Similarly, amino acids with same affinity to water are likely to serve the same purpose in some cases. On the other hand, some mutations are not acceptable (may lead to demise of the organism). PAM and BLOSUM matrices are amongst results of such analysis. We will see the techniques through which PAM and BLOSUM matrices are obtained. Substritution matrices Chemical properties of amino acids govern how the amino acids substitue one another. In principle, a substritution matrix s, where sij is used to score aligning character i with character j, should reflect the probability of two characters substituing one another. The question is how to build such a probability matrix that closely maps reality? Different strategies result in different matrices but the central idea is the same. If we go back to the concept of a high scoring segment pair, theory tells us that the alignment (ungapped) given by such a segment is governed by a limiting distribution such that ¸sij qij = pipje where: ² s is the subsitution matrix used ² qij is the probability of observing character i aligned with character j ² pi is the probability of occurrence of character i Therefore, 1 qij sij = ln ¸ pipj This formula for sij suggests a way to constrcut the matrix s.
    [Show full text]
  • Development of Novel Classical and Quantum Information Theory Based Methods for the Detection of Compensatory Mutations in Msas
    Development of novel Classical and Quantum Information Theory Based Methods for the Detection of Compensatory Mutations in MSAs Dissertation zur Erlangung des mathematisch-naturwissenschaftlichen Doktorgrades ”Doctor rerum naturalium” der Georg-August-Universität Göttingen im Promotionsprogramm PCS der Georg-August University School of Science (GAUSS) vorgelegt von Mehmet Gültas aus Kirikkale-Türkei Göttingen, 2013 Betreuungsausschuss Professor Dr. Stephan Waack, Institut für Informatik, Georg-August-Universität Göttingen. Professor Dr. Carsten Damm, Institut für Informatik, Georg-August-Universität Göttingen. Professor Dr. Edgar Wingender, Institut für Bioinformatik, Universitätsmedizin, Georg-August-Universität Göttingen. Mitglieder der Prüfungskommission Referent: Prof. Dr. Stephan Waack, Institut für Informatik, Georg-August-Universität Göttingen. Korreferent: Prof. Dr. Carsten Damm, Institut für Informatik, Georg-August-Universität Göttingen. Korreferent: Prof. Dr. Mario Stanke, Institut für Mathematik und Informatik, Ernst Moritz Arndt Universität Greifswald Weitere Mitglieder der Prüfungskommission Prof. Dr. Edgar Wingender, Institut für Bioinformatik, Universitätsmedizin, Georg-August-Universität Göttingen. Prof. Dr. Burkhard Morgenstern, Institut für Mikrobiologie und Genetik, Abteilung für Bioinformatik, Georg-August- Universität Göttingen. Prof. Dr. Dieter Hogrefe, Institut für Informatik, Georg-August-Universität Göttingen. Prof. Dr. Wolfgang May, Institut für Informatik, Georg-August-Universität Göttingen. Tag der mündlichen
    [Show full text]
  • Novel Bioinformatics Applications for Protein Allergology
    AND ! "#$% &'()* +% + ,-.,-/,0 + 121,..0-10- ! 3 4 33!!3 ,,,1/ !"# $% # $# &'()$ $*+,'-./ $ "Por la ciencia, como por el arte, se va al mismo sitio: a la verdad" Gregorio Marañón Madrid, 19-05-1887 - Madrid, 27-03-1960 List of Papers This thesis is based on the following papers, which are referred to in the text by their Roman numerals. I Martínez Barrio, Á., Soeria-Atmadja, D., Nister, A., Gustafsson, M.G., Hammerling, U., Bongcam-Rudloff, E. (2007) EVALLER: a web server for in silico assessment of potential protein allergenicity. Nucleic Acids Research, 35(Web Server issue):W694-700. II Martínez Barrio, Á.∗, Lagercrantz, E.∗, Sperber, G.O., Blomberg, J., Bongcam-Rudloff, E. (2009) Annotation and visualization of endogenous retroviral sequences using the Distributed Annotation System (DAS) and eBioX. BMC Bioinformatics, 10(Suppl 6):S18. III Martínez Barrio, Á., Xu, F., Lagercrantz, E., Bongcam-Rudloff, E. (2009) GeneFinder: In silico positional cloning of trait genes. Manuscript. IV Martínez Barrio, Á., Ekerljung, M., Jern, P., Benachenhou, F., Sperber,
    [Show full text]
  • 3D Representations of Amino Acids—Applications to Protein Sequence Comparison and Classification
    Computational and Structural Biotechnology Journal 11 (2014) 47–58 Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/csbj 3D representations of amino acids—applications to protein sequence comparison and classification Jie Li a, Patrice Koehl b,⁎ a Genome Center, University of California, Davis, 451 Health Sciences Drive, Davis, CA 95616, United States b Department of Computer Science and Genome Center, University of California, Davis, One Shields Ave, Davis, CA 95616, United States article info abstract Available online 6 September 2014 The amino acid sequence of a protein is the key to understanding its structure and ultimately its function in the cell. This paper addresses the fundamental issue of encoding amino acids in ways that the representation of such Keywords: a protein sequence facilitates the decoding of its information content. We show that a feature-based representa- Protein sequences tion in a three-dimensional (3D) space derived from amino acid substitution matrices provides an adequate Substitution matrices representation that can be used for direct comparison of protein sequences based on geometry. We measure Protein sequence classification the performance of such a representation in the context of the protein structural fold prediction problem. Fold recognition We compare the results of classifying different sets of proteins belonging to distinct structural folds against classifications of the same proteins obtained from sequence alone or directly from structural information. We find that sequence alone performs poorly as a structure classifier.Weshowincontrastthattheuseofthe three dimensional representation of the sequences significantly improves the classification accuracy. We conclude with a discussion of the current limitations of such a representation and with a description of potential improvements.
    [Show full text]
  • Information-Theoretic Bounds of Evolutionary Processes Modeled As a Protein Communication System
    INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM Liuling Gong, Nidhal Bouaynaya∗ and Dan Schonfeld University of Illinois at Chicago, Dept. of Electrical and Computer Engineering, ABSTRACT can be investigated in the context of engineering communica- In this paper, we investigate the information theoretic bounds tion codes. In particular, it is legitimate to ask at what rate of the channel of evolution introduced in [1]. The channel of can the genomic information be transmitted. And what is the evolution is modeled as the iteration of protein communica- average distortion between the transmitted message and the tion channels over time, where the transmitted messages are received message at this rate? Shannon’s channel capacity protein sequences and the encoded message is the DNA. We theorem states that, by properly encoding the source, a com- compute the capacity and the rate-distortion functions of the munication system can transmit information at a rate that is protein communication system for the three domains of life: as close to the channel capacity as one desires with an arbi- Achaea, Prokaryotes and Eukaryotes. We analyze the trade- trarily small transmission error. Conversely, it is not possi- off between the transmission rate and the distortion in noisy ble to reliably transmit at a rate greater than the channel ca- protein communication channels. As expected, comparison pacity. The theorem, however, is not constructive and does of the optimal transmission rate with the channel capacity in- not provide any help in designing such codes. In the case dicates that the biological fidelity does not reach the Shan- of biological communication systems, however, evolution has non optimal distortion.
    [Show full text]
  • Testing the Independence Hypothesis of Accepted Mutations for Pairs Of
    University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Computer Science and Engineering: Theses, Computer Science and Engineering, Department of Dissertations, and Student Research 12-2016 TESTING THE INDEPENDENCE HYPOTHESIS OF ACCEPTED MUTATIONS FOR PAIRS OF ADJACENT AMINO ACIDS IN PROTEIN SEQUENCES Jyotsna Ramanan University of Nebraska-Lincoln, [email protected] Follow this and additional works at: http://digitalcommons.unl.edu/computerscidiss Part of the Bioinformatics Commons, and the Computer Engineering Commons Ramanan, Jyotsna, "TESTING THE INDEPENDENCE HYPOTHESIS OF ACCEPTED MUTATIONS FOR PAIRS OF ADJACENT AMINO ACIDS IN PROTEIN SEQUENCES" (2016). Computer Science and Engineering: Theses, Dissertations, and Student Research. 118. http://digitalcommons.unl.edu/computerscidiss/118 This Article is brought to you for free and open access by the Computer Science and Engineering, Department of at DigitalCommons@University of Nebraska - Lincoln. It has been accepted for inclusion in Computer Science and Engineering: Theses, Dissertations, and Student Research by an authorized administrator of DigitalCommons@University of Nebraska - Lincoln. TESTING THE INDEPENDENCE HYPOTHESIS OF ACCEPTED MUTATIONS FOR PAIRS OF ADJACENT AMINO ACIDS IN PROTEIN SEQUENCES by Jyotsna Ramanan A THESIS Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulfilment of Requirements For the Degree of Master of Science Major: Computer Science Under the Supervision of Peter Z. Revesz Lincoln, Nebraska December, 2016 TESTING THE INDEPENDENCE HYPOTHESIS OF ACCEPTED MUTATIONS FOR PAIRS OF ADJACENT AMINO ACIDS IN PROTEIN SEQUENCES Jyotsna Ramanan, MS University of Nebraska, 2016 Adviser: Peter Z. Revesz Evolutionary studies usually assume that the genetic mutations are independent of each other. However, that does not imply that the observed mutations are indepen- dent of each other because it is possible that when a nucleotide is mutated, then it may be biologically beneficial if an adjacent nucleotide mutates too.
    [Show full text]
  • Assume an F84 Substitution Model with Nucleotide Frequ
    Exercise Sheet 5 Computational Phylogenetics Prof. D. Metzler Exercise 1: Assume an F84 substitution model with nucleotide frequencies (πA; πC ; πG; πT ) = (0:2; 0:3; 0:3; 0:2), a rate λ = 0:1 of “crosses” and a rate µ = 0:2 of “bullets” (see lecture). (a) Assume that the nucleotide at some site is A at time t. Calculate the probabilities that the nucleotide is A, C or G at time t + 0:2. (b) Assume the nucleotide distribution in a genomic region is (0:1; 0:2; 0:3; 0:4) at time t, but from this time on the genomic region evolves according to the model above. Calculate the expectation values for the nucleotide distributions at time points t + 0:2 and t + 2. Exercise 2: Calculate rate matrix for the nucleotide substution process for which the substitution matrix for time t is 0 1−e−t=10 21+9e−t=10−30e−t=5 1−e−t=10 1 PA!A(t) 10 70 5 B 2−2e−t=10 3−3et=10 3+7e−t=10−10e−t=5 C B PC!C (t) C S(t) = B 5 10 15 C ; B 14+6e−t=10−20e−t=5 1−e−t=10 1−e−t=10 C B PG!G(t) C @ 35 10 5 A 2−2e−t=10 3+7e−t=10−10e−t=5 3−3e−t=10 5 30 10 PT !T (t) where the diagonal entries PA!A(t), PC!C (t), PG!G(t) and PT !T (t) are the values that fulfill that each row sum is 1 in the matrix S(t).
    [Show full text]
  • Amino Acid Substitution Matrices from an Information Theoretic Perspective
    p J. Mol. Bd-(1991) 219, 555-565 Amino Acid Substitution Matrices from an , Information Theoretic Perspective Stephen F. Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD 20894, U.S.S. (Received 1 October 1990; accepted 12 February 1991) Protein sequence alignments have become an important tool for molecular biologists. Local alignments are frequently constructed with the aid of a “substitution score matrix” that specifies a scorefor aligning each pair of amino acid residues. Over the years, manydifferent substitution matrices have been proposed, based on a wide variety of rationales. Statistical results, however, demonstrate that any such matrix is i.mplicitly a “log-odds” matrix, with a specific targetdistribution for aligned pairs of amino acid residues. Inthe light of information theory, itis possible to express the scores of a substitution matrix in bits and to see that different matrices are better adapted to different purposes. The most widely used matrix for protein sequence comparison has been the PAM-250 matrix. It is argued that for database searches the PAM-,I20 matrix generally is more appropriate, while for comparing two specific proteins with.suspecte4 homology the PAM-200 matrix is indicated. Examples discussed include the lipocalins, human a,B-glycoprotein, the cysticfibrosis transmembrane conductance regulator and the globins. Keywords: homology; sequence comparison; statistical significance; alignment algorithms; pattern recognition 2. Introduction . similarity measure (Smith & Waterman, 1981; Goad & Kanehisa, 1982; Sellers, 1984). This has the General methods for protein sequence comparison advantage of placing no a priori restrictions on the were introduced to molecular biology 20 years ago length of the local alignments sought.
    [Show full text]
  • A Thesis Entitled Homology-Based Structural Prediction of the Binding
    A Thesis entitled Homology-based Structural Prediction of the Binding Interface Between the Tick-Borne Encephalitis Virus Restriction Factor TRIM79 and the Flavivirus Non-structural 5 Protein. by Heather Piehl Brown Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Master of Science Degree in Biomedical Science _________________________________________ R. Travis Taylor, PhD, Committee Chair _________________________________________ Xiche Hu, PhD, Committee Member _________________________________________ Robert M. Blumenthal, PhD, Committee Member _________________________________________ Amanda Bryant-Friedrich, PhD, Dean College of Graduate Studies The University of Toledo December 2016 Copyright 2016, Heather Piehl Brown This document is copyrighted material. Under copyright law, no parts of this document may be reproduced without the expressed permission of the author. An Abstract of Homology-based Structural Prediction of the Binding Interface Between the Tick-Borne Encephalitis Virus Restriction Factor TRIM79 and the Flavivirus Non-structural 5 Protein. by Heather P. Brown Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Master of Science Degree in Biomedical Sciences The University of Toledo December 2016 The innate immune system of the host is vital for determining the outcome of virulent virus infections. Successful immune responses depend on detecting the specific virus, through interactions of the proteins or genomic material of the virus and host factors. We previously identified a host antiviral protein of the tripartite motif (TRIM) family, TRIM79, which plays a critical role in the antiviral response to flaviviruses. The Flavivirus genus includes many arboviruses that are significant human pathogens, such as tick-borne encephalitis virus (TBEV) and West Nile virus (WNV).
    [Show full text]