Amino Acid Substitution Matrices from an Information Theoretic Perspective

Total Page:16

File Type:pdf, Size:1020Kb

Amino Acid Substitution Matrices from an Information Theoretic Perspective p J. Mol. Bd-(1991) 219, 555-565 Amino Acid Substitution Matrices from an , Information Theoretic Perspective Stephen F. Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD 20894, U.S.S. (Received 1 October 1990; accepted 12 February 1991) Protein sequence alignments have become an important tool for molecular biologists. Local alignments are frequently constructed with the aid of a “substitution score matrix” that specifies a scorefor aligning each pair of amino acid residues. Over the years, manydifferent substitution matrices have been proposed, based on a wide variety of rationales. Statistical results, however, demonstrate that any such matrix is i.mplicitly a “log-odds” matrix, with a specific targetdistribution for aligned pairs of amino acid residues. Inthe light of information theory, itis possible to express the scores of a substitution matrix in bits and to see that different matrices are better adapted to different purposes. The most widely used matrix for protein sequence comparison has been the PAM-250 matrix. It is argued that for database searches the PAM-,I20 matrix generally is more appropriate, while for comparing two specific proteins with.suspecte4 homology the PAM-200 matrix is indicated. Examples discussed include the lipocalins, human a,B-glycoprotein, the cysticfibrosis transmembrane conductance regulator and the globins. Keywords: homology; sequence comparison; statistical significance; alignment algorithms; pattern recognition 2. Introduction . similarity measure (Smith & Waterman, 1981; Goad & Kanehisa, 1982; Sellers, 1984). This has the General methods for protein sequence comparison advantage of placing no a priori restrictions on the were introduced to molecular biology 20 years ago length of the local alignments sought. Most data- and have since gained widespreaduse. Most early base search methods have been based on such local attemptsto measure protein sequencesimilarity alignments(Lipman & Pearson, 1985; Pearson & ‘‘$’vused on global sequencealignments, in which Lipman, 1988; Altschul et aE.,1990). 1.vclry residue of the two sequences compared had to To evaluate local alignments, scores generally are participate(Needleman & Wunsch, 19TO; Sellers, assigned to each aligned pair of residues (the set of 1954; Sankoff C Kruskal, 1983). However, hecause such scores is called a substitution matrix), aswell as distantly related proteins may share only isolated to residues aligned with nulls: the score of the regions of similarity, e.g. in the vicinity of an active overall alignment is then taken to be the sum of site, attention has shifted to local as opposed to these scores. Specifying an appropriate amino acid global sequence similarity measures. The basic idea substitution matrix is central to protein comparison is to consider only relatively conserved sub- methodsand much effort has been devoted to xquences; dissimilar regions do not contribute toor defining, analyzingand refining such matrices ,thtract from the measure of similarity. Local sirni- (SIcl,achlan, 1971; Dayhoff et al., 1978; Schwartz & iurity mar be studied in a variety of ways. These lhyhoff, 1975; Feng et al., 1985; Rao, 1987: Risler et include measuresbased on the longest matching al.. 1988). One hope has been to find a matrix best segments of two sequences with a specified number adaptedtodistinguishing distant evolutionary or proportion of mismatches (Arratia’et al., 1986; relationships from chancesimilarities. Recent Xrratia & Waterman, 1989), as well as methods that mathematicalresults (Karlin & Altschul, 1990; compare all segments of a fixed, predefined Karlin et al., 1990) allow all substitution matrices to “window”length (McLachlan, 1971). The most be\%wed in a common light,and provide a common practice, however, is to consider segments rationale for selecting particular sets of “optimal” 1;f’ all lengths,and choose those that optimize scores for local protein sequence comparison. 2. The Statistical Significance of Local Sequence Alignments C;lohal alignments are of essentially no IIS~unless they can aIlow gaps. but this is not true for local alignments. The ability to choose segments with arbiora.r?- starting positions in each sequence means that biologically significant regions frequently may be aligned without the need to introduce gaps. Ij'hile: in general. it. is desirable to allow gaps in low1 dignments.doins so greatlydecreases their mathematical tractabilit?. Theresults described here applv rigorously only t.0 1oc:al alignmen1.s that. lack gap. i.e. to segments 01' ecpa.1 I~ngthfrom each of the two sequcncrs cmmpsretl. Somc rcwnt di~,ti~- base search tools have focusccl on tintling st~c:halign- ments (-4ltsc:hul & Lipmm, 1990; Alt.sc:hnl r!f 0.1.. Not.ic:c* thiLt. multiplying all the wares of a sul)st.itu- , 1990). Howrc-er. the statisLics Of optimal s(wrt's fi)r tion matris 1)y somr ~~ositivc t:ollstiLl1t (Ioos not lot:al ;Aignments that include gaps (Smith e! a/., nff1Ac:t. tllc re1;~t.i~~s(:ores of' any sl~i,;lli~,rn~nc~l~ts.Two 19S.5; li'aterman et d., 1957) are 1)roully ;~nalogous mat riws n?lat.c'tl 1)y s11(*h f'ncAtor (:Hn, lhrrrf(~re,be I to those for thc no-gap case (Karlin Rr Altsc:huI: c:onsidercd , c:ssc+ntialIy t~(Iuiv;~hi.Tnslwc*tion of ; 1990; Karlin el al., IYYO), where more precise resull,s cqna.tion (1 ) re\T(!;LIs that multiplying dlworm by (I ' are availaMe. Therdore, one may hope that many of ;~lsohas the c?ffr:ct of dividing 2. I)? 11. Tht. t);lrameter 1 the hicideas prcsenfsd t)elow will ge~~eraliw1.0 1. mil?., t.lrc?rc\fore,I)r viewed :I t~t11r:dsc-;~l(. for 4 local alignments that include gaps. any sroring system; its c1ery)er meaning wi\: be ,d Formally, we assume that the aligned amino acids discassc~lhelow. c ai and ai are assigned thesubstitution score sij. !$ C;ivw~two random protein stlquw1ws as tlcscribed .i Given two protein sequences, thepair of (~11151 above, how many distinct, 01' -'loc*ally optimal" length segments that, when aligned,have the (Sellers.,* 1984) MSPs with score at least S are greatest. aggregate score we call the Maximal expect@ to occur simply by chance? This number is Segment Pair (MSPt). An MSP may be' of any well apbroximated by the formula: length; its score is the MSP score. Since any two protein sequences, related or un- K.1' c - As related, a-ill have some MSP score, it is important to where N is the product of the sequenws' len$hs, know how great a score onecan expect to find and h' is an explicit]? calculable parameter (liarlin simpIy by chance. To address this questionone & Altschul, 1990; Karlin et nl.. 1990). When needs some model of chance. Thesimplest is to comparing a single randomsequence withall the. assume that in the twoproteins compared, the sequences in a database, settingiV to the product of' amino acid ai appearsrandomly with the prob- the query sequence length and the database length ability pi. These probabilities are chosen to reflect (in residues) yields an upper bound on the number the observed frequencies of the aminoacids in of distinct MSPs with score at least S .that the actual proteins. For simplicity of discussion we will search is especkd to yield. assume both proteins share the same amino acid probabilitydistribution; more generally, one can allow them to have different distributions. A random protein sequence is. simply one constructed 3. Optimal Substitution Matrices for Local according to this model. Sequence Alignment For the sake of the statistical theory, we need to Formula (2) allows us to tell when a segm makc two crucial but reasonable assumptions about has a significantly high score. However, it doe the substitution scores. The first is that there be at assist in choosing an appropriatesuhsti least one positive score and the second is that the matrix in the first place. A second class of r expected score xi,jpipjs,j be negative. Because we however, has direct bearing on this question. Th permit the length of a segment pair to be adjusted state that among MSPs from the comparison to optimize its score, both these assumptions are random sequences, the amino acids ai an a necessary also from practical perspective. If there aligned with frequency approaching qij = p wereno positive scores, the MSP would always (Arratia et or., 1988; Kartin 8: Altschul, 1990; consist of a single pair of' residues (or none at all, if et al., 1990; Dembo & Karlin, 1991). this were permitted), and such an alignment is not Given any snbstitution matrix and rando of interest. If the expected score for two random tein model,one may easily calculatethe residues were positive, extending a segment pair as target frequencies, qij, just described. Sotice by the definition of J. in equation (I), these t t Abbreviations used: MSP. Maximal Segmrnt hir: frequencies sum to 1. Sow among ztlig~ment Ig. immunoglobulin. sentingdistant homologies, the amino acids Amino Acid Substitution Matrices -537 with certain characteristic frequencies, Only doubtfulthat any “target distribution” theorem e correspond to a matrix’s target frequencies, can be proved. It may be possible to make a convincing case for a particular substitution matrix in the global alignment context, but the argument will most likely have to be different from that for Any substitutionmatrix has an implicit set of local alignments(Karlin & Altschul, 1990). The arget frequencies for aligned amino acids. Writing sameapplies to substitution matrices used with thescores of thematrix in terms of itstarget fixed-length windows for studying local similarities frequencies, one has: (McI,achlan, 1971; Argos, 1987; Stormo & Hartzell, 1989): a fixed quantity can be added to all entries of such a matrix with no essential effect. It is notable ,qij = (In %)/A. (3) that while the PAM matrices were developed origin- Pi Pj ally for global sequence comparison (Dayhoff et al., 19SS), their statistical ,theory has blossomed in the In other words, the score for an amino acid pair can local alignment context.
Recommended publications
  • Optimal Matching Distances Between Categorical Sequences: Distortion and Inferences by Permutation Juan P
    St. Cloud State University theRepository at St. Cloud State Culminating Projects in Applied Statistics Department of Mathematics and Statistics 12-2013 Optimal Matching Distances between Categorical Sequences: Distortion and Inferences by Permutation Juan P. Zuluaga Follow this and additional works at: https://repository.stcloudstate.edu/stat_etds Part of the Applied Statistics Commons Recommended Citation Zuluaga, Juan P., "Optimal Matching Distances between Categorical Sequences: Distortion and Inferences by Permutation" (2013). Culminating Projects in Applied Statistics. 8. https://repository.stcloudstate.edu/stat_etds/8 This Thesis is brought to you for free and open access by the Department of Mathematics and Statistics at theRepository at St. Cloud State. It has been accepted for inclusion in Culminating Projects in Applied Statistics by an authorized administrator of theRepository at St. Cloud State. For more information, please contact [email protected]. OPTIMAL MATCHING DISTANCES BETWEEN CATEGORICAL SEQUENCES: DISTORTION AND INFERENCES BY PERMUTATION by Juan P. Zuluaga B.A. Universidad de los Andes, Colombia, 1995 A Thesis Submitted to the Graduate Faculty of St. Cloud State University in Partial Fulfillment of the Requirements for the Degree Master of Science St. Cloud, Minnesota December, 2013 This thesis submitted by Juan P. Zuluaga in partial fulfillment of the requirements for the Degree of Master of Science at St. Cloud State University is hereby approved by the final evaluation committee. Chairperson Dean School of Graduate Studies OPTIMAL MATCHING DISTANCES BETWEEN CATEGORICAL SEQUENCES: DISTORTION AND INFERENCES BY PERMUTATION Juan P. Zuluaga Sequence data (an ordered set of categorical states) is a very common type of data in Social Sciences, Genetics and Computational Linguistics.
    [Show full text]
  • Traveling Salesman Problem
    TRAVELING SALESMAN PROBLEM, THEORY AND APPLICATIONS Edited by Donald Davendra Traveling Salesman Problem, Theory and Applications Edited by Donald Davendra Published by InTech Janeza Trdine 9, 51000 Rijeka, Croatia Copyright © 2010 InTech All chapters are Open Access articles distributed under the Creative Commons Non Commercial Share Alike Attribution 3.0 license, which permits to copy, distribute, transmit, and adapt the work in any medium, so long as the original work is properly cited. After this work has been published by InTech, authors have the right to republish it, in whole or part, in any publication of which they are the author, and to make other personal use of the work. Any republication, referencing or personal use of the work must explicitly identify the original source. Statements and opinions expressed in the chapters are these of the individual contributors and not necessarily those of the editors or publisher. No responsibility is accepted for the accuracy of information contained in the published articles. The publisher assumes no responsibility for any damage or injury to persons or property arising out of the use of any materials, instructions, methods or ideas contained in the book. Publishing Process Manager Ana Nikolic Technical Editor Teodora Smiljanic Cover Designer Martina Sirotic Image Copyright Alex Staroseltsev, 2010. Used under license from Shutterstock.com First published December, 2010 Printed in India A free online edition of this book is available at www.intechopen.com Additional hard copies can be obtained
    [Show full text]
  • Early Fixation of an Optimal Genetic Code
    Early Fixation of an Optimal Genetic Code Stephen J. Freeland,* Robin D. Knight,* Laura F. Landweber,* and Laurence D. Hurst² *Department of Ecology and Evolution, Princeton University; and ²Department of Biology and Biochemistry, University of Bath, Bath, England The evolutionary forces that produced the canonical genetic code before the last universal ancestor remain obscure. One hypothesis is that the arrangement of amino acid/codon assignments results from selection to minimize the effects of errors (e.g., mistranslation and mutation) on resulting proteins. If amino acid similarity is measured as polarity, the canonical code does indeed outperform most theoretical alternatives. However, this ®nding does not hold for other amino acid properties, ignores plausible restrictions on possible code structure, and does not address the naturally occurring nonstandard genetic codes. Finally, other analyses have shown that signi®cantly better code structures are possible. Here, we show that if theoretically possible code structures are limited to re¯ect plausible biological constraints, and amino acid similarity is quanti®ed using empirical data of substitution frequencies, the canonical code is at or very close to a global optimum for error minimization across plausible parameter space. This result is robust to variation in the methods and assumptions of the analysis. Although signi®cantly better codes do exist under some assumptions, they are extremely rare and thus consistent with reports of an adaptive code: previous analyses which suggest otherwise derive from a misleading metric. However, all extant, naturally occurring, secondarily derived, nonstandard genetic codes do appear less adaptive. The arrangement of amino acid assignments to the codons of the standard genetic code appears to be a direct product of natural selection for a system that minimizes the phenotypic impact of genetic error.
    [Show full text]
  • Sequence Motifs, Correlations and Structural Mapping of Evolutionary
    Talk overview • Sequence profiles – position specific scoring matrix • Psi-blast. Automated way to create and use sequence Sequence motifs, correlations profiles in similarity searches and structural mapping of • Sequence patterns and sequence logos evolutionary data • Bioinformatic tools which employ sequence profiles: PFAM BLOCKS PROSITE PRINTS InterPro • Correlated Mutations and structural insight • Mapping sequence data on structures: March 2011 Eran Eyal Conservations Correlations PSSM – position specific scoring matrix • A position-specific scoring matrix (PSSM) is a commonly used representation of motifs (patterns) in biological sequences • PSSM enables us to represent multiple sequence alignments as mathematical entities which we can work with. • PSSMs enables the scoring of multiple alignments with sequences, or other PSSMs. PSSM – position specific scoring matrix Assuming a string S of length n S = s1s2s3...sn If we want to score this string against our PSSM of length n (with n lines): n alignment _ score = m ∑ s j , j j=1 where m is the PSSM matrix and sj are the string elements. PSSM can also be incorporated to both dynamic programming algorithms and heuristic algorithms (like Psi-Blast). Sequence space PSI-BLAST • For a query sequence use Blast to find matching sequences. • Construct a multiple sequence alignment from the hits to find the common regions (consensus). • Use the “consensus” to search again the database, and get a new set of matching sequences • Repeat the process ! Sequence space Position-Specific-Iterated-BLAST • Intuition – substitution matrices should be specific to sites and not global. – Example: penalize alanine→glycine more in a helix •Idea – Use BLAST with high stringency to get a set of closely related sequences.
    [Show full text]
  • Computational Biology Lecture 8: Substitution Matrices Saad Mneimneh
    Computational Biology Lecture 8: Substitution matrices Saad Mneimneh As we have introduced last time, simple scoring schemes like +1 for a match, -1 for a mismatch and -2 for a gap are not justifiable biologically, especially for amino acid sequences (proteins). Instead, more elaborated scoring functions are used. These scores are usually obtained as a result of analyzing chemical properties and statistical data for amino acids and DNA sequences. For example, it is known that same size amino acids are more likely to be substituted by one another. Similarly, amino acids with same affinity to water are likely to serve the same purpose in some cases. On the other hand, some mutations are not acceptable (may lead to demise of the organism). PAM and BLOSUM matrices are amongst results of such analysis. We will see the techniques through which PAM and BLOSUM matrices are obtained. Substritution matrices Chemical properties of amino acids govern how the amino acids substitue one another. In principle, a substritution matrix s, where sij is used to score aligning character i with character j, should reflect the probability of two characters substituing one another. The question is how to build such a probability matrix that closely maps reality? Different strategies result in different matrices but the central idea is the same. If we go back to the concept of a high scoring segment pair, theory tells us that the alignment (ungapped) given by such a segment is governed by a limiting distribution such that ¸sij qij = pipje where: ² s is the subsitution matrix used ² qij is the probability of observing character i aligned with character j ² pi is the probability of occurrence of character i Therefore, 1 qij sij = ln ¸ pipj This formula for sij suggests a way to constrcut the matrix s.
    [Show full text]
  • Development of Novel Classical and Quantum Information Theory Based Methods for the Detection of Compensatory Mutations in Msas
    Development of novel Classical and Quantum Information Theory Based Methods for the Detection of Compensatory Mutations in MSAs Dissertation zur Erlangung des mathematisch-naturwissenschaftlichen Doktorgrades ”Doctor rerum naturalium” der Georg-August-Universität Göttingen im Promotionsprogramm PCS der Georg-August University School of Science (GAUSS) vorgelegt von Mehmet Gültas aus Kirikkale-Türkei Göttingen, 2013 Betreuungsausschuss Professor Dr. Stephan Waack, Institut für Informatik, Georg-August-Universität Göttingen. Professor Dr. Carsten Damm, Institut für Informatik, Georg-August-Universität Göttingen. Professor Dr. Edgar Wingender, Institut für Bioinformatik, Universitätsmedizin, Georg-August-Universität Göttingen. Mitglieder der Prüfungskommission Referent: Prof. Dr. Stephan Waack, Institut für Informatik, Georg-August-Universität Göttingen. Korreferent: Prof. Dr. Carsten Damm, Institut für Informatik, Georg-August-Universität Göttingen. Korreferent: Prof. Dr. Mario Stanke, Institut für Mathematik und Informatik, Ernst Moritz Arndt Universität Greifswald Weitere Mitglieder der Prüfungskommission Prof. Dr. Edgar Wingender, Institut für Bioinformatik, Universitätsmedizin, Georg-August-Universität Göttingen. Prof. Dr. Burkhard Morgenstern, Institut für Mikrobiologie und Genetik, Abteilung für Bioinformatik, Georg-August- Universität Göttingen. Prof. Dr. Dieter Hogrefe, Institut für Informatik, Georg-August-Universität Göttingen. Prof. Dr. Wolfgang May, Institut für Informatik, Georg-August-Universität Göttingen. Tag der mündlichen
    [Show full text]
  • Using Local Alignments for Relation Recognition
    Journal of Artificial Intelligence Research 38 (2010) 1-48 Submitted 11/09; published 05/10 Using Local Alignments for Relation Recognition Sophia Katrenko [email protected] Pieter Adriaans [email protected] Maarten van Someren [email protected] Informatics Institute, University of Amsterdam Science Park 107, 1098XG Amsterdam, the Netherlands Abstract This paper discusses the problem of marrying structural similarity with semantic relat- edness for Information Extraction from text. Aiming at accurate recognition of relations, we introduce local alignment kernels and explore various possibilities of using them for this task. We give a definition of a local alignment (LA) kernel based on the Smith-Waterman score as a sequence similarity measure and proceed with a range of possibilities for com- puting similarity between elements of sequences. We show how distributional similarity measures obtained from unlabeled data can be incorporated into the learning task as se- mantic knowledge. Our experiments suggest that the LA kernel yields promising results on various biomedical corpora outperforming two baselines by a large margin. Additional series of experiments have been conducted on the data sets of seven general relation types, where the performance of the LA kernel is comparable to the current state-of-the-art results. 1. Introduction Despite the fact that much work has been done on automatic relation extraction (or recog- nition) in the past few decades, it remains a popular research topic. The main reason for the keen interest in relation recognition lies in its utility. Once concepts and semantic relations are identified, they can be used for a variety of applications such as question answering (QA), ontology construction, hypothesis generation and others.
    [Show full text]
  • 1 Introduction
    Biological sequence analysis by vector-valued functions: revisiting alignment-free methodologies for DNA and protein classification Susana Vinga Instituto de Engenharia de Sistemas e Computadores: Investigação e Desenvolvimento (INESC-ID) R. Alves Redol 9, 1000-029 Lisboa, Portugal Tel. (+351) 213 100 300 Fax: (+351) 213 145 843 E-mail: [email protected] Departamento de Bioestatística e Informática, Faculdade de Ciências Médicas – Universidade Nova de Lisboa (FCM/UNL) Campo dos Mártires da Pátria 130, 1169-056 Lisboa, Portugal Tel. (+351) 218 803 052 Fax: (+351) 218 851 920 Abstract Biological sequence analysis is at the core of bioinformatics, bringing together several fields, from computer science to probability and statistics. Its purpose is to computationally process and decode the information stored in biological macromolecules involved in all cell mechanisms of living organisms – such as DNA, RNA and proteins – and provide prediction tools to reveal their structure, function and complex relationship networks. Within this context several methods have arisen that analyze sequences based on alignment algorithms, ubiquitously used in most bioinformatics applications. Alternatively, although less explored in the literature, the use of vector maps for the analysis of biological sequences, both DNA and proteins, represents a very elegant proposal to extract information from those types of sequences using an alignment-free approach. This work presents an overview of alignment-free methods used for sequence analysis and comparison and the new trends of these techniques, applied to DNA and proteins. The recent endeavors found in the literature along with new proposals and widening of applications fully justifies a revisit to these methodologies, partially reviewed before (Vinga and Almeida, 2003).
    [Show full text]
  • 3D Representations of Amino Acids—Applications to Protein Sequence Comparison and Classification
    Computational and Structural Biotechnology Journal 11 (2014) 47–58 Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/csbj 3D representations of amino acids—applications to protein sequence comparison and classification Jie Li a, Patrice Koehl b,⁎ a Genome Center, University of California, Davis, 451 Health Sciences Drive, Davis, CA 95616, United States b Department of Computer Science and Genome Center, University of California, Davis, One Shields Ave, Davis, CA 95616, United States article info abstract Available online 6 September 2014 The amino acid sequence of a protein is the key to understanding its structure and ultimately its function in the cell. This paper addresses the fundamental issue of encoding amino acids in ways that the representation of such Keywords: a protein sequence facilitates the decoding of its information content. We show that a feature-based representa- Protein sequences tion in a three-dimensional (3D) space derived from amino acid substitution matrices provides an adequate Substitution matrices representation that can be used for direct comparison of protein sequences based on geometry. We measure Protein sequence classification the performance of such a representation in the context of the protein structural fold prediction problem. Fold recognition We compare the results of classifying different sets of proteins belonging to distinct structural folds against classifications of the same proteins obtained from sequence alone or directly from structural information. We find that sequence alone performs poorly as a structure classifier.Weshowincontrastthattheuseofthe three dimensional representation of the sequences significantly improves the classification accuracy. We conclude with a discussion of the current limitations of such a representation and with a description of potential improvements.
    [Show full text]
  • Efficient Alignment-Free Sequence Similarity Measurement Based on Kendall Statistics
    Clinical and Translational Science Institute Centers 5-15-2018 K2 and K2*: efficient alignment-free sequence similarity measurement based on Kendall statistics Jie Lin Fujian Normal University Donald A. Adjeroh West Virginia University Bing-Hua Jiang University of Iowa Yue Jiang Fujian Normal University Follow this and additional works at: https://researchrepository.wvu.edu/ctsi Part of the Medicine and Health Sciences Commons Digital Commons Citation Lin, Jie; Adjeroh, Donald A.; Jiang, Bing-Hua; and Jiang, Yue, "K2 and K2*: efficient alignment-free sequence similarity measurement based on Kendall statistics" (2018). Clinical and Translational Science Institute. 957. https://researchrepository.wvu.edu/ctsi/957 This Article is brought to you for free and open access by the Centers at The Research Repository @ WVU. It has been accepted for inclusion in Clinical and Translational Science Institute by an authorized administrator of The Research Repository @ WVU. For more information, please contact [email protected]. Bioinformatics, 34(10), 2018, 1682–1689 doi: 10.1093/bioinformatics/btx809 Advance Access Publication Date: 15 December 2017 Original Paper Sequence analysis * K2 and K2 : efficient alignment-free sequence similarity measurement based on Kendall statistics Jie Lin1, Donald A. Adjeroh2, Bing-Hua Jiang3 and Yue Jiang1,* 1Department of Software engineering, College of Mathematics and Informatics, Fujian Normal University, Fuzhou 350108, China, 2Department of Computer Science & Electrical Engineering, West Virginia University, Morgantown, WV 26506, USA and 3Department of Pathology, Carver College of Medicine, The University of Iowa, Iowa City, IA 52242, USA *To whom correspondence should be addressed. Associate Editor: John Hancock Received on September 8, 2017; revised on December 11, 2017; editorial decision on December 12, 2017; accepted on December 14, 2017 Abstract Motivation: Alignment-free sequence comparison methods can compute the pairwise similarity between a huge number of sequences much faster than sequence-alignment based methods.
    [Show full text]
  • Assume an F84 Substitution Model with Nucleotide Frequ
    Exercise Sheet 5 Computational Phylogenetics Prof. D. Metzler Exercise 1: Assume an F84 substitution model with nucleotide frequencies (πA; πC ; πG; πT ) = (0:2; 0:3; 0:3; 0:2), a rate λ = 0:1 of “crosses” and a rate µ = 0:2 of “bullets” (see lecture). (a) Assume that the nucleotide at some site is A at time t. Calculate the probabilities that the nucleotide is A, C or G at time t + 0:2. (b) Assume the nucleotide distribution in a genomic region is (0:1; 0:2; 0:3; 0:4) at time t, but from this time on the genomic region evolves according to the model above. Calculate the expectation values for the nucleotide distributions at time points t + 0:2 and t + 2. Exercise 2: Calculate rate matrix for the nucleotide substution process for which the substitution matrix for time t is 0 1−e−t=10 21+9e−t=10−30e−t=5 1−e−t=10 1 PA!A(t) 10 70 5 B 2−2e−t=10 3−3et=10 3+7e−t=10−10e−t=5 C B PC!C (t) C S(t) = B 5 10 15 C ; B 14+6e−t=10−20e−t=5 1−e−t=10 1−e−t=10 C B PG!G(t) C @ 35 10 5 A 2−2e−t=10 3+7e−t=10−10e−t=5 3−3e−t=10 5 30 10 PT !T (t) where the diagonal entries PA!A(t), PC!C (t), PG!G(t) and PT !T (t) are the values that fulfill that each row sum is 1 in the matrix S(t).
    [Show full text]
  • Similarity in Design: a Framework for Evaluation
    MASTER’S THESIS Similarity in Design: A Framework for Evaluation Master’s Educational Program __Space and Engineering Systems__ Student: Alsalehi, Suhail Hasan Hael Signature, Name Research Advisor: Crawley, Edward F. Signature, Name, Title Moscow 2019 Copyright 2019 Author. All rights reserved. The author hereby grants to Skoltech permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole and in part in any medium now known or hereafter created. МАГИСТЕРСКАЯ ДИССЕРТАЦИЯ Сходство в дизайне: Структура для оценки Магистерская образовательная программа__ Космические и Инженерные Системы_ Студент: Аль Салехи Сухаиль Хасан Хаил подпись, ФИО Научный руководитель: Кроули, Эдвард Ф. подпись, ФИО, должность Москва, 2019 Авторское право 2019. Все права защищены. Автор настоящим дает Сколковскому институту науки и технологий разрешение на воспроизводство и свободное распространение бумажных и электронных копий настоящей диссертации в целом или частично на любом ныне существующем или созданном в будущем носителе. "Aim for the moon. If you miss, you may hit a star" W. Clement Stone i ABSTRACT Engineered systems are rapidly increasing in complexity. Complexity, if not managed properly, can lead to inefficient design, higher risk of failures, and/or unexpected costs and schedule delays. This is specifically true during early design. The early phase of design, conceptual design, is characterized by high levels of uncertainty and fuzziness. In this phase, designers rely mostly on expertise to evaluate solution variants. Although researchers have developed many tools for solution variants evaluation during conceptual design, these tools have some drawbacks. A major drawback is that existing tools do not account for the system architecture during evaluation.
    [Show full text]