A-Matrix, 262 Acceptance Probability, 53 Accepted Point Mutations, See

Total Page:16

File Type:pdf, Size:1020Kb

A-Matrix, 262 Acceptance Probability, 53 Accepted Point Mutations, See Index A-matrix, 262 biased nucleotide composition, 356 Acceptance probability, 53 empirical patterns, 355 Accepted point mutations, see PAM GC content, 356 Affine gap penalties, 378 spectrum, 356 AIC, see Akaike information criterion Base frequencies, 107, 188 Akaike information criterion, 20, 204, Bayes factor, 152, 153, 204–206 293, 465 Bayes’ theorem, 197 Alignment Bayesian approach, 46, 439 multiple, see Multiple alignment Bayesian dating, 244 pairwise, see Pairwise alignment Bayesian estimation, 184 Alignment algorithms, 375 applications, 186 BLAST, 376 Bayesian hypothesis testing, 451 Clustal, 376 predictive distributions, 441 hidden Markov models, 376 Bayesian inference, 35, 46, 47, 184, 186, Smith-Waterman, 376 188, 242, 467, 469, 470, 488, 489 Alignment methods, 376 assessing uncertainty in phylogenet- Alphabet, 328 ics, 467 Amino acid rate matrices, 11 divergence times, 242, 244 AU test, 468, 480 empirical approach, 118 Autocorrelated rate-variation model, phylogenetic inference, 50 332 hybrid samplers, 50 Autocorrelation parameter λ, 331 posterior distribution, 46 posterior probability, 35 Background frequencies, 327 prior distribution, 35, 46 Background selection, 370 unnormalized posterior distribution, BAMBE program, 50, 449 47 Base composition, 355 Bayesian information criterion, 337, 469 Base composition evolution, 366 Bayesian methods, 35, 46, 184, 197 a two-state model, 366 BEAST program, 399 selection coefficients, 366 Biased base composition, 362 selection parameter, 368 Biases of probability values, 477 Base composition variation, 355 BIC, see Bayesian information criterion biased mutation DNA repair, 357 Binomial distribution, 27 496 Index Birth-death process, 4, 5, 378, 380 Codon frequency, 360 Bivariate distribution, 150 Codon models, 12, 17, 20, 120, 144, 281, Block substitution matrix, see 338 Substitution matrix, BLOSUM local, 147 Bootstrap methods, 19, 142, 143, 149, reversible, 106 171, 199, 242, 249, 442, 472–476, Codon usage, 90, 105, 107, 108, 360, 478, 485, 487, 488 361, 364, 366–370 approximately unbiased tests, 474 Coevolution, 278–280 ML estimate, 32 Coevolutionary Markov model, 279, 280 multiscale, 474 Computer underflow, 196 nonparametric, 199, 200, 242, 249, Confidence intervals, 19, 31, 132–134, 451, 472, 478, 485–487 154, 171 parametric, 19, 142, 149, 199, 294, CONSEL software, 468, 475 442, 451–453, 457, 478 Context-dependent substitution, 333, sampling error, 472 335 speed improvements, 474 Continuous-time evolutionary model, Bootstrap probability, 468, 473 377 Bootstrap replicate, 242, 473–476, 488 Continuous-time Markov chain, 187, BP, see Bootstrap probability 296 Branch lengths, 235 Correlated character evolution, 458 Breakpoint graph, 308, 313, 314, 317 Correlated rate change, 251 Brownian motion, 3, 4, 315 Covarion models, 17, 275 Burn-in, 55 Cox test, 204, 479, 484, 485, 487 CpG islands, 357 Calibration point, 239, 240, 249 cpREV model, 211, 265 Calibration times, 216, 218, 219 Character history, 195, 442, 447 Dayhoff model, 148, 262, 263 sampling, 447 Degrees of freedom, 271 Character mapping, 440 Dependent sampling, 45 Bayesian approaches, 440 Detailed balance condition, 48, 378 maximum likelihood, 440 Dimension matching, 53 parsimony, 440 Dirichlet prior distribution, 214 Chromosomal fission, 307, 315 DIST-PC model, 273, 274 Chromosomal fusion, 307, 315 Divergence times, 233 Chromosomal inversions, 307 Bayesian inference, 242 Bayesian approach, 311 branch lengths, 235 breakpoint graph, 308 estimation, 215–217, 233–235, 240, comparative map, 312, 318 248, 252 fortress of hurdles, 309 local clock, 239 hurdles, 308 molecular clock Markov chain Monte Carlo method, overdispersed, 239 311 multigene analyses, 250 nonuniformity, 320 penalized likelihood, 240 signed permutation, 307 rate change, 251, 252 unsigned permutation, 309 uncertainties, 248 Chromosomal segment, 307 uncertainty Chronological rate, 233–236, 252 fossil, 249 Clades, 200, 215, 483 topological, 250 Codon bias, 360 DNA motif bias, 358 Index 497 DNA repair amino acid fitnesses, 271 biased, 357 Metropolis-Hastings function, 272 very short patch, 359 Fitness model, 271, 272, 274, 279 vsp, see very short patch amino acid, 271 DNA substitution matrices, see coevolutionary, 279 Substitution matrix, DNA Fluorescent in situ hybridization, see FISH Effective divergence time, 426, 427, 430, Forward algorithm, 328, 341 431 Forward-backward algorithm, 328, 331, Effective number of codons, 360 341, 387, 395 Effective population size, 368 EM, see EM algorithm Gamma distribution, 16, 212, 266, 267, EM algorithm, 342, 414–416, 418, 419 273 continuous time, 418 Gamma shape parameters, 217 discrete time, 416 GC content, 356–358, 368 E step, 417 General time-reversible model, 203, 263, M step, 417 363 tree EM, 420 Generalized Dirichlet prior, 247 Empirical Bayes, 118, 154 Generation length, 235 Empirical Bayesian mapping, 274 Genetic drift, 64, 67, 69–71, 79, 80, 89, ENC, see Effective number of codons 90 Equilibrium frequencies, 272 Genetic markers, 289 Equilibrium length distribution, 378 Genome rearrangement, 307 Erd¨os-Renyi graph, 314 breakpoint distance, 310, 312 edge occupancy probability, 314 breakpoint graph, 308, 313, 314 Evolutionary constraints, 260, 270 chromosomal fission, 307, 315 pattern, 267 chromosomal fusion, 307, 315 Evolutionary distance, 363, 364, 410 chromosome segment Evolutionary divergence estimation, 363 syntenic, 320 Evolutionary rate, 15, 236 chromosome shuffling, 310 modelling rate variations, 15 coagulation-fragmentation process, Expectation maximization, see EM 313 algorithm conserved segments, 316 Expected amount of evolution, 146, cycle structure, 313 236, 237 distance, 312 Expected information, 31, 32 inversion tract lengths, 321 Exponential random variables, 189 inversions, 307, see Chromosomal inversions F81, see Felsenstein model maximum parsimony, 310 FASTA format, 127 n-inversion chain, 309 Felsenstein model, 159, 160, 203, 204 chromosome markers, 309 Felsenstein pruning algorithm, 328, 447 Nadeau and Taylor method, 318 likelihood calculation, 281 number of inversions, 307, 310–312, FISH, 319 317 Fisher information matrix, 132, 133 parsimony, 312, 320 FIT-GEN model, 274 parsimony distance, 312 FIT-PC model, 271, 273, 274, 279 parsimony methods, 307 Fitness functions, 271 permutation cycles, 308 amino acid, 279 random transpositions, 312 498 Index reciprocal translocations, 307, 315 HKY model, 11, 328, 336, 337, 363 θ-inversion model, 321 HKY85 model, 52–54, 127, 187, 188, Genomic distance, 315 190, 192, 196, 199, 203–205, 228 Genomic signature, 359 Holding times, 296 Gibbs sampler, 49, 50 Homotachy, 275 random-scan, 49, 50 HP algorithm, 307 systematic-scan, 49, 50 Hypermutability, 358 Graph, 338 HyPhy, 125 directed, 338 Alignment data, 159 edges, 338 data filter, 127, 128 nodes, 338 data set, 127 undirected, 338 defining a likelihood function, 161 vertices, 338 HKY85, 127, 129 Graphical models, 325, 338 hypothesis testing, 141 belief-propagation algorithm, 340–342 instantaneous rate matrix, 127 elimination algorithm, 340, 341 likelihood function, 128, 130 junction-tree algorithm, 342 local branch parameters, 135 Markov chain Monte Carlo, 344 maximizing the likelihood, 162 moralization, 342 MLE, 130, 132 parents, 338 model description, 160 probabilistic inference, 339 multiple partitions, 139 GTR model, see General time-reversible object inspector, 136 model phylogenetic tree input, 161 substitution models, 127 Hardy-Weinberg equilibrium, 36 tree, 128 maximum likelihood estimator, 36 tree viewer, 131 HBL, see HyPhy Batch Language HyPhy batch files, 159, 162 Heterogeneity models over time, 275 HyPhy batch language, 157, 158, 162 Hidden Markov model, 268, 325, 326, analyzing codon data, 178 385, 386, 389 model definition, 162 across sites, 268 molecular clocks, 168 emission-equivalent, 389 simulation tools, 170 hidden classes, 268 site-to-site rate heterogeneity, 175 hidden path, 326 Hypothesis testing, 19, 33, 113, 139, HMMER, 391 141, 148 matrix of state-transition probabili- acceptance region, 33 ties, 327 alternative hypothesis, 33 multiple alignment, 391 null hypothesis, 33 path, 326 rejection region, 33 path-equivalent, 389 significance level, 34 phylogenetic models, 325 type I error, 34 posterior probability, 326 type II error, 34 recombination events, 325 SAM, 391 Indel rate per fragment, 382 secondary structure prediction, 325 Indels, 377 silent states, 392 Independent sites–structurally con- Hidden site classes, 273 strained protein evolution, see Higher-order Markov models, 333 IS-SCPE method Hill-Robertson effect, 370 Individual, 26 Index 499 Instantaneous transition matrix, 263 MAP, see Maximum a posteriori Instantaneous transition rate matrix, Markov chain, 3, 187, 325, 408 262 continuous-time, 5, 296 IS-SCPE method, 270 EM algorithm, see EM algorithm Ising model, 343 equilibrium distribution, 409 Isochore, 357 ergodic, 6 higher-order, 333 JC69 model, see Jukes and Cantor homogeneous, 409 model inhomogeneity, 427 JTT model, 12, 149, 264, 265 posterior probability, 17 JTT+Γ model, 270 rate matrix, 409 Jukes and Cantor model, 10, 36, 201, resolvent, 422 363, 445 reversible, 409 maximum likelihood estimator, 37 stationary, 409 stationary distribution, 6 KH test, see Kishino-Hasegawa test substitution matrix, 408 Kimura two-parameter model, 363 time reversibility, 7 Kimura’s formula, 366 time-reversible, 48 Kishino-Hasegawa test, 482, 484, 485, transition probabilities, 409 487 calculations, 7 KL divergence, 465, 466 transition rates, 5 Kolmogorov’s forward equations, 379 Markov chain Monte Carlo, 45,
Recommended publications
  • Lecture 5: Sequence Alignment – Global Alignment
    Sequence Alignment COSC 348: Computing for Bioinformatics • Sequence alignment is a way of arranging two or more sequences of characters to identify regions of similarity – b/c similarities may be a consequence of functional or Lecture 5: evolutionary relationships between these sequences. Sequence Alignment – Global Alignment • Another definition: Procedure for comparing two or more sequences by searching for a series of individual characters that Lubica Benuskova, Ph.D. are in the same order in those sequences – Pair-wise alignment: compare two sequences – Multiple sequence alignment: compare > 2 sequences http://www.cs.otago.ac.nz/cosc348/ 1 2 Similarity versus identity Sequence alignment: example • In the process of evolution, from one generation to the next, and from one species to the next, the amino acid sequences of • Task: align abcdef with somehow similar abdgf an organism's proteins are gradually altered through the action of DNA mutations. For example, the sequence: • Write second sequence below the first one – ALEIRYLRD • could mutate into the sequence: ALEINYLRD abcdef abdgf • in one generation and possibly into AQEINYQRD • Move sequences to give maximum match between them. • over a longer period of evolutionary time. – Note: a hydrophobic amino acid is more likely to stay • Show characters that match using vertical bar. hydrophobic than not, since replacing it with a hydrophilic residue could affect the folding and/or activity of the protein. 3 4 Sequence alignment: example Quantitative global alignments abcdef • We are looking for an alignment, which || – maximizes the number of base-to-base matches; abdgf – if necessary to achieve this goal, inserts gaps in either sequence (a gap means a base-to-nothing match); • In order to maximise the alignment, we insert gap between – the order of bases in each sequence must remain and in lower sequence to allow and to align b d d f preserved and abcdef – gap-to-gap matches are not allowed.
    [Show full text]
  • Novel Bioinformatics Applications for Protein Allergology
    AND ! "#$% &'()* +% + ,-.,-/,0 + 121,..0-10- ! 3 4 33!!3 ,,,1/ !"# $% # $# &'()$ $*+,'-./ $ "Por la ciencia, como por el arte, se va al mismo sitio: a la verdad" Gregorio Marañón Madrid, 19-05-1887 - Madrid, 27-03-1960 List of Papers This thesis is based on the following papers, which are referred to in the text by their Roman numerals. I Martínez Barrio, Á., Soeria-Atmadja, D., Nister, A., Gustafsson, M.G., Hammerling, U., Bongcam-Rudloff, E. (2007) EVALLER: a web server for in silico assessment of potential protein allergenicity. Nucleic Acids Research, 35(Web Server issue):W694-700. II Martínez Barrio, Á.∗, Lagercrantz, E.∗, Sperber, G.O., Blomberg, J., Bongcam-Rudloff, E. (2009) Annotation and visualization of endogenous retroviral sequences using the Distributed Annotation System (DAS) and eBioX. BMC Bioinformatics, 10(Suppl 6):S18. III Martínez Barrio, Á., Xu, F., Lagercrantz, E., Bongcam-Rudloff, E. (2009) GeneFinder: In silico positional cloning of trait genes. Manuscript. IV Martínez Barrio, Á., Ekerljung, M., Jern, P., Benachenhou, F., Sperber,
    [Show full text]
  • Information-Theoretic Bounds of Evolutionary Processes Modeled As a Protein Communication System
    INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM Liuling Gong, Nidhal Bouaynaya∗ and Dan Schonfeld University of Illinois at Chicago, Dept. of Electrical and Computer Engineering, ABSTRACT can be investigated in the context of engineering communica- In this paper, we investigate the information theoretic bounds tion codes. In particular, it is legitimate to ask at what rate of the channel of evolution introduced in [1]. The channel of can the genomic information be transmitted. And what is the evolution is modeled as the iteration of protein communica- average distortion between the transmitted message and the tion channels over time, where the transmitted messages are received message at this rate? Shannon’s channel capacity protein sequences and the encoded message is the DNA. We theorem states that, by properly encoding the source, a com- compute the capacity and the rate-distortion functions of the munication system can transmit information at a rate that is protein communication system for the three domains of life: as close to the channel capacity as one desires with an arbi- Achaea, Prokaryotes and Eukaryotes. We analyze the trade- trarily small transmission error. Conversely, it is not possi- off between the transmission rate and the distortion in noisy ble to reliably transmit at a rate greater than the channel ca- protein communication channels. As expected, comparison pacity. The theorem, however, is not constructive and does of the optimal transmission rate with the channel capacity in- not provide any help in designing such codes. In the case dicates that the biological fidelity does not reach the Shan- of biological communication systems, however, evolution has non optimal distortion.
    [Show full text]
  • Testing the Independence Hypothesis of Accepted Mutations for Pairs Of
    University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Computer Science and Engineering: Theses, Computer Science and Engineering, Department of Dissertations, and Student Research 12-2016 TESTING THE INDEPENDENCE HYPOTHESIS OF ACCEPTED MUTATIONS FOR PAIRS OF ADJACENT AMINO ACIDS IN PROTEIN SEQUENCES Jyotsna Ramanan University of Nebraska-Lincoln, [email protected] Follow this and additional works at: http://digitalcommons.unl.edu/computerscidiss Part of the Bioinformatics Commons, and the Computer Engineering Commons Ramanan, Jyotsna, "TESTING THE INDEPENDENCE HYPOTHESIS OF ACCEPTED MUTATIONS FOR PAIRS OF ADJACENT AMINO ACIDS IN PROTEIN SEQUENCES" (2016). Computer Science and Engineering: Theses, Dissertations, and Student Research. 118. http://digitalcommons.unl.edu/computerscidiss/118 This Article is brought to you for free and open access by the Computer Science and Engineering, Department of at DigitalCommons@University of Nebraska - Lincoln. It has been accepted for inclusion in Computer Science and Engineering: Theses, Dissertations, and Student Research by an authorized administrator of DigitalCommons@University of Nebraska - Lincoln. TESTING THE INDEPENDENCE HYPOTHESIS OF ACCEPTED MUTATIONS FOR PAIRS OF ADJACENT AMINO ACIDS IN PROTEIN SEQUENCES by Jyotsna Ramanan A THESIS Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulfilment of Requirements For the Degree of Master of Science Major: Computer Science Under the Supervision of Peter Z. Revesz Lincoln, Nebraska December, 2016 TESTING THE INDEPENDENCE HYPOTHESIS OF ACCEPTED MUTATIONS FOR PAIRS OF ADJACENT AMINO ACIDS IN PROTEIN SEQUENCES Jyotsna Ramanan, MS University of Nebraska, 2016 Adviser: Peter Z. Revesz Evolutionary studies usually assume that the genetic mutations are independent of each other. However, that does not imply that the observed mutations are indepen- dent of each other because it is possible that when a nucleotide is mutated, then it may be biologically beneficial if an adjacent nucleotide mutates too.
    [Show full text]
  • A Thesis Entitled Homology-Based Structural Prediction of the Binding
    A Thesis entitled Homology-based Structural Prediction of the Binding Interface Between the Tick-Borne Encephalitis Virus Restriction Factor TRIM79 and the Flavivirus Non-structural 5 Protein. by Heather Piehl Brown Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Master of Science Degree in Biomedical Science _________________________________________ R. Travis Taylor, PhD, Committee Chair _________________________________________ Xiche Hu, PhD, Committee Member _________________________________________ Robert M. Blumenthal, PhD, Committee Member _________________________________________ Amanda Bryant-Friedrich, PhD, Dean College of Graduate Studies The University of Toledo December 2016 Copyright 2016, Heather Piehl Brown This document is copyrighted material. Under copyright law, no parts of this document may be reproduced without the expressed permission of the author. An Abstract of Homology-based Structural Prediction of the Binding Interface Between the Tick-Borne Encephalitis Virus Restriction Factor TRIM79 and the Flavivirus Non-structural 5 Protein. by Heather P. Brown Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Master of Science Degree in Biomedical Sciences The University of Toledo December 2016 The innate immune system of the host is vital for determining the outcome of virulent virus infections. Successful immune responses depend on detecting the specific virus, through interactions of the proteins or genomic material of the virus and host factors. We previously identified a host antiviral protein of the tripartite motif (TRIM) family, TRIM79, which plays a critical role in the antiviral response to flaviviruses. The Flavivirus genus includes many arboviruses that are significant human pathogens, such as tick-borne encephalitis virus (TBEV) and West Nile virus (WNV).
    [Show full text]
  • Bioinformatics Scoring Matrices
    Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Scoring Matrices • Learning Objectives – To explain the requirement for a scoring system reflecting possible biological relationships – To describe the development of PAM scoring matrices – To describe the development of BLOSUM scoring matrices (c) David Gilbert 2008 Scoring matrices 2 Scoring Matrices • Database search to identify homologous sequences based on similarity scores • Ignore position of symbols when scoring • Similarity scores are additive over positions on each sequence to enable DP • Scores for each possible pairing, e.g. proteins composed of 20 amino acids, 20 x 20 scoring matrix (c) David Gilbert 2008 Scoring matrices 3 Scoring Matrices • Scoring matrix should reflect – Degree of biological relationship between the amino-acids or nucleotides – The probability that two AA’s occur in homologous positions in sequences that share a common ancestor • Or that one sequence is the ancestor of the other • Scoring schemes based on physico-chemical properties also proposed (c) David Gilbert 2008 Scoring matrices 4 Scoring Matrices • Use of Identity – Unequal AA’s score zero, equal AA’s score 1. Overall score can then be normalised by length of sequences to provide percentage identity • Use of Genetic Code – How many mutations required in NA’s to transform one AA to another • Phe (Codes UUU & UUC) to Asn (AAU, AAC) • Use of AA Classification – Similarity based on properties such
    [Show full text]
  • Oxidising Bacteria (SAOB)
    Computational and Comparative Investigations of Syntrophic Acetate- oxidising Bacteria (SAOB) Genome-guided analysis of metabolic capacities and energy conserving systems Shahid Manzoor Faculty of Veterinary Medicine and Animal Science Department of Animal Breeding and Genetics Uppsala Doctoral Thesis Swedish University of Agricultural Sciences Uppsala 2014 Acta Universitatis agriculturae Sueciae 2014:56 Cover: Bioinformatics helping the constructed biogas reactors to run efficiently. (photo: (Shahid Manzoor) ISSN 1652-6880 ISBN (print version) 978-91-576-8060-0 ISBN (electronic version) 978-91-576-8061-7 © 2014 Shahid Manzoor, Uppsala Print: SLU Service/Repro, Uppsala 2014 Computational and Comparative Investigations of Syntrophic Acetate-oxidising Bacteria (SAOB) – Genome-guided analysis of metabolic capacities and energy conserving systems. Abstract Today’s main energy sources are the fossil fuels petroleum, coal and natural gas, which are depleting rapidly and are major contributors to global warming. Methane is produced during anaerobic biodegradation of wastes and residues and can serve as an alternative energy source with reduced greenhouse gas emissions. In the anaerobic biodegradation process acetate is a major precursor and degradation can occur through two different pathways: aceticlastic methanogenesis and syntrophic acetate oxidation combined with hydrogenotrophic methanogenesis. Bioinformatics is critical for modern biological research, because different bioinformatics approaches, such as genome sequencing, de novo assembly
    [Show full text]
  • Lecture 10: Local Alignment and Substitution Matrices 10.1
    CPS260/BGT204.1 Algorithms in Computational Biology September 30, 2003 Lecture 10: Local Alignment and Substitution Matrices Lecturer: Pankaj K. Agarwal Scribe: Madhuwanti Vaidya So far we have seen global alignment, where entire sequences are matched. There are two other variations of global alignment. 10.1 Semiglobal alignment In semiglobal alignment we do not pay penalty for end gaps. These are gaps that appear before the first letter of the sequence or after the last letter of the sequence. They are also called leading and trailing gaps, respectively. If one of the sequences is significantly shorter than the other, then semiglobal alignment is preferrable. Example Consider two sequences - C A G C A C T T G G A T T C T C G G and C A G C G T G G. They can be aligned in many ways: C A G C A − C T T G G A T T C T C G G C A G C A C T T G G A T T C T C G G − − − C A G C G T G G − − − − − − − C A G C − − − − − G − T − − − − G G Alignment 1 Alignment 2 Figure 10.1: Semi-global alignment We want to choose the Alignment1 over Alignment2 as Alignment2 fragments the second sequence, which is not what we are looking for. Giving Alignment1 a better score as compared to Alignment2 is done by not paying a penalty for the trailing and leading gaps. ¢¡¤£¦¥¨§ © ©¥ ¡£§© © Trailing Gaps Suppose we have two sequences , and and is the shorter ¥§©¥ © ©¥"! of the 2 sequences.
    [Show full text]
  • 2-PAM Matrices
    3/28/20 Bioinformatics II: PAM matrices Dr Manaf A Guma University of Anbar- college of applied science-Heet. Department of chemistry 1 Before we start, what is the difference between point mutation and frameshift mutation? • Point mutation is an alteration of a single nucleotide in a gene whereas frameshift mutation involves one or more nucleotide changes of a particular gene. • Point mutations are mainly nucleotide substitutions, which lead to silent, missense or nonsense mutations. Frameshift mutations occur by insertion or deletion of nucleotides. 2 1 3/28/20 Define? • Nonsense Mutations: the alteration of a nucleotide in a particular codon may introduce a stop codon to the gene. This stops the translation of the protein at halfway of the complete protein. • Silent mutations, a single base pair has changed in a particular codon, the same amino acid is coded by the altered codon as well. • Missense mutations, once the alteration occurs in a particular codon by a nucleotide substitution, the codon is altered in such a way to code a different amino acid. 3 Point accepted mutation 4 2 3/28/20 PAM matrices: Background and concepts • How the PAM work? 1. Only mutations are allowed. 2. Sites evolve independently. 3. Evolution at each site occurs according to a Markov equation. • It follows Markov process.? How? 5 5 What is Markov concept? • Markov process: • (The substitution is independent from their past history!). • Meaning: • Next mutation depends only on current state and is independent of previous mutations. • It is derived from global alignment. do you remember? 6 3 3/28/20 What are PAM matrices ? • Point accepted mutation matrix known as a PAM.
    [Show full text]
  • Pairwise Alignment
    Chap. 2 Pairwise alignment • The most basic sequence analysis question: if two sequences are related? • Key Issues: 1. What alignment should be considered? 2. What score system to rank alignments? 3. What algorithm to find optimal (or good) scoring alignments? 4. What statistical method to evaluate the significance? 2.1 Introduction 1 Introduction 2 2.1 Introduction 3 The scoring model • Evolutionary force that can shape molecular (protein, DNA) sequences: mutation (substitution, insertion/deletion or indel), selection (positive, negative, neutral). • If total log-likelihood score (measuring relatedness) of an alignment is a sum of terms for each aligned pair of residues (plus terms for each gap), intuitively, we expect identities and conservative substitutions to be more likely in real alignments than we expect by chance (positive score); and vice versa. 2.2 The Scoring model 4 Substitution matrices (for un-gapped global alignment) • For unrelated or random model R, odds ratio of “match model” M and unrelated or random model R, : p ∏ xi yi p(x, y | M ) px y = i = ∏ i i p(x, y | R) q q q q ∏ xi ∏ y j i xi yi i j • For log-odds ratio score S(x, y) = ∑ s(xi , yi ) i pab where s(a,b) = log qaqb 2.2 The scoring model 5 Chemical Properties of Amino Acids Match +3 and mismatch = -1 may be good enough for DNA, but not for proteins: e.g. leucine is much more likely to be replaced by an isoleucine than by a glutamate. Introduction 6 Introduction Taylor W.R. (1986) Bioinformatics7 Introduction 8 Gap penalties • Linear penalty score for a gap of length g γ (g) = −gd • Or affine score γ (g) = −d − (g −1)e where d is the gap-open penalty and e is the gap extension penalty.
    [Show full text]
  • PHAT: a Transmembrane-Specific Substitution Matrix
    Vol. 16 no. 9 2000 BIOINFORMATICS Pages 760–766 PHAT: a transmembrane-specific substitution matrix Pauline C. Ng 1, Jorja G. Henikoff 2 and Steven Henikoff 2,∗ 1Department of Bioengineering, University of Washington, Seattle, WA 98195, USA and 2Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N, Seattle, WA 98109-1024, USA Received on February 23, 2000; revised on April 14, 2000; accepted on April 18, 2000 −1 Abstract sij = λ ln(qij/(pi p j )) where λ is a scaling factor, qij’s Motivation: Database searching algorithms for proteins are target or observed frequencies of amino acid pairs use scoring matrices based on average protein properties, taken from alignments and pi ’s are the background fre- and thus are dominated by globular proteins. However, quencies (Altschul, 1991). The widespread use of database since transmembrane regions of a protein are in a searching and other protein alignment tools in modern distinctly different environment than globular proteins, biology underscores the importance of using substitution one would expect generalized substitution matrices to be matrices that most accurately resemble biological reality. inappropriate for transmembrane regions. The point accepted mutation (PAM) and blocks sub- Results: We present the PHAT (predicted hydrophobic stitution matrices (BLOSUM) are the two most popular and transmembrane) matrix, which significantly outper- matrix series (Dayhoff, 1978; Henikoff and Henikoff, forms generalized matrices and a previously published 1992). The PAM matrix is computed by counting muta- transmembrane matrix in searches with transmembrane tions between closely related sequences and an inferred queries. We conclude that a better matrix can be con- common ancestral sequence to obtain PAM 1 target structed by using background frequencies characteristic frequencies.
    [Show full text]
  • Substitution Matrices E S V U
    C E N Introduction to bioinformatics T R E 2007 F B O I R O I I N N Lecture 8 T F E O G R R M A A T T I I V C Substitution Matrices E S V U C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [1] Substitution matrices – Sequence analysis 2006 Sequence Analysis Finding relationships between genes and gene products of different species, including those at large evolutionary distances C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [2] Substitution matrices – Sequence analysis 2006 Archaea Domain Archaea is mostly composed of cells that live in extreme environments. While they are able to live elsewhere, they are usually not found there because outside of extreme environments they are competitively excluded by other organisms. Species of the domain Archaea are •not inhibited by antibiotics, •lack peptidoglycan in their cell wall (unlike bacteria, which have this sugar/polypeptide compound), •and can have branched carbon chains in their membrane lipids of the phospholipid bilayer. C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [3] Substitution matrices – Sequence analysis 2006 Archaea (Cnt.) • It is believed that Archaea are very similar to prokaryotes (e.g.
    [Show full text]