POINT ACCEPTED (PAM) AND BLOCK

SUBSTITUTION MATRIX (BLOSUM) INTERACTIVE

PROGRAMS ON KELLEYBIOINFO.ORG

______

A Project

Presented to the

Faculty of

San Diego State University

______

In Partial Fulfillment

of the Requirements for the Degree

Master of Science in and Medical Informatics

with a Concentration in

Professional Science

______

by

Ari Widjaja

Spring 2017

iii

Copyright © 2017 by Ari Widjaja All Rights Reserved

iv

DEDICATION

I dedicate this project to my father, Wahono Widjaja, and mother, Lanny Harrijanto, who have raised me to be the person I am today. Secondly, I dedicate this project to my lovely wife,Dewayani A. Windy, and our daughter, WidyaSuryadewi, for their unconditional love, support, and encouragement for me to finish this project and to become the best that I can be. I thank Lord Jesus Christ for all my achievement and the journey to be what I want to be. I am grateful for the bad and good days, the tears and laughter, and for everything that I went through.

v

"Strength doesn't come from what you can do. It comes from overcoming the things you once thought you couldn't" - Rikki Rogers

vi

ABSTRACT OF THE PROJECT

Point Accepted Mutation (PAM) and Block (BLOSUM) Interactive Programs on Kelleybioinfo.org by Ari Widjaja Master of Science in Bioinformatics and Medical Informatics with a Concentration in Professional Science San Diego State University, 2017

Kelleybioinfo.org is an interactive bioinformatics software that was built by Dr. Scott Kelley's team in 2013 that teaches basic theories in bioinformatics in BIOMI 563 class at San Diego State University (SDSU). Equipped with five modules with two algorithms in each module, it offers 10 interactive bioinformatics learning platform. Kelleybioinfo software is also accessible through web browsers, handheld devices, or tablets. The National Science Foundation (NSF) has recently awarded a grant for a further development of this software and therefore we will introduce and develop the scoring matrix module. The scoring matrix module offers the basic concept of point accepted mutation (PAM) and block substitution matrix (BLOSUM). Scoring matrices such as PAM, BLOSUM, or Gonnett matrix have been widely used to score matches, mismatches, substitutions, and deletions in sequence alignments. This module will simulate global (for PAM) or local (for BLOSUM) and to determine protein’s mutability based off the log-odd matrix.There has not been a single interactive tutorial on how PAM’s or BLOSUM log-odd matrix are generated. Therefore having this module in kelleybioinfo.org will help students who learn bioinformatics to understand the PAM or BLOSUM matrix’s concept.

vii

TABLE OF CONTENTS

PAGE

ABSTRACT ...... vi LIST OF FIGURES ...... viii ACKNOWLEDGEMENTS ...... ix CHAPTER 1 INTRODUCTION ...... 1 2 MATERIALS AND METHODS ...... 3 Observed Frequency Calculation ...... 3 Expected Frequency Calculation ...... 4 Log-Odd Matrix ...... 5 Methods...... 5 3 RESULTS ...... 8 Module Main Page ...... 8 Concept Mode ...... 9 Interactive Mode ...... 9 4 DISCUSSION ...... 11 REFERENCES ...... 12

viii

LIST OF FIGURES

PAGE

Figure 1. BLOSUM_General.py script ...... 6 Figure 2. BLOSUM_Interactive.html. The JavaScript object notations (JSON) parsed from both PAM and BLOSUM Python scripts are handled using JQuery event handler $.getJSON where it fetches JSON data from the server...... 7 Figure 3. Kelley Bioinformatics main page...... 8 Figure 4. Probability module's icon...... 8 Figure 5. BLOSUM concept mode page. In concept mode user can follow through steps to calculate sequence probability and generate subsequent matrices...... 9 Figure 6. PAM interactive (quiz) mode with blank tables...... 10 Figure 7. BLOSUM interactive (quiz) mode with blank tables ...... 10

ix

ACKNOWLEDGEMENTS

I would like to express special gratitude for my graduate advisor, Dr. Scott T. Kelley, for giving me the opportunity to work on this project and I would like to thank my project panel members: Dr. Robert Edwards and Dr. Barbara Bailey for their support and advice. Additionally, I would like to thank my colleague and friend Dennis Didulo for the guidance and technical consultation for the duration of this project.

1

CHAPTER 1

INTRODUCTION

In bioinformatics, one of the most fundamental things to do is to decipher the hidden messages in our DNA and to make estimations of how one species is interrelated with another. One of the many topics in bioinformatics that attempts to answer the question of how an ape is related to human is through sequence alignment. Sequence alignment is a process of arranging the sequences of DNA, RNA, or protein in order to identify regions of similarity which can be a direct consequence of functional, structural, and evolutionary relationship between the sequences [1]. While any alignment algorithms such as Smith- Waterman or Needleman-Wunsch can be used to match sequences and to further transform a hypothetical ancestor sequence out of two aligned sequences, scoring matrix is constructed to quantify the alignment’s similarity. The score given in every possible substitution and identity is based off the frequencies of occurrences in alignments of related , which also reflects the frequency that a particular occurs in nature [2]. Conventionally, the high and low scores in sequence alignments indicate the probability by chance that the two amino acids aligned. Higher scores indicate that two amino acids are evolutionary related and the probability of the two amino acids aligned by chance are low, and in contrast, lower scores indicated a high probability the two amino acids aligned by chance, and are evolutionarily unrelated [2]. While there are numerous scoring matrices nowadays such as: Gonnet [3], PET [4], Risler [5], Overington [6] there are two fundamental scoring matricesdeveloped: Point Accepted Mutation (PAM) and Block of Substitution Matrix (BLOSUM). PAM matrix was introduced and developed by Margareth Dayhoff in her thesis titled “A Model of Evolutionary Change in Proteins” [7] as a scoring matrix of the observed amino acid substitutions of 34 closely related protein superfamilies grouped into 71 evolutionary trees [7]. PAM matrices are based off Markov model of protein evolution [8] or

2 observed throughout an un-gapped, global alignment. PAM1 matrix, which is used as the basis, which estimates what rate of substitution would be expected if 1% of the amino acids had changed. BLOSUM, on the other hand, was developed by Steven Henikoff and Jorja G. Henikoff in their thesis “Amino Acid Substitution Matrices from Protein Blocks” [9] published in 1992. BLOSUM scoring matrices, in contrast to PAMs, were derived from 2000 blocks of aligned sequence of more than 500 groups of related proteins [9] and they are based on implicit model of evolution or local alignments of closely related proteins [2]. Additionally, BLOSUM matrices focus on highly conserved regions of protein and as for extrapolation, larger numbers in the BLOSUM matrix naming scheme, such as BLOSUM 80, denote higher similarity and, conversely, smaller evolutionary distance. Both BLOSUM and PAM matrices are logarithm-of-odds matrices which is the ratio of the occurrence of each amino acid combination in the observed data (observed frequency) to the expected value of occurrence of the pair (expected frequency) [9]. Scoring matrix is one of bioinformatics subjects that has been presented and taught in every bioinformatics courses, however, currently there is not an interactive tutorial that guides student on how to construct scoring matrices. Kelleybioinfo.org (also known as "Kelley Bioinformatics") is a great venue to introduce and teach scoring matrix interactively. Kelleybioinfo.org is an open-access web platform developed by Dr. Scott T. Kelley [10] that teaches basic concepts and algorithms in bioinformatics. Currently, it hosts 6 modules and it has been used intensively as a part of curriculum in BIOMI 568 class at San Diego State University. The goal for this project is, therefore, to implement a new, interactive module within Kelleybioinfo.org, named probability module, that aim to teach first-year bioinformatics students the fundamental theory of PAM and BLOSUM matrix construction. Within this module students will receive step-by-step tutorial (in concept mode), graded practices, and links to real-world application on each subject.

3

CHAPTER 2

MATERIALS AND METHODS

OBSERVED FREQUENCY CALCULATION A logarithm of the odds, or log-odds is the log ratio of observed frequency and expected frequency. The initial calculation involves calculating the observed probability of an amino acid transformation into another amino acid, which is the numerator of the general score matrix (S) log-odd equation below.

Observed Frequency 푆 = log (1) 푖,푗 Expected Frequency

General score matrix log-odd For PAM, the observed frequency calculation starts by aligning two sequences with 85% similarity and calculating the relative mutability from each alignment. Relative mutability (mi) is the probability of an amino acid will change over a small evolutionary period of time [7]. It is a ratio of the total number of changes for a specific amino acid over the frequency of occurrence. Then, relative mutability of each amino acid in used in tandem with the matrix of accepted , or matrix that accounts for the total number of amino acid substitutions from a , to calculate mutation probability of amino acid (Mi). Equation for both relative mutability (mi) and mutation probability (Mi) can be found as follows:

number of changes of i 휆푚푖퐴푖푗 a) mi= b) Mij = ; M푖푖 = 1 − m푖 (2) number of occurence of i ∑푘 퐴푘푗

a) Relative mutability equation. b) Mutation probability equation for non-homologous and homologous pair. Mij refers to mutation probability for non-homologous pair between amino acid i and j; whereas, Mii refers to homologous pair.

4

In this module, relative mutability and lambda are neglected for the purpose of clarity and simplicity. Consequently, PAM's mutation probability (Mi) is calculated as ratio of either the number of homologous (Aii) or non-homologous (Aij) pair to the sum of the pairs. Below are the revised equation for mutation probability (Mij).

퐴푖푗 Mij or ii = (3) ∑푘 퐴푘푗

Revised Mutation probability equation for non-homologous and homologous pair. Mij refers to mutation probability for non-homologous pair between amino acid i and j; whereas, Mii refers to homologous pair. The observed frequency calculation in BLOSUM involves aligning gapless sequences (k) as one block and counting the pair frequency (cij ) for each amino acid i and j across each column in the block. Then the observed frequency (Qii for homologous or Qij for non- homologous pair) is the sum of each amino acid pair (tuple) frequency over the sum of pair frequencies. The equation for pair frequency and observed frequency are given as follows:

(푘) (푘) 푐푖푗 ∑푘 푐푖푗 a)푐푖푗= ∑푘 푐푖푗 b) 푄푖푗 = = (4) 푇 ∑푖≥푗 푐푖푗

a) Sum of pair frequencies equation. b) Observed probability equation. When counting for homologous tuples in each column, one can use combination formula nC2

EXPECTED FREQUENCY CALCULATION

In PAM, the frequency of occurrence (fj) which is the ratio of the number of amino acid j in the block over by the sum of amino acids. Whereas in BLOSUM, the expected frequency is calculated sequentially starting by the expected probability (pi)of each amino acid residue, then from the expected probability we can calculate the expected frequency for non-homologous tuples (eij) and the expected frequency for homologous tuples (eii). Below are the equations for BLOSUM expected probability.

푄 a) 푝 = 푄 + ∑ 푖푗 b) 푒 = 푝2 표푟 푒 = 2(푝 × 푝 ) (5) 푖 푖푖 푗≠푖 2 푖푖 푖 푖푗 푖 푗

a) BLOSUM expected probability equation. b) BLOSUM expected frequency equation

5

LOG-ODD MATRIX The final step for both PAM and BLOSUM is to log the observed probability over the expected probability. The only difference between PAM and BLOSUM matrices is the logarithm base they use. PAM matrices generally use natural log whereas BLOSUM matrices use log base 2. The following are the score matrix log-odd formulas for both algorithms.

푛 푃푖푗,푛푓푖 푀푖푗 푄푖푗 a)푆푖,푗 = 10 × log ( ) = 10 × log ( ) b) 푆푖,푗 = 2 × log2 ( ) (6) 푓푖푓푗 푓푗 푒푖푗

a) PAM log-odd score matrix equation. b) BLOSUM log-odd score matrix equation. Pi, j denotes the probability to see a pair (i, j) due to chance [11]. Additionally, with the revised PAM mutation probability (Mij), the probability of seeing an amino acid or fij value is assumed to be 1/20, or 0.05

METHODS The scripting programs and modules used to develop the scoring matrix module in Kelleybioinfo.org include: Python and JavaScript for the backend development, and HTML for the module's web page. The structure of each interactive modulewithinKelleybioinfo.org website consists of an HTML iframe which is used to dynamically shows and displays each module's web page, or it is simply a web page within a web page. Each module are then stored under a separate server maintained by Dr. Scott Kelley and his group. The first step of the development of this new module involves scripting two main Python scripts to generate simulated data, compiled as JavaScript objects (JSON) and then parsed into PAM and BLOSUM module HTML. These two Python scripts are: BLOSUM_General.pyand PAM_General.py and both scripts contain functions for the following two processes: 1. Generate multiple random sequences of 20 amino acids shorthand symbols with pre-specified 85% percent similarity. 2. While processing the random sequence, these functions also generate data frames including: relative mutability (mi), mutation probability (Mi), observed frequency (Qii and Qij) expected probability (pi), expected frequency (eij and eii), log-odd score matrices (Si, j), and any other data frames needed for the calculation.

6

Once these data frames were generated, they are compiled and parsed into JavaScript object notations using Python built-in Common Gateway Interface (CGI) and JSON packages. The second step involves developing the web page structure for PAM and BLOSUM algorithms. In this step we process the parsed JavaScript object notations, create dynamic tables, buttons, links, and interactive pages. Throughout the development of web framework and designs, all testing has been conducted using local Windows Internet Information Services (IIS). Examples of both the Python and JavaScript scripts can be found Figure 1 and 2 below.

Figure 1. BLOSUM_General.py script

7

Figure 2. BLOSUM_Interactive.html. The JavaScript object notations (JSON) parsed from both PAM and BLOSUM Python scripts are handled using JQuery event handler $.getJSON where it fetches JSON data from the server.

8

CHAPTER 3

RESULTS

MODULE MAIN PAGE The probability module can be selected in Kelley Bioinformatics (Kelleybioinfo.org) [10] main web page by clicking icon shown in Figure 4 below. Figure 3 shows the main page for Kelley Bioinformatics.

Figure 3. Kelley Bioinformatics main page.

Figure 4. Probability module's icon.

9

CONCEPT MODE Each module within Kelley Bioinformatics web page is equipped with concept mode and interactive (graded quiz) mode. The concept mode is designed for students to understand the subject matter by providing stepwise or procedural process of calculation, and in the case of probability module, students will be given tutorial on clustering an un-gapped block of sequences or analyzing a phylogenetic tree, and using basic probability calculation, they will generate subsequent matrices. Upon clicking the icon above, users will directed to the concept mode as shown in Figure 5 below.

Figure 5. BLOSUM concept mode page. In concept mode user can follow through steps to calculate sequence probability and generate subsequent matrices.

INTERACTIVE MODE Interactive mode in Kelley Bioinformatics web page consists of graded interactive questionnaires and calculations tailored to each module's subject. Within PAM or BLOSUM module section's interactive (quiz) mode, the program will generate a random sequence block (in BLOSUM) or a randomly generated phylogenetic tree (in PAM) which students can analyze. Then the program will precompile the matrices and randomly generate blank cells in each matrix/table that require user input. The program will then analyze user input, track, and

10 report back the number of correct or incorrect input. Figure 6 and Figure 7 below shows PAM’s and BLOSUM's interactive (quiz) mode respectively.

Figure 6. PAM interactive (quiz) mode with blank tables.

Figure 7. BLOSUM interactive (quiz) mode with blank tables

11

CHAPTER 4

DISCUSSION

Upon creating this new probability module, there has not been an interactive guide to creating PAM and BLOSUM scoring matrices, and most of the tutorials available are in the form of static power point presentations or lectures. Our new probability module in Kelleybioinfo.org is designed to help students to understand the basics to construct PAM and BLOSUM scoring matrices and we achieve that by providing intuitive, stepwise tutorials on creating the intermediate matrices and counts. Like any existing modules within Kelleybioinfo.org, each program in this module includes written tutorials and a manual to help students navigate through the program. The mechanism to track students’ scores and number of trials are built in the system, however, the system protocol to evaluate students’ performance based on their scores has not been implemented and the database to track these activities will be developed in the future. Additionally, with the newly implemented exam mode, this module allows users with higher privilege settings i.e. course lecturers to print out quizzes along with the answer sheet. PAM and BLOSUM frameworks are adjustable for longer sequence lengths; however, for the purpose of tutorial for beginner level students, the sequence lengths displayed and calculated in both programs were adjusted appropriately. Overall, this new probability module is a solid introductory and additional module in Kelleybioinfo.org that delivers the concept of statistical probability in scoring matrix development clearly and concisely.

12

REFERENCES

[1] D. W. Mount, Bioinformatics: Sequence and Genome Analysis, 2nd ed. Cold Spring Harbor, NY, USA: Cold Spring Harbor Laboratory Press, 2004. [2] S. Cates, “Scoring Matrices,” [Online]. Available: http://cnx.org/contents/oHCewToL@8/Scoring-Matrices, Accessed on: Mar. 28, 2017. [3] G. H. Gonnet, M. A. Cohen, and S. A. Benner, “Exhaustive matching of the entire protein sequence database,” Science, vol. 256, pp. 1443-1445, June 1992. [4] J. L. Risler, H. Delome, and A. Henaut, “Amino acid substitutions in structurally related proteins a pattern recognition approach: Determination of a new and efficient scoring matrix,” J. Mol. Biol., vol. 4, pp. 1019-1029, July 1988. [5] D. T. Jones, W. R. Taylor, and J. M. Thornton, “A new approach to protein fold recognition,” Nature, vol. 358, pp. 86-89, July 1992. [7] M. O.Dayhoff, R. M. Schwartz, and B. C. Orcutt, "A model of evolutionary change in proteins," in Atlas of Protein Sequence and Structure, vol. 5, Silver Spring, MD, USA: National Biomedical Research Foundation, 1978, pp. 345-352 [8] D. W. Mount, “Using PAM Matrices in sequence alignments,” Cold Spring Harbor Protoc., vol. 3, pp. 1-8, June 2008. [9] D. T. Jones, W. R. Taylor, and J. M. Thornton, “A new approach to protein fold recognition,” Nature, vol. 358, pp. 86-89, July 1992. [9] S. Henikoff and J.G. Henikoff, “Amino acid substitution matrices from protein blocks,” Proc. Natl. Acad. Sci. USA, vol. 89, pp. 10915-10919, Nov. 1992. [10] Kelleybioinfo.org, “Kelley Bioinformatics,” 2015. [Online]. Available: http://www.kelleybioinfo.org/index.php [11] M. Borodovsky and S. Ekisheva, Problems and Solutions in Biological Sequence Analysis, Cambridge, UK: Cambridge Univ. Press, 2006. [Online]. Available: http://www.cambridge.org