A Comparative Study of Sequence Analysis Tools in Computational Biology
Total Page:16
File Type:pdf, Size:1020Kb
Copyright Warning & Restrictions The copyright law of the United States (Title 17, United States Code) governs the making of photocopies or other reproductions of copyrighted material. Under certain conditions specified in the law, libraries and archives are authorized to furnish a photocopy or other reproduction. One of these specified conditions is that the photocopy or reproduction is not to be “used for any purpose other than private study, scholarship, or research.” If a, user makes a request for, or later uses, a photocopy or reproduction for purposes in excess of “fair use” that user may be liable for copyright infringement, This institution reserves the right to refuse to accept a copying order if, in its judgment, fulfillment of the order would involve violation of copyright law. Please Note: The author retains the copyright while the New Jersey Institute of Technology reserves the right to distribute this thesis or dissertation Printing note: If you do not wish to print this page, then select “Pages from: first page # to: last page #” on the print dialog screen The Van Houten library has removed some of the personal information and all signatures from the approval page and biographical sketches of theses and dissertations in order to protect the identity of NJIT graduates and faculty. ABSTRACT A COMPARATIVE STUDY OF SEQUENCE ANALYSIS TOOLS IN COMPUTATIONAL BIOLOGY by Wei-Jen Chuang A biomolecular object, such as a deoxyribonucleic acid (DNA), a ribonucleic acid (RNA) or a protein molecule, is made up of a long chain of subunits. A protein is represented as a sequence made from 20 different amino acids, each represented as a letter. There are a vast number of ways in which similar structural domains can be generated in proteins by different amino acid sequences. By contrast, the structure of DNA, made up of only four different nucleotide building blocks that occur in two pairs, is relatively simple, regular, and predictable. Biomolecular sequence alignment/string search is the most important issue and challenging task in many areas of science and information processing. It involves identifying one-to-one correspondences between subunits of different sequences. An efficient algorithm or tool is involved with many important factors, these include the following: Scoring systems, Alignment statistics, Database redundancy and sequence repetitiveness. Sequence "motifs" are derived from multiple alignments and can be used to examine individual sequences or an entire database for subtle patterns. With motifs, it is sometimes possible to detect distant relationships that may not be demonstrable based on comparisons of primary sequences alone. A more comprehensive solution to the efficient string search is approached by building a small, representative set of motifs and using this as a screening database with automatic masking of matching query subsequences. This technology is still under development but recent studies indicate that a representative set of only 1,000 — 3,000 sequences may suffice and such a database can be searched in seconds. A COMPARATIVE STUDY OF SEQUENCE ANALYSIS TOOLS IN COMPUTATIONAL BIOLOGY by Wei-Jen Chuang A Thesis Submitted to the Faculty of New Jersey Institute of Technology in Partial Fulfillment of the Requirements for the Degree of Master of Science and Computer Science Department of Computer and Information Science January 1999 APPROVAL PAGE A COMPARATIVE STUDY OF SEQUENCE ANALYSIS TOOLS IN COMPUTATIONAL BIOLOGY Wei-Jen Chuang Dr. Jason T.L. Wang, Thesis Advisor Date Associate Professor of Computer and Information Science, NJIT Dr. James M. Calvin, Committee Member Date Assistant Professor of Computer and Information Science, NJIT Dr. Franz J. Kurfess, Committee Member Date Assistant Professor of Computer and Information Science, NJIT BIOGRAPHICAL SKETCH Author: Wei-Jen Chuang Degree: Master of Science Date: January 1999 Undergraduate and Graduate Education: • Master of Science in Computer Information and Science, New Jersey Institute of Technology, Newark, NJ 1999 • Master of Science in Phytopathology, National Chung-Hsing University, Taichung, Taiwan, R.O.C., 1990 • Bachelor of Science in Plant Pathology, National Chung-Hsing University, Taichung, Taiwan, R.O.C., 1988 Major: Computer Information and Science Publication: W.-J. Chuang, "Etiological Studies On Stem Rot and Basal Rot of Taiwan Anoetcochilus and Their Control," NCHU, Taichung, Taiwan, R.O.C., July, 1990. iv This work is dedicated to my mother, who is no longer with us. ACKNOWLEDGMENT First, I would like to express my sincere appreciation to my advisor, Professor Jason T. L. Wang, for his patience and constant guidance during the course of this research. Also, I would like to thank Professor James M. Calvin and Professor Franz J. Kurfess for their activity participating in my committee. Most of all, I want to express my appreciation to my family, for their supporting and devotion, and to my wife, Mei-Yi, for her love and understanding. Without their support and encouragement, my accomplishment would not be possible. vi TABLE OF CONTENTS Chapter Page 1.INTRODUCTION 1 2. MODIFYING THE SDISCOVER PROGRAM 7 3. FINDING MOTIFS AND DATABASE EVALUATION 3.1. Searching for Motifs 11 3.2. Converting the Output Into FASTA Format and Forming a Local Database 13 3.3. Evaluating the Database 15 3.4. Results 17 4. COMPARING MOTIFS RETRIEVED FROM PROSITE WITH MOTIFS FOUND BY SDISCOVER 26 5. CONCLUSION AND FUTURE WORKS 48 APPENDIX A.1 QUERY RESULTS OF HS.12716 IN NCBI HOMEPAGE 52 APPENDIX A.2 QUERY RESULTS OF HS.112341 IN NCBI HOMEPAGE 63 APPENDIX A.3 MOTIFS RETRIEVED FROM PROSITE IN PRATT WEBSITE 81 APPENDIX B. A COMPLETE LIST OF HUMAN DNA SEQUENCES 89 APPENDIX C. FASTA FORMAT DESCRIPTION 93 APPENDIX D. THE BLAST FAMILY 95 REFERENCE 96 vii LIST OF TABLES Table Page 1.1. List of 20 Amino Acids 2 1.2. Selected World Wide Web Sites 5 1.3. Selected Molecular Biology FTP Servers 6 3.1. Lists of All Results of Finding the Motifs 12 4.1. Motifs Found by SDISCOVER that Match Prosite Signatures 42 iii LIST OF FIGURES Figure Page 2.1. The Input Interface of SDISCOVER 8 2.2. Illustration of Executing the Sorting Program 9 2.3. Illustration of Executing the Modified Program 10 3.2.1. A Screen Shot of Converting Output to FASTA Format 13 3.2.2. Using NCBI Tool Formatdb to Form a Database for BLAST 2 14 3.3.1. Sequences Data of Hs.12716 15 3.3.2. Sequences Data of Hs.112341 16 3.4.1. Screen Shot of BLAST 2 Query Screen 17 3.4.2. Query Results of Hs.12716 in Our Local Machine 19 3.4.3. Query Results of Hs.112341 in Our Local Machine 21 3.4.4. Hs.12716, the NCBI Homepage Query Screen 24 3.4.5. Hs.112341, the NCBI HomePage Query Screen 25 4.1. Sequences of COAGULATION FACTOR X PRECURSOR 26 4.2. Motifs of Protein Sequences Found by SDISCOVER 28 4.3. Sequences of Three Proteins from Different Families 45 4.4. Motifs Retrieved from Prosite in Pratt Website 46 5.1. Waiting Jobs to Finish 50 5.2. Error Message: Unable to Accept More Jobs 51 i CHAPTER 1 INTRODUCTION Biomolecular sequence alignment is among the most important and challenging tasks in computational biology; it involves identifying one-to-one correspondences between subunits of different sequences [33]. This procedure is essential to many tasks of biological data analysis [21,25,47], for example, retrieving a database to determine the species of unknown sequences, and discovering highly conserved subregious or patterns related to molecular structures and functions. Even by using genetic tools, forensic scientists can now examine the DNA in this biological evidence and tell almost certainly whether it came from a given individual [31]. A biomolecular object, such as a deoxyribonucleic acid (DNA), a ribonucleic acid (RNA) or a protein molecule, is made up of a long chain of subunits. In molecular sequence studies, sequence subunits are represented by characters from a domain which is denoted by . For example, the characters used to represent the nucleotides in DNA sequences are A (Adenosine), G (Guanine), C (Cytidine) and T (Thymidine), and is the set {A,G,C,T} [52]. A protein is represented as a sequence made from 20 different amino acids (see Table 1.1), each represented as a letter. By contrast, the structure of DNA, made up of only four different nucleotides that occur in two pairs, is relatively simple, regular, and predictable. In aligning biomolecular sequences, a function of scores (or distances) is needed to measure the goodness of alignments [20,43,48]. For a set of sequences, the optimal alignment is the one which maximizes the score (or minimizes the distance) [13,42]. To achieve a high score value, some subunits in different sequences are matched, and some sequences are considered to have insertions, deletions, and substitutions [1,40]. The following illustrates an alignment of three sequences. Sequence 1 ATGCAGGC Sequence 2 ATGAXGXC Sequence 3 ATGAXGGC Column 1 2 3 4 5 6 7 8 1 2 In the alignment columns 1,2,3,6, and 8 indicate character matches, column 4 indicates that a substitution has occurred in the first sequence (A-C, T-G), column 5 indicates that a subunit has been inserted into the first sequence, column 7 indicates that a subunit has been deleted from the second sequence. To represent insertion and deletion, the character X is introduced. Table 1.1 List of 20 Amino Acids 3 String search is an important operation in many areas of science and information processing. It occurs naturally as part of data processing, text editing, lexical analysis and information retrieval [51]. In molecular biology this computational problem is associated with the analysis, search and comparison of biosequences, which can be considered as texts made up of only four characters in the case of nucleic acid, and twenty, in the case of proteins [34].