Copyright Warning & Restrictions
The copyright law of the United States (Title 17, United States Code) governs the making of photocopies or other reproductions of copyrighted material.
Under certain conditions specified in the law, libraries and archives are authorized to furnish a photocopy or other reproduction. One of these specified conditions is that the photocopy or reproduction is not to be “used for any purpose other than private study, scholarship, or research.” If a, user makes a request for, or later uses, a photocopy or reproduction for purposes in excess of “fair use” that user may be liable for copyright infringement,
This institution reserves the right to refuse to accept a copying order if, in its judgment, fulfillment of the order would involve violation of copyright law.
Please Note: The author retains the copyright while the New Jersey Institute of Technology reserves the right to distribute this thesis or dissertation
Printing note: If you do not wish to print this page, then select “Pages from: first page # to: last page #” on the print dialog screen
The Van Houten library has removed some of the personal information and all signatures from the approval page and biographical sketches of theses and dissertations in order to protect the identity of NJIT graduates and faculty.
ABSTRACT
A COMPARATIVE STUDY OF SEQUENCE ANALYSIS TOOLS IN COMPUTATIONAL BIOLOGY
by Wei-Jen Chuang
A biomolecular object, such as a deoxyribonucleic acid (DNA), a ribonucleic acid (RNA) or a protein molecule, is made up of a long chain of subunits. A protein is represented as a sequence made from 20 different amino acids, each represented as a letter. There are a vast number of ways in which similar structural domains can be generated in proteins by different amino acid sequences. By contrast, the structure of DNA, made up of only four different nucleotide building blocks that occur in two pairs, is relatively simple, regular, and predictable. Biomolecular sequence alignment/string search is the most important issue and challenging task in many areas of science and information processing. It involves identifying one-to-one correspondences between subunits of different sequences. An efficient algorithm or tool is involved with many important factors, these include the following: Scoring systems, Alignment statistics, Database redundancy and sequence repetitiveness. Sequence "motifs" are derived from multiple alignments and can be used to examine individual sequences or an entire database for subtle patterns. With motifs, it is sometimes possible to detect distant relationships that may not be demonstrable based on comparisons of primary sequences alone. A more comprehensive solution to the efficient string search is approached by building a small, representative set of motifs and using this as a screening database with automatic masking of matching query subsequences. This technology is still under development but recent studies indicate that a representative set of only 1,000 — 3,000 sequences may suffice and such a database can be searched in seconds. A COMPARATIVE STUDY OF SEQUENCE ANALYSIS TOOLS IN COMPUTATIONAL BIOLOGY
by Wei-Jen Chuang
A Thesis Submitted to the Faculty of New Jersey Institute of Technology in Partial Fulfillment of the Requirements for the Degree of Master of Science and Computer Science
Department of Computer and Information Science
January 1999 APPROVAL PAGE
A COMPARATIVE STUDY OF SEQUENCE ANALYSIS TOOLS IN COMPUTATIONAL BIOLOGY
Wei-Jen Chuang
Dr. Jason T.L. Wang, Thesis Advisor Date Associate Professor of Computer and Information Science, NJIT
Dr. James M. Calvin, Committee Member Date Assistant Professor of Computer and Information Science, NJIT
Dr. Franz J. Kurfess, Committee Member Date Assistant Professor of Computer and Information Science, NJIT BIOGRAPHICAL SKETCH
Author: Wei-Jen Chuang
Degree: Master of Science
Date: January 1999
Undergraduate and Graduate Education:
• Master of Science in Computer Information and Science, New Jersey Institute of Technology, Newark, NJ 1999 • Master of Science in Phytopathology, National Chung-Hsing University, Taichung, Taiwan, R.O.C., 1990 • Bachelor of Science in Plant Pathology, National Chung-Hsing University, Taichung, Taiwan, R.O.C., 1988
Major: Computer Information and Science
Publication: W.-J. Chuang, "Etiological Studies On Stem Rot and Basal Rot of Taiwan Anoetcochilus and Their Control," NCHU, Taichung, Taiwan, R.O.C., July, 1990.
iv This work is dedicated to my mother, who is no longer with us. ACKNOWLEDGMENT
First, I would like to express my sincere appreciation to my advisor, Professor Jason
T. L. Wang, for his patience and constant guidance during the course of this research.
Also, I would like to thank Professor James M. Calvin and Professor Franz J.
Kurfess for their activity participating in my committee.
Most of all, I want to express my appreciation to my family, for their supporting and devotion, and to my wife, Mei-Yi, for her love and understanding. Without their support and encouragement, my accomplishment would not be possible.
vi TABLE OF CONTENTS
Chapter Page
1.INTRODUCTION 1
2. MODIFYING THE SDISCOVER PROGRAM 7
3. FINDING MOTIFS AND DATABASE EVALUATION
3.1. Searching for Motifs 11 3.2. Converting the Output Into FASTA Format and Forming a Local Database 13 3.3. Evaluating the Database 15
3.4. Results 17
4. COMPARING MOTIFS RETRIEVED FROM PROSITE WITH MOTIFS FOUND BY SDISCOVER 26
5. CONCLUSION AND FUTURE WORKS 48
APPENDIX A.1 QUERY RESULTS OF HS.12716 IN NCBI HOMEPAGE 52 APPENDIX A.2 QUERY RESULTS OF HS.112341 IN NCBI HOMEPAGE 63 APPENDIX A.3 MOTIFS RETRIEVED FROM PROSITE IN PRATT WEBSITE 81 APPENDIX B. A COMPLETE LIST OF HUMAN DNA SEQUENCES 89
APPENDIX C. FASTA FORMAT DESCRIPTION 93
APPENDIX D. THE BLAST FAMILY 95
REFERENCE 96
vii LIST OF TABLES
Table Page
1.1. List of 20 Amino Acids 2
1.2. Selected World Wide Web Sites 5
1.3. Selected Molecular Biology FTP Servers 6
3.1. Lists of All Results of Finding the Motifs 12
4.1. Motifs Found by SDISCOVER that Match Prosite Signatures 42