U·M·I University Microfilms International
Total Page:16
File Type:pdf, Size:1020Kb
Approximate pattern matching and its applications. Item Type text; Dissertation-Reproduction (electronic) Authors Wu, Sun. Publisher The University of Arizona. Rights Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author. Download date 25/09/2021 15:18:22 Link to Item http://hdl.handle.net/10150/185914 INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand corner and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6" x 9" black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. U·M·I University Microfilms International . A 8eil & Howell Information Company 300 North Zeeb Road. Ann Arbor. M148106-1346 USA 313/761-4700 800/521-0600 Order Number 9234909 Approximate pattern matching and its applications Wu, Sun, Ph.D. The University of Arizona, 1992 Copyright ©1992 by Wu, Sun. All rights reserved. V·M·I 300 N. Zeeb Rd. Ann Arbor, MI48106 APPROXIMATE PATTERN MATCHING AND ITS APPLICATIONS by SUNWU Copyright © Sun Wu 1992 A Dissertation Submitted to the Faculty of the DEPARTMENT OF COMPUTER SCIENCE In partial Fulfillment of the Requirements For the Degree of DOCTOR OF PHILOSOPHY In the Graduate College THE UNIVERSITY OF ARIZONA 1992 2 THE UNIVERSITY OF ARIZONA GRADUATE COLLEGE As members of the Final Examination Committee, we certify that we have read the dissertation prepared by ____~S~u~n~~N~u~ _________________________ entitled Approximate Pattern Matching and its Applications and recommend that it be accepted as fulfilling the dissertation requirement for the Degree of ________~D~o~c~t~o~r~o~f_P~I~11~·l~o~s~o~p~h~y~ ___________ 6/8/92 Date 6/8/92 Date 6/8/92 Sampath Date 6/8/92 ;j)~7!'~~ Date Date Final approval and ac'ceptance of this dissertation is contingent upon the candidate's submission of the final copy of the dissertation to the Graduate College. I hereby certify that I have read this dissertation prepared under my direction and recommend that it be accepted as fulfilling the dissertation requirement. 6/fJ/92 Dissertation Director lidi Hanber Date 3 STATEMENT BY AUTHOR This dissertation has been submitted in partial fulfillment of requirements for an advanced degree at The University of Arizona and is deposited in the University Library to be made avail able to borrowers under the rules of the Library. Brief quotations from this dissertation are allowable without special permission, provided that accurate acknowledgment of source is made. Requests for permission for extended quotation from or reproduction of the manuscript in whole or in part may be granted by the copyright holder. 4 Acknow ledgements I would like to thank Udi Manber for being my advisor. He provided all the support, gui dance, encouragement, and freedom needed for me to complete this study. From him I learned not only the way of doing research, but also the way of being a researcher who seeks not only the intelligence in doing research but also the real contributions the work will provide. I am indebted to his patiently teaching me the writing, the oral presentation, and the way of being organized. I would like to thank Gene Myers, Sampath Kannan, Donald Myers, and William Velez for serving as members of my committee. I thank Gene Myers for illuminating me in the area of pattern matching. I learned a lot from his class as well as his pioneer research results in this area. Many of my researches have been following his foot step. I also thank Sampath Kannan for many fruitful discussions that have benefited me a lot. I have received a lot help from friends, fellow graduate students, and the excellent lab sup port. I thank Herman Chung-Hwa Rao, Naiwei Lin, Suchen Hu, Cheng-Ning Hsi, Su-Ing Tsuei, Vicraj Thomas, Cliff Hathaway. and Ric Anderson for their kindful help during my research and the development of 'agrep'. I would like to thank my wife, Fu-Mei, who has endured more than 1,000 days of allergy in Tucson, for providing me persistent support and care. Finally, I would like to thank my parents, Ta-chen and Su-Mei. Without their life-time love and encouragement, I would not have started this study. 5 Table of Contents Abstract ................................................................................................................................ 8 Chapter 1: Introduction ....... ....... ......... ........ ... ........ ....... ...... ..... .... ..... ........ ............. ........... ... 9 1.1 Motivation ........................................................................................................... 9 1.2 Previous Results ............ ....................................... ................................. .............. 10 1.3 Main Contributions .............................................................................................. 11 Chapter 2: Preliminary ......................................................................................................... 13 2.1 Basic Definitions ................................................................................................. 13 2.2 Organization of the Remaining Chapters ...................... ........................ .............. 16 Chapter 3: A Sub-quadratic Algorithm for Approximate String Matching ......................... 18 3.1 Introduction ............. ................. ....... ......... ........................ ........ ........................... 18 3.2 The Algorithm ..................................................................................................... 20 3.3 An Improvement to the Main Algorithm ............................................................ 24 Chapter 4: A Sub-quadratic Algorithm for Approximate Regular Expression Matching... 25 4.1 Introduction ...... ..... .......................... ................... ........... ......... ....... ...................... 25 4.2 Regular Expressions and NFA ............................................................................ 27 4.3 The Algorithm ..................................................................................................... 29 Chapter 5: A Fast Algorithm for Approximate String Matching ........................................ 43 5.1 Introduction ......................................................................................................... 43 5.2 A Simple Variation of the Boyer-Moore Algorithm ........................................... 44 5.3 A Simple Approximate Boyer-Moore Algorithm ............................................... 45 5.4 The Main Algorithm ............................................................................................ 47 Chapter 6: A bit-parallel Algorithm for Approximate Pattern Matching ............................ 51 6.1 Introduction ........... ............. ...... ........... ..... ............. ........... ...... ............................. 51 6.2 Exact Matching ............ ............... ..... ......... .......... ........... ......... .................... ......... 53 6 6.3 The algorithm for Limited Expressions .............................................................. 55 6.3.1 The Basic Algorithm .............................................................................. 55 6.3.2 An Improvement to the Main Algorithm ............................................... 58 6.3.3 Extensions ............................................................................................. 60 6.3.3.1 Sets of Characters ..................................................................... 60 6.3.3.2 Wild Cards ................................................................................ 60 6.3.3.3 Unknown Number of Errors ..................................................... 61 6.3.3.4 Mixed Exact/Approximate Matching ....................................... 61 6.3.3.5 Non-Uniform Cost .................................................................... 62 6.3.3.6 A Set of Patterns ....................................................................... 62 6.3.3.7 Long Patterns ............................................................................ 63 6.3.3.8 Very Large Alphabets ............................................................... 63 6.3.4 Analysis .................................................................................................