Pattern Matching Algorithms This Page Intentionally Left Blank Pattern Matching Algorithms

Pattern Matching Algorithms This page intentionally left blank Pattern Matching Algorithms Edited by Alberto Apostolico Zvi Galil New York Oxford Oxford University Press 1997 Oxford University Press Oxford New York Athens Auckland Bangkok Bogota Bombay Buenos Aires Calcutta Cape Town Dar es Salaam Delhi Florence Hong Kong Istanbul Karachi Kuala Lumpur Madras Madrid Melbourne Mexico City Nairobi Paris Singapore Taipei Tokyo Toronto and associated companies in Berlin Ibadan Copyright © 1997 by Alberto Apostolico & Zvi Galil Published by Oxford University Press, Inc. 198 Madison Avenue, New York, New York 10016 Oxford is a registered trademark of Oxford University Press All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without prior permission of Oxford University Press. Library of Congress Cataloging-in-Publication Data Apostolico, Alberto, 1948- Pattern matching algorithms / edited by A. Apostolico, Zvi Galil. p. cm. Includes bibliographical references and index. ISBN 0-19-511367-5 1. Computer algorithms. 2. Combinatorial analysis. I. Galil, Zvi. II. Title. QA76.9.A43A66 1997 006.4—dc21 96-49602 35798642 Printed in the United States of America on acid-free paper Preface Issues of matching and searching on elementary discrete structures arise pervasively in Computer Science as well as in many of its applications. Their relevance may be expected to grow even further in the near future as information is amassed, disseminated and shared at an increasing pace. A number of algorithms have been discovered in recent years in response to these needs. Along the way, a number of combinatorial structures and tools have been exposed that often carry intrinsic value and interest. The main ideas that developed in this process concurred to distill the scope and flavor of Pattern Matching, now an established specialty of Algorithmics. The lineage of Pattern Matching may be traced back for more than two decades, to a handful of papers such as "Rapid identification of repeated patterns in strings, trees, and arrays", by R.M. Karp, R.E. Miller and A.L. Rosenberg, "Automata theory can be useful", by D.E. Knuth and V.R. Pratt, and, in a less direct way, S.A. Cook's "Linear time simulation of deterministic two-way pushdown automata". The outgrowth of these powerful seeds is impressive: I. Simon lists over 350 titles in the current version of his bibliography on string algorithms; A. V. Aho references over 140 papers in his recent survey of string-searching algorithms alone; and advanced workshops and schools, books and special issues of major journals have been already dedicated to the subject and more are planned for the future. This book attempts a snapshot of the current state of the art in Pattern Matching as seen by specialists who have devoted several years of their study to the field. Without pretending to be exhaustive, the book does cover most basic principles and presents material advanced enough to portray faithfully the current frontier of research in the field. Our intention was to combine a textbook suitable for a graduate or advanced level course with an in-depth source for the specialist as well as the neophyte. The appearance of this book marks, somewhat intentionally, the Tenth anniversary since "Combinatorial Algorithms on Words". We wish it an equally favorable reception, and possibly even more opportunities to serve, through the forthcoming years, the growth of its useful and fascinating subject. a.a. and z.g. This page intentionally left blank Contributors Amihood Amir, Computer Science Department, Georgia Institute of Technology, Atlanta, GA 30332, USA Alberto Apostolico, Department of Computer Sciences, Purdue Uni- versity, CS Building, West Lafayette, IN 47907, USA and Dipartimento di Elettronica e Informatica, Universita di Padova, Via Gradenigo 6/A, 35131 Padova, Italy Mikhail J. Atallah, Department of Computer Sciences, Purdue Univer- sity, CS Building, West Lafayette, IN 47907, USA Maxima Crochemore, Institut Gaspard Monge, Universite de Marne-la- Vallee, 2 rue de la Butte Verte, F-93160 Noisy-le-Grand, France. Martin Farach, DIM ACS, Rutgers University, CoRE Building, Busch Campus, P.O. 1179, Piscataway, NJ 08855-1179, USA Zvi Galil, Department of Computer Science, Columbia University, New York, NY 10027, USA and Department of Computer Science, School of Mathematical Sciences, Tel Aviv University, Tel Aviv, 69978 Israel Raffaele Giancarlo, Dipartimento di Matematica ed Applicazioni, Uni- versita di Palermo, Via Archirafi 34, 90123 Palermo, Italy Roberto Grossi, Dipartimento di Sistemi e Informatica, Universita di Firenze, Via Lombroso 6/17, 50134 Firenze, Italy Daniel S. Hirschberg, Department of Information and Computer Sci- ence, University of California at Irvine, Irvine, CA 92717, USA Tao Jiang, Department of Computer Science, McMaster University, Hamil- ton, Ontario L8S 4K1, Canada Gad M. Landau, Department of Computer Science, Polytechnic Univer- sity, 6 MetroTech, Brooklyn, NY 11201, USA Ming Li, Department of Computer Science, University of Waterloo, Wa- terloo, Ontario N2L 3G1, Canada Dennis Shasha, Courant Institute of Mathematical Science, New York University, 251 Mercer Street, New York, NY 10012, USA Uzi Vishkin, Institute for Advanced Computer Studies and Electrical En- gineering Department, University of Maryland, College Park, MD 20742, USA. and Department of Computer Science, School of Mathematical Sci- ences, Tel Aviv University, Tel Aviv, 69978 Israel Ilan Yudkiewicz, Department of Computer Science, School of Mathemat- ical Sciences, Tel Aviv University, Tel Aviv, 69978 Israel Kaizhong Zhang, Department of Computer Science, University of West- ern Ontario, London, Ontario N6A 5B7, Canada Contents 1 Off-line Serial Exact String Searching 1 M. Crochemore 1.1 Searching for strings in texts 1 1.2 Sequential searches 5 1.2.1 String-Matching Automaton 5 1.2.2 Forward Prefix scan 7 1.2.3 Forward Subword scan 19 1.3 Practically fast searches 24 1.3.1 Reverse-Suffix scan 24 1.3.2 Reverse-Prefix scan 32 1.4 Space-economical methods 35 1.4.1 Constant-space searching algorithm 36 1.4.2 Computing a critical factorization 40 1.4.3 Computing the period of the pattern 42 1.5 Heuristics for practical implementations 44 1.6 Exercises 46 1.7 Bibliographic notes 47 2 Off-line Parallel Exact String Searching 55 Z. Galil and I. Yudkiewicz 2.1 Preliminaries 55 2.2 Application of sequential techniques 57 2.3 Periodicities and witnesses 59 2.3.1 Witnesses 61 2.3.2 Computing and using witnesses in O(log log m) time 64 2.3.3 A Lower Bound for the CRCW-PRAM 64 2.4 Deterministic Samples 67 2.4.1 Faster text search using DS 70 2.5 The fastest known CRCW-PRAM algorithm 71 2.5.1 Computing deterministic samples in constant time 72 2.5.2 Applying DS to obtain the fastest algorithms 74 2.5.3 A new method to compute witnesses : Pseudo-periods 74 2.5.4 A constant expected time algorithm 78 2.6 Fast optimal algorithms on weaker models 78 2.6.1 Optimal algorithm on the EREW- PRAM 80 2.7 Two dimensional pattern matching 82 2.8 Conclusion and open questions 84 2.9 Exercises 85 2.10 Bibliographic notes 85 3 On-line String Searching 89 A. Apostolico 3.1 Subword trees 89 3.2 McCreight's algorithm 93 3.3 Storing suffix trees 96 3.4 Building suffix trees in parallel 97 3.4.1 Preprocessing 98 3.4.2 Structuring Dx 99 3.4.3 Refining Dx 100 3.4.4 Reversing the edges 105 3.5 Parallel on-line search 106 3.6 Exhaustive on-line searches 108 3.6.1 Lexicographic lists 111 3.6.2 Building lexicographic lists 113 3.6.3 Standard representations for constant- time queries 117 3.7 Exercises 118 3.8 Bibliographic notes 119 4 Serial Computations of Levenshtein Distances 123 D.S. Hirschberg 4.1 Levenshtein distance and the LCS problem 123 4.2 Classical algorithm 124 4.3 Non-parameterized algorithms 125 4.3.1 Linear space algorithm 125 4.3.2 Sub quadratic algorithm 126 4.3.3 Lower bounds 129 4.4 Parameterized algorithms 130 4.4.1 pn algorithm 130 4.4.2 Hunt-Szymanski algorithm 132 4.4.3 nD algorithm 134 4.4.4 Myers algorithm 134 4.5 Exercises 137 4.6 Bibliographic notes 138 5 Parallel Computations of Levenshtein Distances 143 A. Apostolico and M.J. Atallah 5.1 Edit distances and shortest paths 144 5.2 A monotonicity property and its use 146 5.3 A sequential algorithm for tube minima 151 5.4 Optimal EREW-PRAM algorithm for tube minima 151 5.4.1 EREW-PRAM computation of row minima 152 5.4.2 The tube minima bound 168 5.4.3 Implications for string editing 170 5.5 Optimal CRCW-PRAM algorithm for tube minima 170 5.5.1 A preliminary algorithm 171 5.5.2 Decreasing the work done 178 5.5.3 Further remarks 179 5.6 Exercises 180 5.7 Bibliographic notes 181 6 Approximate String Searching 185 G.M. Landau and U. Vishkin 6.1 The serial algorithm 187 6.1.1 The dynamic programming algorithm. 187 6.1.2 An alternative dynamic programming computation 188 6.1.3 The efficient algorithm 190 6.2 The parallel algorithm 191 6.3 An algorithm for the LCA problem 193 6.3.1 Reducing the LCA problem to a restricted- domain range-minima problem 194 6.3.2 A simple sequential LCA algorithm 194 6.4 Exercises 196 6.5 Bibliographic notes 196 7 Dynamic Programming: Special Cases 201 R. Giancarlo 7.1 Preliminaries 201 7.2 Convexity and concavity 203 7.2.1 Monge conditions 203 7.2.2 Matrix searching 204 7.3 The one-dimensional case (ID/ID) 206 7.3.1 A stack or a queue 207 7.3.2 The effects of SMAWK 208 7.4 The two-dimensional case 211 7.4.1 A coarse divide-and-conquer 212 7.4.2 A fine divide-and-conquer 213 7.5 Sparsity 215 7.5.1 The sparse 2D/0D problem 216 7.5.2 The sparse 2D/2D problem 220 7.6 Applications 222 7.6.1 Sequence alignment 222 7.6.2 RNA secondary structure 226 7.7 Exercises 231 7.8 Bibliographic notes 232 8 Shortest Common Superstrings 237 M.

Load more