Efficient Top-k Algorithms for Approximate Substring Matching Younghoon Kim Kyuseok Shim Seoul National University Seoul National University Seoul, Korea Seoul, Korea
[email protected] [email protected] ABSTRACT to deal with vast amounts of data, retrieving similar strings There is a wide range of applications that require to query becomes a more challenging problem today. a large database of texts to search for similar strings or To handle approximate substring matching, various string substrings. Traditional approximate substring matching re- (dis)similarity measures, such as edit distance, hamming dis- quests a user to specify a similarity threshold. Without top- tance, Jaccard coefficient, cosine similarity and Jaro-Winkler k approximate substring matching, users have to try repeat- distance have been considered [2, 10, 11, 19]. Among them, edly different maximum distance threshold values when the the edit distance is one of the most widely accepted distance proper threshold is unknown in advance. measures for database applications where domain specific In our paper, we first propose the efficient algorithms knowledge is not really available [10, 12, 20, 27]. for finding the top-k approximate substring matches with Based on the edit distance, the approximate string match- a given query string in a set of data strings. To reduce the ing problem is to find every string from a string database number of expensive distance computations, the proposed whose edit distance to the query string is not larger than a algorithms utilize our novel filtering techniques which take given maximum distance threshold. There are diverse appli- advantages of q-grams and inverted q-gram indexes avail- cations requiring approximate string matching.