Efficient Multiple Pattern Matching Algorithm Based on BMH: MP-BMH

Total Page:16

File Type:pdf, Size:1020Kb

Efficient Multiple Pattern Matching Algorithm Based on BMH: MP-BMH The International Arab Journal of Information Technology, Vol. 16, No. 6, November 2019 1121 Efficient Multiple Pattern Matching Algorithm Based on BMH: MP-BMH Akhtar Rasool1, Gulfishan Firdose Ahmed2, Raju Barskar3, and Nilay Khare1 1Department of Computer Science and Engineering, Maulana Azad National Institute of Technology, India 2Department of Computer Science, College of Agriculture, India 3Department of Computer Science and Engineering, University Institute of Technology RGPV, India Abstract: String matching is playing a key role in a wide range of applications in the information computing. Due to its importance, large numbers of different algorithms have been developed over the last 50 years. There are various standards of single and multiple pattern string matching algorithms like Boyer-Moore (BM), Boyer-Moore-Horspool (BMH), Aho-Corasick etc. Multiple pattern string matching is very useful in real world applications. Standard benchmark multiple pattern algorithm Aho-Corasick and many other multiple pattern string matching algorithms are not memory and time efficient for wider length and large number of patterns set on large text data set because they are using the concept like DFA and require full scan of text data. Many string matching tools which are currently popular for string matching are based on these algorithms. Using the bad character principle of BMH, a method for multiple pattern string matching is being developed. Using this method a time and memory efficient exact multiple pattern string matching algorithm Multiple Pattern BMH (MP-BMH) is proposed. Unlike Aho-Corasick, the proffered MP-BMH algorithm provides skipping of unnecessary matching of characters in text while searching like BMH Algorithm. It also provides the skipping of characters in patterns using the similarity between patterns. Along with the shifts while searching, this algorithm also provides shrewd switches among the patterns for efficacious matching.In this paper, the aforesaid method, MP-BMH algorithm with its time, memory and experimental analysis are described. Keywords: String matching; multiple pattern string matching, Boyer-Moore BM,BMH, MP-BMH. Received January 8, 2017; accepted January 16, 2018 1. Introduction 2. Related Work String matching is one of the important concepts for There are various popular single pattern string solving problems in computer science. In the multi- matching algorithms such as Boyer-Moore (BM) [45, pattern string matching algorithm [9, 35, 39] all 47], Knutt Morris Pratt (KMP) [49, 50], Rabin occurrences of the desired patterns in the text string are Karp[51], Boyer-Moore-Horspool (BMH) [27, 42, 44, searched. The taxonomy and relative performance 50], BMHS [30, 34], BMHS2 [47], BMI [53], measures of multi-pattern matching algorithms are Improved BMHS [44] and Backward Non- mentioned in [28, 51, 53]. String Matching has many Deterministic Dawg Matching (BNDM) [39] and Two real world applications like intrusion detection system way Non-Deterministic Dawg Matching (TNDM) [49]. [28, 51, 53], plagiarism detection [3], text mining These algorithm are simple but time consuming so techniques [35], digital forensics technique [31] and move to multi-pattern string matching[16, 49] Some of many more. An effective multiple string matching the multiple pattern string matching algorithms algorithm can boost the performance of these subsume Multi-pattern string matching with q- applications as these applications are widely used in grams[29, 50], a fast algorithm for multi-pattern real world. An effective selection of multiple string searching [46], Wu Manber [31, 36, 48],commentz- matching algorithms [29] can improve the efficiency of walter [22] and standard benchmark algorithm Aho- these real world applications. The most important Corasick [11, 12, 13, 15, 17, 18, 19, 20, 21, 22, 23, 24, benefit of multiple string matching algorithms is that 30, 32, 33, 34, 38, 40, 41, 43]. Boyer-Moore-Horspool no additional search structure is needed. Only one pass (BMH) [50] algorithm uses the concept of Bad of searching is sufficient to search multiple patterns or Character [14] of BM algorithm [47] for shifting keywords from a given text. Knowledge of single through text in matching process. In BMH algorithm, pattern string matching could help in better no matter the location of mismatching, the distance of understanding of the multiple pattern string matching shift to right is determined by the character in the text algorithms. string which is aligned to the last character of pattern string. It means here always the Bad Character value 1122 The International Arab Journal of Information Technology, Vol. 16, No. 6, November 2019 [14] of a character of text corresponding to last version is better than original Wu-Manber algorithm character of pattern is calculated. And then the window [46]. Wu-Manber and Quadrupled Wu Manber is shifted right by that value. BMH always provides the (QWM) algorithm is relatively not better than Aho- jump at least one and maximum of pattern length [50]. Corasick and Commentz-Walter algorithm [8] when In [50] they applied the BMH algorithm for multiple binary text is employed rather than increased pattern patterns in a different way using q-grams. They called length and alphabet size. Among all four, QWM their proposed filtering algorithm as Horspool with q- algorithm has better performance. That’s why QWM Grams (HG) algorithm. The HG algorithm performs algorithm is applied in various fields, such as network faster for less than 2000 patterns and works well for content analysis, intrusion detection, and text retrieval less than 10000 patterns. Here full text is scanned systems. when the pattern set contains more than 50000 The grep tools contributed a lot to the field of multi patterns. This algorithm is a filtering method and so is pattern string matching. First implementation of grep not designed for searching texts with large number of was given by Thompson using Nondeterministic Finite matches. Automaton (NFA) for regular expression [37]. Then, In [46] extended the BM algorithm for handling Alfred Aho made egrep (enhanced grep and later multiple patterns. They proposed two new algorithms extended grep) using Deterministic Finite Automaton whose operations involves remembrance of previous (DFA) [3]. Also fgrep (fixed-string grep) tool is based matches. With N as text string length and maximum on Aho-Corasick algorithm [43]. The UNIX grep pattern length D, the first proposed algorithm utility searches the input and outputs the lines which remembers maximum 1+ log4 D previous matches and contain any of the input patterns. The UNIX egrep consults O(N log D) text characters. The second supports additional operators for regular expressions algorithm, being designed for distinct time-space trade- [1]. UNIX fgrep performs like grep but does not off, consults O (kN log D) text characters and recognize any regular expression meta-characters as remembers maximum t/k non periodic matches for any special. So, fgrep is basically a fixed pattern matching given k and t number of patterns. tool, hence performs faster [1]. The original egrep and Aho-Corasick [42] is standard multi-pattern string fgrep could not perform well for more than few matching algorithm [9] which is based on finite hundred patterns [43]. Agrep [44] advances from egrep automata. The complexity of Aho-Corasick (AC) [2] and fgrep as it searches the patterns in given text algorithm is in linear time, which scans on character by approximately, which is based on bitap algorithm. In character basis in the text by creating finite state 1964 Balint Domolki proposed the bitap algorithm for machine from the patterns given at the beginning. By exact string searching [9, 26] and extended by taking basis of that work, Commentz-Walter [48] have Shyamsundar in 1977 [32], this algorithm is reinvented provided almost linear solution by combining Boyer- by Manber and Wu [44] from exact string matching to Moore’s [47] matching with the AC machine. In approximate string matching also known as fuzzy addition to the approaches of automata, researchers string matching based on work done by Baeza- have also considered bit parallelism approach (e.g., Yates and Gonnet [4]. Further improvement was done shift-or algorithm [4], which proved to deliver by Sunday [34] and then for long patterns Gene considerable speed up in case of single pattern string Myers proposed an improved algorithm based on bitap matching but it is also observed that applying bit- algorithm in 1998 [25]. parallelism for multiple pattern string matching is In this paper, multiple pattern string matching tedious task [27]. By utilizing arithmetic and logical method is designed on the basis of BMH algorithm. By operations and assigning numbers to different states, using this method, a multiple pattern string matching Baeza-Yates and Gonnet [4] proposed searching algorithm is proposed. This algorithm provides shifting method, which includes all keywords of the pattern set mechanism for unnecessary matched character. Here when computer word size is not large enough. Kim & MP-BMH algorithm, its time and space complexity Kim [35] proposed an algorithm which performs analysis and comparative analysis with other standard bitwise hashing operations while scanning by encoding tools and algorithm are described. the patterns and text at the beginning of the process. Timo Raita [29] improves the BMH algorithm by 3. Multiple Pattern String Matching Using introducing the dependencies between the characters
Recommended publications
  • The Problem of Fuzzy Duplicate Detection of Large Texts
    The problem of fuzzy duplicate detection of large texts E V Sharapova1 and R V Sharapov1 1Vladimir State University, Orlovskaya street 23, Murom, Russia, 602264 Abstract. In the paper, considered the problem of fuzzy duplicate detection. There are given the basic approaches to detection of text duplicates–distance between strings, fuzzy search algorithms without indexing data, fuzzy search algorithms with indexing data. Thereview of existing methods for the fuzzy duplicate detection is given. The algorithm of fuzzy duplicate detection is present. The algorithm of fuzzy duplicate texts detection was implemented in the system AVTOR.NET. The paper presents the results of testing of the system.The use of filtering text, stemming and character replacement, allow the algorithm to found duplicates even in minor modified texts. 1. Introduction Today computer technology and the World Wide Web have become part of our lives. Billions of users use the Internet every day. The Internet contains a huge number of documents. Many books are transformed in a digital form. Users can read books, articles, newspapers directly from computers and electronic gadgets. Today, in the Internet there are many studying materials: lectures, textbooks, methodical literature, etc. There are big collections of essays, course and diploma projects and scientific dissertations. It leads to problem un-controlling coping of information. The text of document may be a part of another document, formed from several documents, may be modified part of another document (with a change of words to synonyms, endings, times, etc.). In this way, the document is fuzzy duplicate of other documents. There is the problem of text checking to presence of fussy duplicates.
    [Show full text]
  • Algorithms for Approximate String Matching Petro Protsyk (28 October 2016) Agenda
    Algorithms for approximate string matching Petro Protsyk (28 October 2016) Agenda • About ZyLAB • Overview • Algorithms • Brute-force recursive Algorithm • Wagner and Fischer Algorithm • Algorithms based on Automaton • Bitap • Q&A About ZyLAB ZyLAB is a software company headquartered in Amsterdam. In 1983 ZyLAB was the first company providing a full-text search program for MS-DOS called ZyINDEX In 1991 ZyLAB released ZyIMAGE software bundle that included a fuzzy string search algorithm to overcome scanning and OCR errors. Currently ZyLAB develops eDiscovery and Information Governance solutions for On Premises, SaaS and Cloud installations. ZyLAB continues to develop new versions of its proprietary full-text search engine. 3 About ZyLAB • We develop for Windows environments only, including Azure • Software developed in C++, .Net, SQL • Typical clients are from law enforcement, intelligence, fraud investigators, law firms, regulators etc. • FTC (Federal Trade Commission), est. >50 TB (peak 2TB per day) • US White House (all email of presidential administration), est. >20 TB • Dutch National Police, est. >20 TB 4 ZyLAB Search Engine Full-text search engine • Boolean and Proximity search operators • Support for Wildcards, Regular Expressions and Fuzzy search • Support for numeric and date searches • Search in document fields • Near duplicate search 5 ZyLAB Search Engine • Highly parallel indexing and searching • Can index Terabytes of data and millions of documents • Can be distributed • Supports 100+ languages, Unicode characters 6 Overview Approximate string matching or fuzzy search is the technique of finding strings in text or dictionary that match given pattern approximately with respect to chosen edit distance. The edit distance is the number of primitive operations necessary to convert the string into an exact match.
    [Show full text]
  • A Parallel String Matching Tool
    Clgrep: A Parallel String Matching Tool Peng Wu a thesis submitted for the degree of Master of Science at the University of Otago, Dunedin, New Zealand. March 18, 2013 Abstract In this study, we widely investigate the problem of string matching in the context of Heterogeneous Parallel Computing. A overview of string match- ing is made, in which the different forms of string matching problem are distinguished, and the classifications of string matching algorithm are dis- cussed. As an alternative to grep for computational intensive string matching and in addition to support the research of the study, a parallel exact string matching utility `Clgrep' is developed. By experimental studies, we investigate the use of heuristics-based algo- rithms, specifically QS and Horspool algorithms, in the context of Het- erogeneous Parallel Computing. The results suggest that the performance of Heterogeneous Parallel Computing matching, either on multi-core CPU or GPU, is highly related to the computational intensity of certain cases. When computational power is intensively required, the SIMD Parallel Com- puting model of Clgrep can be several times more efficient than correspond- ing sequential matching program. ii Acknowledgements This study has been a difficult task as the problem is novel and involves many programming challenges. Everyday, I am fighting with the errors in the program and the issues raised in the experiment. I do not remember how many hours were spent in the complex experiment. However, I can not forget the people helped me during the study. I would like to thank my parents, without them my study would merely be a dream.
    [Show full text]
  • Improvising of Web Recommendation System by Using K-Means and Bitap Algorithm
    November 2017, Volume 4, Issue 11 JETIR (ISSN-2349-5162) IMPROVISING OF WEB RECOMMENDATION SYSTEM BY USING K-MEANS AND BITAP ALGORITHM 1 Dipali Wankhede,2 S.G.Tuppad 1ME Student,2Professor 1Computer Science, 1MSSCET, Jalna,India Abstract— The increase in the amount of information over the Internet in recent years has led to the risk of data flooding, which has resulted in problems accessing user-related information. In addition, the increase in the number of websites and web pages has made it difficult for webmasters to determine the content of their content. The online user's information needs can be obtained by evaluating the user's web usage. Web Based Data Mining (WUM) is used to extract knowledge from web users. Access logs are accessed by using data mining techniques. One of the WUM applications is a guideline, a privacy filtering technique used to determine whether a user approves a specified item or identifies a list of things that may be important to the user. This article is an architecture that combines product data with user access log data and generates a set of instructions for that user. The operation records interesting results in terms of precision, recall, and metrics. Index Terms— Web Usage mining, Online Web Recommendation System, Clustering, Pattern Matching, Boyer Moore, K- means, Recommendation I. INTRODUCTION In recent years, eCommerce, web services and web-based information systems have been used rapidly. The explosion of the web and the emergence of ecommerce have encouraged designers to develop. The ecommerce product introduction system has transformed the global business landscape.
    [Show full text]
  • NEWS for R Version 4.1.1 (2021-08-10)
    NEWS for R version 4.1.1 (2021-08-10) NEWS R News LATER NEWS • News for R 3.0.0 and later can be found in file `NEWS.Rd' in the R sources and files `NEWS' and `doc/html/NEWS.html' in an R build. CHANGES IN R VERSION 2.15.3 NEW FEATURES: • lgamma(x) for very small x (in the denormalized range) is no longer Inf with a warning. • image() now sorts an unsorted breaks vector, with a warning. • The internal methods for tar() and untar() do a slightly more general job for `ustar'- style handling of paths of more than 100 bytes. • Packages compiler and parallel have been added to the reference index (`refman.pdf'). • untar(tar = "internal") has some support for pax headers as produced by e.g. gnu- tar --posix (which seems prevalent on OpenSUSE 12.2) or bsdtar --format pax, including long path and link names. • sQuote() and dQuote() now handle 0-length inputs. (Suggestion of Ben Bolker.) • summaryRprof() returns zero-row data frames rather than throw an error if no events are recorded, for consistency. • The included version of PCRE has been updated to 8.32. • The tcltk namespace can now be re-loaded after unloading. The Tcl/Tk event loop is inhibited in a forked child from package parallel (as in e.g. mclapply()). • parallel::makeCluster() recognizes the value `random' for the environment vari- able R_PARALLEL_PORT: this chooses a random value for the port and reduces the chance of conflicts when multiple users start a cluster at the same time. UTILITIES: • The default for TAR on Windows for R CMD build has been changed to be `internal' if no tar command is on the path.
    [Show full text]
  • Java Exception Matching in Real Time Using Fuzzy Logic Master of Science Thesis
    Java exception matching in real time using fuzzy logic Master of Science Thesis Karl Tillström Chalmers University of Technology University of Gothenburg Department of Computer Science and Engineering Göteborg, Sweden, May 2010 The Author grants to Chalmers University of Technology and University of Gothenburg the non-exclusive right to publish the Work electronically and in a non-commercial purpose make it accessible on the Internet. The Author warrants that he/she is the author to the Work, and warrants that the Work does not contain text, pictures or other material that violates copyright law. The Author shall, when transferring the rights of the Work to a third party (for example a publisher or a company), acknowledge the third party about this agreement. If the Author has signed a copyright agreement with a third party regarding the Work, the Author warrants hereby that he/she has obtained any necessary permission from this third party to let Chalmers University of Technology and University of Gothenburg store the Work electronically and make it accessible on the Internet. Java exception matching in real time using fuzzy logic KARL TILLSTRÖM © KARL TILLSTRÖM, MAY 2010. Examiner: BROR BJERNER Chalmers University of Technology University of Gothenburg Department of Computer Science and Engineering SE­412 96 Göteborg Sweden Telephone + 46 (0)31­772 1000 Department of Computer Science and Engineering Göteborg, Sweden May 2010 Abstract This thesis deals with the problem of matching Java exceptions that originates from the same problem, but looks a bit dissimilar. The match- ing must be done within reasonable time to be able to be carried out automatically.
    [Show full text]
  • Approximate Pattern Matching with Index Structures
    Technische Universität München Fakultät für Informatik Lehrstuhl für Effiziente Algorithmen a Approximatec e Pattern Matching b with Index Structuresd f Johannes Krugel a c b d c e d f e e e f a d e b c f Technische Universität München Fakultät für Informatik Lehrstuhl für Effiziente Algorithmen Approximate Pattern Matching with Index Structures Johannes A. Krugel Vollständiger Abdruck der von der Fakultät für Informatik der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigten Dissertation. Vorsitzender: Univ.-Prof. Dr. Helmut Seidl Prüfer der Dissertation: 1. Univ.-Prof. Dr. Ernst W. Mayr 2. Univ.-Prof. Dr. Stefan Kramer, Johannes Gutenberg-Universität Mainz Die Dissertation wurde am 06.05.2015 bei der Technischen Universität München eingereicht und durch die Fakultät für Informatik am 19.01.2016 angenommen. ii Zusammenfassung Ziel dieser Arbeit ist es, einen Überblick über das praktische Verhalten von Indexstrukturen und Algorithmen zur approximativen Textsuche (approximate pattern matching, APM) zu geben, abhängig von den Eigenschaften der Eingabe. APM ist die Suche nach Zeichenfolgen in Texten oder biologischen Sequenzen unter Berücksichtigung von Fehlern (wie z. B. Rechtschreibfehlern oder genetischen Mutationen). In der Offline-Variante dieses Problems kann der Text vorverarbeitet werden, um eine Indexstruktur aufzubauen bevor die Suchanfragen beantwortet werden. Diese Arbeit beschreibt und diskutiert praktisch relevante Indexstrukturen, Ähnlichkeitsmaÿe und Suchalgorithmen für APM. Wir schlagen einen neuen effizienten Suchalgorithmus für Suffixbäume im externen Speicher vor. Im Rahmen der Arbeit wurden mehrere Indexstrukturen und Algorithmen für APM implementiert und in einer Softwarebibliothek bereitgestellt; die Implementierungen sind effizient, stabil, generisch, getestet und online verfügbar.
    [Show full text]
  • NEWS for R Version 3.3.1 (2016-06-21)
    NEWS for R version 3.3.1 (2016-06-21) NEWS R News CHANGES IN R 3.3.1 BUG FIXES: • R CMD INSTALL and hence install.packages() gave an internal error installing a package called description from a tarball on a case-insensitive file system. • match(x, t) (and hence x %in% t) failed when x was of length one, and either character and x and t only differed in their Encoding or when x and t where complex with NAs or NaNs. (PR#16885.) • unloadNamespace(ns) also works again when ns is a `namespace', as from getNames- pace(). • rgamma(1,Inf) or rgamma(1, 0,0) no longer give NaN but the correct limit. • length(baseenv()) is correct now. • pretty(d, ..) for date-time d rarely failed when "halfmonth" time steps were tried (PR#16923) and on `inaccurate' platforms such as 32-bit windows or a configuration with --disable-long-double; see comment #15 of PR#16761. • In text.default(x, y, labels), the rarely(?) used default for labels is now cor- rect also for the case of a 2-column matrix x and missing y. • as.factor(c(a = 1L)) preserves names() again as in R < 3.1.0. • strtrim(""[0], 0[0]) now works. • Use of Ctrl-C to terminate a reverse incremental search started by Ctrl-R in the readline-based Unix terminal interface is now supported for readline >= 6.3 (Ctrl- G always worked). (PR#16603) • diff(<difftime>) now keeps the "units" attribute, as subtraction already did, PR#16940. CHANGES IN R 3.3.0 SIGNIFICANT USER-VISIBLE CHANGES: 1 2 NEWS • nchar(x, *)'s argument keepNA governing how the result for NAs in x is deter- mined, gets a new default keepNA = NA which returns NA where x is NA, except for type = "width" which still returns 2, the formatting / printing width of NA.
    [Show full text]
  • Needleman–Wunsch Algorithm
    Not logged in Talk Contributions Create account Log in Article Talk Read Edit View history Search Wikipedia Wiki Loves Love: Documenting festivals and celebrations of love on Commons. Help Wikimedia and win prizes by sending photos. Main page Contents Featured content Current events Needleman–Wunsch algorithm Random article From Wikipedia, the free encyclopedia Donate to Wikipedia Wikipedia store This article may be too technical for most readers to understand. Please help improve it Interaction to make it understandable to non-experts, without removing the technical details. (September 2013) (Learn how and when to remove this template message) Help About Wikipedia The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align Class Sequence Community portal protein or nucleotide sequences. It was one of the first applications of dynamic alignment Recent changes Worst-case performance Contact page programming to compare biological sequences. The algorithm was developed by Saul B. Needleman and Christian D. Wunsch and published in 1970.[1] The Worst-case space complexity Tools algorithm essentially divides a large problem (e.g. the full sequence) into a series of What links here smaller problems and uses the solutions to the smaller problems to reconstruct a solution to the larger problem.[2] It is also Related changes sometimes referred to as the optimal matching algorithm and the global alignment technique. The Needleman–Wunsch algorithm Upload file is still widely used for optimal global alignment, particularly when
    [Show full text]
  • Department of Computer Science Fast Text Searching with Errors
    Fast Text Searching With Errors Sun Wu and Udi Manber TR 91-11 DEPARTMENT OF COMPUTER SCIENCE FAST TEXT SEARCHING WITH ERRORS Sun Wu and Udi Manber1 Department of Computer Science University of Arizona Tucson, AZ 85721 June 1991 ABSTRACT Searching for a pattern in a text ®le is a very common operation in many applications ranging from text editors and databases to applications in molecular biology. In many instances the pat- tern does not appear in the text exactly. Errors in the text or in the query can result from misspel- ling or from experimental errors (e.g., when the text is a DNA sequence). The use of such approximate pattern matching has been limited until now to speci®c applications. Most text edi- tors and searching programs do not support searching with errors because of the complexity involved in implementing it. In this paper we present a new algorithm for approximate text searching which is very fast and very ¯exible. We believe that the new algorithm will ®nd its way to many searching applications and will enable searching with errors to be just as common as searching exactly. 1. Introduction The string-matching problem is a very common problem. We are searching for a pattern = = ... P p 1p 2...pm inside a large text ®le T t 1t 2 tn. The pattern and the text are sequences of characters from a ®nite character set Σ. The characters may be English characters in a text ®le, DNA base pairs, lines of source code, angles between edges in polygons, machines or machine parts in a production schedule, music notes and tempo in a musical score, etc.
    [Show full text]
  • The Stringdist Package for Approximate String Matching by Mark P.J
    CONTRIBUTED RESEARCH ARTICLES 111 The stringdist Package for Approximate String Matching by Mark P.J. van der Loo Abstract Comparing text strings in terms of distance functions is a common and fundamental task in many statistical text-processing applications. Thus far, string distance functionality has been somewhat scattered around R and its extension packages, leaving users with inconistent interfaces and encoding handling. The stringdist package was designed to offer a low-level interface to several popular string distance algorithms which have been re-implemented in C for this purpose. The package offers distances based on counting q-grams, edit-based distances, and some lesser known heuristic distance functions. Based on this functionality, the package also offers inexact matching equivalents of R’s native exact matching functions match and %in%. Introduction Approximate string matching is an important subtask of many data processing applications including statistical matching, text search, text classification, spell checking, and genomics. At the heart of approximate string matching lies the ability to quantify the similarity between two strings in terms of string metrics. String matching applications are commonly subdivided into online and offline applications where the latter refers to the situation where the string that is being searched for a match can be preprocessed to build a search index. In online string matching no such preprocessing takes place. A second subdivision can be made into approximate text search, where one is interested in the location of the match of one string in another string, and dictionary lookup, where one looks for the location of the closest match of a string in a lookup table.
    [Show full text]
  • The Evaluation of Text String Matching Algorithms As an Aid to Image Search
    JOURNAL OF APPLIED COMPUTER SCIENCE Vol. 26 No. 1 (2018), pp. 33-62 The Evaluation of Text String Matching Algorithms as an Aid to Image Search Joanna Ochelska-Mierzejewska1 1Lodz University of Technology Institute of Information Technology ul. Wólcza´nska215, 90-924 Łód´z,Poland [email protected] Abstract. The main goal of this paper is to analyse intelligent text string matching methods (like fuzzy sets and relations) and evaluate their useful- ness for image search. The present study examines the ability of different algorithms to handle multi-word and multi-sentence queries. Eight different similarity measures (N-gram, Levenshtein distance, Jaro coefficient, Dice coefficient, Overlap coefficient, Euclidean distance, Cosine similarity and Jaccard similarity) are employed to analyse the algorithms in terms of time complexity and accuracy of results. The outcomes are used to develop a hi- erarchy of methods, illustrating their usefulness to image search. The search response time increases significantly in the case of data sets containing sev- eral thousand images. The findings indicate that the analysed algorithms do not fulfil the response-time requirements of professional applications. Due to its limitations, the proposed system should be considered only as an illustra- tion of a novel solution with further development perspectives. The use of Polish as the language of experiments affects the accuracy of measures. This limitation seems to be easy to overcome in the case of languages with simpler grammar rules (e.g. English). Keywords: text comparison, N-gram, Levenshtein distance, Jaro coefficient, Dice’s coefficient, Overlap coefficient, Euclidean distance, Cosine similarity, Jaccard similarity.
    [Show full text]