Comparison of Levenshtein Distance Algorithm and Needleman-Wunsch Distance Algorithm for String Matching

National Journal of Parallel and Soft Computing, Volume 01, Issue 01, March-2019 Comparison of Levenshtein Distance Algorithm and Needleman-Wunsch Distance Algorithm for String Matching Khin Moe Myint Aung, Ah Nge Htwe University of Computer Studies, Yangon [email protected], [email protected] Abstract algorithm is needed to find the pattern as well as to know the locations where it was found in a given String similarity measures play an sequence of characters. increasingly important role in text related research The proposed system analyzes the similarity and applications in tasks and operate on string measurements on Song Information by using sequences and character composition. A string metric Levenshtein Distance Algorithm and Needleman- is a metric that String_Based measures similarity or Wunsch Distance Algorithm. The objective of this dissimilarity (distance) between two strings for research is to compare the Levenshtein Distance approximate string matching or comparison. Algorithm and Needleman-Wunsch Distance Determining similarity between texts is crucial to Algorithm based on their f-score value and execution many applications such as clustering, duplicate time. removal, merging similar topics or themes, text While entering characters there may be some retrieval and etc. Among many methods of String typographical errors (typos), Levenshtein Distance similarity, Levenshtein Distance Algorithm and Algorithm and Needleman-Wunsch Distance Needleman-Wunsch Distance Algorithm are used in Algorithm find similar strings and displays results for this proposed system. The proposed system intended the predicted strings. If the user wants to search for to present by comparing Levenshtein Distance an artist containing keyword “oliver” but by mistake Algorithm and Needleman-Wunsch Distance he or she types “olover” then because of Levenshtein Algorithm based on their f-score. So, user can search Distance Algorithm and Needleman-Wunsch effectively the required song by typing the title of Distance Algorithm the system will be able to display songs or artist name using English language in this the song containing “oliver”. Similarly if user wants proposed system. Then the proposed system retrieve to search for song titles containing keyword the user’s required song information with similarity “downtown” but by mistake he or she types score. The matching efficiencies of these algorithms “downtoun” the system will be able to display proper are compared by searching f-score and execution song containing “downtown”. Since words “oliver” time. The proposed system uses song title and artist and “olover” are similar, similarly words “downtown” feature of billboard song dataset from year 1965- and “downtoun” are similar. 2015 and implements using Java programming Levenshtein Distance Algorithm and language. Needleman-Wunsch Distance Algorithm are based on finding similar strings from the billboard song dataset. Keywords – Levenshtein Distance Algorithm, Levenshtein distance here refers to number of single Needleman-Wunsch Distance Algorithm, Dataset, character operations such as insertion, replacement or Similarity Score, f-score. deletion need to be done in order to transform one 1. Introduction string to another string. For example, edit distance between “bein” and “pin” is two, since replacing String searching is a very important character ‘b’ by ‘p’, deleting character ‘e’ then word component of many problems, including text editing, “bein” can be converted to “pin”. text searching and symbol manipulation. Strings The Needleman-Wunsch Distance Algorithm searching sometimes called String matching are an performs a global alignment to find the best match or important class of string algorithms that try to find a alignment of two strings through computing minimal place where one or several strings (also called alignment distance. For example minimal alignment patterns) are found within a larger string or text. In distance between “bein” and “pin” is three, since order to search for a pattern within a string, an aligning character ‘b’ by ‘p’(mismatch) , aligning character ‘e’ by ‘-’(character ‘e’ align gap cost), 209 National Journal of Parallel and Soft Computing, Volume 01, Issue 01, March-2019 aligning character ‘i’ by ‘i’, aligning character ‘n’ by same length. It is not considered order of sequence of ‘n’ then word “bein” can be converted to “pin”. Here, characters while comparing. gap penalty=2, match=0 and mismatch=1. Levenshtein Distance Algorithm 2. Related Works Step 1: Initialization a) Set n to be the length of s, set m to be the SinglaN [3] et al was exploiting different kinds length of t. of string matching algorithms for strings and b) Construct a matrix containing 0..m rows searching the best algorithm in some application. and 0..n columns. They were decreed their preprocessing and orders c) Initialize the first row to 0...n, that evaluate the matching. d) Initialize the first column to 0...m. Pandiselvam.P, Marimuthu.T and Lawrance.R Step2: Processing [2] was evaluated different kinds of string matching a) Examine s (i from 1 to n). algorithms for biological sequences such as DNA and b) Examine t (j from 1 to m). Proteins and observed their time and space c) If s[i] equals t[j], the cost is 0. complexities. d) If s[i] doesn't equal t[j], the cost is 1. New Zin Oo [10] was proposed the process of e) Set cell d[i,j] of the matrix equal to the checking the spelling of a Myanmar input word and minimum of: suggestion list if it is missed spelt Myanmar word. i) The cell immediately above plus 1: This is intended to develop a Myanmar Language d[i-1,j] + 1. Spell Checker (or spell check) by using Levenshtein ii). The cell immediately to the left plus Distance Algorithm, Dynamic Threshold Algorithm 1: d [i, j-1] + 1. and Transformation Algorithm. iii The cell diagonally above and to the Khaing Su Yee [11] was analyzed the DNA left plus the cost: d [i-1, j-1] + cost. and protein structure of HIV genome structure by Step 3: Result using Levenshtein Distance Algorithm and Step 2 is repeated till the d [n, m] value is found. determined what kind of behaviour that the sequence has. In the following Table 1 example, finding 3. Background Theory Levenshtein Distance between “helo” and “hello”. Table 1. Example of Levenshtein Distance String similarity measures play an increasingly Algorithm important role in text related research and applications in tasks such as information retrieval, h e l l o text classification, document clustering, topic 0 1 2 3 4 5 detection, topic tracking, questions generation, h 1 0 1 2 3 4 question answering, essay scoring, short answer e 2 1 0 1 2 3 scoring, machine translation, text summarization and l 3 2 1 0 1 2 others. String matching algorithms are used to find o 4 3 2 1 1 1 the matches between the source string and the target string. The distance is in the lower right hand corner 3.1. Levenshtein Distance Algorithm of the matrix, i.e., 1. Levenshtein distance (LD) is a measure of the 3.2. Needleman-Wunsch Distance similarity between two strings, the source string (s) Algorithm and the target string (t). The distance is the number of The Needleman-Wunsch algorithm finds the deletions, insertions, or substitutions required to optimal alignment of two strings (the source string (s) transform s into t [8]. The greater the Levenshtein and the target string (t)). It is also referred as optimal distance, the more different the strings are [8]. matching algorithm and the global alignment It is fast and best suited for strings similarity. technique [12]. Needleman-Wunsch distance It is not restricted by the strings needing to have the algorithm is allowed to insert gaps (a blank character) in either or both of the strings in such a way to make 210 National Journal of Parallel and Soft Computing, Volume 01, Issue 01, March-2019 the resulting strings an optimal match. The match is 3.3. Similarity Score optimal, the score is minimum. It is best for string comparison because it Similarity score is the measure to show how considers ordering of sequence of characters. It finds similar two set of data are to each other [13]. In this the optimal alignment solution between the case, the two set of data are user input and the data in sequences. It is approximate for finding the best the dataset. It is not only for two texts but also for the alignment of two strings that are similar in length and different algorithm options given to the user in the similar across their entire length. It takes more time application. The method to calculate the similarity to make the alignment this decrease the performance. can be found by: Needleman-Wunsch Distance Algorithm 푆푖푚푖푙푎푟푖푡푦(푠표푢푟푐푒, 푡푎푟푔푒푡) = (Distance(source,targ et))/(Maximumlength(source,target))*100% Set n to be the length of s, set m to be the length of t. (1) If n=0, return respective m and exit. If m=0, return n and exit. For the similarity measures a threshold will be If Construct a matrix containing 0...n rows and 0...m 50 that influences the classification performance columns. (name pairs with a similarity value equal and above Initialization the threshold are show to the user and pairs with F (0, 0) = 0 similarity value below are not show). F (0, i) = i * d (d is the gap Penalty) F (j, 0) = j * d 3.4. F-Score Main Iteration To compare the system performance for For each i = 1 . M (M is the length of target string) results of two methods, f-score is measured as For each j = 1 . N (N is the length of source string) evaluation method. The proposed system is measured F (i, j) = min{ F(i − 1, j − 1) + s(xi, yj), case1 using the f-measure (also called f-score) which is F(i − 1, j) + d , case2 based on precision and recall.

Load more