String Similarity Search Using Edit Distance and Soundex Algorithm

Total Page:16

File Type:pdf, Size:1020Kb

String Similarity Search Using Edit Distance and Soundex Algorithm International Journal of Engineering and Advanced Technology (IJEAT) ISSN: 2249 – 8958, Volume-8 Issue-4, April, 2019 String Similarity Search Using Edit Distance and Soundex Algorithm P. Pranathi,C. Karthikeyan, D. Charishma positions, as a standout amongst the most critical principal Abstract: String similarity is a major inquiry that has been assignments, string likeness seek has been widely generally used for DNA sequencing, error tolerant, auto contemplated which checks whether two strings are completion, and data cleaning which is required in database, comparative enough for information cleaning purposes in data warehousing, and data mining. String similarity search is databases, information warehousing, and information possible in various methodologies of procedures like Edit mining frameworks. The applications that need string distance, Cosine distance, Soundex algorithm, Hamming similitude look incorporate fuzzy search, inquiry distance and Levensntein distance, etc.., These strategies can be applied for long strings, which are not possible by the current auto-fulfillment, and DNA sequencing. In Computer methodologies on the grounds that the extent of the record Science, a string metric is a metric which measures between constructed and an opportunity to manufacture such list. We two content strings for inexact string coordinating or apply distinctive string similitude strategies and check which examination and in fuzzy string seeking. A vital necessity for procedure gives progressively suitable qualities. Our similarity a string metric is satisfaction of the triangle imbalance. For measures incorporate Edit distance, Cosine similarity, Soundex instance, the strings "Raj" and "Rajeev" can be considered as algorithm, Hamming separation and Levensntein distance close. A string metric dependably gives a number Index Terms: Cosine Similarity,Edit distance, Hamming distance, Levensntein distance,Soundex Algorithm etc..,. demonstrating a calculation explicit sign of separation. The most generally known string metric is a simple one called the I. INTRODUCTION Levenshtein remove and is otherwise called alter separate. It ascertains between two information strings, restoring a String similarity search is a major question that has been number identical to the quantity of substitutions and erasures utilized for DNA sequencing, error tolerant inquiry auto required so as to change one string into another. fulfillment, and information cleaning which is required in Oversimplified string measurements, for example, database, information distribution center, and information Levenshtein remove have extended so as to incorporate mining[1]. In separation edit(s,t) ,between two string s and t, phonetic, token, syntactic and character-based techniques for the string likeness seek is to discover each string in a string factual examinations. String measurements are utilized in database D which is like a question string s with the data joining and are right now utilized in regions including end goal that edit(s,t)<=ĩ ,between two strings, s and t, the extortion discovery, unique finger impression examination, string closeness look is to discover each string t in a string literary theft location, metaphysics combining, DNA database D which is comparable in an inquiry string s, to investigation, RNA examination, picture examination, proof such an extent that edit(s,t) for a given edge ĩ .We have taken based AI, information deduplication, information mining, around 50 records and structure them into bunches as per gradual hunt, information incorporation, and semantic their similitudes. Furthermore, we likewise apply distinctive learning combination. string comparability procedures and check which strategy gives progressively suitable qualities. Our similitude II. EXISTING METHODS: measures incorporate Edit distance[2], Cosine similarity, Soundex algorithm[3], Hamming distance[4] and Strings are commonly used to address a collection of Levensntein distance[5]. Strings are generally used to speak artistic data including DNA groupings, messages, thing to an assortment of literary information including DNA overviews, and documents. Also, there are a broad number of groupings, messages, messages, item audits, and reports. string datasets accumulated from various data sources in real Also, there are a substantial number of string datasets applications[7]. In view of the way that string data from gathered from different information sources in genuine different sources may struggle brought about by the applications. Because of the way that string information from composition bungles or the refinements in data plans, as a various sources might be conflicting brought about by the champion among the most basic errands, string resemblance composing botches or the distinctions in information look for has been generally viewed as which checks whether two strings are adequately similar for information cleaning Revised Manuscript Received on April 25, 2019. purposes in databases[8], P.Pranathi, Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram,AP, India. Dr. C.Karthikeyan, Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram,AP, India. D.Charishma Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram,AP, India. Published By: Blue Eyes Intelligence Engineering Retrieval Number: D6181048419/19©BEIESP 401 & Sciences Publication String Similarity Search Using Edit Distance and Soundex Algorithm information warehousing, and information mining structures separation between two words is the base number of There are distinctive string closeness measurements for this single-character alters required to transform one character issue yet we consider couple of measurements among them into the other which gives ideal arrangements. Ex: FOOD OFDO 2.1 COSINE SIMILARITY: # F O O D Cosine similarity is a proportion of similitude between two # 0 1 2 3 4 non-zero vectors of an inward item space which estimates the O 1 1 1 1 2 cosine of the edge between them. The cosine of edge 0° is 1, F 2 1 2 2 2 and is under 1 for some other point in the interim [0,2π). It is D 3 2 2 3 2 in this manner a judgment of introduction and not size: two vectors with a similar introduction have a cosine similitude O 4 3 2 2 3 of 1, two vectors at 90° have a comparability of 0, and two Fig2:Levensntein distance vectors oppositely restricted have likeness of – 1[9], autonomous of their extent. Cosine likeness is explicitly 2.4 JACCARD DISTANCE utilized in positive space, where the result is perfectly limited The Jaccard coefficient estimates closeness between in [0,1]. The name gets from the expression "heading limited example sets, and is characterized as the extent of the cosine": for this case, note that unit vectors are maximally crossing point isolated by the span of association of the "comparable" on the off chance that they're parallel and example sets. maximally "unique" on the off chance that they're In the event that An and B are both vacant, we characterize symmetrical (opposite). This is undifferentiated from the J(A<B)=1 cosine, which is solidarity for the most part spoke to as Hamming distance: greatest esteem when the fragments subtend a zero edge and It talks about the number of positions with same symbol zero (uncorrelated) when the sections are opposite. in both strings. Hamming distance is only defined for strings Note that these sort of limits apply for any number of of equal length. measurements, and cosine comparability is most generally distance(‘aefghi‘,’aeklui‘) = 3 utilized in high-dimensional positive spaces. For instance, in Levenshtein distance: content mining and data recovery, each term is notionally Minimal number of insertions, deletions and replacements allocated an alternate measurement and a record is portrayed needed for transforming String a into string b. by a vector where the estimation of each measurement relates Cosine Similarity: to the occasions that a term shows up in the archive[10]. Some Cosine Similarities doesn’t work with Cosine comparability dependably endeavors to give a floating values. It is major drawback while dealing with valuable proportion of how comparative two records are documents probably going to be as far as their topic. In the below graph Similarity = cos(Ө) = A.B/||A|| ||B|| Maximum similarity = 1, Minimum similarity = -1 V = cosine(A,B) = /|A|.|B| Document Pitch Bat Win Lose Magnitude of vectors which is dealing with term frequency 1 5 3 3 2 does not play any role in this similarity measurement 2 4 6 3 0 A = B = C = D, There is a huge difference between the vector’s magnitudes 3 2 5 0 4 As A and B have the higher proportion of the terms in their 4 3 4 2 3 texts, it seems A and B are more dedicated to the terms from Fig 1: Cosine Similarity for documents searched query than C and D 2.2 HAMMING DISTANCE: Therefore B -> C -> D similarity order Hamming distance is the distance between two strings of equal length is the number of positions at which the corresponding symbols are different. i.e.., the minimum number of errors that could have transformed one string into the other. Ex: The Hamming distance between: "People" and "Psikle" is 3. "Leather" and "Leutren" is 3. 1010001 and 0110001 is 2. Fig 3: Cosine similarity between four documents 6785790 and 6872790 is 3. 2.3 LEVENSNTEIN DISTANCE: Levensntein separate is one of the string metric for estimating the distinction between two arrangements. The Published By: Blue Eyes Intelligence Engineering Retrieval Number: D6181048419/19©BEIESP 402 & Sciences Publication International Journal of Engineering and Advanced Technology (IJEAT) ISSN: 2249 – 8958, Volume-8 Issue-4, April, 2019 Hamming separation gives an absolete distinction Hence the edit distance indicates that FIVE operations are For instance, if B is the light up duplicate of A, Hamming required to transform string s1 to string s2 separation of A,B is vast It is utilized to discover the level of contrasts between two pictures. ALGORITHM: The base Hamming separation is utilized to characterize if(str[i]==str[j]) some of basic ideas in coding hypothesis, for example, mistake identifying and blunder revising codes.
Recommended publications
  • The One-Way Communication Complexity of Hamming Distance
    THEORY OF COMPUTING, Volume 4 (2008), pp. 129–135 http://theoryofcomputing.org The One-Way Communication Complexity of Hamming Distance T. S. Jayram Ravi Kumar∗ D. Sivakumar∗ Received: January 23, 2008; revised: September 12, 2008; published: October 11, 2008. Abstract: Consider the following version of the Hamming distance problem for ±1 vec- n √ n √ tors of length n: the promise is that the distance is either at least 2 + n or at most 2 − n, and the goal is to find out which of these two cases occurs. Woodruff (Proc. ACM-SIAM Symposium on Discrete Algorithms, 2004) gave a linear lower bound for the randomized one-way communication complexity of this problem. In this note we give a simple proof of this result. Our proof uses a simple reduction from the indexing problem and avoids the VC-dimension arguments used in the previous paper. As shown by Woodruff (loc. cit.), this implies an Ω(1/ε2)-space lower bound for approximating frequency moments within a factor 1 + ε in the data stream model. ACM Classification: F.2.2 AMS Classification: 68Q25 Key words and phrases: One-way communication complexity, indexing, Hamming distance 1 Introduction The Hamming distance H(x,y) between two vectors x and y is defined to be the number of positions i such that xi 6= yi. Let GapHD denote the following promise problem: the input consists of ±1 vectors n √ n √ x and y of length n, together with the promise that either H(x,y) ≤ 2 − n or H(x,y) ≥ 2 + n, and the goal is to find out which of these two cases occurs.
    [Show full text]
  • Headprint: Detecting Anomalous Communications Through Header-Based Application Fingerprinting
    HeadPrint: Detecting Anomalous Communications through Header-based Application Fingerprinting Riccardo Bortolameotti Thijs van Ede Andrea Continella University of Twente University of Twente UC Santa Barbara [email protected] [email protected] [email protected] Thomas Hupperich Maarten H. Everts Reza Rafati University of Muenster University of Twente Bitdefender [email protected] [email protected] [email protected] Willem Jonker Pieter Hartel Andreas Peter University of Twente Delft University of Technology University of Twente [email protected] [email protected] [email protected] ABSTRACT 1 INTRODUCTION Passive application fingerprinting is a technique to detect anoma- Data breaches are a major concern for enterprises across all indus- lous outgoing connections. By monitoring the network traffic, a tries due to the severe financial damage they cause [25]. Although security monitor passively learns the network characteristics of the many enterprises have full visibility of their network traffic (e.g., applications installed on each machine, and uses them to detect the using TLS-proxies) and they stop many cyber attacks by inspect- presence of new applications (e.g., malware infection). ing traffic with different security solutions (e.g., IDS, IPS, Next- In this work, we propose HeadPrint, a novel passive fingerprint- Generation Firewalls), attackers are still capable of successfully ing approach that relies only on two orthogonal network header exfiltrating data. Data exfiltration often occurs over network covert characteristics to distinguish applications, namely the order of the channels, which are communication channels established by attack- headers and their associated values. Our approach automatically ers to hide the transfer of data from a security monitor.
    [Show full text]
  • Author Guidelines for 8
    I.J. Information Technology and Computer Science, 2014, 01, 68-75 Published Online December 2013 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2014.01.08 FSM Circuits Design for Approximate String Matching in Hardware Based Network Intrusion Detection Systems Dejan Georgiev Faculty of Electrical Engineering and Information Technologies , Skopje, Macedonia E-mail: [email protected] Aristotel Tentov Faculty of Electrical Engineering and Information Technologies, Skopje, Macedonia E-mail: [email protected] Abstract— In this paper we present a logical circuits to perform a check to a possible variations of the design for approximate content matching implemented content because of so named "evasion" techniques. as finite state machines (FSM). As network speed Software systems have processing limitations in that increases the software based network intrusion regards. On the other hand hardware platforms, like detection and prevention systems (NIDPS) are lagging FPGA offer efficient implementation for high speed behind requirements in throughput of so called deep network traffic. package inspection - the most exhaustive process of Excluding the regular expressions as special case ,the finding a pattern in package payloads. Therefore, there problem of approximate content matching basically is a is a demand for hardware implementation. k-differentiate problem of finding a content Approximate content matching is a special case of content finding and variations detection used by C c1c2.....cm in a string T t1t2.......tn such as the "evasion" techniques. In this research we will enhance distance D(C,X) of content C and all the occurrences of the k-differentiate problem with "ability" to detect a a substring X is less or equal to k , where k m n .
    [Show full text]
  • Comparison of Levenshtein Distance Algorithm and Needleman-Wunsch Distance Algorithm for String Matching
    COMPARISON OF LEVENSHTEIN DISTANCE ALGORITHM AND NEEDLEMAN-WUNSCH DISTANCE ALGORITHM FOR STRING MATCHING KHIN MOE MYINT AUNG M.C.Sc. JANUARY2019 COMPARISON OF LEVENSHTEIN DISTANCE ALGORITHM AND NEEDLEMAN-WUNSCH DISTANCE ALGORITHM FOR STRING MATCHING BY KHIN MOE MYINT AUNG B.C.Sc. (Hons:) A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Computer Science (M.C.Sc.) University of Computer Studies, Yangon JANUARY2019 ii ACKNOWLEDGEMENTS I would like to take this opportunity to express my sincere thanks to those who helped me with various aspects of conducting research and writing this thesis. Firstly, I would like to express my gratitude and my sincere thanks to Prof. Dr. Mie Mie Thet Thwin, Rector of the University of Computer Studies, Yangon, for allowing me to develop this thesis. Secondly, my heartfelt thanks and respect go to Dr. Thi Thi Soe Nyunt, Professor and Head of Faculty of Computer Science, University of Computer Studies, Yangon, for her invaluable guidance and administrative support, throughout the development of the thesis. I am deeply thankful to my supervisor, Dr. Ah Nge Htwe, Associate Professor, Faculty of Computer Science, University of Computer Studies, Yangon, for her invaluable guidance, encouragement, superior suggestion and supervision on the accomplishment of this thesis. I would like to thank Daw Ni Ni San, Lecturer, Language Department of the University of Computer Studies, Yangon, for editing my thesis from the language point of view. I thank all my teachers who taught and helped me during the period of studies in the University of Computer Studies, Yangon. Finally, I am grateful for all my teachers who have kindly taught me and my friends from the University of Computer Studies, Yangon, for their advice, their co- operation and help for the completion of thesis.
    [Show full text]
  • Fault Tolerant Computing CS 530 Information Redundancy: Coding Theory
    Fault Tolerant Computing CS 530 Information redundancy: Coding theory Yashwant K. Malaiya Colorado State University April 3, 2018 1 Information redundancy: Outline • Using a parity bit • Codes & code words • Hamming distance . Error detection capability . Error correction capability • Parity check codes and ECC systems • Cyclic codes . Polynomial division and LFSRs 4/3/2018 Fault Tolerant Computing ©YKM 2 Redundancy at the Bit level • Errors can bits to be flipped during transmission or storage. • An extra parity bit can detect if a bit in the word has flipped. • Some errors an be corrected if there is enough redundancy such that the correct word can be guessed. • Tukey: “bit” 1948 • Hamming codes: 1950s • Teletype, ASCII: 1960: 7+1 Parity bit • Codes are widely used. 304,805 letters in the Torah Redundancy at the Bit level Even/odd parity (1) • Errors can bits to be flipped during transmission/storage. • Even/odd parity: . is basic method for detecting if one bit (or an odd number of bits) has been switched by accident. • Odd parity: . The number of 1-bit must add up to an odd number • Even parity: . The number of 1-bit must add up to an even number Even/odd parity (2) • The it is known which parity it is being used. • If it uses an even parity: . If the number of of 1-bit add up to an odd number then it knows there was an error: • If it uses an odd: . If the number of of 1-bit add up to an even number then it knows there was an error: • However, If an even number of 1-bit is flipped the parity will still be the same.
    [Show full text]
  • Sequence Distance Embeddings
    Sequence Distance Embeddings by Graham Cormode Thesis Submitted to the University of Warwick for the degree of Doctor of Philosophy Computer Science January 2003 Contents List of Figures vi Acknowledgments viii Declarations ix Abstract xi Abbreviations xii Chapter 1 Starting 1 1.1 Sequence Distances . ....................................... 2 1.1.1 Metrics ............................................ 3 1.1.2 Editing Distances ...................................... 3 1.1.3 Embeddings . ....................................... 3 1.2 Sets and Vectors ........................................... 4 1.2.1 Set Difference and Set Union . .............................. 4 1.2.2 Symmetric Difference . .................................. 5 1.2.3 Intersection Size ...................................... 5 1.2.4 Vector Norms . ....................................... 5 1.3 Permutations ............................................ 6 1.3.1 Reversal Distance ...................................... 7 1.3.2 Transposition Distance . .................................. 7 1.3.3 Swap Distance ....................................... 8 1.3.4 Permutation Edit Distance . .............................. 8 1.3.5 Reversals, Indels, Transpositions, Edits (RITE) . .................... 9 1.4 Strings ................................................ 9 1.4.1 Hamming Distance . .................................. 10 1.4.2 Edit Distance . ....................................... 10 1.4.3 Block Edit Distances . .................................. 11 1.5 Sequence Distance Problems . .................................
    [Show full text]
  • Dictionary Look-Up Within Small Edit Distance
    Dictionary Lo okUp Within Small Edit Distance Ab dullah N Arslan and Omer Egeciog lu Department of Computer Science University of California Santa Barbara Santa Barbara CA USA farslanomergcsucsbedu Abstract Let W b e a dictionary consisting of n binary strings of length m each represented as a trie The usual dquery asks if there exists a string in W within Hamming distance d of a given binary query string q We present an algorithm to determine if there is a memb er in W within edit distance d of a given query string q of length m The metho d d+1 takes time O dm in the RAM mo del indep endent of n and requires O dm additional space Intro duction Let W b e a dictionary consisting of n binary strings of length m each A dquery asks if there exists a string in W within Hamming distance d of a given binary query string q Algorithms for answering dqueries eciently has b een a topic of interest for some time and have also b een studied as the approximate query and the approximate query retrieval problems in the literature The problem was originally p osed by Minsky and Pap ert in in which they asked if there is a data structure that supp orts fast dqueries The cases of small d and large d for this problem seem to require dierent techniques for their solutions The case when d is small was studied by Yao and Yao Dolev et al and Greene et al have made some progress when d is relatively large There are ecient algorithms only when d prop osed by Bro dal and Venkadesh Yao and Yao and Bro dal and Gasieniec The small d case has applications
    [Show full text]
  • Automating Error Frequency Analysis Via the Phonemic Edit Distance Ratio
    JSLHR Research Note Automating Error Frequency Analysis via the Phonemic Edit Distance Ratio Michael Smith,a Kevin T. Cunningham,a and Katarina L. Haleya Purpose: Many communication disorders result in speech corresponds to percent segments with these phonemic sound errors that listeners perceive as phonemic errors. errors. Unfortunately, manual methods for calculating phonemic Results: Convergent validity between the manually calculated error frequency are prohibitively time consuming to use in error frequency and the automated edit distance ratio was large-scale research and busy clinical settings. The purpose excellent, as demonstrated by nearly perfect correlations of this study was to validate an automated analysis based and negligible mean differences. The results were replicated on a string metric—the unweighted Levenshtein edit across 2 speech samples and 2 computation applications. distance—to express phonemic error frequency after left Conclusions: The phonemic edit distance ratio is well hemisphere stroke. suited to estimate phonemic error rate and proves itself Method: Audio-recorded speech samples from 65 speakers for immediate application to research and clinical practice. who repeated single words after a clinician were transcribed It can be calculated from any paired strings of transcription phonetically. By comparing these transcriptions to the symbols and requires no additional steps, algorithms, or target, we calculated the percent segments with a combination judgment concerning alignment between target and production. of phonemic substitutions, additions, and omissions and We recommend it as a valid, simple, and efficient substitute derived the phonemic edit distance ratio, which theoretically for manual calculation of error frequency. n the context of communication disorders that affect difference between a series of targets and a series of pro- speech production, sound error frequency is often ductions.
    [Show full text]
  • Searching All Seeds of Strings with Hamming Distance Using Finite Automata
    Proceedings of the International MultiConference of Engineers and Computer Scientists 2009 Vol I IMECS 2009, March 18 - 20, 2009, Hong Kong Searching All Seeds of Strings with Hamming Distance using Finite Automata Ondřej Guth, Bořivoj Melichar ∗ Abstract—Seed is a type of a regularity of strings. Finite automata provide common formalism for many al- A restricted approximate seed w of string T is a fac- gorithms in the area of text processing (stringology), in- tor of T such that w covers a superstring of T under volving forward exact and approximate pattern match- some distance rule. In this paper, the problem of all ing and searching for borders, periods, and repetitions restricted seeds with the smallest Hamming distance [7], backward pattern matching [8], pattern matching in is studied and a polynomial time and space algorithm a compressed text [9], the longest common subsequence for solving the problem is presented. It searches for [10], exact and approximate 2D pattern matching [11], all restricted approximate seeds of a string with given limited approximation using Hamming distance and and already mentioned computing approximate covers it computes the smallest distance for each found seed. [5, 6] and exact covers [12] and seeds [2] in generalized The solution is based on a finite (suffix) automata ap- strings. Therefore, we would like to further extend the set proach that provides a straightforward way to design of problems solved using finite automata. Such a problem algorithms to many problems in stringology. There- is studied in this paper. fore, it is shown that the set of problems solvable using finite automata includes the one studied in this Finite automaton as a data structure may be easily imple- paper.
    [Show full text]
  • APPROXIMATE STRING MATCHING METHODS for DUPLICATE DETECTION and CLUSTERING TASKS by Oleksandr Rudniy
    Copyright Warning & Restrictions The copyright law of the United States (Title 17, United States Code) governs the making of photocopies or other reproductions of copyrighted material. Under certain conditions specified in the law, libraries and archives are authorized to furnish a photocopy or other reproduction. One of these specified conditions is that the photocopy or reproduction is not to be “used for any purpose other than private study, scholarship, or research.” If a, user makes a request for, or later uses, a photocopy or reproduction for purposes in excess of “fair use” that user may be liable for copyright infringement, This institution reserves the right to refuse to accept a copying order if, in its judgment, fulfillment of the order would involve violation of copyright law. Please Note: The author retains the copyright while the New Jersey Institute of Technology reserves the right to distribute this thesis or dissertation Printing note: If you do not wish to print this page, then select “Pages from: first page # to: last page #” on the print dialog screen The Van Houten library has removed some of the personal information and all signatures from the approval page and biographical sketches of theses and dissertations in order to protect the identity of NJIT graduates and faculty. ABSTRACT APPROXIMATE STRING MATCHING METHODS FOR DUPLICATE DETECTION AND CLUSTERING TASKS by Oleksandr Rudniy Approximate string matching methods are utilized by a vast number of duplicate detection and clustering applications in various knowledge domains. The application area is expected to grow due to the recent significant increase in the amount of digital data and knowledge sources.
    [Show full text]
  • B-Bit Sketch Trie: Scalable Similarity Search on Integer Sketches
    b-Bit Sketch Trie: Scalable Similarity Search on Integer Sketches Shunsuke Kanda Yasuo Tabei RIKEN Center for Advanced Intelligence Project RIKEN Center for Advanced Intelligence Project Tokyo, Japan Tokyo, Japan [email protected] [email protected] Abstract—Recently, randomly mapping vectorial data to algorithms intending to build sketches of non-negative inte- strings of discrete symbols (i.e., sketches) for fast and space- gers (i.e., b-bit sketches) have been proposed for efficiently efficient similarity searches has become popular. Such random approximating various similarity measures. Examples are b-bit mapping is called similarity-preserving hashing and approximates a similarity metric by using the Hamming distance. Although minwise hashing (minhash) [12]–[14] for Jaccard similarity, many efficient similarity searches have been proposed, most of 0-bit consistent weighted sampling (CWS) for min-max ker- them are designed for binary sketches. Similarity searches on nel [15], and 0-bit CWS for generalized min-max kernel [16]. integer sketches are in their infancy. In this paper, we present Thus, developing scalable similarity search methods for b-bit a novel space-efficient trie named b-bit sketch trie on integer sketches is a key issue in large-scale applications of similarity sketches for scalable similarity searches by leveraging the idea behind succinct data structures (i.e., space-efficient data structures search. while supporting various data operations in the compressed Similarity searches on binary sketches are classified
    [Show full text]
  • Finite State Recognizer and String Similarity Based Spelling
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by BRAC University Institutional Repository FINITE STATE RECOGNIZER AND STRING SIMILARITY BASED SPELLING CHECKER FOR BANGLA Submitted to A Thesis The Department of Computer Science and Engineering of BRAC University by Munshi Asadullah In Partial Fulfillment of the Requirements for the Degree of Bachelor of Science in Computer Science and Engineering Fall 2007 BRAC University, Dhaka, Bangladesh 1 DECLARATION I hereby declare that this thesis is based on the results found by me. Materials of work found by other researcher are mentioned by reference. This thesis, neither in whole nor in part, has been previously submitted for any degree. Signature of Supervisor Signature of Author 2 ACKNOWLEDGEMENTS Special thanks to my supervisor Mumit Khan without whom this work would have been very difficult. Thanks to Zahurul Islam for providing all the support that was required for this work. Also special thanks to the members of CRBLP at BRAC University, who has managed to take out some time from their busy schedule to support, help and give feedback on the implementation of this work. 3 Abstract A crucial figure of merit for a spelling checker is not just whether it can detect misspelled words, but also in how it ranks the suggestions for the word. Spelling checker algorithms using edit distance methods tend to produce a large number of possibilities for misspelled words. We propose an alternative approach to checking the spelling of Bangla text that uses a finite state automaton (FSA) to probabilistically create the suggestion list for a misspelled word.
    [Show full text]