A BRIEF SURVEY of METRICS in CODING THEORY Ernst Gabidulin

Total Page:16

File Type:pdf, Size:1020Kb

A BRIEF SURVEY of METRICS in CODING THEORY Ernst Gabidulin 66 Mathematics of Distances and Applications A BRIEF SURVEY OF METRICS IN CODING THEORY Ernst Gabidulin Abstract: The main objects of Coding theory are metric vector or matrix spaces. Subsets of spaces are known as codes. The main problem is constructing codes of given pairwise distance and having maximal cardinality. Most known and most investigated spaces are Hamming spaces. Thousands papers and books are devoted to codes in the Hamming metric. We mention here only two books [1; 2] and will not consider this metric in details. Other metrics are investigated much less. In this paper, we give many examples of useful metrics. It is still non exhaustive review. Keywords: metrics and norms, the uniform and non-uniform Hamming metrics, the Lee and Sharma-Kaushik metrics, the city block (Manhattan) metric, the Varshamov metric, the burst metric, the 2-dimensional burst metric, the term rank metric, the rank metric, combinatorial metrics, projective metrics, graph metrics, the subspace metric. ACM Classification Keywords: A.0 General Literature - Conference proceedings MSC: 94B05, 94B20, 94B25 Introduction Coding theory studies techniques to correct errors arising during communications through noisy channels. Its distinguishing features are using discrete signals and introducing the artificial redundancy. Discrecity allows to describe signals in terms of abstract symbols not connected with any physical realization. The artificial redundancy gives possibilities to correct errors using hard enough combinatorial constructions of signals. One can say that coding theory uses a wide spectrum of mathematical tools from simple binary arithmetic to modern algebraic geometry. The main objects of Coding theory are metric vector spaces. Subsets of spaces are known as codes. Main problem is constructing codes of given cardinality and having maximal pairwise distance as large as possible. Most known and most investigated spaces are Hamming spaces. The distance function between two vectors is defined as the number of non identical their coordinates. Thousands papers and books are devoted to the Hamming metrics. We mention here only two books [1; 2]. Other metrics are investigated much less. In this paper, we describe connections between channels and metrics and give many examples of useful metrics. It is still non exhaustive review. Also it is important to mention that not all metrics allow the good mathematic theory. General properties of codes Let X be an alphabet of q elements. Let X n be the space of all vectors over X of dimension n. A code C⊆X n of size |C| = M and length n is defined as any set of n-vectors over X n: C {x x x } = 1, 2, ..., M . We assume also that the space of vectors X n is considered as a metric space. A metric can be defined either by a distance function, or by a norm function. We shall consider only integer-valued distance functions and norms. Mathematics of Distances and Applications 67 A distance d(x, y) between a n-vectors x, y is a function satisfying conditions d(x, y) ≥ 0, ∀ x, y; (Non-negative). d(x, y) = 0 ⇐⇒ x = y; (Zero value). d(x, y) = d(y, x); (Symmetry). d(x, y) ≤ d(x, z)+ d(z, y), ∀ x, y, z (Triangle inequality). A norm function N (x) should satisfy next axioms: N (x) ≥ 0, ∀ x; (Non-negative). N (x) = 0 ⇐⇒ x = 0; (Zero value). N (x + y) ≤ N (x)+ N (y), ∀ x, y (Triangle inequality). The norm function allows to construct the distance function as follows: d(x, y) := N (x − y). Often distance and norm functions are defined coordinate-wise. A distance d(x,y) between letters of the alphabet X (respectively, N (x), x ∈X ) is defined first. Assume that the distance takes all values 0, 1,...,D, where D is the maximal possible value. Then the distance between n-vectors x, y ∈X n is defined as follows: n d(x, y)= d(xi,yi). Xi=1 This distance takes values 0, 1,...,nD. There exist still distance functions which are not coordinate-wise. For instance, the Varshamov distance and the rank distance (see, below) can not be represented in such a manner. Similarly, for the coordinate-wise norm, we have n Nn(x)= N (xi). Xi=1 It is useful for applications to introduce the generator norm function D W (z)= N(i)zi, Xi=0 where N(i)= |x ∈X : N (x)= i| , i = 0, 1,...,D, is the number of elements X with norm i. D means the maximal norm. The generator norm function of the extended norm Nn(·) is clearly nD D n W (z)= N (i)zi = W (z)n = N(i)zi . n n ! Xi=0 Xi=0 We shall consider metrics defined by a coordinate-wise norm Nn(·). The main problem of coding theory is n constructing codes Cn ⊆ X of maximal cardinality M if minimal distance d is given. For a metric on X we define the average norm N and the average pair-wise distance D by D P N (x) P iN(i) x∈X i=0 N = q = q ; P P N (x−y) P N(x) x∈X y∈X x∈X D = q(q−1) = q , 68 Mathematics of Distances and Applications where N(x) denotes the average distance from X \x to x. X q If is an additive group, then N(x) does not depend on x. Moreover N(x) = q−1 N. Hence in this case q D = q−1 N. N x qn For the extended metric n( ) we have N n = Nn, Dn = qn−1 N n. If Cn is a code of cardinality M and of distance d> N n, then the Plotkin-style bound is valid: d M ≤ . d − N n Let Sd−1(0)= y : NN (y) ≤ d − 1 be the ball of radius d − 1 with the center at the all zero vector 0. Then asymptotically, when n → ∞ and d−1 x = Dn < N, the volume of the ball is equal to n(1−R) Vd−1 = |Sd−1(0)|≍ cq , where D − 1 R = H(α0, α1,...,αD)+ αi logq Ni; iP=1 i Niγ αi = W (γ) , i = 0, 1,...,D; γW ′(γ) γ is the positive root of W (γ) = xD, D − H(α0, α1,...,αD) = αi logq αi. iP=0 log q M − It follows the upper Gilbert-style bound for a code with rate R = n and distance d 1= xDn: D − − R = 1 H(α0, α1,...,αD) αi logq Ni. Xi=1 One can show that this equation can be reduced to the next simple form: − R = 1 logq W (γ)+ xD log γ; γW ′(γ) xD = W (γ) . Examples of metrics Hamming metric The most known is the Hamming metric. The Hamming norm wH (x) of a vector x is defined as the number of its non zero coordinates. The Hamming distance between x and y is the norm of its difference: d(x, y)= wH (x − y). The Hamming metric is matched strongly with all full symmetrical memoryless channels. Full symmetrical channels are channels such that all non diagonal elements of the transfer probability matrix are identical. Huge amount of papers are devoted to this metric. Mathematics of Distances and Applications 69 Norms and metrics for Zq Let the alphabets X be the integer ring Zq = {0, 1,...,q − 1}. Integer-valued norms and metrics can be defined in many ways. It is clear that for any norm N (i)= N (q − i), i ∈ Zq. All elements of Zq can be divided into subsets of equal weight elements Bj = {a : a ∈ Zq, N (a)= j} , j = ≤ q ∈ − ∈ 0, 1,...,D, where D 2 is the maximal norm. If a Bj, then also q a Bj. q The maximal norm D can take values between 1 and 2 . Open problem: find all possible values of D for Zq. Open problem: describe all non-equivalent norms for Zq. q Two extreme cases are the Hamming metric (D = 1) and the Lee metric (D = 2 ). The Hamming metric is defined by 0, if i = 0; N (i)= , 1, if i = 1,...,q − 1; so D = 1. The subsets of equal weight elements are B0 = {0} , B1 = {1, 2,...,q − 1} . The Lee metric is defined by 0, if i = 0; N ≤ ≤ q (i)= i, if 1 i 2 ; , N − q ≤ − (q i), if 2 < i q 1; q so D = 2 . The subsets of equal weight elements are q q B0 = {0} , B1 = {1,q − 1} ,B2 = {2,q − 2} ,...,B⌊ q ⌋ = ,q − . 2 nj2k j2ko Main results for codes with the Lee metric were obtained by Berlekamp [7]: • The weight generator function 2 q−1 1 + 2z + 2z + · · · + 2z 2 , if q odd; W z −2 ( )= ( 2 q q 1 + 2z + 2z + · · · + 2z 2 + z 2 , if q even. • The average vector weight 2 n q −1 , if q odd; N 4q n = ( q n 4 , if q even. • The Plotkin-style bound for cardinality M of a code with Lee distance d: d M ≤ , if d> N n. d − N n • The asymptotic Gilbert-style bound: a code with cardinality M and Lee distance d exists, if log q M − R = n = 1 logq W (γ)+ xD log γ; γW ′(γ) xD = W (γ) , q d−1 where D = 2 and xD = n . 70 Mathematics of Distances and Applications All non-equivalent norms for Z4 1. The Hamming metrics with the subsets of equal weight elements B0 = {0}, B1 = {1, 2, 3}, and with the distance matrix x\y 0 12 3 0 0 11 1 D = 1 1 01 1 . 2 1 10 1 3 1 11 0 2.
Recommended publications
  • The One-Way Communication Complexity of Hamming Distance
    THEORY OF COMPUTING, Volume 4 (2008), pp. 129–135 http://theoryofcomputing.org The One-Way Communication Complexity of Hamming Distance T. S. Jayram Ravi Kumar∗ D. Sivakumar∗ Received: January 23, 2008; revised: September 12, 2008; published: October 11, 2008. Abstract: Consider the following version of the Hamming distance problem for ±1 vec- n √ n √ tors of length n: the promise is that the distance is either at least 2 + n or at most 2 − n, and the goal is to find out which of these two cases occurs. Woodruff (Proc. ACM-SIAM Symposium on Discrete Algorithms, 2004) gave a linear lower bound for the randomized one-way communication complexity of this problem. In this note we give a simple proof of this result. Our proof uses a simple reduction from the indexing problem and avoids the VC-dimension arguments used in the previous paper. As shown by Woodruff (loc. cit.), this implies an Ω(1/ε2)-space lower bound for approximating frequency moments within a factor 1 + ε in the data stream model. ACM Classification: F.2.2 AMS Classification: 68Q25 Key words and phrases: One-way communication complexity, indexing, Hamming distance 1 Introduction The Hamming distance H(x,y) between two vectors x and y is defined to be the number of positions i such that xi 6= yi. Let GapHD denote the following promise problem: the input consists of ±1 vectors n √ n √ x and y of length n, together with the promise that either H(x,y) ≤ 2 − n or H(x,y) ≥ 2 + n, and the goal is to find out which of these two cases occurs.
    [Show full text]
  • Fault Tolerant Computing CS 530 Information Redundancy: Coding Theory
    Fault Tolerant Computing CS 530 Information redundancy: Coding theory Yashwant K. Malaiya Colorado State University April 3, 2018 1 Information redundancy: Outline • Using a parity bit • Codes & code words • Hamming distance . Error detection capability . Error correction capability • Parity check codes and ECC systems • Cyclic codes . Polynomial division and LFSRs 4/3/2018 Fault Tolerant Computing ©YKM 2 Redundancy at the Bit level • Errors can bits to be flipped during transmission or storage. • An extra parity bit can detect if a bit in the word has flipped. • Some errors an be corrected if there is enough redundancy such that the correct word can be guessed. • Tukey: “bit” 1948 • Hamming codes: 1950s • Teletype, ASCII: 1960: 7+1 Parity bit • Codes are widely used. 304,805 letters in the Torah Redundancy at the Bit level Even/odd parity (1) • Errors can bits to be flipped during transmission/storage. • Even/odd parity: . is basic method for detecting if one bit (or an odd number of bits) has been switched by accident. • Odd parity: . The number of 1-bit must add up to an odd number • Even parity: . The number of 1-bit must add up to an even number Even/odd parity (2) • The it is known which parity it is being used. • If it uses an even parity: . If the number of of 1-bit add up to an odd number then it knows there was an error: • If it uses an odd: . If the number of of 1-bit add up to an even number then it knows there was an error: • However, If an even number of 1-bit is flipped the parity will still be the same.
    [Show full text]
  • Sequence Distance Embeddings
    Sequence Distance Embeddings by Graham Cormode Thesis Submitted to the University of Warwick for the degree of Doctor of Philosophy Computer Science January 2003 Contents List of Figures vi Acknowledgments viii Declarations ix Abstract xi Abbreviations xii Chapter 1 Starting 1 1.1 Sequence Distances . ....................................... 2 1.1.1 Metrics ............................................ 3 1.1.2 Editing Distances ...................................... 3 1.1.3 Embeddings . ....................................... 3 1.2 Sets and Vectors ........................................... 4 1.2.1 Set Difference and Set Union . .............................. 4 1.2.2 Symmetric Difference . .................................. 5 1.2.3 Intersection Size ...................................... 5 1.2.4 Vector Norms . ....................................... 5 1.3 Permutations ............................................ 6 1.3.1 Reversal Distance ...................................... 7 1.3.2 Transposition Distance . .................................. 7 1.3.3 Swap Distance ....................................... 8 1.3.4 Permutation Edit Distance . .............................. 8 1.3.5 Reversals, Indels, Transpositions, Edits (RITE) . .................... 9 1.4 Strings ................................................ 9 1.4.1 Hamming Distance . .................................. 10 1.4.2 Edit Distance . ....................................... 10 1.4.3 Block Edit Distances . .................................. 11 1.5 Sequence Distance Problems . .................................
    [Show full text]
  • Dictionary Look-Up Within Small Edit Distance
    Dictionary Lo okUp Within Small Edit Distance Ab dullah N Arslan and Omer Egeciog lu Department of Computer Science University of California Santa Barbara Santa Barbara CA USA farslanomergcsucsbedu Abstract Let W b e a dictionary consisting of n binary strings of length m each represented as a trie The usual dquery asks if there exists a string in W within Hamming distance d of a given binary query string q We present an algorithm to determine if there is a memb er in W within edit distance d of a given query string q of length m The metho d d+1 takes time O dm in the RAM mo del indep endent of n and requires O dm additional space Intro duction Let W b e a dictionary consisting of n binary strings of length m each A dquery asks if there exists a string in W within Hamming distance d of a given binary query string q Algorithms for answering dqueries eciently has b een a topic of interest for some time and have also b een studied as the approximate query and the approximate query retrieval problems in the literature The problem was originally p osed by Minsky and Pap ert in in which they asked if there is a data structure that supp orts fast dqueries The cases of small d and large d for this problem seem to require dierent techniques for their solutions The case when d is small was studied by Yao and Yao Dolev et al and Greene et al have made some progress when d is relatively large There are ecient algorithms only when d prop osed by Bro dal and Venkadesh Yao and Yao and Bro dal and Gasieniec The small d case has applications
    [Show full text]
  • Coding Theory, Information Theory and Cryptology : Proceedings of the EIDMA Winter Meeting, Veldhoven, December 19-21, 1994
    Coding theory, information theory and cryptology : proceedings of the EIDMA winter meeting, Veldhoven, December 19-21, 1994 Citation for published version (APA): Tilborg, van, H. C. A., & Willems, F. M. J. (Eds.) (1994). Coding theory, information theory and cryptology : proceedings of the EIDMA winter meeting, Veldhoven, December 19-21, 1994. Euler Institute of Discrete Mathematics and its Applications. Document status and date: Published: 01/01/1994 Document Version: Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers) Please check the document version of this publication: • A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website. • The final author version and the galley proof are versions of the publication after peer review. • The final published version features the final layout of the paper including the volume, issue and page numbers. Link to publication General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal.
    [Show full text]
  • Searching All Seeds of Strings with Hamming Distance Using Finite Automata
    Proceedings of the International MultiConference of Engineers and Computer Scientists 2009 Vol I IMECS 2009, March 18 - 20, 2009, Hong Kong Searching All Seeds of Strings with Hamming Distance using Finite Automata Ondřej Guth, Bořivoj Melichar ∗ Abstract—Seed is a type of a regularity of strings. Finite automata provide common formalism for many al- A restricted approximate seed w of string T is a fac- gorithms in the area of text processing (stringology), in- tor of T such that w covers a superstring of T under volving forward exact and approximate pattern match- some distance rule. In this paper, the problem of all ing and searching for borders, periods, and repetitions restricted seeds with the smallest Hamming distance [7], backward pattern matching [8], pattern matching in is studied and a polynomial time and space algorithm a compressed text [9], the longest common subsequence for solving the problem is presented. It searches for [10], exact and approximate 2D pattern matching [11], all restricted approximate seeds of a string with given limited approximation using Hamming distance and and already mentioned computing approximate covers it computes the smallest distance for each found seed. [5, 6] and exact covers [12] and seeds [2] in generalized The solution is based on a finite (suffix) automata ap- strings. Therefore, we would like to further extend the set proach that provides a straightforward way to design of problems solved using finite automata. Such a problem algorithms to many problems in stringology. There- is studied in this paper. fore, it is shown that the set of problems solvable using finite automata includes the one studied in this Finite automaton as a data structure may be easily imple- paper.
    [Show full text]
  • Open Thesis-Cklee-Jul21.Pdf
    The Pennsylvania State University The Graduate School College of Engineering EFFICIENT INFORMATION ACCESS FOR LOCATION-BASED SERVICES IN MOBILE ENVIRONMENTS A Dissertation in Computer Science and Engineering by Chi Keung Lee c 2009 Chi Keung Lee Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy August 2009 The dissertation of Chi Keung Lee was reviewed and approved∗ by the following: Wang-Chien Lee Associate Professor of Computer Science and Engineering Dissertation Advisor Chair of Committee Donna Peuquet Professor of Geography Piotr Berman Associate Professor of Computer Science and Engineering Daniel Kifer Assistant Professor of Computer Science and Engineering Raj Acharya Professor of Computer Science and Engineering Head of Computer Science and Engineering ∗Signatures are on file in the Graduate School. Abstract The demand for pervasive access of location-related information (e.g., local traffic, restau- rant locations, navigation maps, weather conditions, pollution index, etc.) fosters a tremen- dous application base of Location Based Services (LBSs). Without loss of generality, we model location-related information as spatial objects and the accesses of such information as Location-Dependent Spatial Queries (LDSQs) in LBS systems. In this thesis, we address the efficiency issue of LDSQ processing through several innovative techniques. First, we develop a new client-side data caching model, called Complementary Space Caching (CSC), to effectively shorten data access latency over wireless channels. This is mo- tivated by an observation that mobile clients, with only a subset of objects cached, do not have sufficient knowledge to assert whether or not certain object access from the remote server are necessary for given LDSQs.
    [Show full text]
  • Checking Whether a Word Is Hamming-Isometric in Linear Time
    Checking whether a word is Hamming-isometric in linear time Marie-Pierre B´eal ID ,∗ Maxime Crochemore ID † July 26, 2021 Abstract duced Fibonacci cubes in which nodes are on a binary alphabet and avoid the factor 11. The notion of d- A finite word f is Hamming-isometric if for any two ary n-cubes has subsequently be extended to define word u and v of the same length avoiding f, u can the generalized Fibonacci cube [10, 11, 19] ; it is the 2 be transformed into v by changing one by one all the subgraph Qn(f) of a 2-ary n-cube whose nodes avoid letters on which u differs from v, in such a way that some factor f. In this framework, a binary word f is 2 all of the new words obtained in this process also said to be Lee-isometric when, for any n 1, Qn(f) 2 ≥ avoid f. Words which are not Hamming-isometric can be isometrically embedded into Qn, that is, the 2 have been characterized as words having a border distance between two words u and v vertices of Qn(f) 2 2 with two mismatches. We derive from this charac- is the same in Qn(f) and in Qn. terization a linear-time algorithm to check whether On a binary alphabet, the definition of a Lee- a word is Hamming-isometric. It is based on pat- isometric word can be equivalently given by ignor- tern matching algorithms with k mismatches. Lee- ing hypercubes and adopting a point of view closer isometric words over a four-letter alphabet have been to combinatorics on words.
    [Show full text]
  • B-Bit Sketch Trie: Scalable Similarity Search on Integer Sketches
    b-Bit Sketch Trie: Scalable Similarity Search on Integer Sketches Shunsuke Kanda Yasuo Tabei RIKEN Center for Advanced Intelligence Project RIKEN Center for Advanced Intelligence Project Tokyo, Japan Tokyo, Japan [email protected] [email protected] Abstract—Recently, randomly mapping vectorial data to algorithms intending to build sketches of non-negative inte- strings of discrete symbols (i.e., sketches) for fast and space- gers (i.e., b-bit sketches) have been proposed for efficiently efficient similarity searches has become popular. Such random approximating various similarity measures. Examples are b-bit mapping is called similarity-preserving hashing and approximates a similarity metric by using the Hamming distance. Although minwise hashing (minhash) [12]–[14] for Jaccard similarity, many efficient similarity searches have been proposed, most of 0-bit consistent weighted sampling (CWS) for min-max ker- them are designed for binary sketches. Similarity searches on nel [15], and 0-bit CWS for generalized min-max kernel [16]. integer sketches are in their infancy. In this paper, we present Thus, developing scalable similarity search methods for b-bit a novel space-efficient trie named b-bit sketch trie on integer sketches is a key issue in large-scale applications of similarity sketches for scalable similarity searches by leveraging the idea behind succinct data structures (i.e., space-efficient data structures search. while supporting various data operations in the compressed Similarity searches on binary sketches are classified
    [Show full text]
  • Large Scale Hamming Distance Query Processing
    Large Scale Hamming Distance Query Processing Alex X. Liu, Ke Shen, Eric Torng Department of Computer Science and Engineering Michigan State University East Lansing, MI 48824, U.S.A. {alexliu, shenke, torng}@cse.msu.edu Abstract—Hamming distance has been widely used in many a new solution to the Hamming distance range query problem. application domains, such as near-duplicate detection and pattern The basic idea is to hash each document into a 64-bit binary recognition. We study Hamming distance range query problems, string, called a fingerprint or a sketch, using the simhash where the goal is to find all strings in a database that are within a Hamming distance bound k from a query string. If k is fixed, algorithm in [11]. Simhash has the following property: if two we have a static Hamming distance range query problem. If k is documents are near-duplicates in terms of content, then the part of the input, we have a dynamic Hamming distance range Hamming distance between their fingerprints will be small. query problem. For the static problem, the prior art uses lots of Similarly, Miller et al. proposed a scheme for finding a song memory due to its aggressive replication of the database. For the that includes a given audio clip [20], [21]. They proposed dynamic range query problem, as far as we know, there is no space and time efficient solution for arbitrary databases. In this converting the audio clip to a binary string (and doing the paper, we first propose a static Hamming distance range query same for all the songs in the given database) and then finding algorithm called HEngines, which addresses the space issue in the nearest neighbors to the given binary string in the database prior art by dynamically expanding the query on the fly.
    [Show full text]
  • Parity Check Code Repetition Code 10010011 11100001
    Lecture Notes Charles Cusack Communication Errors Introduction Implementations of memory, storage, and communication are not perfect. Therefore, it is possible that when I store a number on a hard drive or send it over a network, a slightly different number is actually stored/received. There are techniques to deal with this. Definition An error detection (correction) code is a method of encoding data in so that if a certain number of errors has occurred during transmission, it can be detected (corrected). Definition The hamming distance between two bit strings is the number of bits where they differ. Example The hamming distance between 10010011 and 11100001 is 4 since 1001 00 11 these bit patterns differ in bits 2, 3, 4, and 7. 1110 00 01 Parity Check Code The simplest method of detecting errors is by using a parity bit . To use a parity bit, one simply appends one additional bit (to the beginning or end—it doesn’t matter which as long as you are consistent) to a bit string so that the number of 1s in the string is odd (or even). To detect an error you add the number of ‘1’ bits. If it is not odd (even) you know an error has occurred. Example The bit string 10010011 is encoded as 100100111 using an odd parity check code, and the bit string 11110001 is encoded as 111100010 using an odd parity check code. Question • How many errors can this method detect? What happens if more errors occur? • How many errors can this method correct? Repetition Code A repetition code encodes a bit string by appending k copies of the bit string.
    [Show full text]
  • Bit-Parallel Approximate String Matching Under Hamming Distance
    Aalto University School of Science Degree Programme in Computer Science and Engineering Tommi Hirvola Bit-parallel approximate string matching under Hamming distance Master's Thesis Espoo, July 21, 2016 Supervisors: Professor Jorma Tarhio Advisor: Professor Jorma Tarhio Aalto University School of Science ABSTRACT OF Degree Programme in Computer Science and Engineering MASTER'S THESIS Author: Tommi Hirvola Title: Bit-parallel approximate string matching under Hamming distance Date: July 21, 2016 Pages: 87 Major: Software engineering Code: T-106 Supervisors: Professor Jorma Tarhio Advisor: Professor Jorma Tarhio In many situations, an exact search of a pattern in a text does not find all in- teresting parts of the text. For example, exact matching can be rendered useless by spelling mistakes or other errors in the text or pattern. Approximate string matching addresses the problem by allowing a certain amount of error in occur- rences of the pattern in the text. This thesis considers the approximate matching problem known as string matching with k mismatches, in which k characters are allowed to mismatch. The main contribution of the thesis consists of new algorithms for approximate string matching with k mismatches based on the bit-parallelism technique. Ex- periments show that the proposed algorithms are competitive with the state-of- the-art algorithms in practice. A significant advantage of the novel algorithms is algorithmic simplicity and flexibility, which is demonstrated by deriving efficient algorithms to several problems involving approximate string matching under the k-mismatches model. A bit-parallel algorithm is presented for online approximate matching of circular strings with k mismatches, which is the first average-optimal solution that is not based on filtering.
    [Show full text]