Repairing Data Through Regular Expressions

Repairing Data through Regular Expressions Zeyu Li Hongzhi Wang Wei Shao Jianzhong Li Hong Gao Harbin Institute of Technology lizeyu [email protected] [email protected] shaowei [email protected] flijzh, [email protected] ABSTRACT approaches are in demand. With such motivation, many au- Since regular expressions are often used to detect errors in tomatic data repairing methods have been proposed. Three sequences such as strings or date, it is natural to use them categories constitute such methods. (1) Rule-based repair- for data repair. Motivated by this, we propose a data re- ing methods amend data to make them satisfy a given set pair method based on regular expression to make the input of rules [4, 14, 12]. (2) Repair methods based on truth dis- sequence data obey the given regular expression with mini- covery attempt to discover truth from conflicting values [13, mal revision cost. The proposed method contains two steps, 9, 10]. (3) Learning-based repair employs machine learning sequence repair and token value repair. models such as decision tree, Bayes network or neural net- For sequence repair, we propose the Regular-expression- work to predict values for imputation or value revision [15, based Structural Repair (RSR in short) algorithm. RSR 16, 19]. algorithm is a dynamic programming algorithm that uti- Constraints or semantics of data are usually represented lizes Nondeterministic Finite Automata (NFA) to calculate as rules. Rules for data repairing could be classified into two the edit distance between a prefix of the input string and kinds, semantic rules and syntactic rules. The former one is a partial pattern regular expression with time complexity often described by dependencies among data such as CFD of O(nm2) and space complexity of O(mn) where m is the and CIND [6, 7]. Many researches have been conducted on edge number of NFA and n is the input string length. We this topic. Syntactic rules are often represented by gram- also develop an optimization strategy to achieve higher per- mars and used to describe the structure of data. A popular formance for long strings. For token value repair, we com- type of grammars is regular expression (\regex"). bine the edit-distance-based method and associate rules by Regular expressions are more suitable for syntactic amend- a unified argument for the selection of the proper method. ments than dependencies and other semantic rules. Depen- Experimental results on both real and synthetic data show dencies are defined in semantic domains. It is incapable of that the proposed method could repair the data effectively solving structure problems such as erroneous insertions and and efficiently. deletions. For example, given A ! B as a dependency rule, a miswriting of attribute B could be repaired according to attribute A. However, if B is lost, the rule cannot offer in- structions on positions to insert B back. 1. INTRODUCTION On the contrary, regular expressions can remember essen- Due to its importance, data quality draws great atten- tial and optional tokens and their absolute or relative position in both industry and academia. Data quality is the tions. For above example, regex r = A(xy)∗Bz can suggest measurement of validity and correctness of the data. Low- that the lost B should be added between the repetition of xy quality data may cause disasters. It is reported that data and the single z. If A is lost, r can instruct that A's absolute errors account for 6% of annual industrial economic loss in position is at the beginning. Therefore, regex expressions are America [11]. According to the statistic data of Institute appropriate choices for regular-grammar-structured data. of Medicine, data errors are responsible for 98,000 people's Error detection based on regexes has been well studied. death [17] per year. To improve data quality, data clean- Such techniques could tell errors in data by checking whether ing is a natural approach, which is to repair the data with some components violate the regex. [18] illustrates a spread- errors. sheet like program called \Potter's Wheel" which can detect When data size is small, it is feasible to repair the er- errors and infer structures by user specified regexes as well rors manually. However, for big data, automatic data repair as data domains. [21] explains an error detection by regular expressions via an example of `Company Name' column from registration entries. Regex could be used not only to detect errors but also to repair the data automatically. Consider the scenario of a This work is licensed under the Creative Commons Attribution- Intrusion Detection System (IDS). IDS generally scans input NonCommercial-NoDerivatives 4.0 International License. To view a copy data streams and matches the malicious code blocks that are of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For harmful to user devices. The properties of such attack are any use beyond those covered by this license, obtain permission by emailing described in the following structure: [email protected]. <Action><Protocol> Proceedings of the VLDB Endowment, Vol. 9, No. 5 <Source IP><Source Port> <Direction Operator> Copyright 2016 VLDB Endowment 2150-8097/16/01. 432 <Destination IP><Destination Port> (<Option1: edit distance and association rules which are widely used Value1>;...;<Optionn:Valuen>) which matches the regex in data mining field. The final repairing solution is deter- ∗ Act:P:Sip:Sport · Opdir · Dip · Dport · ([Opti : vi] ). In this mined by a comparison between the estimated confidence of regex, each label represents a token with possible values. the distance-based strategy and that of the association-rule- For instance, \P " can be one of ftcp,udp,ip,icmpg. based one. If a stream with token sequence <Act><SIP><SP><DO> The contributions of this paper are summarized as follows. <DIP><DP>(<Opt1:v1>;<Opt2:v2>;) is received, it is known that <P> is lost by matching with the regex. We can repair • We use regular expressions for not only error detection but it by inserting <P> to the token sequence behind <Act> to also data repair. We model the problem of data repair as make it match the regex. Such insertion is a repair. an optimization problem. As we know, this is the first In addition to the Snort Rule, there exist numerous other work studying regular-expression-based data repair. instances that are constrained by regular expressions. For • We develop a whole solution for regex-based data repair example, the MovieLens rating data set contains ratings of with a dynamic programming algorithm as the basic algo- approximately 3,900 movies made by MovieLens users [1] rithm, a pruning strategy for acceleration, and a proper and obey the following format:<Sequence Number>::<Movie value selection method to accomplish the repair. Name><Year>::<Type1>/<Type2>/...<Typen>/. That is, • We conduct extensive experiments to verify the efficien- each movie information is recorded under the constraint of cy and effectiveness of the proposed algorithms. Experi- the regular expression Seq# :: Name:Y ear :: (T ype=)∗. mental results demonstrate that the proposed algorithms Apart from those two examples, regular expressions have could efficiently repair sequences according to a regex and a large number practical applications on constraining data the quality of the repaired data is high. structures from smaller structures like postcode, email ad- dress to larger structures like file name matching patterns This paper is organized as follows. We give an overview of in Windows System. These examples demonstrate the regex this problem in Section 2 and separate this problem into two has been widely used. Thus, a regex-based data repairing subproblems, structural repair and value repair, whose solu- method is significant and in demand. tions are discussed in Section 3 and Section 4 respectively. Data repairing approaches [12, 6] often make data obey We experimentally verified the effectiveness and efficiency of the given rules and minimize the differences between the the proposed algorithms in Section 5. At last, we draw the revised and the original data. Even though we could still conclusions in Section 6. follow such idea for regex-based repairing, it is not straight- forward to find a proper revision, since the regex-based er- 2. OVERVIEW ror detection approaches fail to tell how to make the strings Structural constraints are used to describe the structure obey the rule. that a string should obey. Each structural constraint has Edit distance is defined as the minimal number of steps two components, token sequence and token value options. required to convert one string s to another one s , denoted 1 2 The sequence describes the order of tokens in a string and by ed(s ; s ). It is often used to measure the differences 1 2 the value options describe the available options of each to- between strings. Since regex-based data repair is to convert ken. A structural constraint is expressed by a grammar with a string to the one satisfying some regex, it is natural to variables. The grammar describes the token sequence and adopt edit distance to measure the differences between the variables describe token values. We focus on the structural repaired data and the original one. Thus, we model the constraint with its grammar described by regexes. problem of regex-based data repairing via edit distance as Structural constraints could be used to detect and repair follows. errors in strings. A string s may violate a structural con- Given a string s and a regex r, find a string s0 2 L(r) straint in structural level or token value level. The repair so that for any s00 2 L(r), ed(s; s0) ≤ ed(s; s00), where L(r) for structural level violation (structural repair in brief) is is the language described by r, i.e.

Load more