Quick viewing(Text Mode)

Methods of Analysis of Textual Data (MATD) Exact Pattern Matching Searching for Finite Set of Patterns

Methods of Analysis of Textual Data (MATD) Exact Pattern Matching Searching for Finite Set of Patterns

Lectures Outline

1. Methods of Analysis of Textual Data (MATD) Exact pattern matching Searching for finite set of patterns

Searching for (Regular) Infinite Set of Patterns in Text Jiří Dvorský October 11, 2019 Approximate pattern matching

Department of VŠB – TU Ostrava

1/31 2/31

Pattern Matching

Pattern Matching

Jiří Dvorský Exact pattern matching

Department of Computer Science VŠB – TU Ostrava

3/31 Searching for (Regular) Infinite Set of Patterns in Text Regular Expressions and Languages

Regular expression 푅 Value of expression ℎ(푅) Atomic expressions ∅ ∅ 휀 {휀} 1. How to describe infinte set of pattern i.e. string? 푎, 푎 ∈ Σ {푎} Regular Expressions Operations 2. What shall we use to perform matching? 푈 ⋅ 푉 {푢푣|푢 ∈ ℎ(푈) ∧ 푣 ∈ ℎ(푉)} Finite Automata 푈 + 푉 ℎ(푈) ∪ ℎ(푉) 푉 푘 = 푉⏟ ⋅ 푉 ⋅ … ⋅ 푉 푘 푡푖푚푒푠 푉 + = 푉 1 + 푉 2 + 푉 3 + … 푉 ∗ = 푉 0 + 푉 1 + 푉 2 + …

4/31 5/31

Regular Expression Features Deterministic Finite Automaton

푈 + (푉 + 푊) = (푈 + 푉) + 푊 Definition 푈 ⋅ (푉 ⋅ 푊) = (푈 ⋅ 푉) ⋅ 푊 Deterministic Finite Automaton (DFA) is a quintuple 퐴 = (푄, Σ, 푞 , 훿, 퐹) 푈 + 푉 = 푉 + 푈 0 , where (푈 + 푉) ⋅ 푊 = (푈 ⋅ 푊) + (푉 ⋅ 푊) • 푄 is a finite set of states 푈 ⋅ (푉 + 푊) = (푈 ⋅ 푉) + (푈 ⋅ 푊) • Σ is an alphabet

푈 + 푈 = 푈 • 푞0 ∈ 푄 is an initial state 휀 ⋅ 푈 = 푈 • 훿 ∶ 푄 × Σ → 푄 is a transition function ∅ ⋅ 푈 = ∅ • 퐹 ⊆ 푄 is a set of final states 푈 + ∅ = 푈 푈∗ = 휀 + 푈+

6/31 7/31 Deterministic Finite Automaton (cont.) Nondeterministic Finite Automaton

Configuration of Finite Automaton Definition Nondeterministic Finite Automaton (NFA) is a quintuple (푞, 푤) ∈ 푄 × Σ∗ 퐴 = (푄, Σ, 푞0, 훿, 퐹), where Transition of Finite Automaton is a relation • 푄 is a finite set of states • Σ is an alphabet ↦∶ (푄 × Σ∗) × (푄 × Σ∗) • 푞0 ∈ 푄 is an initial state such as • 훿 ∶ 푄 × Σ → 푃(푄) is a transition function (푞, 푎푤) ↦ (푞’, 푤) ⟺ 훿(푞, 푎) = 푞’ • 퐹 ⊆ 푄 is a set of final states Automaton accepts word 푤 if • Alternatively NFA can be defined as 퐴 = (푄, Σ, 푆, 훿, 퐹), where ∗ (푞0, 푤) ↦ (푞, 휀), 푞 ∈ 퐹 푆 ⊆ 푄 is a set of initial states. • For each NFA, there is a DFA such that it recognizes the same formal language. 8/31 9/31

Nondeterministic Finite Automaton – example NFA ⟶ DFA Conversion

Set of patterns 푃 = {he, her, she} The DFA can be constructed using the powerset construction. ′ ′ ′ ′ ′ ′ NFA 퐴 = (푄, Σ, 푆, 훿, 퐹) ⟶ DFA 퐴 = (푄 ,Σ , 푞0, 훿 , 퐹 ) Σ ′ h e Σ • 푄 ⊆ 푃(푄) start 푞1 푞2 푞3 ′ Σ h e r • Σ = Σ start 푞1 푞2 푞3 푞4 ′ 푞 h 푞 e 푞 r 푞 s start 4 5 6 7 • 푞0 = 푆 Σ 푞 h 푞 e 푞 ′ ′ ′ 5 6 7 • 훿 (푞 , 푥) = ∪훿(푞, 푥) for all 푞 ∈ 푞 푞 s 푞 h 푞 e 푞 start 8 9 10 11 • 퐹′ = {푞′ ∈ 푄′|푞′ ∩ 퐹 ≠ ∅}

10/31 11/31 NFA ⟶ DFA Conversion I NFA ⟶ DFA Conversion II

State Label 푒 ℎ 푟 푠 other State Label 푒 ℎ 푟 푠 other {1, 4, 8} 푞′ {1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} 1 ′ ′ {1} 푞 {1} {1, 2} {1} {1, 5} {1} {1, 2, 4, 5, 8} 푞2 {1, 3, 4, 6, 8} {1, 2, 4, 5, 8} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} 1 ′ ′ {1, 4, 8, 9} 푞3 {1, 4, 8} {1, 2, 4, 5, 8, 10} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} {1, 2} 푞2 {1, 3} {1, 2} {1} {1, 5} {1} {1, 3, 4, 6, 8} 푞′ {1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 7, 8} {1, 4, 8, 9} {1, 4, 8} ′ 4 {1, 5} 푞3 {1} {1, 2, 6} {1} {1, 5} {1} {1, 2, 4, 5, 8, 10} 푞′ {1, 3, 4, 6, 8, 11} {1, 2, 4, 5, 8} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} ′ 5 {1, 3} 푞 {1} {1, 2} {1, 4} {1, 5} {1} ′ 4 {1, 4, 7, 8} 푞6 {1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} ′ ′ {1, 2, 6} 푞5 {1, 3, 7} {1, 2} {1} {1, 5} {1} {1, 3, 4, 6, 8, 11} 푞7 {1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 7, 8} {1, 4, 8, 9} {1, 4, 8} ′ {1, 4} 푞6 {1} {1, 2} {1} {1, 5} {1} ′ h h {1, 3, 7} 푞7 {1} {1, 2} {1, 4} {1, 5} {1} Σ h ′ h ′ ′ r ′ 푞 h 푞 e 푞 h h start 1 2 3 start 푞 푞 푞 푞 1 2 e 4 6 Σ s s h s r Σ ′ h ′ ′ r ′ 푞 h 푞 e 푞 r 푞 s h start 푞 푞 푞 푞 start 4 5 6 7 h 1 2 e 4 6 Σ h e r s ′ h ′ e ′ start 푞1 푞2 푞3 푞4 푞3 푞5 푞7 s s s r 푞 s 푞 h 푞 e 푞 s h start 8 9 10 11 s h s 푞 h 푞 e 푞 ′ h ′ e ′ s 5 6 7 푞3 푞5 푞7 s Only reachable states, transitions to state 푞 are not shown. s 1 s 12/31 13/31

Derivation of Regular Expression Derivation of Regular Expression – properties

For given regular expression 푅, derivation is defined as d푅 ℎ ( ) = {푦|푥푦 ∈ ℎ(푅)} d∅ d(푈 + 푉) d푈 d푉 푥 = ∅, ∀푎 ∈ Σ = + d d푎 d푎 d푎 d푎 d휀 d(푈 ⋅ 푉) d푈 Example = ∅, ∀푎 ∈ Σ = ⋅ 푉, 휀 ∉ 푈 d푎 d푎 d푎 For 푅 = 푎 + 푠ℎ푒푙푙 + 푠푡표푝 + 푝푙표푡 and its value d푎 d(푈 ⋅ 푉) d푈 d푉 ℎ(푅) = {푎, 푠ℎ푒푙푙, 푠푡표푝, 푝푙표푡} = 휀, ∀푎 ∈ Σ = ⋅ 푉 + , 휀 ∈ 푈 derivations are d푎 d푎 d푎 d푎 d푏 d푉 ∗ d푉 d푅 = ∅, ∀푏 ≠ 푎 = ⋅ 푉 ∗ ℎ ( ) = {휀} d푎 푎 푎 d푎 d d d푅 ℎ ( ) = {ℎ푒푙푙, 푡표푝} d푉 d d d d푉 d푠 = ( (⋯ ( ))) , for 푥 = 푎 푎 … 푎 d푅 d푥 d푎 d푎 d푎 d푎 1 2 푛 ℎ ( ) = ∅ 푛 푛−1 2 1 d푡

14/31 15/31 Construction of DFA Derivations of RE Construction of DFA Derivations of RE – example

• Derivation of regular expressions allows directly and Lest’s have 푉 = (0 + 1)∗ ⋅ 01 over alphabet Σ{0, 1}. algorithmically build DFA for any regular expression. ∗ Then 푞0 = (0 + 1) ⋅ 01 • Let 푉 is given regular expression in alphabet Σ. Example of derivations: • Each state of DFA defines a set of words, that move the ∗ ∗ d((0 + 1) ⋅ 01) d((0 + 1) ) d01 DFA from this state to any of final states. = ⋅ 01 + So, every state can be associated with regular expression, d0 d0 d0 d(0 + 1) defining this set of words = ⋅ (0 + 1)∗ ⋅ 01 + 1 d0 d0 d1 ∗ 푞0 = 푉 = ( + ) ⋅ (0 + 1) ⋅ 01 + 1 d0 d0 d푞 훿(푞, 푥) = = (휀 + ∅) ⋅ (0 + 1)∗ ⋅ 01 + 1 푥 d ∗ 퐹 = {푞 ∈ 푄|휀 ∈ ℎ(푞)} = (0 + 1) ⋅ 01 + 1

16/31 17/31

Construction of DFA Derivations of RE – example (cont.) Construction of DFA Derivations of RE – example (cont.)

Regular Expression State 0 1 ∗ ∗ d((0 + 1) ⋅ 01) d((0 + 1) ) d01 ∗ ∗ ∗ = ⋅ 01 + (0 + 1) ⋅ 01 푞0 (0 + 1) ⋅ 01 + 1 (0 + 1) ⋅ 01 d1 d1 d1 ∗ ∗ ∗ (0 + 1) ⋅ 01 + 1 푞1 (0 + 1) ⋅ 01 + 1 (0 + 1) ⋅ 01 + 휀 d(0 + 1) ∗ = ⋅ (0 + 1) ⋅ 01 + ∅ (0 + 1)∗ ⋅ 01 + 휀 푞 (0 + 1)∗ ⋅ 01 + 1 (0 + 1)∗ ⋅ 01 d1 2 d0 d1 = ( + ) ⋅ (0 + 1)∗ ⋅ 01 1 0 d1 d1 = (∅ + 휀) ⋅ (0 + 1)∗ ⋅ 01 ∗ 0 1 = (0 + 1) ⋅ 01 start 푞0 푞1 푞2 0 1

18/31 19/31 Approximate pattern matching

• String (string function) is a metric that measures distance between two text strings for approximate string matching. • String metric can be considered as “inverse similarity” – Pattern Matching how two strings are dissimilar. • There are two classic metrics Approximate pattern matching 1. 2. • Yes, string dissimilarity, distance can be measured. Both are metrics from mathematical point of view – non-negativity, identity, symmetry, and triangle inequality.

20/31

Hamming distance Levenshtein distance

Definition Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. Definition In other words, it measures the minimum number of Levenshtein distance (1965) between two strings is the substitutions required to change one string into the other. minimum number of single-character edits (insertions, deletions or substitutions) required to change one string into Example the other. Hamming distance of “karolin” and “kathrin” is 3.

k a r o l i n k a t h r i n 0 0 1 1 1 0 0

21/31 22/31 Levenshtein distance (cont.) Levenshtein distance (cont.)

Upper and lower bounds The Levenshtein distance has several simple upper and lower Example bounds: Levenshtein distance between “kitten” and “sitting” is 3: • It is at least the difference of the sizes of the two strings. 1. kitten → sitten (substitution of “s” for “k”) • It is at most the length of the longer string. 2. sitten → sittin (substitution of “i” for “e”) • It is zero if and only if the strings are equal. 3. sittin → sitting (insertion of “g” at the end). • If the strings are the same size, the Hamming distance is There is no way to do it with fewer than three edits. an upper bound on the Levenshtein distance. • The Levenshtein distance between two strings is no greater than the sum of their Levenshtein distances from a third string (triangle inequality).

23/31 24/31

Levenshtein distance (cont.) Levenshtein distance (cont.)

푖, 푗 = 0 ⎧ if 1 int LevenshteinDistance(const char *s, int len_s, const ⎪ ⎪ char *t, int len_t) ⎪푗, if 푖 = 0 2 { 푑(푖, 푗) = 푑(푖 − 1, 푗) + 1, ⎨ 3 int cost; ⎪ ⎪min ( 푑(푖, 푗 − 1) + 1, ) 4 ⎪ 5 /* base case: empty strings */ 푑(푖 − 1, 푗 − 1) + 푐(푖, 푗) ⎩ 6 if (len_s ==0) return len_t; where 7 if (len_t ==0) return len_s; 8 0 if 푎 = 푏 푐(푖, 푗) = { 푖 푗 9 /* test if last characters of the strings match */ 1 otherwise 10 if (s[len_s-1]== t[len_t-1]) 11 cost = 0; First element in the minimum corresponds to deletion (from 푎 12 else to 푏), the second to insertion and the third to match or 13 cost = 1; mismatch. 14

25/31 26/31 Levenshtein distance (cont.) Levenshtein distance (cont.)

k i t t e n 15 /* return minimum of delete char from s, delete char 0 1 2 3 4 5 6 from t, and delete char from both */ a a subst. of k for s 16 return minimum s 1 1 2 3 4 5 6 b b i is equal i 17 ( i 2 2 1 2 3 4 5 c t is equal t 18 LevenshteinDistance(s, len_s-1, t, len_t) + 1, c t 3 3 2 1 2 3 4 d 19 LevenshteinDistance(s, len_s, t, len_t-1) + 1, d t is equal t t 4 4 3 2 1 2 3 e 20 LevenshteinDistance(s, len_s-1, t, len_t-1) + cost subst. of e for i i 5 5 4 3 2 2e 3 21 ); f f n is equal n 22 } n 6 6 5 4 3 3 2 g delete g g 7 7 6 5 4 4 3g

27/31 28/31

Approximate pattern matching using finite automata Approximate pattern matching using finite automata (cont.)

Σ

푝1 푝2 푝3 푝4 start 푞0 푞1 푞2 푞3 푞4

푝1 푝2 푝3 푝4

Σ 푝2 푝3 푝4 푞5 푞6 푞7 푞8

푝2 푝3 푝4 left to right – match 푝1 푝2 푝3 푝4 푞 푞 푞 푞 푞 푝 푝 diagonal – replace start 0 1 2 3 4 3 4 푞9 푞10 푞11

푝3 푝4 NFA for the exact string matching (푚 = 4) 푝4 푞12 푞13

NFA for the approximate string matching using the Hamming distance (푚 = 4, 푘 = 3)

29/31 30/31 Approximate pattern matching using finite automata (cont.)

Σ

푝1 푝2 푝3 푝4 start 푞0 푞1 푞2 푞3 푞4

휀 휀 휀 휀

푝2 푝3 푝4 푝1 푝2 푝3 푝4 푝2 푝3 푝4 left to right – match 푞5 푞6 푞7 푞8 휀 휀 휀 diagonal – replace 푝3 푝4 푝2 푝3 푝4 푝3 푝4 down – insert 푞9 푞10 푞11

휀 휀 diagonal 휀– delete

푝4 푝3 푝4

푝4 푞12 푞13

NFA for the approximate string matching using the Levenshtein distance (푚 = 4, 푘 = 3)

31/31