Methods of Analysis of Textual Data (MATD)
Jiří Dvorský October 11, 2019
Department of Computer Science VŠB – TU Ostrava
1/31 Lectures Outline
1. Pattern Matching Exact pattern matching Searching for finite set of patterns
Searching for (Regular) Infinite Set of Patterns in Text Approximate pattern matching
2/31 Pattern Matching
Jiří Dvorský
Department of Computer Science VŠB – TU Ostrava
3/31 Pattern Matching Exact pattern matching Searching for (Regular) Infinite Set of Patterns in Text
1. How to describe infinte set of pattern i.e. string? Regular Expressions 2. What shall we use to perform matching? Finite Automata
4/31 Regular Expressions and Languages
Regular expression 푅 Value of expression ℎ(푅) Atomic expressions ∅ ∅ 휀 {휀} 푎, 푎 ∈ Σ {푎} Operations 푈 ⋅ 푉 {푢푣|푢 ∈ ℎ(푈) ∧ 푣 ∈ ℎ(푉)} 푈 + 푉 ℎ(푈) ∪ ℎ(푉) 푉 푘 = 푉⏟ ⋅ 푉 ⋅ … ⋅ 푉 푘 푡푖푚푒푠 푉 + = 푉 1 + 푉 2 + 푉 3 + … 푉 ∗ = 푉 0 + 푉 1 + 푉 2 + …
5/31 Regular Expression Features
푈 + (푉 + 푊) = (푈 + 푉) + 푊 푈 ⋅ (푉 ⋅ 푊) = (푈 ⋅ 푉) ⋅ 푊 푈 + 푉 = 푉 + 푈 (푈 + 푉) ⋅ 푊 = (푈 ⋅ 푊) + (푉 ⋅ 푊) 푈 ⋅ (푉 + 푊) = (푈 ⋅ 푉) + (푈 ⋅ 푊) 푈 + 푈 = 푈 휀 ⋅ 푈 = 푈 ∅ ⋅ 푈 = ∅ 푈 + ∅ = 푈 푈∗ = 휀 + 푈+
6/31 Deterministic Finite Automaton
Definition Deterministic Finite Automaton (DFA) is a quintuple
퐴 = (푄, Σ, 푞0, 훿, 퐹), where • 푄 is a finite set of states • Σ is an alphabet
• 푞0 ∈ 푄 is an initial state • 훿 ∶ 푄 × Σ → 푄 is a transition function • 퐹 ⊆ 푄 is a set of final states
7/31 Deterministic Finite Automaton (cont.)
Configuration of Finite Automaton
(푞, 푤) ∈ 푄 × Σ∗
Transition of Finite Automaton is a relation
↦∶ (푄 × Σ∗) × (푄 × Σ∗)
such as (푞, 푎푤) ↦ (푞’, 푤) ⟺ 훿(푞, 푎) = 푞’
Automaton accepts word 푤 if
∗ (푞0, 푤) ↦ (푞, 휀), 푞 ∈ 퐹
8/31 Nondeterministic Finite Automaton
Definition Nondeterministic Finite Automaton (NFA) is a quintuple
퐴 = (푄, Σ, 푞0, 훿, 퐹), where • 푄 is a finite set of states • Σ is an alphabet
• 푞0 ∈ 푄 is an initial state • 훿 ∶ 푄 × Σ → 푃(푄) is a transition function • 퐹 ⊆ 푄 is a set of final states
• Alternatively NFA can be defined as 퐴 = (푄, Σ, 푆, 훿, 퐹), where 푆 ⊆ 푄 is a set of initial states. • For each NFA, there is a DFA such that it recognizes the same formal language. 9/31 Nondeterministic Finite Automaton – example
Set of patterns 푃 = {he, her, she}
Σ
h e Σ start 푞1 푞2 푞3 Σ h e r start 푞1 푞2 푞3 푞4 h e r s start 푞4 푞5 푞6 푞7 Σ h e 푞5 푞6 푞7 s h e start 푞8 푞9 푞10 푞11
10/31 NFA ⟶ DFA Conversion
The DFA can be constructed using the powerset construction. ′ ′ ′ ′ ′ ′ NFA 퐴 = (푄, Σ, 푆, 훿, 퐹) ⟶ DFA 퐴 = (푄 ,Σ , 푞0, 훿 , 퐹 ) • 푄′ ⊆ 푃(푄) • Σ′ = Σ ′ • 푞0 = 푆 • 훿′(푞′, 푥) = ∪훿(푞, 푥) for all 푞 ∈ 푞′ • 퐹′ = {푞′ ∈ 푄′|푞′ ∩ 퐹 ≠ ∅}
11/31 NFA ⟶ DFA Conversion I
State Label 푒 ℎ 푟 푠 other ′ {1, 4, 8} 푞1 {1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} ′ {1, 2, 4, 5, 8} 푞2 {1, 3, 4, 6, 8} {1, 2, 4, 5, 8} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} ′ {1, 4, 8, 9} 푞3 {1, 4, 8} {1, 2, 4, 5, 8, 10} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} ′ {1, 3, 4, 6, 8} 푞4 {1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 7, 8} {1, 4, 8, 9} {1, 4, 8} ′ {1, 2, 4, 5, 8, 10} 푞5 {1, 3, 4, 6, 8, 11} {1, 2, 4, 5, 8} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} ′ {1, 4, 7, 8} 푞6 {1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} ′ {1, 3, 4, 6, 8, 11} 푞7 {1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 7, 8} {1, 4, 8, 9} {1, 4, 8}
h h Σ h ′ h ′ ′ r ′ 푞 h 푞 e 푞 start 1 2 3 start 푞 푞 푞 푞 1 2 e 4 6 Σ s s s r 푞 h 푞 e 푞 r 푞 s h start 4 5 6 7 h Σ ′ h ′ e ′ 푞3 푞5 푞7 s h e start 푞8 푞9 푞10 푞11 s s s
Only reachable states, transitions to state 푞1 are not shown. 12/31 NFA ⟶ DFA Conversion II
State Label 푒 ℎ 푟 푠 other ′ {1} 푞1 {1} {1, 2} {1} {1, 5} {1} ′ {1, 2} 푞2 {1, 3} {1, 2} {1} {1, 5} {1} ′ {1, 5} 푞3 {1} {1, 2, 6} {1} {1, 5} {1} ′ {1, 3} 푞4 {1} {1, 2} {1, 4} {1, 5} {1} ′ {1, 2, 6} 푞5 {1, 3, 7} {1, 2} {1} {1, 5} {1} ′ {1, 4} 푞6 {1} {1, 2} {1} {1, 5} {1} ′ {1, 3, 7} 푞7 {1} {1, 2} {1, 4} {1, 5} {1}
h h
h Σ ′ h ′ ′ r ′ start 푞 푞 푞 푞 1 2 e 4 6 푞 h 푞 e 푞 r 푞 s start 1 2 3 4 s s r s s h h 푞 h 푞 e 푞 ′ h ′ e ′ 5 6 7 푞3 푞5 푞7 s s s 13/31 Derivation of Regular Expression
For given regular expression 푅, derivation is defined as d푅 ℎ ( ) = {푦|푥푦 ∈ ℎ(푅)} d푥
Example For 푅 = 푎 + 푠ℎ푒푙푙 + 푠푡표푝 + 푝푙표푡 and its value ℎ(푅) = {푎, 푠ℎ푒푙푙, 푠푡표푝, 푝푙표푡} derivations are
d푅 ℎ ( ) = {휀} d푎 d푅 ℎ ( ) = {ℎ푒푙푙, 푡표푝} d푠 d푅 ℎ ( ) = ∅ d푡
14/31 Derivation of Regular Expression – properties
d∅ d(푈 + 푉) d푈 d푉 = ∅, ∀푎 ∈ Σ = + d푎 d푎 d푎 d푎 d휀 d(푈 ⋅ 푉) d푈 = ∅, ∀푎 ∈ Σ = ⋅ 푉, 휀 ∉ 푈 d푎 d푎 d푎 d푎 d(푈 ⋅ 푉) d푈 d푉 = 휀, ∀푎 ∈ Σ = ⋅ 푉 + , 휀 ∈ 푈 d푎 d푎 d푎 d푎 d푏 d푉 ∗ d푉 = ∅, ∀푏 ≠ 푎 = ⋅ 푉 ∗ d푎 d푎 d푎
d푉 d d d d푉 = ( (⋯ ( ))) , for 푥 = 푎1푎2 … 푎푛 d푥 d푎푛 d푎푛−1 d푎2 d푎1
15/31 Construction of DFA Derivations of RE
• Derivation of regular expressions allows directly and algorithmically build DFA for any regular expression. • Let 푉 is given regular expression in alphabet Σ. • Each state of DFA defines a set of words, that move the DFA from this state to any of final states. So, every state can be associated with regular expression, defining this set of words
푞0 = 푉 d푞 훿(푞, 푥) = d푥 퐹 = {푞 ∈ 푄|휀 ∈ ℎ(푞)}
16/31 Construction of DFA Derivations of RE – example
Lest’s have 푉 = (0 + 1)∗ ⋅ 01 over alphabet Σ{0, 1}. ∗ Then 푞0 = (0 + 1) ⋅ 01 Example of derivations:
∗ ∗ d((0 + 1) ⋅ 01) d((0 + 1) ) d01 = ⋅ 01 + d0 d0 d0 d(0 + 1) = ⋅ (0 + 1)∗ ⋅ 01 + 1 d0 d0 d1 = ( + ) ⋅ (0 + 1)∗ ⋅ 01 + 1 d0 d0 = (휀 + ∅) ⋅ (0 + 1)∗ ⋅ 01 + 1 = (0 + 1)∗ ⋅ 01 + 1
17/31 Construction of DFA Derivations of RE – example (cont.)
∗ ∗ d((0 + 1) ⋅ 01) d((0 + 1) ) d01 = ⋅ 01 + d1 d1 d1 d(0 + 1) = ⋅ (0 + 1)∗ ⋅ 01 + ∅ d1 d0 d1 = ( + ) ⋅ (0 + 1)∗ ⋅ 01 d1 d1 = (∅ + 휀) ⋅ (0 + 1)∗ ⋅ 01 = (0 + 1)∗ ⋅ 01
18/31 Construction of DFA Derivations of RE – example (cont.)
Regular Expression State 0 1 ∗ ∗ ∗ (0 + 1) ⋅ 01 푞0 (0 + 1) ⋅ 01 + 1 (0 + 1) ⋅ 01 ∗ ∗ ∗ (0 + 1) ⋅ 01 + 1 푞1 (0 + 1) ⋅ 01 + 1 (0 + 1) ⋅ 01 + 휀 ∗ ∗ ∗ (0 + 1) ⋅ 01 + 휀 푞2 (0 + 1) ⋅ 01 + 1 (0 + 1) ⋅ 01
1 0
0 1 start 푞0 푞1 푞2 0 1
19/31 Pattern Matching Approximate pattern matching Approximate pattern matching
• String metric (string distance function) is a metric that measures distance between two text strings for approximate string matching. • String metric can be considered as “inverse similarity” – how two strings are dissimilar. • There are two classic metrics 1. Hamming distance 2. Levenshtein distance • Yes, string dissimilarity, distance can be measured. Both distances are metrics from mathematical point of view – non-negativity, identity, symmetry, and triangle inequality.
20/31 Hamming distance
Definition Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different.
In other words, it measures the minimum number of substitutions required to change one string into the other. Example Hamming distance of “karolin” and “kathrin” is 3.
k a r o l i n k a t h r i n 0 0 1 1 1 0 0
21/31 Levenshtein distance
Definition Levenshtein distance (1965) between two strings is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one string into the other.
22/31 Levenshtein distance (cont.)
Example Levenshtein distance between “kitten” and “sitting” is 3:
1. kitten → sitten (substitution of “s” for “k”) 2. sitten → sittin (substitution of “i” for “e”) 3. sittin → sitting (insertion of “g” at the end).
There is no way to do it with fewer than three edits.
23/31 Levenshtein distance (cont.)
Upper and lower bounds The Levenshtein distance has several simple upper and lower bounds:
• It is at least the difference of the sizes of the two strings. • It is at most the length of the longer string. • It is zero if and only if the strings are equal. • If the strings are the same size, the Hamming distance is an upper bound on the Levenshtein distance. • The Levenshtein distance between two strings is no greater than the sum of their Levenshtein distances from a third string (triangle inequality).
24/31 Levenshtein distance (cont.)
⎧푖, if 푗 = 0 ⎪ ⎪ ⎪푗, if 푖 = 0 푑(푖, 푗) = 푑(푖 − 1, 푗) + 1, ⎨ ⎪ ⎪min ( 푑(푖, 푗 − 1) + 1, ) ⎪ ⎩ 푑(푖 − 1, 푗 − 1) + 푐(푖, 푗) where 0 if 푎 = 푏 푐(푖, 푗) = { 푖 푗 1 otherwise
First element in the minimum corresponds to deletion (from 푎 to 푏), the second to insertion and the third to match or mismatch.
25/31 Levenshtein distance (cont.)
1 int LevenshteinDistance(const char *s, int len_s, const char *t, int len_t) 2 { 3 int cost; 4 5 /* base case: empty strings */ 6 if (len_s ==0) return len_t; 7 if (len_t ==0) return len_s; 8 9 /* test if last characters of the strings match */ 10 if (s[len_s-1]== t[len_t-1]) 11 cost = 0; 12 else 13 cost = 1; 14
26/31 Levenshtein distance (cont.)
15 /* return minimum of delete char from s, delete char from t, and delete char from both */ 16 return minimum 17 ( 18 LevenshteinDistance(s, len_s-1, t, len_t) + 1, 19 LevenshteinDistance(s, len_s, t, len_t-1) + 1, 20 LevenshteinDistance(s, len_s-1, t, len_t-1) + cost 21 ); 22 }
27/31 Levenshtein distance (cont.)
k i t t e n 0 1 2 3 4 5 6 asubst. of k for s s 1 1a 2 3 4 5 6 b i is equal i i 2 2 1b 2 3 4 5 c t is equal t t 3 3 2 1c 2 3 4 d t is equal t t 4 4 3 2 1d 2 3 esubst. of e for i i 5 5 4 3 2 2e 3 f n is equal n n 6 6 5 4 3 3 2f g delete g g 7 7 6 5 4 4 3g
28/31 Approximate pattern matching using finite automata
Σ
푝1 푝2 푝3 푝4 start 푞0 푞1 푞2 푞3 푞4
NFA for the exact string matching (푚 = 4)
29/31 Approximate pattern matching using finite automata (cont.)
Σ
푝1 푝2 푝3 푝4 start 푞0 푞1 푞2 푞3 푞4
푝1 푝2 푝3 푝4
푝2 푝3 푝4 푞5 푞6 푞7 푞8
푝2 푝3 푝4 left to right – match
푝3 푝4 diagonal – replace 푞9 푞10 푞11
푝3 푝4
푝4 푞12 푞13
NFA for the approximate string matching using the Hamming distance (푚 = 4, 푘 = 3)
30/31 Approximate pattern matching using finite automata (cont.)
Σ
푝1 푝2 푝3 푝4 start 푞0 푞1 푞2 푞3 푞4
휀 휀 휀 휀
푝2 푝3 푝4 푝1 푝2 푝3 푝4 푝2 푝3 푝4 left to right – match 푞5 푞6 푞7 푞8 휀 휀 휀 diagonal – replace 푝3 푝4 푝2 푝3 푝4 푝3 푝4 down – insert 푞9 푞10 푞11
휀 휀 diagonal 휀– delete
푝4 푝3 푝4
푝4 푞12 푞13
NFA for the approximate string matching using the Levenshtein distance (푚 = 4, 푘 = 3)
31/31