<<

Methods of Analysis of Textual Data (MATD)

Jiří Dvorský October 11, 2019

Department of VŠB – TU Ostrava

1/31 Lectures Outline

1. Exact pattern matching Searching for finite set of patterns

Searching for (Regular) Infinite Set of Patterns in Text Approximate pattern matching

2/31 Pattern Matching

Jiří Dvorský

Department of Computer Science VŠB – TU Ostrava

3/31 Pattern Matching Exact pattern matching Searching for (Regular) Infinite Set of Patterns in Text

1. How to describe infinte set of pattern i.e. string? Regular Expressions 2. What shall we use to perform matching? Finite Automata

4/31 Regular Expressions and Languages

Regular expression 푅 Value of expression ℎ(푅) Atomic expressions ∅ ∅ 휀 {휀} 푎, 푎 ∈ Σ {푎} Operations 푈 ⋅ 푉 {푢푣|푢 ∈ ℎ(푈) ∧ 푣 ∈ ℎ(푉)} 푈 + 푉 ℎ(푈) ∪ ℎ(푉) 푉 푘 = 푉⏟ ⋅ 푉 ⋅ … ⋅ 푉 푘 푡푖푚푒푠 푉 + = 푉 1 + 푉 2 + 푉 3 + … 푉 ∗ = 푉 0 + 푉 1 + 푉 2 + …

5/31 Features

푈 + (푉 + 푊) = (푈 + 푉) + 푊 푈 ⋅ (푉 ⋅ 푊) = (푈 ⋅ 푉) ⋅ 푊 푈 + 푉 = 푉 + 푈 (푈 + 푉) ⋅ 푊 = (푈 ⋅ 푊) + (푉 ⋅ 푊) 푈 ⋅ (푉 + 푊) = (푈 ⋅ 푉) + (푈 ⋅ 푊) 푈 + 푈 = 푈 휀 ⋅ 푈 = 푈 ∅ ⋅ 푈 = ∅ 푈 + ∅ = 푈 푈∗ = 휀 + 푈+

6/31 Deterministic Finite Automaton

Definition Deterministic Finite Automaton (DFA) is a quintuple

퐴 = (푄, Σ, 푞0, 훿, 퐹), where • 푄 is a finite set of states • Σ is an alphabet

• 푞0 ∈ 푄 is an initial state • 훿 ∶ 푄 × Σ → 푄 is a transition function • 퐹 ⊆ 푄 is a set of final states

7/31 Deterministic Finite Automaton (cont.)

Configuration of Finite Automaton

(푞, 푤) ∈ 푄 × Σ∗

Transition of Finite Automaton is a relation

↦∶ (푄 × Σ∗) × (푄 × Σ∗)

such as (푞, 푎푤) ↦ (푞’, 푤) ⟺ 훿(푞, 푎) = 푞’

Automaton accepts word 푤 if

∗ (푞0, 푤) ↦ (푞, 휀), 푞 ∈ 퐹

8/31 Nondeterministic Finite Automaton

Definition Nondeterministic Finite Automaton (NFA) is a quintuple

퐴 = (푄, Σ, 푞0, 훿, 퐹), where • 푄 is a finite set of states • Σ is an alphabet

• 푞0 ∈ 푄 is an initial state • 훿 ∶ 푄 × Σ → 푃(푄) is a transition function • 퐹 ⊆ 푄 is a set of final states

• Alternatively NFA can be defined as 퐴 = (푄, Σ, 푆, 훿, 퐹), where 푆 ⊆ 푄 is a set of initial states. • For each NFA, there is a DFA such that it recognizes the same formal language. 9/31 Nondeterministic Finite Automaton – example

Set of patterns 푃 = {he, her, she}

Σ

h e Σ start 푞1 푞2 푞3 Σ h e r start 푞1 푞2 푞3 푞4 h e r s start 푞4 푞5 푞6 푞7 Σ h e 푞5 푞6 푞7 s h e start 푞8 푞9 푞10 푞11

10/31 NFA ⟶ DFA Conversion

The DFA can be constructed using the powerset construction. ′ ′ ′ ′ ′ ′ NFA 퐴 = (푄, Σ, 푆, 훿, 퐹) ⟶ DFA 퐴 = (푄 ,Σ , 푞0, 훿 , 퐹 ) • 푄′ ⊆ 푃(푄) • Σ′ = Σ ′ • 푞0 = 푆 • 훿′(푞′, 푥) = ∪훿(푞, 푥) for all 푞 ∈ 푞′ • 퐹′ = {푞′ ∈ 푄′|푞′ ∩ 퐹 ≠ ∅}

11/31 NFA ⟶ DFA Conversion I

State Label 푒 ℎ 푟 푠 other ′ {1, 4, 8} 푞1 {1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} ′ {1, 2, 4, 5, 8} 푞2 {1, 3, 4, 6, 8} {1, 2, 4, 5, 8} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} ′ {1, 4, 8, 9} 푞3 {1, 4, 8} {1, 2, 4, 5, 8, 10} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} ′ {1, 3, 4, 6, 8} 푞4 {1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 7, 8} {1, 4, 8, 9} {1, 4, 8} ′ {1, 2, 4, 5, 8, 10} 푞5 {1, 3, 4, 6, 8, 11} {1, 2, 4, 5, 8} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} ′ {1, 4, 7, 8} 푞6 {1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} ′ {1, 3, 4, 6, 8, 11} 푞7 {1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 7, 8} {1, 4, 8, 9} {1, 4, 8}

h h Σ h ′ h ′ ′ r ′ 푞 h 푞 e 푞 start 1 2 3 start 푞 푞 푞 푞 1 2 e 4 6 Σ s s s r 푞 h 푞 e 푞 r 푞 s h start 4 5 6 7 h Σ ′ h ′ e ′ 푞3 푞5 푞7 s h e start 푞8 푞9 푞10 푞11 s s s

Only reachable states, transitions to state 푞1 are not shown. 12/31 NFA ⟶ DFA Conversion II

State Label 푒 ℎ 푟 푠 other ′ {1} 푞1 {1} {1, 2} {1} {1, 5} {1} ′ {1, 2} 푞2 {1, 3} {1, 2} {1} {1, 5} {1} ′ {1, 5} 푞3 {1} {1, 2, 6} {1} {1, 5} {1} ′ {1, 3} 푞4 {1} {1, 2} {1, 4} {1, 5} {1} ′ {1, 2, 6} 푞5 {1, 3, 7} {1, 2} {1} {1, 5} {1} ′ {1, 4} 푞6 {1} {1, 2} {1} {1, 5} {1} ′ {1, 3, 7} 푞7 {1} {1, 2} {1, 4} {1, 5} {1}

h h

h Σ ′ h ′ ′ r ′ start 푞 푞 푞 푞 1 2 e 4 6 푞 h 푞 e 푞 r 푞 s start 1 2 3 4 s s r s s h h 푞 h 푞 e 푞 ′ h ′ e ′ 5 6 7 푞3 푞5 푞7 s s s 13/31 Derivation of Regular Expression

For given regular expression 푅, derivation is defined as d푅 ℎ ( ) = {푦|푥푦 ∈ ℎ(푅)} d푥

Example For 푅 = 푎 + 푠ℎ푒푙푙 + 푠푡표푝 + 푝푙표푡 and its value ℎ(푅) = {푎, 푠ℎ푒푙푙, 푠푡표푝, 푝푙표푡} derivations are

d푅 ℎ ( ) = {휀} d푎 d푅 ℎ ( ) = {ℎ푒푙푙, 푡표푝} d푠 d푅 ℎ ( ) = ∅ d푡

14/31 Derivation of Regular Expression – properties

d∅ d(푈 + 푉) d푈 d푉 = ∅, ∀푎 ∈ Σ = + d푎 d푎 d푎 d푎 d휀 d(푈 ⋅ 푉) d푈 = ∅, ∀푎 ∈ Σ = ⋅ 푉, 휀 ∉ 푈 d푎 d푎 d푎 d푎 d(푈 ⋅ 푉) d푈 d푉 = 휀, ∀푎 ∈ Σ = ⋅ 푉 + , 휀 ∈ 푈 d푎 d푎 d푎 d푎 d푏 d푉 ∗ d푉 = ∅, ∀푏 ≠ 푎 = ⋅ 푉 ∗ d푎 d푎 d푎

d푉 d d d d푉 = ( (⋯ ( ))) , for 푥 = 푎1푎2 … 푎푛 d푥 d푎푛 d푎푛−1 d푎2 d푎1

15/31 Construction of DFA Derivations of RE

• Derivation of regular expressions allows directly and algorithmically build DFA for any regular expression. • Let 푉 is given regular expression in alphabet Σ. • Each state of DFA defines a set of words, that move the DFA from this state to any of final states. So, every state can be associated with regular expression, defining this set of words

푞0 = 푉 d푞 훿(푞, 푥) = d푥 퐹 = {푞 ∈ 푄|휀 ∈ ℎ(푞)}

16/31 Construction of DFA Derivations of RE – example

Lest’s have 푉 = (0 + 1)∗ ⋅ 01 over alphabet Σ{0, 1}. ∗ Then 푞0 = (0 + 1) ⋅ 01 Example of derivations:

∗ ∗ d((0 + 1) ⋅ 01) d((0 + 1) ) d01 = ⋅ 01 + d0 d0 d0 d(0 + 1) = ⋅ (0 + 1)∗ ⋅ 01 + 1 d0 d0 d1 = ( + ) ⋅ (0 + 1)∗ ⋅ 01 + 1 d0 d0 = (휀 + ∅) ⋅ (0 + 1)∗ ⋅ 01 + 1 = (0 + 1)∗ ⋅ 01 + 1

17/31 Construction of DFA Derivations of RE – example (cont.)

∗ ∗ d((0 + 1) ⋅ 01) d((0 + 1) ) d01 = ⋅ 01 + d1 d1 d1 d(0 + 1) = ⋅ (0 + 1)∗ ⋅ 01 + ∅ d1 d0 d1 = ( + ) ⋅ (0 + 1)∗ ⋅ 01 d1 d1 = (∅ + 휀) ⋅ (0 + 1)∗ ⋅ 01 = (0 + 1)∗ ⋅ 01

18/31 Construction of DFA Derivations of RE – example (cont.)

Regular Expression State 0 1 ∗ ∗ ∗ (0 + 1) ⋅ 01 푞0 (0 + 1) ⋅ 01 + 1 (0 + 1) ⋅ 01 ∗ ∗ ∗ (0 + 1) ⋅ 01 + 1 푞1 (0 + 1) ⋅ 01 + 1 (0 + 1) ⋅ 01 + 휀 ∗ ∗ ∗ (0 + 1) ⋅ 01 + 휀 푞2 (0 + 1) ⋅ 01 + 1 (0 + 1) ⋅ 01

1 0

0 1 start 푞0 푞1 푞2 0 1

19/31 Pattern Matching Approximate pattern matching Approximate pattern matching

• String (string function) is a metric that measures distance between two text strings for approximate string matching. • String metric can be considered as “inverse similarity” – how two strings are dissimilar. • There are two classic metrics 1. 2. • Yes, string dissimilarity, distance can be measured. Both are metrics from mathematical point of view – non-negativity, identity, symmetry, and triangle inequality.

20/31 Hamming distance

Definition Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different.

In other words, it measures the minimum number of substitutions required to change one string into the other. Example Hamming distance of “karolin” and “kathrin” is 3.

k a r o l i n k a t h r i n 0 0 1 1 1 0 0

21/31 Levenshtein distance

Definition Levenshtein distance (1965) between two strings is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one string into the other.

22/31 Levenshtein distance (cont.)

Example Levenshtein distance between “kitten” and “sitting” is 3:

1. kitten → sitten (substitution of “s” for “k”) 2. sitten → sittin (substitution of “i” for “e”) 3. sittin → sitting (insertion of “g” at the end).

There is no way to do it with fewer than three edits.

23/31 Levenshtein distance (cont.)

Upper and lower bounds The Levenshtein distance has several simple upper and lower bounds:

• It is at least the difference of the sizes of the two strings. • It is at most the length of the longer string. • It is zero if and only if the strings are equal. • If the strings are the same size, the Hamming distance is an upper bound on the Levenshtein distance. • The Levenshtein distance between two strings is no greater than the sum of their Levenshtein distances from a third string (triangle inequality).

24/31 Levenshtein distance (cont.)

⎧푖, if 푗 = 0 ⎪ ⎪ ⎪푗, if 푖 = 0 푑(푖, 푗) = 푑(푖 − 1, 푗) + 1, ⎨ ⎪ ⎪min ( 푑(푖, 푗 − 1) + 1, ) ⎪ ⎩ 푑(푖 − 1, 푗 − 1) + 푐(푖, 푗) where 0 if 푎 = 푏 푐(푖, 푗) = { 푖 푗 1 otherwise

First element in the minimum corresponds to deletion (from 푎 to 푏), the second to insertion and the third to match or mismatch.

25/31 Levenshtein distance (cont.)

1 int LevenshteinDistance(const char *s, int len_s, const char *t, int len_t) 2 { 3 int cost; 4 5 /* base case: empty strings */ 6 if (len_s ==0) return len_t; 7 if (len_t ==0) return len_s; 8 9 /* test if last characters of the strings match */ 10 if (s[len_s-1]== t[len_t-1]) 11 cost = 0; 12 else 13 cost = 1; 14

26/31 Levenshtein distance (cont.)

15 /* return minimum of delete char from s, delete char from t, and delete char from both */ 16 return minimum 17 ( 18 LevenshteinDistance(s, len_s-1, t, len_t) + 1, 19 LevenshteinDistance(s, len_s, t, len_t-1) + 1, 20 LevenshteinDistance(s, len_s-1, t, len_t-1) + cost 21 ); 22 }

27/31 Levenshtein distance (cont.)

k i t t e n 0 1 2 3 4 5 6 asubst. of k for s s 1 1a 2 3 4 5 6 b i is equal i i 2 2 1b 2 3 4 5 c t is equal t t 3 3 2 1c 2 3 4 d t is equal t t 4 4 3 2 1d 2 3 esubst. of e for i i 5 5 4 3 2 2e 3 f n is equal n n 6 6 5 4 3 3 2f g delete g g 7 7 6 5 4 4 3g

28/31 Approximate pattern matching using finite automata

Σ

푝1 푝2 푝3 푝4 start 푞0 푞1 푞2 푞3 푞4

NFA for the exact string matching (푚 = 4)

29/31 Approximate pattern matching using finite automata (cont.)

Σ

푝1 푝2 푝3 푝4 start 푞0 푞1 푞2 푞3 푞4

푝1 푝2 푝3 푝4

푝2 푝3 푝4 푞5 푞6 푞7 푞8

푝2 푝3 푝4 left to right – match

푝3 푝4 diagonal – replace 푞9 푞10 푞11

푝3 푝4

푝4 푞12 푞13

NFA for the approximate string matching using the Hamming distance (푚 = 4, 푘 = 3)

30/31 Approximate pattern matching using finite automata (cont.)

Σ

푝1 푝2 푝3 푝4 start 푞0 푞1 푞2 푞3 푞4

휀 휀 휀 휀

푝2 푝3 푝4 푝1 푝2 푝3 푝4 푝2 푝3 푝4 left to right – match 푞5 푞6 푞7 푞8 휀 휀 휀 diagonal – replace 푝3 푝4 푝2 푝3 푝4 푝3 푝4 down – insert 푞9 푞10 푞11

휀 휀 diagonal 휀– delete

푝4 푝3 푝4

푝4 푞12 푞13

NFA for the approximate string matching using the Levenshtein distance (푚 = 4, 푘 = 3)

31/31