
Methods of Analysis of Textual Data (MATD) Jiří Dvorský October 11, 2019 Department of Computer Science VŠB – TU Ostrava 1/31 Lectures Outline 1. Pattern Matching Exact pattern matching Searching for finite set of patterns Searching for (Regular) Infinite Set of Patterns in Text Approximate pattern matching 2/31 Pattern Matching Jiří Dvorský Department of Computer Science VŠB – TU Ostrava 3/31 Pattern Matching Exact pattern matching Searching for (Regular) Infinite Set of Patterns in Text 1. How to describe infinte set of pattern i.e. string? Regular Expressions 2. What shall we use to perform matching? Finite Automata 4/31 Regular Expressions and Languages Regular expression 푅 Value of expression ℎ(푅) Atomic expressions ∅ ∅ 휀 {휀} 푎, 푎 ∈ Σ {푎} Operations 푈 ⋅ 푉 {푢푣|푢 ∈ ℎ(푈) ∧ 푣 ∈ ℎ(푉)} 푈 + 푉 ℎ(푈) ∪ ℎ(푉) 푉 푘 = 푉⏟ ⋅ 푉 ⋅ … ⋅ 푉 푘 푡푖푚푒푠 푉 + = 푉 1 + 푉 2 + 푉 3 + … 푉 ∗ = 푉 0 + 푉 1 + 푉 2 + … 5/31 Regular Expression Features 푈 + (푉 + 푊) = (푈 + 푉) + 푊 푈 ⋅ (푉 ⋅ 푊) = (푈 ⋅ 푉) ⋅ 푊 푈 + 푉 = 푉 + 푈 (푈 + 푉) ⋅ 푊 = (푈 ⋅ 푊) + (푉 ⋅ 푊) 푈 ⋅ (푉 + 푊) = (푈 ⋅ 푉) + (푈 ⋅ 푊) 푈 + 푈 = 푈 휀 ⋅ 푈 = 푈 ∅ ⋅ 푈 = ∅ 푈 + ∅ = 푈 푈∗ = 휀 + 푈+ 6/31 Deterministic Finite Automaton Definition Deterministic Finite Automaton (DFA) is a quintuple 퐴 = (푄, Σ, 푞0, 훿, 퐹), where • 푄 is a finite set of states • Σ is an alphabet • 푞0 ∈ 푄 is an initial state • 훿 ∶ 푄 × Σ → 푄 is a transition function • 퐹 ⊆ 푄 is a set of final states 7/31 Deterministic Finite Automaton (cont.) Configuration of Finite Automaton (푞, 푤) ∈ 푄 × Σ∗ Transition of Finite Automaton is a relation ↦∶ (푄 × Σ∗) × (푄 × Σ∗) such as (푞, 푎푤) ↦ (푞’, 푤) ⟺ 훿(푞, 푎) = 푞’ Automaton accepts word 푤 if ∗ (푞0, 푤) ↦ (푞, 휀), 푞 ∈ 퐹 8/31 Nondeterministic Finite Automaton Definition Nondeterministic Finite Automaton (NFA) is a quintuple 퐴 = (푄, Σ, 푞0, 훿, 퐹), where • 푄 is a finite set of states • Σ is an alphabet • 푞0 ∈ 푄 is an initial state • 훿 ∶ 푄 × Σ → 푃(푄) is a transition function • 퐹 ⊆ 푄 is a set of final states • Alternatively NFA can be defined as 퐴 = (푄, Σ, 푆, 훿, 퐹), where 푆 ⊆ 푄 is a set of initial states. • For each NFA, there is a DFA such that it recognizes the same formal language. 9/31 Nondeterministic Finite Automaton – example Set of patterns 푃 = {he, her, she} Σ h e Σ start 푞1 푞2 푞3 Σ h e r start 푞1 푞2 푞3 푞4 h e r s start 푞4 푞5 푞6 푞7 Σ h e 푞5 푞6 푞7 s h e start 푞8 푞9 푞10 푞11 10/31 NFA ⟶ DFA Conversion The DFA can be constructed using the powerset construction. ′ ′ ′ ′ ′ ′ NFA 퐴 = (푄, Σ, 푆, 훿, 퐹) ⟶ DFA 퐴 = (푄 ,Σ , 푞0, 훿 , 퐹 ) • 푄′ ⊆ 푃(푄) • Σ′ = Σ ′ • 푞0 = 푆 • 훿′(푞′, 푥) = ∪훿(푞, 푥) for all 푞 ∈ 푞′ • 퐹′ = {푞′ ∈ 푄′|푞′ ∩ 퐹 ≠ ∅} 11/31 NFA ⟶ DFA Conversion I State Label 푒 ℎ 푟 푠 other ′ {1, 4, 8} 푞1 {1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} ′ {1, 2, 4, 5, 8} 푞2 {1, 3, 4, 6, 8} {1, 2, 4, 5, 8} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} ′ {1, 4, 8, 9} 푞3 {1, 4, 8} {1, 2, 4, 5, 8, 10} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} ′ {1, 3, 4, 6, 8} 푞4 {1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 7, 8} {1, 4, 8, 9} {1, 4, 8} ′ {1, 2, 4, 5, 8, 10} 푞5 {1, 3, 4, 6, 8, 11} {1, 2, 4, 5, 8} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} ′ {1, 4, 7, 8} 푞6 {1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} ′ {1, 3, 4, 6, 8, 11} 푞7 {1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 7, 8} {1, 4, 8, 9} {1, 4, 8} h h Σ h ′ h ′ ′ r ′ 푞 h 푞 e 푞 start 1 2 3 start 푞 푞 푞 푞 1 2 e 4 6 Σ s s s r 푞 h 푞 e 푞 r 푞 s h start 4 5 6 7 h Σ ′ h ′ e ′ 푞3 푞5 푞7 s h e start 푞8 푞9 푞10 푞11 s s s Only reachable states, transitions to state 푞1 are not shown. 12/31 NFA ⟶ DFA Conversion II State Label 푒 ℎ 푟 푠 other ′ {1} 푞1 {1} {1, 2} {1} {1, 5} {1} ′ {1, 2} 푞2 {1, 3} {1, 2} {1} {1, 5} {1} ′ {1, 5} 푞3 {1} {1, 2, 6} {1} {1, 5} {1} ′ {1, 3} 푞4 {1} {1, 2} {1, 4} {1, 5} {1} ′ {1, 2, 6} 푞5 {1, 3, 7} {1, 2} {1} {1, 5} {1} ′ {1, 4} 푞6 {1} {1, 2} {1} {1, 5} {1} ′ {1, 3, 7} 푞7 {1} {1, 2} {1, 4} {1, 5} {1} h h h Σ ′ h ′ ′ r ′ start 푞 푞 푞 푞 1 2 e 4 6 푞 h 푞 e 푞 r 푞 s start 1 2 3 4 s s r s s h h 푞 h 푞 e 푞 ′ h ′ e ′ 5 6 7 푞3 푞5 푞7 s s s 13/31 Derivation of Regular Expression For given regular expression 푅, derivation is defined as d푅 ℎ ( ) = {푦|푥푦 ∈ ℎ(푅)} d푥 Example For 푅 = 푎 + 푠ℎ푒푙푙 + 푠푡표푝 + 푝푙표푡 and its value ℎ(푅) = {푎, 푠ℎ푒푙푙, 푠푡표푝, 푝푙표푡} derivations are d푅 ℎ ( ) = {휀} d푎 d푅 ℎ ( ) = {ℎ푒푙푙, 푡표푝} d푠 d푅 ℎ ( ) = ∅ d푡 14/31 Derivation of Regular Expression – properties d∅ d(푈 + 푉) d푈 d푉 = ∅, ∀푎 ∈ Σ = + d푎 d푎 d푎 d푎 d휀 d(푈 ⋅ 푉) d푈 = ∅, ∀푎 ∈ Σ = ⋅ 푉, 휀 ∉ 푈 d푎 d푎 d푎 d푎 d(푈 ⋅ 푉) d푈 d푉 = 휀, ∀푎 ∈ Σ = ⋅ 푉 + , 휀 ∈ 푈 d푎 d푎 d푎 d푎 d푏 d푉 ∗ d푉 = ∅, ∀푏 ≠ 푎 = ⋅ 푉 ∗ d푎 d푎 d푎 d푉 d d d d푉 = ( (⋯ ( ))) , for 푥 = 푎1푎2 … 푎푛 d푥 d푎푛 d푎푛−1 d푎2 d푎1 15/31 Construction of DFA Derivations of RE • Derivation of regular expressions allows directly and algorithmically build DFA for any regular expression. • Let 푉 is given regular expression in alphabet Σ. • Each state of DFA defines a set of words, that move the DFA from this state to any of final states. So, every state can be associated with regular expression, defining this set of words 푞0 = 푉 d푞 훿(푞, 푥) = d푥 퐹 = {푞 ∈ 푄|휀 ∈ ℎ(푞)} 16/31 Construction of DFA Derivations of RE – example Lest’s have 푉 = (0 + 1)∗ ⋅ 01 over alphabet Σ{0, 1}. ∗ Then 푞0 = (0 + 1) ⋅ 01 Example of derivations: ∗ ∗ d((0 + 1) ⋅ 01) d((0 + 1) ) d01 = ⋅ 01 + d0 d0 d0 d(0 + 1) = ⋅ (0 + 1)∗ ⋅ 01 + 1 d0 d0 d1 = ( + ) ⋅ (0 + 1)∗ ⋅ 01 + 1 d0 d0 = (휀 + ∅) ⋅ (0 + 1)∗ ⋅ 01 + 1 = (0 + 1)∗ ⋅ 01 + 1 17/31 Construction of DFA Derivations of RE – example (cont.) ∗ ∗ d((0 + 1) ⋅ 01) d((0 + 1) ) d01 = ⋅ 01 + d1 d1 d1 d(0 + 1) = ⋅ (0 + 1)∗ ⋅ 01 + ∅ d1 d0 d1 = ( + ) ⋅ (0 + 1)∗ ⋅ 01 d1 d1 = (∅ + 휀) ⋅ (0 + 1)∗ ⋅ 01 = (0 + 1)∗ ⋅ 01 18/31 Construction of DFA Derivations of RE – example (cont.) Regular Expression State 0 1 ∗ ∗ ∗ (0 + 1) ⋅ 01 푞0 (0 + 1) ⋅ 01 + 1 (0 + 1) ⋅ 01 ∗ ∗ ∗ (0 + 1) ⋅ 01 + 1 푞1 (0 + 1) ⋅ 01 + 1 (0 + 1) ⋅ 01 + 휀 ∗ ∗ ∗ (0 + 1) ⋅ 01 + 휀 푞2 (0 + 1) ⋅ 01 + 1 (0 + 1) ⋅ 01 1 0 0 1 start 푞0 푞1 푞2 0 1 19/31 Pattern Matching Approximate pattern matching Approximate pattern matching • String metric (string distance function) is a metric that measures distance between two text strings for approximate string matching. • String metric can be considered as “inverse similarity” – how two strings are dissimilar. • There are two classic metrics 1. Hamming distance 2. Levenshtein distance • Yes, string dissimilarity, distance can be measured. Both distances are metrics from mathematical point of view – non-negativity, identity, symmetry, and triangle inequality. 20/31 Hamming distance Definition Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other. Example Hamming distance of “karolin” and “kathrin” is 3. k a r o l i n k a t h r i n 0 0 1 1 1 0 0 21/31 Levenshtein distance Definition Levenshtein distance (1965) between two strings is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one string into the other. 22/31 Levenshtein distance (cont.) Example Levenshtein distance between “kitten” and “sitting” is 3: 1. kitten → sitten (substitution of “s” for “k”) 2. sitten → sittin (substitution of “i” for “e”) 3. sittin → sitting (insertion of “g” at the end). There is no way to do it with fewer than three edits. 23/31 Levenshtein distance (cont.) Upper and lower bounds The Levenshtein distance has several simple upper and lower bounds: • It is at least the difference of the sizes of the two strings. • It is at most the length of the longer string. • It is zero if and only if the strings are equal. • If the strings are the same size, the Hamming distance is an upper bound on the Levenshtein distance. • The Levenshtein distance between two strings is no greater than the sum of their Levenshtein distances from a third string (triangle inequality). 24/31 Levenshtein distance (cont.) ⎧푖, if 푗 = 0 ⎪ ⎪ ⎪푗, if 푖 = 0 푑(푖, 푗) = 푑(푖 − 1, 푗) + 1, ⎨ ⎪ ⎪min ( 푑(푖, 푗 − 1) + 1, ) ⎪ ⎩ 푑(푖 − 1, 푗 − 1) + 푐(푖, 푗) where 0 if 푎 = 푏 푐(푖, 푗) = { 푖 푗 1 otherwise First element in the minimum corresponds to deletion (from 푎 to 푏), the second to insertion and the third to match or mismatch.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages33 Page
-
File Size-