Fast Selection of Small and Precise Candidate Sets from Dictionaries

Computational Linguistics at CIS - from basic research to applications in Digital Humanities Stefan Gerdjikov, Klaus U. Schulz CIS - LMU Munich Including joint work with Stoyan Mihov and Petar Mitankin (Sofia) Survey of talk Excursion 1 Fast approximate search in large lexica Interactive postcorrection of Advanced text index structures OCRed historical texts Excursion 2 Advanced text index structures Modernization/normalization of Novel ways of querying and historical language exploring corpora and text collections Ontologies & finding topics in texts (TopicZoom) Loose ends... Wittgenstein search (Max Hadersbeck) Latin (Uwe Springmann, Helmut Schmidt) Excursion 1 Fast approximate search in large lexica Interactive postcorrection of Advanced text index structures OCRed historical texts Approximate search in lexica Intuition Given a large lexicon L and possibly erroneous input tokens, for each input token V efficiently find the „most similar“ entries W of L. Levenshtein distance d The Levenshtein distance between two words V, W is the minimal number of edit operations (deletion, insertion of a symbol or substitution of one symbol by another) needed to rewrite V into W. Formalized problem Given a lexicon L, an input word („pattern“) V and a bound b (=0,1,2,..), compute all entries W of L such such d(V,W) ≤ b. Lexicon representation as minimal deterministic finite-state automaton Example lexicon: axis,behave,child,church,cold,hate,hold x i h a a e v s b i e h l d o c u h r c o h t a Lexicon representation as minimal deterministic finite-state automaton Example lexicon: axis,behave,child,church,cold,hate,hold x i h a Fast access a e v s b i e h l d o c u h r c o h t a Lexicon representation as minimal deterministic finite-state automaton Example lexicon: axis,behave,child,church,cold,hate,hold x i h a Fast access a e v s Size: each prefix of a lexicon b i e word only represented once. h l d Similar sharing of suffixes! o c u h r c o h t a Lexicon representation as minimal deterministic finite-state automaton Example lexicon: axis,behave,child,church,cold,hate,hold x i h a Fast access a e v s Size: each prefix of a lexicon b i e word only represented once. h l d Similar sharing of suffixes! o c u h Using the lexicon: for many r c „processing tasks“ strings o h occurring in several words t only processed once. a Approximate search in lexicon using dynamic programming (Oflazer) Search for „chiid“, maximal Levenshtein distance 1 x i h a a e v s b i e h l d o c u h r c o h t a Approximate search in lexicon using dynamic programming (Oflazer) Search for „chiid“, maximal Levenshtein distance 1 „Complete“ traversal of lexicon automaton x i 0 h a a e v s c 1 b i e h 2 h l d o i 3 c u h i 4 r c o d 5 h t a Parallel control Oflazer approximate search Search for „chiid“, maximal Levenshtein distance 1 x i a 0 1 h a a e v s c 1 1 b i e h 2 2 h l d o i 3 3 c u h i 4 4 r c o d 5 5 h t a Oflazer approximate search Search for „chiid“, maximal Levenshtein distance 1 x i a x 0 1 2 h a a e v s c 1 1 2 b i e h 2 2 2 h l d o i 3 3 3 c u h i 4 4 4 r c o d 5 5 5 h t a Blind path, backtracking! Oflazer approximate search Search for „chiid“, maximal Levenshtein distance 1 x i c 0 1 h a a e v s c 1 0 b i e h 2 1 h l d o i 3 2 c u h i 4 3 r c o d 5 4 h t a Oflazer approximate search Search for „chiid“, maximal Levenshtein distance 1 x i c h 0 1 2 h a a e v s c 1 0 1 b i e h 2 1 0 h l d o i 3 2 1 c u h i 4 3 2 r c o d 5 4 3 h t a Oflazer approximate search Search for „chiid“, maximal Levenshtein distance 1 x i c h i 0 1 2 3 h a a e v s c 1 0 1 2 b i e h 2 1 0 1 h l d o i 3 2 1 0 c u h i 4 3 2 1 r c o d 5 4 3 2 h t a Oflazer approximate search Search for „chiid“, maximal Levenshtein distance 1 x i c h i l 0 1 2 3 4 h a a e v s c 1 0 1 2 3 b i e h 2 1 0 1 2 h l d o i 3 2 1 0 1 c u h i 4 3 2 1 1 r c o d 5 4 3 2 2 h t a Oflazer approximate search Search for „chiid“, maximal Levenshtein distance 1 x i c h i l d 0 1 2 3 4 5 h a a e v s c 1 0 1 2 3 4 b i e h 2 1 0 1 2 3 h l d o i 3 2 1 0 1 2 c u h i 4 3 2 1 1 2 r c o d 5 4 3 2 2 1 h t a match: child Oflazer approximate search For given pattern, o „blind“ paths in automaton recognized as early as possible, o only „small“ part of the automaton visited But o Computing matrix columns at each step expensive! Approximate search using non-deterministic Levenshtein automata Search for „chiid“, maximal Levenshtein distance 1 x i h a a e v s b i e h l d o c u h active states r c o h t a c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Search for „chiid“, maximal Levenshtein distance 1 x i h a a e v s b i e h l d o c u h r c o h t a c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Search for „chiid“, maximal Levenshtein distance 1 x i h a a e v s b i e h l d o c u h r c o Transition with x fails - backtracking h t a c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Search for „chiid“, maximal Levenshtein distance 1 x i h a a e v s b i e h l d o c u h r c o h t a c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Search for „chiid“, maximal Levenshtein distance 1 x i h a a e v s b i e h l d o c u h r c o h t a c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Search for „chiid“, maximal Levenshtein distance 1 x i h a a e v s b i e h l d o c u h r c o h t a c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Search for „chiid“, maximal Levenshtein distance 1 x i h a a e v s b i e h l d o c u h r c o h t a c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Search for „chiid“, maximal Levenshtein distance 1 x i h a e v a s match: child b i e h l d o c u h r c o h t a c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Same strengths But similar weaknesses: o Specific automaton for each pattern, o Propagation of active states computationally expensive. Approximate search using non-deterministic Levenshtein automata Observation 1 The set of active states after step n always restricted to a „triangular area“. c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Observation 1 The set of active states after step n always restricted to a „triangular area“. c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Observation 1 The set of active states after step n always restricted to a „triangular area“.

Fast Selection of Small and Precise Candidate Sets from Dictionaries

Weighted Automata Computation of Edit Distances with Consolidations and Fragmentations Mathieu Giraud, Florent Jacquemard

Algorithms for Approximate String Matching Petro Protsyk (28 October 2016) Agenda

Fast Approximate Search in Large Dictionaries

Chapter 1 OBTAINING VALUABLE PRECISION-RECALL

Unary Words Have the Smallest Levenshtein K-Neighbourhoods

Fast Approximate Search in Large Dictionaries

Automata Processor Architecture and Applications: a Survey

Accelerating Decision Tree Ensemble Inference with an Automata Representation

Algorithms and Frameworks for Accelerating Security Applications on HPC Platforms

REAPR: Reconfigurable Engine for Automata Processing

An Experimental Tagalog Finite State Automata Spellchecker with Levenshtein Edit-Distance Feature

Towards Differential Privacy for Symbolic Systems