Fast Selection of Small and Precise Candidate Sets from Dictionaries
Total Page:16
File Type:pdf, Size:1020Kb
Computational Linguistics at CIS - from basic research to applications in Digital Humanities Stefan Gerdjikov, Klaus U. Schulz CIS - LMU Munich Including joint work with Stoyan Mihov and Petar Mitankin (Sofia) Survey of talk Excursion 1 Fast approximate search in large lexica Interactive postcorrection of Advanced text index structures OCRed historical texts Excursion 2 Advanced text index structures Modernization/normalization of Novel ways of querying and historical language exploring corpora and text collections Ontologies & finding topics in texts (TopicZoom) Loose ends... Wittgenstein search (Max Hadersbeck) Latin (Uwe Springmann, Helmut Schmidt) Excursion 1 Fast approximate search in large lexica Interactive postcorrection of Advanced text index structures OCRed historical texts Approximate search in lexica Intuition Given a large lexicon L and possibly erroneous input tokens, for each input token V efficiently find the „most similar“ entries W of L. Levenshtein distance d The Levenshtein distance between two words V, W is the minimal number of edit operations (deletion, insertion of a symbol or substitution of one symbol by another) needed to rewrite V into W. Formalized problem Given a lexicon L, an input word („pattern“) V and a bound b (=0,1,2,..), compute all entries W of L such such d(V,W) ≤ b. Lexicon representation as minimal deterministic finite-state automaton Example lexicon: axis,behave,child,church,cold,hate,hold x i h a a e v s b i e h l d o c u h r c o h t a Lexicon representation as minimal deterministic finite-state automaton Example lexicon: axis,behave,child,church,cold,hate,hold x i h a Fast access a e v s b i e h l d o c u h r c o h t a Lexicon representation as minimal deterministic finite-state automaton Example lexicon: axis,behave,child,church,cold,hate,hold x i h a Fast access a e v s Size: each prefix of a lexicon b i e word only represented once. h l d Similar sharing of suffixes! o c u h r c o h t a Lexicon representation as minimal deterministic finite-state automaton Example lexicon: axis,behave,child,church,cold,hate,hold x i h a Fast access a e v s Size: each prefix of a lexicon b i e word only represented once. h l d Similar sharing of suffixes! o c u h Using the lexicon: for many r c „processing tasks“ strings o h occurring in several words t only processed once. a Approximate search in lexicon using dynamic programming (Oflazer) Search for „chiid“, maximal Levenshtein distance 1 x i h a a e v s b i e h l d o c u h r c o h t a Approximate search in lexicon using dynamic programming (Oflazer) Search for „chiid“, maximal Levenshtein distance 1 „Complete“ traversal of lexicon automaton x i 0 h a a e v s c 1 b i e h 2 h l d o i 3 c u h i 4 r c o d 5 h t a Parallel control Oflazer approximate search Search for „chiid“, maximal Levenshtein distance 1 x i a 0 1 h a a e v s c 1 1 b i e h 2 2 h l d o i 3 3 c u h i 4 4 r c o d 5 5 h t a Oflazer approximate search Search for „chiid“, maximal Levenshtein distance 1 x i a x 0 1 2 h a a e v s c 1 1 2 b i e h 2 2 2 h l d o i 3 3 3 c u h i 4 4 4 r c o d 5 5 5 h t a Blind path, backtracking! Oflazer approximate search Search for „chiid“, maximal Levenshtein distance 1 x i c 0 1 h a a e v s c 1 0 b i e h 2 1 h l d o i 3 2 c u h i 4 3 r c o d 5 4 h t a Oflazer approximate search Search for „chiid“, maximal Levenshtein distance 1 x i c h 0 1 2 h a a e v s c 1 0 1 b i e h 2 1 0 h l d o i 3 2 1 c u h i 4 3 2 r c o d 5 4 3 h t a Oflazer approximate search Search for „chiid“, maximal Levenshtein distance 1 x i c h i 0 1 2 3 h a a e v s c 1 0 1 2 b i e h 2 1 0 1 h l d o i 3 2 1 0 c u h i 4 3 2 1 r c o d 5 4 3 2 h t a Oflazer approximate search Search for „chiid“, maximal Levenshtein distance 1 x i c h i l 0 1 2 3 4 h a a e v s c 1 0 1 2 3 b i e h 2 1 0 1 2 h l d o i 3 2 1 0 1 c u h i 4 3 2 1 1 r c o d 5 4 3 2 2 h t a Oflazer approximate search Search for „chiid“, maximal Levenshtein distance 1 x i c h i l d 0 1 2 3 4 5 h a a e v s c 1 0 1 2 3 4 b i e h 2 1 0 1 2 3 h l d o i 3 2 1 0 1 2 c u h i 4 3 2 1 1 2 r c o d 5 4 3 2 2 1 h t a match: child Oflazer approximate search For given pattern, o „blind“ paths in automaton recognized as early as possible, o only „small“ part of the automaton visited But o Computing matrix columns at each step expensive! Approximate search using non-deterministic Levenshtein automata Search for „chiid“, maximal Levenshtein distance 1 x i h a a e v s b i e h l d o c u h active states r c o h t a c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Search for „chiid“, maximal Levenshtein distance 1 x i h a a e v s b i e h l d o c u h r c o h t a c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Search for „chiid“, maximal Levenshtein distance 1 x i h a a e v s b i e h l d o c u h r c o Transition with x fails - backtracking h t a c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Search for „chiid“, maximal Levenshtein distance 1 x i h a a e v s b i e h l d o c u h r c o h t a c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Search for „chiid“, maximal Levenshtein distance 1 x i h a a e v s b i e h l d o c u h r c o h t a c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Search for „chiid“, maximal Levenshtein distance 1 x i h a a e v s b i e h l d o c u h r c o h t a c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Search for „chiid“, maximal Levenshtein distance 1 x i h a a e v s b i e h l d o c u h r c o h t a c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Search for „chiid“, maximal Levenshtein distance 1 x i h a e v a s match: child b i e h l d o c u h r c o h t a c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Same strengths But similar weaknesses: o Specific automaton for each pattern, o Propagation of active states computationally expensive. Approximate search using non-deterministic Levenshtein automata Observation 1 The set of active states after step n always restricted to a „triangular area“. c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Observation 1 The set of active states after step n always restricted to a „triangular area“. c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Observation 1 The set of active states after step n always restricted to a „triangular area“.