Fast Selection of Small and Precise Candidate Sets from Dictionaries

Fast Selection of Small and Precise Candidate Sets from Dictionaries

Computational Linguistics at CIS - from basic research to applications in Digital Humanities Stefan Gerdjikov, Klaus U. Schulz CIS - LMU Munich Including joint work with Stoyan Mihov and Petar Mitankin (Sofia) Survey of talk Excursion 1 Fast approximate search in large lexica Interactive postcorrection of Advanced text index structures OCRed historical texts Excursion 2 Advanced text index structures Modernization/normalization of Novel ways of querying and historical language exploring corpora and text collections Ontologies & finding topics in texts (TopicZoom) Loose ends... Wittgenstein search (Max Hadersbeck) Latin (Uwe Springmann, Helmut Schmidt) Excursion 1 Fast approximate search in large lexica Interactive postcorrection of Advanced text index structures OCRed historical texts Approximate search in lexica Intuition Given a large lexicon L and possibly erroneous input tokens, for each input token V efficiently find the „most similar“ entries W of L. Levenshtein distance d The Levenshtein distance between two words V, W is the minimal number of edit operations (deletion, insertion of a symbol or substitution of one symbol by another) needed to rewrite V into W. Formalized problem Given a lexicon L, an input word („pattern“) V and a bound b (=0,1,2,..), compute all entries W of L such such d(V,W) ≤ b. Lexicon representation as minimal deterministic finite-state automaton Example lexicon: axis,behave,child,church,cold,hate,hold x i h a a e v s b i e h l d o c u h r c o h t a Lexicon representation as minimal deterministic finite-state automaton Example lexicon: axis,behave,child,church,cold,hate,hold x i h a Fast access a e v s b i e h l d o c u h r c o h t a Lexicon representation as minimal deterministic finite-state automaton Example lexicon: axis,behave,child,church,cold,hate,hold x i h a Fast access a e v s Size: each prefix of a lexicon b i e word only represented once. h l d Similar sharing of suffixes! o c u h r c o h t a Lexicon representation as minimal deterministic finite-state automaton Example lexicon: axis,behave,child,church,cold,hate,hold x i h a Fast access a e v s Size: each prefix of a lexicon b i e word only represented once. h l d Similar sharing of suffixes! o c u h Using the lexicon: for many r c „processing tasks“ strings o h occurring in several words t only processed once. a Approximate search in lexicon using dynamic programming (Oflazer) Search for „chiid“, maximal Levenshtein distance 1 x i h a a e v s b i e h l d o c u h r c o h t a Approximate search in lexicon using dynamic programming (Oflazer) Search for „chiid“, maximal Levenshtein distance 1 „Complete“ traversal of lexicon automaton x i 0 h a a e v s c 1 b i e h 2 h l d o i 3 c u h i 4 r c o d 5 h t a Parallel control Oflazer approximate search Search for „chiid“, maximal Levenshtein distance 1 x i a 0 1 h a a e v s c 1 1 b i e h 2 2 h l d o i 3 3 c u h i 4 4 r c o d 5 5 h t a Oflazer approximate search Search for „chiid“, maximal Levenshtein distance 1 x i a x 0 1 2 h a a e v s c 1 1 2 b i e h 2 2 2 h l d o i 3 3 3 c u h i 4 4 4 r c o d 5 5 5 h t a Blind path, backtracking! Oflazer approximate search Search for „chiid“, maximal Levenshtein distance 1 x i c 0 1 h a a e v s c 1 0 b i e h 2 1 h l d o i 3 2 c u h i 4 3 r c o d 5 4 h t a Oflazer approximate search Search for „chiid“, maximal Levenshtein distance 1 x i c h 0 1 2 h a a e v s c 1 0 1 b i e h 2 1 0 h l d o i 3 2 1 c u h i 4 3 2 r c o d 5 4 3 h t a Oflazer approximate search Search for „chiid“, maximal Levenshtein distance 1 x i c h i 0 1 2 3 h a a e v s c 1 0 1 2 b i e h 2 1 0 1 h l d o i 3 2 1 0 c u h i 4 3 2 1 r c o d 5 4 3 2 h t a Oflazer approximate search Search for „chiid“, maximal Levenshtein distance 1 x i c h i l 0 1 2 3 4 h a a e v s c 1 0 1 2 3 b i e h 2 1 0 1 2 h l d o i 3 2 1 0 1 c u h i 4 3 2 1 1 r c o d 5 4 3 2 2 h t a Oflazer approximate search Search for „chiid“, maximal Levenshtein distance 1 x i c h i l d 0 1 2 3 4 5 h a a e v s c 1 0 1 2 3 4 b i e h 2 1 0 1 2 3 h l d o i 3 2 1 0 1 2 c u h i 4 3 2 1 1 2 r c o d 5 4 3 2 2 1 h t a match: child Oflazer approximate search For given pattern, o „blind“ paths in automaton recognized as early as possible, o only „small“ part of the automaton visited But o Computing matrix columns at each step expensive! Approximate search using non-deterministic Levenshtein automata Search for „chiid“, maximal Levenshtein distance 1 x i h a a e v s b i e h l d o c u h active states r c o h t a c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Search for „chiid“, maximal Levenshtein distance 1 x i h a a e v s b i e h l d o c u h r c o h t a c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Search for „chiid“, maximal Levenshtein distance 1 x i h a a e v s b i e h l d o c u h r c o Transition with x fails - backtracking h t a c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Search for „chiid“, maximal Levenshtein distance 1 x i h a a e v s b i e h l d o c u h r c o h t a c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Search for „chiid“, maximal Levenshtein distance 1 x i h a a e v s b i e h l d o c u h r c o h t a c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Search for „chiid“, maximal Levenshtein distance 1 x i h a a e v s b i e h l d o c u h r c o h t a c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Search for „chiid“, maximal Levenshtein distance 1 x i h a a e v s b i e h l d o c u h r c o h t a c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Search for „chiid“, maximal Levenshtein distance 1 x i h a e v a s match: child b i e h l d o c u h r c o h t a c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Same strengths But similar weaknesses: o Specific automaton for each pattern, o Propagation of active states computationally expensive. Approximate search using non-deterministic Levenshtein automata Observation 1 The set of active states after step n always restricted to a „triangular area“. c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Observation 1 The set of active states after step n always restricted to a „triangular area“. c h i i d e e e e e S S S S S S S S S S c h i i d Approximate search using non-deterministic Levenshtein automata Observation 1 The set of active states after step n always restricted to a „triangular area“.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    81 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us