Optical Kurrent Recognition Rossum Reading Group 2018 - 08 - 16 O nás Auxiliary Historical Sciences (Archival science) Outline
1. The evolution of scripts in modern history 2. Archivists and written source processing 3. Obstacles in reading and rewriting 4. OCR obstacles from the archivists point of view The evolution of scripts in modern history
Novogothic (“german”) script
Humanistic (“latin”) script
Novogothic script
Frakturschrift
Kanzleischrift
Kurrentschrift
And how it actually looks like...
Humanistic sript
Rotunda humanistica
Humanistic semi-cursive
Archivists and written source processing
Paleography (PVH) Archivists and written source processing
Paleography (PVH)
Why learn reading/rewriting? Archivists and written source processing
Paleografie (PVH)
Why learn reading/rewriting?
research (students, historians, genealogs…)
archivists
publish source editions Archivists and written source processing
Paleography (PVH)
Why learn reading/rewriting?
How to rewrite? Archivists and written source processing
Paleography (PVH)
Why learn reading/rewriting?
How to rewrite?
transliteration
transcription
Obstacles in reading and rewriting
Multiple languages
Obstacles in reading and rewriting
Multiple languages
Numerals and abbreviations
Obstacles in reading and rewriting
Multiple languages
Numerals and abbreviations
Special shapes
Obstacles in reading and rewriting
Multiple languages
Numerals and abbreviations
Special shapes
Grammar Obstacles in reading and rewriting
Multiple languages
Numerals and abbreviations
Special shapes
Grammar
digraphs
different letters: j = g/y, i = j, v = w...
different grammatical habits OCR obstacles from the archivists POV
Letter variations OCR obstacles from the archivists POV
Letter variations
Manuscript variations
OCR obstacles from the archivists POV
Letter variations
Manuscript variations
Current state of digitalization Optical Character Recognition OCR OCR Preprocess
I/OCR
Errors
Layout
Proofread OCR OCR OCR OCR Google Keep baseline GKB GKB GKB OCR What about HTR? Optical Kur... Recognition Con Pro OKR state of the art OKR state of the art under the hood OKR state of the art “References”
● Dropbox
● Line segmentation with FCN ● Text alignment using HMM ● Query text ● DTW
● Mass transcription of modern English ● Comparison of CRNN to LSTM, HMM in historical HTR ● Older case study for old fonts Actually ... ECCO IAM
Actually EEBO
... Bentham
GWP
ICFR2018
DAS(y mod 4) Thank you
http://opticalkurrentrecognition.jdem.cz/