OCR

Brent Younce, Dr. Akram Khater [email protected], [email protected]

OCR & Search OCR Engines Motivation Architecture • The Moise A. Khayrallah Center for Lebanese Diaspora Calamari OCR Studies at NC State maintains a dataset of over 250,000 historical Arabic documents, in the form of images; • Researchers cannot make full use of these documents, as they cannot be searched or queried in any meaningful way, due to a lack of digital text for these • Modern OCR engine built on Tensorflow, operates on single line documents; images (requires document segmentation); • Existing commercial OCR (Optical Character • Model trained with 100,000 lines of synthetic Arabic documents Recognition) systems, which extract digital text from produces highest accuracy, at around 95%; images, perform extremely poorly on this dataset, • Pre-processing and Post-processing system developed to clean, correctly recognizing less than 5% of the visible split, and re-combine images of individual lines into full documents characters; • This extremely low accuracy also applies to similar Tesseract OCR datasets involving historical or other scanned Arabic documents; • Goal: Therefore, to enable researchers to fully utilize • A highly scalable document search system was designed to this and similar datasets, we intend to improve OCR enable both researchers and the general public to query and accuracy for this dataset over that of commercial visualize the entire processed dataset; systems, and build an easily accessible web-based • Popular, proven open source OCR engine; • Open Semantic Search, a FOSS search platform built on Solr, search interface to allow for querying and exploration • High overall accuracy (~70-80%); was chosen as an interface for researchers. It includes: of the entire dataset. • Most consistent performance of the OCR engines tested. • An advanced query system (through Lucene) • Document tagging and organization Others • Visualization (networks and trends) Dataset • Vision • NLP (content analysis and named entity analysis) • High accuracy (80%), but fails and returns no • In addition, a simple search web interface was developed for results on a non-negligible subset of the data; dataset access by the general public; • Dataset consists of an ever-expanding collection of • Kraken (based on OCRopus) • The above search tools, as well as the OCR system, have (currently) 250,000+ documents; • Arabic models fine-tuned on 100 ground truth been deployed on AWS EC2 for high scalability; • These documents are primarily scanned Arab-American documents in a very simple training process, • OCR-generated PDFs as well as the original images, can be newspapers, between the years 1899 and 1950; yielding unimpressive accuracy at ~10-15%. stored locally or on S3, S3 IA, or Glacier (for archival storage). • The documents range in size from ~5 to ~20MB, and in quality from extremely difficult to interpret to fairly clean scans. Sample Results

Calamari trained model results on manually transliterated ground truth data Simple (public) search interface