OCR, Classification & Machine Translation (OCCAM)

Joachim Van den Bogaert, Arne Defauw, Pavel Smrž, Michal Hradiš Frederic Everaert, Koen Van Winckel, Brno University of Technology Alina Kramchaninova, Anna Bardadym, Božetěchova 2 Tom Vanallemeersch 612 00 Brno CrossLang Czech Republic Kerkstraat 106 9050 Gentbrugge [email protected], Belgium [email protected] {first.lastname}@croslang.com

e-Justice Portal,3 but a large volume of scanned Abstract documents would remain untranslated, because it consists of raw images that are not machine- The OCCAM project (Optical Character readable. recognition, ClassificAtion & Machine translation), which runs from 2019 to A similar problem occurs in the digital human- 2021, and is carried out by CrossLang and ities domain: while there are plenty of optical Brno University of Technology, aims at character recognition (OCR) frameworks availa- integrating the CEF (Connecting Europe ble (both open source and proprietary), the need Facility) Automated Translation service for OCR and translation within the digital human- with image classification, translation ities domain remains pressing. The European memories, optical character recognition, Newspaper Survey Report,4 as conducted during and machine translation. It will support the Europeana Newspapers project, revealed that the automated translation of scanned busi- access to twentieth-century content remains prob- ness documents (a document format that, lematic, and only few libraries use OCR when currently, cannot be processed by the CEF scanning documents. At the same time, there is a eTranslation service) and will also lead to growing interest in gaining multilingual access to a tool useful for the digital humanities do- cultural heritage resources. main. 2 Proposed solution for BRIS 1 Introduction Existing content within the member state data- The European Commission’s Business bases will be leveraged to recognise, classify and Registers Interconnection System (BRIS) translate legacy and new content. The presence of facilitates the access to information on EU database links to scanned documents, and the tem- companies and ensures that all EU business plate-like nature of administrative documents will registers can communicate to each other be exploited to optimize OCR and translation. electronically, in relation to cross-border mergers Since a pipelined (cascaded) implementation (i.e. and foreign branches.1 Its main task is to an OCR step followed by a machine translation synchronize the information that is present within (MT) step) has the inherent risk of error accumu- Members States’ business registers. lation, OCCAM proposes a more informed classification-based approach, as outlined in Figure 1, The CEF has planned an integration of BRIS to: with the CEF eTranslation2 Digital Service Infrastructure (DSI), to make draft translations of company information available via the European

1https://ec.europa.eu/cefdig- 3 https://e-justice.europa.eu/ ital/wiki/pages/viewpage.ac- 4 http://www.europeana-newspapers.eu/wp-con- tion?pageId=46992657 tent/uploads/2012/04/D4.1-Europeana-newspa- 2https://ec.europa.eu/cefdigital/wiki/dis- pers-survey-report.pdf play/CEFDIGITAL/eTranslation

Figure 1: OCCAM system architecture CrossLang will build MT engines using CEF eTranslation’s MT system and its own Moses- • recognise document types and link them to a based SMT (distributed under LGPL license) and corresponding data model (consisting of tem- OpenNMT-based (distributed under MIT license) plate text, data fields, data entries and free systems. The developed MT systems will text); incorporate maned-entity recognition and • identify entities within documents and link terminology technology, and target the following them to corresponding entries in member state language pairs: Dutch, French, German, and databases (e.g. by using OCR to recognise Czech into English. VAT numbers and retrieve the corresponding The resulting OCCAM solution will be built as data from a national business register); a reference implementation and made publicly • retrieve translations from translation memo- available at the end of the project. The licenses of riess associated with the data model; the used components will ensure that the • use class-adapted OCR and MT for the remain- implementation can be distributed freely, and ing free text. adapted for use by business registers across Brno University of Technology will provide the Europe, after the project has ended. image classification and OCR tools, using the For the digital humanities domain, a technology 5 6 open source packages OpenCV, Tesseract, and roadshow will be organised and tutorials will be 7 TensorFlow, and an in-house neural OCR system published, to make researchers acquainted with currently developed for the analysis of the technology, so they can easily develop and challenging historical documents in a project adapt their own models. called PERO,8 aimed at improving accessibility of cultural heritage. These tools are distributed under Acknowledgement commercially-friendly (non-viral) licenses (3- clause BSD and Apache 2.0). OCCAM is funded by the EC’s CEF Telecom programme (project 2018-EU-IA-0052).

5 https://opencv.org/ 7 https://www.tensorflow.org/ 6 https://github.com/tesseract-ocr/tes- 8http://www.fit.vutbr.cz/units/UPGM/gran seract ts/index.php.en?id=1165