Papermerge

Apr 10, 2020

Contents

1 Requirements 3 1.1 Python...... 3 1.2 Imagemagick...... 3 1.3 Poppler...... 3 1.4 Tesseract...... 4 1.5 Database...... 4

2 Installation 5 2.1 OS Specific Packages...... 5 2.1.1 1. Web App + Workers Machine...... 5 2.1.1.1 Ubuntu Bionic 18.04 (LTS)...... 5 2.1.2 2. Web App Machine...... 6 2.1.2.1 Ubuntu Bionic 18.04 (LTS)...... 6 2.1.3 3. Worker Machine...... 6 2.2 Manual Way...... 6 2.2.1 Package Dependencies...... 6 2.2.2 Web App...... 7 2.2.3 Worker...... 9 2.2.4 Recurring Commands...... 10 2.3 ...... 11 2.3.1 Package Dependencies...... 11 2.3.2 Web App...... 11 2.4 Docker...... 13 2.5 Ansible (Semiautomated)...... 13 2.6 Jenkins + Ansible (Fully Automated Deployment)...... 14

3 Languages Support 15

4 REST API 17 4.1 How It Works?...... 17 4.1.1 Get a Token...... 17 4.1.2 Use the Token...... 19 4.2 REST API Reference...... 19

5 Page Management 23 5.1 Delete Page(s)...... 23 5.2 Reorder Pages...... 23

i 5.3 Cut & Paste...... 23

6 Settings 25 6.1 STORAGE_ROOT...... 25 6.2 S3...... 26 6.3 OCR...... 26 6.4 DATABASES...... 26 6.5 STATICFILES_DIRS...... 26

7 Developers Guide 27 7.1 Contributing...... 27 7.1.1 Fix a Typo...... 27 7.1.2 Open an Issue...... 27 7.1.3 Add Your Language Support...... 27 7.2 Design...... 27 7.2.1 1. Frontend...... 28 7.2.2 2. Backend...... 29 7.2.3 3. Workers...... 29 7.3 Branching Model...... 29 7.3.1 Worker, Papermege-js Branching Model?...... 31 7.3.2 Git Branching/Tagging Blitz Introduction...... 31 7.4 Language Support...... 31 7.4.1 What is Language Support?...... 31 7.4.2 User Interface Language...... 32 7.4.3 Document Content Language...... 32

8 Indices and tables 33

ii Papermerge

I have nothing against paper. Paper is a brilliant invention of humanity. But in the 21st century I find it more appropriate for paper-based documents to be digitized (scanned). Once scanned, appropriate software can be used to find any document in a fraction of a second, just by typing a few keywords. Papermerge is a document management system designed to work with scanned documents. As well as OCR with full text search, it provides the look and feel of major modern file browsers, with a hierarchical structure for files and folders, so that you can organize your documents in a similar way to Dropbox (via web) or Google Drive.

Contents 1 Papermerge

2 Contents CHAPTER 1

Requirements

Papermerge depends on following software: • Python >= 3.8.0 • Tesseract - because of OCR • Imagemagick - Image operations • Poppler - PDF operations • PostgreSQL >= 11.0 because of Full Text Search

1.1 Python

Papermerge is a Python 3 application.

1.2 Imagemagick

Papermerge uses Imagemagick to convert between images format

1.3 Poppler

More exactly poppler utils are used. For exampple pdfinfo command line utility is used to find out number of page in PDF document.

3 Papermerge

1.4 Tesseract

If you never heard of Tesseract software - it is google’s open source Optical Character Recognition software. It extracts text from images. It works fantastically well for wide range of languages.

1.5 Database

One of Papermerge’s core philosophies is “Find Any Document”. PostgreSQL database comes with Full Text Search Support (FTS) out of the box. Papermerge uses websearch_to_tsquery PostgreSQL function which was intro- duced in PostgreSQL version 11.0. With FTS - full text search - you can search documents in similar way people are used to search web pages in google (bing, yandex, duckduckgo) search engine - you just type some words - and search result will display only documents with those words sorted by their relevancy.

4 Chapter 1. Requirements CHAPTER 2

Installation

There are different methods to install Papermerge. They differ by amount of effort required and purpose.

2.1 OS Specific Packages

Here are given instructions on how to install specific packages. There are three cases. 1. Both web app and workers are on same machine 2. Web app machine 3. Worker machine

2.1.1 1. Web App + Workers Machine

2.1.1.1 Ubuntu Bionic 18.04 (LTS)

Install required ubuntu packages: sudo apt-get update sudo apt-get install python3 python3-pip python3-venv \ poppler-utils \ imagemagick \ build-essential \ poppler-utils \ tesseract-ocr \ tesseract-ocr-deu \ tesseract-ocr-eng

Notice that for tesseract only english and german (Deutsch) language packages are needed. Ubuntu Bionic 18.04 comes with postgres 10 package. Papermerge on the other hand requires at least version 11 of Postgres.

5 Papermerge

Install Postgres version 11:

# add the repository sudo tee/etc/apt/sources.list.d/pgdg.list <

# get the signing key and import it wget https://www.postgresql.org/media/keys/ACCC4CF8.asc sudo apt-key add ACCC4CF8.asc

# fetch the metadata from the new repo sudo apt-get update

2.1.2 2. Web App Machine

Tesseract should not run on Web App only computer.

2.1.2.1 Ubuntu Bionic 18.04 (LTS)

Install required ubuntu packages: sudo apt-get update sudo apt-get install python3 python3-pip python3-venv \ poppler-utils \ imagemagick \ build-essential \ poppler-utils

2.1.3 3. Worker Machine

Worker is the one performing heavy task of extracting text from images. So it must have tesseract packages installed.

2.2 Manual Way

Papermerge has two parts: • Web application • Worker - which is used for OCR operation With this installation method both parts will run on the same computer. This installation method is suitable for developers. In this method no configuration is automated, so it is a perfect method if you want to understand the mechanics of the project. If you follow along in this document and still have trouble, please open an issue on GitHub: so I can fill in the gaps.

2.2.1 Package Dependencies

In this setup, Web App and Workers run on the same machine. Install os specific packages for webapp + worker

6 Chapter 2. Installation Papermerge

Check that Postgres version 11 is is up and running: sudo systemctl status [email protected]

Create new role for postgres database: sudo-u postgres createuser--interactive

When asked Shall the new role be allowed to create databases? please answer yes (when running tests, django creates a temporary database) Create new database owned by previously created user: sudo-u postgres createdb-O

Set a password for user: sudo-u postgres psql ALTER USER WITH PASSWORD'';

2.2.2 Web App

Once we have prepared database, tesseract and other dependencies, let’s start with paperpermerge itself. Clone main papermerge project: git clone https://github.com/ciur/papermerge papermerge-proj

Clone papermerge-js project (this is the frontend part): git clone https://github.com/ciur/papermerge-js

Create python’s virtual environment .env: cd papermerge-proj python3-m venv.venv

Activate python’s virtual environment: source.venv/bin/activate

Install required python packages (now you are in papermerge-proj directory):

# while in folder pip install-r requirements.txt

Rename file config/settings/development.example.py to config/settings/development.py. This file is default for DJANGO_SETTINGS_MODULE and it is included in .gitignore. Adjust following settings in config/settings/development.py: • DATABASES - name, username and password of database you created in PostgreSQL • STATICFILES_DIRS - include path to /static • MEDIA_ROOT - absolute path to media folder • STORAGE_ROOT- absolute path to same media root, but with a “local:/” prefix

2.2. Manual Way 7 Papermerge

Note: 1. Make sure that data_folder_in and data_folder_out point to the same location. 2. Make sure that folder pointed by data_folder_in and data_folder_out exists.

Then, as in any django based project, run migrations, create super user and run build in webserver:

cd ./manage.py migrate ./manage.py createsuperuser ./manage.py runserver

At this point, you should be able to see (styled) login page. You should be able as well to login with administrative user you created before with ./manage.py createsuperuser command. At this , must be able to access login screen and it should look like in screenshot below.

Also, you can upload some document and see their .

But because there is no worker configured yet, documents are basically plain images. Let’s configure worker!

8 Chapter 2. Installation Papermerge

2.2.3 Worker

Let’s add a worker on the same machine with Web Application we configured above. We will use the same python’s virtual environment as for Web Application.

Note: Workers are the ones who depend on (and use) tesseract not Web App.

Clone repo and install (in same python’s virtual environment as Web App) required packages: git clone https://github.com/ciur/papermerge-worker cd papermerge-worker pip install-r requirements.txt

Create a file /config.py with following configuration: worker_concurrency=1 broker_url="filesystem://" broker_transport_options={ 'data_folder_in':'/home/vagrant/papermerge-proj/run/broker/data_in', 'data_folder_out':'/home/vagrant/papermerge-proj/run/broker/data_in', } worker_hijack_root_logger= True task_default_exchange='papermerge' task_ignore_result= False result_expires= 86400 result_backend='rpc://' include='pmworker.tasks' accept_content=['pickle','json'] s3_storage='s3:/' local_storage="local:/home/vagrant/papermerge-proj/run/media/"

2.2. Manual Way 9 Papermerge

Important: Folder pointed by data_folder_in and data_folder_out must exists and be the same one as in configuration for Web Application.

Now, while in folder, run command:

CELERY_CONFIG_MODULE=config celery worker-A pmworker.celery-Q papermerge-l info

At this stage, if you keep both built in webserver (./manage.py runserver command above) and worker running in foreground and upload a couple of PDF documents, and obvisouly give worker few minutes time to OCR the document, document becomes more than an image - you can now select text in it!

Fig. 1: Now you should be able to select text

2.2.4 Recurring Commands

At this point, if you will try to search a document - nothing will show up in search results. It is because, workers OCR a document and place results into a .txt file, thus extracted text is not yet in database. A special Papermerge command txt2db will read .txt files and insert them in associated documents’ (documents’ pages) database entries. Afterwards another command update_fts will prepare a special a database column with correct information about document (more precicely - page). Run commands manually:

10 Chapter 2. Installation Papermerge

cd ./manage.py txt2db ./manage.py update_fts

Note: In manual setup (i.e. without any Papermerge’s background services running), if you want a document to be available for search, you need to run ./manage.py txt2db and ./manage.py update_fts commands everytime after document is OCRed.

2.3 Systemd

In this installation method you use a special papermerge command startetc to generate a bunch of configuration files in /run/etc folder. Then only with one single command: systemctl--user start papermerge you start a full fledged staging environment with nginx, gunicorn, one worker and recurring commands running as services on a single machine. I really love this method and I use in my local development environment. This method relies on systemd and its --user argument.

2.3.1 Package Dependencies

You will need to install os specific packages for webapp + worker first. Then make sure that PostreSQL is up and running. Make sure that your machine has both nginx and systemd available: nginx-V systemd--version

2.3.2 Web App

Clone main papermerge project: git clone https://github.com/ciur/papermerge papermerge-proj

Clone papermerge-js project (this is the frontend part): git clone https://github.com/ciur/papermerge-js

Create python’s virtual environment .env: cd papermerge-proj python3-m venv.venv

Activate python’s virtual environment: source.venv/bin/activate

Install required python packages (now you are in papermerge-proj directory):

2.3. Systemd 11 Papermerge

# while in folder pip install-r requirements.txt

Rename file config/settings/development.example.py to config/settings/development.py. This file is default for DJANGO_SETTINGS_MODULE and it is included in .gitignore. Adjust following settings in config/settings/development.py: • DATABASES - name, username and password of database you created in PostgreSQL • MEDIA_ROOT - absolute path to media folder • STORAGE_ROOT- absolute path to same media root, but with a “local:/” prefix

Note: 1. Make sure that data_folder_in and data_folder_out point to the same location. 2. Make sure that folder pointed by data_folder_in and data_folder_out exists.

Then, as in any django based project, run migrations and create super user: cd ./manage.py migrate ./manage.py createsuperuser

Run startetc command:

./manage.py startetc

Just out of curiousity, have a look /run at folder generated by startetc command. Folder should have following structure: run broker data_in data_out data_processed etc gunicorn.conf.py nginx.conf papermerge.env pmworker.env pmworker.py systemd papermerge.service papermerge.target pm_nginx.service pmworker.service txt2db.service txt2db.timer update_fts.service update_fts.timer log tmp

Systemd can be used to manage user services. For that –user flag is used. User services must be referenced in ~/. config/systemd/user folder. By the way, I made a video about systemd –user feature.

12 Chapter 2. Installation Papermerge

Create ~/.config/systemd/user if you don’t have it. Then reference (create symbolic links) /run/etc/systemd/ units in ~/.config/systemd/user folder: cd~/.config/systemd/user ln-s/run/etc/systemd/ * .

Important: Path /run/etc/systemd/* must be absolute.

Start papermerge: systemctl--user start papermerge.target

2.4 Docker

With this method you will need git, docker and docker-compose installed. 1. Install Docker 2. Install docker-compose 3. Clone Papermerge Repository:

git clone https://github.com/ciur/papermerge papermerge-proj

4. Run docker compose command (which will pull images from DockerHub):

cd papermerge-proj/docker docker-compose up-d

This will pull and start the necessary containers. If you wish, you can use docker-compose up --build -f docker-compose-dev.yml -d command instead to build local images. Check if services are up and running: docker-compose ps

Papermerge Web Service is available at http://localhost:8000 For initial sign in use:

URL: http://localhost:8000 username: admin password: admin

You can check logs of each service with: docker-compose logs worker docker-compose logs app docker-compose logs db

2.5 Ansible (Semiautomated)

Coming soon. . .

2.4. Docker 13 Papermerge

2.6 Jenkins + Ansible (Fully Automated Deployment)

To be added. . .

14 Chapter 2. Installation CHAPTER 3

Languages Support

Theorethically all languages supported by tesseract (over 130) can be used. But for my own needs only two were required: • German • English Thus, only support for these two languages is provided. Both localization (of user interface) and OCRing documents in german and english are basically hardcoded into the project.

15 Papermerge

16 Chapter 3. Languages Support CHAPTER 4

REST API

Screencast demo REST API is a way to interact with Papermerge far beyond Web Browser realm. It gives you power to extend Paper- merge in many interesting ways. For example it allows you to write a simple bash script to automate uploading of files from your local (or remote) computer’s specific location. Another practical scenario where REST API can be used is to automatically (well, you need some sort of 3rd party script for that) import attached documents from a given email account.

4.1 How It Works?

Instead of usual Sign In, with username and password, via Web Browser, you will sign in with a token (a fancy name for sequence of numbers and letters) from practically any software which supports http protocol. Thus, working with REST API is two step process: 1. get a token 2. use the token from 3rd party REST API client

4.1.1 Get a Token

1. Click User Menu (top right corner) -> API Tokens 2. Click New Token 3. You will to decide on number of hours the token will be valid. Default is 4464 hours, which is roughly equivalent of 6 months. Click Save button. 4. After you click Save button, two information messages will be displayed. Write down your token from Remem- ber the token: . . . info window.

17 Papermerge

Fig. 1: “API Tokens” in User Menu (step 1)

Fig. 2: “New token” button (step 2)

18 Chapter 4. REST API Papermerge

Important: Write down your token. For security reasons, it is will be displayed only once. In picture below, it is the one marked in red.

Important: Tokens are saved in database encrypted. Token’s encrypted version is called digest. In tokens tables (by the way, you can have as many token you like) first column displays first 16 characters of the digest. It is a way to identify the token. In picture below, token’s digest is marked with green.

4.1.2 Use the Token

Once you have your REST API token, you can use Papermerge with any HTTP client, just remember to include REST API token as header using following format:

Authorization: Token

Let’s see some examples with curl. The simpliest REST API call is: curl-H"Authorization: Token 7502db85f8d40bc7f4f5ab0a4e4fee4a"\ /api/documents

If get 2XX response, it means your Authorization header and token are correct. Upload local file to remote host specified with : curl-H"Authorization: Token 7502db85f8d40bc7f4f5ab0a4e4fee4a"\ -T/home/eugen/documents/demo/2019/berlin1. \ /api/document/upload/berlin_x1.pdf

Notice that local file name is berlin1.pdf while it features in url as berlin_x1.pdf. This way I can rename local file. You can upload files without specifying their remote name, in that case remote file will have same name as local file: curl-H"Authorization: Token 7502db85f8d40bc7f4f5ab0a4e4fee4a"\ -T/home/eugen/documents/demo/2019/berlin1.pdf \ /api/document/upload/

Note: Notice the trailing / character. When uploading file with curl without specifing file name URL must end with /. This is a way to notify curl that we don’t want to rename files.

Your (REST API) uploaded files will end up in Inbox.

4.2 REST API Reference

REST API authorization header: • name: Authorization • value format Token Example:

4.2. REST API Reference 19 Papermerge

Fig. 3: In red color is your (example) token (step 4)

20 Chapter 4. REST API Papermerge

Fig. 4: Files uploaded with REST API end up in Inbox.

4.2. REST API Reference 21 Papermerge

curl...-H"Authorization: Token "

REST API URLs:

URL HTTP Method Description /api/documents GET json list of all documents /api/document/ GET json info about document with id= /api/document/upload/ PUT Uploads unnamed file (random name will be assigned) /api/document/upload/<filename> PUT Uploads named file

22 Chapter 4. REST API CHAPTER 5

Page Management

Screencast demo Page management is new set of features of Papermerge to manage pages. In other words you can delete, reorder, cut and paste pages. Many times scanning documents in bulk results in documents with blank pages; some pages my be out of order or maybe part of totally different document. Even if user notices these flaws immediately it is time consuming and frustrating to redo scanning process. Thus it is a welcome feature of Papermerge to allow user to fix out of order pages in application.

5.1 Delete Page(s)

Delete those blank pages. Although my scanner has automatic “remove blank pages” feature, it misses some blank page. So I find it very practical to allow user to remove blank pages by himself/herself.

5.2 Reorder Pages

Out of order pages occur very often during scanning process. Papermerge allows users to change pages order within the document.

5.3 Cut & Paste

You can move document pages around from one document to another. Once you cut one or several pages from a document, you can paste them either inside another document - pages will become part of new document or you can paste pages in file browser, this will create entirely new document from cut pages.

23 Papermerge

24 Chapter 5. Page Management CHAPTER 6

Settings

These are configurations settings for Papermerge - Web App. Configuration settings are used in same manner as for any Django based project. Settings which are common for all environments (production, development, staging) are defined in papermerge. config.settings.base module. If you want to reuse papermerge.config.settings.base, create python file, for example staging.py, and import all settings from base module:

from.base import *

DEBUG= False STATIC_ROOT='/www/static/'

Example above assumes that staging.py was created in same folder with base.py. Don’t forget to point DJANGO_SETTINGS_MODULE environment variable to your settings module.

6.1 STORAGE_ROOT

• local:/ • s3:/ Defines either local or a remote location where documents are stored. In case of local, it’s meaning is same of for Django’s MEDIA_ROOT. In case of s3 storage it indicates path to the S3 bucket. Examples:

STORAGE_ROOT='local:/home/vagrant/papermerge-proj/run/media' # good for

˓→development env STORAGE_ROOT='s3:/yourbucketname/alldocuments' # suitable for production

25 Papermerge

Note: In case when you choose not to use S3 storage both STORAGE_ROOT needs to be set to local://... path and S3 option must be set to False. And other way around, if you want to use S3 storage, both SOTRAGE_ROOT and S3 needs to be set accordingly (S3=True, STORAGE_ROOT=’s3:/bucketname’).

6.2 S3

• True|False Instructs papermerge if you want to use S3 storage. S3=True is more suitable for production environ- ments.

Note: In case S3=True you need to point ref:STORAGE_ROOT to s3 location.

6.3 OCR

• True|False Enables or disables OCR features. With OCR=False no workers needs to be configured;

6.4 DATABASES

This is Django specific configuration settings. Papermerge uses PostgreSQL as database, which means that ENGINE options must be set to django.db.backends.postgresql. Example:

DATABASES={ 'default':{ 'NAME':'db_name', 'ENGINE':'django.db.backends.postgresql', 'USER':'db_user', 'PASSWORD':'db_password' }, }

6.5 STATICFILES_DIRS

Include absolute path where papermege-js static files are. Example:

STATICFILES_DIRS=[ '/home/vagrant/papermerge-js/static' ]

26 Chapter 6. Settings CHAPTER 7

Developers Guide

Documentation, notes and general info for developers (myself included).

7.1 Contributing

This documents describes in detail how you can contribute to papermerge project.

7.1.1 Fix a Typo

Contribute to the project just by fixing text typos. Like tis one. Yes, English is not my native lnguage and I do lots of typoz. Fixing documentation typos is easiest and fastest way to contribute to the project. Even even if you correct one minor typing mistake I will add you to the list of contributors.

7.1.2 Open an Issue

Another way to contribute is open issues. Obviously this means you need to at least run once application and test it.

7.1.3 Add Your Language Support

Adding language support is not as trivial as fixing a typo or opening an issue, but it is not that difficult either. In any case, there is a separate page in developer guide for it.

7.2 Design

A brief description of the architecture of Papermerge and why such design decisions were taken. Papermerge project has 2 parts:

27 Papermerge

• Web Application • Workers Web application is further devided into Frontend and Backend. As result there are 3 separate repositories that are part of one whole.

Fig. 1: High level design. Backend and frontend are separate.

7.2.1 1. Frontend

Papermerge-js Repository

Warning: Name papermerge-js is misleading, because it implies that it is only javascript is used, which is not true. This project manages all static assets: javascript, css, images, fonts.

Modern web applications tend to use a lot of javascript and css. Javascript code, as opposite to code written in Python, become increasingly difficult to manage. Same is for css. To deal with codebase complexity, I decided to split frontend as completely separate project. This project is a Webpack project. In practice this makes it little bit easier to deal with growing javascript code complexity. The outcome of this project, among others, are two important files:

/static/js/papermerge.js /static/css/papermerge.css

There are static files as well, like images and fonts. However images and fonts, are just placed in /static and nothing really interesting happens with them.

28 Chapter 7. Developers Guide Papermerge

7.2.2 2. Backend

Papermerge-proj Repository Backend is a standard Django application. It uses static files from frontend part. Throughout documentation it is refered as backend because term webapp is more general (webapp = backend + frontend).

7.2.3 3. Workers

Papermerge-worker Repository Workers perform OCR on the documents. Documents are passed as reference (see note below) from backend to the workers via a shared location. In simplest setup when everything runs on same machine, shared location is just a folder on local machine accessible by worker and by backend. In production, shared location is a S3 bucket.

Note: There are at least two distinct methods of passing documents from backend to the workers. First method, which is very simple, but wrong: backend will just transfer entire document byte by byte to the worker. Without diving deep into technical details, this method is not scalable because it deplets backend’s memory very quickly. Backend instead instructs workers which documents they need to OCR by telling workers document id (it passes user id and language name as well).

Fig. 2: Backend passes documents to workers by reference.

7.3 Branching Model

All current development goes into master branch. Papermerge versions branch from master branch and are tagged for specific version. This is easier to explain with a picture.

7.3. Branching Model 29 Papermerge

Fig. 3: Branching model used by Papermerge project.

30 Chapter 7. Developers Guide Papermerge

• Stable version branches are named stable/1.0.x, stable/1.1.x etc. • Git tagging is used to mark specific software version e.g. v1.0.0, v1.1.0, v1.2.0 and so on.

7.3.1 Worker, Papermege-js Branching Model?

Well, both worker and papermerge-js will follow the same model.

Note: I started above described branching model somewhere around 14th February 2020 and I have applied it only on main project - unfortunatelly at that moment I forgot about other two parts.

As temporary workaround I tagged both worker and papermerge-js with v1.1.0 tags to mark their compatility point in time with main project.

7.3.2 Git Branching/Tagging Blitz Introduction

To checkout a branch stable/1.1.x, use command:

$ git checkout stable/1.1.x

To checkout a tagged commit, say a commit tagged v1.1.0, you use same command as checking out a branch:

$ git checkout v1.1.0

7.4 Language Support

By default, papermerge is hardcoded to work with documents in only two languages - German and English. The- oretically it can support more than 100 languages. However, I, as developer and user of this software, included in papermerge only what was usefull for me (German and English). You can contribute to this project by adding support (and testing it) for you own language. It is extremely rewarding experience, because: • it is fun • you will learn a lot • you will create something useful for you and others

7.4.1 What is Language Support?

There are two parts to consider: • User Interface language (text like username, Log out) • Document Content (actual content of your documents)

7.4. Language Support 31 Papermerge

7.4.2 User Interface Language

User Interface language is text you user sees and interacts with. Say labels for username in German will be Benutzer- name, or text for Log out in German is Abmelden. To localize user interface (UI) in your own language you need be familiar with Django way. It is because main web application is Django project. Contributing to the project in this sense means basically creating/updating file paperme- rge//LC_MESSAGES/django.po file.

7.4.3 Document Content Language

Every document upload to papermerge will be OCRed by tesseract command line utility. Tesseract command requires -l argument - to indicate the language of the document. This is the heart of document language support. Have a look a worker’s shortcuts module extract_hocr and extract_txt functions. Both functions built tesseract command with language as first argument. To check what languages you have installed for tesseract, use command:

$ tesseract --list-langs

In my case, it lists deu and eng - which are codes for German and English languages. OCRing of the documents (tesseract -l deu path/to/doc) happens on worker side. I explained this because it is important to know, but for adding language support - you don’t need to change anything in the worker, because worker only takes orders and blindly executes them. The entry point, for the worker part is task module with it’s ocr_page function. Again, no need to change anything here, I mention this only because it is important to know. First thing you need to have a look into and change is dynamic_preferences_module where configuration for to lan- guage is defined. You will need to add a new choice in OcrLanguage class. The code for the new language must match language code listed by tesseract --list-langs. This change will add a new entry in UI and will allow user to choose new language for the document. But the tricky part is doing the change on database level. The thing is papermerge makes use of PostgreSQL full text search feature, which means it needs to store an updated version of tsvector type column. How to create and search tsvector type columns is described in postgres documentation. Every time page.text column is changed a database level trigger is fired to updated language specific tsvector column. Triggers for this job are defined in papermerge/core/pgsql/01_triggers.sql file. Another important language related sql file is papermerge/core/pgsql/03_update_lang_cols.sql. This sql code is exe- cuted periodically by papermerge/core/management/commands/update_fts.py command. It is responsable for moving document page.text to page.text_deu or text to text_eng. Both page.text_eng and page.text_deu are tsvector type columns with preset weight ‘C’.

32 Chapter 7. Developers Guide CHAPTER 8

Indices and tables

• genindex • modindex • search

33