<<

Papermerge

Feb 29, 2020

Contents

1 Requirements 3 1.1 Python...... 3 1.2 Imagemagick...... 3 1.3 Poppler...... 3 1.4 Tesseract...... 4 1.5 Database...... 4

2 Installation 5 2.1 OS Specific Packages...... 5 2.1.1 1. Web App + Workers Machine...... 5 2.1.1.1 Ubuntu Bionic 18.04 (LTS)...... 5 2.1.2 2. Web App Machine...... 6 2.1.2.1 Ubuntu Bionic 18.04 (LTS)...... 6 2.1.3 3. Worker Machine...... 6 2.2 Manual Way...... 6 2.2.1 Package Dependencies...... 6 2.2.2 Web App...... 7 2.2.3 Worker...... 9 2.2.4 Recurring Commands...... 10 2.3 ...... 11 2.3.1 Package Dependencies...... 11 2.3.2 Web App...... 11 2.4 Docker...... 13 2.5 Ansible (Semiautomated)...... 13 2.6 Jenkins + Ansible (Fully Automated Deployment)...... 14

3 Languages Support 15

4 REST API 17 4.1 How It Works?...... 17 4.1.1 Get a Token...... 17 4.1.2 Use the Token...... 19 4.2 REST API Reference...... 19

5 Settings 23 5.1 STORAGE_ROOT...... 23 5.2 S3...... 24

i 5.3 OCR...... 24 5.4 DATABASES...... 24 5.5 STATICFILES_DIRS...... 24

6 Design 25 6.1 1. Frontend...... 25 6.2 2. Backend...... 26 6.3 3. Wrokers...... 26

7 Changelog 29

8 1.0.0 31

9 Indices and tables 33

ii Papermerge

I have nothing against paper. Paper is a brilliant invention of humanity. But in the 21st century I find it more appropriate for paper-based documents to be digitized (scanned). Once scanned, appropriate software can be used to find any document in a fraction of a second, just by typing a few keywords. Papermerge is a document management system designed to work with scanned documents. As well as OCR with full text search, it provides the look and feel of major modern file browsers, with a hierarchical structure for files and folders, so that you can organize your documents in a similar way to Dropbox (via web) or Google Drive.

Contents 1 Papermerge

2 Contents CHAPTER 1

Requirements

Papermerge depends on following software: • Python >= 3.8.0 • Tesseract - because of OCR • Imagemagick - Image operations • Poppler - PDF operations • PostgreSQL >= 11.0 because of Full Text Search

1.1 Python

Papermerge is a Python 3 application.

1.2 Imagemagick

Papermerge uses Imagemagick to convert between images format

1.3 Poppler

More exactly poppler utils are used. For exampple pdfinfo command line utility is used to find out number of page in PDF document.

3 Papermerge

1.4 Tesseract

If you never heard of Tesseract software - it is google’s open source Optical Character Recognition software. It extracts text from images. It works fantastically well for wide range of languages.

1.5 Database

One of Papermerge’s core philosophies is “Find Any Document”. PostgreSQL database comes with Full Text Search Support (FTS) out of the box. Papermerge uses websearch_to_tsquery PostgreSQL function which was intro- duced in PostgreSQL version 11.0. With FTS - full text search - you can search documents in similar way people are used to search web pages in google (bing, yandex, duckduckgo) search engine - you just type some words - and search result will display only documents with those words sorted by their relevancy.

4 Chapter 1. Requirements CHAPTER 2

Installation

There are different methods to install Papermerge. They differ by amount of effort required and purpose.

2.1 OS Specific Packages

Here are given instructions on how to install specific packages. There are three cases. 1. Both web app and workers are on same machine 2. Web app machine 3. Worker machine

2.1.1 1. Web App + Workers Machine

2.1.1.1 Ubuntu Bionic 18.04 (LTS)

Install required ubuntu packages: sudo apt-get update sudo apt-get install python3 python3-pip python3-venv \ poppler-utils \ imagemagick \ build-essential \ poppler-utils \ tesseract-ocr \ tesseract-ocr-deu \ tesseract-ocr-eng

Notice that for tesseract only english and german (Deutsch) language packages are needed. Ubuntu Bionic 18.04 comes with postgres 10 package. Papermerge on the other hand requires at least version 11 of Postgres.

5 Papermerge

Install Postgres version 11:

# add the repository sudo tee/etc/apt/sources.list.d/pgdg.list <

# get the signing key and import it wget https://www.postgresql.org/media/keys/ACCC4CF8.asc sudo apt-key add ACCC4CF8.asc

# fetch the metadata from the new repo sudo apt-get update

2.1.2 2. Web App Machine

Tesseract should not run on Web App only computer.

2.1.2.1 Ubuntu Bionic 18.04 (LTS)

Install required ubuntu packages: sudo apt-get update sudo apt-get install python3 python3-pip python3-venv \ poppler-utils \ imagemagick \ build-essential \ poppler-utils

2.1.3 3. Worker Machine

Worker is the one performing heavy task of extracting text from images. So it must have tesseract packages installed.

2.2 Manual Way

Papermerge has two parts: • Web application • Worker - which is used for OCR operation With this installation method both parts will run on the same computer. This installation method is suitable for developers. In this method no configuration is automated, so it is a perfect method if you want to understand the mechanics of the project. If you follow along in this document and still have trouble, please open an issue on GitHub: so I can fill in the gaps.

2.2.1 Package Dependencies

In this setup, Web App and Workers run on the same machine. Install os specific packages for webapp + worker

6 Chapter 2. Installation Papermerge

Check that Postgres version 11 is is up and running: sudo systemctl status [email protected]

Create new role for postgres database: sudo-u postgres createuser--interactive

When asked Shall the new role be allowed to create databases? please answer yes (when running tests, django creates a temporary database) Create new database owned by previously created user: sudo-u postgres createdb-O

Set a password for user: sudo-u postgres psql ALTER USER WITH PASSWORD'';

2.2.2 Web App

Once we have prepared database, tesseract and other dependencies, let’s start with paperpermerge itself. Clone main papermerge project: git clone https://github.com/ciur/papermerge papermerge-proj

Clone papermerge-js project (this is the frontend part): git clone https://github.com/ciur/papermerge-js

Create python’s virtual environment .env: cd papermerge-proj python3-m venv.venv

Activate python’s virtual environment: source.venv/bin/activate

Install required python packages (now you are in papermerge-proj directory):

# while in folder pip install-r requirements.txt

Rename file config/settings/development.example.py to config/settings/development.py. This file is default for DJANGO_SETTINGS_MODULE and it is included in .gitignore. Adjust following settings in config/settings/development.py: • DATABASES - name, username and password of database you created in PostgreSQL • STATICFILES_DIRS - include path to /static • MEDIA_ROOT - absolute path to media folder • STORAGE_ROOT- absolute path to same media root, but with a “local:/” prefix

2.2. Manual Way 7 Papermerge

Note: 1. Make sure that data_folder_in and data_folder_out point to the same location. 2. Make sure that folder pointed by data_folder_in and data_folder_out exists.

Then, as in any django based project, run migrations, create super user and run build in webserver:

cd ./manage.py migrate ./manage.py createsuperuser ./manage.py runserver

At this point, you should be able to see (styled) login page. You should be able as well to login with administrative user you created before with ./manage.py createsuperuser command. At this , must be able to access login screen and it should look like in screenshot below.

Also, you can upload some document and see their .

But because there is no worker configured yet, documents are basically plain images. Let’s configure worker!

8 Chapter 2. Installation Papermerge

2.2.3 Worker

Let’s add a worker on the same machine with Web Application we configured above. We will use the same python’s virtual environment as for Web Application.

Note: Workers are the ones who depend on (and use) tesseract not Web App.

Clone repo and install (in same python’s virtual environment as Web App) required packages: git clone https://github.com/ciur/papermerge-worker cd papermerge-worker pip install-r requirements.txt

Create a file /config.py with following configuration: worker_concurrency=1 broker_url="filesystem://" broker_transport_options={ 'data_folder_in':'/home/vagrant/papermerge-proj/run/broker/data_in', 'data_folder_out':'/home/vagrant/papermerge-proj/run/broker/data_in', } worker_hijack_root_logger= True task_default_exchange='papermerge' task_ignore_result= False result_expires= 86400 result_backend='rpc://' include='pmworker.tasks' accept_content=['pickle','json'] s3_storage='s3:/' local_storage="local:/home/vagrant/papermerge-proj/run/media/"

2.2. Manual Way 9 Papermerge

Important: Folder pointed by data_folder_in and data_folder_out must exists and be the same one as in configuration for Web Application.

Now, while in folder, run command:

CELERY_CONFIG_MODULE=config celery worker-A pmworker.celery-Q papermerge-l info

At this stage, if you keep both built in webserver (./manage.py runserver command above) and worker running in foreground and upload a couple of PDF documents, and obvisouly give worker few minutes time to OCR the document, document becomes more than an image - you can now select text in it!

Fig. 1: Now you should be able to select text

2.2.4 Recurring Commands

At this point, if you will try to search a document - nothing will show up in search results. It is because, workers OCR a document and place results into a .txt file, thus extracted text is not yet in database. A special Papermerge command txt2db will read .txt files and insert them in associated documents’ (documents’ pages) database entries. Afterwards another command update_fts will prepare a special a database column with correct information about document (more precicely - page). Run commands manually:

10 Chapter 2. Installation Papermerge

cd ./manage.py txt2db ./manage.py update_fts

Note: In manual setup (i.e. without any Papermerge’s background services running), if you want a document to be available for search, you need to run ./manage.py txt2db and ./manage.py update_fts commands everytime after document is OCRed.

2.3 Systemd

In this installation method you use a special papermerge command startetc to generate a bunch of configuration files in /run/etc folder. Then only with one single command: systemctl--user start papermerge you start a full fledged staging environment with nginx, gunicorn, one worker and recurring commands running as services on a single machine. I really love this method and I use in my local development environment. This method relies on systemd and its --user argument.

2.3.1 Package Dependencies

You will need to install os specific packages for webapp + worker first. Then make sure that PostreSQL is up and running. Make sure that your machine has both nginx and systemd available: nginx-V systemd--version

2.3.2 Web App

Clone main papermerge project: git clone https://github.com/ciur/papermerge papermerge-proj

Clone papermerge-js project (this is the frontend part): git clone https://github.com/ciur/papermerge-js

Create python’s virtual environment .env: cd papermerge-proj python3-m venv.venv

Activate python’s virtual environment: source.venv/bin/activate

Install required python packages (now you are in papermerge-proj directory):

2.3. Systemd 11 Papermerge

# while in folder pip install-r requirements.txt

Rename file config/settings/development.example.py to config/settings/development.py. This file is default for DJANGO_SETTINGS_MODULE and it is included in .gitignore. Adjust following settings in config/settings/development.py: • DATABASES - name, username and password of database you created in PostgreSQL • MEDIA_ROOT - absolute path to media folder • STORAGE_ROOT- absolute path to same media root, but with a “local:/” prefix

Note: 1. Make sure that data_folder_in and data_folder_out point to the same location. 2. Make sure that folder pointed by data_folder_in and data_folder_out exists.

Then, as in any django based project, run migrations and create super user: cd ./manage.py migrate ./manage.py createsuperuser

Run startetc command:

./manage.py startetc

Just out of curiousity, have a look /run at folder generated by startetc command. Folder should have following structure: run broker data_in data_out data_processed etc gunicorn.conf.py nginx.conf papermerge.env pmworker.env pmworker.py systemd papermerge.service papermerge.target pm_nginx.service pmworker.service txt2db.service txt2db.timer update_fts.service update_fts.timer log tmp

Systemd can be used to manage user services. For that –user flag is used. User services must be referenced in ~/. config/systemd/user folder. By the way, I made a video about systemd –user feature.

12 Chapter 2. Installation Papermerge

Create ~/.config/systemd/user if you don’t have it. Then reference (create symbolic links) /run/etc/systemd/ units in ~/.config/systemd/user folder: cd~/.config/systemd/user ln-s/run/etc/systemd/ * .

Important: Path /run/etc/systemd/* must be absolute.

Start papermerge: systemctl--user start papermerge.target

2.4 Docker

With this method you will need git, docker and docker-compose installed. 1. Install Docker 2. Install docker-compose 3. Clone Papermerge Repository:

git clone https://github.com/ciur/papermerge papermerge-proj

4. Run docker compose command:

cd papermerge-proj/docker docker-compose up--build-d

This will create and start the necessary containers. Check if services are up and running: docker-compose ps

Papermerge Web Service is available at http://localhost:8000 For initial sign in use:

URL: http://localhost:8000 username: admin password: admin

You can check logs of each service with: docker-compose logs worker docker-compose logs app docker-compose logs db

2.5 Ansible (Semiautomated)

Coming soon. . .

2.4. Docker 13 Papermerge

2.6 Jenkins + Ansible (Fully Automated Deployment)

To be added. . .

14 Chapter 2. Installation CHAPTER 3

Languages Support

Theorethically all languages supported by tesseract (over 130) can be used. But for my own needs only two were required: • German • English Thus, only support for these two languages is provided. Both localization (of user interface) and OCRing documents in german and english are basically hardcoded into the project.

15 Papermerge

16 Chapter 3. Languages Support CHAPTER 4

REST API

REST API is a way to interact with Papermerge far beyond Web Browser realm. It gives you power to extend Paper- merge in many interesting ways. For example it allows you to write a simple bash script to automate uploading of files from your local (or remote) computer’s specific location. Another practical scenario where REST API can be used is to automatically (well, you need some sort of 3rd party script for that) import attached documents from a given email account.

4.1 How It Works?

Instead of usual Sign In, with username and password, via Web Browser, you will sign in with a token (a fancy name for sequence of numbers and letters) from practically any software which supports http protocol. Thus, working with REST API is two step process: 1. get a token 2. use the token from 3rd party REST API client

4.1.1 Get a Token

1. Click User Menu (top right corner) -> API Tokens 2. Click New Token 3. You will to decide on number of hours the token will be valid. Default is 4464 hours, which is roughly equivalent of 6 months. Click Save button. 4. After you click Save button, two information messages will be displayed. Write down your token from Remem- ber the token: . . . info window.

Important: Write down your token. For security reasons, it is will be displayed only once. In picture below, it is the one marked in red.

17 Papermerge

Fig. 1: “API Tokens” in User Menu (step 1)

Fig. 2: “New token” button (step 2)

18 Chapter 4. REST API Papermerge

Important: Tokens are saved in database encrypted. Token’s encrypted version is called digest. In tokens tables (by the way, you can have as many token you like) first column displays first 16 characters of the digest. It is a way to identify the token. In picture below, token’s digest is marked with green.

4.1.2 Use the Token

Once you have your REST API token, you can use Papermerge with any HTTP client, just remember to include REST API token as header using following format:

Authorization: Token

Let’s see some examples with curl. The simpliest REST API call is: curl-H"Authorization: Token 7502db85f8d40bc7f4f5ab0a4e4fee4a"\ /api/documents

If get 2XX response, it means your Authorization header and token are correct. Upload local file to remote host specified with : curl-H"Authorization: Token 7502db85f8d40bc7f4f5ab0a4e4fee4a"\ -T/home/eugen/documents/demo/2019/berlin1. \ /api/document/upload/berlin_x1.pdf

Notice that local file name is berlin1.pdf while it features in url as berlin_x1.pdf. This way I can rename local file. You can upload files without specifying their remote name, in that case remote file will have same name as local file: curl-H"Authorization: Token 7502db85f8d40bc7f4f5ab0a4e4fee4a"\ -T/home/eugen/documents/demo/2019/berlin1.pdf \ /api/document/upload/

Note: Notice the trailing / character. When uploading file with curl without specifing file name URL must end with /. This is a way to notify curl that we don’t want to rename files.

Your (REST API) uploaded files will end up in Inbox.

4.2 REST API Reference

REST API authorization header: • name: Authorization • value format Token Example: curl...-H"Authorization: Token "

REST API URLs:

4.2. REST API Reference 19 Papermerge

Fig. 3: In red color is your (example) token (step 4)

20 Chapter 4. REST API Papermerge

Fig. 4: Files uploaded with REST API end up in Inbox.

4.2. REST API Reference 21 Papermerge

URL HTTP Method Description /api/documents GET json list of all documents /api/document/ GET json info about document with id= /api/document/upload/ PUT Uploads unnamed file (random name will be assigned) /api/document/upload/<filename> PUT Uploads named file

22 Chapter 4. REST API CHAPTER 5

Settings

These are configurations settings for Papermerge - Web App. Configuration settings are used in same manner as for any Django based project. Settings which are common for all environments (production, development, staging) are defined in papermerge. config.settings.base module. If you want to reuse papermerge.config.settings.base, create python file, for example staging.py, and import all settings from base module:

from.base import *

DEBUG= False STATIC_ROOT='/www/static/'

Example above assumes that staging.py was created in same folder with base.py. Don’t forget to point DJANGO_SETTINGS_MODULE environment variable to your settings module.

5.1 STORAGE_ROOT

• local:/ • s3:/ Defines either local or a remote location where documents are stored. In case of local, it’s meaning is same of for Django’s MEDIA_ROOT. In case of s3 storage it indicates path to the S3 bucket. Examples:

STORAGE_ROOT='local:/home/vagrant/papermerge-proj/run/media' # good for

˓→development env STORAGE_ROOT='s3:/yourbucketname/alldocuments' # suitable for production

23 Papermerge

Note: In case when you choose not to use S3 storage both STORAGE_ROOT needs to be set to local://... path and S3 option must be set to False. And other way around, if you want to use S3 storage, both SOTRAGE_ROOT and S3 needs to be set accordingly (S3=True, STORAGE_ROOT=’s3:/bucketname’).

5.2 S3

• True|False Instructs papermerge if you want to use S3 storage. S3=True is more suitable for production environ- ments.

Note: In case S3=True you need to point ref:STORAGE_ROOT to s3 location.

5.3 OCR

• True|False Enables or disables OCR features. With OCR=False no workers needs to be configured;

5.4 DATABASES

This is Django specific configuration settings. Papermerge uses PostgreSQL as database, which means that ENGINE options must be set to django.db.backends.postgresql. Example:

DATABASES={ 'default':{ 'NAME':'db_name', 'ENGINE':'django.db.backends.postgresql', 'USER':'db_user', 'PASSWORD':'db_password' }, }

5.5 STATICFILES_DIRS

Include absolute path where papermege-js static files are. Example:

STATICFILES_DIRS=[ '/home/vagrant/papermerge-js/static' ]

24 Chapter 5. Settings CHAPTER 6

Design

A brief description of the architecture of Papermerge and why such design decisions were taken. Papermerge project has 2 parts: • Web Application • Workers Web application is further devided into Frontend and Backend. As result there are 3 separate repositories that are part of one whole.

6.1 1. Frontend

Papermerge-js Repository

Warning: Name papermerge-js is misleading, because it implies that it is only javascript is used, which is not true. This project manages all static assets: javascript, css, images, fonts.

Modern web applications tend to use a lot of javascript and css. Javascript code, as opposite to code written in Python, become increasingly difficult to manage. Same is for css. To deal with codebase complexity, I decided to split frontend as completely separate project. This project is a Webpack project. In practice this makes it little bit easier to deal with growing javascript code complexity. The outcome of this project, among others, are two important files:

/static/js/papermerge.js /static/css/papermerge.css

There are static files as well, like images and fonts. However images and fonts, are just placed in /static and nothing really interesting happens with them.

25 Papermerge

Fig. 1: High level design. Backend and frontend are separate.

6.2 2. Backend

Papermerge-proj Repository Backend is a standard Django application. It uses static files from frontend part. Throughout documentation it is refered as backend because term webapp is more general (webapp = backend + frontend).

6.3 3. Wrokers

Papermerge-worker Repository Workers perform OCR on the documents. Documents are passed as reference (see note below) from backend to the workers via a shared location. In simplest setup when everything runs on same machine, shared location is just a folder on local machine accessible by worker and by backend. In production, shared location is a S3 bucket.

Note: There are at least two distinct methods of passing documents from backend to the workers. First method, which is very simple, but wrong: backend will just transfer entire document byte by byte to the worker. Without diving deep into technical details, this method is not scalable because it deplets backend’s memory very quickly. Backend instead instructs workers which documents they need to OCR by telling workers document id (it passes user id and language name as well).

26 Chapter 6. Design Papermerge

Fig. 2: Backend passes documents to workers by reference.

6.3. 3. Wrokers 27 Papermerge

28 Chapter 6. Design CHAPTER 7

Changelog

29 Papermerge

30 Chapter 7. Changelog CHAPTER 8

1.0.0

• Initial release

31 Papermerge

32 Chapter 8. 1.0.0 CHAPTER 9

Indices and tables

• genindex • modindex • search

33