Papermerge Documentation
Total Page:16
File Type:pdf, Size:1020Kb
Papermerge Feb 29, 2020 Contents 1 Requirements 3 1.1 Python..................................................3 1.2 Imagemagick...............................................3 1.3 Poppler..................................................3 1.4 Tesseract.................................................4 1.5 Database.................................................4 2 Installation 5 2.1 OS Specific Packages..........................................5 2.1.1 1. Web App + Workers Machine................................5 2.1.1.1 Ubuntu Bionic 18.04 (LTS)..............................5 2.1.2 2. Web App Machine......................................6 2.1.2.1 Ubuntu Bionic 18.04 (LTS)..............................6 2.1.3 3. Worker Machine.......................................6 2.2 Manual Way...............................................6 2.2.1 Package Dependencies.....................................6 2.2.2 Web App............................................7 2.2.3 Worker.............................................9 2.2.4 Recurring Commands...................................... 10 2.3 Systemd................................................. 11 2.3.1 Package Dependencies..................................... 11 2.3.2 Web App............................................ 11 2.4 Docker.................................................. 13 2.5 Ansible (Semiautomated)........................................ 13 2.6 Jenkins + Ansible (Fully Automated Deployment)........................... 14 3 Languages Support 15 4 REST API 17 4.1 How It Works?.............................................. 17 4.1.1 Get a Token........................................... 17 4.1.2 Use the Token.......................................... 19 4.2 REST API Reference........................................... 19 5 Settings 23 5.1 STORAGE_ROOT............................................ 23 5.2 S3..................................................... 24 i 5.3 OCR................................................... 24 5.4 DATABASES............................................... 24 5.5 STATICFILES_DIRS.......................................... 24 6 Design 25 6.1 1. Frontend................................................ 25 6.2 2. Backend................................................ 26 6.3 3. Wrokers................................................ 26 7 Changelog 29 8 1.0.0 31 9 Indices and tables 33 ii Papermerge I have nothing against paper. Paper is a brilliant invention of humanity. But in the 21st century I find it more appropriate for paper-based documents to be digitized (scanned). Once scanned, appropriate software can be used to find any document in a fraction of a second, just by typing a few keywords. Papermerge is a document management system designed to work with scanned documents. As well as OCR with full text search, it provides the look and feel of major modern file browsers, with a hierarchical structure for files and folders, so that you can organize your documents in a similar way to Dropbox (via web) or Google Drive. Contents 1 Papermerge 2 Contents CHAPTER 1 Requirements Papermerge depends on following software: • Python >= 3.8.0 • Tesseract - because of OCR • Imagemagick - Image operations • Poppler - PDF operations • PostgreSQL >= 11.0 because of Full Text Search 1.1 Python Papermerge is a Python 3 application. 1.2 Imagemagick Papermerge uses Imagemagick to convert between images format 1.3 Poppler More exactly poppler utils are used. For exampple pdfinfo command line utility is used to find out number of page in PDF document. 3 Papermerge 1.4 Tesseract If you never heard of Tesseract software - it is google’s open source Optical Character Recognition software. It extracts text from images. It works fantastically well for wide range of languages. 1.5 Database One of Papermerge’s core philosophies is “Find Any Document”. PostgreSQL database comes with Full Text Search Support (FTS) out of the box. Papermerge uses websearch_to_tsquery PostgreSQL function which was intro- duced in PostgreSQL version 11.0. With FTS - full text search - you can search documents in similar way people are used to search web pages in google (bing, yandex, duckduckgo) search engine - you just type some words - and search result will display only documents with those words sorted by their relevancy. 4 Chapter 1. Requirements CHAPTER 2 Installation There are different methods to install Papermerge. They differ by amount of effort required and purpose. 2.1 OS Specific Packages Here are given instructions on how to install operating system specific packages. There are three cases. 1. Both web app and workers are on same machine 2. Web app machine 3. Worker machine 2.1.1 1. Web App + Workers Machine 2.1.1.1 Ubuntu Bionic 18.04 (LTS) Install required ubuntu packages: sudo apt-get update sudo apt-get install python3 python3-pip python3-venv \ poppler-utils \ imagemagick \ build-essential \ poppler-utils \ tesseract-ocr \ tesseract-ocr-deu \ tesseract-ocr-eng Notice that for tesseract only english and german (Deutsch) language packages are needed. Ubuntu Bionic 18.04 comes with postgres 10 package. Papermerge on the other hand requires at least version 11 of Postgres. 5 Papermerge Install Postgres version 11: # add the repository sudo tee/etc/apt/sources.list.d/pgdg.list <<END deb http://apt.postgresql.org/pub/repos/apt/ bionic-pgdg main END # get the signing key and import it wget https://www.postgresql.org/media/keys/ACCC4CF8.asc sudo apt-key add ACCC4CF8.asc # fetch the metadata from the new repo sudo apt-get update 2.1.2 2. Web App Machine Tesseract should not run on Web App only computer. 2.1.2.1 Ubuntu Bionic 18.04 (LTS) Install required ubuntu packages: sudo apt-get update sudo apt-get install python3 python3-pip python3-venv \ poppler-utils \ imagemagick \ build-essential \ poppler-utils 2.1.3 3. Worker Machine Worker is the one performing heavy task of extracting text from images. So it must have tesseract packages installed. 2.2 Manual Way Papermerge has two parts: • Web application • Worker - which is used for OCR operation With this installation method both parts will run on the same computer. This installation method is suitable for developers. In this method no configuration is automated, so it is a perfect method if you want to understand the mechanics of the project. If you follow along in this document and still have trouble, please open an issue on GitHub: so I can fill in the gaps. 2.2.1 Package Dependencies In this setup, Web App and Workers run on the same machine. Install os specific packages for webapp + worker 6 Chapter 2. Installation Papermerge Check that Postgres version 11 is is up and running: sudo systemctl status [email protected] Create new role for postgres database: sudo-u postgres createuser--interactive When asked Shall the new role be allowed to create databases? please answer yes (when running tests, django creates a temporary database) Create new database owned by previously created user: sudo-u postgres createdb-O<user-created- in-prev-step><dbname> Set a password for user: sudo-u postgres psql ALTER USER<username> WITH PASSWORD'<password>'; 2.2.2 Web App Once we have prepared database, tesseract and other dependencies, let’s start with paperpermerge itself. Clone main papermerge project: git clone https://github.com/ciur/papermerge papermerge-proj Clone papermerge-js project (this is the frontend part): git clone https://github.com/ciur/papermerge-js Create python’s virtual environment .env: cd papermerge-proj python3-m venv.venv Activate python’s virtual environment: source.venv/bin/activate Install required python packages (now you are in papermerge-proj directory): # while in <papermerge-proj> folder pip install-r requirements.txt Rename file config/settings/development.example.py to config/settings/development.py. This file is default for DJANGO_SETTINGS_MODULE and it is included in .gitignore. Adjust following settings in config/settings/development.py: • DATABASES - name, username and password of database you created in PostgreSQL • STATICFILES_DIRS - include path to <absolute_path_to_papermerge_js_clone>/static • MEDIA_ROOT - absolute path to media folder • STORAGE_ROOT- absolute path to same media root, but with a “local:/” prefix 2.2. Manual Way 7 Papermerge Note: 1. Make sure that data_folder_in and data_folder_out point to the same location. 2. Make sure that folder pointed by data_folder_in and data_folder_out exists. Then, as in any django based project, run migrations, create super user and run build in webserver: cd<papermerge-proj> ./manage.py migrate ./manage.py createsuperuser ./manage.py runserver At this point, you should be able to see (styled) login page. You should be able as well to login with administrative user you created before with ./manage.py createsuperuser command. At this step, must be able to access login screen and it should look like in screenshot below. Also, you can upload some document and see their preview. But because there is no worker configured yet, documents are basically plain images. Let’s configure worker! 8 Chapter 2. Installation Papermerge 2.2.3 Worker Let’s add a worker on the same machine with Web Application we configured above. We will use the same python’s virtual environment as for Web Application. Note: Workers are the ones who depend on (and use) tesseract not Web App. Clone repo and install (in same python’s virtual environment as Web App) required packages: git clone https://github.com/ciur/papermerge-worker cd papermerge-worker pip install-r requirements.txt Create a file <papermerge-worker>/config.py