OpenZIM/ ETL toolchain for Dumping… and a bit more Wikimedia Technical Talk August 24th 2020 Emmanuel Engelhart OpenZIM - home of the ZIM file format

://openzim.org

Open specification for file format ZIM

Standard implementation libzim & bindings

WP1.0 solutions

MWoffliner, the Mediawiki scraper

Many other dedicated scrapers https://code.openzim.org

~30 repositories on GitHub

libzim & -tools (cmd line tools) in C++

Python & Node.js bindings

MWoffliner in Node.js

Most of other scrapers in Python Kiwix - Best ZIM offline reader

https://kiwix.org

ZIM file reader for Windows, GNU/, Android, macOS, iOS, browsers

Fulltext search, bookmarks, history, etc…

Enjoy offline versions of Wikipedia, Gutenberg project, Youtube channels, openEDX, Phets, TED, StackExchange, …

Available in most of app stores https://code.kiwix.org

~ 30 repositories on GitHub

Core Kiwix library libkiwix & Kiwix-tools (cmd line tools) in C++

Lots of different languages C++, Kotlin, Java, Swift, Javascript, …

Server side solutions in Python Publishing

Thousands of ZIM files available

Different flavours for each content

Selections (only a few Wikipedia articles)

Most of them refreshed each month

Mirrored on many continents and available via Bittorrent

OPDS catalog/feed Challenges Lots of content

Lots of updates

Lots of (HW) resources needed

Custom apps with embedded content

Selections of content

Upstream sources evolving

Q & A Image by Jahor / CC0 1.0 universal Show case - WikiMed Android App WikiMed App Publishing

Publishing on Play Store

Git tag triggered

Custom app repository https://github.com/kiwix/kiwix-android-custom/tree/master/ wikimed

Custom configuration

Logo, ZIM, title, … ZIM Catalogue

In library. at https://download.kiwix.org

In OPDS feed https://library.kiwix.org/catalog/root.xml

Demo case at https://library.kiwix.org

On https://wiki.kiwix.org (deprecated)

Surf directly on https://download.kiwix.org/zim

Docker everywhere Hosting ZIM files

https://download.kiwix.org/zim

HTTP + RSYNC + FTP + BitTorrent

Master + n mirrors

Based on Mirrobrain software

Retention of two versions of same ZIM

Docker everywhere Zimfarm at https://farm.openzim.org

Orchestrating ZIM file creation

Web platform Python+Django

Build all ZIMs automatically

For partners as well

Recipe based

Open API at https://api.farm.openzim.org

Q&A automatic process

Decentralised workers/scrapers

Docker everywhere WikiMed Zimfarm recipe MWoffliner

MediaWiki offline

Command line tool (really easy to use)

Creates a local ZIM file from any online Mediawiki

With Node.js on GNU/Linux & macOS

Rely on Redis

Published on npmjs & Docker hub Wikipedia 1.0

15 years old project

In Wikipedia in English, French, Arabic, Polish, …

Based on Wikiproject quality + importance manual assessment

Making selections of articles to provide a coherent snapshot of Wikipedia

Working together

Technical support Wikipedia 1.0 engine https://wp1.openzim.org

Python

openAPI

3 parts:

Gathering assessments

Wikipedia bot

Web interface

Fully automatic

Database based

Docker everywhere Wikipedia selection tools

Taking WP1.0 assessments for WPEN

Gather popularity via usage stats

Gather other stats on each article

Works for all

Perl + Bash

Mixes all KPIs to makes selections (list of article titles)

Publish (monthly) all TSV on https://download.openzim.org/wp1 Wrap up

Wikipedia

Wikipedia WP1 engine (DB)

Wikipedia selection tools (TSV)

Zimfarm (ZIM)

Github actions (APK)

Google Play ?

https://kiwix.org - https://openzim.org - @kiwixoffline