OpenZIM/Kiwix ETL toolchain for Wikipedia Dumping… and a bit more Wikimedia Technical Talk August 24th 2020 Emmanuel Engelhart OpenZIM - home of the ZIM file format
https://openzim.org
Open specification for file format ZIM
Standard implementation libzim & bindings
WP1.0 software solutions
MWoffliner, the Mediawiki scraper
Many other dedicated scrapers https://code.openzim.org
~30 repositories on GitHub
libzim & zim-tools (cmd line tools) in C++
Python & Node.js bindings
MWoffliner in Node.js
Most of other scrapers in Python Kiwix - Best ZIM offline reader
https://kiwix.org
ZIM file reader for Windows, GNU/Linux, Android, macOS, iOS, browsers
Fulltext search, bookmarks, history, etc…
Enjoy offline versions of Wikipedia, Gutenberg project, Youtube channels, openEDX, Phets, TED, StackExchange, …
Available in most of app stores https://code.kiwix.org
~ 30 repositories on GitHub
Core Kiwix library libkiwix & Kiwix-tools (cmd line tools) in C++
Lots of different languages C++, Kotlin, Java, Swift, Javascript, …
Server side solutions in Python Publishing
Thousands of ZIM files available
Different flavours for each content
Selections (only a few Wikipedia articles)
Most of them refreshed each month
Mirrored on many continents and available via Bittorrent
OPDS catalog/feed Challenges Lots of content
Lots of updates
Lots of (HW) resources needed
Custom apps with embedded content
Selections of content
Upstream sources evolving
Q & A Image by Jahor / CC0 1.0 universal Show case - WikiMed Android App WikiMed App Publishing
Publishing on Play Store
Git tag triggered
Custom app repository https://github.com/kiwix/kiwix-android-custom/tree/master/ wikimed
Custom configuration
Logo, ZIM, title, … ZIM Catalogue
In library.xml at https://download.kiwix.org
In OPDS feed https://library.kiwix.org/catalog/root.xml
Demo case at https://library.kiwix.org
On https://wiki.kiwix.org (deprecated)
Surf directly on https://download.kiwix.org/zim
Docker everywhere Hosting ZIM files
https://download.kiwix.org/zim
HTTP + RSYNC + FTP + BitTorrent
Master + n mirrors
Based on Mirrobrain software
Retention of two versions of same ZIM
Docker everywhere Zimfarm at https://farm.openzim.org
Orchestrating ZIM file creation
Web platform Python+Django
Build all ZIMs automatically
For partners as well
Recipe based
Open API at https://api.farm.openzim.org
Q&A automatic process
Decentralised workers/scrapers
Docker everywhere WikiMed Zimfarm recipe MWoffliner
MediaWiki offline
Command line tool (really easy to use)
Creates a local ZIM file from any online Mediawiki
With Node.js on GNU/Linux & macOS
Rely on Redis
Published on npmjs & Docker hub Wikipedia 1.0
15 years old project
In Wikipedia in English, French, Arabic, Polish, …
Based on Wikiproject quality + importance manual assessment
Making selections of articles to provide a coherent snapshot of Wikipedia
Working together
Technical support Wikipedia 1.0 engine https://wp1.openzim.org
Python
openAPI
3 parts:
Gathering assessments
Wikipedia bot
Web interface
Fully automatic
Database based
Docker everywhere Wikipedia selection tools
Taking WP1.0 assessments for WPEN
Gather popularity via usage stats
Gather other stats on each article
Works for all Wikipedias
Perl + Bash
Mixes all KPIs to makes selections (list of article titles)
Publish (monthly) all TSV on https://download.openzim.org/wp1 Wrap up
Wikipedia
Wikipedia WP1 engine (DB)
Wikipedia selection tools (TSV)
Zimfarm (ZIM)
Github actions (APK)
Google Play ?
https://kiwix.org - https://openzim.org - @kiwixoffline