<<

what happens when you type en..org

effie mouzeli • alexandros kosiaris

SREcon19 Dublin @kosiaris • @manjiki

About

@kosiaris • @manjiki CC BY-SA 4.0 Niccolò Caranti2 Did you know...

● … the Wikipedia infrastructure is run by the , an American nonprofit charitable organisation? ● … and we are ~370 people? ● … and we have no affiliation with Wikileaks? ● … all content is managed by volunteers? ● … we support 304 languages? ● … Wikipedia is 18 years old ? ● … Wikipedia hosts some really really weird articles? ● … which can’t be read in Turkey (2017) nor China (2019)?

3 Wikimedia Projects

4 Wikimedia Infrastructure

software

✺ 2 Primary Data Centres

✺ 3 Caching Points of Presence

✺ ~17 billion per month*

✺ ~300k new editors per month

✺ ~1300 bare metal servers

* it’s complicated 5 Site Reliability Engineering

✺ Datacenter Operations The SRE team is a globally distributed team of 26 people responsible for ✺ Data Persistence developing and maintaining Wikimedia's production systems ✺ Infrastructure Foundations

✺ Service Operations The Foundation has more SREs in other teams as well! ✺ Traffic

6

Application Layer

@kosiaris • @manjiki CC BY-SA 2.0 Arthur Dunn7 MediaWiki

✺ Our core application MediaWiki is a free -based

✺ PHP, Apache, MySQL. Yes.* software, licensed under theGNU GPL. ✴ PHP7.2 since Sept 2019 It is an extremely powerful, scalable ✺ web pages - app servers software, and a feature-rich wiki cluster implementation that uses PHP to ✺ API cluster process and display data stored in a

✺ Jobrunners/Videoscalers cluster , such as MySQL.

* it’s complicated 8 Application Layer Caches

9 2014

From a

Monolith to 2019 Microservices

10 ✺ Elasticity

From a ✺ Hardware fault mitigation Monolith to ✺ Deployments ✺ Migration is not easy, and still Microservices ongoing

11 Microservices!

✺ Thumbor Thumbor is used for imagescaling Mathoid renders LaTeX, and returns JSON ✺ Mathoid with PNG, SVG or MathML renderings of the formula ✺ ORES ORES scores edits using Service (MCS) (anti-vandalism effort) MCS modifies page content on the fly, ✺ And many more tailoring it for mobile

12

Kubernetes

@kosiaris • @manjiki Kubernetes

✺ Bare metal We have been running it successfully for the last 2 years! Currently, 11 services on ✺ Calico as a CNI plugin it. Got a pipeline in the works.

✺ Helm for deployments Powers all mathematical formulas on

✺ 2 clusters + 1 staging one Wikipedia!!!

✺ Docker as a CRE

14

Message Queueing

@kosiaris • @manjiki CC BY 2.0 bootbearwdc Message Queueing

✺ Yes, we use Apache Kafka Apache Kafka: stream processing

✺ We are sending events like: platform for real-time data feeds ✳ wikitext templates refresh ✳ edge caches purging One message queue to rule them all; ✳ cross wiki links started as a service for Analytics only. ✳ create new thumbnails Now, it is our de facto solution. ✳ re-encoding videos to open source formats

16

Databases

@kosiaris • @manjiki CC BY 2.0 RageZ17 MariaDB*

✺ Database clusters are divided into sections MariaDB: fork of MySQL, migrated from ✺ Sections have masters and MySQL in 2013* replicas*

✺ MediaWiki reads from replicas and writes to master Have a go at ://quarry.wmflabs.org ✺ Clusters: ✳ Wikitext (compressed) ✳

✳ Parsercache

* it’s complicated 18 MariaDB

✺ Online schema migrations* ✺ Cross DC replication ✺ TLS across all DBs ✺ Snapshots and local dumps for Backups

✺ ~570 TB total data ✺ ~150 DB servers ✺ ~350k queries per second (qps) ✺ ~70 TB of RAM

* it’s complicated 19

You guessed it right, we use it for search. That box on your top right. Run by a team surprisingly called Search Platform!

20

Storage

@kosiaris • @manjiki CC BY-NC 2.0 Gail Thomas Swift

✺ All our media are stored on Swift OpenStack Object Storage: a scalable ✺ It has frontends storage system that stores and retrieves data … and backends via HTTP

✺ 1 billion objects

✺ ~390 TB of media!

22

Traffic

@kosiaris • @manjiki Public Domain23 Network

24 Network

25 Network

✺ We have our own content delivery gdnsd: GeoDNS is written and maintained network by one of us

✺ We direct traffic to a location on peering: interconnection with other demand (via GeoDNS) networks ✳ Pooling/Depooling DCs Virtual Server: an advanced L3/L4 ✳ 10 min TTL load balancing solution for linux, supports

✺ LVS as a Layer 3/4 Linux consistent hashing loadbalancer* pybal: LVS manager, developed in-house

* it’s complicated 26 LVS-DR

27 CDN

28 CDN

✺ Nginx- for TLS termination Nginx-: Highly performant HTTP ✺ Varnish frontend ✳ in memory webserver/proxy with excellent TLS ✺ Varnish backend support ✳ local stores ✺ Varnish text ✳ HTML, CSS, JS etc Varnish: Reverse HTTP caching proxy ✺ Varnish upload ✳ media, media, media

29 CDN (coming soon)

30 CDN (coming soon)

✺ ATS TLS ✳ in memory Apache Traffic Server: Reverse and ✺ ATS backend forward proxy with excellent caching ✳ local store (SSDs) support ✺ ATS text ✳ HTML, CSS, JS etc ACME-chief: handles all the process of ✺ ATS upload issuing and renewing Let’s Encrypt ✳ media, media, media certificates (dns-01) ✺ ACME-chief

31

what happens when you type en.wikipedia.org

@manjiki • @kosiaris CC BY 3.0 WikiReader Read (cached)

33 Read (cached)

34 Read (uncached)

35 Edit - Media Upload

36

Managing to Manage

@kosiaris • @manjiki GETTY IMAGES Managing to Manage

Puppet: configuration management ✺ Infrastructure as code system for servers/services ✺ Configuration management ...~50k lines of puppet code ✺ Kubernetes ...~100k lines of Ruby/ERB ✺ Testing/CI/CD Cumin: in-house automation and ✺ Orchestration tooling orchestration tool

38

In a Nutshell

@kosiaris • @manjiki CC BY 2.0 Peter Trimming Want to sell ?

https://jobs.wikimedia.org

https://grafana.wikimedia.org/ https://github.com/wikimedia/operations-puppet https://phabricator.wikimedia.org/ https://wikitech.wikimedia.org/

SREcon19 Dublin @kosiaris • @manjiki