Best Practices Building Resilient Systems</U>
Total Page:16
File Type:pdf, Size:1020Kb
<u>Best Practices Building Resilient Systems</u> Pablo Jensen, CTO Who is Pablo Jensen? . Danish – but born in Argentina where they didn’t had Paul on their whitelist of names so my parents had to call me Pablo . Computer Science degree from Copenhagen University – and MBA from Henley . Several years in Thomson Reuters in Scandinavia, London and Switzerland . Joined Sportradar as CTO in 2013 when the business had 500 employees with 150 in IT – now 2.000 employees and 400 in IT . Industrial advisor for EQT . Running, wine, car’s, Brøndby IF Who is Sportradar? Operating at the intersection of sports, media and entertainment. Global leader in live sports data solutions for digital sport entertainment . 8,000+ staff and contractors globally . 30+ global offices . Deep coverage of more than 40 sports and 600,000 live events per year . 9.000 data points updated every second . 1 second delay from live stadium event to when data is out at our customers . Platform handling 200,000 requests a second, serving users with up to 4gbit/s in total traffic . 9.000 requests/second in average . 800+ Clients and Partners Serving More Than 800 Global Customers Betting Sports Media Integrity Rights Holders Sportradar in a Nutshell Data Collection Data Processing DATA ANALYTICS Data Monitoring Data Marketing Digital Sports Solutions Sports Media: Live Score Sports Media: AV & OTT Sports Media: Widgets & Cards Betting: Life Cycle of Odds Betting: Live Odds Betting: Virtual Games Betting: Integrity Data Feeds & Development Services MDP – Mobile eSports Service Live Odds Development Platform What can go possibly wrong?? What can go possibly wrong?? Top incident reasons 1. 3rd Party Provider issue Physical Footprint 2. Limit exceed (table, storage, traffic) 3. Coding error Technology 4. Not following agreed procedures Process Prepare for that there always will be something wrong Web: IT Organisation HTML5/CSS3, React, Javascript, API driven, Nginx, NodeJS, Varnish, Tomcat, Jetty Tech Stack Mobile: IOS, Android 400+ employees in 10+ IT locations: Backend: • 40+ Dedicated teams Java, PHP, Scala, JRuby, Go, C++, Memcache, Redis, MySql, Cassandra, MongoDB • 300+ Developers • 35 Tech Leads Sys Admin • 40+ System Administrators Ganeti, OpenStack, Zabbix, Puppet, Mcollective, Debian Linux, AWS, Ceph, • 40+ Project Managers Kubernetes • 30+ QA • 20+ Mobile Developers Source code system GIT (GitLab) Open Source scanning WhiteSource Build management: Jenkins, GitLab CI BI & Analytics: S3, ORC, NiFi, RedShift, Athena, Spark, Qlik Communication Tools Slack, Outlook, own build tools for Incident and Maintenance Management but looking at migrating to 3rd party services (StatusPage.io) Web: IT Organisation HTML5/CSS3, React, Javascript, API driven, Nginx, NodeJS, Varnish, Tomcat, Jetty Tech Stack Mobile: IOS, Android 400+ employees in 10+ IT locations: Best practices for buildingBackend resiliency: • 40+ Dedicated teams Java, PHP, Scala, JRuby, Go, C++, Memcache, Redis, MySql, Cassandra, MongoDB • 300+ Developers • 35 Tech Leads Strict defined tech stackSys – Adminnew technologies are • 40+ System Administrators Ganeti, OpenStack, Zabbix, Puppet, Mcollective, Debian Linux, Amazon Web Services, • 40+ Project Managers architecture driven, notKubernetes developer driven • 30+ QA • 20+ Mobile Developers Key technical IT gate pointsSource tocode be system followed GIT (GitLab) • Fitness for DevelopmentOpen Source scanning • Fitness for Launch WhiteSource • “30% Rule” Build management: • Secure DevelopmentJenkins, Guidelines GitLab CI • Maintenance Procedure • Incident Procedure BI & Analytics: • On Duty Procedure S3, ORC, NiFi, RedShift, Athena, Spark, Qlik Sportradar Hosting Locations Own regional based data center locations in Europe AWS/Amazon hosting locations used by Sportradar Sportradar Hosting Locations Physical Footprint Best practices for building resiliency Identical physical regional located core data centers running live-live treated as single redundant data center. Multiple options for client access: • Strategic located POP’s • Direct connect • Open Internet Conceptual Cluster Physical Cluster Data Center A Data Center B Own regional basedA dataB center locationsC in Europe A B C A B C AWS/Amazon hosting locations used by Sportradar Sportradar’s Global Data Production Operations setup is physical redundant so we can Sportradar Production with more than 900 employees globally shift operations between locations Key facts Germany US • Worldwide accepted data quality unmatched in combination of speed and accuracy Estonia • Redundant production setup • Key positions manned with branch expertise Philippines from all business segments • State of the art data entry tools, developed in- Austria house, enhanced based on needs of operations • Operations approved and well-rehearsed, Uruguay permanently reviewed and improved/adjusted • >900 operators across 7 locations • >6,000 scouts globally Sportradar’s Global Data Production Operations setup is physical redundant so we can Sportradar Production with more than 900 employees globally shift operations between locations PhysicalKey facts Footprint Germany US Best practices for building resiliency • Worldwide accepted data quality unmatched in combination of speed and accuracy Estonia Identical production locations • Redundant production setup • Key positions manned with branch expertise Tasks can move from one locationPhilippines to another from all business segments • State of the art data entry tools, developed in- Austria house, enhanced based on needs of operations • Operations approved and well-rehearsed, Uruguay permanently reviewed and improved/adjusted • >900 operators across 7 locations • >6,000 scouts globally Providers All service elements; eg. ISP, CDN, DDOS Protection, cloud hosting, physical hosting, DNS, physical production locations, POPs, fixed line connections are understood and categorized with full risk understanding and acceptance. Providers Physical Footprint All service elements;Best practices eg. for ISP, building CDN, resiliency DDOS Protection, cloud hosting, physicalUnderstand hosting, and accept: DNS, physical production locations, • Service elements that are ‘multi-vendor POPs, fixed line connections are understood and categorized • Service elements that are ‘multi-regional’ with full risk understanding and acceptance. • Service elements that are ‘single’ served Separate technology stacks Closed extranet environment for Business Area A Open internet environment for Business Area B US Asia EU EU Client Client Client Client Client Client Client Client DDOS Amazon AWS City A POP City B POP Amazon AWS Protection DC Closed Stack DC Open Stack Own hardware, firewall, routers Own hardware, firewall, routers Leased/fixed line Open Internet during normal operation Gateway for clients from Open Internet during DDOS mitigation open internet Separate technology stacks Closed extranet environment for Business Area A Open internet environment for Business Area B US Asia EU EU Client Client Client Client Client Client Client TechnologyClient Best practices for building resiliency Amazon AWS City A POP City B POP Amazon AWS Prolexic Business areas served via separate technology stacks; one stack can have issues without impacting other stacks DC Closed StackTechnology stacks are hosted on independent DC Open Stack Own hardware, firewall,redundant routers services Own hardware, firewall, routers Leased/fixed line Open Internet during normal operation Gateway for B2B clients Open Internet during DDOS attack from open internet Architecture Deployment Model One of our Backend Core Systems • Running on 3 dedicated physical servers in 3 different physical locations 3 availability zones • Composed of many sub-systems - each running as an independent cluster Separate cluster per sub system • Java services either stateless or stateful while keeping data in a distributed mem-grid • Clustered active-active setup of RabbitMQ, Zookeeper, HAProxy, Mongo replica sets, Cassandra Active/Active • Master-slave active-passive setup of MySQL, MySQL Fabric and Redis instances Active/Passive • Mongo point-in-time incremental backup, MySQL/Redis/ZK daily backups • Recovery mechanisms (e.g. a subsystem is able to recover its state based on reference data) Recovery • Async service design (message passing, streaming) Async Design • Circuit-breakers, request throttling, fail-fast approach (Hystrix) • Decoupling of operational and archive/warehouse databases Decoupling • Decoupling and different types of disk volumes, reduce I/O contention (e.g. Mongo, MySQL, Backup, VMs) • Lots of attention to low-latency implementation and design Architecture Deployment Model One of our backend Core Systems • Running on 3 dedicated physical servers in 3 different physical locations 3 availability zones Technology • Composed of many sub-systems - each running as an independent cluster Separate cluster per sub system • Java services either statelessBest or stateful practiceswhile for keeping building data inresiliency a distributed mem-grid • Clustered active-active setup of RabbitMQ, Zookeeper, HAProxy, Mongo replica sets, Cassandra Active/Active • Master-slave active-passive setupClear oftechnical MySQL, MySQL guidelines Fabric for and development Redis instances teams; Active/Passive • Mongo point-in-time incrementalno need backup, to invent MySQL/ theRedis wheel/ZK dailyall the backups time • Recovery mechanisms (e.g. a subsystem is able to recover its state based on reference data) Recovery Test, test, test • Async service design (message passing,