<<

Best Practices Building Resilient Systems

Pablo Jensen, CTO Who is Pablo Jensen?

. Danish – but born in Argentina where they didn’t had Paul on their whitelist of names so my parents had to call me Pablo

. Computer Science degree from Copenhagen University – and MBA from Henley

. Several years in Thomson Reuters in Scandinavia, London and Switzerland

. Joined Sportradar as CTO in 2013 when the business had 500 employees with 150 in IT – now 2.000 employees and 400 in IT

. Industrial advisor for EQT

. Running, wine, car’s, Brøndby IF Who is Sportradar?

Operating at the intersection of sports, media and entertainment.

. Global leader in live sports data solutions for digital sport entertainment

. 8,000+ staff and contractors globally

. 30+ global offices

. Deep coverage of more than 40 sports and 600,000 live events per year

. 9.000 data points updated every second

. 1 second delay from live stadium event to when data is out at our customers

. Platform handling 200,000 requests a second, serving users with up to 4gbit/s in total traffic

. 9.000 requests/second in average

. 800+ Clients and Partners Serving More Than 800 Global Customers

Betting Sports Media Integrity Rights Holders Sportradar in a Nutshell

Data Collection Data Processing

DATA ANALYTICS

Data Monitoring Data Marketing

Digital Sports Solutions Sports Media: Live Score Sports Media: AV & OTT Sports Media: Widgets & Cards Betting: Life Cycle of Odds Betting: Live Odds Betting: Virtual Games Betting: Integrity Data Feeds & Development Services

MDP – Mobile eSports Service Live Odds Development Platform What can go possibly wrong?? What can go possibly wrong??

Top incident reasons

1. 3rd Party Provider issue Physical Footprint 2. Limit exceed (table, storage, traffic) 3. Coding error Technology 4. Not following agreed procedures Process Prepare for that there always will be something wrong Web: IT Organisation HTML5/CSS3, React, Javascript, API driven, Nginx, NodeJS, Varnish, Tomcat, Jetty

Tech Stack Mobile: IOS, Android 400+ employees in 10+ IT locations: Backend: • 40+ Dedicated teams Java, PHP, Scala, JRuby, Go, C++, Memcache, Redis, MySql, Cassandra, MongoDB • 300+ Developers • 35 Tech Leads Sys Admin • 40+ System Administrators , OpenStack, Zabbix, Puppet, Mcollective, Debian , AWS, , • 40+ Project Managers • 30+ QA • 20+ Mobile Developers Source code system GIT (GitLab)

Open Source scanning WhiteSource

Build management: , GitLab CI

BI & Analytics: S3, ORC, NiFi, RedShift, Athena, Spark, Qlik

Communication Tools Slack, Outlook, own build tools for Incident and Maintenance Management but looking at migrating to 3rd party services (StatusPage.io) Web: IT Organisation HTML5/CSS3, React, Javascript, API driven, Nginx, NodeJS, Varnish, Tomcat, Jetty

Tech Stack Mobile: IOS, Android 400+ employees in 10+ IT locations: Best practices for buildingBackend resiliency: • 40+ Dedicated teams Java, PHP, Scala, JRuby, Go, C++, Memcache, Redis, MySql, Cassandra, MongoDB • 300+ Developers • 35 Tech Leads Strict defined tech stackSys – Adminnew technologies are • 40+ System Administrators Ganeti, OpenStack, Zabbix, Puppet, Mcollective, Debian Linux, , • 40+ Project Managers architecture driven, notKubernetes developer driven • 30+ QA • 20+ Mobile Developers Key technical IT gate pointsSource tocode be system followed GIT (GitLab)

• Fitness for DevelopmentOpen Source scanning • Fitness for Launch WhiteSource • “30% Rule” Build management: • Secure DevelopmentJenkins, Guidelines GitLab CI • Maintenance Procedure • Incident Procedure BI & Analytics: • On Duty Procedure S3, ORC, NiFi, RedShift, Athena, Spark, Qlik Sportradar Hosting Locations

Own regional based data center locations in Europe AWS/Amazon hosting locations used by Sportradar Sportradar Hosting Locations

Physical Footprint Best practices for building resiliency

Identical physical regional located core data centers running live-live treated as single redundant data center.

Multiple options for client access: • Strategic located POP’s • Direct connect • Open Internet

Conceptual Cluster Physical Cluster

Data Center A Data Center B

Own regional basedA dataB center locationsC in Europe A B C A B C AWS/Amazon hosting locations used by Sportradar Sportradar’s Global Data Production

Operations setup is physical redundant so we can Sportradar Production with more than 900 employees globally shift operations between locations

Key facts Germany US • Worldwide accepted data quality unmatched in combination of speed and accuracy Estonia • Redundant production setup

• Key positions manned with branch expertise Philippines from all business segments

• State of the art data entry tools, developed in- Austria house, enhanced based on needs of operations • Operations approved and well-rehearsed, Uruguay permanently reviewed and improved/adjusted • >900 operators across 7 locations

• >6,000 scouts globally Sportradar’s Global Data Production

Operations setup is physical redundant so we can Sportradar Production with more than 900 employees globally shift operations between locations

PhysicalKey facts Footprint Germany US Best practices for building resiliency • Worldwide accepted data quality unmatched in combination of speed and accuracy Estonia Identical production locations • Redundant production setup

• Key positions manned with branch expertise Tasks can move from one locationPhilippines to another from all business segments

• State of the art data entry tools, developed in- Austria house, enhanced based on needs of operations • Operations approved and well-rehearsed, Uruguay permanently reviewed and improved/adjusted • >900 operators across 7 locations

• >6,000 scouts globally Providers

All service elements; eg. ISP, CDN, DDOS Protection, cloud hosting, physical hosting, DNS, physical production locations, POPs, fixed line connections are understood and categorized with full risk understanding and acceptance. Providers

Physical Footprint

All service elements;Best practices eg. for ISP, building CDN, resiliency DDOS Protection, cloud hosting, physicalUnderstand hosting, and accept: DNS, physical production locations, • Service elements that are ‘multi-vendor POPs, fixed line connections are understood and categorized • Service elements that are ‘multi-regional’ with full risk understanding and acceptance. • Service elements that are ‘single’ served Separate technology stacks

Closed extranet environment for Business Area A Open internet environment for Business Area B

US Asia EU EU Client Client Client Client Client Client Client Client

DDOS Amazon AWS City A POP City B POP Amazon AWS Protection

DC Closed Stack DC Open Stack

Own hardware, firewall, routers Own hardware, firewall, routers

Leased/fixed line Open Internet during normal operation

Gateway for clients from Open Internet during DDOS mitigation open internet Separate technology stacks

Closed extranet environment for Business Area A Open internet environment for Business Area B

US Asia EU EU Client Client Client Client Client Client Client TechnologyClient

Best practices for building resiliency

Amazon AWS City A POP City B POP Amazon AWS Prolexic Business areas served via separate technology stacks; one stack can have issues without impacting other stacks

DC Closed StackTechnology stacks are hosted on independent DC Open Stack Own hardware, firewall,redundant routers services Own hardware, firewall, routers

Leased/fixed line Open Internet during normal operation

Gateway for B2B clients Open Internet during DDOS attack from open internet Architecture Deployment Model

One of our Backend Core Systems

• Running on 3 dedicated physical servers in 3 different physical locations 3 availability zones • Composed of many sub-systems - each running as an independent cluster Separate cluster per sub system • Java services either stateless or stateful while keeping data in a distributed mem-grid • Clustered active-active setup of RabbitMQ, Zookeeper, HAProxy, Mongo replica sets, Cassandra Active/Active • Master-slave active-passive setup of MySQL, MySQL Fabric and Redis instances Active/Passive • Mongo point-in-time incremental backup, MySQL/Redis/ZK daily backups • Recovery mechanisms (e.g. a subsystem is able to recover its state based on reference data) Recovery • Async service design (message passing, streaming) Async Design • Circuit-breakers, request throttling, fail-fast approach (Hystrix) • Decoupling of operational and archive/warehouse databases Decoupling • Decoupling and different types of disk volumes, reduce I/O contention (e.g. Mongo, MySQL, Backup, VMs) • Lots of attention to low-latency implementation and design Architecture Deployment Model

One of our backend Core Systems

• Running on 3 dedicated physical servers in 3 different physical locations 3 availability zones Technology • Composed of many sub-systems - each running as an independent cluster Separate cluster per sub system • Java services either statelessBest or stateful practiceswhile for keeping building data inresiliency a distributed mem-grid • Clustered active-active setup of RabbitMQ, Zookeeper, HAProxy, Mongo replica sets, Cassandra Active/Active • Master-slave active-passive setupClear oftechnical MySQL, MySQL guidelines Fabric for and development Redis instances teams; Active/Passive • Mongo point-in-time incrementalno need backup, to invent MySQL/ theRedis wheel/ZK dailyall the backups time • Recovery mechanisms (e.g. a subsystem is able to recover its state based on reference data) Recovery Test, test, test • Async service design (message passing, streaming) Async Design • Circuit-breakers, request throttling, fail-fast approach (Hystrix) • Decoupling of operational and archive/warehouse databases Decoupling • Decoupling and different types of disk volumes, reduce I/O contention (e.g. Mongo, MySQL, Backup, VMs) • Lots of attention to low-latency implementation and design Architecture Deployment Model

Frontend Systems

• Applications assume little about the infrastructure it runs on Frontend / Backend Separation • Can be deployed to cloud or on premise Deployed everywhere • Servers are provisioned the same way regardless if they are on premise or in the cloud • Conservative about using cloud or on premise services that lock us to that infrastructure Avoid infrastructure lock in (especially cloud services) • Redundant direct connect links between on premise and cloud infrastructure Direct connections • Route53 load balance between DCs. Very handy when ISPs fail, incoming traffic just flows Route53 to LB between DC’s through other DCs that then use the direct connect backbone to reach correct destinations • On-premise usually takes the bulk of the traffic due to traffic costs. On premise to reduce AWS costs Architecture Deployment Model

Frontend Systems

• Applications assume little about the infrastructure it runs on Frontend / Backend Separation Technology • Can be deployed to cloud or on premise Deployed everywhere • Servers are provisioned the sameBest waypractices regardless for ifbuilding they are onresiliency premise or in the cloud • Conservative about using cloud or on premise services that lock us to that infrastructure Avoid infrastructure lock in (especially cloud services) Frontend / backend infrastructure separation to • Redundant direct connect linksensure between no vendor on premise lock and in cloud infrastructure Direct connections • Route53 load balance between DCs. Very handy when ISPs fail, incoming traffic just flows Route53 to LB between DC’s through other DCs that then Frontenduse the direct technology connect backbone can be deployed to reach correct everywhere destinations • On-premise usually takes the bulk of the traffic due to traffic costs. On premise to reduce AWS costs Use direct connections where possible Client Support & Technical Support

Setup of Client Support & Technical Support 54% < 0.5% Of all customer Of all incoming tickets have been requests escalates • Global 24x7x365 support service via Chat, Helpdesk & Phone finally solved in less to technical than 60 minutes support • ISO 9001 certified • Escalate to relevant technical On Duty Team 98% 97% • All teams with a service in production are required to have a Is the average Incoming chats 24x7x365 On-Duty Team handling rate of all have been incoming chats accepted by an • Only best engineers part of such a team operator in less than 18 seconds • Own build tools for on call and incident management – looking at PagerDuty 110.000 Chat / Helpdesk / Phone support requests 2017 12000 „Your reps were very professional and speedy in „Quick response and 10000 replying to my emails which is something I solution. Re-sending the 8000 appreciate.“ – Client A feed was everything we 6000 needed and it fixed the 4000 „Great support - very fast and accurate answers.“ problems on our side. Client B Thanks guys.“ – Client C 2000 0 Jan Apr Jul Oct Client Support & Technical Support Process Setup of Client Support & Technical Support 54% < 0.5% Best practices for building resiliency Of all customer Of all incoming tickets have been requests escalates • Global 24x7x365 support service via Chat, Helpdesk & Phone finally solved in less to technical Support is Your friend – 110.000+ requests/yearthan 60 minutes support • ISO 9001 certified • Escalate to relevant technicalLess On Dutythan Team 0.5% escalates to a technical team98% 97% • All teams with a service in production are required to have a Is the average Incoming chats 24x7x365 On-Duty Team handling rate of all have been Incident reviews incoming chats accepted by an • Only best engineers part of such a team operator in less Be open and transparent – also during incidents than 18 seconds

110.000 Chat / Helpdesk / Phone support requests 2017 „Your reps were very professional and speedy inF ocus on monitoring„Quick response and – can always be12000 improved – replying to my emails which is something I remembersolution down. Re to-sending lowthe level if you do10000 on premise appreciate.“ – Client A feed was everything we 8000 Tight controlneeded onand communication it fixed the channels and on 6000 „Great support - very fast and accurate answerscall.“ and incidentproblems on management our side. tools 4000 Client B Thanks guys.“ – Client C Developers Eat Own Dog Food 2000 0 Jan Apr Jul Oct Maintenance Procedure

Maintenance procedure with clear rules

• Affected clients to be notified at least 2 days in advance • Always scheduled in cooperation with Operations • Friday/Saturday/Sunday no maintenance – if it happens then it’s low risk and/or with business approval

Rejected Reverted Good results Critical Consequences Aborted • Less than 4% aborted

• Close to “rolling updates” Minor Consequences

All OK Maintenance Procedure Process

Maintenance procedure with clearBest rules practices for building resiliency

Clear rules – common for all teams • Affected clients to be notified at least 2 days in advance • Always scheduled in cooperationNeed with for exceptionsOperations build into the process • Friday/Saturday/Sunday no maintenance – if it happens then it’s low risk and/or withRespect business your approval peak days

Rejected Ensure strong tools for process management Reverted Good results Critical Consequences Aborted Communicate, communicate, communicate • Less than 4% aborted

• Close to “rolling updates” Minor Consequences

All OK Dedicated Information Security Team

Setup of Information Security Team

• Independent – can escalate to CEO and Board • Policy framework • Secure development guidelines • System evaluation and guidance • Open source scanning Dedicated Information Security Team

Setup of Information Security Team Process

• Independent – can escalateBest to CEO practices and Board for building resiliency • Policy framework • Secure development guidelinesInclude security as soon as possible • System evaluation and guidance Train the trainer (eg. Tech Lead) concept • Open source scanning KPI‘s introduced by Policy used to measure compliance

MVP - Start small and extend

Communicate, communicate, communicate.. From From Launched Full product idea to project to product moves decline and Project Terminology project launch to maintenance product will close

Decisions by Business Lead Innovation Pipeline New Project Existing Project Maintenance Project

Innovation Team / „Sandbox“ ProjectRamp Delivery Up Process forDevelopment Existing Products Maintenance Ramp Dowm Not Live Live

Business Unit • Innovation Team • Project plan creation • Project plan • Roadmap • Less resources • Close down delivery delivery service • Press Release • Wireframes • Maintain client(s) Development • MVP focused • Assigned • Costar • Prototyping resources • Assigned • Financials • Build up team resources • Address client feedback

Fitness for Development • Architecture • Security • Hosting (dev env, prod plan) • Project setup IT / Technology procedures Fitness for Launch to be followed • Client Setup readiness • Hosting/prod environment • Sales, Marketing briefing • Support / On-Duty 24x7 • Security, Architecture “30% Rule“ From From Launched Full product idea to project to product moves decline and Project Terminology project launch to maintenance product will close

Decisions by Business Lead Innovation Pipeline New Project Existing Project Maintenance Project Process

Innovation Team / „Sandbox“ ProjectRamp Delivery Up Process forDevelopment Existing Products Maintenance Ramp Dowm Best practices for building resiliencyNot Live Live

Business Unit • Innovation Team • Project plan creation • Project plan • Roadmap • Less resources • Close down delivery delivery service • Press ReleaseIdentify your• Wireframeskey IT gate points • Maintain client(s) Development • MVP focused • Assigned • Costar • Prototyping resources • Assigned • Financials • Build up team Do active project portfolio managementresources so• youAddress know client what is going on feedback Fitness for Development • Architecture • Security Development teams in business unit – focusing on • Hosting (devproduct env, prod plan) features - tend to forget procedures, • Project setup IT / Technology maintenance and stability so we have “30% Rule” procedures Fitness for Launch to be followed • Client SetupIterate, readiness remind and communicate procedures • Hosting/prod environment • Sales, MarketingTrain briefing the trainer (eg. Tech Lead) concept • Support / On-Duty 24x7 • Security, Architecture “30% Rule“ Best practices building resiliency - Wrap Up

How to ensure all areas and systems are included

Technology Governance ➔ Definition of simple and clear technology architecture rules ➔ Enforce key IT processes; eg. Fitness for Development, ➔ Ensure you use several hosting locations in a combined build up; Fitness for Launch and “30% Rule” either fully AWS or combination of on premise hosting with AWS ➔ Active project portfolio management with key IT processes as ➔ Separate your service deliveries in logical pieces; mandatory gate points frontend/backend, sub system clusters ➔ Strict defined maintenance, support and incident processes ➔ Understand your vendor dependencies Continuous improvement Cross IT challenge ➔ What went wrong – improve !! ➔ Grey zone between development, system administration, hosting ➔ Peer review – clear DevOps topic ➔ Postmortems ➔ Building resiliency is not only a technical topic; it’s also about people processes and physical footprint Best practices building resiliency - Wrap Up

How to ensure all areas and systems are included

Technology Governance ➔ Definition of simple and clear Lookstechnology bureaucratic architecture rules but it➔ doesn’tEnforce key feelIT processes; so eg. Fitness for Development, ➔ Ensure you use several hosting locations in a combined build up; Fitness for Launch and “30% Rule” either fully AWS or combination of on premise hosting with AWS ➔ Active project portfolio management with key IT processes as ➔ Separate your service deliveries in logical pieces; mandatory gate points frontend/backend, sub system clusters(Comment from a Sportradar➔ Strict defined Tech maintenance, Lead) support and incident processes ➔ Understand your vendor dependencies Continuous improvement Cross IT challenge ➔ What went wrong – improve !! ➔ Grey zone between development, system administration, hosting ➔ Peer review – clear DevOps topic ➔ Postmortems ➔ Building resiliency is not only a technical topic; it’s also about people processes and physical footprint [email protected] www.linkedin.com/in/pablojensen

Thank you.|