Managing Capacity and Performance in a Large Scale Production Environment

This image cannot currently be displayed.

Presenter: Goranka Bjedov, Capacity Engineer, Facebook This image cannot currently be displayed. Outline

1 Problem Description

2 Objectives/Goals

3 Start Simple/Small

4 Current System

5 What is Next? This image cannot currently be displayed. Problem Description There is more to the story than meets the eye

• Facebook (> 1.7B MAU) • Whatsapp (> 1B MAU) • Messenger (> 1B MAU) • Instagram (> 0.5B MAU) • Oculus • Internet.org This image cannot currently be displayed. Photos (and Videos)

• User growth not reaching saturation • Implicit contract to store content forever

How do you keep photos for 100 years? This image cannot currently be displayed. What we do in 30 minutes…

• 3.8 trillion cache operations • 105TB of data is scanned via Hive • 108 billion queries run on MySQL • 10TB of logs loaded into Hadoop • 5 billion real-time messages sent • 160 million newsfeed stories created • 225 TB network egress • 10 billion profile pictures served • 50 million photos served

• 200 billion objects checked • 300 million objects blocked • 10 million photos uploaded

* Numbers based on summer 2013 This image cannot currently be displayed. Facebook Private Cloud

LLA

ATN CLN PRN ODN ASH SNC FRC VLL FTW

SNC:Santa Clara, California ASH:Ashburn, Virginia PRN:Prineville, Oregon FRC:Forest City, North Carolina Fully Turned Up LLA:Lulea, Sweden Future ATN:Altoona, Iowa FTW:Fort Worth, Texas CLN:Clonee, Ireland VLL:Village of Los Lunas, New Mexico ODN:Odensa, Denmark This image cannot currently be displayed. 2017 Servers (opencompute.org)

Standard I III IV V VI VII Cold Systems Web Database Hadoop Haystack Feed Storage High High High Low High High Broadwell 2 x Broadwell 2 x Broadwell 1 x Avoton 2 x Broadwell 2 x Haswell CPU (16c) (14c) (14c) (8c) (14c) (12c) 1.8 GHz 2.4 GHz 2.4 GHz 2.7 GHz 2.4GHz 2.5 GHz Low High Medium Low High Medium Memory 32GB 256GB 128GB 32GB 256GB 128GB

Medium Very High Low High IOP High High 2TB SAS 240 X 8TB Disk 256 GB Flash 2 x 3.2TB Flash 15 x 8TB 30 x 8TB (1.6 TB FLASH 16 active 128 GB mSata NL SAS NL SAS optional) disks Number Servers/ 120 30 18 9 30 6 Rack This image cannot currently be displayed. 2015 Servers (opencompute.org)

Standard I III IV V VI VII Cold Systems Web Database Hadoop Haystack Feed Storage High High High Low High High 2 x Haswell 2 x Haswell 2 x Haswell 1 x Avoton 2 x Haswell 2 x Haswell CPU (12c) (12c) (12c) (8c) (12c) (12c) 2.5 GHz 2.5 GHz 2.5 GHz 2.7 GHz 2.5GHz 2.5 GHz Low High Medium Low High Medium Memory 32GB 256GB 128GB 32GB 256GB 128GB

High High Medium Very High Low High IOP 15 x 4TB (or 30 x 4TB (or 6 2TB SAS 240 X 4TB Disk 500GB 2 x 3.2TB Flash 6TB) TB) (1.8 TB FLASH 16 active SATA 128 GB mSata NL SAS NL SAS optional) disks Number Servers/ 30 30 18 9 30 6 Rack This image cannot currently be displayed. Server Generations

Type I Web 2007 2008 2009 2010 2011 2012 2014 2015 2017 Servers

L5420 Rack (VSS) L5420 L5520 L5639 X5650 E5-2670 E5-2680 Compo- Fully- Haswell Broadwell (SC) (NHM) (WSM) (XWSM) (SNB) (IVB) sition buffered Dimms

8 real 16 24 24 32 40 48 8 real 32 logical Cores / cores logical logical logical logical logical logical cores CPUs 2.50 CPUs CPUs CPUs CPUs CPUs CPUs Speed 2.50 GHz 1.8 GHz GHz 2.27 GHz 2.13 GHz 2.67 GHz 2.67 GHz 2.8 GHz 2.5 GHz

RCUs 0.4 0.6 1 1.4 1.75 2.42 X X X This image cannot currently be displayed. Objectives/Goals Not getting any simpler

• Low latency for users • Launch things quickly - Succeed or fail quickly - Don’t worry about efficiency • Conserve Resources - Power - Money - Computers (RAM, CPU) - Network - Developer Time This image cannot currently be displayed. Power Efficiency

Typical Power Prineville Power

Utility Transformer Utility Transformer 480/277 VAC Standby 480/277 VAC Standby Generator Generator 2% loss 2% loss

AC/DC UPS 480VAC DC/AC 6% - 12% loss ASTS/PDU 208/120VAC 99.999% 99.9999% 3% loss Availability 480/277VAC Availability

SERVER PS FB SERVER 48VDC DC UPS 10% loss PS (Stand-by) (assuming 90% plus PS) 5.5% loss Total loss up to server: Total loss up to server: 21% to 27% 7.5% This image cannot currently be displayed. My Goals Finally something simple

• Want (only) right things running on (only) right gear • Want things to run efficiently • Want to know if something is broken • Want to know if something is about to break • Want to know how (and why) things are growing (or not) • Want to know who to blame… J This image cannot currently be displayed. Approach Start small • Monitoring Daemon (dynolog) –Runs on all servers –Collects system/kernel data in one second granularity –Slim high-throughput aggregator and streaming Thrift service –Exports data to scribe, hadoop, ods, etc. • What Data Does it Collect –Default: CPU, RAM, Network, IO, disk stats… –Customized: Requests + App performance This image cannot currently be displayed. Dyno This image cannot currently be displayed. Application level (Web) This image cannot currently be displayed. Measuring web capacity

• Hit duration increases with hit rate 091201_snc2_ape_nhm: cpu_delay_per_req

• Defined 100 milliseconds of !

! ! ! ! contention induced latency as ! ! ! 200 100% utilization—any more latency ! ! ! !

is easily perceived by the user 150 ! • The increase in duration comes ! ! 100 from the process waiting in the ! cpu_delay_pre_req (ms) !

! ! ! ! ! !

kernel run queue. 50 ! ! ! ! ! ! ! ! ! ! ! • The NHM server to the right tops ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! out at 48 hits/s 0 17 20 23 26 29 32 35 38 41 44 47 50 53

stat_hits/s/s This image cannot currently be displayed. Custom Monitoring More info: https://www.facebook.com/note.php?note_id=203367363919

• InstantDyno: – Listen to dyno thrift interface – Make correlations between pairs of metrics – Adjust load on loadbalancers

• Live loadtests • 24/7 • Every day, every second This image cannot currently be displayed. What Questions Can I Answer

• How is a group of servers performing? –Need to know which servers form a “group” –Need to process/organize/present that data –Our answer: Overwatch • How is the code on those servers performing? –Is the code uniform? –CPU (instructions, cycles, etc.) –RAM –Disk, network, TOR, etc. –Our answer: Perf Doctor This image cannot currently be displayed. Overwatch - Comparison across services This image cannot currently be displayed. Overwatch - Alerting This image cannot currently be displayed. Dyno and Perf Doctor

• Perf Doctor consumes nearly all of Dyno’s system counters • Raw time-series values are used to determine: – Does a tier have performance issues? – What is possible remediation? • Breakdown done by machine architecture • Most counters are also plotted Dyno system counters in ODS

CPU Memory Storage

• Memory bandwidth • Disk read/write iops • CPU instructions • Local memory bw • Flash/disk rd/wr bytes • CPU cycles • Remote memory bw • Flash/disk rd/wr ms • Instructions-per-cycle • Memory bw • Flash/disk rd/wr ios • CPU frequency utilization • HHVM foreground inst • CPU core count Kernel • Turbo status Network • Interrupts eth0 per sec • Interrupts • Tx/Rx packets tlbshootdowns • Tx/Rx bytes • CPU busy • Tx/Rx errors • Page faults • Tx/Rx drops • Context switches This image cannot currently be displayed. PerfDoctor - Is A Service Healthy? This image cannot currently be displayed. PerfDoctor - Calculating CPI This image cannot currently be displayed. PerfDoctor - Calculating DTLB miss rate This image cannot currently be displayed. PerfDoctor - Calculating LLC miss rate This image cannot currently be displayed. Deep Dive - Longma

• A tool to simplify the collection of performance data, store it and present it in accessible way • Can use INTEL’s EMON to get low-level hardware counters • Interprets hardware counters and presents the information This image cannot currently be displayed. EMON Counters This image cannot currently be displayed. Longma – Process Count

Find the hottest functions and get a breakdown of memory and CPU usage by process and thread. This image cannot currently be displayed. Other daemons

• I can still not tell if (only) the right things are running on the right hardware • Want to be able to tell automatically why a particular (set of) machine(s) behaves differently at a certain time • Our solution: two more daemons –atop daemon: take atop every X seconds, store output on a local drive –Automatically analyze and correlate output –Strobelight: get all stacktraces every Y minutes and analyze those This image cannot currently be displayed. Questions?