ESXi Performance Principles DOAG Edition

Valentin Bondzio Sr. Staff TSE / GSS Premier Services 2020-01-23

DOAG 2020 │ ©2020 VMware, Inc. Brief Intro

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 2 Brief Intro

@VMware since 2009

Global Support Services / Premier Services

Focus on Resource Management, Performance and Windows Internals

Originally from Berlin, living in Ireland since 2007

And most importantly …

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 3 Brief Intro

Not an Oracle expert !

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 4 Brief Intro

Not an Oracle expert !

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 5 Agenda CPU Scheduling and Usage Accounting The “basics”

“Power Management” The Good, the Better and the Ugly

ESXi Memory Management More “basics”

Local resource distribution What else is running on ESXi

CPU Topology Abstraction CPU Socket != NUMA node

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 6 Agenda CPU Scheduling and Usage Accounting The “basics”

“Power Management” The Good, the Better and the Ugly

ESXi Memory Management More “basics”

Local resource distribution +I/O stuff What else is running on ESXi

CPU Topology Abstraction CPU Socket != NUMA node

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 7 Agenda CPU Scheduling and Usage Accounting The “basics”

“Power Management” The Good, the Better and the Ugly

ESXi Memory Management More “basics”

Local resource distribution +I/O stuff What else is running on ESXi +vMotion CPU Topology Abstraction CPU Socket != NUMA node

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 8 Agenda CPU Scheduling and Usage Accounting The “basics”

“Power Management” The Good, the Better and the Ugly

ESXi Memory Management More “basics”

Local resource distribution +I/O stuff What else is running on ESXi +vMotion CPU Topology Abstraction CPU Socket != NUMA node +Backup

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 9 CPU Scheduler Overview

Resource guarantees and weighting (shares) on a per VM or “Resource Pool” level

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 10 CPU Scheduler Overview What does the scheduler do?

Dispatch VMs (its “worlds”) to honor CPU settings (Local)

vCPU vCPU vCPU

HT / Core

vCPU vCPU vCPU

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 11 CPU Scheduler Overview What does the scheduler do?

Dispatch VMs (its “worlds”) to honor CPU settings (Local) • For fairness: select VM with the least (consumed CPU time / fair share)

vCPU vCPU vCPU

HT / Core

vCPU vCPU vCPU

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 12 CPU Scheduler Overview What does the scheduler do?

Dispatch VMs (its “worlds”) to honor CPU settings (Local) • For fairness: select VM with the least (consumed CPU time / fair share) • For priority: run latency-sensitive VM (high) before anyone else

vCPU vCPU vCPU vCPU

HT / Core

IO vCPU vCPU vCPU

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 13 CPU Scheduler Overview What does the scheduler do?

Place the worlds / threads on physical CPUs (Global)

HT 0 HT 0 HT 0 HT 0 HT 0 HT 0 HT 0 HT 0

HT 1 HT 1 HT 1 HT 1 HT 1 HT 1 HT 1 HT 1 Core Core Core Core Core Core Core Core

LLC LLC

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 14 CPU Scheduler Overview What does the scheduler do?

Place the worlds / threads on physical CPUs (Global)

HT 0 HT 0 HT 0 HT 0 HT 0 HT 0 HT 0 HT 0

HT 1 HT 1 HT 1 HT 1 HT 1 HT 1 HT 1 HT 1 Core Core Core Core Core Core Core Core

LLC LLC

VM VM VM VM

• To balance load across physical execution contexts (PCPUs)

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 15 CPU Scheduler Overview What does the scheduler do?

Place the worlds / threads on physical CPUs (Global)

HT 0 HT 0 HT 0 HT 0 HT 0 HT 0 HT 0 HT 0

HT 1 HT 1 HT 1 HT 1 HT 1 HT 1 HT 1 HT 1 Core Core Core Core Core Core Core Core

LLC LLC

VM VM VM VM

• To balance load across physical execution contexts (PCPUs) • To preserve cache state, minimize migration cost

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 16 CPU Scheduler Overview What does the scheduler do?

Place the worlds / threads on physical CPUs (Global)

HT 0 HT 0 HT 0 HT 0 HT 0 HT 0 HT 0 HT 0

HT 1 HT 1 HT 1 HT 1 HT 1 HT 1 HT 1 HT 1 Core Core Core Core Core Core Core Core

LLC LLC

VM VM VM VM VM

• To balance load across physical execution contexts (PCPUs) • To preserve cache state, minimize migration cost • To avoid contention from hardware (HT, LLC, etc.) and sibling vCPUs (from the same VM)

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 17 CPU Scheduler Overview What does the scheduler do?

Place the worlds / threads on physical CPUs (Global)

HT 0 HT 0 HT 0 HT 0 HT 0 HT 0 HT 0 HT 0

HT 1 HT 1 HT 1 HT 1 HT 1 HT 1 HT 1 HT 1 Core Core Core Core Core Core Core Core

LLC LLC

VM VM VM VM VM VM VM

• To balance load across physical execution contexts (PCPUs) • To preserve cache state, minimize migration cost • To avoid contention from hardware (HT, LLC, etc.) and sibling vCPUs (from the same VM) • To keep VMs or threads that have frequent communications close to each other

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 18 CPU Scheduler Overview How does that look?

10:10:29am up 2 days 48 min, 674 worlds, 1 VMs, 2 vCPUs; CPU load average: 0.02, 0.01, 0.01 PCPU USED(%): 0.3 0.1 0.0 0.3 0.2 0.1 0.0 0.0 0.0 0.2 50 50 4.1 0.1 0.1 0.0 0.0 0.0 0.1 0.0 0.0 0.1 0.0 0.0 AVG: 4.4 PCPU UTIL(%): 0.5 0.1 0.1 0.6 0.2 0.2 0.0 0.2 0.0 0.3 100 100 4.2 0.2 0.1 0.1 0.0 0.0 0.1 0.0 0.0 0.2 0.1 0.1 AVG: 8.6 CORE UTIL(%): 0.6 0.7 0.4 0.9 0.3 100 4.3 0.2 0.0 0.1 0.4 0.7 AVG: 9.1

ID GID NAME NWLD %USED %RUN %SYS %WAIT %VMWAIT %RDY %IDLE %OVRLP 96337 148153 vmx 1 0.02 0.01 0.02 61.82 - 37.86 0.00 0.00 96339 148153 NetWorld-VM-96338 1 0.00 0.00 0.00 99.68 - 0.00 0.00 0.00 96340 148153 NUMASchedRemapEpochInitial 1 0.00 0.00 0.00 99.68 - 0.00 0.00 0.00 96341 148153 vmast.96338 1 0.03 0.05 0.00 99.63 - 0.00 0.00 0.00 96343 148153 vmx-vthread-6 1 0.00 0.00 0.00 99.68 - 0.00 0.00 0.00 96344 148153 vmx-mks:Debian86 1 0.00 0.00 0.00 61.55 - 38.13 0.00 0.00 96345 148153 vmx-svga:Debian86 1 0.00 0.00 0.00 99.68 - 0.00 0.00 0.00 96346 148153 vmx-vcpu-0:Debian86 1 62.35 99.68 0.00 0.00 0.00 0.00 0.00 0.05 96348 148153 vmx-vcpu-1:Debian86 1 62.36 99.67 0.00 0.00 0.00 0.01 0.00 0.05 96347 148153 PVSCSI-96338:0 1 0.00 0.00 0.00 99.68 - 0.00 0.00 0.00 96350 148153 vmx-vthread-7:Debian86 1 0.00 0.00 0.00 99.68 - 0.00 0.00 0.00

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 19 CPU Scheduler Overview How does that look?

10:10:29am up 2 days 48 min, 674 worlds, 1 VMs, 2 vCPUs; CPU load average: 0.02, 0.01, 0.01 PCPU USED(%): 0.3 0.1 0.0 0.3 0.2 0.1 0.0 0.0 0.0 0.2 50 50 4.1 0.1 0.1 0.0 0.0 0.0 0.1 0.0 0.0 0.1 0.0 0.0 AVG: 4.4 PCPU UTIL(%): 0.5 0.1 0.1 0.6 0.2 0.2 0.0 0.2 0.0 0.3 100 100 4.2 0.2 0.1 0.1 0.0 0.0 0.1 0.0 0.0 0.2 0.1 0.1 AVG: 8.6 CORE UTIL(%): 0.6 0.7 0.4 0.9 0.3 100 4.3 0.2 0.0 0.1 0.4 0.7 AVG: 9.1

ID GID NAME NWLD %USED %RUN %SYS %WAIT %VMWAIT %RDY %IDLE %OVRLP 96337 148153 vmx 1 0.02 0.01 0.02 61.82 - 37.86 0.00 0.00 96339 148153 NetWorld-VM-96338 1 0.00 0.00 0.00 99.68 - 0.00 0.00 0.00 96340 148153 NUMASchedRemapEpochInitial 1 0.00 0.00 0.00 99.68 - 0.00 0.00 0.00 96341 148153 vmast.96338 1 0.03 0.05 0.00 99.63 - 0.00 0.00 0.00 96343 148153 vmx-vthread-6 1 0.00 0.00 0.00 99.68 - 0.00 0.00 0.00 96344 148153 vmx-mks:Debian86 1 0.00 0.00 0.00 61.55 - 38.13 0.00 0.00 96345 148153 vmx-svga:Debian86 1 0.00 0.00 0.00 99.68 - 0.00 0.00 0.00 96346 148153 vmx-vcpu-0:Debian86 1 62.35 99.68 0.00 0.00 0.00 0.00 0.00 0.05 96348 148153 vmx-vcpu-1:Debian86 1 62.36 99.67 0.00 0.00 0.00 0.01 0.00 0.05 96347 148153 PVSCSI-96338:0 1 0.00 0.00 0.00 99.68 - 0.00 0.00 0.00 96350 148153 vmx-vthread-7:Debian86 1 0.00 0.00 0.00 99.68 - 0.00 0.00 0.00

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 20 CPU Usage Accounting What states are there

?

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 21 CPU Usage Accounting What states are there

Running Not Running

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 22 CPU Usage Accounting What states are there

Idle Running Ready (descheduled)

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 23 CPU Usage Accounting In an ideal world

Usage Idle Ready (descheduled) Running

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 24 CPU Usage Accounting What is charged against the VM

Usage Overlap HT busy Frequency .. Idle Ready (descheduled) Running

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 25 CPU Usage Accounting What is charged against the VM

“stolen time”

Usage Overlap HT busy Frequency .. Idle Ready (descheduled) Running

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 26 CPU Usage Accounting What is charged against the VM

“stolen time”

s V y Usage Overlap HT busy Frequency .. m s Idle Ready w (descheduled) a Running I t

wait

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 27 CPU Usage Accounting What is charged against the VM

“stolen time”

s C M V y Usage Overlap HT busy Frequency .. R m s S L Idle ReadyD w T M (descheduled) a Running Y I P T t

wait

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 28 CPU Usage Accounting stolen time aka “%LAT_C”

%LAT_C captures the gap between “ideal” execution (demand) and “current” execution.

• “Ideal”: unlimited dedicated cores running at nominal processor frequency Demand

Ideal Current

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 29 CPU Usage Accounting stolen time aka “%LAT_C”

%LAT_C captures the gap between “ideal” execution (demand) and “current” execution. • “Ideal”: unlimited dedicated cores running at nominal processor frequency

%LAT_C Demand

Ideal Current

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 30 CPU Usage Accounting stolen time aka “%LAT_C”

%LAT_C captures the gap between “ideal” execution (demand) and “current” execution. • “Ideal”: unlimited dedicated cores running at nominal processor frequency

%LAT_C

Sources of Compute Latency:

Demand • VM resource contention: check %RDY and %CSTP • Power management (P-State): frequency throttling • Hardware contention: HTs are in use

Ideal Current

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 31 ® Hyper-Threading Technology Cores and Threads

Does enabling HT “spawn” a less capable “logical core”?

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 32 Intel® Hyper-Threading Technology Cores and Threads

Does enabling HT “spawn” a less capable “logical core”?

“physical” core “physical” core “logical” core

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 33 Intel® Hyper-Threading Technology Cores and Threads

Does enabling HT “spawn” a less capable “logical core”?

“physical” core “physical” core “logical” core

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 34 Intel® Hyper-Threading Technology Cores and Threads

Does enabling HT “spawn” a less capable “logical core”?

“physical” core “physical” core “logical” core

Maybe two slightly less capable “logical” cores?

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 35 Intel® Hyper-Threading Technology Cores and Threads

Does enabling HT “spawn” a less capable “logical core”?

“physical” core “physical” core “logical” core

Maybe two slightly less capable “logical” cores?

“physical” core “logical” “logical” core0 core1

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 36 Intel® Hyper-Threading Technology Cores and Threads

Does enabling HT “spawn” a less capable “logical core”?

“physical” core “physical” core “logical” core

Maybe two slightly less capable “logical” cores?

“physical” core “logical” “logical” core0 core1

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 37 Intel® Hyper-Threading Technology Cores and Threads

Does enabling HT “spawn” a less capable “logical core”?

“physical” core “physical” core “logical” core

Maybe two slightly less capable “logical” cores?

“physical” core “logical” “logical” core0 core1

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 38 Intel® Hyper-Threading Technology Individual throughput reduction, aggregated throughput increase at high load

100 100

~125

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 39 Intel® Hyper-Threading Technology on ESXi Throughput reduction is accounted for in USED

100 100

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 40 Intel® Hyper-Threading Technology on ESXi Throughput reduction is accounted for in USED

100 100

125

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 41 Intel® Hyper-Threading Technology on ESXi Throughput reduction is accounted for in USED

100 100

125

2 x 50 + 12.5 = 62.5

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 42 Intel® Hyper-Threading Technology on ESXi Throughput reduction is accounted for in USED

100 100 HTEfficiencyShift – Default: 2

HT is:

1: 50 % 2: 25 % 125 3: 12.5 % 4: 6.25 % 5: 3.125 %

more efficient than no-HT 2 x 50 + 12.5 = 62.5

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 43 CPU Usage Accounting Usage vs. Utilization

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 44 Power Management Umbrella Term

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 45 Power Management Umbrella Term

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 46 Power Management Umbrella Term

P-States Options aka: Power Regulator, CPU Power Management, EIST

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 47 Power Management Umbrella Term

P-States Options aka: Power Regulator, CPU Power Management, EIST

Deep C-States

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 48 Power Management refresher …

P-State = voltage / frequency point

C-State = idle state, running or varying degrees of stuff turned off

P0 P1 P2 P3 / TB / NF P4 P7 P8 P9 P10 P11 Frequency P12 P13

C0 C1-Cn

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 49 C-State Transition

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 50 C-State Transition

C1

C1

C1

C1

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 51 C-State Transition

C1

~1µs C1

C1

C1

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 52 Deep C-State Transition

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 53 Deep C-State Transition

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 54 Deep C-State Transition

C6

C6

C6

C6

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 55 Deep C-State Transition

C6

~30µs C6

C6

C6

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 56 Power Management _Profiles_ Dell

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 57 ESXi Power Management Policy Only affects what’s presented from the BIOS

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 58 Power Management refresher … Who controls what? → allow control /  use

CPU ESXi

BIOS VM / guest

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 59 Power Management refresher … Who controls what? → allow control /  use

HLT / C1-Cn

P-States

CPU ESXi

P-States BIOS VM / deep C- guest States

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 60 Power Management refresher … Who controls what? → allow control /  use

HLT / C1-Cn

P-States

CPU ESXi

P-States BIOS VM / deep C- guest States

HLT / C1

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 61 Power Management refresher … Who controls what? → allow control /  use

HLT / C1-Cn

P-States

CPU ESXi

P-States BIOS VM / deep C- guest States

HLT / C1

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 62 ESXi Power Management Policy Only affects what’s presented from the BIOS (DELL terminology)

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 63 ESXi Power Management Policy Only affects what’s presented from the BIOS (DELL terminology)

System Profile → "Performance Per Watt (DAPC)" "Performance Per Watt (OS)" "Performance" "Dense Configuration" "Custom"

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 64 ESXi Power Management Policy Only affects what’s presented from the BIOS (DELL terminology)

System Profile → "Performance Per Watt (DAPC)" "Performance Per Watt (OS)" "Performance" "Dense Configuration" "Custom"

CPU Power Management → "System DPBM (DAPC)" "OS DBPM" "Maximum Performance“

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 65 ESXi Power Management Policy Only affects what’s presented from the BIOS (DELL terminology)

System Profile → "Performance Per Watt (DAPC)" "Performance Per Watt (OS)" "Performance" "Dense Configuration" "Custom"

CPU Power Management → "System DPBM (DAPC)" "OS DBPM" "Maximum Performance“

C States → "Enabled" "Disabled"

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 66 ESXi Power Management Policy Only affects what’s presented from the BIOS (DELL terminology)

System Profile → "Performance Per Watt (DAPC)" "Performance Per Watt (OS)" "Performance" "Dense Configuration" "Custom"

CPU Power Management → "System DPBM (DAPC)" P-States "OS DBPM" P-States "Maximum Performance“ P-States

C States → "Enabled" "Disabled"

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 67 ESXi Power Management Policy Only affects what’s presented from the BIOS (DELL terminology)

System Profile → "Performance Per Watt (DAPC)" "Performance Per Watt (OS)" "Performance" "Dense Configuration" "Custom"

CPU Power Management → "System DPBM (DAPC)" P-States "OS DBPM" P-States "Maximum Performance“ P-States

C States → "Enabled" "Disabled"

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 68 ESXi Power Management Policy Only affects what’s presented from the BIOS (DELL terminology)

System Profile → "Performance Per Watt (DAPC)" "Performance Per Watt (OS)" "Performance" "Dense Configuration" "Custom"

CPU Power Management → "System DPBM (DAPC)" P-States "OS DBPM" P-States "Maximum Performance“ P-States

C States → "Enabled" C-States "Disabled" C-States

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 69 Which BIOS policy am I running on? Most likely …

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 70 Which BIOS policy am I running on? Most likely …

Most likely “Dynamic”

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 71 Which BIOS policy am I running on? Most likely …

Most likely “Dynamic”

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 72 Which BIOS policy am I running on? Most likely …

Most likely “Dynamic”

Very likely “Performance”

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 73 Which BIOS policy am I running on? Most likely “Dynamic”

4:30:58pm up 2 min, 1276 worlds, 0 VMs, 0 vCPUs; CPU load average: 0.02, 0.00, 0.00 Power Usage: 94W, Power Cap: N/A PSTATE MHZ:

CPU %USED %UTIL %C0 %C1 %C2 %A/MPERF 0 0.3 0.7 1 23 76 50.0 1 0.0 0.0 0 0 100 50.1 2 0.1 0.2 0 6 94 50.0 3 0.0 0.0 0 0 100 50.1 4 5.2 10.4 10 5 85 50.0 5 0.0 0.0 0 5 95 51.0 6 0.0 0.1 0 3 97 50.0 7 0.0 0.0 0 0 100 50.0 8 0.1 0.4 0 16 84 50.0 9 0.0 0.0 0 0 100 50.0 10 0.0 0.0 0 0 100 50.0 (…)

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 74 Which BIOS policy am I running on? Most likely “Dynamic”

4:30:58pm up 2 min, 1276 worlds, 0 VMs, 0 vCPUs; CPU load average: 0.02, 0.00, 0.00 Power Usage: 94W, Power Cap: N/A PSTATE MHZ:

CPU %USED %UTIL %C0 %C1 %C2 %A/MPERF 0 0.3 0.7 1 23 76 50.0 1 0.0 0.0 0 0 100 50.1 2 0.1 0.2 0 6 94 50.0 3 0.0 0.0 0 0 100 50.1 4 5.2 10.4 10 5 85 50.0 5 0.0 0.0 0 5 95 51.0 6 0.0 0.1 0 3 97 50.0 7 0.0 0.0 0 0 100 50.0 8 0.1 0.4 0 16 84 50.0 9 0.0 0.0 0 0 100 50.0 10 0.0 0.0 0 0 100 50.0 (…)

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 75 Which BIOS policy am I running on? Most likely “Performance”

4:38:51pm up 1 min, 1276 worlds, 0 VMs, 0 vCPUs; CPU load average: 0.02, 0.00, 0.00 Power Usage: 142W, Power Cap: N/A PSTATE MHZ:

CPU %USED %UTIL %C0 %C1 %A/MPERF 0 0.0 0.1 0 100 108.3 1 0.1 0.1 0 100 108.4 2 0.1 0.1 0 100 108.3 3 0.0 0.1 0 100 108.4 4 0.0 0.0 0 100 108.3 5 18.0 16.7 17 83 108.3 6 0.0 0.1 0 100 108.4 7 0.2 0.2 0 100 108.3 8 0.0 0.0 0 100 108.3 9 0.1 0.2 0 100 108.3 10 0.0 0.1 0 100 108.3 (…)

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 76 Which BIOS policy am I running on? Most likely “Performance”

4:38:51pm up 1 min, 1276 worlds, 0 VMs, 0 vCPUs; CPU load average: 0.02, 0.00, 0.00 Power Usage: 142W, Power Cap: N/A PSTATE MHZ:

CPU %USED %UTIL %C0 %C1 %A/MPERF 0 0.0 0.1 0 100 108.3 1 0.1 0.1 0 100 108.4 2 0.1 0.1 0 100 108.3 3 0.0 0.1 0 100 108.4 4 0.0 0.0 0 100 108.3 5 18.0 16.7 17 83 108.3 6 0.0 0.1 0 100 108.4 7 0.2 0.2 0 100 108.3 8 0.0 0.0 0 100 108.3 9 0.1 0.2 0 100 108.3 10 0.0 0.1 0 100 108.3 (…)

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 77 Which BIOS policy am I running on? Most likely “Custom”

5:09:53pm up 6 min, 827 worlds, 0 VMs, 0 vCPUs; CPU load average: 0.01, 0.01, 0.00 Power Usage: 107W, Power Cap: N/A PSTATE MHZ: 2401 2400 2300 2200 2100 2000 1900 1800 1700 1600 1500 1400 1300 1200

CPU %USED %UTIL %C0 %C1 %C2 %P0 %P1 %P2 %P3 %P4 %P5 %P6 %P7 %P8 %P9 %P10 %P11 %P12 %P13 %A/MPERF 0 0.2 0.4 0 16 83 62 0 0 0 0 0 0 0 0 0 0 0 0 38 75.2 1 0.0 0.0 0 0 100 0 0 0 0 0 0 0 0 0 0 0 0 0 100 59.3 2 0.0 0.1 0 5 95 15 0 0 0 0 0 0 0 0 0 0 0 0 85 57.9 3 0.0 0.0 0 1 98 38 0 0 0 0 0 0 0 0 0 0 0 0 62 61.5 4 0.0 0.0 0 4 96 5 0 0 0 0 0 0 0 0 0 0 0 0 95 52.0 5 0.0 0.0 0 0 100 0 0 0 0 0 0 0 0 0 0 0 0 0 100 50.3 6 0.1 0.1 0 1 99 7 0 0 0 0 0 0 0 0 0 0 0 0 93 67.7 7 0.1 0.1 0 0 100 99 0 0 0 0 0 0 0 0 0 0 0 0 1 77.7 8 0.0 0.0 0 0 100 10 0 0 0 0 0 0 0 0 0 0 0 0 90 50.8 9 0.0 0.1 0 0 100 0 0 0 0 0 0 0 0 0 0 0 0 0 100 51.6 10 0.0 0.0 0 3 97 8 0 0 0 0 0 0 0 0 0 0 0 0 92 54.0 (…)

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 78 Which BIOS policy am I running on? Most likely “Custom”

5:09:53pm up 6 min, 827 worlds, 0 VMs, 0 vCPUs; CPU load average: 0.01, 0.01, 0.00 Power Usage: 107W, Power Cap: N/A PSTATE MHZ: 2401 2400 2300 2200 2100 2000 1900 1800 1700 1600 1500 1400 1300 1200

CPU %USED %UTIL %C0 %C1 %C2 %P0 %P1 %P2 %P3 %P4 %P5 %P6 %P7 %P8 %P9 %P10 %P11 %P12 %P13 %A/MPERF 0 0.2 0.4 0 16 83 62 0 0 0 0 0 0 0 0 0 0 0 0 38 75.2 1 0.0 0.0 0 0 100 0 0 0 0 0 0 0 0 0 0 0 0 0 100 59.3 2 0.0 0.1 0 5 95 15 0 0 0 0 0 0 0 0 0 0 0 0 85 57.9 3 0.0 0.0 0 1 98 38 0 0 0 0 0 0 0 0 0 0 0 0 62 61.5 4 0.0 0.0 0 4 96 5 0 0 0 0 0 0 0 0 0 0 0 0 95 52.0 5 0.0 0.0 0 0 100 0 0 0 0 0 0 0 0 0 0 0 0 0 100 50.3 6 0.1 0.1 0 1 99 7 0 0 0 0 0 0 0 0 0 0 0 0 93 67.7 7 0.1 0.1 0 0 100 99 0 0 0 0 0 0 0 0 0 0 0 0 1 77.7 8 0.0 0.0 0 0 100 10 0 0 0 0 0 0 0 0 0 0 0 0 90 50.8 9 0.0 0.1 0 0 100 0 0 0 0 0 0 0 0 0 0 0 0 0 100 51.6 10 0.0 0.0 0 3 97 8 0 0 0 0 0 0 0 0 0 0 0 0 92 54.0 (…)

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 79 The magic of Turbo Boost Dynamic, supported overclocking

TB1 TB1

P1 P1 Frequency

C1 C1 C-State depth

C0 C1

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 80 The magic of Turbo Boost Dynamic, supported overclocking

TB4 TB4 TB3 TB3 TB2 TB2 TB1 TB1 TB1 TB1

P1 P1 P1 P1 Frequency

C1 C1 C6 C6 C-State depth

C0 C1 C0 C6

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 81 The magic of Turbo Boost Dynamic, supported overclocking TB7 TB6 TB5 TB4 TB3 TB2 TB1 TB1 TB1

P1 P1 P1 Frequency

C1 C1 C6 C6 C6 C-State depth

C0 C1 C0 C6

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 82 Power Policy “playfield"

BIOS “Dynamic” pre Haswell Bad Good Optimal* Optimal* Good Bad

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 83 Power Policy “playfield"

BIOS “Dynamic” on Haswell+

BIOS “Dynamic” pre Haswell Bad Good Optimal* Optimal* Good Bad

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 84 Power Policy “playfield"

BIOS “Maximum / High Performance” Same* as Custom BIOS + High Performance ESXi policy (with the exception of C1E)

BIOS “Dynamic” on Haswell+

BIOS “Dynamic” pre Haswell Bad Good Optimal* Optimal* Good Bad

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 85 Power Policy “playfield"

Custom BIOS + Custom or Balanced ESXi policy

BIOS “Maximum / High Performance” Same* as Custom BIOS + High Performance ESXi policy (with the exception of C1E)

BIOS “Dynamic” on Haswell+

BIOS “Dynamic” pre Haswell Bad Good Optimal* Optimal* Good Bad

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. * a few workloads fare better with more deterministic performance 86 Power Policy “playfield" Custom done right!

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 87 Power Policy “playfield" Custom done right!

Custom BIOS “Dynamic” + ESXi Balanced

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 88 Power Policy “playfield" Custom done right!

Custom BIOS “Dynamic” + ESXi Balanced

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 89 Power Policy “playfield" Custom done right!

Custom BIOS “Dynamic” + ESXi Balanced

Custom BIOS “Dynamic” “Performance” + ESXi Balanced

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 90 Power Policy “playfield" Custom done right!

Custom BIOS “Dynamic” + ESXi Balanced

Custom BIOS “Dynamic” “Performance” + ESXi Balanced

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 91 Power Management Trivia Frequently Asked Questions

“Why doesn’t the frequency I see in Task Manager change?”

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 92 Power Management Trivia Frequently Asked Questions

“Why doesn’t the frequency I see in Task Manager change?” • Possibility 1: You are looking at the brand string • Possibility 2: You are looking in the right place (but the guest OS has no way of knowing)

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 93 Power Management Trivia Frequently Asked Questions

“Why doesn’t the frequency I see in Task Manager change?” • Possibility 1: You are looking at the brand string • Possibility 2: You are looking in the right place (but the guest OS has no way of knowing)

• Base frequency should be: CPUID.(EAX=16h):EAX[15-00] – But it seems Windows is getting that from SMBIOS

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 94 Power Management Trivia Frequently Asked Questions

“Why doesn’t the frequency I see in Task Manager change?” • Possibility 1: You are looking at the brand string • Possibility 2: You are looking in the right place (but the guest OS has no way of knowing)

• Base frequency should be: CPUID.(EAX=16h):EAX[15-00] – But it seems Windows is getting that from SMBIOS

# grep cpuid ./WinTest.vmx cpuid.16.eax = "------0100011100011000" cpuid.coresPerSocket = "6" cpuid.brandstring = "VMware (R) SuperSecretCPU (R) @ 18.2 GHz"

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 95 Power Management Trivia Frequently Asked Questions

“I turned off all C-States, why is it still showing C1 in esxtop?”

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 96 Power Management Trivia Frequently Asked Questions

“I turned off all C-States, why is it still showing C1 in esxtop?”

4:38:51pm up 1 min, 1276 worlds, 0 VMs, 0 vCPUs; CPU load average: 0.02, 0.00, 0.00 Power Usage: 142W, Power Cap: N/A PSTATE MHZ:

CPU %USED %UTIL %C0 %C1 %A/MPERF 0 0.0 0.1 0 100 108.3 1 0.1 0.1 0 100 108.4 2 0.1 0.1 0 100 108.3 3 0.0 0.1 0 100 108.4 4 0.0 0.0 0 100 108.3 5 18.0 16.7 17 83 108.3 6 0.0 0.1 0 100 108.4 7 0.2 0.2 0 100 108.3 8 0.0 0.0 0 100 108.3 (…)

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 97 Power Management Trivia Frequently Asked Questions

“I turned off all C-States, why is it still showing C1 in esxtop?” • You can’t turn off C1, you can disable different levels of deep C-States (C2+)

4:38:51pm up 1 min, 1276 worlds, 0 VMs, 0 vCPUs; CPU load average: 0.02, 0.00, 0.00 Power Usage: 142W, Power Cap: N/A PSTATE MHZ:

CPU %USED %UTIL %C0 %C1 %A/MPERF 0 0.0 0.1 0 100 108.3 1 0.1 0.1 0 100 108.4 2 0.1 0.1 0 100 108.3 3 0.0 0.1 0 100 108.4 4 0.0 0.0 0 100 108.3 5 18.0 16.7 17 83 108.3 6 0.0 0.1 0 100 108.4 7 0.2 0.2 0 100 108.3 8 0.0 0.0 0 100 108.3 (…)

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 98 Power Management Trivia Frequently Asked Questions

“I won’t have any issues if I have everything set to High Performance in the BIOS, right?”

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 99 Power Management Trivia Frequently Asked Questions

“I won’t have any issues if I have everything set to High Performance in the BIOS, right?” • No, besides possibly: – PSU redundancy issues – Power capping – Temperature – Firmware bugs

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 100 Power Management Trivia Frequently Asked Questions

“I won’t have any issues if I have everything set to High Performance in the BIOS, right?” • No, besides possibly: – PSU redundancy issues – Power capping – Temperature – Firmware bugs • And definitely … – No ability to control P-/deep C-States – No maximum Turbo Boost frequencies …

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 101 Power Management Trivia Frequently Asked Questions

“I won’t have any issues if I have everything set to High Performance in the BIOS, right?” • No, besides possibly: – PSU redundancy issues – Power capping – Temperature – Firmware bugs • And definitely … – No ability to control P-/deep C-States – No maximum Turbo Boost frequencies …

http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/-e5-v3-spec-update.pdf

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 102 Power Management Trivia Frequently Asked Questions

“I won’t have any issues if I have everything set to High Performance in the BIOS, right?” • No, besides possibly: – PSU redundancy issues – Power capping – Temperature – Firmware bugs • And definitely … – No ability to control P-/deep C-States – No maximum Turbo Boost frequencies …

http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v3-spec-update.pdf

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 103 Power Management Trivia Frequently Asked Questions

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 104 Power Management Trivia Frequently Asked Questions

“I can clearly see C2 in perfmon on Windows, why are you lying to me?”

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 105 Power Management Trivia Frequently Asked Questions

“I can clearly see C2 in perfmon on Windows, why are you lying to me?” • This is either a perfmon bug or a choice to represent an “enlightened” idle feature – “Intelligent Timer Tick Distribution (ITTD)” – needs Windows 2012 R2 / vHW 11 – disable via “monitor.disable_guest_idle_msr = true” • you really shouldn’t have to ever …

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 106 What runs where and when The high level picture

OS / APPs VMK VMM

CPU

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 107 What runs where and when Mostly Direct Exec

OS / APPs

CPU

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 108 What runs where and when Mostly Direct Exec

(…) 0xffffffff810a99d0 <+416>: test %eax,%eax 0xffffffff810a99d2 <+418>: je 0xffffffff810a9932 0xffffffff810a99d8 <+424>: callq 0xffffffff810c6ed0 0xffffffff810a99dd <+429>: mov 0x82740c(%rip),%r13 0xffffffff810a99e4 <+436>: test %r13,%r13 0xffffffff810a99e7 <+439>: je 0xffffffff810a9a07 0xffffffff810a99e9 <+441>: mov 0x0(%r13),%rax 0xffffffff810a99ed <+445>: no0xffffffff810a99f0 <+448>: mov 0x8(%r13),%rdi 0xffffffff810a99f4 <+452>: add $0x10,%r13 0xffffffff810a99f8 <+456>: xor %esi,%esi vCPU 0xffffffff810a99fa <+458>: mov %ebp,%edx 0xffffffff810a99fc <+460>: callq *%rax 0xffffffff810a99fe <+462>: mov 0x0(%r13),%rax 0xffffffff810a9a02 <+466>: test %rax,%rax 0xffffffff810a9a05 <+469>: jne 0xffffffff810a99f0 0xffffffff810a9a07 <+471>: callq 0xffffffff810c6e40 0xffffffff810a9a0c <+476>: jmpq 0xffffffff810a9932 0xffffffff810a9a11 <+481>: nopl 0x0(%rax) 0xffffffff810a9a18 <+488>: mov %gs:0xa0e4,%eax 0xffffffff810a9a20 <+496>: mov %eax,%eax 0xffffffff810a9a22 <+498>: bt %rax,(%rbx) (…) PCPU

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 109 What runs where and when What about Idle?

(…) 0xffffffff81052c20 <+0>: sti 0xffffffff81052c21 <+1>: hlt *loud screeching sound*

vCPU

CPU

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 110 What runs where and when VMM traps on the privileged instruction and puts (with VMK) the vCPU to “sleep

(…) 0xffffffff81052c20 <+0>: sti 0xffffffff81052c21 <+1>: hlt *tells VMK to deschedule*

VMM

CPU

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 111 What runs where and when The scheduler decides what next to run

VMK

CPU

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 112 What runs where and when E.g. a vCPU / world that is ready to run

other vCPU

CPU

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 113 What runs where and when ESXi’s _own_ idle thread

C1-Cn

CPU

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 114 Memory Management Overview Goals and Objectives

Manage host physical memory to abstract physical memory away from guest. Allow memory over-commitment to provide an illusion of virtual DRAM to the guest. Hide transient host memory pressure from application

Host Physical Memory Guest Memory

ESXi

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 115 Virtual Memory

Process 0

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 116 Virtual Memory

Process 0

Process 1

Process 2

Process 3

Process n

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 117 Virtual Memory

Process 0 256 TB From the process’ point of view, it provides: • Contiguous address space Process 1 256 TB • Isolation / Security

Process 2 256 TB

Process 3 256 TB

Process n 256 TB

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 118 Virtual Memory

Process 0 256 TB From the process’ point of view, it provides: • Contiguous address space Process 1 256 TB • Isolation / Security

Virtual Memory abstracts Process 2 256 TB Magic

Process 3 256 TB

Process n 256 TB

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 119 Virtual Memory

Process 0 256 TB From the process’ point of view, it provides: • Contiguous address space Process 1 256 TB • Isolation / Security

Virtual Memory abstracts Process 2 256 TB Magic • It provides the possibility to overcommit …

Process 3 256 TB

Process n 256 TB

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 120 Virtual Memory

Process 0 256 TB From the process’ point of view, it provides: • Contiguous address space Process 1 256 TB 64 TB • Isolation / Security

Virtual Memory abstracts Process 2 256 TB Magic • It provides the possibility to overcommit …

The process is unaware what Process 3 256 TB 256 TB is backing the virtual address • Physical Memory • Swap File Process n 256 TB

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 121 Virtual Physical Memory Abstraction …

VM 0

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 122 Virtual Physical Memory Abstraction …

VM 0

VM 1

VM 2

VM 3

VM n

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 123 Virtual Physical Memory Abstraction …

VM 0 6 TB From the VMs point of view, it provides: • Contiguous address space VM 1 6 TB • Isolation / Security

VM 2 6 TB

VM 3 6 TB

VM n 6 TB

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 124 Virtual Physical Memory Abstraction …

VM 0 6 TB From the VMs point of view, it provides: • Contiguous address space VM 1 6 TB • Isolation / Security

Virt. Physical Mem. abstracts VM 2 6 TB Magic

VM 3 6 TB

VM n 6 TB

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 125 Virtual Physical Memory Abstraction …

VM 0 6 TB From the VMs point of view, it provides: • Contiguous address space VM 1 6 TB • Isolation / Security

Virt. Physical Mem. abstracts VM 2 6 TB Magic • It provides the possibility to overcommit …

VM 3 6 TB

VM n 6 TB

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 126 Virtual Physical Memory Abstraction …

VM 0 6 TB From the VMs point of view, it provides: • Contiguous address space VM 1 6 TB 16 TB • Isolation / Security

Virt. Physical Mem. abstracts VM 2 6 TB Magic • It provides the possibility to overcommit …

The VM is unaware what is VM 3 6 TB *** TB backing the physical address • Physical Memory • Swap File VM n 6 TB

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 127 Virtual Physical Memory Abstraction …

VM 0 6 TB *** TB From the VMs point of view, it provides: • Contiguous address space VM 1 6 TB 16 TB • Isolation / Security

Virt. Physical Mem. abstracts VM 2 6 TB Magic *** TB • It provides the possibility to overcommit …

The VM is unaware what is VM 3 6 TB *** TB backing the physical address • Physical Memory • Swap File VM n 6 TB • Or COW, ZIP, BLN

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 128 Virtual Physical Memory Abstraction …

VM 0 6 TB *** TB From the VMs point of view, it provides: • Contiguous address space VM 1 6 TB 16 TB • Isolation / Security

Virt. Physical Mem. abstracts VM 2 6 TB Magic *** TB • It provides the possibility to overcommit …

The VM is unaware what is VM 3 6 TB *** TB backing the physical address • Physical Memory * • Swap File VM n 6 TB *** TB • Or COW, ZIP, BLN

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 129 Memory Management Overview How to Hide Memory Pressure?

Understanding VM memory usage on ESXi

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 130 Memory Management Overview How to Hide Memory Pressure?

Understanding VM memory usage on ESXi

Total Memory Size

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 131 Memory Management Overview How to Hide Memory Pressure?

Understanding VM memory usage on ESXi

Allocated Memory

Total Memory Size

Free Memory

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 132 Memory Management Overview How to Hide Memory Pressure?

Understanding VM memory usage on ESXi

Active Memory

Allocated Memory

Total Memory Size Idle Memory

Free Memory

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 133 Memory Management Overview How to Hide Memory Pressure?

Understanding VM memory usage on ESXi

Active Memory

Allocated Memory

Total Memory Size Idle Memory

Free Memory

Reclaim memory from VM if it using more than it is entitled. • Entitlement depends on configuration (reservation / shares / limit). • Techniques to reclaim memory from VMs includes: – Page sharing > Ballooning > Compression > Host swapping – Breaks host large pages

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 134 Active Memory Not the same as guest stats!

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 135 Active Memory Not the same as guest stats!

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 136 Active Memory Not the same as guest stats!

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 137 Active Memory Not the same as guest stats!

!=

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 138 Active Memory aka Touched

ESXi VM level heuristic • Weighted, moving average • OS / VMTools independent • “Memory Sampling”

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 139 Active Memory aka Touched

VM mapped memory ESXi VM level heuristic • Weighted, moving average • OS / VMTools independent • “Memory Sampling”

100 x 4 KB Un-maps 100 random pages over the entire VMs mapped address space

4 KB 4 KB 4 KB 4 KB 4 KB 4 KB …

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 140 Active Memory aka Touched

VM mapped memory ESXi VM level heuristic • Weighted, moving average • OS / VMTools independent • “Memory Sampling”

100 x 4 KB / min Un-maps 100 random pages over the entire VMs mapped address space

Monitors R/W for a minute (access traps to the VMM) 4 KB 4 KB 4 KB 4 KB 4 KB 4 KB …

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 141 Active Memory aka Touched

VM mapped memory ESXi VM level heuristic • Weighted, moving average • OS / VMTools independent • “Memory Sampling”

100 x 4 KB / min Un-maps 100 random pages over the entire VMs mapped address space

Read Read Write Monitors R/W for a minute (access traps to the VMM) 4 KB 4 KB 4 KB 4 KB 4 KB 4 KB …

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 142 Active Memory aka Touched

VM mapped memory ESXi VM level heuristic • Weighted, moving average • OS / VMTools independent • “Memory Sampling”

100 x 4 KB / min Un-maps 100 random pages over the entire VMs mapped address space

Monitors R/W for a minute (access traps to the VMM) 4 KB 4 KB 4 KB 4 KB 4 KB 4 KB … After one minute, re-maps all remaining pages, starts again

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 143 Active Memory vs. Consumed

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 144 Active Memory What to trust?

active consumed

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 145 Active Memory What to trust?

active consumed

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 146 Active Memory What to trust?

active consumed

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 147 Active Memory What to trust?

active consumed

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 148 Active Memory What to trust?

active consumed

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 149 Active Memory What to trust?

active consumed

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 150 Active Memory What to trust?

active consumed

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 151 Active Memory What to trust?

active consumed

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 152 Guest Memory Metrics In a nutshell

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 153 Guest Memory Metrics In a nutshell

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 154 Guest Memory Metrics In a nutshell

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 155 Guest Memory Metrics In a nutshell

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 156 Guest Memory Metrics In a nutshell

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 157 Active Memory Guests working set tends to be between active and consumed

active guest WS consumed

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 158 Active Memory Guest WS might over report (greedy app)

active guest WS

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 159 Active Memory But guest WS will not underreport

guest WS active consumed

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 160 Active Memory Not then end all of guest workload estimation

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 161 Hierarchical Resource Groups From an ESXi perspective

host The host owns all resources

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 162 Hierarchical Resource Groups From an ESXi perspective

host The host owns all resources

Those are distributed by hierarchical resource groups

system vim iofilters user

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 163 Hierarchical Resource Groups From an ESXi perspective

host The host owns all resources

Those are distributed by hierarchical resource groups

system vim iofilters user

minfree kernel helper ft drivers vmotion …

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 164 Hierarchical Resource Groups From an ESXi perspective

host The host owns all resources

Those are distributed by hierarchical resource groups

system vim iofilters user

minfree kernel helper ft drivers vmotion …

vmkboot CpuSched Init …

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 165 Hierarchical Resource Groups From an ESXi perspective

host The host owns all resources

Those are distributed by hierarchical resource groups

system vim iofilters user Consumers can demand (request) resources

minfree kernel helper ft drivers vmotion …

vmkboot CpuSched Init …

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 166 Hierarchical Resource Groups From an ESXi perspective

host vCenter shows the sum of all user resources as:

Total Reservation Capacity

system vim iofilters user

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 167 Hierarchical Resource Groups From an ESXi perspective

host vCenter shows the sum of all user resources as:

Total Reservation Capacity

system vim iofilters user Global Resource Pools are then distributed back to hosts into Local RPs • Based on VMs demand pool1 pool2 pool3 pool4 …

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 168 Hierarchical Resource Groups From an ESXi perspective

host vCenter shows the sum of all user resources as:

Total Reservation Capacity

system vim iofilters user Global Resource Pools are then distributed back to hosts into Local RPs • Based on VMs demand pool1 pool2 pool3 pool4 …

vm.vmid vm.vmid vm.vmid …

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 169 Hierarchical Resource Groups From an ESXi perspective

user Local Resource Groups are created and incrementally numbered when clients are instantiated: pool1 pool15 pool231 pool430 … • VM starts / vMotions etc. • Based on VMs demand

vm.vmid vm.vmid …

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 170 Hierarchical Resource Groups From an ESXi perspective

user Local Resource Groups are created and incrementally numbered when clients are instantiated: pool1 pool15 pool231 pool430 … • VM starts / vMotions etc. • Based on VMs demand

The local hierarchy is equal

vm.vmid vm.vmid … vm.vmid pool321 to the global one • Check for VM / LRG siblings

vm.vmid vm.vmid …

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 171 Hierarchical Resource Groups From an ESXi perspective

user Local Resource Groups are created and incrementally numbered when clients are instantiated: pool1 pool15 pool231 pool430 … • VM starts / vMotions etc. • Based on VMs demand

The local hierarchy is equal

vm.vmid vm.vmid … vm.vmid pool321 to the global one • Check for VM / LRG siblings

VM groups have multiple leaf

vmm uw ... vm.vmid vm.vmid … consumers • vmid is local, not global

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 172 Hierarchical Resource Groups Both Memory and CPU resources

Memory

host mem.resv Reservation mem.limit Limit mem.shares Shares mem.resvLimit Expandable*

system vim iofilters user

CPU pool1 pool2 pool3 pool4 … cpu.resv Reservation cpu.limit Limit cpu.shares Shares cpu.resvLimit Expandable*

vm.vmid vm.vmid vm.vmid …

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 173 Tools

ESXi CLI (via SSH)

sched-stats memstats esxtop

… for CPU … for Memory … for comparison

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 174 Tools sched-stats cmdline for local groups (no VMs)

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 175 Tools sched-stats cmdline for local groups (no VMs)

# sched-stats -t groups | awk 'NR == 1

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 176 Tools sched-stats cmdline for local groups (no VMs)

# sched-stats -t groups | awk 'NR == 1 || $2 ~ /^(vm\.|pool)[0-9]+/

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 177 Tools sched-stats cmdline for local groups (no VMs)

# sched-stats -t groups | awk 'NR == 1 || $2 ~ /^(vm\.|pool)[0-9]+/ || /^ +[0-4] /

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 178 Tools sched-stats cmdline for local groups (no VMs)

# sched-stats -t groups | awk 'NR == 1 || $2 ~ /^(vm\.|pool)[0-9]+/ || /^ +[0-4] / {printf ("%-10s%-12s%-9s%-6s%-6s%-6s%-9s%-6s%-9s%-9s%-10s\n" ,$1, $2, $3, $6, $8, $9, $10, $11, $12, $13, $14)}'

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 179 Tools sched-stats cmdline for local groups (no VMs)

# sched-stats -t groups | awk 'NR == 1 || $2 ~ /^(vm\.|pool)[0-9]+/ || /^ +[0-4] / {printf ("%-10s%-12s%-9s%-6s%-6s%-6s%-9s%-6s%-9s%-9s%-10s\n" ,$1, $2, $3, $6, $8, $9, $10, $11, $12, $13, $14)}' vmgid name pgid vsmps amin amax minLimit units ashares resvMHz availMHz 0 host 0 933 1600 1600 1600 pct 4096000 5232 33168 1 system 0 659 10 -1 -1 pct 500 288 33168 2 vim 0 271 4944 -1 -1 mhz 500 4344 33768 3 iofilters 0 3 0 -1 -1 pct 1000 0 33168 4 user 0 0 0 -1 -1 pct 9000 0 33168

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 180 Tools sched-stats cmdline for local groups (no VMs)

# sched-stats -t groups | awk 'NR == 1 || $2 ~ /^(vm\.|pool)[0-9]+/ || /^ +[0-4] / {printf ("%-10s%-12s%-9s%-6s%-6s%-6s%-9s%-6s%-9s%-9s%-10s\n" ,$1, $2, $3, $6, $8, $9, $10, $11, $12, $13, $14)}' vmgid name pgid vsmps amin amax minLimit units ashares resvMHz availMHz 0 host 0 933 1600 1600 1600 pct 4096000 5232 33168 1 system 0 659 10 -1 -1 pct 500 288 33168 2 vim 0 271 4944 -1 -1 mhz 500 4344 33768 3 iofilters 0 3 0 -1 -1 pct 1000 0 33168 4 user 0 0 0 -1 -1 pct 9000 0 33168

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 181 Tools sched-stats cmdline for local groups (no VMs)

# sched-stats -t groups | awk 'NR == 1 || $2 ~ /^(vm\.|pool)[0-9]+/ || /^ +[0-4] / {printf ("%-10s%-12s%-9s%-6s%-6s%-6s%-9s%-6s%-9s%-9s%-10s\n" ,$1, $2, $3, $6, $8, $9, $10, $11, $12, $13, $14)}' vmgid name pgid vsmps amin amax minLimit units ashares resvMHz availMHz 0 host 0 933 1600 1600 1600 pct 4096000 5232 33168 1 system 0 659 10 -1 -1 pct 500 288 33168 2 vim 0 271 4944 -1 -1 mhz 500 4344 33768 3 iofilters 0 3 0 -1 -1 pct 1000 0 33168 4 user 0 0 0 -1 -1 pct 9000 0 33168

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 182 Tools sched-stats cmdline for local groups (no VMs)

# sched-stats -t groups | awk 'NR == 1 || $2 ~ /^(vm\.|pool)[0-9]+/ || /^ +[0-4] / {printf ("%-10s%-12s%-9s%-6s%-6s%-6s%-9s%-6s%-9s%-9s%-10s\n" ,$1, $2, $3, $6, $8, $9, $10, $11, $12, $13, $14)}' vmgid name pgid vsmps amin amax minLimit units ashares resvMHz availMHz 0 host 0 933 1600 1600 1600 pct 4096000 5232 33168 1 system 0 659 10 -1 -1 pct 500 288 33168 2 vim 0 271 4944 -1 -1 mhz 500 4344 33768 3 iofilters 0 3 0 -1 -1 pct 1000 0 33168 4 user 0 0 0 -1 -1 pct 9000 0 33168

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 183 Tools memstats cmdline for local groups (with VMs)

# memstats -r group-stats -g0 -l2 -s gid:name:min:max::conResv:availResv -u mb | sed -n '/^-\+/,/.*\n/p'

------gid name min max conResv availResv ------0 host 97823 97823 28917 68907 1 system 20024 -1 20008 68923 2 vim 0 -1 3378 68907 3 iofilters 0 -1 25 68907 4 user 0 -1 5490 68907 ------

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 184 Tools memstats cmdline for local groups (with VMs)

# memstats -r group-stats -g0 -l2 -s gid:name:min:max::conResv:availResv -u mb | sed -n '/^-\+/,/.*\n/p'

------gid name min max conResv availResv ------0 host 97823 97823 28917 68907 1 system 20024 -1 20008 68923 2 vim 0 -1 3378 68907 3 iofilters 0 -1 25 68907 4 user 0 -1 5490 68907 ------

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 185 Tools memstats cmdline for local groups (with VMs)

# memstats -r group-stats -g0 -l2 -s gid:name:min:max::conResv:availResv -u mb | sed -n '/^-\+/,/.*\n/p'

------gid name min max conResv availResv ------0 host 97823 97823 28917 68907 1 system 20024 -1 20008 68923 2 vim 0 -1 3378 68907 3 iofilters 0 -1 25 68907 4 user 0 -1 5490 68907 ------

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 186 (N)UMA + terminology

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 187 (N)UMA + terminology

DIMMs

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 188 (N)UMA + terminology

DIMMs Socket / Package

0

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 189 (N)UMA + terminology

DIMMs Socket / Package

0 NUMA node

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 190 (N)UMA + terminology

DIMMs Socket / Package 1 0 NUMA node

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 191 (N)UMA + terminology

DIMMs Socket / Package 1 0 NUMA node 2

Socket != NUMA node

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 192 (N)UMA + terminology

DIMMs Socket / Package 1 0 NUMA node 2

Socket != NUMA node

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 193 (N)UMA + terminology

DIMMs Socket / Package 1 0 NUMA node 2

Socket != NUMA node

LLC / DIE

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 194 (N)UMA + terminology

DIMMs Socket / Package 1 0 NUMA node 2

Socket != NUMA node

LLC / DIE (CoD, SNC / Zen1/2)

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 195 Importance of Memory Access Latency Jim Gray’s Storage Latency Analogy (slightly adapted)

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 196 Importance of Memory Access Latency Jim Gray’s Storage Latency Analogy (slightly adapted)

You want to calculate a + b and the operands are in:

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 197 Importance of Memory Access Latency Jim Gray’s Storage Latency Analogy (slightly adapted)

You want to calculate a + b and the operands are in:

your head

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 198 Importance of Memory Access Latency Jim Gray’s Storage Latency Analogy (slightly adapted)

You want to calculate a + b and the operands are in:

your head = register / 1 cycle

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 199 Importance of Memory Access Latency Jim Gray’s Storage Latency Analogy (slightly adapted)

You want to calculate a + b and the operands are in:

your head this room = register / 1 cycle

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 200 Importance of Memory Access Latency Jim Gray’s Storage Latency Analogy (slightly adapted)

You want to calculate a + b and the operands are in:

your head this room = = register / 1 cycle L1-L2 / 10 cycles

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 201 Importance of Memory Access Latency Jim Gray’s Storage Latency Analogy (slightly adapted)

You want to calculate a + b and the operands are in:

your head this room this building = = register / 1 cycle L1-L2 / 10 cycles

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 202 Importance of Memory Access Latency Jim Gray’s Storage Latency Analogy (slightly adapted)

You want to calculate a + b and the operands are in:

your head this room this building = = = register / 1 cycle L1-L2 / 10 cycles DRAM / 100 cycles

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 203 Importance of Memory Access Latency Jim Gray’s Storage Latency Analogy (slightly adapted)

You want to calculate a + b and the operands are in:

your head this room this building Finland + Algeria = = = register / 1 cycle L1-L2 / 10 cycles DRAM / 100 cycles

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 204 Importance of Memory Access Latency Jim Gray’s Storage Latency Analogy (slightly adapted)

You want to calculate a + b and the operands are in:

your head this room this building Finland + Algeria = = = = register / 1 cycle L1-L2 / 10 cycles DRAM / 100 cycles Disk / 10^6 cycles

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 205 Importance of Memory Access Latency Numbers based on Intel i7-3770 @ 3.4 GHz

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 206 Importance of Memory Access Latency Numbers based on Intel i7-3770 @ 3.4 GHz

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 207 Importance of Memory Access Latency Numbers based on Intel i7-3770 @ 3.4 GHz

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 208 Importance of Memory Access Latency Numbers based on Intel i7-3770 @ 3.4 GHz

DRAM

IMC QPI access size cycles ns

L3 / Last Level Cache

L2 L2 L2 L2 L1 L1 L1 L1

core core core core 0 1 2 3

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 209 Importance of Memory Access Latency Numbers based on Intel i7-3770 @ 3.4 GHz

DRAM

IMC QPI access size cycles ns

L3 / Last Level Cache

L2 L2 L2 L2 L1 L1 L1 L1

core core core core 0 1 2 3 L1 32 KB 4-5 1.5

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 210 Importance of Memory Access Latency Numbers based on Intel i7-3770 @ 3.4 GHz

DRAM

IMC QPI access size cycles ns

L3 / Last Level Cache

L2 L2 L2 L2 L1 L1 L1 L1 L2 256 KB 12 4 core core core core 0 1 2 3 L1 32 KB 4-5 1.5

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 211 Importance of Memory Access Latency Numbers based on Intel i7-3770 @ 3.4 GHz

DRAM

IMC QPI access size cycles ns

L3 / Last Level Cache

L3 8 MB 30 10 L2 L2 L2 L2 L1 L1 L1 L1 L2 256 KB 12 4 core core core core 0 1 2 3 L1 32 KB 4-5 1.5

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 212 Importance of Memory Access Latency Numbers based on Intel i7-3770 @ 3.4 GHz

DRAM

IMC QPI access size cycles ns

L3 / Last Level Cache

L3 8 MB 30 10 L2 L2 L2 L2 L1 L1 L1 L1 L2 256 KB 12 4 core core core core 0 1 2 3 L1 32 KB 4-5 1.5

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 213 Importance of Memory Access Latency Numbers based on Intel i7-3770 @ 3.4 GHz

DRAM

IMC QPI access size cycles ns

L3 / Last Level Cache DRAM GBs 30+ 66*

L3 8 MB 30 10 L2 L2 L2 L2 L1 L1 L1 L1 L2 256 KB 12 4 core core core core 0 1 2 3 L1 32 KB 4-5 1.5

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 214 N(UMA) Pre-Opteron/Nehalem

All sockets share the FSB to the Northbridge and hence the bandwidth 0 1 2 3 • NB also known as “Memory Controller Hub” or MCH NB Uniform memory access latency between every CPU and every DIMM

Von Neumann Bottleneck getting worse with faster CPUs / more RAM

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 215 N(UMA) Pre-Opteron/Nehalem

All sockets share the FSB to the Northbridge and hence the bandwidth 0 1 2 3 • NB also known as “Memory Controller Hub” or MCH NB Uniform memory access latency between every CPU and every DIMM

Von Neumann Bottleneck getting worse with faster CPUs / more RAM

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 216 N(UMA) Pre-Opteron/Nehalem

All sockets share the FSB to the Northbridge and hence the bandwidth 0 1 2 3 • NB also known as “Memory Controller Hub” or MCH NB Uniform memory access latency between every CPU and every DIMM

Von Neumann Bottleneck getting worse with faster CPUs / more RAM

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 217 NUMA Post-Opteron/Nehalem

Every NUMA node has its own Integrated Memory Controller (IMC) 0 1 • Some AMD’s (Bulldozer and newer) have two nodes per socket / package

Remote access has to go 3 2 over the interconnect and remote CPU’s IMC

• This adds additional latency making local and remote access Non-Uniform

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 218 NUMA Post-Opteron/Nehalem

Every NUMA node has its own Integrated Memory Controller (IMC) 0 1 • Some AMD’s (Bulldozer and newer) have two nodes per socket / package

Remote access has to go 3 2 over the interconnect and remote CPU’s IMC

• This adds additional latency making local and remote access Non-Uniform

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 219 NUMA Post-Opteron/Nehalem

Every NUMA node has its own Integrated Memory Controller (IMC) 0 1 • Some AMD’s (Bulldozer and newer) have two nodes per socket / package

Remote access has to go 3 2 over the interconnect and remote CPU’s IMC

• This adds additional latency making local and remote access Non-Uniform

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 220 NUMA Post-Opteron/Nehalem

Every NUMA node has its own Integrated Memory Controller (IMC) 0 1 • Some AMD’s (Bulldozer and newer) have two nodes per socket / package

Remote access has to go 3 2 over the interconnect and remote CPU’s IMC

• This adds additional latency making local and remote access Non-Uniform

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 221 NUMA 2 QPI / IC

local adjacent “routed”

CPU 0 1 2 3 /ns

0 1 0 72 291 323 294

1 296 72 293 315

2 319 296 71 296 3 2

3 290 325 300 71

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 222 NUMA 3 QPI / IC

local adjacent

CPU 0 1 2 3 /ns

0 1 0 136 194 198 201

1 194 135 194 196

2 201 194 135 200 3 2

3 202 197 198 135

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 223 NUMA Basic Migration Types

NUMA clients (vCPUs + memory) are kept local to a home node 0 1 Balance migrations re-assign the home node, memory follows vCPUs!

Locality migrations set home 3 2 node to where the most memory resides

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 224 NUMA Basic Migration Types

NUMA clients (vCPUs + memory) are kept local to a home node 0 1 Balance migrations re-assign the home node, memory follows vCPUs!

Locality migrations set home 3 2 node to where the most memory resides

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 225 NUMA Basic Migration Types

NUMA clients (vCPUs + memory) are kept local to a home node 0 1 Balance migrations re-assign the home node, memory follows vCPUs!

Locality migrations set home 3 2 node to where the most memory resides

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 226 NUMA Basic Migration Types

NUMA clients (vCPUs + memory) are kept local to a home node 0 1 Balance migrations re-assign the home node, memory follows vCPUs!

Locality migrations set home 3 2 node to where the most memory resides

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 227 NUMA Basic Migration Types

NUMA clients (vCPUs + memory) are kept local to a home node 0 1 Balance migrations re-assign the home node, memory follows vCPUs!

Locality migrations set home 3 2 node to where the most memory resides

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 228 NUMA Basic Migration Types

NUMA clients (vCPUs + memory) are kept local to a home node 0 1 Balance migrations re-assign the home node, memory follows vCPUs!

Locality migrations set home 3 2 node to where the most memory resides

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 229 NUMA Basic Migration Types

NUMA clients (vCPUs + memory) are kept local to a home node 0 1 Balance migrations re-assign the home node, memory follows vCPUs!

Locality migrations set home 3 2 node to where the most memory resides

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 230 NUMA Scheduler Consideration Local Contention vs Remote Access

NUMA migration incurs significant cost. • All pages need to be remapped, i.e. %localMemory initially drops to 0% and slowly recovers. • Copying memory pages across NUMA boundaries cost memory bandwidth.

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 231 NUMA Scheduler Consideration Local Contention vs Remote Access

NUMA migration incurs significant cost. • All pages need to be remapped, i.e. %localMemory initially drops to 0% and slowly recovers. • Copying memory pages across NUMA boundaries cost memory bandwidth.

Memory Locality & NUMA-migrations Memory Locality & NUMA-migrations (with NUMA Migration) (No NUMA Migration) 9 100 1 120 8 90 0.9 100 7 80 0.8 70 0.7 6 80 60 Mem 0.6

5 - 50 0.5 60 4

40 0.4 %Local %Local #Migrations 3

#Migrations 40 30 0.3 2 20 0.2 20 1 10 0.1

0 0 0 0

1 1

7 7

4 4

13 13 31 31

19 16 19 61 16 61

10 10

37 73 37 73

22 22

67 76 79 67 76 79

25 52 25 52

34 43 34 43

28 55 82 28 55 82

46 49 64 46 49 64

70 70

58 85 58 85

40 40 time (30sec) time (30sec units)

%local #migrations %local #migrations

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 232 vNUMA auto-sizing history We had good(ish) reasonsos

(…) 2007 2008 2009 2010 2011 2012 2013 2014 (…)

ESX 3.5 ESX 4.0 ESX 4.1 ESXi 5.0 ESXi 5.1 ESXi 5.5 ESXi 6.0

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 233 vNUMA auto-sizing history We had good(ish) reasonsos

(…) 2007 2008 2009 2010 2011 2012 2013 2014 (…)

ESX 3.5 ESX 4.0 ESX 4.1 ESXi 5.0 ESXi 5.1 ESXi 5.5 ESXi 6.0

My starting data @ VMware

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 234 vNUMA auto-sizing history We had good(ish) reasonsos

cpuid.coresPerSocket

(…) 2007 2008 2009 2010 2011 2012 2013 2014 (…)

ESX 3.5 ESX 4.0 ESX 4.1 ESXi 5.0 ESXi 5.1 ESXi 5.5 ESXi 6.0

My starting data @ VMware

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 235 vNUMA auto-sizing history We had good(ish) reasonsos

CPS in GUI & supported

cpuid.coresPerSocket

(…) 2007 2008 2009 2010 2011 2012 2013 2014 (…)

ESX 3.5 ESX 4.0 ESX 4.1 ESXi 5.0 ESXi 5.1 ESXi 5.5 ESXi 6.0

My starting data @ VMware

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 236 vNUMA auto-sizing history We had good(ish) reasonsos

Max vSMP 8

CPS in GUI & supported

cpuid.coresPerSocket

(…) 2007 2008 2009 2010 2011 2012 2013 2014 (…)

ESX 3.5 ESX 4.0 ESX 4.1 ESXi 5.0 ESXi 5.1 ESXi 5.5 ESXi 6.0

My starting data @ VMware

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 237 vNUMA auto-sizing history We had good(ish) reasonsos

Max vSMP 8

CPS in GUI & supported

cpuid.coresPerSocket vNUMA

(…) 2007 2008 2009 2010 2011 2012 2013 2014 (…)

ESX 3.5 ESX 4.0 ESX 4.1 ESXi 5.0 ESXi 5.1 ESXi 5.5 ESXi 6.0

My starting data @ VMware

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 238 vNUMA auto-sizing history We had good(ish) reasonsos

Max vSMP 8 Max vSMP 32

CPS in GUI & supported

cpuid.coresPerSocket vNUMA

(…) 2007 2008 2009 2010 2011 2012 2013 2014 (…)

ESX 3.5 ESX 4.0 ESX 4.1 ESXi 5.0 ESXi 5.1 ESXi 5.5 ESXi 6.0

My starting data @ VMware

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 239 vNUMA auto-sizing history We had good(ish) reasonsos

numa.vcpu.min = 9

Max vSMP 8 Max vSMP 32

CPS in GUI & supported

cpuid.coresPerSocket vNUMA

(…) 2007 2008 2009 2010 2011 2012 2013 2014 (…)

ESX 3.5 ESX 4.0 ESX 4.1 ESXi 5.0 ESXi 5.1 ESXi 5.5 ESXi 6.0

My starting data @ VMware

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 240 vNUMA auto-sizing history We had good(ish) reasonsos

numa.vcpu.min = 9

Max vSMP 8 Max vSMP 32

CPS in GUI & supported

cpuid.coresPerSocket vNUMA

(…) 2007 2008 2009 2010 2011 2012 2013 2014 (…)

ESX 3.5 ESX 4.0 ESX 4.1 ESXi 5.0 ESXi 5.1 ESXi 5.5 ESXi 6.0

My starting data @ VMware

cpuid.coresPerSocket → numa.vcpu.maxPerVirtualNode

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 241 Two level’s of abstraction Virtual and Physical Proximity Domains

CPU Topology Doesn’t influence ESXi sched.

Might influence Guest / App sched.

vNUMA Topology VPD doesn’t affect ESXi sched. CPS PPD does define ESXi NUMA sched. VPD • AKA NUMA client

PPD

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 242 Two level’s of abstraction Virtual and Physical Proximity Domains

CPU Topology Doesn’t influence ESXi sched.

Might influence Guest / App sched.

vNUMA Topology VPD doesn’t affect ESXi sched. C C C C C C PPD does define ESXi NUMA sched. VPD VPD • AKA NUMA client

PPD PPD

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 243 Two level’s of abstraction Virtual and Physical Proximity Domains

CPU Topology Doesn’t influence ESXi sched.

Might influence Guest / App sched.

vNUMA Topology VPD doesn’t affect ESXi sched. CPS PPD does define ESXi NUMA sched. VPD • AKA NUMA client

PPD PPD

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 244 Two level’s of abstraction Virtual and Physical Proximity Domains

CPU Topology Doesn’t influence ESXi sched.

Might influence Guest / App sched.

vNUMA Topology VPD doesn’t affect ESXi sched. CPS PPD does define ESXi NUMA sched. VPD VPD • AKA NUMA client

PPD PPD

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 245 DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 246 Case Study: Project Pacific Running Compute Intensive Benchmark

https://blogs.vmware.com/performance/2019/10/how-does-project-pacific-deliver-8-better- performance-than-bare-metal.html

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 247 Case Study: Project Pacific Running Compute Intensive Benchmark

43.5% local memory access on native Linux

https://blogs.vmware.com/performance/2019/10/how-does-project-pacific-deliver-8-better- performance-than-bare-metal.html

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 248 Case Study: Project Pacific Running Compute Intensive Benchmark

43.5% local memory access on native Linux

99.2% local memory access on Pacific Cluster

https://blogs.vmware.com/performance/2019/10/how-does-project-pacific-deliver-8-better- performance-than-bare-metal.html

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 249 IO stuff

DOAG 2020 │ ©2020 VMware, Inc. 250 Herculean Network IO

vSphere 6.0 achieves Line Rate throughput on a 40GigE NIC

Throughput ↑ from 20.5 to 35.5 Gbps

CPU Used ↓ from 36 to 13 % (per Gbps)

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc.

http://blogs.vmware.com/performance/2015/04/network-improvements-vsphere-6-boost-performance-40g-nics.html Virtual NIC coalescing - recap Network Trading CPU Cycles for Lower Latency

By default, vSphere tunes for lower CPU usage by batching I/O operations

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 252 Virtual NIC coalescing - recap Network Trading CPU Cycles for Lower Latency

By default, vSphere tunes for lower CPU usage by batching I/O operations • By default, that is also the case for the RX and TX path on vNICs (here vmxnet3) • When disabled: – Every packet received interrupts immediately – Every packet will be issued immediately

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 253 Virtual NIC coalescing - recap Network Trading CPU Cycles for Lower Latency

By default, vSphere tunes for lower CPU usage by batching I/O operations • By default, that is also the case for the RX and TX path on vNICs (here vmxnet3) • When disabled: – Every packet received interrupts immediately – Every packet will be issued immediately

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 254 Virtual NIC coalescing - recap Network Trading CPU Cycles for Lower Latency

By default, vSphere tunes for lower CPU usage by batching I/O operations • By default, that is also the case for the RX and TX path on vNICs (here vmxnet3) • When disabled: – Every packet received interrupts immediately – Every packet will be issued immediately

1

1

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 255 Virtual NIC coalescing - recap Network Trading CPU Cycles for Lower Latency

By default, vSphere tunes for lower CPU usage by batching I/O operations • By default, that is also the case for the RX and TX path on vNICs (here vmxnet3) • When disabled: – Every packet received interrupts immediately – Every packet will be issued immediately

1

1

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 256 Virtual NIC coalescing - recap Network Trading CPU Cycles for Lower Latency

By default, vSphere tunes for lower CPU usage by batching I/O operations • By default, that is also the case for the RX and TX path on vNICs (here vmxnet3) • When disabled: – Every packet received interrupts immediately – Every packet will be issued immediately

1 2 3 4 5 6 7 8 9 ......

1 2 3 4 5 6 7 8 9 ......

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 257 Possible Latency Optimizations Network Network latency optimization on the VM level

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 258 Possible Latency Optimizations Network Network latency optimization on the VM level

Disable LRO (Large Receive Offload) • Host wide: “Net.Vmxnet3SwLRO = false” • Small packets are no longer concatenated into larger ones

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 259 Possible Latency Optimizations Network Network latency optimization on the VM level

Disable LRO (Large Receive Offload) • Host wide: “Net.Vmxnet3SwLRO = false” • Small packets are no longer concatenated into larger ones

Disable (vNIC) coalescing • VMX option: “ethernetX.coalescingScheme = disabled” • Issue TX immediately and immediately interrupt on RX

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 260 Possible Latency Optimizations Network Network latency optimization on the VM level

Disable LRO (Large Receive Offload) • Host wide: “Net.Vmxnet3SwLRO = false” • Small packets are no longer concatenated into larger ones

Disable (vNIC) coalescing • VMX option: “ethernetX.coalescingScheme = disabled” • Issue TX immediately and immediately interrupt on RX

Disable Dynamic queueing • NetQueue feature, load balances and combines less used queues • Disabling guarantees a single queue for the VM

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 261 Network – Recommendations

Use vmxnet3 Guest Network Driver Very efficient and required for maximum performance=

Evaluate Disabling Interrupt Coalescing Default mechanism may induce small amounts of latency in favor of throughout

It’s a 10Gb+ World 1Gb saturation is real, more bandwidth required today, especially in light of vSAN, MonsterVM vMotion

Use Latency Sensitivity High ‘Cautiously’ While it can reduce latency and jitter in the 10us use case, it comes at a cost with core reservations, etc Requires FULL CPU and MEM reservation – or it won’t work and won’t tell you

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. Herculean Storage IO

• More than 1 Million IOPs from 1 VM

Hypervisor: vSphere 5.1 Server: HP DL380 Gen8 CPU: 2 x Intel Xeon E5-2690, HT disabled Memory: 256GB HBAs: 5 x QLE2562 Storage: 2 x Violin Memory 6616 Flash Arrays VM: Windows Server 2008 R2, 8 vCPUs and 48GB. Config: 4K IO size w/ 16 workers

Reference: http://blogs.vmware.com/performance/2012/08/1millioniops-on-1vm.html

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. Bare-metal to virtual TPC-C* gap then and now(ish)

* Non-complaint, fair-use implementation of the workload on Oracle 12c. Not comparable to official results.

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. Bare-metal to virtual TPC-C* gap then and now(ish)

- 30 %

* Non-complaint, fair-use implementation of the workload on Oracle 12c. Not comparable to official results.

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. Bare-metal to virtual TPC-C* gap then and now(ish)

- 10%

- 30 %

* Non-complaint, fair-use implementation of the workload on Oracle 12c. Not comparable to official results.

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. Scaling out vs. up on the same host to amortize overhead

TPC-E on native HP Proliant DL 385 G8 TPC-VMS on virtualized HP Proliant DL 385 G8 1600 1600

1400 1400

1200 1200 457.55

1000 1000 VM3 800 800 1416.37 468.11 VM2

600 600 VM1

Throughput Score Throughput Score Throughput 400 400

200 200 470.31

0 0 Baremetal tpsE Virtual tpsE of 3 VMs running TPC-VMS

http://blogs.vmware.com/vsphere/2013/09/worlds-first-tpc-vms-benchmark-result.html http://www.tpc.org/4064 / http://www.tpc.org/5201

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. The Problem - with Database Logs

Storage I/O latencies are higher in virtual

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. The Problem - with Database Logs

Storage I/O latencies are higher in virtual

Usually not a noticeable problem for Data IO • Long (5+ ms) latency on HDDs • Random I/O, Many threads banging on the same spindle(s) • Even some SSDs are ~1ms

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. The Problem - with Database Logs

Storage I/O latencies are higher in virtual

Usually not a noticeable problem for Data IO • Long (5+ ms) latency on HDDs • Random I/O, Many threads banging on the same spindle(s) • Even some SSDs are ~1ms

Not OK for Redo Log access

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. The Problem - with Database Logs

Storage I/O latencies are higher in virtual

Usually not a noticeable problem for Data IO • Long (5+ ms) latency on HDDs • Random I/O, Many threads banging on the same spindle(s) • Even some SSDs are ~1ms

Not OK for Redo Log access • Short (<<1ms latency) • Sequential I/O, Single-threaded, Write-Only • Typically a write-back cache in the HBA or the array • Check the Top 5 wait events in Oracle AWR or equivalent database health reports

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. The Solution - Trade CPU Cycles for Lower Latency

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. The Solution - Trade CPU Cycles for Lower Latency

By default, vSphere tunes for lower CPU usage by batching I/O operations

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. The Solution - Trade CPU Cycles for Lower Latency

By default, vSphere tunes for lower CPU usage by batching I/O operations

But when sensing low IOPS, vSphere stops batching and switches to low latency mode

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. The Solution - Trade CPU Cycles for Lower Latency

By default, vSphere tunes for lower CPU usage by batching I/O operations

But when sensing low IOPS, vSphere stops batching and switches to low latency mode • For lowest latency, put the log device on a vSCSI adapter by itself • Batching and coalescing is on a per-vSCSI bus, not device(!) basis • Explicit tuning can prove more effective though

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. The Solution - Trade CPU Cycles for Lower Latency

Explicit workaround on the issuing path: • Default is Asynchronous request passing from vSCSI adapter to VMKernel – But dynamically adjust for low IOPS case

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. The Solution - Trade CPU Cycles for Lower Latency

Explicit workaround on the issuing path: • Default is Asynchronous request passing from vSCSI adapter to VMKernel – But dynamically adjust for low IOPS case • To explicitly force immediate initiation of I/O operation (sync) – scsiNNN.reqCallThreshold = “1”

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. The Solution - Trade CPU Cycles for Lower Latency

Explicit workaround on the issuing path: • Default is Asynchronous request passing from vSCSI adapter to VMKernel – But dynamically adjust for low IOPS case • To explicitly force immediate initiation of I/O operation (sync) – scsiNNN.reqCallThreshold = “1”

Explicit workaround on the completion path: • Default is coalescing of Virtual Interrupts – vSphere automatically suspends interrupt coalescing for low IOPS workloads

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. The Solution - Trade CPU Cycles for Lower Latency

Explicit workaround on the issuing path: • Default is Asynchronous request passing from vSCSI adapter to VMKernel – But dynamically adjust for low IOPS case • To explicitly force immediate initiation of I/O operation (sync) – scsiNNN.reqCallThreshold = “1”

Explicit workaround on the completion path: • Default is coalescing of Virtual Interrupts – vSphere automatically suspends interrupt coalescing for low IOPS workloads • Or explicitly disable Virtual Interrupt Coalescing – For PVSCSI: scsiNNN.intrCoalescing = “False” – For other vHBAs: scsiNNN.ic = “False”

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. Myth Revisited: RDM versus VMFS

VMFS on par or faster than RDM (approx. 1%)

Reference: http://www.vmware.com/techpapers/2017/sql-server-vsphere65-perf.html

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. Storage – Recommendations

Use Multiple vSCSI Adapters Allows for more queues and I/O’s in flight

Use pvscsi vSCSI Adapter More efficient I/O’s per cycle

Don’t Use RDM’s Unless needed for shared disk clustering, no longer a performance advantage

VMware Snapshots Should Be ‘Temporary’ Despite constant performance improvements, snapshots should not live forever, Co-Stop, Syncronous

Leverage Your Storage OEM’s Integration Guide They provide necessary guidance around items like multi-pathing

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. vMotion

DOAG 2020 │ ©2020 VMware, Inc. 282 vMotion Workflow

Source vMotion Network Destination ESXi Host ESXi Host

Datastore

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 283 vMotion Workflow

1 Create VM on Destination

Source vMotion Network Destination ESXi Host ESXi Host

Datastore

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 284 vMotion Workflow

1 Create VM on Destination

2 Copy Memory

Source vMotion Network Destination ESXi Host ESXi Host

Datastore

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 285 vMotion Workflow

1 Create VM on Destination

2 Copy Memory

3 Quiesce VM on Source

Source vMotion Network Destination ESXi Host ESXi Host

Datastore

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 286 vMotion Workflow

1 Create VM on Destination

2 Copy Memory

3 Quiesce VM on Source

4 Source vMotion Network Destination Transfer Device State ESXi Host ESXi Host

Datastore

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 287 vMotion Workflow

1 Create VM on Destination

2 Copy Memory

3 Quiesce VM on Source

4 Source vMotion Network Destination Transfer Device State ESXi Host ESXi Host

5 Resume VM on Destination

Datastore

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 288 vMotion Workflow

1 Create VM on Destination

2 Copy Memory

3 Quiesce VM on Source

4 Source vMotion Network Destination Transfer Device State ESXi Host ESXi Host

5 Resume VM on Destination Execution Switchover Time of 1 sec

Datastore

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 289 vMotion Workflow

1 Create VM on Destination

2 Copy Memory

3 Quiesce VM on Source

4 Source vMotion Network Destination Transfer Device State ESXi Host ESXi Host

5 Resume VM on Destination Execution Switchover Time of 1 sec 6 Power Off VM on Source Datastore

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 290 Memory Copy

Source VM Memory

Destination VM Memory

Iterative Memory Pre-Copy Phase 0: Copy the VM’s 40GB of memory, trace pages. As we send that memory, the VM dirties 10GB

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 291 Memory Copy

Source VM Memory

Destination VM Memory

Iterative Memory Pre-Copy Phase 0: Copy the VM’s 40GB of memory, trace pages. As we send that memory, the VM dirties 10GB

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 292 Memory Copy

Source VM Memory

Destination VM Memory

Iterative Memory Pre-Copy Phase 0: Copy the VM’s 40GB of memory, trace pages. As we send that memory, the VM dirties 10GB Phase 1: Retransmit the dirtied 10GB. In the process, the VM dirties another 3GB

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 293 Memory Copy

Source VM Memory

Destination VM Memory

Iterative Memory Pre-Copy Phase 0: Copy the VM’s 40GB of memory, trace pages. As we send that memory, the VM dirties 10GB Phase 1: Retransmit the dirtied 10GB. In the process, the VM dirties another 3GB Phase 2: Send the 3GB. While that transfer is happening, the VM dirties 1GB

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 294 Memory Copy

Source VM Memory

Destination VM Memory

Iterative Memory Pre-Copy Phase 0: Copy the VM’s 40GB of memory, trace pages. As we send that memory, the VM dirties 10GB Phase 1: Retransmit the dirtied 10GB. In the process, the VM dirties another 3GB Phase 2: Send the 3GB. While that transfer is happening, the VM dirties 1GB Phase 3: Send the remaining 1GB

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 295 vMotion of Oracle RAC It’s been working for a while …

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 296 Common Issues for Monster VMs

Trace Cost LP remap Prealloced memory RDTSC cost (SDPS) pre 6.5*

Confidential │ ©2018 VMware, Inc. 297 - use ESXi 6.5 - use multi NIC (10Gb+!)

Confidential │ ©2018 VMware, Inc. ‹#› 298 Performance Troubleshooting

DOAG 2020 │ ©2020 VMware, Inc. 299 How to troubleshoot any issue No matter how complicated

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 300 How to troubleshoot any issue No matter how complicated

1. Identify a related system or component that your team is not responsible for

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 301 How to troubleshoot any issue No matter how complicated

1. Identify a related system or component that your team is not responsible for

2. Hypothesize that the issue is with that component

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 302 How to troubleshoot any issue No matter how complicated

1. Identify a related system or component that your team is not responsible for

2. Hypothesize that the issue is with that component

3. Assign the issue to the responsible team

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 303 How to troubleshoot any issue No matter how complicated

1. Identify a related system or component that your team is not responsible for

2. Hypothesize that the issue is with that component

3. Assign the issue to the responsible team

4. When proven wrong, go to 1.

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 304 Perfectly valid methods to “troubleshoot” or “tune” /s

Tuning guide for a completely different system Some advanced option found on a blog Vaguely fitting KB etc.

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 305 The biggest enemy "XY Problem"

1. I have problem X

1. I have problem Y Y

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 306 The biggest enemy "XY Problem"

1. I have problem X

1. I have problem Y Y 2. Help me solve problem Y

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 307 The biggest enemy "XY Problem"

1. I have problem X

1. I have problem Y Y 2. Help me solve problem Y

3. Hey! I still have a problem ?

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 308 The biggest enemy "XY Problem"

1. I have problem X

2. I think it is because of Y 3. I have problem Y Y 4. Help me solve problem Y

5. Hey! I still have a problem ?

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 309 The biggest enemy "XY Problem" X 1. I have problem X

2. I think it is because of Y 3. I have problem Y Y 4. Help me solve problem Y

5. Hey! I still have a problem ?

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 310 The biggest enemy "XY Problem" ! X 1. I have problem X

2. I think it is because of Y 3. I have problem Y Y 4. Help me solve problem Y

5. Hey! I still have a problem tl;dr ? don’t jump to conclusions

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 311 Where to use caution Believing anybody

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 312 Where to use caution Believing anybody

“Trust, but verify.“*

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 313 Where to use caution Believing anybody

“Trust, but verify.“*

* From the Russian proverb: "Доверяй, но проверяй" {Doveryai, no proveryai}

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 314 Where to use caution Comparing hosts, past and present, etc. !=

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 315 Where to use caution Comparing hosts, past and present, etc. !=

Don’t assume newer == better

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 316 Where to use caution Comparing hosts, past and present, etc. !=

Don’t assume newer == better

Identify all differences

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 317 Where to use caution Relying on Traffic Light Dashboards alone

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 318 Where to use caution Relying on Traffic Light Dashboards alone

All metrics green?

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 319 Where to use caution Relying on Traffic Light Dashboards alone

All metrics green? → All good then! (false negative)

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 320 Where to use caution Relying on Traffic Light Dashboards alone

All metrics green? → All good then! (false negative)

Some metrics red?

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 321 Where to use caution Relying on Traffic Light Dashboards alone

All metrics green? → All good then! (false negative)

Some metrics red? → Something must be broken! (false positive)

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 322 Where to use caution Working through a list of known issues

Very good to start with! • Don’t spend more than half and hour

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 323 Where to use caution Working through a list of known issues

Very good to start with! • Don’t spend more than half and hour

Can be from different perspectives • Application • Resources, e.g.: – CPU contention – Memory pressure – Disk latency – Etc.

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 324 Apply different methodologies as needed e.g. directionally

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 325 Apply different methodologies as needed e.g. directionally

Top → Down: drill down from the application / its metrics • app specific / difficult to "profile" the whole path

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 326 Apply different methodologies as needed e.g. directionally

Top → Down: drill down from the application / its metrics • app specific / difficult to "profile" the whole path

Bottom → Up: investigate from the resource point of view • easy to run into false positives / not all resources evenly covered

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 327 Apply different methodologies as needed e.g. directionally

Top → Down: drill down from the application / its metrics • app specific / difficult to "profile" the whole path

Bottom → Up: investigate from the resource point of view • easy to run into false positives / not all resources evenly covered

Recommendation: Bottom Up Checklist first

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 328 Ask questions Good ones, preferably

What makes you think there is a performance issue Has it ever performed well What has changed since Can it be quantified What else is affected What is the timing Is it reproducible etc.

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 329 Take notes along the way seriously

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 330 Take notes along the way seriously

"Remember kids, the only difference between science and screwing around is writing it down."

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 331 Provide an exact timeline Part of notetaking but often forgotten 2017-11-29 > 23:59 UTC Tried 2017-11-28 everything under the sun 23:00 UTC and wrote Upgrade down nothing

2017-11-29 2017-11-30 07:00 UTC 08:00 Issue first Called noticed GSS

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 332 Be accurate and universal https://xkcd.com/1179/

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 333 SR examples “The case of the unexplained …”

DOAG 2020 │ ©2020 VMware, Inc. 334 Example 1 – Oracle DB performance Tales from GSS

Initial SR description: • Oracle DB on virtual 64bit W2K8 three times slower than physical • on 32bit W2K8 and 32/64bit RHEL5, only 5% slower than physical • benchmarked with production equivalent test script

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 335 Example 1 – Oracle DB performance Tales from GSS

Initial SR description: • Oracle DB on virtual 64bit W2K8 three times slower than physical • on 32bit W2K8 and 32/64bit RHEL5, only 5% slower than physical • benchmarked with production equivalent test script

Troubleshooting in support: • checked logs for errors • basics like power management, limits, etc • research if similar issues have been reported

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 336 Example 1 – Oracle DB performance Tales from GSS

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 337 Example 1 – Oracle DB performance Tales from GSS

Reproducing in-house:

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 338 Example 1 – Oracle DB performance Tales from GSS

Reproducing in-house: • the customer provided two pre-configured VMs

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 339 Example 1 – Oracle DB performance Tales from GSS

Reproducing in-house: • the customer provided two pre-configured VMs • during initial run, the 64bit VM performed worse by a factor of 3

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 340 Example 1 – Oracle DB performance Tales from GSS

Reproducing in-house: • the customer provided two pre-configured VMs • during initial run, the 64bit VM performed worse by a factor of 3 • automated benchmark start and result collection, dropped to 1.6 on avg.

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 341 Example 1 – Oracle DB performance Tales from GSS

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 342 Example 1 – Oracle DB performance Tales from GSS

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 343 Example 1 – Oracle DB performance Tales from GSS

Murphy's law strikes:

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 344 Example 1 – Oracle DB performance Tales from GSS

Murphy's law strikes: • Minor configuration issues (DB not starting, tnsnames changes) • Initial booking for lab server ran out and it was re-imaged • Redeploy to local box was delayed due to a network issue • Automation scripts had to be recreated • Flashback store ran full

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 345 Example 1 – Oracle DB performance Tales from GSS

Murphy's law strikes: • Minor configuration issues (DB not starting, tnsnames changes) • Initial booking for lab server ran out and it was re-imaged • Redeploy to local box was delayed due to a network issue • Automation scripts had to be recreated • Flashback store ran full

Our Oracle DBA configured both VMs with a default config ….

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 346 Example 1 – Oracle DB performance Tales from GSS

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 347 Example 1 – Oracle DB performance Tales from GSS

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 348 Example 1 – Oracle DB performance Tales from GSS

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 349 Example 1 – Oracle DB performance Tales from GSS

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 350 Example 1 – Oracle DB performance Tales from GSS

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 351 Example 1 – Oracle DB performance Tales from GSS

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 352 Example 1 – Oracle DB performance Tales from GSS

"The more updates or inserts in a workload, the more expensive it is to turn on block checking“

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 353 Example 1 – Oracle DB performance Tales from GSS

"The more updates or inserts in a workload, the more expensive it is to turn on block checking“

The benchmark was an insert loop…

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 354 Example 1 – Oracle DB performance In a Nutshell …

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 355 Example 1 – Oracle DB performance In a Nutshell …

Configuration issue

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 356 Example 1 – Oracle DB performance In a Nutshell …

Configuration issue No virtualization fault

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 357 Example 1 – Oracle DB performance In a Nutshell …

Configuration issue No virtualization fault ~70 hours

DOAG 2020 NOON2NOON │ ©2020 VMware, Inc. 358