Tuning your cloud: Improving global network performance for applications

Richard Wade Principal Cloud Architect AWS Professional Services, Singapore

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Topics

Understanding application performance: Why TCP matters

Choosing the right cloud architecture

Tuning your cloud Mice: Short connections

Majority of connections on the Internet are mice - Small number of transferred - Short lifetime Expect fast service - Need fast, efficient startup - Loss has a high impact as there is no time to recover Elephants: Long connections

Most of the traffic on the Internet is carried by elephants - Large number of bytes transferred - Long-lived single flows Expect stable, reliable service - Need efficient, fair steady-state - Time to recover from loss has a notable impact over the connection’s lifetime Transmission Control Protocol (TCP): Startup

Round trip time (RTT) = two-way delay (this is what you measure with a ping) In this example, RTT is 100 ms Roughly 2 * RTT (200 ms) until the first application request is received The lower the RTT, the faster your application responds and the higher the possible throughput RTT 100 ms AWS Cloud

1.5 * RTT Connection setup

Connection established

Data transfer Transmission Control Protocol (TCP): Growth

A high RTT negatively affects the potential throughput of your application For new connections, TCP tries to double its transmission rate with every RTT This algorithm works for large-object (elephant) transfers (MB or GB) but not so well for small-object (mice) transfers

32 30 28 26 24 1 packet 22 20 18 Increase cwnd 16 14 Segments 12 2 packets 10 8 6 Increase cwnd 4 2 4 packets 0 0 1 2 3 4 5 6 7 Time Transmission Control Protocol (TCP): Loss

Most TCP algorithms use as a signal that the transmission rate has exceeded the bottleneck bandwidth on a given path Most will reduce (by up to 50%) the transmission rate when loss or timeout is experienced, which can have a dramatic effect on overall performance

32 30 28 26 24 Packet loss 22 20 18 16

14 Segments Partial acknowledgement 12 10 Decrease cwnd 8 6 Retransmit lost packets 4 2 0 0 2 4 6 8 Time Acknowledgement Transmission Control Protocol (TCP): Recovery

After a loss or timeout event, most TCP algorithms enter a congestion avoidance phase The transmission rate is increased linearly until further events are experienced This means that recovery from loss is slow, again affecting overall flow performance The rate of recovery is highly dependent on a connection’s RTT

32 30 Packet loss 28 Halve cwnd 26 24 22 20 18 16 14 Segments 12 Increment cwnd 10 8 6 4 2 0 0 2 4 6 8 10 12 Time TCP: Impact of loss on throughput

100

80

60

40

20

0 0% 2% 4% 6% 8% 10% Loss rate %

Net401: Network Performance: Making Every Packet Count, Re:Invent 2017 TCP summary

1) from user to cloud is important • Time to establish connection • Rate at which throughput accelerates • Limits the maximum potential throughput for a TCP connection* 2) Understand your application • Small objects and large objects have different requirements • Define your architecture and tune your infrastructure accordingly 3) TCP has many tunable parameters and variants More on this later

://en.wikipedia.org/wiki/Bandwidth-delay_product Solutions: Three things you can influence

1) Latency from your applications to your users: Service Architecture Region selection, use of edge services 2) Throughput from your infrastructure: Infrastructure Design Using optimized instance types, edge services 3) Configuration of your infrastructure: Tuning Tuning infrastructure parameters to suit your application and deployed architecture Latency: Move closer to your users Amazon CloudFront: Improving latency to users

Amazon CloudFront uses a global network of 216 points of presence (205 Edge Locations and 11 Regional Edge Caches) in 84 cities across 42 countries Amazon CloudFront: Improving latency to users

RTT 150 ms

Viewer request

Users Origin Viewer response Amazon CloudFront: Improving latency to users

RTT 30 ms CloudFront RTT 120 ms cache

Viewer Origin request request

Users Origin Viewer Origin response response Amazon CloudFront: Improving latency to users

RTT 30 ms CloudFront RTT 120 ms cache

Viewer Origin request request

User #1

Viewer Origin response response Origin

User #2 Amazon CloudFront: Improving throughput

Consider the impact of reduced RTT on transfer time for small objects Smaller RTT, faster increase, more throughput, faster transfer

32 30 28 26 24 22 20 Increase cwnd 18 16 14 Segments 12 10 8 6 Increase cwnd 4 2 0 0 1 2 3 4 5 6 7 Time Throughput: Get data to your users faster Why packet throughput matters Packets per second (PPS) and maximum transmission unit (MTU)

Each packet has processing overhead Small packets such as real-time systems or transactions Large packets increase the overall performance Jumbo MTU of 9001 available within VPC or VPC peers

1448 B Payload

8949 B Payload

Jumbo MTUs increase the usable data per packet

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/network_mtu.html The AWS Nitro System

Nearly 100% of available compute resources available

to customers’ workload Server

Improved throughput

Improved latency Nitro

Improved PPS The Nitro System enables performance improvements Up to 4x improvement in instance network throughput

90 43 5.05

34

27

30 28 1.02 0.52

Throughput (Gbps) Latency (microseconds) Packets per second (millions)

Enterprise Strategy Group, 2019 The AWS Nitro System

Network Mem Network Model vCPU Performance Mem (GiB) Model vCPU Performance (Gbps) (GiB) T3 instances – up to 5 Gbps (Gbps) t3.nano 2 6 Up to 5 network performance c5.large 2 4 Up to 10 t3.micro 2 12 Up to 5 c5.xlarge 4 8 Up to 10 t3.small 2 24 Up to 5 c5.2xlarge 8 16 Up to 10 Smaller sizes of C5, M5, R5 – up to t3.medium 2 24 Up to 5 10 Gbps network performance t3.large 2 36 Up to 5 c5.4xlarge 16 32 Up to 10 t3.xlarge 4 96 Up to 5 c5.9xlarge 36 72 10 Larger instance sizes have sustained t3.2xlarge 8 192 Up to 5 c5.18xlarge 72 144 25

10 or 25 Gbps Network Mem Model vCPU Performance (GiB) (Gbps)

Smaller sizes of C5n – up to 25 Gbps c5n.large 2 4 Up to 25

network performance c5n.xlarge 4 8 Up to 25

c5n.2xlarge 8 16 Up to 25

C5n instances have sustained 50 or c5n.4xlarge 16 32 Up to 25

100 Gbps c5n.9xlarge 36 72 50 c5n.18xlarge 72 144 100 AWS Outposts

• Industry standard 42U rack

• Fully assembled, ready to be rolled into final position

• Installed by AWS, simply plugged into power and network

• Centralized redundant power conversion unit and DC distribution system for higher reliability, energy efficiency, and easier serviceability

• Redundant active components, including top-of-rack switches and hot spare hosts AWS Outposts

Nitro hardware and software in your data center

Access via standard AWS API and console AWS Outposts Deploy apps to AWS Outposts using AWS services Improving both latency and throughput Consider the impact of reduced RTT, lower risk of packet loss Optimized compute and network performance Smaller RTT, faster increase, more throughput, faster transfer 32 32 30 30 28 28 26 26 24 24 22 22 20 20 18 18 16 16 14 14

12 12 Segments Segments 10 10 8 8 6 6 4 4 2 2 0 0 0 2 4 6 8 0 5 10 Time Time 32 30 28 100 26 24 80 22 20 18 60 16 14 40 12

Segments 10 8 20 6 4 2 0 0 0 5 10 15 0% 5% 10% Time Tune and optimize your cloud Amazon Linux Kernel TCP tuning

US-EAST-1 AP-SOUTHEAST-1

VPC RTT 220 ms VPC Public subnet Public subnet Amazon Linux Kernel TCP tuning

Kernel Setting Default Tuned Value Function The maximum receive net.core.rmem_max 212,992 134,217,728 socket buffer size in bytes The maximum send socket net.core.wmem_max 212,992 134,217,728 buffer size in bytes Min, default, max TCP net..tcp_rmem 4,096 87,380 62,91,456 4,096 87,380 67,108,864 receive buffer size in bytes Min, default, max TCP send net.ipv4.tcp_wmem 4,096 20,480 41,94,304 4,096 65,536 67,108,864 buffer size in bytes TCP congestion control net.ipv4.tcp_congestion_control Cubic BBR https://tools.ietf.org/html/rfc8312 https://research.google/pubs/pub45646/ algorithm name Queueing discipline net.core.default_qdisc pfifo_fast fq algorithm name https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt Example: Amazon Linux Kernel TCP tuning

#!/bin/bash

sudo sysctl -w net.core.rmem_max=134217728 sudo sysctl -w net.core.wmem_max=134217728

sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 67108864“ sudo sysctl -w net.ipv4.tcp_wmem="4096 65536 67108864“

sudo sysctl -w net.ipv4.tcp_mtu_probing=1

sudo sysctl -w net.core.default_qdisc=fq

One size does NOT fit all. Experiment safely and methodically in a test environment. Amazon Linux Kernel TCP tuning: 1 GB transfer

220 ms latency, no tuning TCP Cubic

Kernel Time to Mbps Retransmits Settings transfer 1 GB

Default ~150 seconds ~55 0

~90 seconds Tuned (average of 10 tests was ~90 ~1000 108 seconds)

220 ms latency, tuning - Kernel settings - TCP Cubic Summary

In this session, we have shown:

1) How to improve application performance by positioning services closer to your users 2) Some examples of AWS services you can use to reduce latency and increase throughput 3) How application performance can be improved using kernel TCP tuning methods AWS Training and Certification

Explore tailored Build cloud skills with Demonstrate expertise with Find entry-level cloud learning paths for 550+ free digital an industry-recognized talent with AWS customers and training courses, or dive credential Academy and AWS partners deep with classroom re/Start training

aws.amazon.com/training Thank you!

Richard Wade [email protected]

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.