Improving the DragonFlyBSD Network Stack
Yanmin Qiao [email protected]
DragonFlyBSD project Nginx performance and latency evaluation: HTTP/1.1, 1KB web object, 1 request/connect
Server: 2x E5-2620v2, 32GB DDR3-1600, Hyperthreading enabled. Server nginx: Intel Intel Installed from dports. X550 X550 nginx.conf: Access log is disabled. 16K connections/worker.
Intel Intel 15K concurrent connections on each client. X540 X540 Client Client Client: i7-3770, 16GB DDR3-1600, Hyperthreading enabled.
MSL on clients and server are changed to 20ms by: route change -net net -msl 20
/boot/loader.conf: kern.ipc.nmbclusters=524288
/etc/sysctl.conf: machdep.mwait.CX.idle=AUTODEEP kern.ipc.somaxconn=256 net.inet.ip.portrange.last=40000
2 IPv4 forwarding performance evalution: 64B packets
pktgen+sink pktgen+sink: i7-3770, 16GB DDR3-1600, Intel Hyperthreading enabled. X540 /boot/loader.conf: kern.ipc.nmbclusters=5242880
/etc/sysctl.conf: machdep.mwait.CX.idle=AUTODEEP Intel X550 Traffic generator: DragonFlyBSD’s in kernel packet generator, which forwarder: has no issue to generate 14.8Mpps. forwarder 2x E5-2620v2, 32GB DDR3-1600, Hyperthreading enabled. Traffic sink: ifconfig ix0 monitor Intel X550 Traffic: Each pktgen targets 208 IPv4 addresses, which are mapped to one link layer address on ‘forwarder’.
Performance counting: Intel Only packets sent by the forwarder are counted. X540
pktgen+sink
3 TOC
1. Use all available CPUs for network processing. 2. Direct input with polling(4). 3. Kqueue(2) accept queue length report. 4. Sending aggregation from the network stack. 5. Per CPU IPFW states.
4 Use all available CPUs for network processing
Issues of using only power-of-2 count of CPUs. - On a system with 24 CPUs, only 16 CPUs will be used. Bad for kernel-only applications: forwarding and bridging.
- Performance and latency of network heavy userland applications can not be fully optimized.
- 6, 10, 12 and 14 cores per CPU package are quite common these days.
5 Use all available CPUs for network processing (24 CPUs, 16 netisrs)
CONFIG.B: 32 nginx workers, no affinity. CONFIG.A.16: 16 nginx workers, affinity. None of them are optimal.
250000 200 215907.25 180 200000 191920.81 160 140
150000 120 ) s s
/ performance s m t (
s
100 y latency-avg e c u n
q latency-stdev e t e 100000 80 a R
L latency-.99 60 50000 33.11 32.04 40 20 0 0 CONFIG.B CONFIG.A.16
6 Use all available CPUs for network processing
Host side is easy (RSS_hash & 127) % ncpus
How NIC RSS redirect table should be setup? Simply do this? table[i] = i % nrings, i.e.
128 entries
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
7 Use all available CPUs for network processing (NIC RSS redirect table)
Issue with table[i]=i%nrings e.g. 24 CPUs, 16 RX rings.
First 48 entries
0 1 2 3 4 5 6 7 8 9 10 11 12 1314 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Host (RSS_hash&127)%ncpus
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 1213 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
table[i]=i%nrings, 32 mismatches
grid (grid=ncpus/K, ncpus%K==0) grid
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15
Current, 16 mismatches patch patch
8 Use all available CPUs for network processing (result: nginx, 24 CPUs)
CONFIG.A.24: 24 nginx workers, affinity
250000 200
215907.25 180 210658.8 200000 191920.81 160
140
150000 120 ) s s
/ performance s m t (
s latency-avg 100 y e c u n
q latency-stdev e t e a R 100000 80
L latency-.99 58.01 60
50000 33.11 32.04 40
20
0 0 CONFIG.B CONFIG.A.16 CONFIG.A.24
9 Use all available CPUs for network processing (result: forwarding)
12.5 12 11.5 11 10.5 10 9.5 9 8.5 8 7.5 7 6.5 s 16 CPUs p
p 6
M 24 CPUs 5.5 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Fast forwarding Normal forwarding
10 Use all available CPUs for network processing if_ringmap APIs for drivers if_ringmap_alloc(), if_ringmap_alloc2(), if_ringmap_free() Allocate/free a ringmap. if_ringmap_alloc2() for NICs requiring power-of- 2 count of rings, e.g. jme(4). if_ringmap_align() If N_txrings!=N_rxrings, it makes sure that TX ring and the respective RX ring are assigned to the same CPU. Required by bce(4) and bnx(4). if_ringmap_match() If N_txrings!=N_rxrings, it makes sure that distribution of TX rings and RX rings is more even. Used by igb(4) and ix(4). if_ringmap_cpumap(), if_ringmap_rdrtable() Get ring CPU binding information, and RSS redirect table.
11 TOC
1. Use all available CPUs for network processing. 2. Direct input with polling(4). 3. Kqueue(2) accept queue length report. 4. Sending aggregation from the network stack. 5. Per CPU IPFW states.
12 Direct input with polling(4)
netisrN netisrN msgport
- Hold RX ring serializer. Hold RX serializer mbuf3 NIC_rxpoll(ifp1) NIC_rxeof(ifp1) Mainly for stability, Setup mbuf1 mbuf2 Queue mbuf1 Setup mbuf2 mbuf1 i.e. bring down NIC when there are Queue mbuf2 Setup mbuf3 Queue mbuf3 heavy RX traffic. Release RX serializer Dequeue mbuf1 ether_input_handler(mbuf1) - Have to requeue the mbufs. ip_input(mbuf1) ip_forward(mbuf1) ip_output(mbuf1) To avoid dangerous deadlock against ifp2->if_output(mbuf1)
Dequeue mbuf2 RX ring serializer. ether_input_handler(mbuf2) ip_input(mbuf2) ip_forward(mbuf2) ip_output(mbuf2) ifp3->if_output(mbuf2)
Dequeue mbuf3 ether_input_handler(mbuf3) Excessive L1 cache eviction ip_input(mbuf3) ip_forward(mbuf3) ip_output(mbuf3) and refetching. ifp2->if_output(mbuf3)
13 Direct input with polling(4)
- Driver synchronizes netisrs netisrN
NIC_rxpoll(ifp1) at stop time. NIC_rxeof(ifp1) Setup mbuf1 ether_input_handler(mbuf1) ip_input(mbuf1) - Then holding RX ring ip_forward(mbuf1) ip_output(mbuf1) ifp2->if_output(mbuf1) serializer is no longer needed. Setup mbuf2 ether_input_handler(mbuf2) - Mbuf requeuing is not ip_input(mbuf2) ip_forward(mbuf2) ip_output(mbuf2) necessary. ifp3->if_output(mbuf2)
Setup mbuf3 ether_input_handler(mbuf3) ip_input(mbuf3) ip_forward(mbuf3) ip_output(mbuf3) ifp2->if_output(mbuf3)
14 Direct input with polling(4) (result: forwarding)
14 13.5 13 12.5 12 11.5 11 10.5 10 9.5 9 8.5 8 7.5
s requeue p 7 p 6.5 M direct 6 5.5 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Fast forwarding Normal forwarding
15 TOC
1. Use all available CPUs for network processing. 2. Direct input with polling(4). 3. Kqueue(2) accept queue length report. 4. Sending aggregation from the network stack. 5. Per CPU IPFW states.
16 Kqueue(2) accept queue length report
Applications, e.g. nginx, utilize kqueue(2) reported accept queue length like this: do { s = accept(ls, &addr, &addrlen); /* Setup accepted socket. */ } while (--accept_queue_length);
The Setup accepted socket part: - Time consuming. - Destabilize and increase request handling latency.
17 Kqueue(2) accept queue length report
Backlog of listen(2) socket. Major factor affects kqueue(2) accept queue length report.
Low listen(2) socket backlog. e.g. 32, excessive connection drops.
High listen(2) socket backlog. e.g. 256, destabilize and increase request handling latency.
18 Kqueue(2) accept queue length report
The solution: - Leave listen(2) backlog high, e.g. 256. - Limit the accept queue length from kqueue(2), e.g. 32. Simple one liner change.
19 Kqueue(2) accept queue length report (result: nginx, 24 CPUs)
FINAL: 24 nginx workers, affinity, accept queue length reported by kqueue(2) is limited to 32.
250000 200
215907.25 217599.01 180 210658.8 200000 160
140
150000 120 ) s s
/ performance s m t (
s latency-avg 100 y e c u n
q latency-stdev e t e a R 100000 80 L latency-.99 58.01 60
50000 33.11 32 40
20
0 0 CONFIG.B CONFIG.A.24 FINAL
20 TOC
1. Use all available CPUs for network processing. 2. Direct input with polling(4). 3. Kqueue(2) accept queue length report. 4. Sending aggregation from the network stack. 5. Per CPU IPFW states.
21 Sending aggregation from the network stack
- High performance drivers have already deployed mechanism to reduce TX ring doorbell writes.
- The effectiveness of the drivers’ optimization depends on how network stack feeds the packets.
22 Sending aggregation from the network stack
netisrN
Dequeue ifp1 RX ring[N] polling msg Preconditions: NIC_rxpoll(ifp1, RX ring[N]) NIC_rxeof(ifp1, RX ring[N]) Setup mbuf1 ether_input_handler(mbuf1) ip_input(mbuf1) - All packet sending runs in netisrs. ip_forward(mbuf1) ip_output(mbuf1) ifp2->if_output(mbuf1) Queue mbuf1 to ifp2.IFQ[N] Setup mbuf2 - Ordered ‘rollup’: ether_input_handler(mbuf2) ip_input(mbuf2) ip_forward(mbuf2) ip_output(mbuf2) ifp3->if_output(mbuf2) Protocols register ‘rollup’ function. Queue mbuf2 to ifp3.IFQ[N]
netisrN.msgport has no more msgs
TCP ACK aggregation ‘rollup’ with higher priority. rollup1() // Flush pending aggregated TCP ACKs tcp_output(tcpcb1) Sending aggregation ‘rollup’ with lower priority. ip_output(mbuf3) ifp2->if_output(mbuf3) Queue mbuf3 to ifp2.IFQ[N] tcp_output(tcpcb2) Rollups are called when: ip_output(mbuf4) ifp3->if_output(mbu4) Queue mbuf4 to ifp3.IFQ[N] No more netmsg. rollup2() // Flush pending mbufs on IFQs ifp2->if_start(ifp2.IFQ[N]) Put mbuf1 to ifp2 TX ring[N] 32 netmsgs are handled. Put mbuf3 to ifp2 TX ring[N] Write ifp2 TX ring[N] doorbell reg ifp3->if_start(ifp3.IFQ[N]) Put mbuf2 to ifp3 TX ring[N] Put mbuf4 to ifp3 TX ring[N] Write ifp3 TX ring[N] doorbell reg
Wait for more msgs on netisrN.msgport
23 Sending aggregation from the network stack (result: forwarding)
14.4 14.2 14 13.8 13.6 13.4 13.2 13 12.8 12.6 12.4 12.2 12 11.8 s aggregate 4 p
p 11.6 M 11.4 aggregate 16 11.2 11 10.8 10.6 10.4 10.2 10 9.8 9.6 9.4 9.2 9 Fast forwarding Normal forwarding
24 TOC
1. Use all available CPUs for network processing. 2. Direct input with polling(4). 3. Kqueue(2) accept queue length report. 4. Sending aggregation from the network stack. 5. Per CPU IPFW states.
25 Per CPU IPFW states
IPFW states were rwlocked, even if IPFW rules were per CPU.
cpu0 cpu1 cpu2 cpu3
State table state Global shared 0 Protected by rwlock 1 2 3
rule0 rule0 rule0 rule0
rule1 rule1 rule1 rule1
Default Default Default Default rule rule rule rule
26 Per CPU IPFW states
- UDP was RSS hash aware in 2014. Prerequisite for per CPU IPFW state.
- Rwlocked IPFW states work pretty well, if the states are relatively stable. e.g. long-lived HTTP/1.1 connections.
- Rwlocked IPFW states suffer a lot for short lived states; r/w contention. e.g. short-lived HTTP/1.1 connections.
27 Per CPU IPFW states
global state counter
cpu0 cpu1 cpu2 cpu3
State table State table State table State table counter counter counter counter
state0 state4 state3 state1 state2 state5
rule0 rule0 rule0 rule0
rule1 rule1 rule1 rule1
Default Default Default Default rule rule rule rule
28 Per CPU IPFW states
Loose state counter update - Implemented by Matthew Dillon for DragonFlyBSD’s slab Allocator. It avoids expensive atomic operations. - Per CPU state counter. Updated when states are installed/removed from the respective CPU. - Per CPU state counter will be added to global state counter, once it is above a threshold. - Per CPU state counters will be re-merged into global state counter, if the global state counter goes beyond the limit. - Allow 5% overflow.
29 Per CPU IPFW states (limit rule)
Track count table Global shared locked trackcnt0 count=6 Track counter tier
Track tier cpu0 cpu1 cpu2 cpu3
Track table Track table Track table Track table track0 track1 track2 state list state list state list count ptr count ptr count ptr
State table State table State table State table state0 state4 state3 state1 state2 state5
rule0 rule0 rule0 rule0
rule1 rule1 rule1 rule1
Default Default Default Default rule rule rule rule
30 Per CPU IPFW states (result: nginx)
On server running nginx: ipfw add 1 check-state ipfw add allow tcp from any to me 80 setup keep-state
250000 300
210658.8 250 200000 191626.58
200 150000 ) s s
/ 153.76 performance s m t (
s
150 y latency-avg e c u n
q latency-stdev e t e a
R 100000
L latency-.99 100 64.74 58.01 50000 43481.19 50
0 0 no IPFW percpu state rwlock state
31 Per CPU IPFW states (redirect)
cpuM cpuN
state table state table master slave state state
rule0 rule0
redirect redirect rule rule 0 0 : : M M N N rule1 : : rule1
default default rule rule
netisrM netisrN
ip_input(mbuf) ipfw_chk(mbuf) Match redirect rule Create master state Link master state to redirect rule[M] Install master state Xlate mbuf Rehash mbuf Create slave state Link slave state to redirect rule[N] Tuck slave state into mbuf.netmsg Send mbuf.netmsg to netisrN Dequeue mbuf.netmsg ipfw_xlate_redispatch(mbuf.netmsg) ip_input(mbuf) ipfw_chk(mbuf) Install slave state Continue from redirect rule’s act ip_forward(mbuf) ip_output(mbuf) ipfw_chk(mbuf) Match slave state Xlate direction mismatch, no xlate Continue from redirect rule’s act ifp->if_output(mbuf) 32 Performance Impact of
33 Performance Impact of Meltdown and Spectre
250000 60 54.43 224082.8 51.12
201859.11 47.45 50 200000 44.43
39.9 40 154723.94 35 150000 33.46 145031.1 ) s s
/ performance s 29.68 m t (
s latency-avg 30 y e c u n
q latency-stdev e t e a R
L latency-.99 100000
20
50000 10
4.16 4.22 4.78 5.05
0 0 base meltdown spectre melt+spectre
34