Disruptor Using High Performance, Low Latency Technology in the CERN Control System ICALEPCS 2015

21/10/2015 2 The problem at hand

21/10/2015 WEB3O03 3 The problem at hand

• CESAR is used to control the devices in Motor Detector Power Converter CERN experimental areas

CESAR Server

21/10/2015 WEB3O03 4 The problem at hand

• CESAR is used to control the devices in Motor Detector Power Converter CERN experimental areas

x 2500 • These devices produce 2500 event streams CESAR Server

21/10/2015 WEB3O03 5 The problem at hand

• CESAR is used to control the devices in Motor Detector Power Converter CERN experimental areas

x 2500 • These devices produce 2500 event streams CESAR Server • The business logic on the CESAR server combines the data coming from these streams to calculate device states State State State Device Device Device A B C

21/10/2015 WEB3O03 6 The problem at hand

• CESAR is used to control the devices in Motor Detector Power Converter CERN experimental areas

21/10/2015 WEB3O03 7 What happens when all flows converge?

21/10/2015 WEB3O03 8 21/10/2015 WEB3O03 9 21/10/2015 WEB3O03 10 21/10/2015 WEB3O03 11 1- Some background about the Disruptor

21/10/2015 WEB3O03 12 1- Some background about the Disruptor

• Created by LMAX, a trading company, to build a high performance Forex exchange

21/10/2015 WEB3O03 13 1- Some background about the Disruptor

• Created by LMAX, a trading company, to build a high performance Forex exchange

• Is the result of different trials and errors

21/10/2015 WEB3O03 14 1- Some background about the Disruptor

• Created by LMAX, a trading company, to build a high performance Forex exchange

• Is the result of different trials and errors

• Challenges the idea that “CPUs are not getting any faster”

21/10/2015 WEB3O03 15 1- Some background about the Disruptor

• Created by LMAX, a trading company, to build a high performance Forex exchange

• Is the result of different trials and errors

• Challenges the idea that “CPUs are not getting any faster”

• Designed to take advantage of the architecture of modern CPUs, following the concept of “mechanical sympathy”

21/10/2015 WEB3O03 16 Feed the cores – avoid cache misses

Core 1 Core 2

L1 Cache L1 Cache 64 KB

L2 Cache L2 Cache 256 KB

L3 Cache 1 to 20 MB Socket Socket Interconnect

RAM

21/10/2015 WEB3O03 17 Feed the cores – avoid cache misses

Core 1 Core 2

L1 Cache L1 Cache 64 KB 1 ns

L2 Cache L2 Cache 256 KB

L3 Cache 1 to 20 MB Socket Socket Interconnect

RAM

21/10/2015 WEB3O03 18 Feed the cores – avoid cache misses

Core 1 Core 2

L1 Cache L1 Cache 64 KB 1 ns

L2 Cache L2 Cache 256 KB 3 ns

L3 Cache 1 to 20 MB Socket Socket Interconnect

RAM

21/10/2015 WEB3O03 19 Feed the cores – avoid cache misses

Core 1 Core 2

L1 Cache L1 Cache 64 KB 1 ns

L2 Cache L2 Cache 256 KB 3 ns

L3 Cache 1 to 20 MB 12 ns Socket Socket Interconnect

RAM

21/10/2015 WEB3O03 20 Feed the cores – avoid cache misses

Core 1 Core 2

L1 Cache L1 Cache 64 KB 1 ns

L2 Cache L2 Cache 256 KB 3 ns

L3 Cache 1 to 20 MB 12 ns Socket Socket Interconnect

RAM 65 ns

21/10/2015 WEB3O03 21 Avoiding false sharing 1 cache line = 64 bytes (on modern x86) Core 1 Core 2

X Y X Y L1 / 1 to 3 ns L2

L3 12 ns

Socket Interconnect 40 ns

21/10/2015 WEB3O03 22 Avoiding false sharing 1 cache line = 64 bytes (on modern x86) Core 1 Core 2

X Y X Y L1 / 1 to 3 ns L2

L3 12 ns

Socket Interconnect 40 ns

21/10/2015 WEB3O03 23 Avoiding false sharing 1 cache line = 64 bytes (on modern x86) Core 1 Core 2

X’X Y X Y L1 / 1 to 3 ns L2

L3 12 ns

Socket Interconnect 40 ns

21/10/2015 WEB3O03 24 Avoiding false sharing 1 cache line = 64 bytes (on modern x86) Core 1 Core 2

X’X Y X Y L1 / 1 to 3 ns L2

X’ Y L3 12 ns

Socket Interconnect 40 ns

21/10/2015 WEB3O03 25 Avoiding false sharing 1 cache line = 64 bytes (on modern x86) Core 1 Core 2

X’X Y X Y L1 / 1 to 3 ns L2

X’ Y L3 12 ns

Socket Interconnect 40 ns

21/10/2015 WEB3O03 26 Avoiding false sharing 1 cache line = 64 bytes (on modern x86) Core 1 Core 2

X’X Y X Y L1 / 1 to 3 ns L2

X’ Y L3 12 ns

Socket Interconnect 40 ns

21/10/2015 WEB3O03 27 Avoiding false sharing 1 cache line = 64 bytes (on modern x86) Core 1 Core 2

X’X Y X Y L1 / 1 to 3 ns L2

X’ Y L3 12 ns

Socket Interconnect 40 ns

21/10/2015 WEB3O03 28 Avoiding false sharing 1 cache line = 64 bytes (on modern x86) Core 1 Core 2

X Y L1 / 1 to 3 ns L2

L3 12 ns

Socket Interconnect 40 ns

The solution?

21/10/2015 WEB3O03 29 Avoiding false sharing 1 cache line = 64 bytes (on modern x86) Core 1 Core 2

P P X P P P P P Y P P P L1 / 1 to 3 ns L2

L3 12 ns

Socket Interconnect 40 ns

The solution? Padding

21/10/2015 WEB3O03 30 2 - Disruptor architecture

• What is it?

 Can be viewed as a very efficient FIFO bounded queue

 A data structure to pass data between threads, designed to avoid contention

21/10/2015 WEB3O03 31 The mighty ring buffer

11 10 13 12 9 13 8 Producer 14 7 15 6 4 16 5 Consumer 17 4 18 3 19 2 20 21 1 22 23 24

21/10/2015 WEB3O03 32 The mighty ring buffer

11 10 1413 12 9 13 8 Producer 14 7 15 6 54 16 5 Consumer 17 4 18 3 19 2 20 21 1 22 23 24

21/10/2015 WEB3O03 33 The mighty ring buffer

11 10 1413 12 9 13 8 Producer 14 7 15 6 54 16 5 Consumer 17 4 18 3 19 2 20 21 1 22 23 24

• Represented internally as an array  caches gets prefetched

21/10/2015 WEB3O03 34 The mighty ring buffer

11 10 1413 12 9 13 8 Producer 14 7 15 6 54 16 5 Consumer 17 4 18 3 19 2 20 21 1 22 23 24

• Represented internally as an array  caches gets prefetched • The sequence number is a padded long  no false sharing

21/10/2015 WEB3O03 35 The mighty ring buffer

11 10 1413 12 9 13 8 Producer 14 7 15 6 54 16 5 Consumer 17 4 18 3 19 2 20 21 1 22 23 24

• Represented internally as an array  caches gets prefetched • The sequence number is a padded long  no false sharing • The memory visibility relies on the volatile sequence number  no locks

21/10/2015 WEB3O03 36 The mighty ring buffer

11 10 1413 12 9 13 8 Producer 14 7 15 6 54 16 5 Consumer 17 4 18 3 19 2 20 21 1 22 23 24

21/10/2015 WEB3O03 37 Main differences compared to a queue

21/10/2015 WEB3O03 38 Main differences compared to a queue  Latency and jitter reduced to a minimum

21/10/2015 WEB3O03 39 Main differences compared to a queue  Latency and jitter reduced to a minimum

 No garbage collection

21/10/2015 WEB3O03 40 Main differences compared to a queue  Latency and jitter reduced to a minimum

 No garbage collection

 Can have multiple consumers organized in a graph of dependency

21/10/2015 WEB3O03 41 Main differences compared to a queue  Latency and jitter reduced to a minimum

 No garbage collection

 Can have multiple consumers organized in a graph of dependency

 Consumers can use batching to catch up with producers

21/10/2015 WEB3O03 42 Benefits for the architecture

21/10/2015 WEB3O03 43 Benefits for the architecture

 Performance No locks, no garbage collection, CPU friendly

21/10/2015 WEB3O03 44 Benefits for the architecture

 Performance No locks, no garbage collection, CPU friendly

 Determinism The order in which events were processed is known Messages can be replayed to rebuild the server state

21/10/2015 WEB3O03 45 Benefits for the architecture

 Performance No locks, no garbage collection, CPU friendly

 Determinism The order in which events were processed is known Messages can be replayed to rebuild the server state

 Simplification of the code base Since the business logic runs on a single thread, there is no need to worry about concurrency

21/10/2015 WEB3O03 46 3 – The Disruptor in CESAR

21/10/2015 WEB3O03 47 3 – The Disruptor in CESAR