<<

Breaking the limits of Building large parallel systems with open-source software

Daniel J Blueman, Principal Software Engineer, Numascale AS, [email protected]

quora.org/scaling-x86.pdf Dream machine

I Lots and lots of

I cores, 1000s? I memory, terabytes? I disks, exabytes? I PCIe slots, GPUs, just lots

I What do we get?

I efficient resource sharing I less management/admin I no software change How is this done?

I NUMA address space routing

512GB memory proc 1 memory controller 384GB base[47:24] limit[47:24] node memory proc 2 0000h 1FFFh 0 memory controller 256GB 2000h 3FFFh 1 4000h 7FFFh 2 memory proc 3

memory 8000h 9FFFh 3 controller 128GB A000h ...4 memory proc 4 memory 0GB controller 64 cores 2x2 torus, 256 cores

512GB 1536GB memory proc 1 memory proc 9 memory memory memory controller proc 2 memory controller proc 10 memory memory controller controller memory proc 3 memory proc 11 memory memory controller controller memory proc 4 memory proc 12 memory memory controller controller 0GB router 1024GB cache router memory memory controller controller

1024GB 2048GB memory proc 5 memory proc 13 memory memory controller memory controller proc 6 memory proc 14 memory memory controller controller memory proc 7 memory proc 15 memory memory controller controller memory proc 8 memory proc 16 memory memory controller controller 512GB cache router 1536GB cache router memory memory controller controller Scalable routing

I 6 ports

I 1-3D torus

I or point-to-point

I routing model

I switchless!

I 1.6us latency #Oops

I I can only boot 240 cores!

I 8-bit APIC ID limit from 1990s

I Needed for inter-core communication () I Solution

I Routable controller I Linux Kernel support I Structure APIC ID space Interrupt delivery teardown interrupt broadcast memory proc 1 memory controller

memory proc 4 memory controller

I 2 APIC writes @ FEE00000h MMIO write memory I converted to broadcast proc 1 memory controller I vs MMIO space write @ F0000000h cache router memory

I Forwarded over fabric controller

I Rebroadcast memory proc 5 memory controller

cache router interrupt memory controller broadcast x86-64 address map 64TB 3FFFFFFFFFFFh PCI config 63TB 3F0000000000h

64-bit PCI BARs 512GB server 4 DRAM server 3 DRAM server 2 DRAM 128GB DRAM 4GB FFFFFFFFh local APIC FEE00000h global timer local router local timers interrupt controller F0000000h legacy PCI config E0000000h 32-bit PCI BARs

server 1 C0000000h

low DRAM

64KB legacy IO coherent 0KB linux/arch/x86/kernel/apic/apic numachip.c 1 struct apic apic_numachip={ 2 .name= "NumaConnect2 system", 3 .send_IPI= numachip_send_IPI_one, 4 }; 5 6 apic_driver(apic_numachip); 7 8 void numachip_wakeup_secondary(int phys_apicid, unsigned long start_rip) { 9 numachip_apic_icr_write(phys_apicid, APIC_DM_INIT); 10 numachip_apic_icr_write(phys_apicid, APIC_DM_STARTUP| (start_rip >> 12)); 11 } 12 13 void numachip_send_IPI_one(int cpu, int vector) { 14 int local_apicid, apicid= per_cpu(x86_cpu_to_apicid, cpu); 15 unsigned int dmode; 16 17 preempt_disable(); 18 local_apicid= __this_cpu_read(x86_cpu_to_apicid); 19 20 /* Send via local APIC where non-local part matches */ 21 if(!((apicid^ local_apicid) >> NUMACHIP_LAPIC_BITS)) { 22 unsigned long flags; 23 24 local_irq_save(flags); 25 __default_send_IPI_dest_field(apicid, vector, APIC_DEST_PHYSICAL); 26 local_irq_restore(flags); 27 preempt_enable(); 28 return; 29 } 30 preempt_enable(); 31 32 dmode= (vector == NMI_VECTOR)? APIC_DM_NMI : APIC_DM_FIXED; 33 numachip_apic_icr_write(apicid, dmode| vector); 34 } 35 36 void numachip2_apic_icr_write(int apicid, unsigned int val) { 37 numachip2_write32_lcsr(NUMACHIP2_APIC_ICR, (apicid << 12)| val); 38 } Back to the future

I Nanosecond clock local to server I Latency, lack of determinism I Hierarchical clock I Track drift, bound I Scheduler misbehaviour I See clocktree.c (complex!) I Hardware support: linux/drivers/clocksource/numachip.c

1 struct clocksource numachip_clocksource={ 2 .name= "numachip", 3 .rating= 295, 4 .read= numachip_timer_read, 5 .mask= CLOCKSOURCE_MASK(64), 6 .flags= CLOCK_SOURCE_IS_CONTINUOUS, 7 .mult=1, 8 .shift=0, 9 }; 10 11 void numachip_timer_init(void){ 12 clocksource_register_hz(&numachip_clocksource, NSEC_PER_SEC); 13 } 14 15 arch_initcall(numachip_timer_init); 16 17 uint64_t numachip2_timer_read(struct clocksource*cs) { 18 return numachip2_read64_lcsr(NUMACHIP2_TIMER_NOW); 19 } Live demo, 7 servers, 336C, 1.3TB, P2P, 34TF $ lscpu On-line CPU(s) list: 0-335 Core(s) per socket: 8 Socket(s): 21 NUMA node(s): 42 Model name: AMD (tm) 6380 CPU MHz: 2499.806 L1d cache: 16K L1i cache: 64K L2 cache: 2048K L3 cache: 6144K NUMA node0 CPU(s): 0-7 ... NUMA node41 CPU(s): 328-335

$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdb 8:16 0 238.5G 0 disk ... sdj 8:144 0 238.5G 0 disk md127 9:127 0 1.9T 0 raid0 /scratch

$ ./stream_c.exe.gnu.1028g ... Function Best Rate MB/s Copy: 656792.7 Scale: 648476.4 Add: 606363.2 Triad: 607470.9

$ ./mg.D.x NAS Parallel Benchmarks (NPB3.3-OMP) - MG Benchmark iter 1 ... Size = 2048x2048x2048 Iterations = 50 Time in seconds = 736.61 Total threads = 168 Avail threads = 168 Mop/s total = 33818.22 Mop/s/ = 201.30 Operation type = floating point Verification = SUCCESSFUL Configuration

I PXElinux serial 2 115200 0x003

label numaconnect2 kernel nc2-bootloader-r37.c32 append config=fabric-numademo-P2P.txt

label local localboot 0

I Firmware

prefix=numademo-

# nodes suffix=01 mac=00:25:90:5a:77:fc partition=1 ports=02A,03A,04A,05A,06A,07A suffix=02 mac=00:25:90:59:b1:e4 partition=1 ports=,03B,04B,05B,06B,07B suffix=03 mac=00:25:90:58:d7:70 partition=1 ports=,,04C,05C,06C,07C suffix=04 mac=00:25:90:59:b1:d2 partition=1 ports=,,,05D,06D,07D suffix=05 mac=00:25:90:58:22:ce partition=1 ports=,,,,06E,07E suffix=06 mac=00:30:48:ff:d7:68 partition=1 ports=,,,,,07F suffix=07 mac=00:30:48:ff:cb:ac partition=1 ports=,,,,,

# partitions label=local unified=true NMA Quick Start Guide version 1.0 num as c ale Leaving server power cables disconnected, connect the NMA and Plug in the power cable of the �rst (master) Double click the management computer to the local ethernet and power on; server; it will be detected by the NMA and power icon to power 1 open a browser window to address http://192.168.30.250 2 listed in Servers after 30s. 3 up that server

The eth0 MAC address and MC MAC address for each server will be associated automatically if possible. If not, copy the eth0 MAC address from 4 the new entry to the corresponding server's entry that has an MC MAC address. Repeat steps 2-4 for each additional server.

Use the table shown in From Deployment, Selection the desired entry in Fabric From Dashboard, power on or reset Wiring to interconnect select OS to install geometry all servers to boot as desired 5 6 all NumaConnect cards 7 or boot into 8 NumaConnect Big iron

I 6x6x3 torus, 48 cores per I 10TB/s memory bandwidth! server (previous world record)

I 5184 cores, 21TB DRAM I boots in 15 minutes, before 2h Questions and links

I [email protected], coffee anytime!

I Presentation @ quora.org/scaling-x86.pdf

I Firmware @ github.com/numascale

I Kernel patches @ resources.numascale.com/kernels

I Interactive configurator @ numascale.com/configurator.html

I Stay tuned for my future presentation on ARM and FPGAs

I LaTeX for presentations rocks