1
Experiences in Developing Lightweight System Software for Massively Parallel Systems
Barney Maccabe Professor, Computer Science University of New Mexico
June 23, 2008 USENIX LASCO Workshop Boston, MA 2
Simplicity
Butler Lampson, “Hints for Computer System Design,” IEEE Software, vol. 1, no. 1, January 1984. Make it fast, rather than general or powerful Don't hide power Leave it to the client
“Perfection is reached, not when there is no longer anything to add, but when there is no longer anything to take away.” A. Saint-Exupery 3
MPP Operating Systems 4
MPP OS Research
1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 Cray Red Storm SUNMOS message passing Intel ASCI/Red Catamount re-engineering of Puma
LWK direct comparison Intel Paragon Puma/Cougar
levels of trust Cplant (Portals)
commodity Unified Cplant features Config JRTOS application driven real-time 5
Partitioning for Specialization
Red Storm 6
Functional Partitioning
Service nodes authentication and authorization job launch, job control, and accounting Compute nodes memory, processor, communication trusted compute kernel passes user id to file system isolation through communication controls I/O nodes storage and external communication 7
Compute Node Structure
QK – mechanism Quintessential Kernel Compute Node provides communication and address spaces fixed size–rest to PCT Task PCT Task loads PCT TaskTask PCT – policy Process Control Thread trusted agent on node Q-Kernel application load task scheduling Applications – work 8
Trust Structure Application Application Not trusted (runtime) Server Server
PCT PCT Trusted (OS) Q-Kernel Q-Kernel
Hardware 9
Is it good?
It’s not bad... Other things are bad... Intel Paragon 1993: 1,842 OSF-1/AD was a failure on the compute nodes Paragon #1 6/1994–11/1994 OS noise when using full-featured Intel ASCI/Red 1997: 9,000 kernels processors Livermore and LANL experiences First Teraflop system #1 6/1997–11/2000 40 hours MTBI Historical problem: OS researchers only got to study Red Storm 2005 (Cray XT3); broken systems at scale 10,000 processors 10
Compare to Blue Gene/L
BG/L Catamount
I/O nodes (servers) QK = Hypervisor
CNK trampoline PCT = Dom 0
I/O node Task Task Task PCT App App CNK CNK Q-Kernel
Hardware Hardware 11
OS Noise 12
OS Noise
OS interference: OS uses resources that the application could have used to do things not directly related to what the application is doing
Does not include things like handling TLB misses
May include message handling (if the application is not waiting)
OS Noise (Jitter): the variation in OS interference
Fixed work (selfish): measure variation in time to complete
Fixed time (FTQ): measure variation in amount of work completed
e.g., garbage collection–noise is usually there to do good things 13
FTQ (Fixed Time Quantum)
FTQ on Catamount
FTQ
Fixed Time Quantum
Measure application PCT Quantum work in fixed time QK Quantum quantum
Matt Sottile, LANL (now at Google) Source: Larry Kaplan Cray, Inc.
June 18, 2007 Cray Inc./FastOS07 Slide 8 14
FTQ on Linux
Source: Larry Kaplan Cray, Inc. FTQ on Linux
June 18, 2007 Cray Inc./FastOS07 Slide 9 15
FTQ on ASC Purple Typical FTQ Output From Purple The AfterDay AFirmwarefter Pur pleupgrade’s Firmware (VM isUpgr usingade cycles)…
Typical FTQ
Source: Terry Jones, LLNL 16
What’s the big deal?
1 p0 p1 p2 p3 p4 p5 p6 0.8
0.6
0.4 y=1-.99^x
0.2 probability of encountering noise
0
200 400 600 800 1000 nodes
140
120
noise 100
80
probability = 1%; service time = 20 us
collective time 60 probabilty = 10%; service time = 10 us probability = 5%; service time = 10 us 40
20
0 0 2000 4000 6000 8000 10000 12000 14000 16000 nodes 17
Noise does matter
POP Sage 25 CTH Source: Kurt Ferreira, UNM
20 2500 Hz, 25 usec
Sage 15 CTH 5 % slowdown 10
4 2500 Hz, 25 usec
5
3
0 2000 4000 6000 8000 10000 % slowdown nodes 2
High Frequency, Low Duration 1
2.5% total noise injected 0 0 2000 4000 6000 8000 10000 12000 nodes 18
Noise does matter–really
POP Sage CTH Source: Kurt Ferreira, UNM 1500 10 Hz; 2500 usec
Sage 1000 CTH % slowdown 60 10 Hz; 2500 usec 500
40
0 0 500 1000 1500 2000 2500 3000 % slowdown nodes 20 Low Frequency, High Duration
2.5% total noise injected 0 0 500 1000 1500 2000 2500 3000 nodes 19
Dealing with noise
Minimize noise Lots of short noise is better than small amounts of long noise Make “noisy” services optional Block synchronous systems services synchronizing tens of thousands of nodes is hard Hardware support for noisy operations (e.g., global clock) for operations affected by noise (e.g., collective offload) Develop noise tolerant algorithmic approaches equivalent to latency tolerant and fault oblivious approaches (i.e., accept that noise will eventually dominate all other things) Define how applications can be noise tolerant (e.g., avoid ALLREDUCE) 20
Linux 20
Linux
What was the question? 21
The 800lb Penguin rlogin emacs ssh telnet email dns MPI TCP/IP gcc glibc Linux
I/O Device I/OI/O Bus Bus I/OI/O Device Device I/O Bus
Network Video Disk Network Video DiskDisk Network VideoVideo Disk Network
“Linux’s cleverness is not in the software, but in the development model” Rob Pike, “Systems Software Research is Irrelevant,” 2/2000 21
The 800lb Penguin on a diet
MPI TCP/IP gcc glibc Linux
I/O Device I/OI/O Bus Bus I/OI/O Device Device I/O Bus
Network Video Disk Network Video DiskDisk Network VideoVideo Disk Network
“Linux’s cleverness is not in the software, but in the development model” Rob Pike, “Systems Software Research is Irrelevant,” 2/2000 21
The 800lb Penguin on a diet
MPI TCP/IP gcc glibc Linux I/O Bus I/O Device
Disk Network
“Linux’s cleverness is not in the software, but in the development model” Rob Pike, “Systems Software Research is Irrelevant,” 2/2000 21
The 800lb Penguin on a diet
no “broken” hardware limited number of devices MPI minimal services e.g., CNL, ZeptoOS TCP/IP gcc glibc Linux I/O Bus I/O Device
Disk Network
“Linux’s cleverness is not in the software, but in the development model” Rob Pike, “Systems Software Research is Irrelevant,” 2/2000 22
Building Compute Node Linux
FTQ on Linux FTQ evolving on CNL
Complex interactions in changing the quantum
•BusyBox (embedded) June 18, 2007 Cray Inc./FastOS07 Slide 10 June 18, 2007 •noCra y remoteInc./FastOS07 login Slide 9 •add “capacity” features as needed by application •Libraries •NFS, LDAP Source: Larry Kaplan Cray, Inc. 23
CTH on Catamount and CNL
CTH scaling study Shaped charge 1.75M cells / process
15
10
Time / Timestep (seconds) Timestep / Time Source: Courtenay Vaughan, Sandia
5
CNL LWK Caution: Preliminary results
0 1 1x101 1x102 1x103 1x104 Nodes 24
Partisn on Catamount and CNL
200 213s
Partisn scaling study SN problem 24x24x24 cells / process
150
CNL LWK 143s
100 Time (seconds) Time 50% increase Source: Courtenay Vaughan, Sandia on 8096 nodes
50
Caution: Preliminary results
1 1x101 1x102 1x103 1x104 Nodes 25
Keeping up with Linux
7
2.6 Kernels 2.4 Kernels 2.2 Kernels 6 2.0 Kernels 1.2 Kernels
5
4 Source: Oded Koren
3
2 Lines of code in the Linux kernel (millions)
1
Data from Oded Koren
0 1000 2000 3000 4000 Days since release of 1.0.0 (14 March 1994) 26
Catamount is Nimble
Source code is small enough that developers can keep it in their head Catmount is <100,000 lines of code Early example: dual processors on ASCI/Red Heater mode Message co-processor mode P0 P1 Memory designed/expected mode of use Compute co-processor mode aka “stunt mode” Virtual node mode Network 6 man-month effort to implement FIFOs became the standard mode 27
Adding Multicore Support
SMARTMAP (Brightwell, Pedretti, and Hudson) Map every core’s memory view into every other core’s memory map Almost threads, almost processes modified 20 lines of kernel code in-line function (3 lines of code) to access another core’s memory Modified Open MPI Byte Transport Layer (BTL), requires two copies Message Transport Layer (MTL), message matching in Portals Less than a man-month to implement 28
SMARTMAP Performance
BTL Generic Portals (interrupt, 2 copies) MTL Generic Portals (interrupt 1 copy) BTL SMARTMAP (2 copies) 1000 MTL SMARTMAP (1 copy)
100 Source: Ron Brightwell, Sandia
10
Ping Pong Time (microseconds) Time Ping Pong 1
1 10 100 1000 10000 100000 1x106 Message Size (bytes) 29
Why Linux .... Why not ....
Community One is the loneliest number... Easier to hire Linux specialists diversity is a good thing Lots of eyes to find solutions, and Linux is a moving target others care hard to get changes into Linux Environment Performance tools HPC is not the goal Development tools (compilers) Shrinking Linux eliminates parts of the environment Libraries Highlander: there will be one when does it stop being Linux?
Catamount as a virtualization layer 30
Lightweight Storage Systems 31
Basic Idea
Apply lightweight design philosophy to storage systems
Enforce access control: authentication, capabilities with revocation
Enable consistency: lightweight transactions
Expose full power of the storage resources to applications
Applications manage bandwidth to storage
“Off line” meta data updates – “Meta bots” 32
File/Object Creates
Source: Ron Oldfield, Sandia Lustre File Creates LWFS Object Creates
800 60000 16 I/O Servers 8 I/O Servers 4 I/O Servers 50000 2 I/O Servers 600 1 I/O Server 40000
400 30000
16 I/O Servers 20000 Throughput (ops/sec)
Throughput (ops/sec) 8 I/O Servers 200 4 I/O Servers 2 I/O Servers 10000
0 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Number of Clients Number of Clients
Note different scales for y axis 33
Checkpoints
1. Initiate a lightweight transaction on node 0 Broadcast transaction id to all nodes 2. Each node creates a unique data object & dumps local data parallelism only limited by disks no metadata, no consistency, no coherency data objects are transient 3. All nodes send their data object id to node 0 4. Node zero builds an “index object” and commits the transaction two phase commit with the storage servers data objects and index object are permanent could be done “off line” by a meta-bot 34
Write Throughput
Source: Ron Oldfield, Sandia Lustre write throughput LWFS write throughput file/process object/process
800 800
16 I/O Servers 8 I/O Servers 4 I/O Servers 16 I/O Servers 600 600 2 I/O Servers 8 I/O Servers 1 I/O Server 4 I/O Servers 2 I/O Servers
400 400 Throughput (MB/sec) Throughput (MB/sec)
200 200
0 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Number of Clients Number of Clients 35
A final story
Many-to-one operations are problematic at scale Cannot reserve buffer space on compute nodes for 10,000 to 1 Catamount perspective–it’s a protocol failure, fix the application!
Upper levels are responsible for flow control
Catamount happily drops messages–failing sooner rather than later is better
BG/L–the customer is right
Protect applications from themselves
Flow control is fundamental, even if it handicaps well written applications 36
The Design Space
Good Bad
Enable Enable Good Things
Prevent Bad Prevent Things 36
The Design Space
Good Bad
Enable Enable Good Things
Prevent Bad Prevent Things 36
The Design Space
Good Bad
Enable Enable Good Things
Prevent Bad Prevent Things 37
Thanks
UNM Scalable Systems Lab
Patrick Bridges, Patrick Widener, Kurt Ferreira
Sandia National Labs
Ron Brightwell, Ron Oldfield, Rolf Riesen, Lee Ward, Sue Kelly
“Fools ignore complexity; pragmatists suffer it; experts avoid it; geniuses remove it.” Alan J. Perlis