Programming in the Multi-Core Era
Emil Sekerinski
Department of Computing and Software McMaster University
June 2018 40+ Years of Microprocessor Trends
We have reached instruction level parallelism wall and power wall Still increasing number of transistors leads to more processor cores Communication and Synchronization in Multi-Programs
Message Passing: Asynchronous: Unix pipes & filters, Erlang processes Synchronous: Go goroutines Shared Variables: Semaphores: Linux, C, Python Monitors: Java, C#, Python Transactional memory: C++, Haskell Remote procedure call: Java, Python Distributed shared memory Introduced for resource sharing, interactive programs, distribution
Suitable for multi-core processors? Too many threads: slowdown due to overhead of cores switching between threads, e.g. multiplying n × n matrices → use thread pools to reduce number of threads
Threads in Programs
Threads have to be explicitly created to make use of multiple cores:
Too few threads: not all cores will be used → turn independent tasks into threads
[Android Guides] Threads in Programs
Threads have to be explicitly created to make use of multiple cores:
Too few threads: not all cores will be used → turn independent tasks into threads Too many threads: slowdown due to overhead of cores switching between threads, e.g. multiplying n × n matrices → use thread pools to reduce number of threads
[Thread Pool in the .NET Framework] Threads in Programs
Threads have to be explicitly created to make use of multiple cores:
Too few threads: not all cores will be used → turn independent tasks into threads Too many threads: slowdown due to overhead of cores switching between threads, e.g. multiplying n × n matrices → use thread pools to reduce number of threads
In addition to the static structure of modules, programs have a dynamic structure of threads.
(State is distributed over global variables and thread-local variables) Smalltalk-80 Java, C#, Python, . . . sender caller receiver callee message method
Objects should be thought of as having “independent lives”, even if the implementation is sequential
How to make objects “truly concurrent”? Ada uses rendezvous, Erlang and Akka use actors
Inspiration: Simula-67 and Smalltalk-80
Simula-67 objects in programs mimic objects of the “real world” objects are cooperatively scheduled → coroutines
[Birtwhistle, Dahl, Myhrhaug, Nygaard.
Simula Begin, 1979.] Objects should be thought of as having “independent lives”, even if the implementation is sequential
How to make objects “truly concurrent”? Ada uses rendezvous, Erlang and Akka use actors
Inspiration: Simula-67 and Smalltalk-80
Simula-67 objects in programs mimic objects of the “real world” objects are cooperatively scheduled → coroutines “Part of the problem Smalltalk-80 Java, C#, Python, . . . description becomes part of the solution” sender caller receiver callee message method How to make objects “truly concurrent”? Ada uses rendezvous, Erlang and Akka use actors
Inspiration: Simula-67 and Smalltalk-80
Simula-67 objects in programs mimic objects of the “real world” objects are cooperatively scheduled → coroutines “Part of the problem Smalltalk-80 Java, C#, Python, . . . description becomes part of the solution” sender caller receiver callee message method
Objects should be thought of as having “independent lives”, even if the implementation is sequential Inspiration: Simula-67 and Smalltalk-80
Simula-67 objects in programs mimic objects of the “real world” objects are cooperatively scheduled → coroutines “Part of the problem Smalltalk-80 Java, C#, Python, . . . description becomes part of the solution” sender caller receiver callee message method
Objects should be thought of as having “independent lives”, even if the implementation is sequential
How to make objects “truly concurrent”? Ada uses rendezvous, Erlang and Akka use actors Used for verification tools, TLA (Amazon), Event B (railways), SPIN (NASA), Microsoft Device Driver Verifier, . . . , for hardware description languages, but not programming languages:
Widespread belief in the 80’s that guarded commands cannot be implemented efficiently
Guarded Commands: A Model for Concurrency
do if x1 > x2 then swap x1 and x2, x1 > x2 → swap x1 and x2 ... x2 > x3 → swap x2 and x4 if both x1 > x2 and x3 > x4, x3 > x4 → swap x3 and x4 swap concurrently Guarded Commands: A Model for Concurrency
do x1 > x2 → swap x1 and x2 x2 > x3 → swap x2 and x4 x3 > x4 → swap x3 and x4
Used for verification tools, TLA (Amazon), Event B (railways), SPIN (NASA), Microsoft Device Driver Verifier, . . . , for hardware description languages, but not programming languages:
Widespread belief in the 80’s that guarded commands cannot be implemented efficiently Lime: a Concurrent Object-Oriented Language
class DelayedDoubler Actions execute when enabled var y: int; d: boolean All objects are concurrent init() d := true Objects synchronize through method store(u: int) method calls y := u; d := false Method blocks when guard is method retrieve(): int false d → return y Execution is atomic up to action double method calls not d → y := 2 * y; d := true Guards only over local fields Lime: a Concurrent Object-Oriented Language
class DelayedDoubler A correct implementation of: var y: int; d: boolean init() class Doubler d := true var x: int method store(u: int) method store(u: int) y := u; d := false x := 2 * u method retrieve(): int method retrieve(): int d → return y return x action double not d → y := 2 * y; d := true
Representative for writing to a file, sending over network, . . . m:5m:4 m:6m:5m m:6m m head p:6p:4p p:5p p p a:truea a:truea a:truea a
Priority Queue
class PriorityQueue Add an integer to the queue var l : PriorityQueue Remove the smallest integer ... method add(e: int) Remove executes in O(1) time when not a and not r do if l = nil then Initialize; add 5, 6, 4 . . . m := e ; l := new PriorityQueue else p := e ; a := true action doAdd when a do if m < p then l.add(p) else l.add(m); m := p a := false
m head p a m:5m:4m m:6m:5m m:6m m head p:6p:4p p:5p p p a:truea a:truea a:truea a
Priority Queue
class PriorityQueue Add an integer to the queue var l : PriorityQueue Remove the smallest integer ... method add(e: int) Remove executes in O(1) time when not a and not r do if l = nil then Initialize; add 5, 6, 4 . . . m := e ; l := new PriorityQueue else p := e ; a := true action doAdd when a do if m < p then l.add(p) else l.add(m); m := p a := false
m:5 head p a:true m:5m:4m m:6m:5m m:6m m head p:6p:4p p:5p p p a:truea a:truea a:truea a
Priority Queue
class PriorityQueue Add an integer to the queue var l : PriorityQueue Remove the smallest integer ... method add(e: int) Remove executes in O(1) time when not a and not r do if l = nil then Initialize; add 5, 6, 4 . . . m := e ; l := new PriorityQueue else p := e ; a := true action doAdd when a do if m < p then l.add(p) else l.add(m); m := p a := false
m:5 m head p p a a m:5m:4m m:6m:5m m:6m m head p:4p p:5p p p a:truea a:truea a:truea a
Priority Queue
class PriorityQueue Add an integer to the queue var l : PriorityQueue Remove the smallest integer ... method add(e: int) Remove executes in O(1) time when not a and not r do if l = nil then Initialize; add 5, 6, 4 . . . m := e ; l := new PriorityQueue else p := e ; a := true action doAdd when a do if m < p then l.add(p) else l.add(m); m := p a := false
m:5 m head p:6 p a:true a m:5m:4m m:6m:5m m:6m m head p:6p:4p p:5p p p a:truea a:truea a:truea a
Priority Queue
class PriorityQueue Add an integer to the queue var l : PriorityQueue Remove the smallest integer ... method add(e: int) Remove executes in O(1) time when not a and not r do if l = nil then Initialize; add 5, 6, 4 . . . m := e ; l := new PriorityQueue else p := e ; a := true action doAdd when a do if m < p then l.add(p) else l.add(m); m := p a := false
m:5 m:6 head p p a a:true m:5m:4m m:6m:5m m:6m m head p:6p:4p p:5p p p a:truea a:truea a:truea a
Priority Queue
class PriorityQueue Add an integer to the queue var l : PriorityQueue Remove the smallest integer ... method add(e: int) Remove executes in O(1) time when not a and not r do if l = nil then Initialize; add 5, 6, 4 . . . m := e ; l := new PriorityQueue else p := e ; a := true action doAdd when a do if m < p then l.add(p) else l.add(m); m := p a := false
m:5 m:6 m head p p p a a a m:5m:4m m:6m:5m m:6m m head p:6p p:5p p p a:truea a:truea a:truea a
Priority Queue
class PriorityQueue Add an integer to the queue var l : PriorityQueue Remove the smallest integer ... method add(e: int) Remove executes in O(1) time when not a and not r do if l = nil then Initialize; add 5, 6, 4 . . . m := e ; l := new PriorityQueue else p := e ; a := true action doAdd when a do if m < p then l.add(p) else l.add(m); m := p a := false
m:5 m:6 m head p:4 p p a:true a a m:5m:4m m:6m:5m m:6m m head p:6p:4p p p p a:truea a:truea a:truea a
Priority Queue
class PriorityQueue Add an integer to the queue var l : PriorityQueue Remove the smallest integer ... method add(e: int) Remove executes in O(1) time when not a and not r do if l = nil then Initialize; add 5, 6, 4 . . . m := e ; l := new PriorityQueue else p := e ; a := true action doAdd when a do if m < p then l.add(p) else l.add(m); m := p a := false
m:4 m:6 m head p p:5 p a a:true a m:5m:4m m:6m:5m m:6m m head p:6p:4p p:5p p p a:truea a:truea a a
Priority Queue
class PriorityQueue Add an integer to the queue var l : PriorityQueue Remove the smallest integer ... method add(e: int) Remove executes in O(1) time when not a and not r do if l = nil then Initialize; add 5, 6, 4 . . . m := e ; l := new PriorityQueue else p := e ; a := true action doAdd when a do if m < p then l.add(p) else l.add(m); m := p a := false
m:4 m:5 m:6 head p p p a a a:true m:5m:4m m:6m:5m m:6m head p:6p:4p p:5p p a:truea a:truea a:truea
Priority Queue
class PriorityQueue Add an integer to the queue var l : PriorityQueue Remove the smallest integer ... method add(e: int) Remove executes in O(1) time when not a and not r do if l = nil then Initialize; add 5, 6, 4 . . . m := e ; l := new PriorityQueue else p := e ; a := true action doAdd when a do if m < p then l.add(p) else l.add(m); m := p a := false
m:4 m:5 m:6 m head p p p p a a a a Results of Priority Queue
3500 30000 Lime Go Erlang 3000 Java Pthread 25000 Haskell
2500 20000
2000
15000
Time (ms) 1500
10000 1000
5000 500
0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10 20 30 40 50 60 70 80 Number of Objects Leaf-oriented Tree
Internal nodes contain only guides The elements are stored in the leaves
Leaf
Node
Leaf
Root input
Leaf
Node
Leaf Results of Leaf-oriented Tree
9000 1800 Lime Go 8000 Erlang 1600 Java Pthread Haskell 7000 1400
6000 1200
5000 1000
4000 800 Time (ms)
3000 600
2000 400
1000 200
0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10 20 30 40 50 60 70 80 Number of Objects MapReduce
MapReduce computes the sum of squares from 1 to n The mapper: map(x) = x2 The reducer: reduce(x, y) = x + y.
input Mapper
Reducer
input Mapper
Reducer output
input Mapper
Reducer
input Mapper Results of MapReduce
8000 4500 Lime Go Erlang 4000 7000 Java Pthread Haskell 3500 6000
3000 5000
2500 4000 2000 Time (ms) 3000 1500
2000 1000
1000 500
0 0 128 256 512 1024 2048 2 4 8 16 32 64 Number of Objects MapReduce with Varying Number of Threads
MapReduce Example (8192) 7000 Go Lime 6000
5000
4000
Time (ms) 3000
2000
1000
0 1 2 3 4 5 6 7 8 Thread number Santa Claus Problem
Santa Claus sleeps in his shop up at the North Pole, and can only be wakened by either all nine reindeer being back from their year long vacation on a tropical island, or by some elves who are having some difficulties making the toys. One elf’s problem is never serious enough to wake up Santa (otherwise, he may never get any sleep), so, the elves visit Santa in a group of three. When three elves are having their problems solved, any other elves wishing to visit Santa must wait for those elves to return. If Santa wakes up to find three elves waiting at his shop’s door, along with the last reindeer having come back from the tropics, Santa has decided that the elves can wait until after Christmas, because it is more important to get his sleigh ready as soon as possible. (It is assumed that the reindeer don’t want to leave the tropics, and therefore they stay there until the last possible moment.) The penalty for the last reindeer to arrive is that it must get Santa while the others wait in a warming hut before being harnessed to the sleigh. [Trono, 1994]
Includes priority, multi-party synchronization, barriers, and batch processing.
A number of flawed solutions have been published. Results of Santa Claus
Repetitions Lime (guards) C (semaphores) Go (channels) Java (monitors) of Santa 10,000 0.03 / 0.03 / 0.00 0.87 / 0.26 / 1.18 0.08 / 0.12 / 0.01 6.38 / 2.48 / 5.30 100,000 0.21 / 0.21 / 0.00 8.82 / 2.50 / 12.0 0.77 / 1.18 / 0.06 60.3 / 21.6 / 52.0 1,000,000 2.03 / 2.03 / 0.01 93.0 / 24.8 / 123 7.51 / 11.6 / 0.55 ≈534 / 159 / 509
Execution time in sec on AMD 16 core (32 threads) processor: real / user / system GO and Lime
GO threads (goroutines) use synchronous communication (CSP) Object-oriented structure deemphasized, thread structure emphasized
GO and Lime make use of goroutines resp. actions so cheap that concurrency can be used whenever natural;
The runtime will make use of all available cores.
How is that achieved? Lime Runtime System
head tail
Thread 1 lock obj obj ... obj head tail Thread 2 lock dequeue ... lock obj obj ... obj obj obj obj obj obj
Thread n lock obj obj ... obj enqueue
Local Queues Global Queue Number of worker threads with local queue ≈ number of cores Local queue is full / empty: enqueue / dequeue to / from global queue Local and global queues are empty: steal from the other threads. Lock-free implementation of queues
Each action is a coroutine: compiler inserts transfers in code, i.e. schedules cooperatively → lightweight threads Conclusions
Efficient implementation of concurrency with guarded commands is possible, when local to objects
Based on the Ph.D. thesis of Shucai Yao (2018) and Joshua Moore-Oliva, M.Sc, (2010) Xiao-lei Cui, M.Sc. (2009) Upasana Pujari, M.Sc. (2009) Kevin Lou, M.Sc. (2004) Jie Liang, M.Sc. (2004)
Ongoing work on inheritance, ownership, exceptions http://www.software-pioneers.com (2001)