Programming in the Multi-Core Era

Emil Sekerinski

Department of Computing and Software McMaster University

June 2018 40+ Years of Microprocessor Trends

We have reached instruction level parallelism wall and power wall Still increasing number of transistors leads to more processor cores Communication and Synchronization in Multi-Programs

Message Passing: Asynchronous: pipes & filters, Erlang processes Synchronous: Go goroutines Shared Variables: Semaphores: Linux, , Python Monitors: Java, C#, Python Transactional memory: C++, Haskell Remote procedure call: Java, Python Distributed shared memory Introduced for resource sharing, interactive programs, distribution

Suitable for multi-core processors? Too many threads: slowdown due to overhead of cores switching between threads, e.g. multiplying n × n matrices → use pools to reduce number of threads

Threads in Programs

Threads have to be explicitly created to make use of multiple cores:

Too few threads: not all cores will be used → turn independent tasks into threads

[Android Guides] Threads in Programs

Threads have to be explicitly created to make use of multiple cores:

Too few threads: not all cores will be used → turn independent tasks into threads Too many threads: slowdown due to overhead of cores switching between threads, e.g. multiplying n × n matrices → use thread pools to reduce number of threads

[ in the .NET Framework] Threads in Programs

Threads have to be explicitly created to make use of multiple cores:

Too few threads: not all cores will be used → turn independent tasks into threads Too many threads: slowdown due to overhead of cores switching between threads, e.g. multiplying n × n matrices → use thread pools to reduce number of threads

In addition to the static structure of modules, programs have a dynamic structure of threads.

(State is distributed over global variables and thread-local variables) -80 Java, C#, Python, . . . sender caller receiver callee message method

Objects should be thought of as having “independent lives”, even if the implementation is sequential

How to make objects “truly concurrent”? Ada uses rendezvous, Erlang and Akka use actors

Inspiration: -67 and Smalltalk-80

Simula-67 objects in programs mimic objects of the “real world” objects are cooperatively scheduled → coroutines

[Birtwhistle, Dahl, Myhrhaug, Nygaard.

Simula Begin, 1979.] Objects should be thought of as having “independent lives”, even if the implementation is sequential

How to make objects “truly concurrent”? Ada uses rendezvous, Erlang and Akka use actors

Inspiration: Simula-67 and Smalltalk-80

Simula-67 objects in programs mimic objects of the “real world” objects are cooperatively scheduled → coroutines “Part of the problem Smalltalk-80 Java, C#, Python, . . . description becomes part of the solution” sender caller receiver callee message method How to make objects “truly concurrent”? Ada uses rendezvous, Erlang and Akka use actors

Inspiration: Simula-67 and Smalltalk-80

Simula-67 objects in programs mimic objects of the “real world” objects are cooperatively scheduled → coroutines “Part of the problem Smalltalk-80 Java, C#, Python, . . . description becomes part of the solution” sender caller receiver callee message method

Objects should be thought of as having “independent lives”, even if the implementation is sequential Inspiration: Simula-67 and Smalltalk-80

Simula-67 objects in programs mimic objects of the “real world” objects are cooperatively scheduled → coroutines “Part of the problem Smalltalk-80 Java, C#, Python, . . . description becomes part of the solution” sender caller receiver callee message method

Objects should be thought of as having “independent lives”, even if the implementation is sequential

How to make objects “truly concurrent”? Ada uses rendezvous, Erlang and Akka use actors Used for verification tools, TLA (Amazon), Event B (railways), SPIN (NASA), Microsoft Device Driver Verifier, . . . , for hardware description languages, but not programming languages:

Widespread belief in the 80’s that guarded commands cannot be implemented efficiently

Guarded Commands: A Model for Concurrency

do if x1 > x2 then swap x1 and x2, x1 > x2 → swap x1 and x2 ... x2 > x3 → swap x2 and x4 if both x1 > x2 and x3 > x4, x3 > x4 → swap x3 and x4 swap concurrently Guarded Commands: A Model for Concurrency

do x1 > x2 → swap x1 and x2 x2 > x3 → swap x2 and x4 x3 > x4 → swap x3 and x4

Used for verification tools, TLA (Amazon), Event B (railways), SPIN (NASA), Microsoft Device Driver Verifier, . . . , for hardware description languages, but not programming languages:

Widespread belief in the 80’s that guarded commands cannot be implemented efficiently Lime: a Concurrent Object-Oriented Language

class DelayedDoubler Actions execute when enabled var y: int; : boolean All objects are concurrent init() d := true Objects synchronize through method store(u: int) method calls y := u; d := false Method blocks when guard is method retrieve(): int false d → return y Execution is atomic up to action double method calls not d → y := 2 * y; d := true Guards only over local fields Lime: a Concurrent Object-Oriented Language

class DelayedDoubler A correct implementation of: var y: int; d: boolean init() class Doubler d := true var x: int method store(u: int) method store(u: int) y := u; d := false x := 2 * u method retrieve(): int method retrieve(): int d → return y return x action double not d → y := 2 * y; d := true

Representative for writing to a file, sending over network, . . . m:5m:4 m:6m:5m m:6m m head p:6p:4p p:5p p p a:truea a:truea a:truea a

Priority Queue

class PriorityQueue Add an integer to the queue var l : PriorityQueue Remove the smallest integer ... method add(e: int) Remove executes in O(1) time when not a and not r do if l = nil then Initialize; add 5, 6, 4 . . . m := e ; l := new PriorityQueue else p := e ; a := true action doAdd when a do if m < p then l.add(p) else l.add(m); m := p a := false

m head p a m:5m:4m m:6m:5m m:6m m head p:6p:4p p:5p p p a:truea a:truea a:truea a

Priority Queue

class PriorityQueue Add an integer to the queue var l : PriorityQueue Remove the smallest integer ... method add(e: int) Remove executes in O(1) time when not a and not r do if l = nil then Initialize; add 5, 6, 4 . . . m := e ; l := new PriorityQueue else p := e ; a := true action doAdd when a do if m < p then l.add(p) else l.add(m); m := p a := false

m:5 head p a:true m:5m:4m m:6m:5m m:6m m head p:6p:4p p:5p p p a:truea a:truea a:truea a

Priority Queue

class PriorityQueue Add an integer to the queue var l : PriorityQueue Remove the smallest integer ... method add(e: int) Remove executes in O(1) time when not a and not r do if l = nil then Initialize; add 5, 6, 4 . . . m := e ; l := new PriorityQueue else p := e ; a := true action doAdd when a do if m < p then l.add(p) else l.add(m); m := p a := false

m:5 m head p p a a m:5m:4m m:6m:5m m:6m m head p:4p p:5p p p a:truea a:truea a:truea a

Priority Queue

class PriorityQueue Add an integer to the queue var l : PriorityQueue Remove the smallest integer ... method add(e: int) Remove executes in O(1) time when not a and not r do if l = nil then Initialize; add 5, 6, 4 . . . m := e ; l := new PriorityQueue else p := e ; a := true action doAdd when a do if m < p then l.add(p) else l.add(m); m := p a := false

m:5 m head p:6 p a:true a m:5m:4m m:6m:5m m:6m m head p:6p:4p p:5p p p a:truea a:truea a:truea a

Priority Queue

class PriorityQueue Add an integer to the queue var l : PriorityQueue Remove the smallest integer ... method add(e: int) Remove executes in O(1) time when not a and not r do if l = nil then Initialize; add 5, 6, 4 . . . m := e ; l := new PriorityQueue else p := e ; a := true action doAdd when a do if m < p then l.add(p) else l.add(m); m := p a := false

m:5 m:6 head p p a a:true m:5m:4m m:6m:5m m:6m m head p:6p:4p p:5p p p a:truea a:truea a:truea a

Priority Queue

class PriorityQueue Add an integer to the queue var l : PriorityQueue Remove the smallest integer ... method add(e: int) Remove executes in O(1) time when not a and not r do if l = nil then Initialize; add 5, 6, 4 . . . m := e ; l := new PriorityQueue else p := e ; a := true action doAdd when a do if m < p then l.add(p) else l.add(m); m := p a := false

m:5 m:6 m head p p p a a a m:5m:4m m:6m:5m m:6m m head p:6p p:5p p p a:truea a:truea a:truea a

Priority Queue

class PriorityQueue Add an integer to the queue var l : PriorityQueue Remove the smallest integer ... method add(e: int) Remove executes in O(1) time when not a and not r do if l = nil then Initialize; add 5, 6, 4 . . . m := e ; l := new PriorityQueue else p := e ; a := true action doAdd when a do if m < p then l.add(p) else l.add(m); m := p a := false

m:5 m:6 m head p:4 p p a:true a a m:5m:4m m:6m:5m m:6m m head p:6p:4p p p p a:truea a:truea a:truea a

Priority Queue

class PriorityQueue Add an integer to the queue var l : PriorityQueue Remove the smallest integer ... method add(e: int) Remove executes in O(1) time when not a and not r do if l = nil then Initialize; add 5, 6, 4 . . . m := e ; l := new PriorityQueue else p := e ; a := true action doAdd when a do if m < p then l.add(p) else l.add(m); m := p a := false

m:4 m:6 m head p p:5 p a a:true a m:5m:4m m:6m:5m m:6m m head p:6p:4p p:5p p p a:truea a:truea a a

Priority Queue

class PriorityQueue Add an integer to the queue var l : PriorityQueue Remove the smallest integer ... method add(e: int) Remove executes in O(1) time when not a and not r do if l = nil then Initialize; add 5, 6, 4 . . . m := e ; l := new PriorityQueue else p := e ; a := true action doAdd when a do if m < p then l.add(p) else l.add(m); m := p a := false

m:4 m:5 m:6 head p p p a a a:true m:5m:4m m:6m:5m m:6m head p:6p:4p p:5p p a:truea a:truea a:truea

Priority Queue

class PriorityQueue Add an integer to the queue var l : PriorityQueue Remove the smallest integer ... method add(e: int) Remove executes in O(1) time when not a and not r do if l = nil then Initialize; add 5, 6, 4 . . . m := e ; l := new PriorityQueue else p := e ; a := true action doAdd when a do if m < p then l.add(p) else l.add(m); m := p a := false

m:4 m:5 m:6 m head p p p p a a a a Results of Priority Queue

3500 30000 Lime Go Erlang 3000 Java Pthread 25000 Haskell

2500 20000

2000

15000

Time (ms) 1500

10000 1000

5000 500

0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10 20 30 40 50 60 70 80 Number of Objects Leaf-oriented Tree

Internal nodes contain only guides The elements are stored in the leaves

Leaf

Node

Leaf

Root input

Leaf

Node

Leaf Results of Leaf-oriented Tree

9000 1800 Lime Go 8000 Erlang 1600 Java Pthread Haskell 7000 1400

6000 1200

5000 1000

4000 800 Time (ms)

3000 600

2000 400

1000 200

0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10 20 30 40 50 60 70 80 Number of Objects MapReduce

MapReduce computes the sum of squares from 1 to n The mapper: map(x) = x2 The reducer: reduce(x, y) = x + y.

input Mapper

Reducer

input Mapper

Reducer output

input Mapper

Reducer

input Mapper Results of MapReduce

8000 4500 Lime Go Erlang 4000 7000 Java Pthread Haskell 3500 6000

3000 5000

2500 4000 2000 Time (ms) 3000 1500

2000 1000

1000 500

0 0 128 256 512 1024 2048 2 4 8 16 32 64 Number of Objects MapReduce with Varying Number of Threads

MapReduce Example (8192) 7000 Go Lime 6000

5000

4000

Time (ms) 3000

2000

1000

0 1 2 3 4 5 6 7 8 Thread number Santa Claus Problem

Santa Claus sleeps in his shop up at the North Pole, and can only be wakened by either all nine reindeer being back from their year long vacation on a tropical island, or by some elves who are having some difficulties making the toys. One elf’s problem is never serious enough to wake up Santa (otherwise, he may never get any sleep), so, the elves visit Santa in a group of three. When three elves are having their problems solved, any other elves wishing to visit Santa must wait for those elves to return. If Santa wakes up to find three elves waiting at his shop’s door, along with the last reindeer having come back from the tropics, Santa has decided that the elves can wait until after Christmas, because it is more important to get his sleigh ready as soon as possible. (It is assumed that the reindeer don’t want to leave the tropics, and therefore they stay there until the last possible moment.) The penalty for the last reindeer to arrive is that it must get Santa while the others wait in a warming hut before being harnessed to the sleigh. [Trono, 1994]

Includes priority, multi-party synchronization, barriers, and batch processing.

A number of flawed solutions have been published. Results of Santa Claus

Repetitions Lime (guards) C (semaphores) Go (channels) Java (monitors) of Santa 10,000 0.03 / 0.03 / 0.00 0.87 / 0.26 / 1.18 0.08 / 0.12 / 0.01 6.38 / 2.48 / 5.30 100,000 0.21 / 0.21 / 0.00 8.82 / 2.50 / 12.0 0.77 / 1.18 / 0.06 60.3 / 21.6 / 52.0 1,000,000 2.03 / 2.03 / 0.01 93.0 / 24.8 / 123 7.51 / 11.6 / 0.55 ≈534 / 159 / 509

Execution time in sec on AMD 16 core (32 threads) processor: real / user / system GO and Lime

GO threads (goroutines) use synchronous communication (CSP) Object-oriented structure deemphasized, thread structure emphasized

GO and Lime make use of goroutines resp. actions so cheap that concurrency can be used whenever natural;

The runtime will make use of all available cores.

How is that achieved? Lime Runtime System

head tail

Thread 1 obj obj ... obj head tail Thread 2 lock dequeue ... lock obj obj ... obj obj obj obj obj obj

Thread n lock obj obj ... obj enqueue

Local Queues Global Queue Number of worker threads with local queue ≈ number of cores Local queue is full / empty: enqueue / dequeue to / from global queue Local and global queues are empty: steal from the other threads. Lock-free implementation of queues

Each action is a coroutine: compiler inserts transfers in code, i.e. schedules cooperatively → lightweight threads Conclusions

Efficient implementation of concurrency with guarded commands is possible, when local to objects

Based on the Ph.D. thesis of Shucai Yao (2018) and Joshua Moore-Oliva, M.Sc, (2010) Xiao-lei Cui, M.Sc. (2009) Upasana Pujari, M.Sc. (2009) Kevin Lou, M.Sc. (2004) Jie Liang, M.Sc. (2004)

Ongoing work on inheritance, ownership, exceptions http://www.software-pioneers.com (2001)