CPL 2016, week 7 Performance considerations

Oleg Batrashev

Institute of Computer Science, Tartu, Estonia

March 21, 2016 Overview

Studied so far: 1. Inter- visibility: JMM 2. Inter-thread synchronization: locks and monitors 3. Thread management: executors, tasks, cancelation 4. Inter-thread communication: confinements, queues, back pressure 5. Inter-thread collaboration: actors, inboxes, state diagrams 6. Asynchronous execution: callbacks, Pyramid of Doom, Java 8 promises. Today:

I Performance considerations: asynchronous IO, Java 8 streams. Performance considerations 140/160 - Outline

Performance considerations Context switch Green threads Asynchronous IO Java NIO

Declarative concurrency Java 8 streams Performance considerations 141/160 Context switch - Variants of context switch

Context switch may refer to different things

I application changes CPU priority level (kernel/user) of a running code

I system calls – set of basic operations supported by OS that applications use to open file/socket, write/read it, ... I CPU registers, stack, ... are reloaded in the core

I OS changes the thread that runs on a core

I OS changes the process that runs on a core Our main interest is in switching threads, e.g.:

I if lock is taken, the thread must be suspended until it is released

I if queue is empty, then consumer must be suspended

I if no more data in a socket, reader must be suspended Too many context switches may degrade the performance! Performance considerations 142/160 Context switch - Context switch test

Thread/process context switch is 1-10 microseconds (system dependent). 1. Two actors with own thread: producer writes 1 million integer values to the consumer actor, which sums them up. 2. Two actors with own thread: ping-pong of 0.5 million values. 3. Two actors with shared thread: ping-pong of 0.5 million values. Case Total time Two actors: producer-consumer 0.41 s Two actors with own threads 6 s Two actors with shared thread 0.18 s

I ping-pong between two threads causes expected decline in efficiency (6 µs per context switch, i.e. p(io)ng) Performance considerations 143/160 Context switch - Solutions to context switch

1. Let the same thread do most of the work

I from queue/actor model back to wandering threads 2. Make sure single thread does enough work before switching

I make message processing work expensive (in terms of computation) I keep queues full enough for consumers/transducers/actors – handle several in a row before switching off to another thread I not always possible 3. Do not switch thread when switching actors, consumers, and/or transducers

I use green threads This problem is only relevant in case of many actors and/or many batch non-applicapable messages! Performance considerations 144/160 Green threads - Outline

Performance considerations Context switch Green threads Asynchronous IO Java NIO

Declarative concurrency Java 8 streams Performance considerations 145/160 Green threads - Idea

Green threads ( threads, user-level threads):

I user-level thread is maintained outside OS, on the user level

I implemented by library or VM

I 1 kernel-level (OS) thread per n user-level threads

I OS resources are allocated for 1 thread I cheaper scheduling – no context switch needed

I m kernel threads per n user threads Problems:

I need a way to suspend execution and save/restore thread stack

I i.e. preempt executing thread I non-preemptable threads need to yield periodically

I IO may block OS thread, which is needed by other green threads Performance considerations 146/160 Green threads - Implementations

Languages/VMs:

I Java 1.1 had green threads as the main implementation

I Erlang VM uses green threads with no shared state

I Go, Libraries/frameworks/engines:

I Akka (Java) uses m-n model (specify dispatcher for an actor)

I CPython greenlet, eventlet, gevent

I Quasar (Java) modifies your code to save the stack (location and local variables) See also:

I fibers, Performance considerations 147/160 Asynchronous IO - Outline

Performance considerations Context switch Green threads Asynchronous IO Java NIO

Declarative concurrency Java 8 streams Performance considerations 148/160 Asynchronous IO - Blocking IO problem

I IO may block OS thread that is used for many green threads Solutions: 1. Use dedicated thread pool for blocking IO () 2. Use asynchronous IO (Erlang) Some frameworks:

I Netty is a non-blocking I/O (NIO) client-server framework for the development of Java network applications

I Asynchronous servlets in Servlet 3.0 Performance considerations 149/160 Asynchronous IO - Ideas

I Synchronous IO suspends if no data is yet available

I Asynchronous IO – use callbacks that are executed when IO is readable/writeable

I does not block on IO operations I may read multiple sockets by single thread (selectors) Advantages:

I avoids context switch when reading from multiple sockets

I solves green thread blocking IO problem Disadvantages:

I requires more code to handle IO

I code becomes more scattered Performance considerations 150/160 Asynchronous IO - Java NIO Buffers and channels

http://tutorials.jenkov.com/java-nio/index.html

I buffers are much like arrays

I provide typical write-flip-read sequence I used for Java NIO channels I ByteBuffer.allocate(100)

I channels are much like streams, but

I both readable/writeable I support asynchronous operation, read AsynchronousByteChannel:

Future read(ByteBuffer dst) void read(ByteBuffer dst,A attachment, CompletionHandler handler)

I write also supports these 2 forms: future and callback Performance considerations 151/160 Asynchronous IO - Java NIO Selectors

I may register callback for each channel we are interested

I easier way is to use selectors

I register as many channels as we want, select desired operation:

channel.configureBlocking(false); SelectionKey key= channel.register(selector, SelectionKey.OP_READ);

I supported operations OP_CONNECT, OP_ACCEPT, OP_READ, OP_WRITE

I use selector.select() – blocks until at least one channel is ready for the events you registered for

I selector.selectedKeys() – returns the channels that are ready Performance considerations 152/160 Summary -

I context switch is changing executing mode, thread or process

I context switch is quite expensive on OS (kernel) level

I green threads (user-level threads) may mitigate the cost

I green threads have problems with , saving stack and blocking IO

I blocking IO may be solved by:

I using dedicated thread pool I using asynchronous IO Declarative concurrency 153/160- Ideas

I Java <8 lacked functional style

I declarative = pure functional (see later Erlang,Clojure)

I single assignment variables, lock-step execution I deterministic, no side effects, no race conditions I lazyness, dataflow programming

I interest in performance (utilizing cores)

I structured declarative concurrency

I parallel map/filter/reduce Declarative concurrency 154/160 Java 8 streams - Outline

Performance considerations Context switch Green threads Asynchronous IO Java NIO

Declarative concurrency Java 8 streams Declarative concurrency 155/160 Java 8 streams - Java8 streams

Like usual streams:

I sequence of values. Unlike usual streams:

I do not have state, only for data transformation

I support map/filter/reduce transformations

I lazy – do not execute until data is needed Create stream:

Stream Collection.stream() Arrays.stream(Object[]) Stream.of(Object[]) static Stream generate(Suppliers) static Stream iterate(T seed, UnaryOperatorf)

I last 2 produce infinite streams Declarative concurrency 156/160 Java 8 streams - Collecting stream

I streams are not executed until their results are needed

I terminal operation – one that produces the result Some terminal operations:

long count() Optional max(Comparator comparator) Optional reduce(BinaryOperator accumulator) void forEach(Consumer action) Object[] toArray() R collect(Collector collector)

I Collector interface is very general

I Collectors class contains a lot of standard implementations

I toList(), toSet(), ... Declarative concurrency 157/160 Java 8 streams - Transforming stream

I map – transform each element and return new stream

Stream map(Function mapper)

I filter – select only some elements from the stream

Stream filter(Predicate predicate)

I reduce – aggregate stream into the final result

Optional reduce(BinaryOperator accumulator) T reduce(T identity, BinaryOperator accumulator)

I flatMap – like map but combining resulting streams

Stream flatMap(Function< ? superT, ? extends Stream> mapper)

I analogue of compose in CompletableFuture Declarative concurrency 158/160 Java 8 streams - Example: explicit

Collect travelers that has speed>20, take their max and min temperatures

I types of intermediate streams are given explicitly

I see next slide for more compact version

I limit() takes first n elements

Stream travelers= Stream.generate(Traveler::generate); Stream trav10000= travelers.limit(10000); Stream carTrav= trav10000.filter(t->t.speed>20.0); List carTemps= carTrav.map(t->t.temperature) .collect(Collectors.toList()); Optional minT= carTemps.stream().min(Double::compare); Optional maxT= carTemps.stream().max(Double::compare); System.out.println(minT+""+ maxT); Declarative concurrency 159/160 Java 8 streams - Example: inline

I more readable than explicit version

double[] temps2= Stream.generate(Traveler::generate) .limit(10000) .filter(t->t.speed>20.0) // fast moving .mapToDouble(t->t.temperature) // take temperature .toArray(); System.out.println(DoubleStream.of(temps2).min() +"" + DoubleStream.of(temps2).max());

I do not overuse – many anonymous intermediate results may confuse about what is actually happening Declarative concurrency 160/160 Java 8 streams - Parallel execution

Advantages to usual for loop:

I easily parallelizable, e.g. run the example in parallel

double[] temps3= Stream.generate(Traveler::generate) .parallel() .limit(10000) .filter(t->t.speed>20.0) // fast moving .mapToDouble(t->t.temperature) // take temperature .toArray(); System.out.println(DoubleStream.of(temps3).min() +"" + DoubleStream.of(temps3).max());

I streams are composable, e.g. may write

Stream collectTempOfCarTravelers(Stream)

I combine it in different contexts I no operation is executed until terminal operation for the whole stream is defined