Parallel Processing/Programming with the applications to image processing Lectures: 1.1. ParallelParallel ProcessingProcessing && ProgrammingProgramming –– fromfrom highhigh performanceperformance monomono corescores toto multi-multi- andand many-coresmany-cores 2.2. ProgrammingProgramming InterfacesInterfaces (API)(API) forfor multi-cores,multi-cores, many-coresmany-cores andand heterogeneousheterogeneous programmingprogramming

Exercises: 1.1. HighHigh performanceperformance mono-coresmono-cores andand multi-cores;multi-cores; programmingprogramming withwith openMPopenMP 2.2. ArchitectureArchitecture ofof modernmodern GPUs;GPUs; many-coremany-core processingprocessing andand programmingprogramming withwith CUDACUDA Parallel Processing/Programming with the applications to image processing Labs: 1.1. HighHigh performanceperformance processingprocessing withwith pipelinespipelines andand -memories;cache-memories; simplesimple multi-coremulti-core programmingprogramming withwith openMP 2.2. openMP programmingprogramming forfor parallelparallel imageimage processingprocessing 3.3. SimpleSimple many-coremany-core programmingprogramming withwith CUDA –– vectorvector andand matrixmatrix processingprocessing 4.4. Many-coreMany-core parallelparallel imageimage processingprocessing withwith CUDA andand openCV 5.5. Many-coreMany-core parallelparallel imageimage generationgeneration andand animationanimation withwith CUDA andand openGL AllAll labslabs areare preparedprepared onon embeddedembedded boards:boards: Odroid-U3Odroid-U3 ((openMP)) andand TegraTegra K1K1 ((openMP && CUDA))

P. Bakowski SmartComputerLab.org 2 Performance ?

PerformancePerformance == 1/Time.to..the.given.task=1/Time.to.process.the.given.task= 1/TT1/TT

TT=TT= Number.of.Instructions*Number.of.Instructions* Number.of.Clock.Cycles.per.Instruction.*Number.of.Clock.Cycles.per.Instruction.* Time.of.Clock.CycleTime.of.Clock.Cycle

Number.of.Instructions – task complexity (CISC or RISC) Number.of.Clock.Cycles.per.Instruction – architecture & micro-parallelism Time.of.Clock.Cycle (or 1/Clock.Frequency) - technology

P. Bakowski SmartComputerLab.org 3 Performance – an example

Performance = 1/Time.to.process.the.given.task= 1/TT TT= Number.of.Instructions* Number.of.Clock.Cycles.per.Instruction.* Time.of.Clock.Cycle

Number.of.Instructions = 106 Number.of.Clock.Cycles.per.Instruction = 2 Time.of.Clock.Cycle (or 1/Clock.Frequency) = 1 GHz

What is execution time ?, What is performance ?

P. Bakowski SmartComputerLab.org 4 Number of Instructions

CISC –– ComplexComplex InstructionInstruction SetSet ComputerComputer eg.eg. x86x86 (Intel,..)(Intel,..) ++ lessless instructionsinstructions (memory)(memory) perper tasktask -- multiplemultiple instructioninstruction formats,formats, complexcomplex decodingdecoding

RISC –– ReducedReduced InstructionInstruction SetSet ComputerComputer eg.eg. ARMARM,, MIPS,MIPS, .... -- moremore instructionsinstructions (memory)(memory) perper tasktask ++ fewfew instructioninstruction formats,formats, rapidrapid decodingdecoding

P. Bakowski SmartComputerLab.org 5 Number of Clock Cycles per : ElaborationElaboration ofof severalseveral instructionsinstructions atat thethe samesame timetime Multi-scalar processing: MultipleMultiple pipelinespipelines withwith multiplemultiple executionexecution units:units: ALUsALUs –– fixedfixed andand floatingfloating pointpoint Out-of-order processing: InstructionInstruction queuesqueues withwith micro-scheduling,micro-scheduling, physicalphysical andand virtualvirtual registersregisters Vector processing units: MultipleMultiple datadata unitsunits processedprocessed byby thethe samesame instructioninstruction

P. Bakowski SmartComputerLab.org 6 Pipelining

ElaborationElaboration ofof severalseveral instructionsinstructions atat thethe samesame timetime

fetchfetch

oneone clockclock cyclecycle

P. Bakowski SmartComputerLab.org 7 Pipelining

ElaborationElaboration ofof severalseveral instructionsinstructions atat thethe samesame timetime

fetchfetch decodedecode fetchfetch

one clock cycle

P. Bakowski SmartComputerLab.org 8 Pipelining

ElaborationElaboration ofof severalseveral instructionsinstructions atat thethe samesame timetime

fetch decode execute fetch decode fetch one clock cycle

P. Bakowski SmartComputerLab.org 9 Pipelining

Elaboration of several instructions at the same time

fetch decode execute write fetch decode execute fetch decode one clock cycle fetch

P. Bakowski SmartComputerLab.org 10 Pipelining

Elaboration of several instructions at the same time

fetch decode execute write fetch decode execute write fetch decode execute one clock cycle fetch decode

P. Bakowski SmartComputerLab.org 11 Pipelining

fetch decode execute write fetch decode execute write fetch decode execute write one clock cycle fetch decode execute

P. Bakowski SmartComputerLab.org 12 Pipelining

fetch decode execute write fetch decode execute write fetch decode execute write one clock cycle fetch decode execute write

P. Bakowski SmartComputerLab.org 13 Pipelining and caching

L1-D

fetch decode execute write datadata memorymemory accessaccess DataData CacheCache fetch decode execute write fetch decode execute write instructioninstruction memorymemory accessaccess fetch decode execute write InstructionInstruction CacheCache

L1-I one clock cycle

P. Bakowski SmartComputerLab.org 14 Super-scalar Pipelining

ExecutionExecution ofof severalseveral instructionsinstructions (here(here 2)2) atat thethe samesame timetime

fetch decode execute write fetch decode execute write fetch decode execute write fetch decode execute write fetch decode execute write fetch decode execute write fetch decode execute write fetch decode execute write

MaxMax numbernumber ofof instructionsinstructions perper clockclock cyclecycle isis 22 ??

P. Bakowski SmartComputerLab.org 15 Pipelining & vector processing

ExecutionExecution ofof thethe samesame instructioninstruction onon severalseveral datadata unitsunits

fetch decode execute write fetch decode execute write fetch decode execute write fetch decode execute write

multiplemultiple data:data: 44 int/floatint/float

P. Bakowski SmartComputerLab.org 16 Pipelining and caching

ExecutionExecution ofof thethe memorymemory storestore//loadload instructionsinstructions

fetch decode write write fetch decode read read fetch decode write write fetch decode read read

two clock cycles

L1L1 –– cachecache hithit//missmiss ifif missmiss ~10~10 cyclescycles stallstall forfor L2L2 P. Bakowski SmartComputerLab.org accessaccess 17 Clock frequency

PerformancePerformance == 1/Time.to.process.the.given.task=1/Time.to.process.the.given.task= 1/TT1/TT TTTT== Number.of.Instructions*Number.of.Instructions* Number.of.Clock.Cycles.per.Instruction.*Number.of.Clock.Cycles.per.Instruction.* Time.of.Clock.CycleTime.of.Clock.Cycle oror Performance == Clock.FrequencyClock.Frequency ** 1/(Number.of.Instructions*1/(Number.of.Instructions* Number.of.Clock.Cycles.per.Instruction)Number.of.Clock.Cycles.per.Instruction)

P. Bakowski SmartComputerLab.org 18 Clock frequency

clockclock processorprocessor

TheThe problemproblem isis powerpower consumption:consumption:

dynamic.powerdynamic.power == A*N*C*VA*N*C*V22*f*f where:where: AA –– activityactivity rate,rate, NN –– numbernumber ofof drivendriven transistors,transistors, CC –– inputinput capacitycapacity ofof transistortransistor node,node, VV –– voltage,voltage, ff –– frequencyfrequency ((??))

P. Bakowski SmartComputerLab.org 19 Clock frequency & voltage

LetLet usus increaseincrease thethe frequency.frequency. WhatWhat isis thethe impactimpact onon thethe dynamicdynamic powerpower consumptionconsumption ?? WhereWhere isis thethe problemproblem ??

dynamic.powerdynamic.power == A*N*C*VA*N*C*V22*f*f

TheThe answeranswer isis:: thethe increaseincrease ofof thethe frequencyfrequency involvesinvolves thethe increaseincrease ofof thethe voltage.voltage. ForFor exampleexample thethe increaseincrease ofof thethe frequencyfrequency byby thethe factorfactor 22 needsneeds thethe increaseincrease ofof thethe voltagevoltage byby thethe thethe factorfactor ofof 1.661.66;; soso thethe dynamicdynamic powerpower consumptionconsumption increaseincrease isis ??

P. Bakowski SmartComputerLab.org 20 ARM Cortex-A15 architecture

P. Bakowski SmartComputerLab.org 21 Multi-core architecture

22 clockclock cyclescycles 1010 clockclock cyclescycles

SMPSMP –– SymmetricSymmetric Multi-ProcessorMulti-Processor withwith sharedshared memorymemory

IndependentIndependent L1L1 instructioninstruction andand datadata cachecache (L1-D,(L1-D, L1-I)L1-I) SharedShared L2L2 cachecache

P. Bakowski SmartComputerLab.org 22 ARM Cortex-15 MPCore

ARMARM Cortex-15Cortex-15 :: thethe mostmost powerfulpowerful ARMARM v7v7 processorprocessor 3232 KBKB L1L1 caches,caches, 44 MBMB L2L2 cache,cache, clockclock upup toto 2.52.5 GHzGHz 1TB1TB RAMRAM memorymemory addressaddress spacespace

P. Bakowski SmartComputerLab.org 23 Multi-core programming with openMP

ParallelParallel processingprocessing speed-up:speed-up:

speedupspeedup == 1/(S+1/N*(1-S))1/(S+1/N*(1-S))

where:where: SS –– serialserial partpart ofof thethe tasktask NN –– numbernumber ofof processorsprocessors

Example:Example: ForFor S=10%S=10% andand N=4N=4 thethe speedupspeedup isis ??

P. Bakowski SmartComputerLab.org 24 Multi-core programming with openMP

openMPopenMP operatesoperates mostlymostly viavia compilercompiler directivesdirectives

#pragma omp ...#pragma omp ... {{ ...code...... code... automaticautomatic threadsthreads }}

wherewhere thethe ...... codecode...... isis calledcalled thethe parallelparallel regionregion

P. Bakowski SmartComputerLab.org 25 Multi-core programming with openMP

#include "stdio.h" #include

int main(int argc, char *argv[]) { #pragma omp parallel { printf("hello multicore user!\n"); } return(0); }

%cc ­o Hello.omp Hello.omp.c ­fopenmp

%./Hello.omp hello multicore user! hello multicore user! hello multicore user! hello multicore user!

P. Bakowski SmartComputerLab.org 26 openMP environment

#include "stdio.h" #include int main(int argc, char *argv[]) { #pragma omp parallel { int NCPU,tid,NPR,NTHR; NCPU = omp_get_num_procs(); // get the number of available cores tid = omp_get_thread_num(); // get current ID NPR = omp_get_num_threads(); // get total number of threads NTHR = omp_get_max_threads();// get number of threads requested if (tid == 0) { // execute it in master thread printf("%i : NCPU\t= %i\n",tid,NCPU); printf("%i : NTHR\t= %i\n",tid,NTHR); printf("%i : NPR\t= %i\n",tid,NPR); } printf("%i: I am thread %i out of %i\n",tid,tid,NPR); } return(0); }

P. Bakowski SmartComputerLab.org 27 openMP environment

%cc ­o HelloMulticore HelloMultiCore.c ­fopenmp export OMP_NUM_THREADS=4

%./ HelloMulticore

1 : I am thread 1 out of 4 2 : I am thread 2 out of 4 0 : NCPU = 4 0 : NTHR = 1 0 : NPR = 4 TheThe numbernumber ofof threadsthreads maymay bebe 0 : I am thread 0 out of 4 setset toto anyany value:value: 8,8, 1616 ,, .... 3 : I am thread 3 out of 4 ButBut thethe numbernumber ofof corescores isis fixedfixed byby thethe architecture.architecture.

P. Bakowski SmartComputerLab.org 28 openMP local and global variables int x; #pragma omp parallel for for(x=0; x

implicitimplicit locallocal variablevariable

P. Bakowski SmartComputerLab.org 29 openMP local and global variables int x; #pragma omp parallel for for(x=0; x

P. Bakowski SmartComputerLab.org 30 openMP loop scheduling int i,j,n; #pragma omp parallel for default(none) schedule(static) private(i,j) shared(n) for (i=0; i

P. Bakowski SmartComputerLab.org 31 openMP reduction operator #define N 1000 int main (int argc, char *argv[]) { double a[N], b[N]; double sum = 0.0; int i, n, tid; #pragma omp parallel shared(a) shared(b) private(i) { tid = omp_get_thread_num(); #pragma omp for for (i=0; i < N; i++) { a[i] = 1.0; b[i] = 1.0; TheThe productsproducts areare calculatedcalculated } inin parallel;parallel; #pragma omp for reduction(+:sum) thethe sumsum isis aa uniqueunique valuevalue for (i=0; i < N; i++) { sum += a[i]*b[i]; }

} /* End of parallel region */ printf("Sum = %2.1f\n",sum); exit(0); } P. Bakowski SmartComputerLab.org 32 openMP matrix multiplication int DIM=512; #pragma omp parallel for private(i,j,k,dot) shared(a,b,c) for(i=0;i

P. Bakowski SmartComputerLab.org 33 openMP matrix multiplication

#pragma omp parallel for private(i,j,k,dot) shared(a,b,c) firstprivate(DIM) for(i=0;i

ElementsElements ofof highhigh performanceperformance architecturesarchitectures InstructionInstruction setset architecturearchitecture (ISA)(ISA) Micro-parallelismMicro-parallelism ClockClock frequencyfrequency

ExampleExample ofof ARMARM Cortex-15Cortex-15

WhyWhy wewe needneed multi-coremulti-core architecturesarchitectures ??

BasicBasic multi-coremulti-core programmingprogramming withwith openMPopenMP

P. Bakowski SmartComputerLab.org 35