Parallel Processing/Programming with the Applications to Image Processing Lectures: 1.1

Parallel Processing/Programming with the applications to image processing Lectures: 1.1. ParallelParallel ProcessingProcessing && ProgrammingProgramming –– fromfrom highhigh performanceperformance monomono corescores toto multi-multi- andand many-coresmany-cores 2.2. ProgrammingProgramming InterfacesInterfaces (API)(API) forfor multi-cores,multi-cores, many-coresmany-cores andand heterogeneousheterogeneous programmingprogramming Exercises: 1.1. HighHigh performanceperformance mono-coresmono-cores andand multi-cores;multi-cores; programmingprogramming withwith openMPopenMP 2.2. ArchitectureArchitecture ofof modernmodern GPUs;GPUs; many-coremany-core processingprocessing andand programmingprogramming withwith CUDACUDA Parallel Processing/Programming with the applications to image processing Labs: 1.1. HighHigh performanceperformance processingprocessing withwith pipelinespipelines andand cache-memories;cache-memories; simplesimple multi-coremulti-core programmingprogramming withwith openMP 2.2. openMP programmingprogramming forfor parallelparallel imageimage processingprocessing 3.3. SimpleSimple many-coremany-core programmingprogramming withwith CUDA –– vectorvector andand matrixmatrix processingprocessing 4.4. Many-coreMany-core parallelparallel imageimage processingprocessing withwith CUDA andand openCV 5.5. Many-coreMany-core parallelparallel imageimage generationgeneration andand animationanimation withwith CUDA andand openGL AllAll labslabs areare preparedprepared onon embeddedembedded boards:boards: Odroid-U3Odroid-U3 ((openMP)) andand TegraTegra K1K1 ((openMP && CUDA)) P. Bakowski SmartComputerLab.org 2 Performance ? PerformancePerformance == 1/Time.to.process.the.given.task=1/Time.to.process.the.given.task= 1/TT1/TT TT=TT= Number.of.Instructions*Number.of.Instructions* Number.of.Clock.Cycles.per.Instruction.*Number.of.Clock.Cycles.per.Instruction.* Time.of.Clock.CycleTime.of.Clock.Cycle Number.of.Instructions – task complexity (CISC or RISC) Number.of.Clock.Cycles.per.Instruction – processor architecture & micro-parallelism Time.of.Clock.Cycle (or 1/Clock.Frequency) - technology P. Bakowski SmartComputerLab.org 3 Performance – an example Performance = 1/Time.to.process.the.given.task= 1/TT TT= Number.of.Instructions* Number.of.Clock.Cycles.per.Instruction.* Time.of.Clock.Cycle Number.of.Instructions = 106 Number.of.Clock.Cycles.per.Instruction = 2 Time.of.Clock.Cycle (or 1/Clock.Frequency) = 1 GHz What is execution time ?, What is performance ? P. Bakowski SmartComputerLab.org 4 Number of Instructions CISC –– ComplexComplex InstructionInstruction SetSet ComputerComputer eg.eg. x86x86 (Intel,..)(Intel,..) ++ lessless instructionsinstructions (memory)(memory) perper tasktask -- multiplemultiple instructioninstruction formats,formats, complexcomplex decodingdecoding RISC –– ReducedReduced InstructionInstruction SetSet ComputerComputer eg.eg. ARMARM,, MIPS,MIPS, .... -- moremore instructionsinstructions (memory)(memory) perper tasktask ++ fewfew instructioninstruction formats,formats, rapidrapid decodingdecoding P. Bakowski SmartComputerLab.org 5 Number of Clock Cycles per Instruction Pipelining: ElaborationElaboration ofof severalseveral instructionsinstructions atat thethe samesame timetime Multi-scalar processing: MultipleMultiple pipelinespipelines withwith multiplemultiple executionexecution units:units: ALUsALUs –– fixedfixed andand floatingfloating pointpoint Out-of-order processing: InstructionInstruction queuesqueues withwith micro-scheduling,micro-scheduling, physicalphysical andand virtualvirtual registersregisters Vector processing units: MultipleMultiple datadata unitsunits processedprocessed byby thethe samesame instructioninstruction P. Bakowski SmartComputerLab.org 6 Pipelining ElaborationElaboration ofof severalseveral instructionsinstructions atat thethe samesame timetime fetchfetch oneone clockclock cyclecycle P. Bakowski SmartComputerLab.org 7 Pipelining ElaborationElaboration ofof severalseveral instructionsinstructions atat thethe samesame timetime fetchfetch decodedecode fetchfetch one clock cycle P. Bakowski SmartComputerLab.org 8 Pipelining ElaborationElaboration ofof severalseveral instructionsinstructions atat thethe samesame timetime fetch decode execute fetch decode fetch one clock cycle P. Bakowski SmartComputerLab.org 9 Pipelining Elaboration of several instructions at the same time fetch decode execute write fetch decode execute fetch decode one clock cycle fetch P. Bakowski SmartComputerLab.org 10 Pipelining Elaboration of several instructions at the same time fetch decode execute write fetch decode execute write fetch decode execute one clock cycle fetch decode P. Bakowski SmartComputerLab.org 11 Pipelining fetch decode execute write fetch decode execute write fetch decode execute write one clock cycle fetch decode execute P. Bakowski SmartComputerLab.org 12 Pipelining fetch decode execute write fetch decode execute write fetch decode execute write one clock cycle fetch decode execute write P. Bakowski SmartComputerLab.org 13 Pipelining and caching L1-D fetch decode execute write datadata memorymemory accessaccess DataData CacheCache fetch decode execute write fetch decode execute write instructioninstruction memorymemory accessaccess fetch decode execute write InstructionInstruction CacheCache L1-I one clock cycle P. Bakowski SmartComputerLab.org 14 Super-scalar Pipelining ExecutionExecution ofof severalseveral instructionsinstructions (here(here 2)2) atat thethe samesame timetime fetch decode execute write fetch decode execute write fetch decode execute write fetch decode execute write fetch decode execute write fetch decode execute write fetch decode execute write fetch decode execute write MaxMax numbernumber ofof instructionsinstructions perper clockclock cyclecycle isis 22 ?? P. Bakowski SmartComputerLab.org 15 Pipelining & vector processing ExecutionExecution ofof thethe samesame instructioninstruction onon severalseveral datadata unitsunits fetch decode execute write fetch decode execute write fetch decode execute write fetch decode execute write multiplemultiple data:data: 44 int/floatint/float P. Bakowski SmartComputerLab.org 16 Pipelining and caching ExecutionExecution ofof thethe memorymemory storestore//loadload instructionsinstructions fetch decode write write fetch decode read read fetch decode write write fetch decode read read two clock cycles L1L1 –– cachecache hithit//missmiss ifif missmiss ~10~10 cyclescycles stallstall forfor L2L2 P. Bakowski SmartComputerLab.org accessaccess 17 Clock frequency PerformancePerformance == 1/Time.to.process.the.given.task=1/Time.to.process.the.given.task= 1/TT1/TT TTTT== Number.of.Instructions*Number.of.Instructions* Number.of.Clock.Cycles.per.Instruction.*Number.of.Clock.Cycles.per.Instruction.* Time.of.Clock.CycleTime.of.Clock.Cycle oror Performance == Clock.FrequencyClock.Frequency ** 1/(Number.of.Instructions*1/(Number.of.Instructions* Number.of.Clock.Cycles.per.Instruction)Number.of.Clock.Cycles.per.Instruction) P. Bakowski SmartComputerLab.org 18 Clock frequency clockclock processorprocessor TheThe problemproblem isis powerpower consumption:consumption: dynamic.powerdynamic.power == A*N*C*VA*N*C*V22*f*f where:where: AA –– activityactivity rate,rate, NN –– numbernumber ofof drivendriven transistors,transistors, CC –– inputinput capacitycapacity ofof transistortransistor node,node, VV –– voltage,voltage, ff –– frequencyfrequency ((??)) P. Bakowski SmartComputerLab.org 19 Clock frequency & voltage LetLet usus increaseincrease thethe frequency.frequency. WhatWhat isis thethe impactimpact onon thethe dynamicdynamic powerpower consumptionconsumption ?? WhereWhere isis thethe problemproblem ?? dynamic.powerdynamic.power == A*N*C*VA*N*C*V22*f*f TheThe answeranswer isis:: thethe increaseincrease ofof thethe frequencyfrequency involvesinvolves thethe increaseincrease ofof thethe voltage.voltage. ForFor exampleexample thethe increaseincrease ofof thethe frequencyfrequency byby thethe factorfactor 22 needsneeds thethe increaseincrease ofof thethe voltagevoltage byby thethe thethe factorfactor ofof 1.661.66;; soso thethe dynamicdynamic powerpower consumptionconsumption increaseincrease isis ?? P. Bakowski SmartComputerLab.org 20 ARM Cortex-A15 architecture P. Bakowski SmartComputerLab.org 21 Multi-core architecture 22 clockclock cyclescycles 1010 clockclock cyclescycles SMPSMP –– SymmetricSymmetric Multi-ProcessorMulti-Processor withwith sharedshared memorymemory IndependentIndependent L1L1 instructioninstruction andand datadata cachecache (L1-D,(L1-D, L1-I)L1-I) SharedShared L2L2 cachecache P. Bakowski SmartComputerLab.org 22 ARM Cortex-15 MPCore ARMARM Cortex-15Cortex-15 :: thethe mostmost powerfulpowerful ARMARM v7v7 processorprocessor 3232 KBKB L1L1 caches,caches, 44 MBMB L2L2 cache,cache, clockclock upup toto 2.52.5 GHzGHz 1TB1TB RAMRAM memorymemory addressaddress spacespace P. Bakowski SmartComputerLab.org 23 Multi-core programming with openMP ParallelParallel processingprocessing speed-up:speed-up: speedupspeedup == 1/(S+1/N*(1-S))1/(S+1/N*(1-S)) where:where: SS –– serialserial partpart ofof thethe tasktask NN –– numbernumber ofof processorsprocessors Example:Example: ForFor S=10%S=10% andand N=4N=4 thethe speedupspeedup isis ?? P. Bakowski SmartComputerLab.org 24 Multi-core programming with openMP openMPopenMP operatesoperates mostlymostly viavia compilercompiler directivesdirectives #pragma#pragma ompomp ...... {{ ...code......code... automaticautomatic threadsthreads }} wherewhere thethe ......codecode...... isis calledcalled thethe

Parallel Processing/Programming with the Applications to Image Processing Lectures: 1.1

45-Year CPU Evolution: One Law and Two Equations

Cuda C Best Practices Guide

Multi-Cycle Datapathoperation

CS2504: Computer Organization

Analysis of Body Bias Control Using Overhead Conditions for Real Time Systems: a Practical Approach∗

A Performance Analysis Tool for Intel SGX Enclaves

ESC-470: ARM 9 Instruction Set Architecture with Performance

A Characterization of Processor Performance in the VAX-1 L/780

Autotuning GPU Kernels Via Static and Predictive Analysis

The Anatomy of the ARM Cortex-M0+ Processor

Microarchitecture-Level Power-Performance Simulators: Modeling, Validation, and Impact on Design

Assembly Language Programming (Part 1) 2 7