Universidade Federal do Rio Grande do Sul Instituto de Informática Programa de Pós-Graduação em Computação

Arquitetura e Organização de Processadores

Aula 17

Arquiteturas multi-core Motivation

• Future applications will require still more performance • Power is current bottleneck • Performance driven by higher frequency and ILP hits power wall • Performance may be obtained by multiple processors running at lower frequencies Types of parallelism

• Single processor core – ILP – Instruction-level parallelism – VLIW – Very Long Instruction Word – SIMD – Single Instruction Multiple Data – SMT – Simultaneous Multi-Threading • Multiprocessing – SMP – Symmetrical Multi-Processing – CMP – Chip Multi-Processing (usually homogeneous) – MPSoC – Multi-Processor SoCs (usually heterogeneous) Parallelism

Massive parallelism required in the foreseeable future

2003 2009 2015

Frequency 300 600 1500 (MHz)

Gigaops/s 0.3 14 2458

Operations 1 23 1638 per cycle

Source: ITRS Roadmap, 2003 Complexity of media applications

Ops/sample

Decode 100K

10K Encode 100

TO PS 1K MPEG2 MP@ML CIF 1 MPEG1

MPEG2 MP@HL 0 QCIF H.263 TO MPEG2 MP@ML P 100 S 1 TO PS 10 10 M 100 1 G 10 100 OP M O G G S O PS OP O PS S P S Sampling 100K 1M 10M 100M 1G 10G rate / sec

Source: Shen, SIPS 2003 Parallelism

Many-core era Massively parallel 100 applications

Increasing HW threads 10 per socket

Multi-core era Scalar and Hyper-thread parallel 1 applications

2003 2005 2007 2009 2011 2013

Source: www..com General-purpose processing

• Tera-level computing involves three distinct types of workloads, or computing capabilities: – Recognition: the ability to recognize patterns and models of interest to a specific user or application scenario – Mining: the ability to mine large amounts of real-world data for the patterns or models of interest – Synthesis: the ability to synthesize large datasets or a virtual world based on the patterns or models of interest • Intel foresees a multi-core architecture that is scalable, adaptable, and programmable

Source: www.intel.com General-purpose solutions - IBM

• IBM Power4 General-purpose solutions - IBM

• IBM Power4 – 2 cores, f = 1.4 GHz, 174 Mtransistors – Single clock over entire die – Power = 85 W • One core may be turned off – Trend: multiple processors on die, communication, shared cache • 4 Power4 chips into single module – Chips connected via 4 128-bit buses – Up to 128 MB L3 cache – Bus speed = ½ processor speed – Total throughput = 35 GB/s – Trend: multiple processors on MCM, on-module communication, huge cache

Source: Franza, MPSoC’05 General-purpose solutions - Sun

• Sun Ultrasparc IV – 2 cores, f = 1.8 GHz 4-way MT 4-way MT 4-way MT 4-way MT 4-way MT 4-way MT 4-way MT 4-way MT – Shared 2 MB L2 cache – 300 Mtransistors SPARCpi SPARCpi SPARCpi SPARCpi SPARCpi SPARCpi SPARCpi SPARCpi • Sun Niagara I/O sharedfunctions –8 cores pe pe pe pe pe pe pe – 4 threads per core pe – Shared 3 MB L2 cache crossbar – To be released in 2006

4-way banked L2 cache

Memory controllers & I/O

Source: Franza, MPSoC’05 General-purpose solutions - AMD

• AMD dual-core Opteron – 2 cores, f = 1.8 GHz – 106 Mtransistors – Power = 70 W – 2 x 1 MB L2 caches – Unshared caches

Source: Franza, MPSoC’05 General-purpose solutions - Intel

• Intel D – 2 HT processors on MCM – 2 x 1 MB L2 caches, unshared – f = 3.2 GHz – 230 Mtransistors • Intel Montecito – 2 VLIW cores, f = 1.5 GHz – Power = 100 W – 1.72 Btransistors – 2 x 12 MB L3 asynchronous caches – Multiple clock domains – Power management • Dynamic voltage and frequency adjustment

Source: Franza, MPSoC’05 CMP com cache compartilhada

• Vantagens – Baixa latência de comunicação entre os cores – Interface entre a cache e a E/S é usada somente para comunicação off-chip – A cache pode ser dinamicamente alocada entre os cores • Desvantagens – Maior complexidade – Necessidade de maior banda para a cache •Exemplos – IBM Power 4/5 – Sun UltraSPARC-IV+ CMP com E/S compartilhada

• Vantagens – Simplicidade em relação ao modelo de cache compartilhada – Não é necessário sair do chip para fazer comunicação entre os cores • Desvantagens – Desperdício de recursos devido à cache não compartilhada – A banda entre a cache e o barramento é compartilhada pelo tráfego in-chip e off- chip •Exemplos – Intel Itanium 2 (Montecito) – AMD Opteron Dual-Core CMP com encapsulamento compartilhado

• Vantagens – Não requer modificações na lógica da CPU – Tempo curto de projeto relativo aos outros modelos • Desvantagens – Latência da comunicação entre as CPUs – Limita a freqüência do barramento de interconexão •Exemplos – Intel (Smithfield) – Intel Pentium D (Presley) – Intel (Dempsey) MPSoC issues

• Heterogeneous x homogeneous multi-processing: trade- off between programmability and efficiency – Heterogeneous ISAs – DSP processors for media applications – Hardwired blocks – Configurable processors – Heterogeneous memory systems and address spaces – Heterogeneous interconnects • MPSoCs are custom architectures, derived from configurable platforms, driven by standards – Standards usually define I/O relationships, not algorithms MPSoC issues

• Programming model and software development tools • Memory model – Heterogeneous memory systems are harder to program – Support to real-time constraints and performance • Communication architecture – Support to real-time constraints and performance • Design methodologies and tools – How to configure a platform to meet application constraints? – Time-to-market requires support from tools – Market for tools is too limited – More simulation-oriented (ASIC tools are more synthesis- oriented) Examples of multi-cores for the embedded market

• ST Nomadik • Cell - IBM / Sony / Toshiba • ARM11 MPCore • Toshiba media processor MeP • NEC MP-211 • Panasonic UniPhier • Infineon 3G-baseband MPSoC ST Nomadik platform

Memory, Storage & Connectivity Peripheral Interfaces

General-purpose CPU System Embedded Memory ARM DMA cache

cache cache cache Multi-media Multi-media Multi-media Symmetrical DSP DSP DSP DSPs

HW1 HW2 Graphics DMA DMA Loosely-coupled Acceleration Sub-systems

Source: Artieri, MPSoC’05 Nomadik - MPSoC benefits

• High-computing performance – Multiple non-interfering domains of intense activity, each having its own processor, DMA services, and hardware accelerators for data intensive functions – Hardware acceleration embedding standard functions – Highest and predictable performance through a careful bus and memory hierarchy design • Low-power – Intrinsic low-power sub-systems – Fine grain power management at sub-system level – Leakage management by switching on & off sub-systems

Source: Artieri, MPSoC’05 Nomadik - MPSoC benefits

• Software flexibility – General-purpose CPU allows fast porting of new features – Performance through optimization on DSP with reasonable effort – Full performance at low power using HW functions • Three levels from simplest to most advanced usage – Monolithic general-purpose CPU – Monolithic general-purpose CPU, multiple symmetrical DSPs – Monolithic general-purpose CPU, multiple symmetrical DSPs, hardware accelerators

Source: Artieri, MPSoC’05 Nomadik - Multi-media DSP processor profile

• Short pipeline, high VLIW parallelism efficiency – 1 convolution tap per cycle (2 loads + 2 pointer updates + 1 multiplication + 1 MAC) • Incremental architectural evolution, no race for frequency • Floating point unit – IEEE754 compliant – Division and square root operation • SIMD support • Low power – Level 0 cache for power saving – Low-power instructions – Massive gated clock physical implementation • Programmed only in ANSI C – Reduced learning curve and development time – Allow seamless DSP architecture evolution Source: Artieri, MPSoC’05 Nomadik - Memory hierarchy and bus

• Becomes the main design bottleneck – Memory cache hierarchy – Bus matrix – Usage of shared embedded memory to offload bandwidth from external memory – Smart caching in embedded memory is key • Managed by software • Hardware controlled – L1-cache at sub-system level is sized in accordance with average latency • A very manageable bottleneck

Source: Artieri, MPSoC’05 Nomadik - Memory hierarchy

Bandwidth bottleneck, Very high bandwidth, High latency Low latency DMA Sub-system 1 L1 cache

External Embedded Mass DMA Memory Memory (L2 cache) (SDRAM) Sub-system 1 L1 cache

System DMA

Source: Artieri, MPSoC’05 Nomadik – Software platform

User interface

MP3 player messaging browser gaming telephone PIM

High-level client API

Communication infra-structure Security Multi-media Java Frame- Power management framework Telephony Networking work

Symbian WinCE Operating system core Linux (kernel, device drivers, file system, …)

Low-level API (HCL)

Multi-media Accelerators & Communication Peripheral Audio-video codec interfaces interfaces (MP3, AAC, Midi,… (UARTs, USB, BT, …) (LCD, cameras, memory, …) MPEG4, H.264, …)

Source: Artieri, MPSoC’05 Nomadik - Software overview

Applic Applic Applic Applic Applic

middleware Open OS

Driver Driver Driver Driver Driver

HCL HCL HCL HCL HCL

Nomadik kernel Component Manager ARM

OS OS DSPs

FW FW FW FW Nomadik - Programming model

• Nomadik kernel – A set of system services and API on which • Open OS drivers are built • Sub-system firmware is built – Open OS agnostic – Provides execution resource abstraction for user applications and firmware

Source: Artieri, MPSoC’05 Nomadik - Programming model

• Component = process = service – A dynamically downloadable object • Component Manager – A unique gateway to all sub-systems – Aware of all sub-system resources’ state and activity – Transparently execute a component on any of the sub-systems – Manage the life cycle of a component • Create, start, stop, kill component instances • Apply policy rules – Memory management • Image installation • Memory allocation • Garbage collection

Source: Artieri, MPSoC’05 Nomadik - Programming model

• Sub-system OS – Real-time micro task scheduler – Communication and synchronization services • A sound execution framework – Clear separation between invocation (component manager side) and execution (component instances) – Highly scalable and flexible – Best use of platform resources

Source: Artieri, MPSoC’05 Nomadik - Tool support

• Multiple core approach –ARM • No a priori: whatever is available from the market for both compilation and debug – Multi-media based sub-systems • Dedicated and optimized tools for – Compilation – Simulation and analysis – Debug and trace • Compilation – All C-based approach, no assembly code – Highly optimized and robust ANSI C compiler – DSP extensions matching the ITU/ETSI basic operation package – Multi-platform tools

Source: Artieri, MPSoC’05 ARM11 MPCore ARM11 MPCore

• OS support: AMP vs. SMP • Asymmetric multiprocessor (AMP) – Programmer statically allocates tasks – Uses a distributed view of memory • Synchronization and communication via explicit message passing mechanism – Same model as traditionally used in heterogeneous designs • Workloads are partitioned and manually offloaded to specific processors • Symmetric multiprocessing (SMP) – OS dynamically allocates tasks to CPU – Programmer uses a shared view of memory • Synchronization and communication via common state in shared memory – Normally homogeneous CPU arrangement • Workloads are partitioned and dynamically shared between any processors • OS related requirements – Cache coherency – Generic interrupt controller

– Watchdog timer per processor Source: Zivojnovic, MPSoC’05 Toshiba MeP (Media Processor)

HW extensions

Heterogeneous DSP Unit multiprocessor MeP MeP CPU core module HW UCI Unit engine Instruction Data RAM/cache RAM/cache VLIW co-processor

Local bus bus bridge DMA

Configurable processors 1 2 3 N

Global bus

Source: Matsui, MPSoC’05 Toshiba MeP (Media Processor)

• Configurable processor – MeP-C2 core • Base processor: – 32-bit RISC – 5-stage pipeline – 350 MHz – 50 Kgates • Configuration – memory size – optional instructions – bus width (32/64 bits) – interrupt (# channels, # levels) – debug support unit • User extensions – User Custom Instruction (UCI) Unit – single-cycle ALU instructions – DSP unit – multi-cycle ALU instructions – VLIW co-processor – 2-way or 3-way – up to 10 hardware engines – control register extension up to 4 Kwords

Source: Matsui, MPSoC’05 Toshiba MeP (Media Processor)

• Example of application: MPSoC – 4 MeP processors • Main control • Filter • Video processor, with MPEG4 / H.264 codec accelerators • Audio DSP, with DSP extension Panasonic UniPhier

• Market: home electronics equipment – TV, DVD, cell phones • DPP encourages future signal processing functions • DPP is an optional part for cell phones • Hardware engines are normally ASIC design parts for standardized functions • Complex functions which are not yet standardized as realized by DPP

Processing Element Array

Excursion units Control Unit Hardware Engine Instruction Parallel Processor (IPP) Data Parallel Processor (DPP)

Fundamental Extension Extension

Source: Nishitani, MPSoC’05 NEC MP211

• Market: cell phones – Current business acceleration • Different OSs in component processors • Poor future expandability due to single DSP

Multi-layer AHB

ARM926 ARM926 ARM926 DSP (CPU0) (CPU1) (CPU1) SPX-K602

Source: Nishitani, MPSoC’05 Massive multi-core

• CISCO CRS-1 Carrier Router System • Continuous operation, service flexibility, extended longevity • 92 Terabits per second • Software programmable network processor (SPP) • Each SPP processes 40 Gbps • Parallel array of 188 Xtensa-based SPP processors

Source: Fu, MPSoC’05