Power and Performance Optimization for Network-On-Chip Based Many-Core Processors

Power and Performance Optimization for Network-on-Chip based Many-Core Processors YUAN YAO Doctoral Thesis in Information and Communication Technology (INFKOMTE) School of Electrical Engineering and Computer Science KTH Royal Institute of Technology Stockholm, Sweden 2019 KTH School of Electrical Engineering and Computer Science TRITA-EECS-AVL-2019:44 Electrum 229, SE-164 40 Stockholm ISBN 978-91-7873-182-4 SWEDEN Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framlägges till offentlig granskning för avläggande av teknologie doktorsexamen i Informations- och Kommunikationsteknik fredag den 23 Augusti 2019 klockan 9.00 i Sal B, Elect- rum, Kungl Tekniska högskolan, Kistagången 16, Kista. © Yuan Yao, May 14, 2019 Tryck: Universitetsservice US AB iii Abstract Network-on-Chip (NoC) is emerging as a critical shared architecture for CMPs (Chip Multi-/Many-Core Processors) running parallel and concurrent applications. As the core count scales up and the transistor size shrinks quickly, how to optimize power and performance for NoC open new research challenges. As it can potentially consume 20–40% of the entire chip power [20, 40, 81], NoC power efficiency has emerged as one of the main design constraints in today’s and future high performance CMPs. For NoC power management, we propose a novel on-chip DVFS technique that is able to adjust per-region NoC V/F according to voted V/F levels from communicating threads. A thread periodically votes for a preferred NoC V/F level that best suits its individual performance interests. The final DVFS decision of each region is adjusted by a region DVFS controller democratically based on the majority of votes it receives. Mutually exclusive locks are pervasive shared memory synchronization primitives. In advanced locks such as the Linux queue spinlock comprising a low-overhead spinning phase and a high-overhead sleeping phase, we show that the lock primitive may create very high competition overhead (COH), which is the time threads compete with each other for the next critical section grant. For performance enhancement, we propose a software-hardware cooperative mechanism that can opportunistically maximize the chance of a thread winning critical section in the low-overhead spinning phase and min- imize the chance of winning critical section in the high-overhead sleeping phase, so that COH is significantly reduced. Besides, we further observe that the cache invalidation-acknowledgement round-trip delay between the home node storing the critical section lock and the cores running competing locks can heavily downgrade application performance. To reduce such high lock coherence overhead (LCO), we propose in-network packet generation (iNPG) to turn passive “normal” NoC routers into active “big” ones that can not only transmit but also generate packets to perform early invalidation and collect inv-acks. iNPG effectively shortens the protocol round-trip delay and thus largely reduces LCO in various locking primitives. To enhance performance fairness when running multiple multi-threaded programs on a single CMP, we develop the concept of aggregate flow which refers to a sequence of associated data and cache coherence flows issued from the same thread. Based on the aggregate flow concept, we propose three co- herent mechanisms to efficiently achieve performance isolation: rate profiling, rate inheritance and flow arbitration. Rate profiling dynamically character- izes thread performance and communication needs. Rate inheritance allows a data or coherence reply flow to inherit the characteristics of its associated data or coherency request flow, so that consistent bandwidth allocation policy is applied to all flows of the same aggregate flow. Flow arbitration uses a proven scheduling policy, self-clocked fair queueing (SCFQ) [57], to achieve rate-proportional arbitration for different aggregate flows. Our approach suc- cessfully achieves balanced performance isolations with different mixtures of applications. iv Keywords: Many-Core Processor, Network-on-Chip, Performance, Power Management, DVFS, Shared Memory Synchronization, Hardware/Software Co-Design, Cache Coherency, Performance Isolation v Sammanfattning Network-on-Chip (NoC) är en avgörande gemensam arkitektur för CMPs (Chip Multi- / Many Core Processors) som kör parallella och samverkande applikationer. Allteftersom antalet kärnor ökar och transistorerna minskar har konsten att optimera effekt och prestanda för NoC fått nya forskningsut- maningar. Eftersom det potentiellt kan förbruka 20-40% av datorchippets effekt, har effektiva NoC:ar blivit en av de viktigaste designbegränsningarna i dagens och framtidens högpresterande CMPs. Som energisparfunktioner i NoCar, fö- reslår vi en ny on-chip DVFS-teknik som kan justera varje NoC-regions V/F enligt framröstade V/F-nivåer från kommunicerande trådar. En tråd röstar periodiskt för en föredragen NoC V/F nivå som bäst passar dess individuella prestationsintressen. Det slutgiltiga DVFS-beslutet för varje region justeras av en regional DVFS controller som demokratiskt röstar enligt majoriteten av rösterna den mottar. ömsesidigt oberoende lås (Mutually exclusive locks) är en genomgående mekanism för minnessynkronisering. I avancerade lås som the Linux queue spinlock som innefattar både en spinning-fas med låg overhead och en sömn- fas med hög overhead, visar vi att denna mekanism kan orsaka mycket högt konkurrensoverhead (competition overhead - COH), vilket är tiden som trå- darna konkurrerar med varandra inför nästa kritiska sektionsslot. För att förbättra prestandan föreslår vi en samarbetsmekanism för mjuk- vara och hårdvara som opportunistiskt kan maximera chansen att tråden får en kritisk sektion under spinning-fasen med låg overhead istf under sömn- fasen med hög overhead, så att COG minskas betydligt. Vidare observerar vi att fördröjningen av cache-invaliditetsbekräftelsen tur och retur mellan hem- noden som lagrar det kritiska sektionslåset och kärnorna som kör de tävlande låsen kan kraftigt nedgradera applikationsprestandan. För att minska denna höga lås-koherenta overhead (lock coherence overhead - LCO), föreslår vi paketgenerering inom nätverket (In-network Packet Ge- neration - iNPG) för att slå om passiva “normala” NoC-routrar till aktiva “sto- ra” ditton, då de inte bara kan sända men även generera paket för att utföra en tidig invaliditetsbekräftelse och samla in invaliderings-acknowledgements. iNPG förkortar effektivt protokollets rundresefördröjning (eng. round-trip delay) och reducerar sålunda olika lås-primitivers LCO. För att förbättra den rättvisa arbitreringsprestandan när flertrådnings- program körs på en enda CMP, använder vi begreppet Aggregate Flow för att hänvisa till att en sekvens av associerada data och cache anhopar till ko- herensflöden som utfärdats av samma tråd. Baserat på begreppet, föreslår vi tre sammanhängande mekanismer för att effektivt uppnå prestationsiso- lering: Rate profiling, Rate inheritance and Flow Arbitration. Rate profiling karaktäriserar trådens dynamiska prestanda och kommunikationsbehov. Ra- te inheritance tillåter ett data eller koherenssvar att ärva egenskaper för dess associerade data eller koheherensflöde, så att en konsekvent bandbreddstilldel- ningspolitik kan tillämpas på alla flöden inom samma Aggregate Flow. Flow arbitration använder en beprövad schemaläggningspolicy kallad Self-Clocked Fair Queueing (SCFQ) för att uppnå en proportionell arbitrering för olika vi Aggregate Flow. Tillvägagångssättet lyckas framgångsrikt uppn balanserad isolering av prestanda vid olika blandningar av applikationer. Nyckelord: Many-Core Processor, Network-on-Chip, Performance, Po- wer Management, DVFS, Shared Memory Synchronization, Hardware/Software Co-Design, Cache Coherency, Performance Isolation To my family. List of Publications Conference papers 1. Yuan Yao and Zhonghai Lu, “iNPG: Accelerating Critical Section Access with In-Network Packet Generation for NoC Based Many-Cores”, in Pro- ceedings of the 24nd IEEE International Symposium on High Performance Computer Architecture (HPCA 2018, *Best paper candidate*), Vienna, Feb. 2018. 2. Yuan Yao and Zhonghai Lu, “Opportunistic Competition Overhead Reduc- tion for Expediting Critical Section in NoC based CMPs”, in Proceedings of the 43rd International Symposium on Computer Architecture (ISCA), Seoul, Jul. 2016. 3. Yuan Yao and Zhonghai Lu, “Memory-Access Aware DVFS for Network- on-Chip in CMPs”, in Proceedings of Design, Automation & Test in Europe (DATE 2016), Dresden, Mar. 2016. 4. Yuan Yao and Zhonghai Lu, “DVFS for NoCs in CMPs: A Thread Vot- ing Approach”, in Proceedings of the 22nd IEEE International Symposium on High Performance Computer Architecture (HPCA 2016), Barcelona, Feb. 2016. 5. Zhonghai Lu, Yuan Yao, Yuming Jiang, “Towards Stochastic Delay Bound Analysis for Network-on-Chip”, in IEEE/ACM International Symposium on Network-on-Chip (NoCS), Ferrere, Sep. 2014. 6. Yuan Yao and Zhonghai Lu, “Fuzzy Flow Regulation for Network-on-Chip based Chip Multiprocessors Systems”, in 19th Asia and South Pacific Design Automation Conference (ASP-DAC), Singapore, Feb. 2014. Transactions papers 1. Zhonghai Lu and Yuan Yao, “Thread Voting DVFS for Manycore NoCs”, in IEEE Transactions on Computers (TC, Volume: 67, Issue: 10), Oct. 2018. ix x LIST OF PUBLICATIONS 2. Zhonghai Lu and Yuan Yao, “Marginal Performance: Formalizing and Quan- tifying Power Over/Under Provisioning in NoC DVFS”, in IEEE Transactions on Computers (TC, Volume: 66, Issue: 11), Nov. 2017. 3. Zhonghai Lu and Yuan Yao, “Dynamic Traffic Regulation in NoC-Based

Power and Performance Optimization for Network-On-Chip Based Many-Core Processors

Network on Chip for FPGA Development of a Test System for Network on Chip

Embedded Networks on Chip for Field-Programmable Gate Arrays by Mohamed Saied Abdelfattah a Thesis Submitted in Conformity With

An Early Performance Evaluation of Many Integrated Core Architecture Based SGI Rackable Computing System

A Comparison of Four Series of Cisco Network Processors

MYTHIC MULTIPLIES in a FLASH Analog In-Memory Computing Eliminates DRAM Read/Write Cycles

Using Inspiration from Synaptic Plasticity Rules to Optimize Traffic Flow in Distributed Engineered Networks Arxiv:1611.06937V1

State-Of-The-Art in Heterogeneous Computing

ESA DSP Day 2016 Workshop Proceedings

Hardware Design of Message Passing Architecture on Heterogeneous System

Scheduling Tasks on Heterogeneous Chip Multiprocessors with Reconfigurable Hardware

Smartcell: an Energy Efficient Reconfigurable Architecture for Stream Processing

Towards a Scalable Software Defined Network-On-Chip for Next Generation Cloud