Power and Performance Optimization for Network-On-Chip Based Many-Core Processors

Power and Performance Optimization for Network-on-Chip based Many-Core Processors YUAN YAO Doctoral Thesis in Information and Communication Technology (INFKOMTE) School of Electrical Engineering and Computer Science KTH Royal Institute of Technology Stockholm, Sweden 2019 KTH School of Electrical Engineering and Computer Science TRITA-EECS-AVL-2019:44 Electrum 229, SE-164 40 Stockholm ISBN 978-91-7873-182-4 SWEDEN Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framlägges till offentlig granskning för avläggande av teknologie doktorsexamen i Informations- och Kommunikationsteknik fredag den 23 Augusti 2019 klockan 9.00 i Sal B, Elect- rum, Kungl Tekniska högskolan, Kistagången 16, Kista. © Yuan Yao, May 27, 2019 Tryck: Universitetsservice US AB iii Abstract Network-on-Chip (NoC) is emerging as a critical shared architecture for CMPs (Chip Multi-/Many-Core Processors) running parallel and concurrent applications. As the core count scales up and the transistor size shrinks, how to optimize power and performance for NoC open new research challenges. As it can potentially consume 20–40% of the entire chip power [20, 40, 81], NoC power efficiency has emerged as one of the main design constraints in today’s and future high performance CMPs. For NoC power management, we propose a novel on-chip DVFS technique that is able to adjust per-region NoC V/F according to voted V/F levels from communicating threads. A thread periodically votes for a preferred NoC V/F level that best suits its individual performance interests. The final DVFS decision of each region is adjusted by a region DVFS controller democratically based on the majority of votes it receives. Mutually exclusive locks are pervasive shared memory synchronization primitives. In advanced locks such as the Linux queue spinlock comprising a low-overhead spinning phase and a high-overhead sleeping phase, we show that the lock primitive may create very high competition overhead (COH), which is the time threads compete with each other for the next critical section grant. For performance enhancement, we propose a software-hardware cooperative mechanism that can opportunistically maximize the chance of a thread winning critical section in the low-overhead spinning phase and min- imize the chance of winning critical section in the high-overhead sleeping phase, so that COH is significantly reduced. Besides, we further observe that the cache invalidation-acknowledgement round-trip delay between the home node storing the critical section lock and the cores running competing locks can heavily downgrade application performance. To reduce such high lock coherence overhead (LCO), we propose in-network packet generation (iNPG) to turn passive “normal” NoC routers into active “big” ones that can not only transmit but also generate packets to perform early invalidation and collect inv-acks. iNPG effectively shortens the protocol round-trip delay and thus largely reduces LCO in various locking primitives. To enhance performance fairness when running multiple multi-threaded programs on a single CMP, we develop the concept of aggregate flow which refers to a sequence of associated data and cache coherence flows issued from the same thread. Based on the aggregate flow concept, we propose three co- herent mechanisms to efficiently achieve performance isolation: rate profiling, rate inheritance and flow arbitration. Rate profiling dynamically character- izes thread performance and communication needs. Rate inheritance allows a data or coherence reply flow to inherit the characteristics of its associated data or coherency request flow, so that consistent bandwidth allocation policy is applied to all sub-flows of the same aggregate flow. Flow arbitration uses a proven scheduling policy, self-clocked fair queueing (SCFQ) [57], to achieve rate-proportional arbitration for different aggregate flows. Our approach suc- cessfully achieves balanced performance isolations with different mixtures of applications. iv Keywords: Many-Core Processor, Network-on-Chip, Performance, Power Management, DVFS, Shared Memory Synchronization, Hardware/Software Co-Design, Cache Coherency, Performance Isolation v Sammanfattning Network-on-Chip (NoC) är en avgörande gemensam arkitektur för CMPs (Chip Multi- / Many Core Processors) som kör parallella och samverkande applikationer. Allteftersom antalet kärnor ökar och transistorerna minskar har konsten att optimera effekt och prestanda för NoC fått nya forskningsut- maningar. Eftersom det potentiellt kan förbruka 20-40% av datorchippets effekt, har effektiva NoC:ar blivit en av de viktigaste designbegränsningarna i dagens och framtidens högpresterande CMPs. Som energisparfunktioner i NoCar, fö- reslår vi en ny on-chip DVFS-teknik som kan justera varje NoC-regions V/F enligt framröstade V/F-nivåer från kommunicerande trådar. En tråd röstar periodiskt för en föredragen NoC V/F nivå som bäst passar dess individuella prestationsintressen. Det slutgiltiga DVFS-beslutet för varje region justeras av en regional DVFS controller som demokratiskt röstar enligt majoriteten av rösterna den mottar. ömsesidigt oberoende lås (Mutually exclusive locks) är en genomgående mekanism för minnessynkronisering. I avancerade lås som the Linux queue spinlock som innefattar både en spinning-fas med låg overhead och en sömn- fas med hög overhead, visar vi att denna mekanism kan orsaka mycket högt konkurrensoverhead (competition overhead - COH), vilket är tiden som trå- darna konkurrerar med varandra inför nästa kritiska sektionsslot. För att förbättra prestandan föreslår vi en samarbetsmekanism för mjuk- vara och hårdvara som opportunistiskt kan maximera chansen att tråden får en kritisk sektion under spinning-fasen med låg overhead istf under sömn- fasen med hög overhead, så att COG minskas betydligt. Vidare observerar vi att fördröjningen av cache-invaliditetsbekräftelsen tur och retur mellan hem- noden som lagrar det kritiska sektionslåset och kärnorna som kör de tävlande låsen kan kraftigt nedgradera applikationsprestandan. För att minska denna höga lås-koherenta overhead (lock coherence overhead - LCO), föreslår vi paketgenerering inom nätverket (In-network Packet Ge- neration - iNPG) för att slå om passiva “normala” NoC-routrar till aktiva “sto- ra” ditton, då de inte bara kan sända men även generera paket för att utföra en tidig invaliditetsbekräftelse och samla in invaliderings-acknowledgements. iNPG förkortar effektivt protokollets rundresefördröjning (eng. round-trip delay) och reducerar sålunda olika lås-primitivers LCO. För att förbättra den rättvisa arbitreringsprestandan när flertrådnings- program körs på en enda CMP, använder vi begreppet Aggregate Flow för att hänvisa till att en sekvens av associerada data och cache anhopar till ko- herensflöden som utfärdats av samma tråd. Baserat på begreppet, föreslår vi tre sammanhängande mekanismer för att effektivt uppnå prestationsiso- lering: Rate profiling, Rate inheritance and Flow Arbitration. Rate profiling karaktäriserar trådens dynamiska prestanda och kommunikationsbehov. Ra- te inheritance tillåter ett data eller koherenssvar att ärva egenskaper för dess associerade data eller koheherensflöde, så att en konsekvent bandbreddstilldel- ningspolitik kan tillämpas på alla flöden inom samma Aggregate Flow. Flow arbitration använder en beprövad schemaläggningspolicy kallad Self-Clocked Fair Queueing (SCFQ) för att uppnå en proportionell arbitrering för olika vi Aggregate Flow. Tillvägagångssättet lyckas framgångsrikt uppn balanserad isolering av prestanda vid olika blandningar av applikationer. Nyckelord: Many-Core Processor, Network-on-Chip, Performance, Po- wer Management, DVFS, Shared Memory Synchronization, Hardware/Software Co-Design, Cache Coherency, Performance Isolation To my family. Acknowledgements Research trends come and go quickly in the fast-tempo computer architecture so- ciety. However, the bond of friendship I relish with various people always remains abiding and permits me to finish this dissertation. Thus, not enough eloquence can express my least gratitude and indebtedness for the help I receive from my colleagues and friends, as well as from my beloved family. My first thanks are given to my main supervisor Assoc. Prof. Zhonghai Lu at KTH. You are consistently compelling, trustworthy, and supportive. The same tenacity that you displayed helped me to stay resolve and keep finding a way amidst the swamp of questions and unknowns. I cannot get more satisfied with my research without having a thorough, exhaustive, and enjoyable collaboration with you. My second thanks go to my co-supervisor Prof. Elena Dubrova at KTH. All my efforts will lead to no avail without you keeping everything on the right track. Also thank you for your instructions and tips on the thesis writing. Special thanks to Assoc. Prof. Johnny Öberg for being the advance reviewer and revising the Swedish abstract of the thesis. I endeavour with great exertion to improve the work so that your strict requirements for quality are all met. Also thank you for introducing me to the universe of whisky. Smoky single-malt is now my best weapon decontaminating anxiety and bad luck. I would thank the opponent, Asst. Prof. Trevor E. Carlson (NUS Computing, Singapore), and all the grading board members, Prof. Stefanos Kaxiras (Uppsala University, Sweden), Prof. Diana Göhringer (Technische Universität Dresden, Ger- many), and Assoc. Prof. Jie Han (University of Alberta, Canada), for the critical comments and professional suggestions on the PhD thesis. I would thank all my colleagues

Power and Performance Optimization for Network-On-Chip Based Many-Core Processors

1 in the United States District Court

China's Progress in Semiconductor Manufacturing Equipment

Microcode Revision Guidance August 31, 2019 MCU Recommendations

Single Board Computer

Technology Roadmap for 22Nm CMOS and Beyond

Many-Core Fabricated Chips Information Page

Attacking X86 Processor Integrity from Software

2.2 Adaptive Routing Algorithms and Router Design 20

Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs Samuel K

Detecting False Sharing Efficiently and Effectively

Moore's Law: the Future of Si Microelectronics

Multiprocessing Contents