Efficient Memory Access and Synchronization in Noc-Based

Efficient Memory Access and Synchronization in NoC-based Many-core Processors XIAOWEN CHEN Doctoral Thesis in Information and Communication Technology (INFKOMTE) School of Electrical Engineering and Computer Science KTH Royal Institute of Technology Stockholm, Sweden 2019 KTH School of Electrical Engineering and Computer Science TRITA-EECS-AVL-2019:2 SE-164 40 Kista ISBN 978-91-7873-051-3 SWEDEN Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framlägges till offentlig granskning för avläggande av doktorsexamen i Informations-och Kom- munikationsteknik Fredag den 01 Februari 2019 klockan 09:00 i Sal A, Electrum, Kungl Tekniska högskolan, Kistagången 16, Kista. © Xiaowen Chen, January 2019 Tryck: Universitetsservice US AB iii Abstract In NoC-based many-core processors, memory subsystem and synchronization mechanism are always the two important design aspects, since mining parallelism and pursuing higher performance require not only optimized memory management but also efficient synchronization mechanism. Therefore, we are motivated to research on efficient memory access and synchronization in three topics, namely, efficient on-chip memory organization, fair shared memory access, and efficient many-core synchronization. One major way of optimizing the memory performance is constructing a suitable and efficient memory organization. A distributed memory organization is more suitable to NoC-based many-core processors, since it features good scalability. We envision that it is essential to support Distributed Shared Memory (DSM) because of the huge amount of legacy code and easy program- ming. Therefore, we first adopt the microcoded approach to address DSM is- sues, aiming for hardware performance but maintaining the flexibility of pro- grams. Second, we further optimize the DSM performance by reducing the virtual-to-physical address translation overhead. In addition to the general- purpose memory organization such as DSM, there exists special-purpose memory organization to optimize the performance of application-specific memory access. We choose Fast Fourier Transform (FFT) as the target application, and propose a multi-bank data memory specialized for FFT computation. In 3D NoC-based many-core processors, because processor cores and mem- ories reside in different locations (center, corner, edge, etc.) of different lay- ers, memory accesses behave differently due to their different communication distances. As the network size increases, the communication distance difference of memory accesses becomes larger, resulting in unfair memory access performance among different processor cores. This unfair memory access phenomenon may lead to high latencies of some memory accesses, thus negatively affecting the overall system performance. Therefore, we are motivated to study on-chip memory and DRAM access fairness in 3D NoC-based many-core processors through narrowing the round-trip latency difference of memory accesses as well as reducing the maximum memory access latency. Barrier synchronization is used to synchronize the execution of parallel processor cores. Conventional barrier synchronization approaches such as master-slave, all-to-all, tree-based, and butterfly are algorithm oriented. As many processor cores are networked on a single chip, contended synchronization requests may cause large performance penalty. Motivated by this, different from the algorithm-based approaches, we choose another direction (i.e., exploiting efficient communication) to address the barrier synchronization problem. We propose cooperative communication as a means and com- bine it with the master-slave algorithm and the all-to-all algorithm to achieve efficient many-core barrier synchronization. Besides, a multi-FPGA imple- mentation case study of fast many-core barrier synchronization is conducted. Keywords: Many-core, Network-on-Chip, Distributed Shared Memory, Microcode, Virtual-to-physical Address Translation, Memory Access Fairness, Barrier Synchronization, Cooperative Communication iv Sammanfattning I 2D/3D NoC-baserade processorer med många kärnor är minnesundersy- stem och synkroniseringsmekanismen alltid de två viktiga designaspekterna, eftersom datautvninningsparallellism och högre prestanda inte bara kräver optimerad minneshantering utan också en effektiv synkroniseringsmekanism. Vi är därför motiverade att undersöka effektiv minnesåtkomst och synkronisering inom tre områden, nämligen effektiv minnesorganisation på chip, rättvist fördelad minnesåtkomst och effektiv synkronisering med många kärnor. Ett viktigt sätt att optimera minnesprestandan är att bygga en lämplig och effektiv minnesorganisation. En distribuerad minnesorganisation är mer lämplig för NoC-baserade processorer med många kärnor, eftersom den har god skalbarhet. Vi ser att det är viktigt att stödja Distributed Shared Me- mory (DSM), dels på grund av den enorma mängden legacy-kod, och dels pga enkel programmering. Därför använder vi först det mikrokodade tillvä- gagångssättet för att ta itu med DSM-problemen, syftandes till hårdvarans effektivitet men bibehållandes programmets flexibilitet. För det andra optime- rar vi DSM-prestandan ytterligare genom att minska den virtuella till fysiska adressöversättningen. Förutom den allmänna minnesorganisationen, finns det särskilda minnesorganisationer för att optimera prestandan för applikations- specifik minnesåtkomst. Vi väljer Fast Fourier Transform (FFT) som tillämp- ning och föreslår ett multibankat dataminne specialiserat på FFT-beräkning. I 3D-NoC-baserade processorer med många kärnor beter sig minnesåt- komsten olika för varje kärna på grund av att de olika kärnornas kommuni- kationsavstånd är olika, eftersom processorkärnor och minnen ligger på olika platser (centrum, hörn, kant, etc.) av olika lager. När nätverksstorleken ökar blir skillnaden i kommunikationsdistans för minnesåtkomst större, vilket resulterar i olika (orättvisa) fördröjningar vid minnesåtkomst mellan olika processorkärnor. Detta orättvisa minnesåtkomstfenomen kan leda till höga latenser för vissa minnesåtkomst, vilket på så sätt negativt påverkar syste- mets totala prestanda. Därför är vi motiverade att studera on-chip-minnen och rättvist fördelade åtkomst av DRAM i 3D-NoC-baserade processorer med många kärnor genom att minska det sk round-trip tidsintervallet för minne- såtkomst så väl som att reducera den maximala minnesåtkomstfördröjningen. Barriärsynkronisering används för att synkronisera exekveringen av pa- rallella processorkärnor. Konventionella barriärsynkroniseringsmetoder som master-slave, alla-till-alla, trädbaserade och butterfly är alla algoritmorien- terade. Eftersom många processorkärnor är ihopkopplade på ett enda chip, kan synkroniserade förfrågningar leda till stor prestationsförlust. Motiverad av detta, till skillnad från de algoritmbaserade tillvägagångssätten, väljer vi en annan riktning (dvs utnyttjar effektiv kommunikation) för att hantera barriärsynkroniseringsproblemet. Vi föreslår kooperativ kommunikation och kombinerar det med master-slave och alla-till-alla-algoritmen för att uppnå effektiv fler-kärning barriärsynkronisering. Dessutom genomförs en fallstudie av en multi-FPGA-implementering av snabb flerkärnig barriärsynkronisering. Nyckelord: Många kärnor, Network-on-Chip, Distribuerat Delat Minne, Mikrokod, Virtual-to-Physical Adress Translation, Minnesåtkomst Rättvisa, Barriär Synkronisering, Samverkande Kommunikation v Acknowledgements How time flies. I have been studying for a doctoral degree at KTH Royal Insti- tute of Technology for ten years. It is a long process filled with mixed feelings: pleasure, fatigue, anxiety, disappointment and surprise. The pleasure came from my persistent academic research in the forefront of the interesting area. I enjoyed myself during the process of literature learning, idea thinking and paper writing. The fatigue originated from the limited research time due to the pressure from work and life. I wanted to give up many times. The spur of the supervisor, the encouragement of the family, and my unwillingness prompted me to adjust my state and persist in my PhD study. The anxiety underwent in the face of the deadlines of an- nual study plan update and each paper submission. The disappointment occurred because of unsatisfactory research results and paper rejections. As my academic ability improves year by year, quite a few papers have been accepted and published. The surprise jumped out when the top journals accepted my papers. My ten-year PhD study is a beautiful, important and unforgettable part of my life. When the thesis is completed, I would like to express my heartfelt thanks to my teachers, colleagues and family who have given me guidance, care, and help. The first thank is given to my supervisor, Professor Lu Zhonghai. He is the one I thank most. In the ten years since the autumn of 2008, he has given me strict requirements, careful guidance and selfless care in my study, work and life. Based on his extensive professional knowledge, he supervised my academic research and encouraged me to constantly discover problems, solve problems, and publish high- quality papers. He gave me great confidence and encouragement when I wanted to give up. Numerous conversations and discussions between us have enhanced our friendship. His rigorous academic attitude, excellent academic style, and approach- able teacher style make me admire and will always be worthy of my learning. Thank Professor Axel Jantsch very much. In the fall of 2008, when I first came to KTH Royal Institute of Technology, Professor Axel Jantsch

Efficient Memory Access and Synchronization in Noc-Based

Network on Chip for FPGA Development of a Test System for Network on Chip

Embedded Networks on Chip for Field-Programmable Gate Arrays by Mohamed Saied Abdelfattah a Thesis Submitted in Conformity With

An Early Performance Evaluation of Many Integrated Core Architecture Based SGI Rackable Computing System

A Comparison of Four Series of Cisco Network Processors

MYTHIC MULTIPLIES in a FLASH Analog In-Memory Computing Eliminates DRAM Read/Write Cycles

Using Inspiration from Synaptic Plasticity Rules to Optimize Traffic Flow in Distributed Engineered Networks Arxiv:1611.06937V1

State-Of-The-Art in Heterogeneous Computing

ESA DSP Day 2016 Workshop Proceedings

Hardware Design of Message Passing Architecture on Heterogeneous System

Scheduling Tasks on Heterogeneous Chip Multiprocessors with Reconfigurable Hardware

Smartcell: an Energy Efficient Reconfigurable Architecture for Stream Processing

Towards a Scalable Software Defined Network-On-Chip for Next Generation Cloud