論文 / 著書情報 Article / Book Information
Total Page:16
File Type:pdf, Size:1020Kb
論文 / 著書情報 Article / Book Information 題目(和文) Title(English) Hardware-Accelerated Modeling of Large-Scale Networks-on-Chip 著者(和文) Chu Van Thiem Author(English) Thiem Van Chu 出典(和文) 学位:博士(工学), 学位授与機関:東京工業大学, 報告番号:甲第10994号, 授与年月日:2018年9月20日, 学位の種別:課程博士, 審査員:吉瀬 謙二,横田 治夫,宮﨑 純,渡部 卓雄,金子 晴彦 Citation(English) Degree:Doctor (Engineering), Conferring organization: Tokyo Institute of Technology, Report number:甲第10994号, Conferred date:2018/9/20, Degree Type:Course doctor, Examiner:,,,, 学位種別(和文) 博士論文 Type(English) Doctoral Thesis Powered by T2R2 (Tokyo Institute Research Repository) Hardware-Accelerated Modeling of Large-Scale Networks-on-Chip by Thiem Van Chu A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Engineering Department of Computer Science Graduate School of Information Science and Engineering Tokyo Institute of Technology © Thiem Van Chu 2018. All rights reserved. Abstract Networks-on-Chip (NoCs) are becoming increasingly important elements in different types of computing hardware platforms, from general-purpose many-core processors for supercomputers and datacenters to application-specific MultiProcessor Systems-on-Chip (MPSoCs) for embed- ded applications. They are also integral parts of many emerging accelerators for critically essen- tial applications such as deep neural networks, databases, and graph processing. In such a hard- ware platform, the NoC is responsible for connecting the other components together and thus has a significant impact on the overall performance. To achieve higher performance and better power efficiency, many-core processors with more and more cores have been developed. For the similar reason and to meet the increasingly stringent requirements of target applications, the number of processing elements, memory and input/output modules integrated on an MPSoC/accelerator is increasing. As the number of components that need to be interconnected increases, the overall performance becomes highly sensitive to the NoC performance. Therefore, research and devel- opment of NoCs play a key role in designing future large-scale architectures with hundreds to thousands of components. A major obstacle to research and development of large-scale NoCs is the lack of fast modeling methodologies that can provide a high degree of accuracy. Analytical models are extremely fast but may incur significant inaccuracy in many cases. Thus, NoC designers often rely on simulation to test their ideas and make design decisions. Unfortunately, while being much more accurate than analytical modeling, conventional software simulators are too slow to simulate large-scale NoCs with hundreds to thousands of nodes in a reasonable time. Because of this, most studies are limited to NoCs with around 100 nodes. To address the simulation speed problem, there have been some attempts to build NoC emulators using Field-Programmable Gate Arrays (FPGAs). However, these NoC emulators suffer from the scalability problem. They cannot scale to large NoCs due to the FPGA logic and memory constraints. A recent study has shown that even an extremely large FPGA does not have enough logic blocks to fit a moderately complex NoC design of around 150 nodes. What is even worse is that emulating a large-scale NoC also requires a large amount of memory for modeling of traffic workloads. However, the on-chip memory capacity of an FPGA is very small, at most from several to around only ten megabytes. Off-chip memory ii (usually DRAM) has a larger capacity but is much slower than on-chip memory. The use of off-chip memory may substantially degrade the emulation speed. This dissertation proposes methods for fast and accurate modeling of NoCs with up to thou- sands of nodes by FPGA emulation with cycle accuracy, an extremely high degree of emulation accuracy in which target NoCs are emulated on a cycle-by-cycle basis. While the goal of these methods is to enable fast and accurate modeling of large-scale NoCs, they are also beneficial to the modeling of current NoCs with tens to around 100 nodes. To overcome the FPGA logic constraints, the dissertation proposes a novel use of time- division multiplexing (TDM) where the emulation cycle is decoupled from the FPGA cycle and a network is emulated by time-multiplexing a small number of nodes. This approach makes it possible to emulate NoCs with up to thousands of nodes using a single FPGA. The disserta- tion focuses on applying the TDM technique to two commonly used network topologies, two- dimensional (2D) mesh and fat-tree (k-ary n-tree), which are the bases of almost all actually constructed network topologies. It thus can be expected that the proposed methods can be ex- tended for a wide range of networks. While the time-division multiplexing methods enable the emulation of large-scale NoCs, they alone are not sufficient. To achieve a high emulation speed, it is essential to address the memory constraints caused by modeling traffic workloads. There are two types of workloads used in NoC emulation: synthetic workloads and trace- driven workloads. Synthetic workloads are those based on mathematical modeling of common traffic patterns in real applications. They have a high degree of flexibility and are easy to create. A set of carefully designed synthetic workloads can provide a relatively thorough coverage of the characteristics of the target NoCs. It has also been shown that evaluation on synthetic workloads is indispensable in many cases. For instance, when designing a routing algorithm, the use of synthetic workloads is mandatory for assessing the algorithm on possible corner cases like those under extremely high loads. On the other hand, trace-driven workloads are those based on trace data captured from either a working system or an execution-driven simulation/emulation. They are effective for evaluating target NoCs under the intended applications. Currently, due to the lack of trace data of large-scale NoC-based systems, using synthetic workloads is practically the only feasible approach for emulating large-scale NoCs with thou- sands of nodes. To overcome the memory constraints caused by modeling synthetic workloads, the dissertation proposes a method to reduce the amount of required memory so that it is not nec- essary to use off-chip memory even when emulating NoCs with thousands of nodes. This method not only makes the overall design much simpler but also significantly contributes to the improve- ment of emulation speed. It and the proposed time-multiplexed emulation methods enable a NoC emulator which can be used to model a mesh-based NoC with 16,384 nodes (128×128 NoC) and a fat-tree-based NoC with 6,144 switch nodes and 4,096 terminal nodes (4-ary 6-tree NoC) and iii is up to three orders of magnitude faster than a widely used cycle-accurate software simulator while providing the same results. The dissertation shows the usability of the developed emulator by designing and modeling an effective routing algorithm for 2D mesh NoCs and evaluating it for various network sizes, from 8×8 to 128×64. The proposed routing algorithm has an oblivious routing scheme and thus a low design complexity. It, however, can achieve high performance by properly distributing the load over two network dimensions and using an efficient deadlock avoidance method. Because of the lack of fast modeling methodologies that can provide a high degree of accuracy, most existing routing algorithms have been evaluated in NoCs of limited size. The developed FPGA-based NoC emulator enables the evaluation in large-scale NoCs with thousands of nodes in a practical time. The evaluation results show that, in the currently common NoC sizes of around 100 nodes, the proposed algorithm significantly outperforms other popular oblivious routing algorithms and can provide comparable performance to a complicated adaptive routing algorithm. However, as the NoC size increases, the performance of the algorithms is strongly affected by the resource allocation policy in the network and the effects are different for each algorithm. This result would not be obtained if modeling of large-scale NoCs could not be performed. While synthetic workloads can provide a relatively thorough coverage of the characteristics of the emulated NoCs, evaluation on trace-driven workloads is still required in some cases such as assessing some application-specific optimizations. The dissertation takes this into account and extends the proposed NoC emulator to support trace-driven emulation which will be useful for research and development of large-scale NoCs in the future when trace data of large-scale NoC- based systems are available. Since trace data are large, they must be stored in off-chip memory. The dissertation proposes an effective trace data loading architecture and some methods to hide the off-chip memory access time and improve the scalability of the emulation architecture in terms of operating frequency and logic requirements. These proposals are tightly coupled to the time-multiplexed emulation methods. The evaluation results show that the extended NoC emulator is two orders of magnitude faster than the above-mentioned software simulator when emulating an 8×8 NoC with the widely used PARSEC traces while also providing the same results; and the speedup is increased to three orders of magnitude when emulating a 64×64 NoC with trace data created based on a synthetic workload. The work in this dissertation contributes directly to the formation of infrastructures for re- search and development of large-scale NoCs, which is crucial for developing more powerful and efficient many-core processors, MPSoCs, and hardware accelerators in the future. iv Acknowledgment The work presented in this thesis would not have been possible without the help and support of so many people to whom I owe a lot of gratitude. First and foremost, I would like to thank my advisor, Associate Professor Kenji Kise, who has given me this great opportunity to come and work with him at Tokyo Tech.