Pymtl: a Unified Framework for Vertically Integrated Computer
Total Page:16
File Type:pdf, Size:1020Kb
Appears in the Proceedings of the 47th Int’l Symp. on Microarchitecture (MICRO-47), December 2014 PyMTL: A Unified Framework for Vertically Integrated Computer Architecture Research Derek Lockhart, Gary Zibrat, and Christopher Batten School of Electrical and Computer Engineering, Cornell University, Ithaca, NY {dml257,gdz4,cbatten}@cornell.edu Abstract—Technology trends prompting architects to con- tures, it has general value as a methodology for more tradi- sider greater heterogeneity and hardware specialization have tional architecture research as well. exposed an increasing need for vertically integrated research Unfortunately, attempts to implement an MTL methodol- methodologies that can effectively assess performance, area, and energy metrics of future architectures. However, constructing ogy using existing publicly-available research tools reveals such a methodology with existing tools is a significant challenge numerous practical challenges we call the computer architec- due to the unique languages, design patterns, and tools used ture research methodology gap. This gap is manifested as in functional-level (FL), cycle-level (CL), and register-transfer- the distinct languages, design patterns, and tools commonly level (RTL) modeling. We introduce a new framework called used by functional level (FL), cycle level (CL), and register- PyMTL that aims to close this computer architecture research methodology gap by providing a unified design environment transfer level (RTL) modeling. We believe the computer ar- for FL, CL, and RTL modeling. PyMTL leverages the Python chitecture research methodology gap exposes a critical need programming language to create a highly productive domain- for a new vertically integrated framework to facilitate rapid specific embedded language for concurrent-structural modeling design-space exploration and hardware implementation. Ide- and hardware design. While the use of Python as a modeling ally such a framework would use a single specification lan- and framework implementation language provides considerable benefits in terms of productivity, it comes at the cost of signif- guage for FL, CL, and RTL modeling, enable multi-level sim- icantly longer simulation times. We address this performance- ulations that mix models at different abstraction levels, and productivity gap with a hybrid JIT compilation and JIT special- provide a path to design automation toolflows for extraction ization approach. We introduce SimJIT, a custom JIT special- of credible area, energy, and timing results. ization engine that automatically generates optimized C++ for 1 CL and RTL models. To reduce the performance impact of the This paper introduces PyMTL , our attempt to construct remaining unspecialized code, we combine SimJIT with an off- such a unified, highly productive framework for FL, CL, the-shelf Python interpreter with a meta-tracing JIT compiler and RTL modeling. PyMTL leverages a common high- (PyPy). SimJIT+PyPy provides speedups of up to 72× for CL productivity language (Python2.7) for behavioral specifica- models and 200× for RTL models, bringing us within 4–6× of tion, structural elaboration, and verification, enabling a rapid optimized C++ code while providing significant benefits in terms of productivity and usability. code-test-debug cycle for hardware modeling. Concurrent- structural modeling combined with latency-insensitive design I. INTRODUCTION allows reuse of test benches and components across abstrac- tion levels while also enabling mixed simulation of FL, CL, Limitations in technology scaling have led to a growing and RTL models. A model/tool split provides separation of interest in non-traditional system architectures that incorpo- concerns between model specification and simulator genera- rate heterogeneity and specialization as a means to improve tion letting architects focus on implementing hardware, not performance under strict power and energy constraints. Un- simulators. PyMTL’s modular construction encourages ex- fortunately, computer architects exploring these more exotic tensibility: using elaborated model instances as input, users architectures generally lack existing physical designs to val- can write custom tools (also in Python) such as simulators, idate their power and performance models. The lack of vali- translators, analyzers, and visualizers. Python’s glue lan- dated models makes it challenging to accurately evaluate the guage facilities provide flexibility by allowing PyMTL mod- computational efficiency of these designs [7,8,15–17,19]. As els and tools to be extended with C/C++ components or specialized accelerators become more integral to achieving embedded within existing C/C++ simulators [42]. Finally, the performance and energy goals of future hardware, there is PyMTL serves as a productive hardware generation language a crucial need for researchers to supplement cycle-level simu- for building synthesizable hardware templates thanks to an lation with algorithmic exploration and RTL implementation. HDL translation tool that converts PyMTL RTL models into Future computer architecture research will place an in- Verilog-2001 source. creased emphasis on a methodology we call modeling to- Leveraging Python as a modeling language improves wards layout (MTL). While computer architects have long model conciseness, clarity, and implementation time [11,33], leveraged multiple modeling abstractions (functional level, but comes at a significant cost to simulation time. For cycle level, register-transfer level) to trade off simulation time example, a pure Python cycle-level mesh network simu- and accuracy, an MTL methodology aims to vertically inte- lation in PyMTL exhibits a 300x slowdown when com- grate these abstractions for iterative refinement of a design pared to an identical simulation written in C++. To ad- from algorithm, to exploration, to implementation. Although an MTL methodology is especially valuable for prototyping 1PyMTL loosely stands for [Py]thon framework for [M]odeling specialized accelerators and exploring more exotic architec- [T]owards [L]ayout and is pronounced the same as “py-metal”. dress this performance-productivity gap we take inspiration TABLE I. MODELING METHODOLOGIES from the scientific computing community which has increas- FL CL RTL productivity-level languages ingly adopted (e.g., MATLAB, Modeling Productivity Efficiency Hardware Python) for computationally intensive tasks by replacing Languages Level Level Description hand-written efficiency-level language code (e.g., C, C++) (PLL) (ELL) (HDL) with dynamic techniques such as just-in-time (JIT) compi- MATLAB/Python C/C++ Verilog/VHDL lation [12, 27, 38] and selective-embedded JIT specializa- Modeling Functional: Object Oriented: Concurrent-Structural: tion [5, 10]. Patterns Data Structures, Classes, Methods, Combinational Logic, We introduce SimJIT, a custom just-in-time specializer Algorithms Ticks and/or Clocked Logic, which takes CL and RTL models written in PyMTL and Events Port Interfaces automatically generates, compiles, links, and executes fast, Modeling Third-party Computer Simulator Python-wrapped C++ code seamlessly within the PyMTL Tools Algorithm Architecture Generators, framework. SimJIT is both selective and embedded pro- Packages and Simulation Synthesis Tools, Toolboxes Frameworks Verification Tools viding much of the benefits described in previous work on domain-specific embedded specialization [10]. SimJIT de- livers significant speedups over CPython (up to 34× for CL is of primary concern. Performance-oriented FL models may models and 63× for RTL models), but sees even greater ben- use efficiency-level languages such as C or C++ when simu- efits when combined with PyPy, an interpreter for Python lation time is the priority (e.g., instruction-set simulators). with a meta-tracing JIT compiler [6]. PyPy is able to opti- mize unspecialized Python code as well as hot paths between Cycle-Level – CL models capture the cycle-approximate Python and C++, boosting the performance of SimJIT simu- behavior of a hardware target. CL models attempt to strike a lations by over 2× and providing a net speedup of 72× for balance between accuracy, performance, and flexibility while CL models and 200× for RTL models. These optimizations exploring the timing behavior of hypothetical hardware or- mitigate much of the performance loss incurred by using a ganizations. The CL methodology places an emphasis on productivity-level language, closing the performance gap be- simulation speed and flexibility, leveraging high-performance tween PyMTL and C++ simulations to within 4–6×. efficiency-level languages like C++. Encapsulation and reuse The contributions of this paper are the following: is typically achieved through classic object-oriented software engineering paradigms, while timing is most often mod- 1. We introduce PyMTL, a novel, vertically-integrated eled using the notion of ticks or events. Established com- framework for concurrent-structural modeling and hard- puter architecture simulation frameworks (e.g., ESESC [1], ware design (Section III). PyMTL provides a unified gem5 [4]) are frequently used to increase productivity as they environment for FL, CL, and RTL modeling and enables typically provide libraries, simulation kernels, and parame- a path to EDA toolflows via generation of Verilog HDL. terizable baseline models that allow for rapid design-space exploration. 2. We introduce and evaluate SimJIT, a selective em- Register-Transfer-Level – RTL models are cycle- bedded just-in-time specializer for PyMTL CL and accurate,