Wrangling Rogues: a Case Study on Managing Experimental Post-Moore Architectures Will Powell Jeffrey S
Total Page:16
File Type:pdf, Size:1020Kb
Wrangling Rogues: A Case Study on Managing Experimental Post-Moore Architectures Will Powell Jeffrey S. Young Jason Riedy Thomas M. Conte [email protected] [email protected] [email protected] [email protected] School of Computational Science and Engineering School of Computer Science Georgia Institute of Technology Georgia Institute of Technology Atlanta, Georgia Atlanta, Georgia Abstract are not always foreseen in traditional data center environments. The Rogues Gallery is a new experimental testbed that is focused Our Rogues Gallery of novel technologies has produced research on tackling rogue architectures for the post-Moore era of computing. results and is being used in classes both at Georgia Tech as well as While some of these devices have roots in the embedded and high- at other institutions. performance computing spaces, managing current and emerging The key lessons from this testbed (so far) are the following: technologies provides a challenge for system administration that • Invest in rogues, but realize some technology may be are not always foreseen in traditional data center environments. short-lived. We cannot invest too much time in configuring We present an overview of the motivations and design of the and integrating a novel architecture when tools and software initial Rogues Gallery testbed and cover some of the unique chal- stacks may be in an unstable state of development. Rogues lenges that we have seen and foresee with upcoming hardware may be short-lived; they may not achieve their goals. Find- prototypes for future post-Moore research. Specifically, we cover ing their limits is an important aspect of the Rogues Gallery. networking, identity management, scheduling of resources, and Early users of these platforms are technically sophisticated tools and sensor access aspects of the Rogues Gallery along with and relatively friendly, so not all components of the support- techniques we have developed to manage these new platforms. We ing infrastructure need be user friendly. argue that current tools like the Slurm scheduler can support new • Physical hardware resources not dedicated to rogues rogues without major infrastructure changes. should be kept to a minimum. As we explain in Section 4, most functionality should be satisfied by VMs and containers 1 Challenges for Post-Moore Testbeds rather than dedicating a physical server to manage each As we look to the end of easy and cost-effective transistor scal- rogue piece of hardware. ing, we enter the post-Moore era[33] and reach a turning point in • Collaboration and commiseration is key. Rogues need a computer system design and usage. Accelerators like GPUs have community to succeed. If they cannot establish a community created a pronounced shift in the high-performance computing and (users, vendors interested in providing updates, manageabil- machine learning application spaces, but there is a wide variety ity from an IT perspective), they will disappear. We promote of possible architectural choices for the post-Moore era, including community with our testbed through a documentation wiki, memory-centric, neuromorphic, quantum, and reversible comput- mailing lists, and other communication channels for users ing. These revolutionary research fields combined with alternative as well as close contact with vendors, where appropriate. materials-based approaches to silicon-based hardware have given • Licensing and appropriate identity management are us a bewildering array of options and rogue devices for the post- tough but necessary challenges. Managing access to pro- Moore era. However, there currently is limited guidance on how to totypes and licensed software are tricky for academic institu- evaluate this novel hardware for tomorrow’s application needs. tions, but we must work with collaborators like start-ups to arXiv:1808.06334v4 [cs.AR] 2 Aug 2019 Creating one-off testbeds for each new post-Moore technology provide a secure environment with appropriate separation of increases the cost of evaluating these new technologies. We present privileges to protect IP as necessary. Many software-defined an in-progress case study of a cohesive user environment that en- network (SDN) environments assume endpoint implementa- ables researchers to perform experiments across different novel tions that may not be possible on first-generation hardware. architectures that may be in different stages of acceptance by the Documentation restrictions present a particular challenge wider computing community. The Rogues Gallery is a new exper- while trying to build a community. imental testbed focusing on opening access to rogue architectures Here we describe the initial design of the testbed as well as the that may play a larger role in the Post-Moore era of computing. high-level system management strategy that we use to maintain While some of these devices have roots in the embedded and high- and tie together novel architectures. We use existing tools as much performance computing spaces, managing current and emerging as possible; we need not re-invent the systems integration wheel technologies provides a challenge for system administration that to support early novel architectures, which may be few in number and specialized. ,, 2019. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00 https://doi.org/10.1145/nnnnnnn.nnnnnnn Figure 1: Initial rogues: FPGAs, the Emu Chick, and the field programmable analog array (FPAA) 2 The Rogues Gallery Benchmarks and Data Sets for The Rogues Gallery was initiated by Georgia Tech’s Center for Irregular and ML applications Tools and Resources Research into Novel Computing Hierarchies (CRNCH) in late 2017. Portability APIs – Kokkos, GraphBLAS, Neuromorphic APIs The Gallery’s focus is to acquire new and unique hardware (the Rogues Gallery Training materials and Tutorials rogues) from vendors, research labs, and start-ups and make this Hosted Hardware hardware widely available to students, faculty, and industry collab- Near-memory Computation Neuromorphic orators within a managed data center environment. By exposing and Data Rearrangement Accelerators students and researchers to this set of unique hardware, we hope to Emu foster cross-cutting discussions about hardware designs that will Chick FPAA (GT) Others.. Non-volatile drive future performance improvements in computing long after Memory Programmable Interconnection the Moore’s Law era of cheap transistors ends. Networks High-bandwidth Memory The primary goal of the Rogues Gallery is to make first-generation Future Devices Traditional Computation start-up or research lab prototypes available to a wide audience and Prototype Accelerators RQL Devices of researchers and developers as soon as it is moderately stable Quantum Metastrider for testing, application development, and benchmarking. Examples RISC-V FPGA of such cutting-edge hardware include Emu Technology’s Chick, FPGA-based memory-centric platforms, field programmable analog Figure 2: High-level overview of the CRNCH Rogues Gallery devices (FPAAs), and others. This mirrors how companies like Intel and IBM are investigating novel hardware, Loihi and TrueNorth respectively, but the Rogues Gallery adds student and collaborator access. to play with ideas. Georgia Tech has a history of providing such The Rogues Gallery provides a coherent, managed environment early access through efforts such as the Sony-Toshiba-IBM Center for exposing new hardware and supporting software and tools to of Competence for the Cell Broadband Engine Processor[2], the students, researchers, and collaborators. This environment allows NSF Keeneland GPU system[34], and work with early high core users to perform critical architecture, systems, and HPC research, count Intel-based platforms[24]. and enables them to train with new technologies so they can be more competitive in today’s rapidly changing job market. As part 3 Initial Rogues of this goal, CRNCH and Georgia Tech are committed to partner- Our initial Rogues Gallery shown in Figure 1 consists of vari- ing with vendors to define a flexible access management model ous field programmable gate arrays (FPGAs), an Emu Chick, anda that allows for a functional and distinctive research experience for field-programmable analog array (FPAA). The gallery also currently both Georgia Tech and external users, while protecting sensitive includes power monitoring infrastructure for embedded-style sys- intellectual property and technologies. tems (NVIDIA Tegra), but we will not go into depth on that more Not all rogues become long-term products. Some fade away typical infrastructure. within a few years (or are acquired by companies that fail to pro- FPGAs for Memory System Exploration Reconfigurable devices ductize the technology). The overall infrastructure of a testbed like FPGAs are not novel in themselves, but the gallery includes focused on rogues must minimize up-front investment to limit the FPGAs for prototyping near-memory and in-network computing. cost of “just trying out” new technology. As these early-access and Currently the Rogues Gallery includes prototype platforms change, the infrastructure must also adapt. • two Nallatech 385 PCIe cards with mid-grade Intel Arria 10 And even if the technology is fantastic, rogues that do not de- FPGAs and 8 GiB of memory; velop communities do not last. In our opinion, initial communities • a Nallatech 520N PCIe card with a Intel Stratix 10, 32 GiB of grow by easing use and access. Some systems, like the Emu Chick, memory, and multiple 100 GBps network ports; and require