Flick: Fast and Lightweight ISA-Crossing Call for Heterogeneous-ISA Environments
Total Page:16
File Type:pdf, Size:1020Kb
Flick: Fast and Lightweight ISA-Crossing Call for Heterogeneous-ISA Environments Shenghsun Cho, Han Chen, Sergey Madaminov, Michael Ferdman, Peter Milder Stony Brook University fshencho, smadaminov, [email protected], fhan.chen.2, [email protected] Abstract—Heterogeneous-ISA multi-core systems have perfor- computing systems are already heterogeneous-ISA systems. mance and power consumption benefits. Today, numerous system This fact presents a new opportunity to spread computation components, such as NVRAMs and Smart NICs, already have and communication across cores with different ISAs, achieving built-in processor cores with ISAs different from that of the host CPUs, making many modern systems heterogeneous-ISA better performance and power efficiency. multi-core systems. Unfortunately, programming and using such However, although the benefits of heterogeneous-ISA multi- systems efficiently is difficult and requires extensive support from core systems are attractive, these benefits come with significant the host operating systems. Existing programming solutions are challenges for the software developers of such systems. These complex, require dramatic changes to the systems, and often challenges are particularly pronounced for NxPs, which have incur significant performance overheads. To address this challenge, we propose Flick: Fast and their own private memory, calling for software developers Lightweight ISA-Crossing Call, for migrating threads in to use the NxPs in an “offload engine” programming style, heterogeneous-ISA multi-core systems. By leveraging hardware similar to CUDA or OpenCL for GPUs. The offload-engine virtual memory support and standard operating system mech- software organization differs greatly from the conventional anisms, a software thread can transparently migrate between general-purpose multi-core programming environment familiar cores with different ISAs. We prototype a heterogeneous-ISA multi-core system using FPGAs with off-the-shelf hardware and to many developers. In such NxPs, developers typically handle software to evaluate Flick. Experiments with microbenchmarks not only the communication between the host processors and a BFS application show that Flick requires only minor and the NxPs but also the data addressing and movement, changes to the existing OS and software, and incurs only 18µs further increasing the complexity (and potentially reducing the round trip overhead for migrating a thread through PCIe, which performance) of the resulting systems. is at least 23x faster than prior work. To address the programmability challenges, researchers Index Terms—Heterogeneous-ISA, Multi-Core, Thread Migra- tion, Virtual Memory, FPGA have proposed new operating systems [9], [10] and modi- fications to existing operating systems [11], aiming to hide the complexity of heterogeneous-ISA multi-core systems and I. INTRODUCTION provide a traditional multi-core environment familiar to de- Multi-core processors, including CMPs and MPSoCs, have velopers. These approaches typically assume that software dominated CPU architectures for over a decade. To keep threads must be permitted to freely migrate between cores improving processor performance within constrained energy at any execution point, enabling thread migration based on, budgets, researchers and industry have proposed and de- for example, dynamically profiled system load. To support ployed heterogeneous multi-core processors, including single- such thread migration capabilities, prior work employs many ISA heterogeneous multi-cores [1], [2] and heterogeneous-ISA complex techniques, including run-time binary translation and multi-cores [3], [4]. Such processors are deployed in systems thread state transformation [11], integrating these mechanisms ranging from massive warehouse-scale datacenter servers to into heavily modified or even brand new operating systems. smartphones and small embedded systems. Among the hetero- Unfortunately, although they generally achieve their goal of geneous multi-core processors, heterogeneous-ISA multi-core simple programmability and convenience, these approaches CMPs and MPSoCs require the highest effort to effectively impose major performance overheads on the execution of the utilize the diversity of both the architecture (ISAs) and the heterogeneous-ISA software. microarchitecture for performance and power efficiency. In this work, we observe that conveniently-programmable In addition to the intentionally heterogeneous-ISA CMPs heterogeneous-ISA systems are possible, and high perfor- and MPSoCs, many modern system components, such as mance is achievable, by sacrificing a small amount of purity Smart NICs [5], NVMes [6], and FPGAs [7], [8], include associated with the migrate-any-time systems. Particularly for built-in general-purpose processors. These NxPs (Near-x- NxP systems, where the software developers have a clear idea Processors), such as near-to-communication (Smart NICs), about the partitioning of the software and the data, many near-to-data (NVMes), or near-to-accelerator (FPGAs), typi- of the techniques from the prior work introduce unnecessary cally use ISAs different from the host processors, with their complexity and overhead. The system should provide the microarchitectures tailored to run the target workloads effi- illusion of a simple and efficient shared memory multi-core ciently. As a result of incorporating NxPs, numerous modern environment to users, enabling them to easily develop software that uses cores with different ISAs. We, therefore, take a heterogeneous multi-core systems, heterogeneous-ISA multi- principled approach to identify the minimum requirements core systems represent an extreme case for combining cores for a heterogeneous-ISA multi-core programming environment with different architectures (ISAs) to achieve efficiency with that is easy to use, requires minimal deviation from traditional both specialization (heterogeneity) and parallelism (multi- single-ISA multi-core programming, and can be achieved core). Studies have shown that heterogeneous-ISA multi-core without introducing major modifications to the existing op- systems can achieve 21% speedup and up to 23% energy erating systems and hardware. savings compared to homogeneous multi-core systems [3]. We develop Flick: Fast and Lightweight ISA-Crossing Call for migrating threads in heterogeneous-ISA environments. A. Heterogeneous-ISA Multi-Core Integration To maintain expectations of conventional multi-core systems, There are several ways to integrate cores with different Flick guarantees a unified physical memory address space, and ISAs, primarily depending on the physical locations of the the same software thread observes the same virtual memory cores. On one end of the spectrum, different cores can be address space on all cores, regardless of ISA. Given hardware tightly coupled into one chip to form a heterogeneous-ISA that can support such a memory organization, we create a con- CMP or MPSoC. The cores can share the same memory venient and lightweight generic thread migration mechanism hierarchy and have low-latency communication with each capable of transferring threads between cores with different- other. However, although heterogeneous-ISA CMPs have been ISAs at function-call boundaries. Flick uses mechanisms that proposed in academia [3], they are still rarely seen in the field. are readily available in modern systems, such as access-control On the other end of the spectrum, a data center that connects bits in page table entries, and makes non-intrusive use of processors with different ISAs using a high-speed network existing software techniques such as page fault handling and is also a heterogeneous-ISA system [12]. However, because dynamic library loaders. Ultimately, Flick allows software the cores are spread across servers, developers are limited to developers to take conventional software that targets multi- distributed programming techniques, and the communication core systems and indicate which functions should run on which overheads are high. ISA, thereby directing the system to transparently migrate the An interesting integration option, which lies in the middle, running thread to the desired core when the program makes spreads heterogeneous-ISA cores across multiple chips, but function calls across the designated ISA boundaries. still within one machine. Many components of modern systems We prototype and evaluate Flick on a heterogeneous-ISA already have their own cores, using different ISAs from multi-core platform using a modern off-the-shelf server with the host cores, executing jobs specifically targeted for those a PCIe connected FPGA. On this real system, we show that components. Common examples include servers with Smart Flick achieves 18µs migration times with fewer than 2K LoC NICs, NVMe controllers, and FPGAs with integrated cores. modifications in the off-the-shelf Linux operating system, as The cores are typically used as NxPs, as they reside close to opposed to hundreds to thousands of microseconds overhead the subjects they are working on, such as network, storage, or and tens of thousands LoC modifications in the prior work. accelerators, providing low data access latencies for their tasks. Our experiments with a BFS application show that Flick Such tailored architectures make the NxPs highly efficient for achieves up to 2.5x speedup when compared to a system that their target workloads. More importantly, because these NxPs does not perform thread migration. are located within the same machine as