Hierarchical Multithreading: Programming Model and System Software

Hierarchical Multithreading: Programming Model and System Software Guang R. Gao1, Thomas Sterling2,3,RickStevens4, Mark Hereld4,WeirongZhu1 1 2 Department of Electrical and Computer Engineering Center for Advanced Computing Research University of Delaware California Institute of Technology {ggao,weirong}@capsl.udel.edu [email protected] 3 4 Department of Computer Science Mathematics and Computer Science Division Louisiana State University Argonne National Laboratory [email protected] {stevens,hereld}@mcs.anl.gov Abstract 1 Introduction With the rapid increase in both the scale and com- This paper addresses the underlying sources of per- plexity of scientific and engineering problems, the com- formance degradation (e.g. latency, overhead, and putational demands grow accordingly. Breakthrough- starvation) and the difficulties of programmer produc- quality scientific discoveries and optimal engineering tivity (e.g. explicit locality management and schedul- designs often rely on large scale simulations on High- ing, performance tuning, fragmented memory, and syn- End Computing (HEC) systems with performance re- chronous global barriers) to dramatically enhance the quirement reaching peta-flops and beyond. However, broad effectiveness of parallel processing for high end current HEC systems lack system software and tools computing. We are developing a hierarchical threaded optimized for advanced scientific and engineering work virtual machine (HTVM) that defines a dynamic, mul- of interest, and are extremely difficult to program and tithreaded execution model and programming model, to port applications to. Consequently, applications providing an architecture abstraction for HEC system rarely achieve an acceptable fraction of the peak ca- software and tools development. We are working on pability of the system. a prototype language, LITL-X (pronounced “little-X”) To radically improve this situation, the following for Latency Intrinsic-Tolerant Language, which pro- key features are expected to be supported in the fu- vides the application programmers with a powerful set ture HEC systems: (1) Architecture support for coarse- of semantic constructs to organize parallel computa- and/or fine-grain multithreading at enormous scale (up tions in a way that hides/manages latency and limits to millions of threads). (2) Architecture support for the effects of overhead. This is quite different from lo- runtime thread migration, and (3) Architecture sup- cality management, although the intent of both strate- port for large shared address space across nodes. These gies is to minimize the effect of latency on the efficiency features can be observed in the IBM Bluegene L [6] of computation. We will work on a dynamic compila- and Cyclops architectures [5], Processor-In-Memory- tion and runtime model to achieve efficient LITL-X based architectures [17], fine-grain multithreaded ar- program execution. Several adaptive optimizations will chitectures like HTMT [10] and CARE [14]. be studied. A methodology of incorporating domain- In this paper, we propose a hierarchical threaded vir- specific knowledge in program optimization will be stud- tual machine (HTVM) that defines a dynamic, multi- ied. Finally, we plan to implement our method in an threaded execution model, which provides an architec- experimental testbed for a HEC architecture and per- ture abstraction for HEC system software and tools de- form a qualitative and quantitative evaluation on se- velopment. A corresponding programming model will lected applications. efficiently exploit the ability of the execution model by users. We will perform research on programming model 1-4244-0054-6/06/$20.00 ©2006 IEEE and language issues, continuous compilation and run- namic adaptiveness of the HEC system. Fig. 1 shows time software that are critical to enable the dynamic our overall system software architecture. At the core is adaptation of the HEC system. We propose a method a Hierarchical Threaded Virtual Machine(HTVM) ex- to enable domain-expert knowledge input and exploita- ecution model that features dynamic multi-level mul- tion, and runtime performance monitoring mechanism tithreaded execution. HTVM includes three compo- to support the above continuous compilation. Finally, nents: a thread model, a memory model and a syn- we report the current status of the implementation, chronization model. This design focuses on adaptiv- performance analysis, and evaluation of the proposed ity features, as will be discussed in detail in Section 3. methods under an experimental HEC system software The functionalities of HTVM will be supported and ex- testbed. plored through the HTVM parallel programming language (called LITL-X), compiler and runtime software. 2 An Overview of the Hierarchical The compiler has two parts: a static part and a dy- Threaded Virtual Machine Model namic part. As shown in Fig. 1, the dynamic compiler is responsible for the adaptation of loop parallelism, dynamic load, locality and latency. Since the dynamic This section gives our overall vision on the HEC compiler closely interacts with the runtime system and system software/tools. A major challenge is to accom- will be called during the execution of the HEC applica- modate dynamic adaptivity in the design due to the tions, its functionality extends smoothly to the runtime complex and dynamic nature of ultra-large scale HEC system as well, as indicated by the boxes that span applications and machines. Under a real HEC pro- across the dynamic compiler and the runtime system. gram execution scenario, millions of threads at various To take advantage of the adaptivity features of levels in the thread hierarchy may be generated and HTVM more effectively, a domain-experts knowledge executed at different time and places in the machine. base is provided. Domain-specific knowledge is ex- Each thread should be mapped to a desirable physi- pressed as scripts, which give specific annotations to cal thread unit when resources become available and the source of the HEC applications to guide the com- dependences are resolved. We identify four classes of pilation process of the static compiler. To assist adap- adaptivity critical to the performance of the system: tivity, a system of structured hints guides the dynamic • Loop parallelism adaptation. Scientific appli- compiler for selection and completion of the partial cations tend to have computation-intensive kernels schedules generated by the static compiler, and for se- consisting of loop nests. Exploitable parallelism in lection of runtime algorithms, based on the dynamic a loop nest, and the grain size of the parallelism, facts such as memory access patterns found by a run- are runtime dependent on the machine resource time performance monitor during the execution of the availability and data locality, which change more HEC applications. The flow of the mapping process of drastically in a highly threaded environment with an HEC application under our proposed research soft- deep memory hierarchy. ware and tools is indicated by the big shaded arrows • Dynamic load adaptation. The computation in Fig. 1. The components of the software infrastruc- load may become unbalanced and a large number ture have been annotated by the corresponding sec- of threads may need to migrate to balance the load tion numbers. The HTVM compilation and execution of the machine. process is an iterative process with the assistance of a feedback process as shown in the figure. • Locality adaptation. Data objects may need to migrate, and copies be generated and moved in the memory hierarchy to achieve high locality, 3 Hierarchical Multithreading: Pro- while copy consistency needs to be preserved. gramming Model and System Soft- • Latency adaptation. The deep memory hier- ware archy usually found in an HEC machine makes the memory access latencies vary more drastically 3.1 A Hierarchical Threaded Virtual Ma- during the execution, depending on the locality chine Model of references, the number of concurrent accesses, and the available memory bandwidth. The system needs dynamically adapt to such variations. One of our primary objectives is to define the hierarchical threaded virtual machine (HTVM). We first A main task of our research is to study the key sys- outline our research in the HTVM execution model, tem software technologies that support the above dy- which consists of a thread model, a memory model and Figure 1. An Overview of the Proposed HEC Software/Tools a synchronization model. We then outline research is- • Small-Grain Threads (SGTs) under sues and tasks for HTVM programming model. HTVM: Small-grain threads are another feature of certain HEC architectures interested in this proposal. These threads normally expect 3.1.1 HTVM Execution Model to perform a much smaller computation task, building some state but with substantial less A novel aspect of our HTVM model is to provide a “weight”. Therefore, cost of their invocation smooth and integrated abstraction that directly rep- and management is much lower when comparing resents these thread levels and provides an integrated with large-grain threads. An example of SGTs is thread hierarchy. We will target future HEC architec- the threaded function calls under CILK [9] and tures - they provide rich hardware support for a hier- EARTH[19], parcels under HTMT [10] and Cas- archy of threads at different grain levels, as discussed cade [4], and asynchronous calls being considered earlier. under PERCS [1]. Intuitively, the following levels of threads are to be • defined under HTVM. Tiny-Grain Threads

Hierarchical Multithreading: Programming Model and System Software

A Case for High Performance Computing with Virtual Machines

A Light-Weight Virtual Machine Monitor for Blue Gene/P

Opportunities for Leveraging OS Virtualization in High-End Supercomputing

System Trends and Their Impact on Future Microprocessor Design

A Virtual Machine Environment for Real Time Systems Laboratories

An Overview of the Blue Gene/L System Software Organization

GT-Virtual-Machines

Cellular Wave Computers and CNN Technology – a Soc Architecture with Xk Processors and Sensor Arrays*

Performance Modelling and Optimization of Memory Access on Cellular Computer Architecture Cyclops64

Virtualizationoverview

Virtual Machine Benchmarking Kim-Thomas M¨Oller Diploma Thesis

Distributed Virtual Machines: a System Architecture for Network Computing