Non-Uniform Memory Affinity Strategy in Multi-Threaded Sparse Matrix Computations" (2011)

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Digital Repository @ Iowa State University Computer Science Technical Reports Computer Science 9-2011 Non-uniform Memory Affinity Strategy in Multi- Threaded Sparse Matrix Computations Avinash Srivinasa Iowa State University Masha Sosonkina Iowa State University Follow this and additional works at: http://lib.dr.iastate.edu/cs_techreports Part of the Systems Architecture Commons Recommended Citation Srivinasa, Avinash and Sosonkina, Masha, "Non-uniform Memory Affinity Strategy in Multi-Threaded Sparse Matrix Computations" (2011). Computer Science Technical Reports. 192. http://lib.dr.iastate.edu/cs_techreports/192 This Article is brought to you for free and open access by the Computer Science at Iowa State University Digital Repository. It has been accepted for inclusion in Computer Science Technical Reports by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Non-uniform Memory Affinity Strategy in Multi-Threaded Sparse Matrix Computations Abstract As the core counts on modern multi-processor systems increase, so does the memory contention with all the processes/threads trying to access the main memory simultaneously. This is typical of UMA (Uniform Memory Access) architectures with a single physical memory bank leading to poor scalability in multi- threaded applications. To palliate this problem, modern systems are moving increasingly towards Non- Uniform Memory Access (NUMA) architectures, in which the physical memory is split into several (typically two or four) banks. Each memory bank is associated with a set of cores enabling threads to operate from their own physical memory banks while retaining the concept of a shared virtual address space. However, accessing shared data structures from the remote memory banks may become increasingly slow. This paper proposes a way to determine and pin certain parts of the shared data to specific memory banks, thus minimizing remote accesses. To achieve this, the existing application code has be supplied with the proposed interface to set-up and distribute the shared data appropriately among memory banks. Experiments with NAS benchmark as well as with a realistic large-scale application calculating ab-initio nuclear structure have been performed. Speedups of up to 3.5 times were observed with the proposed approach compared with the default memory placement policy. Keywords Memory affinity, Non-Uniform Memory Access (NUMA) node, Multi-threaded execution, Shared array, Sparse matrix-vector multiply Disciplines Systems Architecture This article is available at Iowa State University Digital Repository: http://lib.dr.iastate.edu/cs_techreports/192 Non-uniform Memory Affinity Strategy in Multi-Threaded Sparse Matrix Computations Technical Report #11-07, Compute Science Department, Iowa State University, Ames, IA 5011, Sep. 2011. Avinash Srinivasa Masha Sosonkina Ames Laboratory/DOE Ames Laboratory/DOE Iowa State University Iowa State University Ames, IA 50011, USA Ames, IA 50011, USA [email protected] [email protected] Abstract 1. Introduction As the core counts on modern multi-processor systems in- Transistor densities have been growing in accordance with crease, so does the memory contention with all the pro- Moore’s law resulting in more and more cores being put on cesses/threads trying to access the main memory simulta- a single processor chip. With the increasing core counts on neously. This is typical of UMA (Uniform Memory Access) modern multi-processor systems, main memory bandwidth architectures with a single physical memory bank leading becomes an important consideration for high performance to poor scalability in multi-threaded applications. To palli- applications. The main memory sub-system can be of two ate this problem, modern systems are moving increasingly types nowadays: Uniform Memory Access (UMA) or Non- towards Non-Uniform Memory Access (NUMA) architec- Uniform Memory Access (NUMA). UMA machines con- tures, in which the physical memory is split into several (typ- sist of a single physical memory bank for the main memory, ically two or four) banks. Each memory bank is associated which may lead to the memory bandwidth contention when with a set of cores enabling threads to operate from their there are many application threads trying to access the main own physical memory banks while retaining the concept of a memory simultaneously. This problem of scalability may be shared virtual address space. However, accessing shared data alleviated by NUMA architectures wherein the main mem- structures from the remote memory banks may become in- ory is physically split into several memory banks, with each creasingly slow. This paper proposes a way to determine and bank associated to a set of cores, the combination of which pin certain parts of the shared data to specific memory banks, is called a NUMA node. Hence, the memory contention may thus minimizing remote accesses. To achieve this, the exist- be reduced among the threads. ing application code has be supplied with the proposed in- However, accesses to remote memory banks as in the case terface to set-up and distribute the shared data appropriately of large shared arrays, for example, may become painstak- among memory banks. Experiments with NAS benchmark ingly slow and may negatively affect the application scal- as well as with a realistic large-scale application calculating ability for higher thread counts [10]. Thus, it is imperative ab-initio nuclear structure have been performed. Speedups to carefully consider which parts of the shared data should of up to 3.5 times were observed with the proposed approach be attributed to which physical memory bank based on the compared with the default memory placement policy. data access pattern or on other considerations. Such an at- tribution of data to physical main memory is often called memory affinity [2, 9]. This notion goes hand in hand with the CPU affinity, as noted in [6], such that the threads are being bound to specific cores for the application start and Keywords Memory affinity, Non-Uniform Memory Ac- their context switches are disabled. Once threads are bound, cess (NUMA) node, Multi-threaded execution, Shared array, the memory may be pinned too. On multi-core NUMA plat- Sparse matrix-vector multiply. structure (MFDn) [17] handles very large sparse unstruc- Threads Threads Threads tured matrices arising in the solution of the underlying on Cores on Cores on Cores Schrodinger¨ equation. The paper is organized as follows. Section 2 describes the proposed memory placement strategy followed by the outline of its implementation and usage within sparse matrix computations (Section 3). Section 4 presents the applications tested while Section 5 provides the experimental results. In Shared Section 6, the concluding remarks are given. Data 2. Proposed memory placement strategy The goal of the proposed memory placement strategy is to minimize the data transfer overhead between main memory and the application code when accessing shared data. Hence, NUMA Node1 NUMA Node2 NUMA Node3 the default (first-touch) placement has to be changed accord- Figure 1. Shared data access pattern with the default first- ing to certain application and system considerations [5]. In touch policy on a NUMA architecture. A dashed curved- a nutshell, the following general steps need to be taken to corner rectangle represents NUMA node. study the application at hand to determine the memory placement for its shared data structures: Step 1: Identify all the shared data structures in the applica- forms, the ability to pin the memory in the application code tion becomes important since it is generally most beneficial for Step 2: Classify them as having deterministic and non- a data portion local to a thread to be placed on the memory deterministic access pattern by threads. 1 bank local to the core it is executing on , so as to ensure the – For deterministic: Find a chunk-to-thread corre- fastest access [1]. spondence; Pin each chunk to the memory bank Conversely, the default memory affinity policy — used in local to the corresponding thread. most Linux-type operating systems — is enforced system- – For non-deterministic: Spread the data across all wide for all the application. This policy, called first-touch, the memory banks. ensures that there is fast access to at least one memory bank The classification step (Step 2) may be performed based regardless of the shared data access pattern within applica- on a definition of the deterministic and non-deterministic tion threads [7]. Specifically, the data is placed in the mem- accesses to a data structure. In the former, portions of the ory bank local to the thread writing to it first, which is typ- structure is accessed by a thread exclusively, while several ically done by the master thread. Thus, the downside of the thread may access a portion in the latter case. This defini- first-touch policy is that all the threads accessing this shared tion is rather general and is featured, for example, in the data converge to this NUMA node, as shown in Fig. 1, case of multi-threaded loop parallelization, such that a block causing bandwidth contention in the memory bank servicing of loop iterations is dedicated to a thread. If the loop index the master thread. The problem may be exacerbated since corresponds to a data portion (called chunk), such as that the master thread typically initializes multiple shared data of a shared array, then each thread accesses its own array structures. Since the threads have to go out of their local chunk exclusively. Such an array may be classified as hav- NUMA node for accessing the data, the remote access laten- ing deterministic access and then distributed among specific cies are also incurred, which causes the application perfor- memory banks. Fig. 2 presents the obtained distribution to mance overhead increase. Thus, the default first-touch mem- the local NUMA nodes, such that vertical arrows emphasize ory placement policy calls for improvement to achieve bet- the local access patterns, that minimizes the access latency.

Load more