<<

Liang Y, Wang S. Performance-centric optimization for racetrack memory based register file on GPUs. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 31(1): 36–49 Jan. 2016. DOI 10.1007/s11390-016-1610-1

Performance-Centric Optimization for Racetrack Memory Based Register File on GPUs

Yun Liang ∗, Member, CCF, ACM, IEEE, and Shuo Wang

Center for Energy-Efficient Computing and Applications (CECA), School of Electrical Engineering and Computer Sciences, Peking University, Beijing 100871, China

E-mail: {ericlyun, shvowang}@pku.edu.cn

Received September 9, 2015; revised December 3, 2015.

Abstract The key to high performance for GPU architecture lies in its massive threading capability to drive a large number of cores and enable execution overlapping among threads. However, in reality, the number of threads that can simultaneously execute is often limited by the size of the register file on GPUs. The traditional SRAM-based register file takes up so large amount of chip area that it cannot scale to meet the increasing demand of GPU applications. Racetrack memory (RM) is a promising technology for designing large capacity register file on GPUs due to its high storage density. However, without careful deployment of RM-based register file, the lengthy shift operations of RM may hurt the performance. In this , we explore RM for designing high-performance register file for GPU architecture. High storage density RM helps to improve the thread level parallelism (TLP), but if the bits of the registers are not aligned to the ports, shift operations are required to move the bits to the access ports before they are accessed, and thus the read/write operations are delayed. We develop an optimization framework for RM-based register file on GPUs, which employs three different optimization techniques at the application, compilation, and architecture level, respectively. More clearly, we optimize the TLP at the application level, design a register mapping algorithm at the compilation level, and design a preshifting mechanism at the architecture level. Collectively, these optimizations help to determine the TLP without causing and register file resource contention and reduce the shift operation overhead. Experimental results using a variety of representative workloads demonstrate that our optimization framework achieves up to 29% (21% on average) performance improvement. Keywords register file, racetrack memory, GPU

1 Introduction The key to high performance of GPU architecture lies in the massive threading to enable fast context Modern GPUs employ a large number of simple, in- switch between threads and hide the latency of func- order cores, delivering several TeraFLOPs peak perfor- tion unit and memory access. This massive threading mance. Traditionally, GPUs are mainly used for super- design requires large on-chip storage support[1-7]. The computers. More recently, GPUs have penetrated mo- majority of on-chip storage area in modern GPUs is bile embedded system markets. The system-on-chip allocated for register file. For example, on NVIDIA (SoC) that integrates GPUs with CPUs, memory con- trollers, and other application-specific accelerators are Fermi (e.g., GTX480), each streaming multiprocessor available for mobile and embedded devices. The ma- (SM) contains 128 KB register file, which is much larger jor SoCs with integrated GPUs available in the mar- than the 16 KB L1 cache and 48 KB shared memory. ket include NVIDIA Tegra series with low power GPU, The hardware resources on GPUs include 1) regi- Qualcomm’s Snapdragon series with Adreno GPU, and sters, 2) shared memory, 3) and threads and thread Samsung’s Exynos series with ARM Mali GPU. blocks. A GPU kernel will launch as many threads

Regular Paper Special Section on Computer Architecture and Systems with This work was supported by the National Natural Science Foundation of China under Grant No. 61300005. A preliminary version of the paper was accepted by the 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016). ∗Corresponding Author ©2016 Springer Science + Business Media, LLC & Science Press, China Yun Liang et al.: Performance-Centric Optimization for Racetrack Memory Based Register File on GPUs 37 concurrently as possible until one or more dimensions can simultaneously execute on this architecture is 1 536 of resource are exhausted. In this paper, we de- threads per SM. However, for these applications, there fine thread level parallelism (TLP) as the number of exists a big gap between the achieved TLP and the thread blocks that can execute simultaneously on GPU. maximum TLP. For example, one thread of ap- Though GPUs are featured with large register file, in plication hotspot requires 15360 registers. Given 32768 reality, we find that the TLP is often limited by the registers budget (Table 2), only two thread blocks can capacity of register file. Table 1 gives the characteri- stics of some representative applications from Parboil run simultaneously. This leads to a very low occupancy 2×256 and Rodinia benchmark suites on a Fermi-like archi- as 1 536 = 33%, where the occupancy is defined as the tecture. The setting of the GPU architecture is shown ratio of simultaneously active threads to the maximum in Table 2. The maximum number of threads that number of threads supported on one SM.

Table 1. Kernel Description

Application Source Block Size Number of Registers per Thread TLP Occupancy (%) Register Utilization (%) hotspot Rodinia 256 60 2 33 93.75 b+tree Rodinia 256 30 4 67 93.75 histo final Parboil 512 41 1 33 64.06 histo main Parboil 768 22 1 50 51.56 mri-gridding Parboil 256 40 3 50 93.75

Table 2. GPGPU-Sim Configuration a new challenge in the form of shifting overhead. More Number of 15 clearly, for RM, one or several access ports are uni- compute formly distributed along the nanowire and shared by all units (SM) the domains. When the domains aligned to the ports SM configuration 32 cores, 700 MHz Resources per SM Max 1 536 threads, Max 8 thread blocks, are accessed, bits in them can be read immediately. 48 KB shared memory, 128 KB 16-bank However, to access other bits on the nanowire, the shift register file (32 768 registers) operations are required to move those bits to the nea- Scheduler 2 warp schedulers per SM, round-robin policy rest access ports. Obviously, a shift operation induces L1 data cache 16/32 KB, 4-way associativity, 128 B extra timing and energy overhead. block, LRU replacement policy, 32 To mitigate the shift operation overhead problem, MSHR entries L2 unified cache 768 KB size we develop an optimization framework, which employs three different optimization techniques at the applica- To deal with the increasingly complex GPU appli- tion, compilation, and architecture level, respectively. cations, new register file design with high capacity is High storage density RM allows high TLP. However, urgently required. Designing the register file using high high TLP may cause resource contention in cache and storage density emerging memory for GPUs is a promis- register file. Thus, at the application level, we propose ing solution[8-9]. Recently, racetrack memory (RM) has to adjust the TLP for better performance through pro- attracted great attention of researchers because of its filing. At the compilation level, we design a compile- ultra-high storage density. By integrating multiple bits time register mapping algorithm, which attempts to put (domains) in a tape-like nanowire[10], racetrack mem- the registers that are used frequently together close in ory has shown about 28x higher density compared with the register file to reduce the shift operation overhead. SRAM (static random access memory)[9]. Recent study Finally, at the architecture level, we design a preshift- has also enabled racetrack memory for GPU register ing mechanism to preshift the ports for the idle banks file design[8]. However, prior work primarily focuses on to further reduce the shift operation overhead. optimizing energy efficiency, leading to very small per- The key contributions of this paper are as follows. formance improvement[8]. • Framework. We develop a performance-centric In this paper, we explore RM for designing high- optimization framework for RM-based register file on performance register file for GPU architecture. High GPUs. storage density RM helps to enlarge the register file ca- • Techniques. Three optimization techniques at the pacity. This allows the GPU applications to run more application, compilation, and architecture level are de- threads in parallel. However, RM-based design presents veloped. 38 J. Comput. Sci. & Technol., Jan. 2016, Vol.31, No.1

• Evaluation. Experiments on a variety of appli- cycle, a warp that is ready for execution, is selected cations show that compared with SRAM design, our and then issued to SPs by the warp scheduler. The SPs RM-based register file design improves performance by in an SM work in the single instruction multiple data 21% on average. (SIMD) fashion and they share the on-chip memory re- sources including register file, shared memory and L1 2 Background data cache on an SM. Warp is the thread scheduling unit. If a warp is In this section, we first review the GPU architecture stalled due to the long latency operation (e.g., memory using NIVIDIA Fermi architecture as an example. We operation), then the warp scheduler will switch the exe- then introduce the basics of racetrack memory. Finally, cution to other ready warps to hide the overhead. In GPU register file designed based on racetrack memory order to fully hide the stall overhead caused by the long is presented. latency operations, GPU applications tend to launch 2.1 GPU Architecture as many threads as possible. However, the capacity of register file is limited, which constrains the TLP on the Modern GPUs achieve high throughput thanks to GPU. Furthermore, due to the chip area, power and en- the massive threading feature. When an application is ergy constraints, it is very difficult to enlarge the regi- launched on a GPU, hundreds or thousands of threads ster file using the current SRAM-based technology. can execute simultaneously. To support such a high thread level parallelism, GPU has multiple streaming 2.2 Basics of Racetrack Memory multiprocessors (SMs) working in parallel. Each SM contains an instruction cache, warp scheduler, L1 data Racetrack memory is an emerging non-volatile cache, register file, shared memory, and 32 small cores, memory, which is also known as spintronic domain wall called streaming processors (SPs), as shown in Fig.1. memory. Recently, it is becoming increasingly popu- Within an SM, threads are often grouped as a warp, lar due to its high storage density[10-12] and low en- which usually contains 32 threads. At any given clock ergy consumption. Compared with SRAM, racetrack

Write Buffer 0 Write Buffer N

Empty Warp 3, R3

⊲ ⊲ ⊲ Warp 3, R5 Warp 2, R1 Blocks

Bank 0 Bank N

Warp 0, R0 Warp 3, R0 Warp Scheduler Port 0 Warp 3, R1 Warp 2, R1

CUDA Kernel Port 0 Warp 2, R2 Region 0 Warp 1, R2 A SP SFU LSU Warp 1, R3 Warp 0, R3

Warp 0, R4 Warp 3, R4

⊲ ⊲ Scheduler ⊲ Warp 3, R5 Warp 2, R5 Warp 1, R Port 1 ⊲ ⊲ ⊲ Warp 2, R6 6 ⊲ ⊲ ⊲ Port 1 SM SM SM SM Warp 1, R7 Region 1 Warp 0, R7 Warp 3, R SM SM SM SM Warp 0, R8 8 Register File Warp 3, R Warp 2, R9 Cluster Cluster 9

L1 Cache Read Queue Read Queue

Warp 1, R3 Warp 3, R6

Warp 3, R5 Warp 2, R1

Fig.1. Architecture of GPU with an RM-based register file. Yun Liang et al.: Performance-Centric Optimization for Racetrack Memory Based Register File on GPUs 39 memory (RM) could achieve about 28x storage density bank SRAM register file is widely adopted in high- while keeping comparable access speed[9]. Moreover, throughput processors. Compared with single-bank read and write operation could achieve about 2x and multi-ported SRAM register file, it achieves higher per- 5x energy reduction respectively[11]. In the following, formance, lower power, and less area[9,12-13]. For in- we will present the details of the basic operations of stance, the NIVIDA Fermi architecture employs a 16- racetrack memory. bank SRAM register file, where up to four read or write Read and Write Operations. Analogous to accessing operations could be accomplished at the same time for the head of hard disk, one access head in RM can access register file. In this paper, we design RM-based register multiple bits. However, RM needs to shift the bit to the file following the multi-bank architecture. head if the bit is not aligned to the port. RM stores mul- Fig.1 shows the architecture of RM-based register tiple bits in tape-like magnetic nanowires called cells. A file. It is 16-bank RM-based register file. For each cell is made of a nanowire (holding successive domains) bank, it has a read request queue and a write buffer. to save bits and several access ports to access them, The read request queue is used to buffer read requests. as shown in Fig.2. The “white bricks” represent the The write buffer is used to resolve the conflict between domains and those separations are domain walls used the delayed write operation in RM and the immediate to isolate successive domains. Each access head has write-back operation in GPU pipeline’s last stage. All two transistors T1 and T2, as outlined by the red dash the read and the write requests will be arbitrated by boxes. T1 is used to read. The magnetization direction the request arbitrator to avoid bank conflicts and de- of each domain represents the value of the stored bit. termine the sequence for the requests. At any given If the direction is anti-parallel to the reference domain clock cycle, at most four read or write requests could (grey bricks) in the access port, the value is “1”; else be served at the same time, which means up to four it will be “0”. T2 is used to write bits into a domain. banks can be accessed simultaneously. When T2 is on, a crosswise current is applied on the Performance of RM-Based Register File. For domain, and the required bit is shifted in just as a shift SRAM-based register file, each register could be ran- operation. domly accessed in any given clock cycle, while for the RM-based register file, the lengthy shift operations of Access Head 0 Access Head 1 BL RM would increase the register access latency. This la- GND T T Domain Wall 2 2 tency could be up to tens of cycles according to the dis- T SL Domain 1 T1 tance between the target domains and the correspond- Ă Ă ing access ports, and this could degrade the perfor- SL BLB Overhead mance of GPU with RM-based register file. Therefore, we propose three optimization techniques in Section 4 Fig.2. Physical structure of a racetrack . to mitigate this problem. Shift Operations. Shift operation is an unique ope- 3 Motivation ration of racetrack memory. The shift operation is based on a phenomenon called -momentum transfer RM has high storage density, which means under caused by spin current. The shift current provided by the same chip area budget, larger register file capacity shift control transistor (SL) drives all the domains in could be achieved. Moreover, larger register file capa- a racetrack cell left and right. Note that several over- city could allow more threads of GPU, unleashing per- head domains are physically assigned at either end of formance improvement by larger TLP. Table 1 shows the cell, in order to save valid domains with stored bits the maximum TLP of five benchmarks under different when they are shifted out. sizes of register file. In order to quantify the potential performance improvement by using RM-based register 2.3 GPU Register File with Racetrack Memory file, we compare the performance of GPU in five appli- cations under the same chip area budget○1 for SRAM Multi-Bank RM-Based Register File. In order to and ideal RM-based register file○2 . As shown in Fig.3, support multi-read and multi-write operations, multi- by using RM-based register file, we could potentially

○1 The chip area budget is set according to the area of a 16-bank 128 KB SRAM register file as shown in Table 2. ○2 The ideal RM-based register file is RM-based register file when the shift operation overhead is zero. 40 J. Comput. Sci. & Technol., Jan. 2016, Vol.31, No.1 achieve up to 29% (on average 22%) performance im- preshift the ports for the idle banks. In the following, provement. The improvement is attributed to larger we will describe the details for each technique. TLP enabled by larger register file capacity. For exam- ple, the TLP of application mri-gridding is increased CUDA Kernel from 3 to 6 by doubling register file capacity. Note that this improvement is the ideal performance improvement by assuming the same access delay for SRAM and RM- Application Level TLP Optimization based register file and the shift operation overhead of RM is zero. In this paper, an optimization framework Register Mapping for RM-based register file on GPUs is proposed target- Compilation Level ing at increasing performance. We will demonstrate that by employing our proposed framework to optimize Architecture Level Bank Preshifting RM-based register file, the achieved performance im- provement is close to the ideal case. Fig.4. Optimization framework.

1.3 SRAM Ideal RM 4.2 TLP Optimization 1.2

1.1 By using RM-based register file, we can support high TLP on the GPU. However, the previous study[15] 1.0 shows that due to the contention in caches, the maxi- 0.9 mum TLP is not always the optimal choice for per- 0.8 formance. Furthermore, using the RM-based register

Normalized IPC Normalized 0.7 file design, high TLP could lead to long shift distance, which results in worse performance. Therefore, in this 0.6 subsection, we adjust TLP by modelling the resource 0.5 contention and shift operation overhead. _f inal _main hotspot b+tree _gridding Average Cache Contention. RM-based register file allows histo histo mri more threads blocks to be executed on each SM, which Fig.3. GPU performance in different applications. increases the TLP of GPU. However, higher TLP may cause cache contention and further degrades GPU per- 4 Optimization Framework formance. To illustrate the performance degradation problem caused by cache contention, we show the per- A preliminary version of this paper was reported in formance and L1 data cache miss rate of two typical [14]. In this paper, we propose an optimization frame- kernels under different TLP numbers in Fig.5. Note work consisting of three performance-centric optimiza- that to exclude the impact from the lengthy shift ope- tion techniques at the application, compilation, and ar- rations of RM, we set the shift operation overhead to chitecture level. zero. From Fig.5(a), we can see that the performance of kernel calculate temp monotonically increases as the 4.1 Framework Overview number of thread blocks increases, while the cache miss We develop an RM-based register file optimization rate remains almost the same. This means that for framework on GPU as shown in Fig.4. It employs kernel calculate temp, the increased TLP will not ag- three performance-centric optimization techniques at gravate cache contention, thus the best performance is the application, compilation, and architecture level. At achieved by setting TLP to be the maximum number the application level, we determine the TLP that does of thread blocks per SM. As for kernel split sort, how- not cause contention in neither register file nor cache ever, the maximum TLP cannot lead to the best per- through profiling. At the compilation level, we design formance as shown in Fig.5(b). The best performance a register mapping algorithm to reduce the number of is achieved when the number of thread blocks per SM shift operations by analyzing the register access trace. is set to be 5 but not 6. This is due to the aggra- Finally, at the architecture level, we design a preshift- vated cache contention caused by higher TLP, which is ing mechanism for the register file request arbitrator to reflected by the increased cache miss rate. Therefore, Yun Liang et al.: Performance-Centric Optimization for Racetrack Memory Based Register File on GPUs 41 based on the analysis of these two kernels, we conclude file while the cache contention problem exists in L1 data that because of the cache contention overhead caused cache, we assume the shift operation overhead model is by increased TLP, the maximum TLP cannot always independent of cache contention model. Therefore, we lead to the best GPU performance. model them separately and then combine them together to optimize TLP.

IPC Miss Rate 2.5 1.0 Bank 0 Bank 1 Warp 0 , R 2.0 0.9 8

1.5 0.8 Warp 1 , R7 IPC Rate 1.0 0.7 Warp 0 , R4 Warp 2 , R6

0.5 0.6 Warp 1 , R3 Maximum Maximum Warp 3 , R5 L1 Data Cache Miss Cache Data L1 Shift Shift 0 0.5 Warp 2 , R Distance Distance Warp 4 , R 1 2 3 4 5 6 2 4 Number of Thread Blocks per SM Warp 3 , R1 Warp 5 , R3 (a)

IPC Miss Rate Warp 6 , R2 3.0 1.00 Warp 7 , R1 2.5 0.95 2.0 Fig.6. Shift operation overhead caused by increased TLP. 1.5 0.90 IPC Rate 1.0 0.85 First, the cache contention is modeled by the vari- 0.5 L1 Data Cache Miss Cache Data L1 able Pcache(NTLP), which represents the performance 0 0.80 (IPC) under different TLPs but without shift operation 1 2 3 4 5 6 Number of Thread Blocks per SM overhead, and NTLP is defined as the TLP value. We (b) obtain the Pcache(NTLP) value by using a profiling tech- Fig.5. Performance and L1 data cache miss rate of two kernels. nique. More clearly, we profile the performance of each Both IPC and cache miss rate values are normalized according benchmark under different TLP values. To make the to the results when the number of thread blocks per SM is 1. shift operation overhead to be zero, we set the number (a) calculate temp. (b) split sort. of access ports of RM to be the same as the number of data domains of racetrack, and thus no shift operation Shift Operation Overhead. Higher TLP will lead to is needed for read/write operations. more thread blocks present on GPU cores, which re- Second, the shift operation overhead is modeled quires more registers allocated in register file. Due to by the variable ∆P (N ), which represents the the tape-like structure of RM, the maximum shift dis- shift TLP amount of performance (IPC) reduction when shift op- tance will be longer for RM. As Fig.6 shows, when the eration overhead is considered, and it is calculated ac- TLP is increased from 2 to 4, the number of warp regi- cording to the following equation: sters per bank is doubled, thus making maximum shift distance increased from 4 to 8. The increased maximum ∆Pshift(NTLP)= Pideal(NTLP) − PRM(NTLP). (1) shift distance enlarges the RM working space, thereby the read/write operations may be further delayed by In this equation, Pideal(NTLP) represents the perfor- more shift operations. These delayed read/write op- mance (IPC) when both cache contention and shift erations will cause performance degradation for GPU. operation overheads are zero. We get the value of Therefore, in order to determine the TLP to achieve ∆Pshift(NTLP) by employing a profiling method. To optimal performance, we should also take into account be more specific, we profile the performance results of the shift operation overhead. of Pideal(NTLP) and PRM(NTLP) under different TLP TLP Optimization. To optimize the TLP to achieve numbers. And then we calculate ∆Pshift(NTLP) accord- the best performance, we need to model both the cache ing to (1). Note that, to exclude the cache contention contention and the shift operation overhead. Because overhead during profiling, we set the L1 data cache to the shift operation overhead is present in the register be hit at all the time. 42 J. Comput. Sci. & Technol., Jan. 2016, Vol.31, No.1

Last, the performance of a benchmark considering 4.3.1 Problem Formulation both cache contention and shift operation overhead is In the following, we formulate the register mapping calculated as problem. We observe that the register access traces of different banks are disjoint, and thus different banks P (N )= P (N ) − ∆P (N ). (2) TLP cache TLP shift TLP can be modelled independently. Hence, next we will discuss the register mapping for one bank. The same To optimize TLP to achieve the best performance, we techniques can be repeated for different banks. calculate each P (N ) according to (2), and then find TLP Let T be the register access trace (sequence of regi- the TLP value which leads to the best performance. ster access) generated by executing the GPU applica- Overall, only a few times of profiling are needed for tion on the target architecture. We define a move from each benchmark○3 . register Ri to Rj if Rj is accessed immediately after the access of Ri. We use Cij to represent the num- 4.3 Register Mapping Optimization ber of move from register Ri to Rj in the trace, where i, j =0, 1, 2, ..., N − 1, and N is the number of regis- In this subsection, we design a compile-time regis- r r ters in a bank. C can be easily derived by traversing ter mapping algorithm, which first analyzes the register ij the trace. access trace and then attempts to put the registers that We define a mapping function m(Ri) that maps regi- are used frequently together close in the register file to ster Ri to a physical address in the register file as fol- reduce the shift operation overhead. Fig.7(a) shows a lows: snippet of the register file access trace of a bank shown Nd in Fig.7(b). There are totally five registers (R0∼R4) ac- m(R )= p(R ) × + o(R ), i i N i cessed in the bank and the trace contains six accesses p to the registers. If we map the registers to the bank where 0 < p(R ) < N and 0 6 o(R ) < Nd . In i p i Np simply in an ascending order of register index as shown other words, p(Ri) determines which port’s region Ri is in Fig.7(b), the racetracks need 12 shift operations to mapped to and o(Ri) determines the offset of Ri in the access the registers for this trace. The optimal mapping region. Each physical entry can only be allocated for of register is shown in Fig.7(c). By exchanging the posi- one register. Thus, ∀0 6 i, j 6 Nr − 1,m(Ri) 6= m(Rj ). tions of registers R0 and R4, the shift operations can be Then, we define the number of shift operations re- reduced to 6. The warps in GPUs access the register file quired for trace T as follows: almost every clock cycle. Hence, we need to carefully S = C × d(m(R ),m(R )), (3) map the registers to the physical address in racetrack X ij i j 06i,j6Nr −1 register file to reduce the shift operations and ensure high throughput. where d(m(Ri),m(Rj )) represents the shift operation needed from the physical address of Ri to Rj within a bank. The function d(m(Ri),m(Rj )) is defined as follows: 0. Warp 2, R2 Warp 0 , R0 Warp 0 , R4 2(1) 3(1)

1. Warp 3, R1 Warp 3 , R1 Warp 3 , R1 d(m(Ri),m(Rj )) = |o(Ri) − o(Rj )|. (4)

2. Warp 0, R4 1(1) Warp 2 , R 4(2) 3(3) Warp 2 , R2 2 Problem 1 (Shift Operation Minimization). Given 3. Warp 3, R 1(1) 1 2(3) Warp 1 , R3 Warp 1 , R3 the register access trace T , find a mapping of the regi- 4. Warp 1, R3 5(3) 4(2) 5(1) sters to the physical address in the bank (e.g., m(Ri)) Warp 0 , R Warp 0 , R0 5. Warp 0, R0 4 such that S is minimized. (a) (b) (c) We solve Problem 1 in two phases. In the first phase, Fig.7. Comparison of different register mappings. (a) Register we develop a register grouping algorithm that partitions access trace. (b) Direct register mapping. (c) Optimal register the registers into groups. The size of each group is Np mapping. For (b) and (c), each arrow is associated with m(n), where m represents the m-th register access in the trace, and n (the number of ports). In the second phase, we de- represents the number of shift operations. velop a register arrangement algorithm that optimizes

○3 The number of recurrence is bounded by the maximum TLP supported on the GPU. The maximum TLP is a hardware limit for any generation of GPUs. For our platform, the maximum TLP is 8, which means at most 8 times of occurrence are needed for each benchmark. Yun Liang et al.: Performance-Centric Optimization for Racetrack Memory Based Register File on GPUs 43

the arrangement of registers within a bank. The first is associated with Np ports. Thus, we need to parti- phase determines the p(Ri) part and the second phase tion the registers into groups of size Np. The registers determines the o(Ri) part in m(Ri) for each register, in each group g have a high number of moves (e.g., respectively. Ci,j ) between them. PRi,Rj ∈g We build an undirected graph G = (V, E), where 4.3.2 Register Grouping Algorithm vi ∈ V represents register Ri, |V | = Nr. The edges We split the objective function (3) into two parts are weighted using function W , where W is defined as based on (4) as follows: follows,

S = Cij × d(m(Ri),m(Rj ))+ W (e(vi, vj )) = Ci,j + Cj,i. P06i,j6Nr −1,o(Ri)=6 o(Rj ) Cij × d(m(Ri),m(Rj )) P06i,j6Nr −1,o(Ri)=o(Rj ) Problem 2 (Subgraph Weight Maximization). = 6 6 Cij × |o(Ri) − o(Rj )|+ P0 i,j Nr −1,o(Ri)=6 o(Rj ) Given the graph G = (V, E), find a subgraph G′ = Cij × |o(Ri) − o(Rj )| P06i,j6Nr −1,o(Ri)=o(Rj ) (V ′, E′) where V ′ ⊆ V , E′ ⊆ E, |V ′| = N , such that C o R o R . p = 06i,j6Nr −1,o(Ri)=6 o(Rj ) ij × | ( i) − ( j )| P ′ W (e) is maximized. (5) P∀e∈G The registers in subgraph G′ form a group and we That is, the moves between two registers that share will distribute the N registers in G′ across the N ports the same offset to ports do not require shift operations. p p with the same offset. This is because these registers can be accessed simul- Algorithm 1 describes the details of our register taneously via multiple ports in racetrack-based register grouping algorithm. It partitions the groups into ⌈ Nr ⌉ file without extra shift overhead. Np groups and each group has N registers. Algorithm 1 Furthermore, given the register access trace T , the p repeatedly calls function FindMaxSubgraph to form total number of moves is a constant. We define it as a group from G. Function FindMaxSubgraph is a follows: heuristic to Problem 2. It finds the nodes in a group

Q = Cij iteratively. It first finds the edge with the maximal P06i,j6Nr −1 C weight. Then, in each iteration, it will complement = 06i,j6Nr −1,o(Ri)=6 o(Rj ) i,j + (6) P ′ Ci,j . the existing subgraph G with either one node (lines P06i,j6Nr −1,o(Ri)=o(Rj ) 21∼25) or two nodes (lines 26∼30) depending on which From (5) and (6), we find that Ci,j is results in larger weight (lines 31∼36). Each time when Po(Ri)=o(Rj ) negative correlated with S. Thus, we can minimize S FindMaxSubgraph is called, it returns a subgroup G′ ′ by maximizing Ci,j . The intuition behind with size Np. The registers in G are distributed across Po(Ri)=o(Rj ) this is that if there are frequent moves between Ri and the Np ports with the same offset. We assign the ′ Rj (e.g., Ci,j is high), then we should align Ri and Rj port region (P (Ri)) for each register in G (lines 7∼9, to the same offset along two ports so that they can 15∼17). As a byproduct of this process, we assign group be accessed simultaneously without shifting overhead. identifier for each group based on the sequence of the In our racetrack-based register file design, each bank formed group. Fig.8 illustrates Algorithm 1 using an

R3

R1

R1 R1 R2 R4 R6 R R R1, R4, R 4 6 R3, R6, R6 3 R3 R R 7, 10 R9, R12 Find R4 R2, R5, R1, R4, R R Best Revert R5 7 10 R9 R7 R10 R 9 Partition R8, R11 R7, R10 R Group Back R9 12 R12 into Groups R , R , R R 3 6 Mapping 2, 5, R Find Max-Weighted R7 5 R5 R9, R12 R8, R11 R2 R2 Cliques One by One R8

R12

R10 R8 R8 R11 R11 R11 Register Grouping Algorithm Register Arrangement Algorithm

Fig.8. Illustration of register mapping algorithm. 44 J. Comput. Sci. & Technol., Jan. 2016, Vol.31, No.1

example. In this example, Np = 4. We partition the for each register. registers into three groups with the group size of 4. Given Nr registers in a bank, there exist Nr! pos- sible mappings of registers to the physical address in Algorithm 1:.1Register Grouping Algorithm the register file. For example, using racetrack mem- input : G = (V,E) output: m(Ri) ory, we can increase the capacity of the register file N 1 for t ← 0 to ⌈ r ⌉− 1 do from 128 KB to 256 KB. The racetrack-based regis- Np 2 port num ← 0; ter file is designed using 16 banks and 8 ports (Np) 3 group id = 0; per bank as this design gives the best trade-off among 4 if |V | > Np then ′ ′ ′ access latency, power and area as shown by prior 5 G (V ,E ) ← FindMaxSubgraph(G, Np); ′ [8-9,11-12] 6 foreach vi ∈ V do studies . Under this setting, we have Nr = 64. 7 o(Ri) ← offset; Obviously, it is intractable to enumerate all the possible 8 p(Ri) ← port num; 9 P ortNum ← port num + 1; mappings (64!). 10 end The registers are partitioned into groups by Algo- ′ 11 Delete G from G rithm 1. More importantly, all the registers in the same 12 end 13 group id ← group id + 1; group share the same offset. Therefore, we can deter- 14 else mine the offset of the groups and assign the offset of 15 G′ V ′,E′ G′ ,n ( ) ← FindMaxSubgraph( ); the group for all the registers in it. By partitioning the 16 foreach vi ∈ V do 17 o(Ri) ← group id; registers into Np groups, we reduce the size of mapping 18 p(R ) ← port num; Nr i problem from Nr to . In our setting, the problem 19 port num ← port num + 1; Np 20 end size is reduced from 64! to 8!, which enables the enu- 21 end meration for the best case. 22 end Let Ω be the set of all the permutations of eight 23 FindMaxSubgraph(G = (V,E),Size){ V ′ ← ∅,E′ ← ∅; groups. Let σ ∈ Ω, then σ[i] is the offset of the i-th 24 while |V ′| 6 Size do group. Algorithm 2 presents the details for register ar- 25 wd ← 0,we ← 0; 26 foreach vt ∈ V do rangement algorithm. It finds the best mapping that 27 w ← P Ci,j ; vi,vj ∈(Vs∪vt),i=6 j minimizes S ((3)) and returns m(Ri) for each register. 28 if w 6 w then d Fig.8 illustrates the register arrangement algorithm us- 29 wd ← w; 30 vd ← vt; ing an example. In this example, Nr = 12 and Np = 4. 31 end By partitioning the registers into three groups, we can 32 end easily determine the mapping for groups through enu- 33 foreach e(vm, vn) ∈ V do 34 w ← P C ; meration (3!) and then derive the mapping for all the vi,vj ∈(Vs∪(vm,vn)),i=6 j i,j 35 if we 6 w then registers. 36 we ← w; 37 Ve ← {vm, vn}; 38 end Algorithm 2:.1Register Arrangement Algorithm

39 end input : g(Ri),p(Ri) 40 if wd > we and |Vs| 6 size − 1 then output : m(Ri) 41 V ′ ← V ′ ∪ v ; d 1 S ← +∞; 42 V ← V − v ; d 2 foreach enumeration of groups σ ∈ Ω do 43 else if w >w V ′ 6 size then e d and | | − 2 3 ′ ′ ′ S ← P06i,j6N −1 Ci,j ×|σ[g(Ri)] − σ[g(Rj )]|; if 44 V ← V ∪ Ve; r S′ 6 S then 45 V ← V − Ve; 4 S ← S′; 46 end 5 foreach register Ri do o(Ri)= σ[g(Ri)]; 47 end 6 end 48 end 7 end 49 E′ = {e(v , v )|v , v ∈ V ′, v 6= v }; i j i j i j // compute the final mapping function 50 Return G′ = (V ′,E′); Nd 51 8 foreach register Ri do m(Ri) ← p(Ri) × + o(Ri); } Np

4.3.3 Register Arrangement Algorithm 4.4 Preshifting Optimization

Algorithm 1 determines the port region p(Ri) for GPU register file adopts a multi-bank structure. At each register. In this subsection, we develop a register any given clock cycle, only a small portion of register arrangement algorithm that determines the offset o(Ri) file banks could be accessed for read or write operations. Yun Liang et al.: Performance-Centric Optimization for Racetrack Memory Based Register File on GPUs 45

For instance, in NIVIDIA Fermi architecture, up to 4 the next warp register to be served in the following out of 16 register file banks could be accessed simul- for each bank, which leaves us a good opportunity to taneously. In this paper, we define the banks that are preshift the idle banks in advance before they are ap- being accessed as busy banks, while the banks which pointed to be busy banks. Therefore, we propose an are not being accessed as idle banks. Fig.1 shows that idle bank preshifting mechanism shown in Fig.10. each bank of register file owns a read request queue and First, in any given clock cycle, after the four busy a write buffer. The read requests and write requests to banks are determined by the register file arbitrator, be served in each bank are all stored in the correspond- each bank will be checked about whether it needs a shift ing queue or buffer. Because the requests in each queue operation. If it is a busy bank, it will be read/written or buffer are served in a first-in first-out (FIFO) fash- or shifted depending on whether it needs shift, while if ion, for a busy bank, the top request in the read request the bank is an idle bank, it will also be checked about queue or write buffer will be poped out and then served. whether it needs a shift operation. If it needs a shift It is needed to note that the write requests have higher operation to align the register to corresponding access priority than the read requests, and thus the write re- port, the bank will shift one step towards access port; quests will always be served before the read requests. otherwise it will do nothing. Overall, this opportunis- Because only 4 out of 16 banks of register file could tic preshifting policy makes the idle banks shift as much be served in any given clock cycle and the next accessed as possible to keep the ready requests to be served as register can be accurately determined by knowing the early as possible, which could potentially reduce the top requests of each queue or buffer, we could take ad- shift operation overhead. vantage of the idle banks to reduce shift operation over- Hardware Overhead of Preshifting Overhead. The head by preshifting. A simple motivation example is preshifting strategy is implemented as a control logic shown in Fig.9. As the figure shows, bank 0∼bank 4 in register file arbitrator. This control logic only needs are busy banks serving the read/write requests, while to know idle/busy state and the current access port the remaining idle banks are preshifting towards regis- position of each bank. The hardware implementation ters which are going to be accessed in the next. This for this control logic is simple because only a few com- preshifting scheme could potentially reduce the number parators and control state flip-flops are required. Then, of shift operations when the idle banks are carefully de- since the preshifting logic is dedicated for one bank, the ployed. routing resource overhead is limited. Thus, the hard- Preshifting Strategy. With the knowledge of the top ware overhead incurred by incorporating our proposed requests in each queue and buffer, we could determine preshifting strategy for register arbitrator is negligible.

Read Write Shift Toward Shift Toward

Warp 11, R5 Warp 16, R3 Warp 16, R4 Warp 6, R9

Warp 0, R0 Warp 3, R0 Warp 4, R0 Warp 15, R0

Warp 15, R1 Warp 18, R1 Warp 19, R1 Warp 14, R1

Warp 14, R2 Warp 17, R2 Warp 18, R2 Warp 13, R2

Warp 13, R3 Warp 16, R3 Warp 17, R3 Warp 12, R3

Warp 12, R4 Warp 15, R4 Warp 16, R4 Warp 11, R4 … … Warp 11, R5 Warp 14, R5 Warp 15, R5 Warp 10, R5

Warp 10, R6 Warp 13, R6 Warp 14, R6 Warp 9, R6

Warp 9, R7 Warp 12, R7 Warp 13, R7 Warp 8, R7

Warp 8, R8 Warp 11, R8 Warp 12, R8 Warp 7, R8

Warp 7, R9 Warp 10, R9 Warp 11, R9 Warp 6, R9

Bank 0 Bank 3 Bank 4 Bank 15

Busy Banks Idle Banks

Fig.9. Example of opportunistically preshifting idle banks. 46 J. Comput. Sci. & Technol., Jan. 2016, Vol.31, No.1

Bank ID parallel. Fig.12 depicts the GPU occupancy improve- ment, showing that the occupancy is increased from Yes No Busy Bank? 46.6% to 80.0% on average. Second, the optimization techniques used in our optimization framework have Yes Need Shift? Need Shift? greatly alleviated the shift operation overhead, leading to a promising performance improvement. Shift One Step In Section 4, we propose a framework which em- No Towards Target No braces three optimization techniques, TLP optimiza- Read/Write Return tion, register mapping, and preshifiting optimization. Registers In order to quantify the respective portions of contri- Fig.10. Flow of opportunistic preshifting idle banks. bution from these proposed techniques, we break down the contributions and show the results in Fig.11(d). 5 Experimental Evaluation It shows that RM-based register file without using the proposed method contributes to 18% performance We implement our racetrack-based register file de- improvement, while our optimization framework con- sign based on GPGPU-sim. We evaluate our technique tributes to 82% performance improvement. To further using applications from benchmark suites Rodinia and break down the contributions from each technique used Parboil as shown in Table 1. We extend GPGPU-sim in our framework, we show that the TLP optimization, with racetrack memory model using a circuit-level race- register mapping, and preshifting optimization tech- track memory simulator[9]. The design parameters of niques contribute to 8%, 54%, and 20% performance racetrack-based register file are shown in Table 3. Race- improvement respectively. track memory incurs long write latency. To mitigate Fig.11(b) shows the average number of shift oper- this problem, similar to prior study[8], we employ a ation has been reduced to 41% of the original num- write buffer to improve the writing efficiency. The write ber of shift operations by employing our optimization buffer is 2 KB and each bank has two entries as shown framework. To differentiate the contributions from the in Fig.1. three optimization techniques, we show the contribu- In the following, we perform three sets of experi- tion breakdown in Fig.11(e). As shown in the figure, ments. First, we demonstrate the performance im- we can see that preshifting idle banks, TLP optimiza- provement of our racetrack-based register file. Second, tion, and register mapping contribute to 17%, 30%, and we show the energy results. Third, we compare the area 53% number of shift operation reduction respectively, scalability of SRAM and racetrack memory. which matches the performance improvement contribu- tion result we have concluded from Fig.11(d). 5.1 Performance Results 5.2 Energy Results Fig.11(a) shows the performance improvement of racetrack-based register file over default SRAM de- Fig.11(c) shows the energy consumption of RM- sign. Our proposed optimization framework achieves based register file with and without our optimization up to 29% (21% on average) performance improvement. framework, and the value of energy consumption is nor- The reasons for the improvement are mainly two folds. malized according to the SRAM-based register file en- First, high-density racetrack-based register file doubles ergy consumption. As shown in Fig.11(c), the energy the cache capacity (256 KB) compared with conven- consumption has decreased by 33%∼53% (on average tional SRAM design (128 KB). The increased register 43%) by employing our proposed optimization tech- file enables the applications to execute more threads in niques. In order to quantify how much our proposed

Table 3. SRAM, RM, and Write Buffer Operating Parameters

Register Capacity Number of Read Latency Write Latency Shift Latency Read Energy Write Energy Shift Energy Leakage File (KB) Banks (ns) (ns) (ns) (pJ) (pJ) (pJ) (mW) SRAM 128 16 0.31 0.31 - 218.88 057.28 - 12.31 RM 256 16 0.28 1.24 0.61 117.12 173.22 56.16 07.95 Write buffer 002 - 0.15 0.16 - 015.60 014.60 - 01.12 Yun Liang et al.: Performance-Centric Optimization for Racetrack Memory Based Register File on GPUs 47

SRAM RM with Optimization RM RM with Optimization 1.3 1.0 1.1 0.8 0.9 0.7 0.6 0.5 0.4 0.3 Shifts of 0.2 Normalized IPC Normalized 0.1 Normalized Number Normalized 0 _temp final main _sort _ _ split Average _temp _final _main _sort findRangeKhisto histo histo split Average calculate calculate findRangeK histo (a) (b) RM RM with Optimization RM TLP Preshift Mapping 0.8 1.0 0.8 0.6 0.6 0.4 0.4

0.2 0.2 Normalized Energy Normalized 0 Contribution IPC 0 _temp final main _sort _ _ split Average _temp _final _main _sort findRangeKhisto histo histo split Average calculate calculate findRangeK histo (c) (d) TLP Preshift Mapping RM TLP Preshift Mapping 1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

Contribution 0.2

Contribution 0.2 Shift Reduction Shift

0 Reduction Energy 0 _temp final main _sort _ _ split Average _temp _final _main _sort findRangeKhisto histo histo split Average calculate calculate findRangeK histo (e) (f)

Fig.11. Analysis of performance, number of shifts, and energy in different applications. (a) Performance. (b) Shift number. (c) Energy. (d) Performance breakdown. (e) Shift number breakdown. (f) Energy breakdown. method contributes to the reduced energy consumption, and thus higher TLP. Therefore, RM-based register file we break down the energy contribution and show the re- is a promising area-efficient solution for future GPU sults in Fig.11(f). As we can see, the emerging RM tech- register file. nology is the main factor bringing energy consumption reduction, which contributes to 66% energy reduction. Moreover, 44% energy consumption reduction comes from employing our proposed optimization framework, where preshifting idle banks, TLP optimization, and SRAMRM RM with Optimization 1.0 register mapping contribute to 11%, 8%, and 15% en- ergy reduction respectively. 0.8

5.3 Area Results 0.6

RM has high storage density compared with SRAM. 0.4 Occupancy Thus, after doubling the register file size from 128 KB 0.2 to 256 KB by employing RM-based register file, the chip area of RM-based register is only 55% of the 0 _temp _final _main _sort histo split Average area of SRAM-based register file. This indicates that calculate findRangeK histo within the same chip area budget, RM-based register file could provide more register file capacity for GPU Fig.12. GPU occupancy in different applications. 48 J. Comput. Sci. & Technol., Jan. 2016, Vol.31, No.1

6 Related Work for performance. Moreover, the widely used SRAM- based register file does not scale well in power and area Recently, emerging memory technologies have at- when the register file size is increased. In this work, tracted a lot of attention in both industry and aca- we explored racetrack memory for designing high per- demic area. Jog et al. presented a technique to improve formance register file for GPUs. The high storage den- [16] STT-RAM based cache performance for CMP . Top- sity feature of racetrack memory increases the register ics about using STT-RAM to architect last level cache file capacity and subsequently enables more threads to have been studied in [17-18]. Optimization techniques execute in parallel. We also developed an optimization at architecture and compilation level were proposed for framework to minimize the shift operations to improve [11] racetrack memory . Chen et al. proposed a data GPU performance. Our experiments showed that our placement strategy to minimize the shift operations for racetrack-based register file can achieve up to 29% (21% [19] racetrack-based memory in general CPU . However, on average) performance improvement for a variety of GPU and CPU have distinct architecture. Thus, these GPU applications. techniques cannot be directly applied to GPU. Emerging Memory Technology for GPU Cache. Acknowledgments We thank the anonymous re- There are a few proposals that explore emerging mem- viewers for their feedback. ory for GPU caches by utilizing its high storage density and power efficient features. TapeCache[20], a cache de- References signed with racetrack memory, demonstrated the bene- [1] Gebhart M, Keckler S W, Khailany B, Krashinsky R, Dally fits of racetrack-based cache design for GPUs. Venkate- W J. Unifying primary cache, scratch, and register file san et al. presented a detailed racetrack memory based memories in a throughput processor. In Proc. the 45th An- cache architecture for GPGPU cache hierarchies[12], nual IEEE/ACM International Symposium on Microarchi- and their experiment results show substantial perfor- tecture (MICRO), Dec. 2012, pp.96-106. mance and energy improvement. [2] Li X, Liang Y. Energy-efficient kernel management on gpus. In Proc. the Design Automation and Test in Europe Emerging Memory Technology for GPU Register (DATE), Mar. 2016. File. GPUs demand a large size of register file for [3] Liang Y, Huynh H, Rupnow K, Goh R, Chen D. Efficient thread context switch. Racetrack memory is a good GPU spatial-temporal multitasking. IEEE Transactions on candidate for designing GPU register file[8,12]. Mao et Parallel and Distributed Systems, 2015, 26(3): 748-760. al. presented a racetrack memory based GPGPU regi- [4] Liang Y, Xie X, Sun G, Chen D. An efficient compiler ster file[8]. They proposed to use a hardware scheduler framework for cache bypassing on GPUs. IEEE Transac- tions on Computer-Aided Design of Integrated Circuits and and write buffer to reduce energy consumption. Be- Systems, 2015, 34(10): 1677-1690. sides the racetrack memory, Jing et al. proposed an [5] Xie X, Liang Y, Li X, Wu Y, Sun G, Wang T, Fan D. energy-efficient register file design for GPGPU based Enabling coordinated register allocation and thread-level on eDRAM and also developed a compiler-assisted reg- parallelism optimization for GPUs. In Proc. the 48th An- ister allocation optimization technique[13,21]. However, nual IEEE/ACM International Symposium on Microarchi- tecture (MICRO), Dec. 2015. all the previous work using emerging technology for [6] Xie X, Liang Y, Sun G, Chen D. An efficient compiler frame- GPU register file mainly focuses on optimizing area, work for cache bypassing on GPUs. In Proc. the Interna- power and energy, resulting in no or little performance tional Conference on Computer Aided Design (ICCAD), improvement. In contrast, we focus on improving per- Nov. 2013, pp.516-523. formance for GPU applications by fully taking advan- [7] Xie X, Liang Y, Wang Y, Sun G, Wang T. Coordinated tage of the high storage density of racetrack memory. static and dynamic cache bypassing on GPUs. In Proc. the 21st IEEE International Symposium on High Performance We also propose a novel compile-time register mapping Computer Architecture (HPCA), Feb. 2015, pp.76-88. algorithm to minimize the shift operations. [8] Mao M,Wen W, Zhang Y, Chen Y, Li H H. Exploration of GPGPU register file architecture using domain-wall-shift- 7 Conclusions write based racetrack memory. In Proc. the 51st Annual De- sign Automation Conference (DAC), June 2014, pp.196:1- The massive threading feature has enabled GPU to 196:6. boost performance for a variety of applications. How- [9] Zhang C, Sun G, Zhang W, Mi F, Li H, Zhao W. Quantita- tive modeling of racetrack memory, a tradeoff among area, ever, the number of simultaneously executing threads performance, and power. In Proc. the 20th Asia and South in GPUs is often constrained by the size of the regis- Pacific Design Automation Conference (ASP-DAC), Jan. ter file and it could potentially be a serious bottleneck 2015, pp.100-105. Yun Liang et al.: Performance-Centric Optimization for Racetrack Memory Based Register File on GPUs 49

[10] Parkin S S P, Hayashi M, Thomas L. Magnetic domain-wall [21] Jing N, Liu H, Lu Y, Liang X. Compiler assisted dynamic racetrack memory. Science, 2008, 320(5873): 190-194. register file in GPGPU. In Proc. the International Sym- [11] Sun Z, Wu W, Li H. Cross-layer racetrack memory design posium on Low Power Electronics and Design (ISLPED), for ultra high density and low power consumption. In Proc. Sept. 2013, pp.3-8. the 50th Annual Design Automation Conference (DAC), May 2013, Article No. 53. Yun Liang obtained his B.S. degree [12] Venkatesan R, Ramasubramanian S G, Venkataramani S, in software engineering from Tongji Roy K, Raghunathan A. Stag: Spintronic-tape architecture University, Shanghai, and his Ph.D. for GPGPU cache hierarchies. In Proc. the 41st Annual In- ternational Symposium on Computer Architecture (ISCA), degree in computer science from Na- Jun. 2014, pp.253-264. tional University of Singapore, in [13] Jing N, Shen Y, Lu Y, Ganapathy S, Mao Z, Guo M, Canal 2004 and 2010, respectively. He was R, Liang X. An energy-efficient and scalable eDRAM-based a research scientist with Advanced register file architecture for GPGPU. In Proc. the 40th An- Digital Science Center, University of nual International Symposium on Computer Architecture Illinois Urbana-Champaign, Urbana, IL, USA, from 2010 (ISCA), Jun. 2013, pp.344-355. to 2012. He has been an assistant professor with the [14] Wang S, Liang Y, Zhang C, Xie X, Sun G, Liu Y, Wang School of Electronics Engineering and Computer Science, Y, Li X. Performance-centric register file design for GPUs Peking University, Beijing, since 2012. His current using racetrack memory. In Proc. the 21st Asia and South research interests include graphics processing unit (GPU) Pacific Design Automation Conference (ASP-DAC), Jan. architecture and optimization, heterogeneous computing, 2016. embedded system, and high level synthesis. Dr. Liang [15] Kayiran O, Jog A, Kandemir M T, Das C R. Neither more was a recipient of the Best Paper Award in International nor less: Optimizing thread-level parallelism for GPGPUs. Symposium on Field-Programmable Custom Computing In Proc. the 22nd International Conference on Parallel Machines (FCCM) 2011 and the Best Paper Award nomi- Architectures and Compilation Techniques (PACT), Oct. nations in International Conference on Hardware/Software 2013, pp.157-166. Codesign and System Synthesis (CODES+ISSS) 2008 and [16] Jog A, Mishra A K, Xu C, Xie Y, Narayanan V, Iyer R, Design Automation Conference (DAC) 2012. He serves a Das C R. Cache revive: Architecting volatile STT-RAM caches for enhanced performance in CMPs. In Proc. the technical committee member for Asia South Pacific Design 49th Annual Design Automation Conference (DAC), June Automation Conference (ASPDAC), Design Automation 2012, pp.243-252. and Test in Europe (DATE), International Conference [17] Samavatian M H, Abbasitabar H, Arjomand M, Sarbazi- on Compilers Architecture and Synthesis for Embedded Azad H. An efficient STT-RAM last level cache architecture System (CASES), and International Conference on Parallel for GPUs. In Proc. the 51st Annual Design Automation Architectures and Compilation Techniques (PACT). He is Conference (DAC), May 2014, pp.197:1-197:6. the TPC subcommittee chair for ASPDAC 2013. [18] Sun Z, Bi X, Li H H, Wong W F, Ong Z L, Zhu X, Wu W. Multi retention level STT-RAM cache designs with a Shuo Wang received his B.S. degree dynamic refresh scheme. In Proc. the 44th Annual Inter- national Symposium on Microarchitecture (MICRO), Dec. in electrical engineering from Nanjing 2011, pp.329-338. Agricultural University, Nanjing, in [19] Chen X, Sha E H M, Zhuge Q, Dai P, Jiang W. Optimiz- 2013 and his M.S. degree in electrical ing data placement for reducing shift operations on domain engineering from the University of wall memories. In Proc. the 52nd Annual Design Automa- Southern California, Los Angeles, tion Conference (DAC), June 2015, pp.139:1-139:6. in 2014. Since 2015, he is a Ph.D. [20] Venkatesan R, Kozhikkottu V, Augustine C, Raychowdhury candidate in Center for Energy-Efficient A, Roy K, Raghunathan A. TapeCache: A high density, en- Computing and Application (CECA), Peking University, ergy efficient cache based on domain wall memory. In Proc. Beijing. His current research interests include compilation the International Symposium on Low Power Electronics techniques for heterogeneous computing platforms. and Design (ISLPED), July 30-August 1, 2012, pp.185-190.