Case Studies: Memory Behavior of Multithreaded Multimedia and AI Applications
Total Page:16
File Type:pdf, Size:1020Kb
Case Studies: Memory Behavior of Multithreaded Multimedia and AI Applications Lu Peng*, Justin Song, Steven Ge, Yen-Kuang Chen, Victor Lee, Jih-Kwon Peir*, and Bob Liang * Computer Information Science & Engineering Architecture Research Lab University of Florida Intel Corporation {lpeng, peir}@cise.ufl.edu {justin.j.song, steven.ge, yen-kuang.chen, victor.w.lee, bob.liang}@intel.com Abstract processor operated as two logical processors [3]. Active threads on each processor have their own local states, Memory performance becomes a dominant factor for such as Program Counter (PC), register file, and today’s microprocessor applications. In this paper, we completion logic, while sharing other expensive study memory reference behavior of emerging components, such as functional units and caches. On a multimedia and AI applications. We compare memory multithreaded processor, multiple active threads can be performance for sequential and multithreaded versions homogenous (from the same application), or of the applications on multithreaded processors. The heterogeneous (from different independent applications). methodology we used including workload selection and In this paper, we investigate memory-sharing behavior parallelization, benchmarking and measurement, from homogenous threads of emerging multimedia and memory trace collection and verification, and trace- AI workloads. driven memory performance simulations. The results Two applications, AEC (Acoustic Echo Cancellation) from the case studies show that opposite reference and SL (Structure Learning), were examined. AEC is behavior, either constructive or disruptive, could be a widely used in telecommunication and acoustic result for different programs. Care must be taken to processing systems to effectively eliminate echo signals make sure the disruptive memory references will not [4]. SL is widely used in bioinformatics for modeling of outweigh the benefit of parallelization. gene-expression data [5]. Both applications were parallelized using OpenMP [6] and were run on ® Pentium 4 processors with/without Hyper-Threading 1. Introduction Technology [7]. When multiple threads are running in parallel on a Multimedia and Artificial Intelligent (AI) applications multithreaded processor with shared caches, their have become increasingly popular in microprocessor memory behavior is influenced by two essential factors. applications. Today, almost all the commercial First, the memory reference sequence among multiple processors, from ARM to Pentium®, include some type threads can be either constructive or disruptive. The of media enhancement in their ISA and hardware [1][2]. memory behavior is constructive when multiple threads While there has been much work put into studying share the data that was brought into the shared cache by memory performance for general integer and large-scale one or another [8]. In other words, memory reference scientific applications due to the widening processor- locality exhibits in the original single thread may be memory performance gap, this paper focuses on memory transformed and/or further enhanced among multiple behavior studies for multimedia and AI applications. threads. The memory behavior is disruptive when The memory reference behavior study of the emerging multiple threads have rather distinctive working sets and applications is complicated by the fact that general- compete with the limited memory hierarchy resources. purpose processors have begun to support multithreading Second, the memory usage or footprint may increase to improve throughput and hardware utilization. The new among multiple threads based on the fact that certain Hyper-Threading Technology brings the multithreading data structures or local variables may be replicated to idea into Intel architecture to make a single physical improve parallelism. The increase of memory footprint alters the reference behavior and demands larger caches based on various cache sizes, topologies and degrees of to hold the working set. parallelization. The above steps are described below. There have been many studies targeting memory performance for commercial workload [9,10] as well as Workload Selection and Parallelization: We selected media [11] and Java applications [12,13]. In [11], two emerging applications, AEC (Acoustic Echo hardware trace collection and trace-driven simulation is Cancellation) and SL (Structure Learning), in this study. used to study the memory behaviors of a group of First, beside video communication, audio is also an multimedia applications running on different Operating important component in an interactive communication; Systems including Linux, Window NT and Tru-Unix. for example, audio glitch is often more annoying than Techniques of using compiler-based, dynamic profile- video glitch. While many studies have been working on based, or user annotation-based methods to improve video related workload, limited studies have been done cache performance were discussed [14,15,16,17,18]. on audio communication. To make sure that future Instead of inventing new hardware or software processors deliver good acoustic quality, we select AEC techniques to improve memory reference behavior, this [4] as the first workload. Second, beside multimedia paper demonstrates a methodology for memory behavior communication, modern processors start to play an analysis and shows initial simulation results of the latest important role in the bioinformatics studies, e.g., multimedia and AI applications. The detailed analysis estimating certain expressions by transcription (amount methodology will be described in the next section. of mRNA made from the gene). Often the best clue to a Section 3 compares the memory reference behavior disease or measurement of successful treatment is the based on data reuse distances of the multithreaded AEC degree to which genes are expressed. Such data also and SL. In comparison with the original sequential provides insights into regulatory networks among genes, program, we found that the reuse distance of AEC i.e., one gene may code for a protein that affects increases with the number of parallel threads. On the another’s expression rate. We select SL [5], which contrary, the reuse distance of SL decreases with the models gene-expression profile data, as the second degree of parallelization. The reasoning for such workload. opposite memory behaviors will be discussed. Finally, The selected applications are parallelized with Section 4 concludes this study. OpenMP and compiled with Intel compiler 7.1 release [20]. In this environment, the OpenMP master thread 2. Methodology for Memory Behavior automatically creates and maintains a team of worker Studies threads; each executing specific tasks assigned by the master thread. All worker threads will not exit until the Execution-driven, whole-system simulation method end of process, thus forming a pool of active threads. has become increasingly popular for studying memory Basically, the main parallelization for both workloads is reference behavior of multithreaded applications. In this done on for-loops. For-loop parallelization can be paper, we use Simics [19], a whole system simulation specified using the OpenMP specification. Task queue tool, which is capable of running unmodified commercial based parallelization is also applied, which allows do- operating system and applications. We first select while style parallelization. The first parallel region interesting, emerging media and AI applications as the hitting thread enqueues parameters needed by worker workload for our studies. The selected applications are threads, and other threads fetch parameters in stack and parallelized using OpenMP and run on Pentium® 4 begin execution. With the task queue mode, dynamic systems for performance measurement. After execution of threads does not have to be identical. demonstrating performance advantages, the parallelized Benchmarking and Measurement: The parallelized applications are ported and run on the Simics simulation ® environment. applications first run on Pentium 4 systems. To avoid A precise, cycle-based memory model can be built and excessive overheads associated with thread scheduling, integrated with Simics virtual machine to drive the we only run the parallelized programs with the number of simulation. However, in order to obtain fast performance workers less than or equal to the capacity of the physical approximations over a large design space, we decided to system. For example, in our studies on a Quad 2.8GHz ® ™ apply trace-driven simulations based on the trace Pentium 4 Xeon MP machine with 2GB DDR collected from the Simics simulation environment. The SDRAM, a maximum of 4 worker threads are generated. generic cache model in the Simics tool set was modified We use VTune™ performance analyzer [21] to collect for trace collection. The characteristics of the collected statistics including execution clock ticks, number of trace is compared and verified with the measurement instructions, number of memory references, cache results. Detailed trace-driven simulations are followed hits/misses, etc. From these measurement results, information, such as speedups and the total memory references among sequential and parallel executions of much faster to simulate traces off-line comparing with the applications can be calculated. Furthermore, we use on-the-fly, execution-driven simulation on Simics. This the Windows-based Perfmon [22] to collect page-fault fast simulation speed allows us to exploit bigger design information