Execution Characteristics of Multimedia Applications on a Pentium II Processor

Execution Characteristics of Multimedia Applications on a Pentium II Processor Deependra Talla and Lizy Kurian John Laboratory of Computer Architecture Department of Electrical and Computer Engineering The University of Texas at Austin {deepu, ljohn}@ece.utexas.edu Abstract starting to become exceedingly important for computer systems as a dominant computing workload [4][5]. Dy- With the widespread use of 3D graphics, animation, namic multimedia component technologies such as video- speech recognition, and other media applications, gen- conferencing, video authoring, visualization, 3D graphics, eral-purpose processors are increasingly spending their animation, realistic simulation, speech recognition, and cycles on video and audio processing. However, the char- broadband communications hold a great promise. The im- acteristics of media applications when executed on gen- portance of multimedia technology, services and applica- eral purpose processors are not well understood. Such tions is being widely recognized by microprocessor de- knowledge is extremely important in guiding the design of signers. A number of manufacturers are offering multi- future microprocessors and development of media appli- media processors that are claimed to be able to decode cations. In this paper we characterize the performance of coded video streams in real-time in software. Most of multimedia applications on an Intel Pentium II processor such processors like the Trimedia processor from Philips based system. Six different commercial multimedia appli- and the Multimedia signal processor from Samsung usu- cations belonging to 3D graphics, streaming video or ally have hardware assists for one or more of the multi- streaming audio categories are executed on an Intel media decoding functions. A number of general-purpose Pentium II processor and performance is measured. Ar- CPU manufacturers are offering multimedia enhanced chitectural data pertaining to utilization of various hard- versions of their CPUs for accelerating audio and video ware resources on the chip are collected, using on-chip processing. The UltraSPARC processor enhanced with performance monitoring counters. Multimedia applica- the VIS (Visual Instruction Set) from Sun, the multime- tions are seen to have fewer branch instructions than dia-enhanced MMX [8] and streaming SIMD Pentium SPECint benchmarks, however more than SPECfp processors from Intel [10], AMD’s 3DNow! [16], and benchmarks. Despite a regular control flow and more Motorola’s AltiVec technology are examples. Such CPUs available parallelism, the average number of cycles taken will likely take over multimedia functions like audio- to execute an instruction is seen to be higher than that of video decoding/encoding, modem, telephony functions, SPECint. In many aspects, media applications exhibit a and network access functions on a PC/workstation plat- behavior between that of SPECint and SPECfp. form, along with the general purpose computing they cur- rently perform. Keywords: MMX, workload characterization, multime- Multimedia applications possess several distinguish- dia, speculative execution, streaming video/audio. ing characteristics than the normal workloads on desktop computing systems. In contrast to traditional applications, multimedia-rich applications will involve significant 1. Introduction computational demands on the processor. Diefendorff and Dubey [4] specified the following characteristics of the For the last two decades, general-purpose processor media-centric applications: real-time response, processing design has been driven largely by non-realtime, stand- of continuous-media data types, significant fine and alone applications. Multimedia applications are now coarse grained parallelism, high instruction-reference lo- cality, and high network and memory bandwidth. How- This work is supported in part by the State of Texas Advanced ever, media applications are still not well understood. For Technology program grant #403, the National Science Founda- instance, tion under Grants CCR-9796098 and EIA-9807112, and by Dell, Intel, Microsoft, and IBM. • Do multimedia applications have fewer branches per mark suite [3], which is a collection of audio, video, instruction and better branch-mispredict ratios (be- graphics and image processing applications. Multimedia cause of regularity in instruction sequence) than inte- extensions for signal processing and image/video proc- ger and desktop workloads? essing were evaluated in [6][12]. However, no commer- • Is the cycles per instruction (CPI) of multimedia ap- cial applications were studied. Also, those studies were on plications lower than other workloads because of a simulator rather than an actual platform. Our studies available parallelism, regular code structure, and po- characterize several popular commercial media applica- tentially better speculation? tions on a popular commercial platform, the X86 archi- • Are L1 instruction cache miss rates lower for multi- tecture. In addition, this study is performed with hardware media applications than other workloads due to pres- assisted measurements on a real system rather than simu- ence of loops and processing of multiple pixel lations. Hardware monitoring techniques are potentially points/data? more accurate and non-invasive. • Are special addressing modes used to access the We examine the characteristics of multimedia appli- regular data structures and do they result in an in- cations in the 3D graphics, streaming video and streaming creased use of complex instructions? audio domains that account for 95% of the general multi- • Recent additions to instruction sets have been float- media workloads [3]. Existing, popular, and complete ing-point multimedia instructions; what amount of commercial multimedia applications have been chosen for floating-point computations do multimedia applica- this evaluation. Measurements were performed using the tions use? built-in performance counters of the processor. Results • Many of these applications can heavily use vectors of are presented for several statistics including instruction packed 8-, 16-, and 32-bit integers and floating-point characteristics, branch-related information, memory per- numbers that allows potential benefits of Single In- formance, and MMX related characteristics. The rest of struction Multiple Data (SIMD) architectures like the the paper is organized as follows: Section 2 describes our MMX for the Pentium family of processors. Do they methodology and our benchmark application suite. Sec- take advantage of architectures that support SIMD tion 3 presents the various execution characteristics of the processing such as MMX? workloads and we compare them with the SPEC and SYSmark/NT characteristics presented in [1]. This paper is an effort to characterize multimedia workloads on the X86 architecture with multimedia- 2. Methodology enhanced extensions (MMX). We compare execution characteristics of multimedia workloads with the SPEC 2.1 Workloads benchmarks and other desktop applications (non- multimedia applications). By trying to answer the above Multimedia workload domain can be broken up into questions, we provide insight and analysis based on 3D graphics, streaming video and streaming audio. These measurements using built-in performance counters of the three sub-domains represent about 95% of the multimedia processor. spectrum [3] and we chose to benchmark two of each sub- There have been some characterizations of desktop domain for a total of six complete applications. Table 1 applications running under the Windows NT operating summarizes the benchmarks used for this work. These system [1][2][13][14][15]. Bhandarkar and Ding [1] char- workloads represent the most recent and extremely popu- acterized the performance of a Pentium Pro processor for lar applications being run on common desktops/PCs. All both the SPEC benchmarks and the SYSmark/NT bench- of our benchmarks were run for 90 seconds in real-time mark suite (contains Word, Excel, Powerpoint, Texim, for the measurement of each statistic such as number of and MaxEDA). Lee et al [2] have examined the perform- clock cycles, total instructions executed, number of ance of common desktop applications (acroread, Net- branches, etc. Applications executed from 350 million to scape, photoshoppe, Powerpoint, and word) on the X86 24 billion dynamic instructions (see appendix) for running platform in addition to benchmarking five SPEC pro- the ninety seconds. For the case of streaming video and grams. However, no clear picture of the execution char- audio applications, the data files were saved on disk rather acteristics of multimedia applications running on than load it from the network to eliminate network delays PC/desktop computing systems on the X86-platform has in this evaluation. been published. Each of the streaming applications use different en- Benchmarks for evaluation of multimedia applica- coding and decoding algorithms for different file formats tions are in their infancy and evolving. A recent effort in and bit transfer rate. For example, RealAudio provides 10 this direction is MediaBench [11], which is a collection of different encoding algorithms to provide 14.4 kbps mo- complete applications/algorithms for multimedia and dems users mono feeds with a frequency response of communications systems. Intel has its own Media bench- Table. 1. Summary of the Benchmarks Application Description One of the most popular 3D games ever with excellent graphics, sounds, and smart combat ene- mies. Processor

Execution Characteristics of Multimedia Applications on a Pentium II Processor

Memory Centric Characterization and Analysis of SPEC CPU2017 Suite

Overview of the SPEC Benchmarks

3 — Arithmetic for Computers 2 MIPS Arithmetic Logic Unit (ALU) Zero Ovf

Asustek Computer Inc.: Asus P6T

Modeling and Analyzing CPU Power and Performance: Metrics, Methods, and Abstractions

Continuous Profiling: Where Have All the Cycles Gone?

Specfp Benchmark Disclosure

Intel Corporation: Lenovo T400 (Intel Core 2 Duo T9900)

Historical Perspective and Further Reading 4.7

An Evaluation of the TRIPS Computer System

Dell Inc.: Poweredge T610 (Intel Xeon X5690, 3.46 Ghz)

A Lightweight Processor Benchmark Mark Claypool Worcester Polytechnic Institute, [email protected]