Cleveland State University EngagedScholarship@CSU

Electrical Engineering & Science Electrical Engineering & Computer Science Faculty Publications Department

10-2005

Dynamic Voltage Scaling Techniques for Power Efficient Video Decoding

Ben Lee Oregon State University, [email protected]

FErikoollow Nur thisvitadhi and additional works at: https://engagedscholarship.csuohio.edu/enece_facpub Carnegie Mellon University, [email protected] Part of the Computer and Systems Architecture Commons, and the Electrical and Computer EngineeringReshma Dixit Commons Oregon State University, [email protected] How does access to this work benefit ou?y Let us know! PublisherChansu Yu 's Statement NOClevTICE:eland thisState is Univ theersity author, [email protected]’s version of a work that was accepted for publication in Journal of MyungchulSystems Ar chitecturKim e. Changes resulting from the publishing , such as peer review, Informationediting, corr andections, Communications structural formatting, University, [email protected] and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Journal of Systems Architecture, 51, 10-11, (10-01-2005); 10.1016/j.sysarc.2005.01.002

Original Citation Lee, B., Nurvitadhi, E., Dixit, R., Yu, C., , & Kim, M. (2005). Dynamic voltage scaling techniques for power efficient video decoding. Journal of Systems Architecture, 51(10-11), 633-652. doi:10.1016/ j.sysarc.2005.01.002

Repository Citation Lee, Ben; Nurvitadhi, Eriko; Dixit, Reshma; Yu, Chansu; and Kim, Myungchul, "Dynamic Voltage Scaling Techniques for Power Efficient Video Decoding" (2005). Electrical Engineering & Computer Science Faculty Publications. 66. https://engagedscholarship.csuohio.edu/enece_facpub/66

This Article is brought to you for free and open access by the Electrical Engineering & Computer Science Department at EngagedScholarship@CSU. It has been accepted for inclusion in Electrical Engineering & Computer Science Faculty Publications by an authorized administrator of EngagedScholarship@CSU. For more information, please contact [email protected]. Dynamic voltage scaling techniques for power efficient video decoding

Ben Lee "", Erik o Nurvitadhi b, Reshma Dixit ", Chansu Yu " Myungchul Kim d

• Scllool of EII'clriml Ellgilll'aillg tlllIl Compllier Sciellce. Oregoll Siall' Ullin:".,"/),. Corw!lis, OR 9733 /. Vllill'lf S[o/<'s b Dl'/lflrIllWIlI of Electriclil (//11/ CompUler £lIgillcerillg. ClifIlegie Mellol/ Ullit'l'rsif)'. Pilliiburgl!. P A 15213. Vlliteli SWII'S <; Deparlmcnl (J/ Elect'ira/lmd COIIJfI"ler £lIg;",:er;II8. Clerdmul SllIle U" hw.lil),. Cfen'/mul. OH 44115·2425, U"iled SII1lI','- d Schild 0/ EngineerillK. Ill/ormatioll alld CommrllliculiollS Ul1il ~ l'rsil)'. 103-6 ,I[""ii-Dong. YllSl'l"'g-GII, Dmj"oll 305-714, SOIllI! Korell

• Corresponding author. Tel.: +1 541 737 3148: fax: +1 541 737 1300. E·mail (ltli/res,,'''.\': [email protected] (8 . Lee). cnurvi\[email protected] (E. Nurvi\adhi). [email protected] (R. Dixi!). c.yu91 @csuohio.cdu (C. Yu). mckim @;icu.ac.kr(M. Kim). 1. Introduction time span allowed. Thus, there is no power wasted by an idle waiting for a frame to be Power efficient design is one of the most impor­ played. In practice, decoding time estimation leads tant goals for mobile devices, such as , to errors that result in frames being decoded either PDAs, handhelds, and mobile phones. As the pop­ before or after their playout time. When the ularity of multimedia applications for these porta­ decoding finishes early, the processor will be idle ble devices increases, reducing their power while it waits for the frame to be played, and some consumption will become increasingly important. power will be wasted. When decoding finishes late, Among multimedia applications, delivering video the frame will miss its playout time, and the per­ will become the most challenging and important ceptual quality of the video could be reduced. applications of future mobile devices. Video con­ Even if decoding time prediction is very accu­ ferencing and multimedia broadcasting are already rate, the maximum DVS performance can be becoming more common, especially in conjunction achieved only if the processor can scale to very with the third generation (3G) wireless network ini­ precise processor settings. Unfortunately, such a tiative [11]. However, video decoding is a computa­ is impractical since there is cost tionally intensive, power ravenous process. In associated with having different processor supply addition, due to different frame types and varia­ voltages. Moreover, the granularity of voltage/fre­ tion between scenes, there is a great degree of var­ quency settings induces a tradeoff between power iance in processing requirements during execution. savings and deadline misses. For example, fine- For example, the variance in per-frame MPEG grain processor settings may even increase the decoding time for the movie Terminator 2 can be number of deadline misses when it is used with as much as a factor of three [1], and the number an inaccurate decoding time predictor. Coarse- of inverse discrete cosine transforms (IDCTs) per­ grain processor settings, on the other hand, lead formed for each frame varies between 0 and 2000 to overestimation by having voltage and frequency [7]. This high variability in video streams can be set a bit higher than required. This reduces dead­ exploited to reduce power consumption of the pro­ line misses in spite of prediction errors, but at cessor during video decoding. the cost of reduced power savings. Therefore, the Dynamic voltage scaling (DVS) has been shown impact of processor settings on video decoding to take advantage of the high variability in pro­ with DVS needs to be further investigated. cessing requirements by varying the processor's Based on the aforementioned discussion, this operating voltage and frequency during run-time paper provides a comparative study of the existing [4,10]. In particular, DVS is suitable for eliminat­ DVS techniques developed for low-power video ing idle times during low workload periods. Re­ decoding, such as Dynamic [33], GOP [21],and cently, researchers have attempted to apply DVS Direct [18,24], with respect to prediction accuracy to video decoding to reduce power [18,17,21, and the corresponding impact on performance. 19,24,33]. These studies present approaches that These approaches are designed to perform well predict the decoding times of incoming frames or even with a high-motion video by either using static group of pictures (GOPs), and reduce or increase prediction model or dynamically adapting its pre­ the processor setting based on this prediction. As diction model based on the decoding experience a result, idle processing time, which occurs when of the particular video clip being played. However, a specific frame decoding completes earlier than they also require video streams to be preprocessed its playout time, is minimized. In an ideal case, to obtain the necessary parameters for the the decoding times are estimated perfectly, and DVS algorithm, such as frame sizes, frame-size/ all the frames (or GOPs) are decoded at the exact decoding-time relationship, or both. To overcome this limitation, this paper also proposes an alterna­ ing low workload periods [4,9,10,12–24,33]. The tive method called frame-data computation aware advantage of DVS can be observed from the (FDCA) method. FDCA dynamically extracts useful power consumption characteristics of digital static frame characteristics while a frame is being decoded CMOS circuits [21] and the clock frequency equa­ and uses this information to estimate the decoding tion [24]: time. Extensive simulation study based on Simpl­ P / C V 2 f ð1Þ eScalar processor model [5], Wattch power tool eff DD CLK [3] and Berkeley MPEG Player [2] has been con­ 1 V DD ducted to compare these DVS approaches. / s / 2 ð2Þ fCLK ðV - V Þ Our focus is to investigate two important trade­ DD T offs: The impact of decoding time predictions and where Ceff is the effective switching capacitance, granularity of processor settings on DVS perfor­ VDD is the supply voltage, fCLK is the clock fre­ mance in terms of power savings, playout accu­ quency, s is the circuit delay that bounds the upper racy, and characteristics of deadline misses. To limit of the clock frequency, and VT is the threshold the best of our knowledge, a comprehensive study voltage. Decreasing the power supply voltage would that provides such a comparison has not been per­ reduce power consumption significantly (Eq. (1)). formed, yet such information is critical to better However, it would lead to higher propagation de­ understand the notion of applying DVS for low- lay, and thus force a reduction in clock frequency power video decoding. For example, existing (Eq. (2)). While it is generally desirable to have methods only use a specific number of processor the frequency as high as possible for faster instruc­ settings and thus do not provide any guidelines tion execution, for some tasks where maximum on an appropriate granularity of processor settings execution speed is not required, the clock fre­ when designing DVS techniques for video decod­ quency and supply voltage can be reduced to save ing. Moreover, these studies quantified the DVS power. performance by only looking at power savings DVS takes advantage of this tradeoff between and the number of deadline misses. In this paper, energy and speed. Since processor activity is vari­ we further expose the impact of deadline misses able, there are idle periods when no useful work by measuring the extent to which the deadline is being performed, yet power is still consumed. misses exceed the desired playout time. DVS can be used to eliminate these power-wasting The rest of the paper is organized as follows. idle times by lowering the processor's voltage and Section 2 presents a background on DVS. Section frequency during low workload periods so that 3 introduces the existing and proposed DVS tech­ the processor will have meaningful work at all niques on low-power video decoding and discusses times, which leads to reduction in the overall their decoding time predictors. Section 4 discusses power consumption. the simulation environment and characteristics of However, the difficulty in applying DVS lies in video streams used in this study. It also presents the estimation of future workload. For example, the simulation results on how the accuracy of Pering et al. took into account the global state of decoding time predictor and the granularity of the system to determine the most appropriate scal­ processor settings affect DVS performance. Final­ ing for the future [23], but the performance benefit ly, Section 5 provides a conclusion and elaborates is limited because the estimation at the system level on future work. is not generally accurate. Another work by Pering and Brodersen considers the characteristics of indi­ vidual threads without differentiating their behav­ 2. Background on dynamic voltage scaling (DVS) ior [22]. Flautner et al. classifies each by their communication characteristics to well known DVS has been proposed as a mean for a proces­ system tasks, such as X server or the sound dae­ sor to deliver high performance when required, mon [9]. Lorch and Smith assign pre-deadline and while significantly reduce power consumption dur­ post-deadline periods for each task and gradually accelerate the speed in between these periods CPU decodes frame [15]. While these approaches apply DVS at the CPU speed Frame 1 is Frame 2 is Frame 3 is Frame 4 is Operating System level, for some applications that (Mhz) displayed displayed displayed displayed inhibit high variability in their execution, such as vi­ 120 deo decoders, greater power savings can be F F 100 r r achieved if DVS is incorporated into the applica­ a a 80 Frame 1 Frame 2 m m tion itself. The next section overviews existing 60 e e 3 4 DVS approaches for low-power video decoding. 40 Time (sec) 1 2 3 4 30 30 30 30 (a) 3. Prediction-based DVS approaches for low-power video decoding CPU speed Frame 1 is Frame 2 is Frame 3 is Frame 4 is (Mhz) displayed displayed displayed displayed

This section introduces various DVS ap­ 120 proaches for low-power video decoding. They uti­ 100 80 lize some form of a decoding time prediction Frame 1 60 Frame2 Frame 3 Frame 4 algorithm to dynamically perform voltage scaling 40 Time (sec) [18,17,21,24,32,33]. In Section 3.1, we show how 1 2 3 4 30 30 30 30 the low-power video decoding benefits from DVS (b) and how the accuracy of decoding time predictor CPU speed Frame 1 is Frame 2 is Frame 3 is Frame 4 is affects the performance of DVS. Section 3.2 de­ (Mhz) displayed displayed displayed displayed scribes three existing DVS approaches and the proposed FDCA approach with a focus on predic­ 120 100 tion algorithms. In Section 3.3, the impact of the 80 Frame 1 granularity of the processor settings on DVS per­ 60 Frame2 Frame 3 Frame 4 40 Time formance is discussed. (sec) 1 2 3 4 30 30 30 30 3.1. Prediction accuracy on DVS performance (c) Fig. 1. Illustration of DVS. (a) Without DVS. (b) With DVS Fig. 1 illustrates the advantage of DVS in video (Ideal). (c) Prediction inaccuracies. decoding as well as the design tradeoff between prediction accuracy, power savings, and deadline misses. The processor speed on the y-axis directly second or fps) when the frame must be played relates to voltage as discussed earlier, and reducing out. During this idle period, the processor is still the speed allows the reduction in supply voltage, running and consuming power. These idle periods which in turn results in power savings. The shaded are the target for exploitation by DVS. Fig. 1b area corresponds to the work required to decode shows the ideal case where the processor scales ex­ the four frames and it is the same in all three cases. actly to the voltage/frequency setting required for However, the corresponding power consumption the desired time span. Therefore, no idle time ex­ is the largest in Fig. 1a because it uses the highest ists and power saving is maximized. Achieving this voltage/frequency setting and there is a quadratic goal involves two important steps. First, the relationship between the supply voltage and power decoding time must be predicted. Second, the pre­ consumption (see Eq. (1) in Section 2). dicted decoding time must be mapped to an appro­ Fig. 1a shows the processor activity when DVS priate voltage/frequency setting. is not used, which means that the processor runs at Inaccurate predictions in decoding time and/or a constant speed (in this case at 120 MHz). Once a use of insufficient number of voltage/frequency set­ frame is decoded, the processor waits until (e.g., tings will introduce errors that lead to reduction in every 33.3 ms for the frame rate of 30 frames per power saving and/or increase in missed deadlines as shown in Fig. 1c. In this figure, the decoding on-line. A DVS scheme is classified as on-line if times for frames 1 and 4 are overestimated, result­ no preprocessing is required to obtain information ing in more power consumption than required. On to be used in the DVS algorithm and therefore is the other hand, the decoding time for frame 2 is equally adaptable for stored video and real-time underestimated, which leads to a deadline miss video applications. It is classified as off-line if pre­ that may degrade the video quality. In summary, processing is required to obtain information DVS has great potential in applications with high needed by the DVS algorithm. varying workload intensities such as video decod­ Four DVS techniques for video decoding and ing, but accurate workload prediction is prerequi­ the corresponding prediction algorithms are dis­ site in realizing the benefit of DVS. cussed and compared: GOP is a per-GOP, dynamic off-line approach, Direct is a per-frame, fixed off­ 3.2. Overview of DVS approaches and their line approach, while Dynamic and FDCA are per- prediction schemes frame, dynamic approaches with Dynamic being an off-line scheme and FDCA being an on-line As clearly shown in Fig. 1, an accurate predic­ scheme. Intuitively, GOP consumes more energy tion algorithm is essential to improve DVS perfor­ and incurs more deadline misses than Direct and mance and to maintain video quality. Prediction Dynamic but would result in the least overhead be­ algorithms employed in several DVS approaches cause the prediction interval is longer. Direct differ based on the following two criteria: predic­ would perform the best because it is based on a tion interval and prediction mechanism. Prediction priori information on decoding times and their interval refers to how often predictions are made in relationship with the corresponding frame sizes. order to apply DVS. The existing approaches use It should be noted that the offline methods have either per-frame or per-GOP scaling. In per-GOP one striking drawback in that they all require a pri­ approaches, since the same voltage/frequency is ori knowledge of encoded frame sizes and there­ used while decoding a particular GOP, they do fore need some sort of preprocessing. There is not take full advantage of the high variability of also a method that completely bypasses the decod­ decoding times among frames within a GOP. ing time prediction at the client to eliminate the Prediction mechanism refers to the way the possibilities of errors due to inaccurate scaling pre­ decoding time of an incoming frame or GOP is esti­ dictions [17]. This is done by preprocessing video mated. Currently, all the approaches utilize some streams off-line on media servers to add accurate form of frame size vs. decoding time relationship video complexity information during the encoding [1]. Some methods are based on a fixed relation­ process. However, this approach requires knowl­ ship, while others use a dynamically changing rela­ edge of client hardware and is therefore impracti­ tionship. In the fixed approach, a linear equation cal. Moreover, it is not useful in case of existing describing the relationship between frame sizes streams that do not include the video complexity and frame decoding times is provided ahead of information. Choi et al. [32] have proposed a time. In the dynamic approach, the frame-size/ method in which the frame is divided into a decoding-time relationship is dynamically adjusted frame-dependent and frame-independent part based on the actual frame-related data and decod­ and scale voltage accordingly. However, as men­ ing times of a video stream being played. The dy­ tioned in their work, it is possible for errors to namic approach is better for high-motion videos propagate across frames due to a single inaccurate where the workload variability is extremely high. prediction, thereby degrading video quality. These In other cases, the fixed approach performs better two methods are not included in the study. than the dynamic approach but its practical value is limited because the relationship is not usually 3.2.1. Per-GOP approach with dynamic equation available before actually decoding the stream. (GOP) Aside from the two criteria explained above, the GOP is a per-GOP scaling approach that dynam­ DVS schemes are classified as either off-line or ically recalculates the slope of the frame-size/ decode-time relationship based on the decoding 3.2.2. Per-frame approach with fixed equation times and sizes of past frames [21]. At the begin­ (Direct) ning of a GOP, the sizes and types of the frames Direct was used by Pouwelse et al. in their of an incoming GOP are observed. This informa­ implementation of StrongARM based system for tion is then applied to the frame-size/decode-time power-aware video decoding [18,24]. In this tech­ model, and the time needed to decode the GOP nique, the scaling decision is made on a per-frame is estimated. Based on this estimate, the lowest fre­ basis. Based on a given linear model between quency and voltage setting that would satisfy the frame sizes and decoding times, decoding time of frame rate requirement is selected. The dynamic a new frame is estimated and then it is associated slope adjustment was originally presented in [1], to a particular processor setting using a direct where the slope adjustment is implemented by uti­ mapping. lizing the concept of decoding time per byte In order to obtain the size of the new frame, the (DTPB). DTPB essentially represents the slope of decoder examines the first half of the frame as it is the frame-size/decode-time equation and this value being decoded. Then, the size of the second half of is updated as the video is decoded using the actual the frame is predicted by multiplying the size of the decoding times of the just-decoded frames. The first half with the complexity ratio between the first summary of the algorithm for GOP is presented and second halves of the previous frame. Based on in Fig. 2. this, if the decoding time of the first half of the Although the GOP method requires the least frame is higher than the estimated decoding time, overhead among the three approaches, the per- it means that the decoding is too slow and the pro­ GOP scaling will introduce more prediction errors. cessor setting is then increased. The reason is that by having the same processor In addition, they also present a case in which setting for a GOP, prediction inaccuracy may the frame sizes are known a priori [24]. This is propagate across all the frames within the GOP. achieved by feeding the algorithm with the size Moreover, the fact that each frame type has its of each frame gathered offline. Thus, voltage/fre­ own decoding time characteristic [1,19,21] is ig­ quency scaling is done at the beginning of each nored while it would be more reasonable to assign frame by looking at the frame size, estimating a processor setting depending on the type of the the decoding time, and scaling the processor set­ frame. ting accordingly. Our simulation study of Direct

Fig. 2. Algorithm for the GOP approach. Fig. 3. Algorithm for the Direct approach. is based on this case. Fig. 3 summarizes the Direct then predicted by looking at the weighted differ­ approach implemented in our simulator. ence of frame sizes and decoding times of previous frames. This predicted fluctuation time is then 3.2.3. Per-frame approach with dynamic equation added to the average decoding time to obtain the (Dynamic) predicted decoding time of the incoming frame. Dynamic [33] is a per-frame scaling method that dynamically updates the frame-size/decoding-time 3.2.4. Per-frame, frame-data computation aware model and the weighted average decoding time. dynamic (FDCA) approach Fig. 4 provides a description of the Dynamic The principal idea behind the FDCA scheme is approach. to use information available within the video The mechanism used to dynamically adjust the stream while decoding the stream. In this way, frame-size/decode-time relationship is similar to there is no need to rely on external or offline pre­ one presented in [1].In Dynamic, the adjustment processing and data generating mechanisms to is made focusing on the differences of the decoding provide input parameters to the DVS algorithm. times and frame sizes. The average decoding time The main steps involved during the video of previous frames of the same type is used as decoding process [30] are variable length decoding the initial value for predicting the next frame. (VLD), reconstruction of motion vectors (MC), The possible deviation from this average value is and pixel reconstruction, which comprises of

Fig. 4. Algorithm of the Dynamic approach. inverse quantization (IQ), inverse discrete cosine tion is carried out. Therefore, a moving average transform (IDCT), and incorporating the error can be maintained, at frame level, of the cycles re­ terms in the blocks (Recon). Ordinarily, an MPEG quired for all the unit operations after the VLD decoder carries out decoding on a per-macroblock step. These parameters consist of the number of basis and the above mentioned steps are repeated cycles required for (1) reconstructing one motion for each macroblock until all macroblocks in a vector (AvgTimeMC), (2) carrying out IQ on one frame are exhausted. coefficient in a block (AvgTimeIQ), performing In order to gather and store valuable frame-re­ IDCT on one block of pixels (AvgTimeIDCT), lated information during the decoding process, the and incorporating error terms on one block of pix­ decoder was modified to carry out VLD for all els (AvgTimeRecon). macroblocks in a frame ahead of the rest of the Using the information explained above, it is steps. The information collected during the VLD now possible to estimate the number of cycles that stage constitutes such parameters as (1) total num­ will be required for frame decoding after the VLD ber of motion vectors in a frame (nbrMV), (2) total step by simply multiplying the corresponding number of block coefficients in a frame (nbrCoeff), parameters. A moving average of the prediction (3) total number of blocks on which to carry out error (PredError) is also maintained and used as IDCT (nbrIDCT), and (3) the number of blocks an adjustment to the final estimated decoding time to perform error term correction on (nbrRecon). for a frame. The cycles for the unit operations are The rest of the decoding steps are then carried not grouped according to the frame type because, out for the entire frame. The FDCA approach is as previously stated, the same block of code will be similar to the one proposed in [31] in that, VLD executed regardless of the type of the frame. The is carried out for the entire frame ahead of the error terms however, are grouped depending on other decoding steps. However, there are some the frame type. The estimated number of cycles key differences between the two methods: First, for a frame is then used to apply DVS by selecting the method in [31] takes into consideration the the lowest frequency/voltage setting that would worst case execution time of frames and tries to meet the frame deadline. The VLD step is per­ lower the overestimation as much as possible by formed at the highest voltage/frequency available using various frame parameters. Therefore, this to leave as much time as possible to perform not only causes an overhead due to decoder DVS during the more computationally intensive restructuring, but also results in overestimation tasks after VLD. Fig. 5 gives an algorithmic of decoding time. On the other hand, FDCA takes description of the FDCA scheme. a ‘‘best effort’’ estimation approach by using mov­ ing averages in the estimation. Second, the ulti­ 3.3. Granularity of processor settings on DVS mate goal of FDCA is to use the decoding time performance estimation for applying DVS. Therefore, unlike the method in [31], FDCA does not buffer the entire In this subsection, the impact of granularity of frame (which may possibly lead to some delay and processor settings on DVS performance is dis­ therefore more power consumption) to find out the cussed. Fig. 6a is the same ideal DVS approach frame size in order to estimate the decoding time as in Fig. 1b. It is ideal not only because the decod­ for the VLD step. Instead, VLD is initiated right ing time prediction is prefect but also because the away, thus bypassing the preprocessing step that processor can be set precisely to any voltage/fre­ is required in their method. quency value. However, since there is some cost in­ In order to estimate the number of cycles that volved in having different processor supply will be required for frame decoding, each of MC, voltages [25,27], a processor design with a large IQ, IDCT, and Recon steps is considered as a unit number of voltage/frequency scales is unfeasible operation. That is, for each unit operation, same [6,13]. For this reason, DVS capable commercial blocks of code will be executed and will require processors typically employ a fixed number of similar number of cycles every time a unit opera­ voltage and frequency settings. For example, Fig. 5. Algorithm of the FDCA approach.

Transmeta TM5400 or ‘‘Crusoe’’ processor has 6 rithm is fine enough to give significant power sav­ voltage scales ranging from 1.1 V to 1.65 V with ings and minimize deadline misses, while still frequency settings of 200–700 MHz [24], while In­ reasonably coarse to be implemented. tel StrongARM SA-100 has up to 13 voltage scales from 0.79 V to 1.65 V with the frequency settings of 59–251 MHz [10]. 4. Performance evaluation and discussions Consider a processor that has a fixed scale of frequencies, e.g., 5 settings ranging from 40 MHz This section presents the simulation results to 120 MHz with steps of 20 MHz as in Fig. 6b. comparing different DVS schemes introduced in The closest available frequency that can still satisfy Section 3.2, and also show the quantitative results the deadline requirement is selected by DVS algo­ on the effect of the granularity of processor set­ rithm. In Fig. 6c, the scales used in the previous tings on DVS performance discussed in Section case are halved, which results in 9 frequency scales 3.3. Performance measures are average power con­ with steps of 10 MHz. This result in less idle times sumption per frame, error rate, and deadline than the previous case and more power saving is misses. Before proceeding, the simulation environ­ achieved. If finer granularity scales than Fig. 6c ment and the workload video streams are first de­ are used, power savings would improve until at scribed in Sections 4.1 and 4.2, respectively. some point when it reaches the maximum as in the ideal case. Nevertheless, more frame deadline 4.1. Simulation environment misses will also start to occur as finer granularity scales are used with inaccuracies in predicting Fig. 7 shows our simulation environment, which decoding times. That is, prediction errors together consists of modified SimpleScalar [5], Wattch [3], with use of fine-grain settings would introduce and Berkeley mpeg_play MPEG-1 Decoder [2]. more deadline misses. On the other hand, use of SimpleScalar [5] is used as the basis of our simula­ coarse-grain settings induces overestimation that tion framework. The simulator is configured to could avoid deadline misses in spite of prediction resemble the characteristics of a five-stage pipeline errors. Thus, it is important to understand which architecture, which is typical of processors used in level of voltage scaling granularity in a DVS algo­ current portable multimedia devices. The proxy CPU speed Frame 1 is Frame 2 is Frame 3 is Frame 4 is decoder makes a DVS system call to the simulator (Mhz) displayed displayed displayed displayed to adjust the processor setting.

120 Wattch [3] is an extension to the SimpleScalar 100 framework for analyzing power consumption of 80 the processor. It takes into account the simulator's 60 Frame 1 Frame 2 Frame 3 Frame 4 states and computes the power consumption of 40 Time (sec) each of the processor structures as the simulation 1 2 3 4 30 30 30 30 progresses. The power parameters in Wattch con­ (a) tain the values of the power consumption for each

CPU speed Frame 1 is Frame 2 is Frame 3 is Frame 4 is hardware structure at a given cycle time. Thus, by (Mhz) displayed displayed displayed displayed constantly observing these parameters, our simula­ tor is capable of obtaining the power used by the 120

100 processor during decoding of each frame. 80 The Berkeley mpeg_play MPEG-1 decoder [2] Frame 1 60 Frame 2 was used as the video decoder in our simulation Frame 3 Frame 4 40 Time (sec) environment. For the FDCA scheme, the original 1 2 3 4 30 30 30 30 decoder was restructured to carry out VLD ahead (b) of the other steps in video decoding. All the meth­ ods required modifications to the decoder to make CPU speed Frame 1 is Frame 2 is Frame 3 is Frame 4 is (Mhz) displayed displayed displayed displayed DVS system calls to the simulator. A DVS system call modifies the voltage and frequency values 120 110 presently used by SimpleScalar. These system calls 100 90 80 are also used to determine the number of cycles re­ 70 60 Frame 1 quired for decoding a frame and updating data 50 Frame 2 Frame 3 Frame 4 40 Time used in an algorithm during the decoding process. (sec) 1 2 3 4 30 30 30 30 In the GOP, Direct, and Dynamic methods, there (c) are two system calls made: One at the start of a frame and one at the end of a frame. In the FDCA Fig. 6. Granularity of scales. (a) With DVS (Ideal). (b) Coarse- grain scales. (c) Fine-grain scales. method, there are also other system calls made to update data related to cycles for unit operations. For the simulation study, the overhead of pro­ system call handler in SimpleScalar was modified cessor scaling was assumed to be negligible. In and a system call for handling voltage and practice, there is a little overhead related to scal­ frequency scaling was added. Thus, the MPEG ing. Previously implemented DVS systems have

CC BenchmarkBenchmark Source HardwareHardware MPEG decoder Configuration

CompilerCompiler SimpleScalaSimpleScalar r TotalTotal Power OptionOption GCC SimpleScalaSimpleScalar r consumption WattchWattch Power Estimator SimpleScalaSimpleScalar ExecutableExecutable File simsim-outorder-outorder Performance MPEG stream Performance StatisticsStatistics SimulatorSimulator

Fig. 7. Simulation environment. shown that processor scaling takes about 70– 4.3. Effect of prediction accuracy on DVS 140 ls [4,18,24]. Since this overhead is significantly performance smaller than the granularity at which the DVS sys­ tem calls are made, they would have negligible af­ Figs. 9 and 10 summarize power savings and er­ fect on the overall results. ror results for the four DVS approaches simulated (GOP, Dynamic, Direct,and FDCA). These simula­ 4.2. Workload video streams tions were carried out using 13-voltage/frequency settings as used in the Intel StrongARM processor Three MPEG clips were used in our simula­ [18]. The ideal case (Ideal) was also included as a tions. These clips were chosen as representatives reference. The ideal case represents perfect predic­ of three types of videos—low-motion, animation, tion with voltage/frequency set to any accuracy re­ and high-motion. A clip showing a public message quired, and thus represents optimum DVS on childcare is selected for a low-motion video performance. This was done by using previously (Children) and a clip named Red's Nightmare is se­ gathered actual frame decoding times to make lected as an animation video. Lastly, a clip from scaling decisions instead of the estimated decoding the action movie Under Siege is selected to repre­ times. sent a high-motion video. Table 1 shows the char­ Fig. 9 shows the power savings in terms of aver­ acteristics of the clips. The table also includes age power consumption per frame relative to using frame-size/decode-time equations, which were gen­ no DVS for all frames as well as for each frame erated after preprocessing each clip. The R2 coeffi­ type. All four approaches achieve comparable cient represents the accuracy of the linear power savings to the ideal case, except GOP with equations, i.e., the closer R2 is to unity, the more Children (i.e., only 35% improvement). It can be likely the data points will lie on the predicted line. observed that the FDCA method consumes more Fig. 8 shows the decoding time characteristics power than the other three methods. This is be­ for each of the clips. As expected, frame decoding cause the restructured MPEG decoder used in times for Under Siege fluctuate greatly, while the FDCA takes longer than the original MPEG deco­ fluctuations in decoding times for Children are der. Our simulations on the sample streams show very subtle and the separation of the decoding that FDCA has on average 12% more overhead times for the three types of frames can be clearly than the original unmodified decoder in terms of seen. For Red's Nightmare, decoding times for I- the number of cycles required. Therefore, this frames are relatively unvarying, but P-frames show overhead represents loss opportunity to save large variations and B-frames are distinguished by power using DVS. In addition, FDCA stores peaks. frame-related data in the VLD step and loads it

Table 1 The characteristics of the clips used in the simulation Characteristics Children Red's Nightmare Under Siege Type Slow (low-motion) Animation Action (high-motion) Frame rate (fps) 29.97 25 30 Number of I frames 62 41 123 Number of P frames 238 81 122 Number of B frames 599 1089 486 Total number of frames 899 1211 731 Screen size (W · H) 320 · 240 pixels 320 · 240 pixels 352 · 240 pixels Linear equation for Decoding time = Decoding time = Decoding time = prediction 88.8 · frame size + 106 53.9 · frame size + 2 · 106 69.6 · frame size + 2 · 106 R2 coefficient 0.94 0.89 0.94 Children 25

20 I Frames 15 P Frames B Frames 10

5

Decoding Time (ms) 0 0 100 200 300 400 500 600 700 800 Frame Number

Red's Nightmare 50 45 40 35 I Frames 30 P Frames 25 20 B Frames 15 10 5

Decoding Time (ms) 0 0 200 400 600 800 1000 1200 Frame Number

Under Siege 30

25

20 I frames P frames 15 B Frames 10

5 Decode Time (ms)

0 0 100 200 300 400 500 600 700 Frame Number

Fig. 8. Frame decoding times for each clip.

Fig. 9. Relative average power consumption per frame for various DVS approaches. Neglecting pared sults back of FDCA defines [28] of namic the of Direct with frame approaches only most frames. typically clips. est played ings This sor forms mare Direct errors 68% Among Fig. standard error setting overall variability to that , consequently slightly again in average , The power , both the deadlines, (14% method and and with to playout how of 10 at 9–14% On for can worst have power FDCA over the shows during the the 29% -and P- FDCA 10.5%, in average and deviation the well each better saving the be error terms of is error multiple off-line higher original interval. because shorter given and made 11% saving as a approach average, . still frame the the decoding wastes B-frames DVS For than also of (80% and 13% error of Error (%) results for of rest 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 method. able accuracy for data 9.4%, DVS example, error, which it types method Direct decoding inter-frame how This type FDCA decoder. for the applies of Dynamic -and P- improvement) Dynamic (38.9%) to was I times resulted of the Dynamic potential methods, smooth closely of depends parameter defined provide is GOP , (77%). P GOP the of steps 17% frames quite was for the B-frames, Children Fig. for miss times However, the B most for , playout has with provides in Red same able and ), followed and a the as 10. substantial. Direct an power heavily four and significant All in GOP rate video the the the basically ' but accurate Errors s 10% than amount average to thus a proces­ 10.8%. Night­ which this GOP. , high­ times three com­ DVS ratio meet it sav­ per­ GOP clip Dy­ the the for re­ on for by I- is is I various Dynamic P Red’s Nightmare for to were as of voltage/frequency 4, imented quency the ertheless, performance ably granularity accurate 13 presented 4.4. frames shows ing having namic voltage/frequency per shown reflected DVS Direct B their these To The The 7, well frequency/voltage in realistic frame Impact impact affect chosen 13, All approaches. show terms method. FDAC fine-grain results the in in as results scaling promising schemes with decoding 25, a by in Under compared the Fig. the of clearer the of relative implementation. of the that than as various and voltage/frequency the of FDCA processor performance I As power are granularities aforementioned 8. representatives subsection previous power voltage was Siege Under Siege scaling can having 49 performance various This variability processor P time understanding shown average to approach. scaling simulated scales. consumption settings. be . B consumption using was predictions settings scales seen, coarse-grain schemes in All processor of have schemes Table also settings power Fig. no power of would DVS. Thus, among tradeoff, settings and using These granularity DVS on simulated. decoding is the 11–14 and 2 were are high needed accuracy and consumption consumption It video lead consisting the even presents for scales. case voltage/fre­ approaches for others seems will accuracy. made, we . potential based Dynamic to the Fig. various if decod­ invari­ for exper­ about better times Each Nev­ very that due Dy­ the the on 11 P- of Table 2 Processor settings simulated Number of settings Voltages (V) Frequencies (MHz) Range Steps Range Steps 4 scales 0.79–1.65 0.286668 59–251 64 7 scales 0.79–1.65 0.143334 59–251 32 13 scales 0.79–1.65 0.071667 59–251 16 25 scales 0.79–1.65 0.035834 59–251 8 49 scales 0.79–1.65 0.017917 59–251 4 Ideal Scale to any requested value by the ideal prediction algorithm decreases as the number of processor settings in­ Siege, respectively). The results for the FDCA creases. However, power saving only increases method are shown in Fig. 13 and show a similar slightly beyond 13 scales. Thus, using 13 available trend of only a marginal increase in power savings settings are sufficient to achieve relative average beyond 13 scales. power per frame comparable to the ideal case Figs. 12 and 14 show the accuracy of the DVS (e.g., 18% vs. 16% for Children, 19% vs. 17% for approaches for the various settings for Dynamic Red's Nightmare, and 23% vs. 20% for Under and FDCA techniques respectively. In general,

0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05

Relative Average Power per Frame 0.00 I P B All I P B All I P B All

Children Red's Nightmare Under Siege

4 scales 7 scales 13 scales 25 scales 49 scales Ideal

Fig. 11. Relative average power per frame for various processor settings for Dynamic method.

0.35

0.30

0.25

0.20

0.15 Error (%) 0.10

0.05

0.00 I PB All I PB All I PB All

Children Red’s Nightmare Under Siege

4 scales 7 scales 13 scales 25 scales 49 scales

Fig. 12. Errors for various processor settings for Dynamic method. proach Therefore, slightly DVS 4.5. (e.g., inaccuracies of the Red duces of Under processor the Fig. processor settings processor ' error Characteristics s propagated). approaches. the Nightmare Siege resulted 15 due decreases error. settings. shows with , more to settings where are settings large in Fig. the However, , the the getting As where than changing of This 13. finer with

from Error (%) Relative Average Power per Frame errors deadline of smallest 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 can deadline Relative 4–13, is 13, granularity, the the be 4 Fig. scaled this I I true for to the availability seen, the average error percentage 14. but misses misses 49 is P P -and P- for available Errors not Children Children more significantly ratio for the B B decreases power Children 4 scales for more the the 4 scales for Direct All All B-frames. increases precisely the of of case per number number various 7 scales of dead­ more frame four 7 scales and ap­ the for for re­ processor I I 13 scales for 13 scales Red’s Nightmare P P various (7.8%). Siege ear video, the frame-size/decoding-time specific size/decoding-time line handle (8.4% decoding cause Children 16.2% where most Red settings B B 25 scales However, particular model. misses. 25 scales processor All All clip number which and of the deadline The the for characteristic 49 scales for times. their resulted FDCA FDCA 9.7% 49 scales This reason Under settings deviates the clip FDCA I I of misses. adaptive method. The Ideal deadline is Dynamic Under Siege being Under Siege method misses P P for is equation in Siege because that Children most cause FDCA the B B Even of run. model capability clip the most (23.92%) misses and from gave each method. All All we about For that though clip comparatively clip are FDCA number good the is Direct clip. respectively) is is well resulted using a 35% calculated for based in P-frames high-motion results approaches , Thus, predicting suited the of Dynamic a deadline on frame- misses Under in with well lin­ the the the be­ for for , 120

100

80

60

40

Deadline Misses (%) 20

0 I P B All I P B All I P B All

Children Red's Nightmare Under Siege

GOP Dynamic Direct FDAC

Fig. 15. Percentage of deadline misses for various DVS approaches. misses, it is found that the deadline is missed only increases linearly as the granularity of the processor by an average of about 5% and is therefore negligi­ settings becomes finer, except for the Children clip ble. The highest number of deadline misses in Dy­ in case of Dynamic. This is because the scaling deci­ namic occurred for the clip with the least amount sions rely more on the estimation of the decoding of scene variations. This is because the dynamic times as more settings are used. Thus, an estimation decoding time estimation used performs too error would easily propagate to cause a deadline aggressively for the clip that has smooth move­ miss. Essentially, the main factor that affects the ment. GOP also uses an adaptive mechanism simi­ relationship between the granularity of the proces­ lar to the Dynamic approach. However, the sor scale and DVS performance is the distribution deadline misses are minimized by having longer of the frame decoding times. The power savings scaling intervals (i.e., per-GOP instead of per- and deadline misses would depend on whether the frame). Moreover, its scaling decision includes all processor settings available and used in the algo­ types of frames. Thus, P- and B-frames, which typ­ rithm could satisfy the expansion of the decoding ically have shorter decoding times than I-frames, times to the frame playout intervals. would likely be overestimated since the setting used Fig. 18 show the characteristics of deadline has to also satisfy the playout times for I-frames. misses in terms of how much the desired playout Figs. 16 and 17 show the deadline misses for var­ times were exceeded for various DVS approaches. ious voltage/frequency scales using Dynamic and The x-axis shows the extent of the deadlines misses FDCA respectively. The number of deadline misses relative to the playout interval, categorized as

40

35

30

25

20

15

10 Deadline Misses (%)

5

0 IP B All IP B All IP B All

Children Red's Nightmare Under Siege

4 scales 7 scales 13 scales 25 scales 49 scales

Fig. 16. Percentage of the deadlines misses for different processor settings for Dynamic. 80

70

60

50

40 30

20 Deadline Misses (%) 10

0 IPBAll IPBAll IPB All

Children Red's Nightmare Under Siege

4 scales 7 scales 13 scales 25 scales 49 scales

Fig. 17. Percentage of the deadlines misses for different processor settings for FDCA.

25

20

15

10

5

0

10% 20% 10% 20% 10% 20% 30% 10% 20% 30% < < <30% <40% >40% < < <30% <40% >40% < < < <40% >40% < < < <40% >40% GOP Dynamic Direct FDAC

Children Red’s Nightmare Under Siege

Fig. 18. The degree of deadline misses for various DVS approaches.

10%, 20%, 30%, 40%, and greater than 40%. The Based on these results, we can clearly see that y-axis represents the percentage of deadline misses the number of deadline misses by itself is not an over an entire clip. For example, a 5% value on the accurate measure of video quality. Instead, how y-axis with the 10% category on x-axis means that much the desired playout times were exceeded 5% of the frames in the clip that miss the deadline should also be measured and analyzed in order missed it by 10% of the desired playout interval to provide a better understanding on how the (e.g., for 25 fps, or 40 ms playout interval, these misses may affect the video quality. The simulation frames are played out between 40 and 44 ms after results indicate that deadline misses imposed by the preceding frames). For the Direct, Dynamic, DVS for most part have negligible effect on per­ and FDCA approaches, most of the misses are ceptual quality since they are mostly within 10% within 10% of the playout interval. In addition, of the desired playout time [8]. virtually all of the misses for the above three ap­ proaches lie within the 20% range. In contrast, the deadline misses for GOP are more erratic, 5. Conclusion and thus, have a higher potential of disrupting the quality of video playback. Conversely, dead­ This paper compared DVS techniques for low- line misses in Direct, Dynamic, and FDCA are less power video decoding. Out of the four approaches likely to affect the video quality. studied, Dynamic and Direct provided the most power savings, but are limited in usefulness with re­ [3] D. Brooks, V. Tiwari, M. Martonosi, Wattch: A frame­ spect to real-time video applications. The FDCA work for architectural-level power analysis and optimiza­ method can be effectively applied to both stored tions, in: Proceedings of the 27th International Symposium on , June 2000, pp. 83–94. video and real-time video applications. Due to [4] T.D. Burd, T.A. Perimg, A.J. Stratakos, R.W. Brodersen, the extra overhead required for restructuring the A dynamic voltage scaled system, IEEE decoding process, FDCA does not provide as much Journal of Solid-State Circuits (November) (2000). power savings as compared to the Dynamic and Di­ [5] D. Burger, T.M. Austin, The SimpleScalar Tool Set, rect methods. Nevertheless, the power savings ob­ Version 2.0, CSD Technical Report # 1342, University of Wisconsin, Madison, June 1997. tained is quite substantial providing up to an [6] A.P. Chandrakasan, R.W. Brodersen, Low Power Digital average of 68% savings and an average of less than CMOS Design, Kluwer Academic Publishers, 1995. 14% (13.4%) frames missing the deadline. Thus, [7] A. Chandrakasan, V. Gutnik, T. Xanthopoulos, Data this approach is very suitable for portable multime­ driven signal processing: An approach for energy efficient dia devices that require low-power consumption. computing, in: Proceedings of IEEE International Sympo­ sium on Low Power Electronics and Design, 1996, pp. 347– Our study also further quantified the deadline 352. misses by analyzing the degree to which the play- [8] M. Claypool, J. Tanner, The effects of jitter on the out times are exceeded. The results indicate that, perceptual quality of video, in: Proceedings of the ACM for the Dynamic, Direct, and FDCA approaches, Multimedia Conference, vol. 2, November 1999. most of the deadline misses are within 20% of the [9] K. Flautner, S. Reinhardt, T. Mudge, Automatic perfor­ mance setting for dynamic voltage scaling, in: Proceedings playback interval. Therefore, use of these power of the 17th Conference on Mobile Computing and saving methods is less likely to degrade the quality Networking, July 2001. of the video. In addition, in designing a DVS capa­ [10] K. Govil, E. Chan, H. Wasserman, Comparing algorithms ble processor for video decoding, higher number of for dynamic speed-setting of a low power CPU, in: processor settings is preferable since more power Proceedings of 1st International Conference on Mobile Computing and Networking, November 1995. saving can be achieved without any additional risk [11] L. Harte, R. Levine, R. Kikta, 3G Wireless Demystified, of sacrificing quality of the video. The number of McGraw-Hill, 2002. deadline misses may increase, but they are still [12] C. Im, H. Kim, S. Ha, Dynamic voltage scheduling within a tolerable range [8]. techniques for low-power multimedia applications using As future work, it would be interesting to inves­ buffers, in: International Symposium on Low Power Electronics and Design (ISLPED'01), California, August tigate the usage of DVS system on streaming video 2001. where packet jitters from the network need to be [13] T. Ishihara, H. Yasuura, Voltage scheduling problem for considered [26,29]. In addition, finding more accu­ dynamically variable voltage processors, in: International rate prediction mechanisms for unit operations in Symposium on Low Power Electronics and Design, August video decoding, in particular for IDCT, and new 1998. [14] S. Lee, T. Sakurai, Run-time power control scheme using ways to exploit DVS for low power video decoding software feedback loop for low-power real-time applica­ are critical and would assist in reaching near-max­ tions, in: Asia and South Pacific Design Automation imum performance. Finally, it would be beneficial Conference, January 2000. to find ways to use DVS on other parts of a sys­ [15] J.R. Lorch, A.J. Smith, Improving dynamic voltage scaling tem, such as applying DVS to memory or network algorithms with PACE, in: Proceedings of the Interna­ tional Conference on Measurement and Modeling of interface. Computer Systems, June 2001. [16] D. Marculescu, On the use of -driven dynamic voltage scaling, in: Workshop on Complexity- References Effective Design, held in conjunction with 27th Interna­ tional Symposium on Computer Architecture, June 2000. [1] A. Bavier, B. Montz, L. Peterson, Predicting MPEG [17] M. Mesarina, Y. Turner, Reduced energy decoding of execution times, in: International Conference on Measure­ MPEG streams, ACM/SPIE Multimedia Computing and ment and Modeling of Computer Systems, June 1998, pp. Networking 2002 (MMCN'02), January 2002. 131–140. [18] J. Pouwelse, K. Langendoen, R. Lagendijk, H. Sips, [2] Berkeley MPEG Tools. Available from: . sium (PCS'01), April 2001. [19] J. Shin, Real time content based dynamic voltage scaling, Master thesis, Information and Communication Univer­ sity, Korea, January 2002. [20] T. Simunic, L. Benini, A. Acquaviva, P. Glynn, G.D. Michelli, Dynamic voltage scaling and for portable systems, in: 38th Design Automation Confer­ ence (DAC 2001), June 2001. [21] D. Son, C. Yu, H. Kim, Dynamic voltage scaling on MPEG decoding, in: International Conference of Parallel and Distributed System (ICPADS), June 2001. [22] T. Pering, R. Brodersen, Energy efficient voltage schedul­ ing for real-time operating systems, in: Proceedings of the 4th IEEE Real-Time Technology and Applications Symposium RTAS'98, Work in Progress Session, June 1998. [23] T. Pering, T. Burd, R. Brodersen, The simulation and evaluation of dynamic voltage scaling algorithms, in: Proceedings of the International Symposium on Low Power Electronics and Design, 10–12 August 1998. [24] J. Pouwelse, K. Langendoen, H. Sips, Dynamic voltage scaling on a low-power microprocessor, in: 7th ACM International Conference on Mobile Computing and Net­ working (Mobicom), July 2001. [25] J.M. Rabaey, M. Pedram, Low Power Design Methodol­ ogies, Kluwer Academic Publishers, 1996. [26] R. Steinmetz, Human perception of jitter and media synchronization, IEEE Journal on Selected Areas in Communications 14 (1) (1996). [27] A.J. Stratakos, C.R. Sullivan, S.R. Sanders, R.W. Broder­ sen, High-efficiency low-voltage DC–DC conversion for portable applications, in: E. Sanchez-Sinencio, A. Andreou (Eds.), Low-voltage/low-power Integrated Circuits and Systems: Low Voltage Mixed-signal Circuits, IEEE Press, 1999. [28] Y. Wang, M. Claypool, Z. Zuo, An empirical study of realvideo performance across the Internet, in: Proceedings of the First Internet Measurement Workshop, November 2001, pp. 295–309. [29] H. Zhang, S. Keshav, Comparison of rate-based service disciplines, in: Proceedings of ACM SIGCOMM'91, Sep­ tember 1991. [30] K. Patel, B. Smith, L. Rowe, Performance of a software MPEG video decoder, in: Proceedings of the First ACM International Conference on Multimedia, August 1993, pp. 75–82. [31] L.O. Burchard, P. Altenbernd, Estimating decoding times of MPEG-2 video streams, in: Proceedings of International Conference on Image Processing (ICIP 00), 2000. [32] K. Choi, K. Dantu, W.-C. Chen, M. Pedram, Frame-based dynamic voltage and frequency scaling for a MPEG decoder, in: Proceedings of International Conference on Computer Aided Design (ICCAD), 2002. [33] E. Nurvitadhi, B. Lee, C. Yu, M. Kim, A comparative study of dynamic voltage scaling techniques for low-power video decoding, in: International Conference on Embedded Systems and Applications, 23–26 June 2003.