Parallelism Analysis of Prominent Desktop Applications: An 18-Year Perspective

Siying Feng, Subhankar Pal, Yichen Yang, Ronald G. Dreslinski University of Michigan, Ann Arbor, MI, USA {fengsy, subh, yangych, rdreslin}@umich.edu

Abstract—Improvements in clock speed and exploitation of User requirements are significantly diverse. Gamers, for Instruction-Level Parallelism (ILP) hit a roadblock during mid- example, require a high-end machine with an advanced GPU 2000s. This, coupled with the demise of Dennard scaling, led and efficient cooling. On the other hand, for a user who uses to the rise of multi-core machines. Today, multi-core processors are ubiquitous and architects have moved to specialization to their system primarily to browse the web and watch videos, it work around the walls hit by single-core performance and chip is cost-inefficient to own a system with a high-end GPU. Thus, Thermal Design Power (TDP). The pressure of innovation in the it is worthwhile to analyze the characteristics of commonly aftermath of Dennard scaling is shifting to software developers, used applications tailored toward different users. who are required to write programs that make the most effective We repeat some of the experiments of an eight-year old use of underlying hardware. This work presents quantitative and qualitative analyses of how software has evolved to reap the work by Blake et al. [3], characterizing the parallelism ex- benefits of multi-core and heterogeneous computers, compared ploited by software in desktop workstations. To analyze how to state-of-the-art systems in 2000 and 2010. We study a wide a wide spectrum of commonly used desktop applications have spectrum of commonly-used applications on a state-of-the-art evolved to utilize available hardware, we use the metrics of desktop machine and analyze two important metrics, - Thread Level Parallelism (TLP) and GPU utilization. Our Level Parallelism (TLP) and GPU utilization. application suite consists of a diverse choice of traditional We compare the results to prior work over the last two decades, which state that 2-3 CPU cores are sufficient for most desktop applications, such as web browsers, video authoring applications and that the GPU is usually under-utilized. Our utilities, media players, as well as emerging applications, analyses show that the harnessed parallelism has improved and such as personal assistants, cryptocurrency miners and virtual emerging workloads show good utilization of hardware resources. reality (VR) games. In addition to the metrics mentioned The average TLP across the applications we study is 3.1, with earlier, we analyze the effect of core scaling and simultaneous most applications attaining the maximum instantaneous TLP of 12 during execution. The GPU is over-provisioned for most multi-threading (SMT) on these applications. applications, but workloads such as cryptocurrency mining utilize This work attempts to answer the following questions: it to the fullest. Overall, we conclude that the effectiveness of • Have modern versions of legacy software, and newer software in utilizing the underlying hardware has improved, but software that have replaced them, been keeping up with still has scope for optimizations. Index Terms—Benchmarking, Multi-Core, Desktop Applica- advances in hardware technology? tions, Thread-Level Parallelism, GPU utilization, Virtual Reality, • How well do contemporary and emerging applications Cryptocurrency Mining, Characterization utilize the parallelism in the underlying hardware? • What is the impact of core scaling, SMT and the GPU I.INTRODUCTION on the performance of applications? The rest of the paper is organized as follows. Section II Innovation in the domain of improving single-threaded introduces the parallelism analyses done 18 and 10 years st performance hit a plateau in the early 21 century. Antici- ago [3, 13, 14], and emerging applications that have evolved pating the end of Dennard scaling, which states that power from advancements in hardware since then. Section III de- density of a chip remains almost constant across technology scribes the system used for benchmarking, the metrics we nodes [10], the hardware industry swiftly pivoted towards study, trace collection methodology and the automation tech- multi-core processors. The post-Dennard scaling era is plagued nique used to obtain consistent measurements. Section IV by the problem of dark silicon, caused because improvements details each testbench and Section V contains evaluation of in cooling technology failed to keep up with the Thermal trends in parallelism. Section VI presents work related to char- Design Power (TDP) requirements of newer technology node acterization of different workloads. We discuss key takeaways chips [11]. Today, the hardware world is experiencing a move and suggestions for software developers to better harness the to specialization, trading away silicon area for gains in energy hardware in Section VII and conclude in Section VIII. efficiency [32, 37]. Modern systems are almost ubiquitously heterogeneous, involving a combination of the CPU with II.BACKGROUNDAND MOTIVATION a GPU and/or fixed-function accelerators. Common desktop This study provides an 18-year perspective on the evolution systems at homes and offices have 4-8 logical CPUs with a of parallelism in desktop workloads. In early 2000, when discrete GPU connected via PCI-Express. uniprocessors were prevalent, Flautner et al. [13, 14] evaluated the TLP of existing desktop applications on a symmetric mul- CPU Intel Core i7-8700K, 3.70-4.70 GHz, 6 cores / 12 threads tiprocessor (SMP) with 2-4 cores. The average TLP observed Graphics NVIDIA GTX 1080 Ti, 1481 MHz, 3584 CUDA cores RAM 64 GB (16 GB × 4) DDR4 @ 3200 MHz across all benchmarks was lower than 2 and only specific Storage 2 TB (1 TB × 2) PCIe NVMe SSD workloads, such as video encoding, benefited from more OS Windows 10 Education Version 1803 processing cores. However, a second processor improved the TABLE I: Specifications of the benchmarking desktop system. responsiveness of interactive applications. 10 years later, Blake et al. [3] presented a study of TLP for commercial desktop with an integrated graphics processor and specialized hardware applications on an 8-core processor with SMT. They concluded blocks, such as Quick Sync Video (QSV) [21]. that 2-3 processor cores were still more than sufficient for most Despite the processor having an integrated GPU, we evalu- applications and that the GPU was mostly underutilized. ate a discrete GPU, which is a more practical setup in desktop Another 8 years have passed since then, and desktop ma- systems. The GTX 285, used by Blake et al., consists of 240 chines have evolved into a combination of CPU, GPU, and CUDA cores operating at 648 MHz [30]. This work uses the fixed-function hardware. The prevalence of multiprocessors GTX 1080 Ti, which has 3584 CUDA cores (∼15× more), begs the question of how software developers have been catch- running at 1481 MHz (∼2× more) [29]. Some benchmarks are ing up with the advancements in hardware. We analyze the also tested with the GTX 680, which operates at 1006 MHz TLP and GPU utilization of a wide variety of commonly used with 1536 cores, to evaluate the differences in performance applications on a state-of-the-art desktop with 6 SMT cores. and utilization between a high-end and a mid-end GPU [31]. GPU utilization measures the average amount of GPU usage Windows 10 is used in this work, as it supports a wide over time, and TLP characterizes the amount of concurrency range of commercial applications that are commonly used by with idle time factored out. desktop consumers. More than 80% of the global operating We also evaluate emerging workloads that have gained systems market share for desktops comprises of Windows [36]. popularity in recent years, including virtual reality (VR) Besides, some applications in the benchmark suite, such as VR games, cryptocurrency miners, and personal assistants. The games, are only supported by Windows. first commercial VR headset was not released until 2016, B. Metrics and currently there are more than 150 million active VR The formula for TLP is shown in Equation 1, where ci users worldwide [23]. Cryptocurrency mining has experienced denotes the percentage of execution time when i threads are tremendous growth over the past decade, reaching a total running simultaneously, in a system with n logical CPUs. c0 market capitalization of over 200 billion USD [1]. However, represents the idle time in the application. both the immersive gaming experience provided by VR and Pn c i the computational complexity of cryptocurrency mining have TLP = i=1 i (1) non-trivial hardware requirements. For personal assistant ap- 1 − c0 plications, the user demands have been scaling since Apple The measurements are reflective of the application-level TLP, introduced Siri in 2011 [19]. Although personal assistant which measures the TLP of processes that pertain only to the applications rely heavily on datacenters to offload the complex application under consideration, in contrast to the system- part of the workload, it is worthwhile to explore how much wide TLP measured by Blake et al. [3] and Flautner et parallelism is exploited by the work performed locally. al. [13, 14]. This is because unlike system TLP, application TLP exposes the application behavior directly. III.METHODOLOGY For GPU utilization, we consider the amount of time spent A. System Setup by work packets actually running over a period of time, where a packet is defined as a large collection of Application Moore’s law has continued, albeit at a slower pace, led Programming Interface (API) calls packaged into a command by incremental advances in manufacturing technology and stream. GPU utilization is measured by aggregating for all hardware architecture over the past decade. The system used packets the ratio of packet running time to total time. by Blake et al. [3] employed a dual-socket CPU with four 2.26 GHz 4-way out-of-order cores per socket, along with C. Trace Collection an 8 MB last-level cache and 6 GB of RAM. We built a Event Tracing for Windows (ETW) is a kernel-level tracing desktop machine with state-of-the-art hardware components, feature that allows logging of application-defined events. We representative of a high-end gaming rig by current standards. use UIforETW [33], a wrapper around ETW, to collect Event Table I shows the specifications of our system. A dual-socket Trace Log (ETL) traces, after ending unrelated background setup is not used in this work since it is more prevalent processes. The traces are then analyzed using the built-in in server-class machines and generally over-provisioned for Windows Performance Analyzer (WPA). Within WPA, we desktop workloads. The processor, operating at 3.70 GHz with extract the relevant data columns from the CPU Usage Turbo to 4.70 GHz, consists of six superscalar cores (Precise) Timeline by CPU analysis for TLP and with a last-level cache of size 12 MB and 64 GB of RAM. from the GPU Utilization (FM) analysis for GPU uti- Each core supports 2-way hyper-threading, providing us with a lization. wpaexporter is used to automate the extraction maximum of 12 logical cores. The processor is also equipped of relevant data from WPA. Lastly, we use custom scripts to User inputs Application UIforETW amount of time. For games without a campaign mode, we start Mouse Start Start trace Testbench from a fixed checkpoint and perform similar actions each time. Keyboard Save trace IV. BENCHMARKS Stop Voice Testbench We construct a suite of common desktop applications based VR .etl file on their popularity among users and create as much overlap as possible with the benchmarks used in [3, 13, 14] to understand Extract columns (WPA): Extract columns (WPA): - Pr ocess - Pr ocess how software has adapted to advancements in hardware. The - CPU - St ar t Execut i on - Ready Ti me - Fi ni shed version number of each application is specified in Table II. - Swi t ch- I n Ti me A. Image Authoring .csv file .csv file Image authoring exhibits a high during Gather number of active Gather amount of active GPU cores at each point in time time at each interval of time rendering. We experiment on a 2D image editor, a 2D/3D CAD application, and a 3D animation modeling application. TLP data GPU util % data Adobe Photoshop: Photoshop is an advanced graphics TLP collection flow GPU util collection flow editing and design tool. 5 custom filters are applied serially Fig. 1: Trace collection workflow to extract TLP and GPU utilization. on a 100 mega-pixel photograph. the outputs of wpaexporter. Figure 1 summarizes Autodesk AutoCAD: AutoCAD is a CAD tool used in the measurement workflow for collecting TLP and GPU data. architecture, engineering, etc. We import a floorplan, pan, We cross-validate the GPU data with those reported by WPA. zoom, draw, fillet the edges, mirror and enter text. Autodesk Maya 3D: Maya is a 3D graphics software used D. Testing Automation for making animated movies, 3D modeling, etc. We open up While conducting multiple iterations of experiments with a complex model, smooth, perform a software render with the same applications, it is crucial to maintain consistency raytracing followed by a hardware render with fog, motion across events such as mouse clicks and keystrokes. AutoIt blur and anti-aliasing, rotate, pan and zoom the camera. is an automation language designed for Windows that can B. Office simulate keystrokes and mouse activities at a user-specified Office productivity applications are ubiquitous. A suite of time. Testing automation mitigates the variations created by applications that aim to help office tasks are included. user interactions among different test iterations. For each Adobe Acrobat Pro: Acrobat Pro is designed for PDF application that can be automated by AutoIt, we construct an editing. We scan documents, combine different files into automation script that initiates the application and performs a one PDF, manipulate the pages, insert links, watermarks and carefully designed sequence of mouse and keyboard activities. signatures, and export the PDF into slides. We inspect the effect of AutoIt on TLP and GPU uti- Microsoft Excel: Excel is a popular spreadsheet editor. lization by comparing experiments on an application with a We open a spreadsheet containing 1 million rows of text and high amount of user interactions (PowerDirector) and one numbers, copy multiple columns, zoom, pan, change layout, with non-trivial GPU utilization (VLC Media Player). The compute means, sort and filter rows, and plot a histogram. TLP for manual testing was 3.3% smaller than with automatic Microsoft Outlook: Outlook is a desktop email client by testing. The GPU utilization is 2.4% lower with AutoIt than Microsoft. We compose a new email, save and delete the draft, when performed manually. This demonstrates that the use of search and reply to a specific email, delete and recover an AutoIt does not significantly distort the results in this work. email from the inbox, move an email in and out of the junk E. Manual Testing folder, and finally categorize emails and do a filter operation. Microsoft PowerPoint: PowerPoint is a tool that lets users Some applications accept user inputs that cannot be pre- create slides and deliver presentations. We open a complex cisely reproduced by automation tools. Personal assistants, for template, add bullet points, format them, add shapes and instance, require audio inputs. So, we apply a fixed sequence animate them, add a picture, scale and rotate it and finally of requests and questions with strict timing constraints and use create a table and populate it with text and numbers. the same person’s voice for all test iterations. Microsoft Word: Word is Microsoft’s flagship document- VR games have a diverse set of inputs to track the posi- processing utility, used for tasks such as writing letters and tion and action of the player, such as signals from motion preparing reports. We create a new document, add and delete sensors and controllers. Besides, VR game scenes vary in a text, change formatting, insert, delete, scale and move images. non-deterministic manner, requiring players to take different actions to survive in the game. Repeating the same action can C. Multimedia cause larger variation in each testing iteration. Therefore, it Despite the booming popularity of online video stream- is both infeasible and undesirable to provide the exact same ing/sharing, multimedia players remain widely in use. This inputs to VR games. To maximize the similarity between work analyzes QuickTime Player, Windows Media Player different test iterations, we choose the campaign mode for each and VLC Media Player. For each application, a 480p and a game to create similar scenarios and survive for a predefined 1080p version of the same video are played in succession. D. Video Authoring and Transcoding G. Cryptocurrency Mining Video authoring and transcoding utilities enable users to Cryptocurrency miners validate transactions by computing edit videos and convert high definition videos into various on blockchains. The benchmark suite encompasses 4 miners – formats. GPUs are usually used in such workloads to assist Bitcoin Miner and EasyMiner for Bitcoin and PhoenixMiner in performing highly parallel, compute-intensive tasks. and Windows Ethereum Miner for Ethereum. Each one is CyberLink PowerDirector: PowerDirector is a video edit- run for a predefined amount of time. ing software that allows composing and editing video clips. We import three clips in PowerDirector, add transitions, titles, H. Personal Assistant color correction and render it with and without CUDA support. The prevalence of machine learning and natural language Adobe Premiere Pro: Premiere Pro is a video editor geared processing has given rise to personal assistant applications. for professionals. The same operations for PowerDirector are Apart from Cortana, which is built-in for Windows, we also repeated with slight differences in filters and transitions. analyze Braina, a multi-functional interactive AI software. HandBrake: HandBrake is an open-source video The tested queries cover requests for daily news, weather fore- transcoder. We use it to transcode part of a 3840×2160 cast, alarm/reminder management and questions about general resolution high-quality video at 50 frames per second (FPS) knowledge, word definitions and simple math problems. to a 1920×1080 resolution MP4 video at 30 FPS. V. EVALUATION WinX HD Video Converter: WinX is a video transcoding utility supporting GPU acceleration. We repeat the same test Detailed analyses of our benchmarks are presented in this sequences that were used for HandBrake. section. We summarize the TLP and GPU utilization of all the applications in our suite and compare them against prior E. Web Browsing work [3, 13, 14]. We then evaluate the impact of core scaling Popular web browsers, Chrome, Firefox, and Edge are and SMT and the role of the GPU in accelerating modern chosen for the benchmark suite. Chrome and Firefox take up applications. We also do an in-depth analysis of specific work- almost 80% of the desktop web browser market [35] and Edge loads, including web browsing and VR gaming. Details of our is a built-in browser for Windows 10 that succeeded Internet experiments and results are available in a public repository1. Explorer. We perform similar tests as in the prior work by A. Overall Results Blake et al. [3] to study the impact of improved software. For the first two tests, we watch a random video on Table II summarizes the TLP of the 6-core processor with YouTube, then browse ESPN, CNN and BestBuy, and finally SMT enabled and the GPU utilization of the GTX 1080 play a flash game. We use a different tab for each website Ti for all the applications. The “execution time” column in the first test and a single tab to browse all websites in the illustrates the amount of time when 0, 1,..., 12 logical cores second test. In the third test, we browse ESPN, which has are active simultaneously. The color of the heat map region plenty of active content (e.g. ads, videos, etc.) and in the final corresponding to ci illustrates the percentage of execution test, we browse Wikipedia, which has little active content. time when i threads are executed simultaneously. The “TLP” column shows the average and standard deviation of the F. Virtual Reality (VR) Gaming TLP derived from 3 test iterations (with same duration) for VR headsets provide players with immersive gaming experi- each application. Similarly, the “GPU utilization” column ence that traditional 3D games cannot offer. The diverse set of contains the average and standard deviation of the measured sensors and advanced rendering techniques necessary for VR GPU utilization. Based on the low standard deviations, we gaming impose non-trivial pressure on hardware. We select a conclude that our experimental results are consistent. The collection of VR games that have a large player community, last two columns in Table II present the average TLP and high user ratings and intensive graphics, and use the highest GPU utilization for each category, respectively. In summary, settings for all games to stress the GPU to the maximum. every application exploits parallelism to some extent, with a Arizona Sunshine: We play the game in single-player few applications showing more concurrency than others. For Horde mode, surviving multiple waves of zombies. categories like office, multimedia playback, personal assistant Fallout 4: We continue from a saved checkpoint where the and web browsing, the degree of parallelism exploited is quite character has escaped from the nuclear fallout shelter. low, as concluded from the average TLP of around 2. VR Serious Sam 3: We play the game in survival mode, and gaming displays moderate concurrency, with an average TLP due to the difficulty in surviving continuously for 3 minutes, ranging from 2 to 4. The TLP is expected to be similar within a we play through after getting killed and respawned. category, but some categories are exceptions, including image Space Pirate Trainer: We play the game in “old school” authoring, video authoring, and cryptocurrency mining. There mode that involves surviving multiple waves of pirate bots. also exist applications that effectively utilize most of the Project CARS 2: We start a quick race with the default car available cores, e.g. applications for video transcoding exhibit and track and race 1-2 laps with multiple other drivers. an average TLP over 9. Overall, the average TLP across all RAW Data: We play the game in campaign mode, surviving benchmarks is 3.1, where 6 out of 30 applications have an waves of attacking humanoid robots and protecting an object. average TLP higher than 4. 1https://github.com/SiyingFeng1995/Desktop Parallelism Analysis 2018 Execution Time (%) TLP GPU Util. (%) Avg. Avg. GPU Category Application C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 Avg. σ Avg. σ TLP Util (%) Adobe Photoshop CC 8.6 0.10 1.6 0.2 Image Autodesk Maya 3D 2019 2.7 0.08 9.9 0.2 4.2 6.8 Authoring Autodesk AutoCAD LT 1.2 0.02 9.0 0.9 Adobe Acrobat Pro DC 1.3 0.00 0.0 0.0 Microsoft Excel 2016 2.1 0.03 2.1 0.0 Microsoft PowerPoint 2016 1.2 0.01 4.0 0.1 1.4 1.7 Office Microsoft Word 2016 1.3 0.01 1.7 0.0 Microsoft Outlook 2016 1.3 0.05 2.5 0.2 QuickTime Player 7.7.9 1.1 0.02 16.4 0.1 Multimedia Windows Media Player 12.0 1.3 0.19 16.1 0.0 1.4 16.0 Playback VLC Media Player 3.0.3 1.8 0.18 15.7 0.9 Video CyberLink PowerDirector v16 4.3 0.03 6.3 0.1 3.1 3.4 Authoring Adobe Premier Pro CC 1.8 0.02 0.6 0.0 Video HandBrake 1.1.0 9.4 0.04 0.4 0.0 9.3 7.0 Transcoding WinX HD Video Converter 5.12.1 9.2 0.02 13.6 0.1 Firefox v60 2.2 0.13 8.6 0.5 Web Chrome v60 2.2 0.13 5.1 0.6 2.1 5.9 Browsing Edge 42.17134.1.0 2.0 0.02 4.0 0.2 Arizona Sunshine 1.5.11046 3.4 0.23 68.2 0.8 Fallout 4 VR 1.2 4.0 0.15 84.9 1.7 RAW Data 1.1.0 2.6 0.13 90.9 1.4 VR Gaming 3.1 76.3 Serious Sam VR BFE 341433 2.4 0.10 72.2 1.7 Space Pirate Trainer 1.01 2.7 0.11 61.6 0.5 Project CARS 2 1.7.1.0 3.8 0.16 80.2 2.1 Bitcoin Miner 1.54.0 5.4 0.15 98.9 1.1 Cryptocurrency EasyMiner v.0.87 11.9 0.02 96.1 0.4 4.8 98.7 Mining PhoenixMiner 3.0c 1.0 0.01 *100.0 0.1 Windows Ethereum Miner 1.5.27 1.0 0.01 99.7 0.1 Personal Cortana 1.4 0.04 2.7 0.0 1.3 1.4 Assistant Braina 1.43 1.1 0.02 0.0 0.0 *for PhoenixMiner, two packets were simultaneously executing on the GPU throughout the experiment Execution Time % TABLE II: Summary of application TLP and GPU utilization of all applications in the benchmarking suite. The color of the heat map region corresponding to ci indicates the percentage of time when i threads are running concurrently. The values of GPU utilization are lower than 10% for most GPU utilization of VR games is commensurate with that of applications. Video authoring and transcoding applications traditional 3D games. Since the current GPU has 15× more exhibit moderate GPU usage. VR games and cryptocurrency cores, a viable explanation is that the amount of offloaded miners, however, show significant utilization of the GPU, work has also increased by an order of magnitude. achieving an average GPU utilization over 90%. In general, the GPU is underutilized under most circumstances, except for C. Architectural Decisions graphic-intensive and cryptocurrency mining applications. 1) Core Scaling: Experiments are performed on the proces- sor with 4, 8 and 12 active logical cores to analyze the TLP B. Evolution of Concurrency behavior when more resources are available. Figure 4 shows The experimental results are compared to those collected the TLP characteristics of the application with the highest from similar applications in prior work in 2000 [13, 14], and average TLP in each category. For applications exhibiting a 2010 [3]. Figures 2 and 3 show the comparison of TLP and low degree of parallelism, including Chrome, VLC, Excel GPU utilization respectively. Although the TLP of benchmarks and Cortana, the TLP is tied to 2, since there is not much in media playback and video authoring have decreased (by parallelism to exploit. On the contrary, EasyMiner assigns 0.5-1.0), possibly due to the enhancements in single-core independent threads to each of the logical cores, leading to the performance, most applications present either comparable or TLP scaling linearly with the number of active cores. The TLP higher TLP. The significantly larger number of inputs (from of other applications scale sub-linearly, based on the amount sensors) and the escalation in computational complexity of of parallel work. The measured TLP, clearly showing these VR games result in a noticeable rise in TLP compared to trends, is shown in Figures 5-7. 3D games. Applications that have shown a large amount of HandBrake presents a highly parallel workload. The TLP is concurrency in previous work, e.g. HandBrake, see a further mostly at its maximum, but drops periodically due to serial- increase in TLP. Even applications with little growth in average ization. Increasing the core count results in more fluctuations TLP exhibit progress. For example, Excel only has an average in instantaneous TLP. This is consistent with the HandBrake TLP of 2, yet its instantaneous TLP reaches the maximum of documentation [18], which states that it can scale up to 6 cores 12 during execution, which was not the case 8 years ago. and presents diminishing returns beyond that. The frame rate On the other hand, all benchmarks, except for those in VR also scales up with the number of cores and the time spent on gaming, show lower GPU utilization. This can be attributed to transcoding the same length of video decreases in proportion. advancements in the GPU hardware, since a higher utilization Photoshop involves a significant amount of user interaction, of an older GPU, with fewer resources, is comparable to a which leads to a non-trivial amount of idle time waiting for lower utilization of a newer GPU, with more resources. The user inputs. However, as mentioned in Section II, idle time 10 3D Gaming VR Gaming Image Office Media Video Authoring & Web 8 Authoring Playback Transcoding Browsing 6

TLP 4 2 0

IE 5 Crysis Edge Quake 2 Bioshock Fallout 4 Word 97Excel 97 RAW Data Excel 2007 Excel 2016 Firefox 3.5 Serious Sam Word 2007 Word 2016 Premier 4.2 Firefox v60 Call of Duty 4 Maya3D 2010Maya3D 2018 Quicktime 7.6 Project CARS 2 Photoshop CS4Photoshop PowerPointCC 97 Premier Pro CCHandBrake 0.9 Photoshop 4.0.1 Quicktime 4.0.3Quicktime 7.7.9 Arizona Sunshine AdobeReader 4.0 AdobeReaderPowerPoint 9.0 2007 AdobeReaderPowerPoint DC 2016 PowerDirector v7 HandBrake 1.1.0 Win Media Player Win Media Player PowerDirector v16 Space Pirate Trainer 2000 2010 2018 Fig. 2: Comparison between TLP of desktop applications for 2000 [13, 14], 2010 [3] and 2018 [this work].

% 100 3D Gaming VR Gaming Image Office Media Video Authoring & Web 80 Authoring Playback Transcoding Browsing 60 40 Utilization 20 0 GPU

Crysis WinX Edge Bioshock Fallout 4 Safari 4.0 RAW Data Excel 2007 Excel 2016 FirefoxFirefox 3.5 v60 Serious Sam AutoCAD LT Word 2007 Word 2016 Chrome v66 Call of Duty 4 Maya3D 2010Maya3D 2019 Project CARS Photoshop2 CS4Photoshop CC Quicktime 7.6 HandBrake 0.9 Quicktime 7.7.9 Premiere Pro CC Arizona Sunshine AdobeReaderPowerPoint 9.0 2007 AdobeReaderPowerPoint DC 2016 VLC MediaPowerDirector Player v7 HandBrake 1.1.0 Win Media Player PowerDirector v16 2010 2018 Space Pirate Trainer Street & Trips 2010 Fig. 3: Comparsion between GPU utilization of desktop applications for 2010 [3] and 2018 [this work].

12 Ideal EasyMiner 8 HandBrake Photoshop

TLP 4 Project CARS 2 Chrome VLC Media Player 0 4 8 12 Excel Number of Logical Cores Cortana Fig. 4: TLP of applications with the highest TLP in each category for 4-12 logical cores with SMT, showing the impact of core scaling.

Fig. 6: Instantaneous TLP and GPU utilization over time for Pho- toshop for different number of cores with SMT. Note that filter rendering scales linearly with core count, yielding a shorter runtime, whereas user interaction processing does not exhibit much .

Fig. 5: Instantaneous TLP and GPU utilization over time for Hand- Brake for different number of cores with SMT. Note that video transcoding shows proportional scaling with core count, and thus reduced runtime for transcoding the same video clip. is not considered while calculating average TLP. User input processing exhibits a low TLP, whereas the TLP of filter Fig. 7: Instantaneous TLP and GPU utilization over time for Project rendering scales linearly with the number of active cores and CARS 2 (Rift) for different number of cores with SMT, illustrating can reach a maximum of 12 when all cores are enabled. moderate scalability as the active core count increases. The runtime is bottlenecked by user response time, so it gets smaller with increasing number of cores, but is still far from of parallel tasks. Performance of parallel workloads scales up linear scaling, in compliance with Amdahl’s law [2]. with growing number of cores, leading to shorter execution Project CARS 2, similar to Photoshop, involves significant times for the same tasks. Interactive benchmarks, with a user interaction and is also continuously processing sensor data significant amount of parallelism, can also benefit from more and rendering graphics. The instantaneous TLP reaches the processor cores, though the processing of user interactions maximum in bursts, but mostly remains between 2 and 6. The is usually serialized. However, non-bursty workloads limited average TLP saturates around 5, due to serialized work. by low TLP do not benefit much from extra processor cores. Overall, the performance gains provided by increasing the Therefore, further exploitation of parallelism is necessary to amount of hardware resources heavily depends on the volume take advantage of the available hardware resources. GTX 1080 GTX 680 HB-1080-SMT HB-1080 HB-680-SMT HB-680 Logical Transcode Rate TLP GPU Utilization (%) WinX-1080-SMT WinX-1080 WinX-680-SMT WinX-680 Cores No GPU GPU No GPU GPU No GPU GPU 40 60 4 9 14 4.0 3.8 0.0 5.2 1 (%) 30 45 8 19 27 7.9 7.0 0.0 10.0 0.9 0.8 (FPS) Rate 20 30 12 28 37 11.5 9.1 0.0 13.9 Utilization 0.7 10 15 TABLE III: Transcode rate, TLP, and GPU utilization of WinX with

0.6 GPU

Transcode 0 0 and without NVIDIA CUDA/NVENC. Enabling the GPU improves 0.5 2 4 6 2 4 6 the transcode rate and lowers the TLP. Number of Logical Cores Number of Logical Cores 0.4 0.3 (a) Transcode Rate (b) GPU Utilization 1080 Ti, harnesses a much higher utilization. If we use an even 0.2 Fig. 8: Transcode rate (GTX 1080 Ti only) and GPU utilization of lower-end GPU, we expect the GPU utilization to increase, and 0.1 HandBrake and WinX for 2 to 6 logical cores, illustrating the effect the performance will start to degrade after the GPU utilization 0 of SMT and GPU offloading. The transcode rates for GTX 680 are saturates at the maximum. not2 shown as they overlap exactly with4 those for GTX 1080 Ti. 6 The transcode rate, TLP and GPU utilization of WinX, 2) Simultaneous Multi-Threading (SMT): SMT is aimed with and without GPU acceleration, are shown in Table III. at exploiting more parallelism and improving functional unit The CPU offloads compute-intensive transcoding tasks to the (FU) utilization in CPUs by allowing a physical core to run GPU through specific Application Programming Interfaces multiple threads simultaneously [38]. Prior work states that (), and the amount of offloading, indicated by the GPU SMT boosts performance as threads bring useful data on- utilization, grows almost linearly with increase in TLP. With chip for each other [3]. However, Figure 8 shows that the CUDA/NVENC enabled, the transcode rate of WinX improves transcode rates of both HandBrake and WinX decrease when by 143% on an average and TLP decreases by up to 22%. SMT is enabled. This is because SMT helps when data is GPU acceleration not only increases performance, but also reused among threads within a physical core, but it also limits relieves stress on the CPU, making it available for other tasks the hardware resources (e.g. functional units) available for and protecting it from thermal throttling. Similar offloading each thread. Statistics from Intel VTune Amplifier show that behavior is observed for Premiere Pro while exporting video for HandBrake, enabling SMT causes a decrease in Last-Level with CUDA support, as shown in Figure 9. The assistance Cache (LLC) misses and the time spent waiting on main of GPU does not cause a significant change in runtime, but memory, as threads fetch data for one another [22]. However, slightly lowers the instantaneous TLP. the percentage of time spent by a core stalled on the L1 cache, Non-CUDA CUDA without missing in it, increases from 5.3% to 10.7%. This is explained by thread contention for computation resources

within a physical core. For example, an old store may be waiting for available functional units to resolve its address and blocking a newer load. The performance degradation due to resource contention overcomes the benefits from lower pres- sure on the LLC/off-chip memory. It also hurts the utilization of GPU, leading to a non-negligible reduction in transcode Fig. 9: GPU utilization of GTX 680 and 1080 Ti for Premiere Pro. Video export with CUDA support shows higher utilization and lower rate. This implies that as software exploits parallelism by TLP than without CUDA, and the utilization is higher for GTX 680. distributing computation among threads, SMT may have no or even detrimental impact on performance. 2) GPU Utilization: As shown in Table II, the GPU is under-utilized for most of the applications, which is pos- D. GPU Analysis sibly because the computational power of the GPU greatly The dramatic breakthrough in GPU hardware over the past exceeds what is demanded from it. The GPU indeed executes couple of decades has made it crucial to understand how substantial tasks in various applications, such as hardware GPUs are used to effectively assist compute-intensive tasks rendering in Maya and video export in PowerDirector, yet and whether GPUs are exploited to their full potential. both exhibit GPU utilizations lower than 10%. Even for WinX 1) GPU Offloading: The performance and GPU utilization Video Converter, which uses CUDA/NVENC in the GPU for of HandBrake and WinX with the high-end GTX 1080 Ti and transcoding, the average GPU utilization is 13.6%. On the the mid-end GTX 680 are shown in Figure 8. HandBrake does other hand, there are applications that utilize the GPU much not offload tasks to the GPU, so the utilization stays below 1%, more efficiently, such as VR games and cryptocurrency miners. regardless of the number of active cores and GPU settings. We measure the GPU utilization of the mid-end GPU for WinX, on the other hand, supports with video-related applications and cryptocurrency miners, as these CUDA/NVENC. The transcode rates for different GPUs are use the GPU more than the others, and compare them to almost the same (the plots for GTX 680 are omitted as they the utilization of the high-end GPU. Most applications see a overlap with those for GTX 1080 Ti). In order to achieve notable improvement in utilization, except for cryptocurrency similar performance, the GTX 680, which is inferior to the mining. Both GPUs show utilizations of up to 100% for Chart Title 4 2 0 Chrome Firefox Edge Chrome Firefox Edge 100 Multi-tab Single-tab ESPN Wiki GTX 680 GTX 1080 Ti 75 3 9

% 50 (%) GPU 25 2 6 TLP Utilization 0 1 3 WMP VLC WinX BitcoinM EasyMiner WinEth 0 GPU Fig. 10: GPU utilization of GTX 680 and 1080 Ti for applications 0 Utilization that show substantial use of GPU. VR is excluded as it requires a Edge Edge Edge Edge GPU better than GTX 970. PhoenixMiner does not support GTX 680. ChromeFirefox ChromeFirefox ChromeFirefox ChromeFirefox (a) TLP (b) GPU Utilization Bitcoin Miner and EasyMiner, but as expected, the hash rate Fig. 11: (a) TLP and (b) GPU utilization for web browsing tests using of GTX 680 is at least 2× lower despite the assistance of the multiple vs. a single tab and browsing tests on ESPN vs. Wikipedia. CPU. Windows Ethereum Miner, however, has a higher GPU utilization with the superior GPU, since NVIDIA’s Kepler of processes. All web browsers use more GPU while rendering architecture in GTX 680, released before the prevalence of ESPN, suggesting that graphic-intensive work is offloaded to cryptocurrency, is not optimized to run mining workloads. the GPU when possible, as expected. In summary, a mid-end GPU is sufficient for most applica- Overall, web browsers have shown improvements in exploit- tions, including video-editors/transcoders. However, for appli- ing parallelism over the last two decades. Although TLP is cations, such as VR gaming and mining, that perform intensive bottlenecked by waiting for and processing user interaction, computations on the GPU, a high-end GPU is indispensable, the improved parallelism enhances user experience in terms as the mid-end GPU causes a significant performance loss. of both responsiveness and stability of web browsers.

E. Web Browsing Workload Analysis F. Virtual Reality Workload Analysis Among the web browsers, Chrome and Firefox share com- Oculus Rift and HTC Vive were released in 2016, and mon features. While the number of processes created by HTC Vive Pro launched in 2018. As shown in Figure 12, Chrome is 10× larger than that by Firefox, Firefox uses much Rift achieves the highest TLP, especially for graphic-intensive more resources in GPU to match the performance. Edge claims games like Project CARS and Fallout 4. Vive and Vive Pro to have the best power efficiency, with Chrome and Firefox have almost the same TLP. Furthermore, Rift achieves the consuming 36% and 53% more power respectively [40], which most stable frame rate of the three headsets (Figure 13). The is consistent with its low TLP and GPU utilization. specified frame rate for all the headsets is 90 FPS, but if The TLP and GPU utilization of the web browsing test- only 4 logical cores are available, the actual frame rate of benches are illustrated in Figure 11. The tests using multiple Rift is clamped to 45 FPS due to asynchronous spacewarp tabs have similar or higher TLP compared to those using a (ASW). ASW compromises the frame rate when the system single tab, which is in contradiction to the results from Blake cannot handle rendering at the full rate, to lower the minimum et al. [3]. In the past, web browsers used to run the entire system requirements for the headset. This is consistent with application in a single process, and the overhead of garbage the reduction in both TLP and GPU utilization (Figure 7). collection when a user navigates to another website resulted Vive and Vive Pro, instead, apply asynchronous reprojection in higher TLP for single-tab tests. Current web browsers use to improve user experience. This technique pushes the GPU multi-process models to separate websites from each other to render at 90 FPS, and inserts an adjusted frame when the and the browser itself, so that web contents are loaded in GPU fails to render the next frame in time. So, with 4 logical parallel to improve responsiveness. This is also to prevent cores, the frame rate oscillates between 90 and 45 FPS, and failures in one webpage’s content from crashing the entire only slight variations appear in the TLP and GPU utilization. browser [7, 27]. Inactive tabs run as background processes in The GPU utilization correlates with the resolution of the the system. Web browsing using multiple tabs spawns more headset. For all games except Fallout 4, Vive Pro, which has processes and threads than when using a single tab, resulting in the best resolution, achieves the highest GPU utilization. Rift higher TLP. However, the increase is not significant, because and Vive have the same resolution and show comparable GPU browsers constantly throttle inactive tabs after a certain amount utilizations. Fallout 4 exhibits a different trend in hardware of time [8]. Chrome generates the most number of processes utilization than the other games. The GPU utilization for Vive and shows the least difference in the number of processes as Pro is the lowest, and a lower frame rate for Vive Pro is well as TLP between the two tests. The overhead of garbage observed in the game. collection is also reduced, as it is scheduled to take place during idle time to avoid degradation of user experience [9]. VI.RELATED WORK In terms of the ESPN tests, Chrome attains the highest TLP, while Firefox and Edge do not exhibit much difference in Prior work has explored characterizing simulated as well TLP. Chrome generally creates a rendering process for each as real systems. Eyerman and Eeckhout [12] evaluated a instance of the website, and the large amount of active content variety of multi-core systems to find the optimal design with in ESPN makes Chrome spawn more processes for webpage limited hardware resources. Lorenzon et al. [24] investigated rendering, leading to a higher TLP. Firefox and Edge, on the the TLP and energy consumption of Application Programming other hand, do not show any apparent increase in the number Interfaces (APIs) on embedded systems and general purpose Chart Title 5

0

AZ… Fall… RA… Seri… Spa… Proj… Rift Vive Vive Pro • Applications exhibiting complementary TLP characteris- 5 100 tics can be scheduled to execute concurrently to achieve 4 (%) 75 3 50 best utilization of the processor. For example, HandBrake

TLP 2 1 25 exhibits high TLP with short periods of TLP drop. The 0 Utilization 0 OS could schedule another task during troughs in TLP,

GPU thus trading off fairness for better overall utilization. Fallout 4 Fallout 4 RAW Data Proj. CARS RAW Data Proj. CARS • The GPU can be further exploited to assist compute- AZ Sunshine Serious SpaceSam Pirate AZ Sunshine Serious SpaceSam Pirate intensive tasks. Although the GPU suffers from poor (a) TLP (b) GPU Utilization single-threaded/latency-sensitive performance, tasks that Fig. 12: (a) TLP and (b) GPU Utilization for VR games across Oculus have lower priority or latency requirements could be Rift, HTC Vive, and HTC Vive Pro. offloaded to the GPU. For instance, if the user is editing an image in Photoshop and transcoding videos in back- ground, the transcoding task can be offloaded to the GPU when Photoshop is using the CPU for rendering. • Idle time and time periods of low activity can be utilized to predict future user tasks and perform them specula- tively. This can lead to a performance boost, with a risk Fig. 13: Instantaneous frame rate for Project CARS 2 for Oculus Rift, of wasted work and energy. However, power is not a HTC Vive, and HTC Vive Pro with 6 SMT cores. The frame rate of major constraint for desktops. Interactive applications are Rift is more stable than that of Vive and Vive Pro. good candidates for such kind of speculation, since low systems. Our work analyzes commercial applications on a real TLP is attained while user inputs are being processed. desktop machine, allowing us to obtain realistic data. For example, when a Photoshop user selects a blur filter, the system can speculate the next task to be blur filter Plenty of hardware reviews on VR headsets, cryptocurrency rendering and the core can start fetching off-chip data mining and desktops are available through tech channels locally, while the user is specifying filter configurations. [4, 5, 20, 26, 34, 39]. They measure gaming performance • As the trend of innovation shifts further away from through frame rate and mining performance through hash rate. parallelism into heterogeneity, future software could of- The study done by Magaki et al. [25] illustrated that ASICs fload kernels within an application to dedicated hard- have better energy and cost efficiency over GPU for mining. ware/accelerators that execute the kernel most efficiently. This work does an analysis from a parallelism perspective and evaluates TLP and GPU utilization. Extensive characterization work has also been done on VIII.CONCLUSION mobile devices [17, 28, 41]. Gao et al. [15, 16] studied how Major advancements have taken place in desktop hardware multi-core mobile devices are utilized by common mobile over the past decade. In this work, we analyzed the TLP, apps. Chen et al. [6] characterized mobile augmented reality GPU utilization, and effects of core scaling and SMT, on applications from the system and architecture perspectives. traditional and emerging applications commonly used in con- temporary desktop systems. Our results showed that software VII.DISCUSSION has improved to take advantage of the parallelism available TLP and GPU utilization can act as useful guidelines for in the hardware compared to the work in 2010 [3]. No- end-users on the amount of hardware resources to invest. ticeable increases were seen in many applications, includ- A key takeaway from this work is that employing many ing those reputed for effective utilization of processor cores processing cores and a high-end GPU does not always bring like HandBrake and Photoshop. For applications with slight benefits. For users who primarily spend time online or on changes in TLP over the past 18 years, efforts for exploiting office applications, 2-3 cores are sufficient to achieve maximal available parallelism were exhibited by them achieving high performance. For professional users who use their desktops instantaneous TLP during execution. For example, Excel spent for video transcoding or image editing, performance scales 3.7% of time using the maximum number of available logi- roughly linearly with increasing number of cores. For gaming cal cores concurrently, and web browsers have shifted from and cryptocurrency mining, a better GPU leads to much better single-process models to multi-process models, resulting in performance than a CPU with a large core count. better responsiveness and reliability. Emerging applications Multi-core scaling has hit a plateau, and software developers also demonstrated good utilization of hardware resources. The are now, more than ever, expected to write hardware-aware average TLP of VR gaming is twice that of traditional 3D programs. This is because TLP and GPU utilization are not gaming, and cryptocurrency miners involving CPU mining fundamental to an application, but highly dependent on how have a TLP higher than that of over 80% of the benchmarks. In they are implemented in software. Based on our 18-year addition, SMT is beneficial when threads running on the same perspective analysis, we discuss potential areas where software physical core work on the same data with sufficient computa- can be improved to further exploit parallelism. tion resources, else it becomes detrimental for performance. On the other hand, overall GPU utilization was lower than [19] J. Hauswald, M. A. Laurenzano, Y. Zhang, C. Li, A. Rovinski, that observed in 2010. This showed that the improvements A. Khurana, R. G. Dreslinski, T. Mudge, V. Petrucci, L. Tang, and J. Mars, “Sirius: An open end-to-end voice and vision personal in the amount of available resources in the GPU has been assistant and its implications for future warehouse scale computers,” growing at a faster pace than improvements in the parallelism SIGPLAN Not., vol. 50, no. 4, pp. 223–238, Mar. 2015. harnessed by software. However, emerging workloads, e.g. VR [20] T. Hochstenbach, “Project cars 2 review: benchmarks with 23 graphics cards.” [Online]. Available: https://us.hardware.info/reviews/7614/ games and cryptocurrency miners, exhibited great potential, as project-cars-2-review-benchmarks-with-23-graphics-cards they fully exploited the computation power of the GPU. [21] Intel, “Intel core i7-8700k processor.” [Online]. Available: In conclusion, appreciable progress has been made by soft- https://ark.intel.com/products/126684/ Intel-Core-i7-8700K-Processor-12M-Cache-up-to-4-70-GHz- ware in exploiting parallelism. However, there is still sufficient [22] Intel, “Intel vtune amplifier.” [Online]. Available: scope for software to further improve hardware utilization. https://software.intel.com/en-us/vtune-amplifier-help [23] KZero, “Number of Active Virtual Reality Users Worldwide from REFERENCES 2014 to 2018 (in Millions).” [Online]. Available: https://www.statista. com/statistics/426469/active-virtual-reality-users-worldwide/ [1] “Cryptocurrency market capitalizations.” [Online]. Available: [24] A. F. Lorenzon, M. C. Cera, and A. C. Schneider Beck, “Performance https://coinmarketcap.com/ and energy evaluation of different multi-threading interfaces in [2] G. M. Amdahl, “Validity of the single processor approach to achieving embedded and general purpose systems,” Journal of Signal Processing large scale computing capabilities,” in Proceedings of the April 18-20, Systems, vol. 80, no. 3, pp. 295–307, Sep 2015. 1967, Spring Joint Computer Conference, ser. AFIPS ’67 (Spring). [25] I. Magaki, M. Khazraee, L. V. Gutierrez, and M. B. Taylor, “Asic New York, NY, USA: ACM, 1967, pp. 483–485. clouds: Specializing the datacenter,” in 2016 ACM/IEEE 43rd Annual [3] G. Blake, R. G. Dreslinski, T. Mudge, and K. Flautner, “Evolution of International Symposium on Computer Architecture (ISCA), June 2016, thread-level parallelism in desktop applications,” in Proceedings of the pp. 178–190. 37th Annual International Symposium on Computer Architecture, ser. [26] Matt Hanson, “Best mining GPU 2018 : the best graphics cards for ISCA ’10. New York, NY, USA: ACM, 2010, pp. 302–313. mining Bitcoin, Ethereum and more.” [Online]. Available: [4] BuriedONE Blockchain, “GPU Mining Hashrates.” [Online]. https://www.techradar.com/news/best-mining-gpu Available: https://www.buriedone.com/hashrates.html [27] MDN web docs, “Multiprocess firefox.” [Online]. Available: [5] K. Carbotte, “Htc vive pro headset review: A high bar for premium https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/ vr.” [Online]. Available: https: [28] G. Narancic, P. Judd, D. Wu, I. Atta, M. Elnacouzi, J. Zebchuk, //www.tomshardware.com/reviews/htc-vive-pro-headset-vr,5549.html J. Albericio, N. E. Jerger, A. Moshovos, K. Kutulakos, and [6] H. Chen, Y. Dai, H. Meng, Y. Chen, and T. Li, “Understanding the S. Gadelrab, “Evaluating the memory system behavior of smartphone characteristics of mobile augmented reality applications,” in 2018 workloads,” in 2014 International Conference on Embedded Computer IEEE International Symposium on Performance Analysis of Systems Systems: Architectures, Modeling, and Simulation (SAMOS XIV), July and Software (ISPASS), April 2018, pp. 128–138. 2014, pp. 83–92. [7] Chromium Blog, “Multi-process architecture.” [Online]. Available: [29] NVIDIA, GeForce GTX 1080 Ti. [Online]. Available: https: https://blog.chromium.org/2008/09/multi-process-architecture.html //www.nvidia.com/en-us/geforce/products/10series/geforce-gtx-1080-ti/ [8] C. Davenport, “Google introduces background tab throttling in chrome [30] NVIDIA, GeForce GTX 285. [Online]. Available: 57.” [Online]. Available: hhttps://www.androidpolice.com/2017/03/14/ https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-285 google-introduces-background-tab-throttling-chrome-57-desktop/ [31] NVIDIA, GeForce GTX 680. [Online]. Available: [9] U. Degenbaev, J. Eisinger, M. Ernst, R. McIlroy, and H. Payer, “Idle https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-680 time garbage collection scheduling,” SIGPLAN Not., vol. 51, no. 6, pp. [32] S. Pal, J. Beaumont, D. Park, A. Amarnath, S. Feng, C. Chakrabarti, 570–583, Jun. 2016. H. Kim, D. Blaauw, T. Mudge, and R. Dreslinski, “Outerspace: An [10] R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R. outer product based sparse matrix multiplication accelerator,” in 2018 LeBlanc, “Design of ion-implanted mosfet’s with very small physical IEEE International Symposium on High Performance Computer dimensions,” IEEE Journal of Solid-State Circuits, vol. 9, no. 5, pp. Architecture (HPCA), Feb 2018, pp. 724–736. 256–268, 1974. [33] I. Park and R. Buch, “Event tracing: Improve debugging and [11] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and performance tuning with etw,” MSDN Magazine, April 2007. D. Burger, “Dark silicon and the end of multicore scaling,” in [34] R. Smith, “The nvidia geforce gtx 1080 ti founder’s edition review: Computer Architecture (ISCA), 2011 38th Annual International Bigger pascal for better performance.” [Online]. Available: Symposium on. IEEE, 2011, pp. 365–376. https://www.anandtech.com/show/11180/ [12] S. Eyerman and L. Eeckhout, “The benefit of smt in the multi-core the-nvidia-geforce-gtx-1080-ti-review era: Flexibility towards degrees of thread-level parallelism,” SIGARCH [35] Statista, “Global market share held by leading desktop internet Comput. Archit. News, vol. 42, no. 1, pp. 591–606, Feb. 2014. browsers.” [Online]. Available: https://www.statista.com/statistics/ [13] K. Flautner, R. Uhlig, S. Reinhardt, and T. Mudge, “Thread-level 544400/market-share-of-internet-browsers-desktop/ parallelism and interactive performance of desktop applications,” [36] Statista, “Global operating systems market share for desktop PCs.” SIGOPS Oper. Syst. Rev., vol. 34, no. 5, pp. 129–138, Nov. 2000. [Online]. Available: https://www.statista.com/statistics/218089/ [14] K. Flautner, R. Uhlig, S. Reinhardt, and T. Mudge, “Thread-level global-market-share-of-windows-7/ parallelism aof desktop applications,” Workshop on Multi-threaded [37] M. B. Taylor, “Is dark silicon useful? harnessing the four horsemen of Execution, Architecture and Compilation, 2000. the coming dark silicon apocalypse,” in DAC Design Automation [15] C. Gao, A. Gutierrez, R. G. Dreslinski, T. Mudge, K. Flautner, and Conference 2012, June 2012, pp. 1131–1136. G. Blake, “A study of thread level parallelism on mobile devices,” in [38] D. M. Tullsen, S. J. Eggers, and H. M. Levy, “Simultaneous 2014 IEEE International Symposium on Performance Analysis of multithreading: Maximizing on-chip parallelism,” in ACM SIGARCH Systems and Software (ISPASS), March 2014, pp. 126–127. Computer Architecture News, vol. 23, no. 2, 1995, pp. 392–403. [16] C. Gao, A. Gutierrez, M. Rajan, R. G. Dreslinski, T. Mudge, and C. J. [39] S. Walton, “Intel core i7-8700k review: The new gaming king.” Wu, “A study of mobile device utilization,” in 2015 IEEE [Online]. Available: International Symposium on Performance Analysis of Systems and https://www.techspot.com/review/1497-intel-core-i7-8700k/page1.html Software (ISPASS), March 2015, pp. 225–234. [40] J. Weber, “Get more out of your battery with microsoft edge.” [17] A. Gutierrez, R. G. Dreslinski, T. F. Wenisch, T. Mudge, A. Saidi, [Online]. Available: https://blogs.windows.com/windowsexperience/ C. Emmons, and N. Paver, “Full-system analysis and characterization 2016/06/20/more-battery-with-edge/ of interactive smartphone applications,” in 2011 IEEE International [41] Y. Zhang, X. Wang, X. Liu, Y. Liu, L. Zhuang, and F. Zhao, “Towards Symposium on Workload Characterization, Nov 2011, pp. 81–90. better cpu power management on multicore smartphones,” in [18] Handbrake, “Video encoding speed.” [Online]. Available: https:// Proceedings of the Workshop on Power-Aware Computing and Systems, handbrake.fr/docs/en/latest/technical/video-encoding-performance.html ser. HotPower ’13. New York, NY, USA: ACM, 2013, pp. 11:1–11:5.