A Case for Increased O perating System Support in Chip

M ulti-Processors

David Nellans, Rajeev Balasubramonian, Erik Brunvand School of Computing, University of Utah Salt Lake City, Utah 84112 {dnelltms, rajeev, elb}@cs.utah.edu

A b s t r a t lowed us to take advantage of increased transistor counts to improve system performance at a breakneck pace. We identify the operating system as one area where a What is striking about these performance improvements, novel architecture could significantly improve on current however, is that while general application performance chip multi-processor designs, allowing increased perfor­ has improved approximately 200x in this time period, we mance and improved power efficiency. We first show that find that the performance of the operating system in the the operating system contributes a non-trivial overhead to same period has seen a significantly smaller improvement even the most computationally intense workloads and that on the order of 50x. For domains where the operating sys­ this OS contribution grows to a significant fraction of total tem contributes a significant percentage of cycles, this can instructions when executing interactive applications. We have a serious impact on overall system performance. then show that architectural improvements have had little One contributor to this effect is that the operating sys­ to no effect on the performance of the operating system tem is not typically considered when architectural en­ over the last 15 years. Based on these observations we hancements are proposed for general purpose processors. propose the need for increased operating system support Modern processors, especially in an academic setting, in chip multiprocessors. Specifically we consider the po­ are typically simulated using a cycle accurate architec­ tential of a separate Operating System Processor (OSP) tural model, such as Simplescalar [2], to evaluate perfor­ operating concurrently with General Purpose Processors mance. Typically these cycle accurate simulators do not (GPP) in a Chip Multi-Processor (CMP) organization. execute the operating system because doing so requires accurate I/O models and slows down simulation speed 1 Introduction substantially. Full system simulators are becoming more prevalent [3, 4, 5, 6] but often require modification to the The performance of systems has scaled well due operating system or do not provide cycle accurate num­ to a synergistic combination of technological advance­ bers. While several studies suggest that the operating sys­ ment and architectural improvement. In the last 15 years, tem has a significant role in commercial workload per­ high performance workstations have progressed from the formance [7, 8, 9, 10], the effect of the operating system venerable single core 486/33, released in 1989, to the cur­ on varying workloads is still largely unexplored. Our ap­ rent IBM Power 5 and Intel dual core Pentium 4 EE. Pro­ proach in this paper is twofold: we simulate a variety of cess fabrication size has shrunk from 1 micron (486/33) applications with a full system simulator to measure the down to 90 nanometers (Pentium Prescott) and is ex­ portion of system cycles spent in OS tasks, and we take pected to continue down to 65 nanometer and below. Dra­ actual measurements on real machines to tiy to understand matically decreased transistor sizes have helped enable how performance has scaled across different implementa­ a 100-fold increase in clock frequency from 33MHz to tions of the same instruction set architecture. 3.8GHz during this same period. Simultaneously, the number of pipeline stages has increased from the classic While Redstone et al. [7], have shown the OS can account 5 stage pipeline all the way up to 31 stages [1]. Tran­ for over 70% of the cycles in commercial workloads, we sistor count and resultant die size has exploded from 1.2 contribute to this body of work by examining a variety of million transistors on the 486/33 to more than 275 mil­ workloads. We examine computationally intensive work­ lion in the Power 5. At the same time, as technology loads to determine the minimum overhead the operating has improved, architectural improvements such as deep system contributes to workloads. This provides a baseline pipelines, caching, and out of order execution have al­ estimate of the potential inaccuracy of architectural mod- % Instructions 2 OS Contribution in W orkloads contributed bv OS Kernel Compile 3.63% SPHCint Average 5.15% Previous work [7, 9, 10] has shown that the operating sys­ SPI’Cfp Average 5.24% tem can contribute a significant amount of overhead to BvteMark 23.73% commercial workloads. To our knowledge no work has Bonnie++ 86.41% been done to survey the amount of operating system over­ UnixBench 97.16% NetPerf 99.48% head that is present in a larger variety of workloads. De­ X Usage 38.04% termining the minimal amount of operating system over­ Average 44.85% head present in all workloads is useful in estimating the amount of possible error in simulations that do not model Table 1: Operating System Contribution In Various Work­ these effects. In workloads that have significant operating loads system overhead, such as databases and web server work­ loads, performance can depend as much on the operating system as on user code. For these applications, knowing els such as Simplescalar compared to Simics coupled with operating system overhead is critical for performance tun­ University of Wisconsin’s Multifacet Gems Project [11]. ing. Finally, veiy rarely are interactive applications exam­ We show that minimally, the operating system will con­ ined when architecting performance improving features tribute 5% overhead to almost all workloads. Workloads because they typically under-utilize modern processors. which utilize system calls or trigger device interrupts have Because of their prevalence, it helps to explore this appli­ increased OS overhead, with the suite of interactive appli­ cation space to determine if architectural improvements cations in our study requiring at least 38% OS overhead. can reduce power consumption while maintaining current I/O intensive applications often even have operating sys­ performance levels. tem overheads which overshadow their userspace compo­ nents. 2.1 Methodology With operating system overhead being a significant pro­ Most architectural simulators do not execute operating portion of instructions executed in many workloads, the system code because instrumenting external interrupts performance of the operating system is a critical factor in and device I/O models is a non- trivial task in addition to determining total system throughput. It is currently hard significantly slowing down the overall simulation speed. to measure operating system performance due to the lack Simics, a full system simulator from Virtutech [4], pro­ of cycle accurate simulator support. Instead we measure vides full system simulation by implementing such de­ the effects of architectural improvements by measuring vice models allowing it to execute an unmodified Linux performance improvement on real machines for key com­ kernel. To maintain simulation speed Simics implements ponents such as context switch time, and system call la­ functional models of these devices as well as the proces­ tency. We find that compared to user codes (which are typ­ sor. The lack of an accurate timing model limits the per­ ically the codes targeted by processor optimizations), op­ formance evaluation possible using this simulator. Func­ erating system performance improvement is consistently tional modeling is advantageous however because it pro­ lower by a factor of 4 or 5. vides execution fast enough to model not only the operat­ We also briefly examine the issue of operating system in­ ing system but interactive applications using X-Windows. terference when powering down processors on modern Simics’ functional model of a Pentium processor allows machines. We show that offloading internipt handling us to examine the processor state bit to determine if to specialized hardware has the potential to allow more the processor model was in privileged mode (supervi­ aggressive power saving techniques for longer durations sor mode) or user mode when executing each instruction. than currently possible. This privilege separation allows us to track which instruc­ Having shown that operating systems significantly under- tions were executed within the operating system. This utilize modern processors and can contribute a large por­ tracking method also allows us to track system calls which tion of executed instructions to workloads, we explore the execute within the user process’ address space but are con­ possibility of executing the operating system on a semi­ sidered operating system functionality. custom processor in a heterogeneous chip multi-processor For our simulated system we used the default Redhat 7.3 organization. First we identify three key criteria that make disk image provided by Virtutech and left all the default the operating system a good candidate for hardware off­ system daemons running. Running with background dae­ load. We then discuss the performance implications of mons disabled would not portray a typical system in most such an organization, followed by situations in which sig­ cases which often require remote filesystem services, re- nificant power savings could also be achieved. SPECfp Instructions SPECint Instructions our computationally intense benchmarks. We also chose Wupwise 5.96% Gzip 4.65% to time a Linux kernel compile to provide a computation­ Swim 19.42% Vpr 4.43% ally intense workload that also provides some file system Mgrid 0.59% Gcc 5.04% I/O. Table 1 shows the percentage of instructions executed Applu 4.29% Mcf 4.86% in supervisor mode by these benchmarks. Table 2 shows Mesa 0.71% Crafty 4.54% the individual breakdown and variation between the SPEC Galgel 0.51% Parser 4.82% benchmarks. SPEC benchmarks, particularly those re­ Eon 4.58% quiring , for which simulation was unable to com­ Gap 5.91% plete due to unidentifiable internal Simics errors, are not Vortex 6.62 % shown. The average operating system overhead when ex­ Bzip2 6.32% ecuting these four benchmarks is 9.43%. The ByteMark Twolf 4.97% benchmark skews these results strongly however and we Average 5.24% Average 5.15% believe the average of 5.19% for the SPEC benchmarks is a more realistic minimal overhead. Table 2: SPECcpu2000 Instructions Executed in Supervi­ sor Mode 2.3 I/O Intensive Workloads mote procedure calls invocation, as well as remote login. Computationally intense benchmarks are a good way to We recognize that computing installations often heavily test architectural improvements in the architecture but optimize the set of system daemons running on their ma­ rarely do they capture total system performance because chines, but we do not attempt to model the vastly differ­ many applications involve a significant amount of file or network traffic. We chose to use Bonnie++, an open ent subsets of services utilized in enterprise deployments. source file-system performance benchmark [13] and net- While system daemons typically do not contribute heav­ perf, a TCP performance benchmark, to measure OS con­ ily to overall instruction counts, they do cause sporadic tribution in what we expected to be OS dominated work­ interrupts which can cause scheduling changes and I/O loads. Table 1 shows that our expectations were in­ activity. deed correct and that the OS contribution far outweighs Typically architectural level simulations are only mn for user code contribution for file-system and network op­ several million cycles due to the slowdown of simulation. erations. We also used UnixBench as an I/O Inten­ We found significant variation when looking at sample sive workload. UnixBench consists of several bench­ sizes of only 10 million instructions due to the OS per­ marks including , Drystone, system call perfor­ forming periodic maintenance operations such as clear­ mance monitoring, and typical UNIX command line us­ ing write buffers, even when the workload had reached age. UnixBench has a surprisingly high operating system a steady state. To remedy this variation we allowed all contribution of 97.16%. These benchmarks confirm that benchmarks to mn for 10 billion instructions prior to ­ I/O processing is almost entirely OS dominated and sup­ ing measurements. All numbers cited are the average of port the work of Redstone et al. [7] who have shown that 10 sequential runs of 100 million instructions which re­ for workloads such as Apache the operating system can duced statistical variation to negligible levels. contribute more than 70% of the total cycles.

2.2 CPU Intensive Workloads 2.4 Interactive Workload

Computationally intense workloads are used to evaluate While cpu-intensive benchmarks provide useful data micro-architectural improvements because they maximize when measuring processor innovations these workloads computation and minimize the effects of memory and I/O rarely represent the day to day usage of many performance as much possible. These benchmarks of­ throughout the world. One key difference between these ten represent scientific workloads which require minimal workloads and typical workstation use is the lack of in­ user intervention. These benchmarks are ideal for deter­ teraction between the application and the user. X Win­ mining the minimal amount of operating system overhead dows, keyboard, and mouse use generates a workload that that will be present on a machine because they make very is very interrupt driven even when the user applications small use of operating system functionality. Thus over­ require very little computation, a common case within head is likely to be contributed from timer interrupts re­ the consumer desktop domain. When a high performance sulting in context switching, interrupt handling, and sys­ workstation is being utilized fully, either locally or as part tem daemon overhead. of a distributed cluster, the processor must still handle these frequent interrupts from user interaction, thus slow- We chose to use SPECint, SPECfp, and ByteMark [12] as ing down the application computation. Benchmark 486 Pentium 4 Speedup crafty Real:27.705s Real: 115s 240.91 We modeled an interactive workload by simultaneously (SPBCint) User:27,612s User: 110s 251.01 decoding an MP3, browsing the web and composing email Sys: 14s Sys: Is 14.00 twolf Real:35,792s Real: 249s 143.74 in a text editor. While this type of benchmark is not re­ (SPBCint) User:35.191s User: 234s 150.38 producible across multiple runs due to spurious interrupt Sys: 49s Sys: Is 49.00 timing, we believe smoothing the operating system ef­ mesa Real:51.447s Real: 272s 189.14 fects across billions of instructions accurately portrays the (SPBCfp) User:50.801s User: 249s 204.02 Sys: 112s Sys: 2s 56 operating system contribution. This interactive workload art Real:64.401s Real: 332s 193.97 typically only utilized about 32.91 % of the processor’s cy­ (SPBCfp) User:64.160s User: 306s 209.67 cles and spent 38.04% of these instructions within the OS. Sys: 34s Sys: Is 34 Linux Kernel Real:57.427s Real: 292s 196.66 Compile User:54.545s User: 250s 218.18 3 Decomposing Performance Im­ Sys: 1,930s Sys: 25s 77.2

p r o v e m e n t s Table 3: Speedup from 486 33MHz to Pentium 4 3.0GHz

The performance increase in microprocessors over the past 15 years has come from both architectural improve­ ating system changes even from minor revisions. ments as well as technological improvements. Faster tran­ Attempting to discern the architectural improvements of sistors have helped drive architectural improvements such only the microprocessor required that we use benchmarks as deep pipelining which in turn caused an enormous in­ that minimize the use of external hardware in a perfor­ crease in clock frequencies. Significant changes have oc­ mance critical manner. We also chose benchmarks with curred in architecture as well, moving from the classic five working sets that fit in the limited main memory of our stage pipeline to a multiple issue, deeply pipelined archi­ 486 test machine. This ensured that the machines would tecture, where significant resources are utilized in branch not regularly access the swap partition which could pro­ prediction and caching. Because we wish to distill the vide misleading performance numbers. We chose the architectural improvement in the last 15 years, and disre­ Linux kernel compile because there is enough parallelism gard technological improvement as much as possible, we present by creating multiple jobs that compilation can chose to take measurements from real machines, a 486 @ fully utilize the processor while other jobs are waiting on 33MHz and a Pentium 4 @ 3.0GHz, instead of relying on data from the I/O system. Care has to be taken to create architectural simulations which can introduce error due only the minimum number of jobs necessary to fully uti­ to inaccurate modeling of the operating system. For this lize the processor or we can introduce significantly more purpose we define total performance improvement (P) as context switching than necessary. N + 1 jobs is typically technology improvement (T) times architectural improve­ the ideal number of jobs, where N is the number of pro­ ment (A), or P = T x A. cessors available when compiling source code. A metric commonly used to compare raw circuit per­ Table 3 shows the performance difference when execut­ formance in different technologies is the fan-out of four ing the same workload on both machines using the UNIX (F04) [14, 15]. This is defined to be the speed at which time command. The total performance improvement P is a single inverter can drive four copies of itself. This met­ taken by examining the Real time. We found the average ric scales roughly linearly with process feature size. Thus speedup, including both application and operating system a processor scaled from a 1 micron process, our 486, to effects, when moving from the 486 to the Pentium 4 was 90nm, our Pentium 4, would have approximately an 11 192.2. Using 192.2 and 11, for P and T respectively, we fold decrease in F 04 delay. We set T, our technological can then calculate that our architectural improvement. A, improvement, to 11 for the remainder of our calculations. is 17.47 or roughly 60% more improvement than we have To determine the architectural improvement. A, when obtained from technological improvements. This archi­ moving from the 486 to the Pentium 4 we must accurately tectural improvement comes from increased ILP utiliza­ determine the total performance improvement P and then tion due to branch prediction, deeper pipelines, out of or­ factor out the technological improvement T. To achieve der issue, and other features not present in a classic five accurate values for P we installed identical versions of the stage pipeline. Linux kernel and system libraries, kernel 2.6.9 and gcc 3.4.4, on both the machines. Each machine was individ­ 3.1 Operating System Performance ually tuned with compiler optimizations for best perfor­ mance. Using the same kernel and system libraries helps In section 2 we showed that the operating system con­ minimize any possible performance variation due to oper- tributes non-trivial overhead to most workloads and that 10 -f

m m h £ tfl co

o O m an

- - X~' ,-C# -■Jp* <3^ <3- ^ ^ ©. -rv3 riy ^ JA° ^ _/bxx^ ^ o^ o^ c^ ' ojy ofy ■>> ■>>

Figure 1: Context Switching Speed Normalized for Clock Frequency some workloads, such as those which are interactive or designers have focused on reducing the cost of a con­ I/O intensive, can have significant portions of their to­ text switch for years, while much less attention has been tal instruction counts occur within the operating system. paid in the architecture community [16, 17]. To deter­ For these applications operating system performance is mine the improvement in context switch time over the last critical to achieve good workload performance. Table 3 15 years we measure context switch cost using a Posix shows that while total application performance has in­ locking scheme similar to the one described in [18]. In creased 192.2 times in the last 15 years, operating system this method two threads are contending for a single Posix performance has only increased 46.04 times. Stated an­ lock, similar in style to traditional methods of passing a other way, application codes that can take full advantage token between threads. The Posix lock is used instead of of modern architectural improvements achieve four times a traditional UNIX pipe because Posix locks are one of more performance than operating system codes that can the lowest overhead synchronization primitives currently not. available in Linux. Table 3 shows only a limited set of benchmarks because Figure 1 provides the average context switch time, nor­ our 486 was restricted to this set due to its limited mem­ malized to 3.0 GHz, over five runs of our benchmark on ory. Additionally the UNIX time command does not pro­ various machine configurations. All timings showed a vide the timing fidelity necessary to have good confidence standard deviation of less than 3% within the five runs. in the low number of seconds reported for the system time The absolute number for context switch time is much less on the Pentium 4. To validate the OS speedup results in important than the relative time between cases. Disregard­ Table 3 we perform two other experiments in subsequent ing the overhead of the token passing method lets us focus sections which each independently support the conclusion on the relative change in context switch cost between dif­ that architectural improvements have helped application fering machine architectures. Context switching routines codes significantly, but are from 4 to 10 times less effec­ in the Linux kernel are hand tuned sections of assembly tive on operating system codes. code that do not make use of the hardware context switch provided by the instruction set. It has been shown that 3.2 Context Switch Performance this software context switch has performance comparable to the hardware switch but provides more flexibility. As single threaded applications are redesigned with mul­ Our first experiment sought to determine how context tiple threads to take advantage of SMT and CMP based switch time scaled with clock frequency for a given archi­ processors, the context switch, an already costly opera­ tecture. Our Pentium 4 running at 3.0GHz supports fre- tion, will become even more critical. Operating system System Call 486 P4 Speedup Arch architectures we used the Linux Trace Toolkit (LTT) [ 19] Speedup to log system call timings over a duration of 10 seconds on brk 87 2 43 3.90 close 439 29 15 1.36 both the 486 and the Pentium 4. LTT is commonly used execve 14,406 1,954 7 .63 to help profile both applications and the operating system fcntl64 62 1 62 5.63 to determine critical sections that need optimization. LTT fork 15,985 8,187 2 .18 fstat64 183 1 183 16.63 can provide the cumulative time spent in each system call getdents64 501 10 50 4.54 as well as the number of invocations during a tracked pe­ getpid 49 1 49 4.45 riod of time. This allows us to average the execution of getrlimit 59 1 59 5.36 system calls over thousands of calls to minimize variation ioctl 728 25 29 2.63 mprotect 324 4 81 7.36 due to caching, memory, and disk performance. By aver­ munmap 365 11 33 3 aging these 10 second runs across the multiple workloads newuname 88 2 44 4 found in Table 3 we also eliminate variation due to partic­ open 860 899 0 0 ular workload patterns. The Linux Trace Toolkit provides pipe 559 7 79 7.18 poll 9,727,367 9,280 1,048 95.27 microsecond timing fidelity, thus system calls which take read 44,304 185 239 21.72 less than 1 microsecond can be reported as taking either 0 rt-sigaction 206 1 206 18.72 microseconds or 1 microsecond depending where the call rt-sigprocmask 178 1 178 16.18 falls in a clock period. Averaging thousands of calls to select 403,042 11,849 34 3.09 sigreturn 75 1 76 6.90 such routines should result in a random distribution across stat64 264 53 5 .45 a clock period but we currently have no way to measure if time 72 5,585 0 0 this is true, and thus can not guarantee if this is, or is not, waitpid 99,917 3 33305 3027.72 occurring. All system call timings have been rounded to w'rite 2,164 3,418 0 0 the nearest microsecond.

Table 4: System Call Speedup from 486 33MHz to Pen­ Table 4 shows the absolute time for common system calls, tium 4 3.0GHz - Time in microseconds absolute speedup, and the speedup due to architectural im­ provement only, using our technology scaling factor T set quency scaling allowing us to scale down the frequency of at 11. When examining these figures we must be careful to examine the function of the system call before inter­ the processor in hardware and run our context switching preting the results. System calls which are waiting on I/O benchmark at multiple frequencies on the same proces­ sor. The absolute context switch time for these frequency such as poll, or are waiting indefinitely for a signal such waitpid, scaled runs was then normalized back to 3.0Ghz. This as should be disregarded because they depend on normalization allows us to clearly see that context switch factors outside of OS performance. System calls such as time scales proportionally with clock frequency. Scaling execve, munmap, and newuname provide more accurate proportionally with clock frequency indicates that context reflections of operating system performance independent switching is a fairly memory independent operation. Thus of device speed. Because system calls can vary in exe­ we can draw the conclusion that the number of cycles re­ cution so greatly we do not attempt to discern an average quired to context switch in a given microarchitecture is number for system call speedup at this time, instead pro­ independent of clock frequency. viding a large number of calls to examine. It is clear how­ ever that most system calls in the operating system are not To determine if architectural improvements have reduced gaining the same benefit from architectural improvement, context switch time we run this benchmark on a 486 @ 17.47, that user codes have received in the past 15 years. 33MHz with the identical kernel and tool-chain versions. Scaled for frequency. Figure 1 shows that context switch 3.4 Interrupt Handling performance has actually decreased in the last 15 years by requiring an increased percentage of cycles. This is likely Interrupt handling is a regular repetitive task that all op­ due to the increased cost of flushing a modern Pentium’s erating systems perform on behalf of the user threads for 31 pipeline stages versus the classic five stages in a 486. which incoming or outgoing data is destined. Using the This decrease in context switch performance undoubtedly Linux Trace toolkit we found that the number of inter­ contributes to the lackluster performance of the operating rupts handled by the operating system during a 10 second system that we saw in Table 3. period on the Pentium 4 was within 3% of the number handled on the 486 during the same 10 second period. 3.3 System Call Performance These timings were performed with both interactive and computationally intense workloads. The average of these To further validate our results that the OS performs any­ two workloads is cited in Figure 2 which shows these ex- where from 4-10 times worse than user codes on modern ternal interrupts are regular in their arrival. In many cases 600 these interrupts are not maskable meaning they must be dealt with immediately and often cause a context switch 500 on their arrival.

Under the Linux OS the external timer interval is by de­ ; 400 fault 10ms, this provides the maximum possible idle time for a processor running the operating system. Including fc 300 external device interrupts the interval between interrupts

[16] T. Ungerer, B. Rob, and J. Silc, “A survey of pro­ cessors with explicit multithreading,” ACM Comput. Surv., vol. 35, no. 1, pp. 29-63, 2003.

[17] M. Evers, P-Y. Chang, and Y. N. Patt, “Using hybrid branch predictors to improve branch prediction ac­ curacy in the presence of context switches,” in Pro­ ceedings of the 23rd Annual International Sympo­ sium on Computer Architecture, May 1996, pp. 3­ 11.

[18] D. E. G. Bradford, “Ibm developerworks, runtime: Context switching, part 1, http://www-106.ibm.com/developerworks/linux/library/l- rt9/?t=gr,lnxw03=rtcsh1,” 2004.