A Thread Performance Comparison: Windows NT and Solaris on a Symmetric Multiprocessor

The following paper was originally published in the Proceedings of the 2nd USENIX Windows NT Symposium Seattle, Washington, August 3–4, 1998 A Thread Performance Comparison: Windows NT and Solaris on A Symmetric Multiprocessor Fabian Zabatta and Kevin Ying Brooklyn College and CUNY Graduate School For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: [email protected] 4. WWW URL:http://www.usenix.org/ A Thread Performance Comparison: Windows NT and Solaris on A Symmetric Multiprocessor Fabian Zabatta and Kevin Ying Logic Based Systems Lab Brooklyn College and CUNY Graduate School Computer Science Department 2900 Bedford Avenue Brooklyn, New York 11210 {fabian, kevin}@sci.brooklyn.cuny.edu Abstract Manufacturers now have the capability to build high performance multiprocessor machines with common CPU CPU CPU CPU cache cache cache cache PC components. This has created a new market of low cost multiprocessor machines. However, these machines are handicapped unless they have an oper- Bus ating system that can take advantage of their under- lying architectures. Presented is a comparison of Shared Memory I/O two such operating systems, Windows NT and So- laris. By focusing on their implementation of Figure 1: A basic symmetric multiprocessor architecture. threads, we show each system's ability to exploit multiprocessors. We report our results and inter- pretations of several experiments that were used to compare the performance of each system. What 1.1 Symmetric Multiprocessing emerges is a discussion on the performance impact of each implementation and its significance on vari- Symmetric multiprocessing (SMP) is the primary ous types of applications. parallel architecture employed in these low cost machines. An SMP architecture is a tightly coupled multiprocessor system, where processors share a 1. Introduction single copy of the operating system (OS) and resources that often include a common bus, memory A few years ago, high performance multiprocessor and an I/O system, see Figure 1. However, the ma- machines had a price tag of $100,000 and up, see chine is handicapped unless it has an OS that is able [16]. The multiprocessor market consisted of pro- to take advantage of its multiprocessors. In the past, prietary architectures that demanded a higher cost the manufacturer of a multiprocessor machine would due to the scale of economics. Intel has helped to be responsible for the design and implementation of change that by bringing high performance comput- the machine's OS. It was often the case, the ma- ing to the mainstream with its Pentium Pro (PPro) chines could only operate under the OS provided by processor. The PPro is a high performance processor the manufacturer. The machines described here are with built in support for multiprocessing, see [4]. built from common PC components and have open This coupled with the low cost of components has architectures. This facilitates an environment where enabled computer manufacturers to build high per- any software developer can design and implement an formance multiprocessor machines at a relatively OS for these machines. Two mainstream examples low cost. Today a four processor machine costs un- of such operating systems are Sun's Solaris and Mi- der $12,000. crosoft's Windows NT. They exploit the power of multiprocessors by incorporating multitasking and multithreading architectures. Their implementations design arises from sharing the same address space. are nevertheless very different. This allows sub-objects of the same process to easily communicate by using shared global variables. However, the sharing of data requires synchroniza- 1.2 Objects Of Execution tion to prevent simultaneous access. This is usually accomplished by using one of the synchronization Classic operating systems, such as UNIX, define a variables provided by the OS, such as a mutex. For process as an object of execution that consists of an general background information on synchronization address space, program counter, stack, data, files, variables see [14], for information on Solaris's syn- and other system resources. Processes are individu- chronization variables see [1, 5, 12, 13], and [7, 10, ally dispatched and scheduled to execute on a proc- 15] for Windows NT's synchronization variables. essor by the operating system kernel, the essential part of the operating system responsible for manag- ing hardware and basic services. In the classic case, 2. Solaris’s and Windows NT’s Design a multitasking operating system divides the available CPU time among the processes of the system. While Windows NT and Solaris further develop the basic executing, a process has only one unit of control. design by sub-dividing the kernel-level objects of Thus, the process can only perform one task at a execution into smaller user-level objects of execu- time. tion. These user-level objects are unknown to the operating system kernel and thus are not executable In order to exploit concurrency and parallelism, op- on their own. They are usually scheduled by the erating systems like NT and Solaris further develop application programmer or a system library to exe- the notion of a process. These operating systems cute in the context of a kernel-level object of execu- break the classical process into smaller sub-objects. tion. These sub-objects are the basic entity to which the OS allocates processor time. Here we will refer them Windows NT and Solaris kernel-level objects of exe- as kernel-level objects of execution. Current oper- cution are similar in several ways. Both operating ating systems allow processes to contain one or more systems use a priority-based, time-sliced, preemptive of these sub-objects. Each sub-object has its own multitasking algorithm to schedule their kernel-level context1 yet it shares the same address space and objects. Each kernel-level object may be either in- resources, such as open files, timers and signals, terleaved on a single processor or execute in parallel with other sub-objects of the same process. The de- on multiprocessors. However, the two operating sign lets the sub-objects function independently systems differ on whether user-level or kernel-level while keeping cohesion among the sub-objects of the objects should be used for parallel and concurrent same process. This creates the following benefits. programming. The differences have implications on the overall systems' performances, as we will see in Since each sub-object has its own context each can later sections. be separately dispatched and scheduled by the operating system kernel to execute on a processor. Therefore, a process in one of these operating sys- 2.1 NT’s Threads and Fibers tems can have one or more units of control. This enables a process with multiple sub-objects to over- A thread is Windows NT’s smallest kernel-level lap processing. For example, one sub-object could object of execution. Processes in NT can consist of continue execution while another is blocked by an one or more threads. When a process is created, one I/O request or synchronization lock. This will im- thread is generated along with it, called the primary prove throughput. Furthermore with a multiproces- thread. This object is then scheduled on a system sor machine, a process can have sub-objects execute wide basis by the kernel to execute on a processor. concurrently on different processors. Thus, a com- After the primary thread has started, it can create putation can be made parallel to achieve speed-up other threads that share its address space and system over its serial counterpart. Another benefit of the resources but have independent contexts, which include execution stacks and thread specific data. A thread can execute any part of a process' code, in- 1 This refers to its state, defined by the values of the program counter, cluding a part currently being executed by another machine registers, stacks, and other data. thread. It is through threads, provided in the Win32 automatically. One or more threads can be mapped application programmer interface (API), that Win- to a light weight process. The mapping is deter- dows NT allows programmers to exploit the benefits mined by the library or the application programmer. of concurrency and parallelism. Since the threads execute in the context of a light A fiber is NT’s smallest user-level object of execu- weight process, the operating system kernel is un- tion. It executes in the context of a thread and is aware of their existence. The kernel is only aware of unknown to the operating system kernel. A thread the LWPs that threads execute on. An illustrative can consist of one or more fibers as determined by example of this design is shown in Figure 3. the application programmer. ˝Some literature ˝[1,11] assume that there is a one-to-one mapping of user- level objects to kernel-level objects, this is inaccu- Thread Library rate. Windows NT does ˝provide the means for Threads many-to-many ˝scheduling. However, NT's design is Application code, global data, ect. poorly documented and the application programmer Scheduler is responsible for the control of fibers such as allo- LWPs cating memory, scheduling them on threads and pre- emption. This is different from Solaris's implemen- Process Structure tation, as we shall see in the next section. See [7, 10] for more details on fibers. An illustrative exam- The Solaris Kernel ple of NT's design is shown in Figure 2. Figure 3: The relationships of a process and its LWPs and threads Fibers in Solaris. Application Code Global Data Solaris's thread library defines two types of threads Threads according to scheduling. A bound thread is one that permanently executes in the context of a light weight process in which no other threads can execute.

A Thread Performance Comparison: Windows NT and Solaris on a Symmetric Multiprocessor

Designing an Ultra Low-Overhead Multithreading Runtime for Nim

Fiber Context As a ﬁrst-Class Object

Fiber-Based Architecture for NFV Cloud Databases

Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++

The Impact of Process Placement and Oversubscription on Application Performance: a Case Study for Exascale Computing

And Computer Systems I: Introduction Systems

Enabling Technologies for Distributed and Cloud Computing Dr

Eindhoven University of Technology MASTER Acceleration of A

HPX-Agseries

Fibers Are Not (P)Threads the Case for Loose Coupling of Asynchronous Programming Models and MPI Through Continuations

Use of High Perfo Nce Networks and Supercomputers Ime Flight Simulation

Parallel Simulation of a Multi-Span DWDM System Limited by FWM