Exploiting MISD Performance Opportunities in Multi-Core Systems

Exploiting MISD Performance Opportunities in Multi-core Systems Patrick G. Bridges, Donour Sizemore, and Scott Levy∗ Department of Computer Science University of New Mexico Albuquerque, NM 87131 Abstract Recent research projects on multi-core system software designs propose a distributed systems-oriented A number of important system services, particu- approach, where data structures are replicated or seg- larly network system services, need strong scaling on regated between processors [1, 2]. These designs use multi-core systems, where a fixed workload is exe- a multiple-instruction/multiple-data (MIMD) paral- cuted more quickly with increased core counts. Un- lel execution model, and generally focus on services fortunately, modern multiple-instruction/multiple- for which coarse-grained synchronization is sufficient. data (MIMD) approaches to multi-core OS design Because of this, such systems generally provide only cannot exploit the fine-grained parallelism needed weak scaling, where more cores can be used to handle to provide such scaling. In this paper, we pro- a larger workload (e.g. more TCP/IP connections) pose a replicated work approach to parallelizing net- but cannot make an existing workload (e.g. a single work system services for multi-core systems based on TCP/IP connection) run faster. a multiple-instruction/single-data (MISD) execution In this paper, we propose a replicated work ap- model. In particular, we discuss the advantages of proach based on a multiple-instruction/single-data this approach for scalable network system services (MISD) execution model to take advantage of un- and we compare past methods for addressing the chal- exploited parallelization opportunities in multi-core lenges this approach presents. We also present pre- systems. This approach is particularly important in liminary results that examine the viability of the ba- the network stack, as it is needed to enable strongly sic approach and the software abstractions needed to scalable network services. We begin by discussing support it. example network applications and related system services that require strong scaling. We then discuss a replicated work execution model for building 1 Introduction strongly-scalable network system services based on A number of important system services need strong MISD parallelism, along with specific challenges that scaling, where a fixed workload is executed more this execution mode presents, including consistency, quickly with increased core counts. This is partic- scheduling, and programmability. Finally, we dis- ularly true for data-intensive, network-oriented sys- cuss initial experiments we have conducted on this tem services because network interface card (NIC) approach and the infrastructure to support it. line rates continue to increase but the individual cores that service them are not getting any faster. In cases 2 Motivating Applications and like this, multiple cores must closely coordinate ac- tivities and multi-core system designs which rely on Services only intermittent synchronization will bottleneck on inter-core synchronization. As a consequence, the re- The combination of increasing network card line rates sulting system performance can be disappointing. and plateauing single-core CPU speeds is driving the demand for scalable network services. The most ob- ∗This work was supported in part by gifts from Sun and vious motivating example of this is high-speed net- Intel Corporations, a faculty sabbatical appointment at Sandia work connectivity over TCP-based network connec- National Laboratories, and a grant from DOE Office of Science, Advanced Scientific Computing research, under award number tions. Traditional operating systems have difficulty DE-SC0005050, program manager Sonia Sachs. saturating a 10Gbps Ethernet NIC using a single 1 TCP connection [14] without complex hardware as- TCP also includes elements that can potentially be sistance such as TCP offload engines. However, such paralellized, for example data delivery, acknowledg- hardware solutions present well-known challenges to ment generation, and timer management. Unfortu- system software design (e.g. packet filtering, security, nately, explicit or implicit inter-core communication portability). costs are too large compared to packet processing Past work has shown that providing scalable net- times to be successfully amortized away. work connections in software is quite challenging [11, 14]. Perhaps as a result, recent work on scaling 3.1 MISD-Based Parallelism TCP/IP network stacks has focused primarily on scaling the number of connections or flows that can We propose using a multiple instruction/single data be supported rather than on the data rate of indi- (MISD) execution model (i.e. replicated work) to vidual connections. Corey, for example, uses a pri- provide fine-grained data consistency between cores vate TCP network stack on each core [3], and Route- and to eliminate expensive inter-core interactions. In Bricks segregates each IP flow to its own core [6]. this approach, shown in Figure 1(a), state is repli- These approaches allow systems to scale to handle a cated into domains on separate cores, and requests large number of connections or flows, but limit the that modify shared state are broadcast to every do- data rate achievable on each flow to well below the main using a ringbuffer-based channel abstraction. throughput of modern network cards. The first replica that dequeues a request becomes One frequently attempted workaround to these the primary replica for that request and is responsi- problems is to use multiple parallel connections for ble for fully processing it, including any updates with these applications, turning a strong scaling problem globally visible side effects (e.g. data delivery, packet into a weak scaling one. Unfortunately, this approach reconstruction, acknowledgment generation). Other has numerous drawbacks. Most importantly, such ap- replicas that process the request will, on the other proaches can break existing network resource alloca- hand, only partially process each request to maintain tion systems [13]. It also places significant burden on state consistency. application and library programmers for managing This approach is particularly appropriate for many data ordering and reliability. network services because it parallelizes expensive per- Without a scalable network service, a wide range byte processing (e.g. data copies, encryption/decryp- of applications such as high-bandwidth file systems, tion, and error correction) and replicates per-header databases, and content distribution systems, become processing costs that are known to be small [7]. It difficult to deploy. For example, the performance also retains the reduced locking and caching costs of of middleware systems that buffer, aggregate, and systems like Barrelfish [1] while adding the ability analyze data as it is streamed between data ware- to perform fine-grained updates to logically shared house systems (e.g. Netezza, LexisNexis) and high- state. Finally, we note that this approach is simi- end supercomputing systems (e.g. ADIOS [9]) de- lar to the execution model used in many group RPC pends upon the availability of scalable network ser- systems [5] vices. To examine the potential for this approach compared to lock-based and atomic instruction-based MIMD approaches, we constructed a simple synthetic test. In this test, processing each request requires 3 Building Strongly Scalable some amount of work that can be done by one core System Services without synchronization or replication (i.e. parallelizable work), and some work that must be synchronized The primary challenge in building strongly scalable or replicated. Specifically, the parallelizable work is networked system services is handling shared state. a set of memory updates done on per-core memory, Current MIMD-based approaches to parallelizing sys- while the synchronized work is the set of memory up- tem services for multi-core systems rely on relatively dates that must be performed: (a) on every replica; expensive synchronization mechanisms (e.g. mes- (b) while holding a shared lock; or (c) by using atomic sages, locks) that prevent the exploitation of fine- memory update instructions. grained parallelism. TCP/IP, for example, includes Figure 1(b) shows the potential performance ad- important shared state (e.g. sliding window state) vantages of this approach using a 10:1 ratio of paral- that must be updated on each packet sent or received. lelizable to replicated work; we chose this ratio based 2 Outgoing Messages 1.4e+06 Replication Outgoing Consistency Fined-grained Locking Coarse-grained Locking Channel Manager 1.2e+06 1e+06 CPU 1 CPU 2 CPU 3 CPU 4 800000 Domain Domain Domain Domain 1 2 3 4 Requests per Second Subsc Subsc Subsc Subsc 600000 1 2 3 4 400000 Incoming Channel 200000 0 1 2 3 4 5 6 7 8 9 Number of Threads Incoming Requests (a) Replicated-work parallelization archi- (b) Comparison of lock, atomic instruction, and replicated-work par- tecture alellization approaches Figure 1: Architecture and basic performance of a replicated-work parallelization approach on our our studies of production TCP/IP stacks. As lution is for each domain to route all requests through can be seen, the lock-based model scales remarkably a single broadcast channel that supports multiple poorly and the atomic instruction approach is only producers, resulting in sequential consistency be- slightly better. The replicated work approach, in con- tween replicas|all replicas see the same requests in trast,

Load more