<<

Exploiting MISD Performance Opportunities in Multi-core Systems

Patrick G. Bridges, Donour Sizemore, and Scott Levy∗ Department of Computer Science University of New Mexico Albuquerque, NM 87131

Abstract Recent research projects on multi-core system soft- ware designs propose a distributed systems-oriented A number of important system services, particu- approach, where data structures are replicated or seg- larly network system services, need strong scaling on regated between processors [1, 2]. These designs use multi-core systems, where a fixed workload is exe- a multiple-instruction/multiple-data (MIMD) paral- cuted more quickly with increased core counts. Un- lel execution model, and generally focus on services fortunately, modern multiple-instruction/multiple- for which coarse-grained is sufficient. data (MIMD) approaches to multi-core OS design Because of this, such systems generally provide only cannot exploit the fine-grained parallelism needed weak scaling, where more cores can be used to handle to provide such scaling. In this paper, we pro- a larger workload (e.g. more TCP/IP connections) pose a replicated work approach to parallelizing net- but cannot make an existing workload (e.g. a single work system services for multi-core systems based on TCP/IP connection) run faster. a multiple-instruction/single-data (MISD) execution In this paper, we propose a replicated work ap- model. In particular, we discuss the advantages of proach based on a multiple-instruction/single-data this approach for scalable network system services (MISD) execution model to take advantage of un- and we compare past methods for addressing the chal- exploited parallelization opportunities in multi-core lenges this approach presents. We also present pre- systems. This approach is particularly important in liminary results that examine the viability of the ba- the network stack, as it is needed to enable strongly sic approach and the software abstractions needed to scalable network services. We begin by discussing support it. example network applications and related system services that require strong scaling. We then dis- cuss a replicated work execution model for building 1 Introduction strongly-scalable network system services based on A number of important system services need strong MISD parallelism, along with specific challenges that scaling, where a fixed workload is executed more this execution mode presents, including consistency, quickly with increased core counts. This is partic- scheduling, and programmability. Finally, we dis- ularly true for data-intensive, network-oriented sys- cuss initial experiments we have conducted on this tem services because network interface card (NIC) approach and the infrastructure to support it. line rates continue to increase but the individual cores that service them are not getting any faster. In cases 2 Motivating Applications and like this, multiple cores must closely coordinate ac- tivities and multi-core system designs which rely on Services only intermittent synchronization will bottleneck on inter-core synchronization. As a consequence, the re- The combination of increasing network card line rates sulting system performance can be disappointing. and plateauing single-core CPU speeds is driving the demand for scalable network services. The most ob- ∗This work was supported in part by gifts from Sun and vious motivating example of this is high-speed net- Intel Corporations, a faculty sabbatical appointment at Sandia work connectivity over TCP-based network connec- National Laboratories, and a grant from DOE Office of Science, Advanced Scientific Computing research, under award number tions. Traditional operating systems have difficulty DE-SC0005050, program manager Sonia Sachs. saturating a 10Gbps Ethernet NIC using a single

1 TCP connection [14] without complex hardware as- TCP also includes elements that can potentially be sistance such as TCP offload engines. However, such paralellized, for example data delivery, acknowledg- hardware solutions present well-known challenges to ment generation, and timer management. Unfortu- system (e.g. packet filtering, security, nately, explicit or implicit inter-core communication portability). costs are too large compared to packet processing Past work has shown that providing scalable net- times to be successfully amortized away. work connections in software is quite challenging [11, 14]. Perhaps as a result, recent work on scaling 3.1 MISD-Based Parallelism TCP/IP network stacks has focused primarily on scaling the number of connections or flows that can We propose using a multiple instruction/single data be supported rather than on the data rate of indi- (MISD) execution model (i.e. replicated work) to vidual connections. Corey, for example, uses a pri- provide fine-grained data consistency between cores vate TCP network stack on each core [3], and Route- and to eliminate expensive inter-core interactions. In Bricks segregates each IP flow to its own core [6]. this approach, shown in Figure 1(a), state is repli- These approaches allow systems to scale to handle a cated into domains on separate cores, and requests large number of connections or flows, but limit the that modify shared state are broadcast to every do- data rate achievable on each flow to well below the main using a ringbuffer-based channel abstraction. throughput of modern network cards. The first replica that dequeues a request becomes One frequently attempted workaround to these the primary replica for that request and is responsi- problems is to use multiple parallel connections for ble for fully processing it, including any updates with these applications, turning a strong scaling problem globally visible side effects (e.g. data delivery, packet into a weak scaling one. Unfortunately, this approach reconstruction, acknowledgment generation). Other has numerous drawbacks. Most importantly, such ap- replicas that the request will, on the other proaches can break existing network resource alloca- hand, only partially process each request to maintain tion systems [13]. It also places significant burden on state consistency. application and library programmers for managing This approach is particularly appropriate for many data ordering and reliability. network services because it parallelizes expensive per- Without a scalable network service, a wide range byte processing (e.g. data copies, encryption/decryp- of applications such as high-bandwidth file systems, tion, and error correction) and replicates per-header , and content distribution systems, become processing costs that are known to be small [7]. It difficult to deploy. For example, the performance also retains the reduced locking and caching costs of of middleware systems that buffer, aggregate, and systems like Barrelfish [1] while adding the ability analyze data as it is streamed between data ware- to perform fine-grained updates to logically shared house systems (e.g. Netezza, LexisNexis) and high- state. Finally, we note that this approach is simi- end supercomputing systems (e.g. ADIOS [9]) de- lar to the execution model used in many group RPC pends upon the availability of scalable network ser- systems [5] vices. To examine the potential for this approach com- pared to -based and atomic instruction-based MIMD approaches, we constructed a simple synthetic test. In this test, processing each request requires 3 Building Strongly Scalable some amount of work that can be done by one core System Services without synchronization or (i.e. paralleliz- able work), and some work that must be synchronized The primary challenge in building strongly scalable or replicated. Specifically, the parallelizable work is networked system services is handling shared state. a set of memory updates done on per-core memory, Current MIMD-based approaches to parallelizing sys- while the synchronized work is the set of memory up- tem services for multi-core systems rely on relatively dates that must be performed: (a) on every replica; expensive synchronization mechanisms (e.g. mes- (b) while holding a shared lock; or (c) by using atomic sages, locks) that prevent the exploitation of fine- memory update instructions. grained parallelism. TCP/IP, for example, includes Figure 1(b) shows the potential performance ad- important shared state (e.g. sliding window state) vantages of this approach using a 10:1 ratio of paral- that must be updated on each packet sent or received. lelizable to replicated work; we chose this ratio based

2 Outgoing Messages

1.4e+06 Replication Outgoing Consistency Fined-grained Locking Coarse-grained Locking Channel Manager 1.2e+06

1e+06 CPU 1 CPU 2 CPU 3 CPU 4

800000 Domain Domain Domain Domain 1 2 3 4 Requests per Second Subsc Subsc Subsc Subsc 600000 1 2 3 4

400000

Incoming Channel 200000 0 1 2 3 4 5 6 7 8 9 Number of Threads

Incoming Requests (a) Replicated-work parallelization archi- (b) Comparison of lock, atomic instruction, and replicated-work par- tecture alellization approaches

Figure 1: Architecture and basic performance of a replicated-work parallelization approach

on our our studies of production TCP/IP stacks. As lution is for each domain to route all requests through can be seen, the lock-based model scales remarkably a single broadcast channel that supports multiple poorly and the atomic instruction approach is only producers, resulting in sequential consistency be- slightly better. The replicated work approach, in con- tween replicas—all replicas see the same requests in trast, scales well to 8 cores, the limit of the 2x4-core the same order—at the cost of extra synchronization Intel Xeon machine on which we tested. when enqueuing requests on channels. A wide range of other well-known distributed consistency and mes- sage ordering techniques are also potentially applica- 3.2 Consistency Management ble. While this approach avoids the inter-core synchro- In addition, addressing consistency problems is in nization and communication costs of other ap- many ways simpler for network services than for proaches, it does introduce consistency management other services. In particular, inconsistency between problems. Specifically, domains that subscribe to replicas in network services results in anomalous pro- multiple broadcast channels, for example separate tocol behavior, and many network services must be channels for incoming and outgoing packets are by able to tolerate unexpected protocol behaviors due default PRAM consistent—they see requests on the to unusual network conditions. As a result, consis- same channel in the same order, but may see differ- tency management between replicas for these services ent interleavings of the requests from the different need only focus on inconsistency that results in non- channels. recoverable behavior (e.g. spurious connection resets) If these consistency problems are not handled prop- or that significantly impacts performance and scala- erly, inconsistency between replicas could result in bility. both poor performance and incorrect behavior. In We are taking an approach where each service man- the networking context, for example, a replica could ages its own consistency demands in three distinct dequeue an incoming acknowledgement for a packet ways. Most coarsely, each service implementation will sent by another replica before it sees dequeues the re- determine how to use channels to route requests to quest that generated the packet being acknowledged domains, allowing it to determine which requests are from the outgoing request channel. sequentially consistent and which are PRAM consis- There are a variety of approaches to handling these tent. At a finer , each domain implements consistency management problems. The simplest so- a channel scheduler which chooses from which chan-

3 5e+06 nel to dequeue the next request, based on examina- Per-Replica Throughput tion of the head of each channel; this allows it to Aggregate Request Throughput implement service-specific scheduling rules, such as: 4e+06 “process outgoing requests required to fill the con- gestion control window prior to processing incoming packets”. 3e+06 Requests/sec 3.3 Scheduling 2e+06

Using a bounded broadcast channel to provide MISD 1e+06 parallelism, as we propose, also raises a variety of inter-domain scheduling issues. In particular, do- 0 mains that lag behind in processing requests can 1 2 3 4 5 6 7 8 Subscribing Cores cause the channel to fill up, preventing new requests from being enqueued and slowing service. However, Figure 2: Replicated ring buffer channel performance well-known gang scheduling and load balancing tech- on an 2x4-core 2.4GHz AMD Shanghai system niques can be used to address such issues. In addi- tion, because every replica processes the same set of requests, it should be possible for a lagging replica have implemented a replicated ring buffer channel to to roll forward to the state of another replica that broadcast requests to multiple domains. We have also periodically saves a snapshot of its state in shared begun porting network stacks to this framework. memory. The basic replicated ring buffer channel is a tra- ditional single producer/single consumer ring buffer 3.4 Programmability augmented with per-replica head pointers and a global per-entry reference count updated using an The new replicated work execution model described atomic decrement instruction. Upon dequeuing an above also presents programmability challenges, an entry, a replica advances its local head pointer and important factor to consider in new system software atomically decrements the entry’s reference count. If designs. This execution model, like a purely dis- the updated reference count is one less than the total tributed one, avoids the need for complex locking number of replicas, the replica knows that it is the protocols and read-copy-update mechanisms. It does, first to dequeue the request and is therefore respon- however, require replicas to segregate program execu- sible for fully processing it. On the other hand, if tion into work that must be executed on every request the updated reference count is zero the replica knows (to maintain consistency) and work that must be ex- that it is the last to dequeue and it can advance the ecuted when the replica is the primary on a request. global head pointer, thereby freeing space in the ring We believe that an event-based programming buffer for new entries. We have also implemented a model [12] is ideal for programming such services. version that allows for multiple producers as opposed Event-based programming provides a simple mecha- to a single producer. nism for directing execution based on whether or not the replica is the primary for a request (e.g. by raising Figure 2 shows the per-core and aggregate dequeue separate events), and for reusing shared code between rates from a channel based on this data structure on a primary and non-primary processing paths. In addi- 2x4-core 2.4GHz AMD Shanghai system, where one tion, MISD-style execution preserves the event atom- core is dedicated to enqueueing elements and some icity expected in event-based systems, and event- number of the remaining cores dequeue elements. based systems have been successfully used in a num- Even with 1500 byte packets, this service rate is suffi- ber of related distributed systems projects [8]. cient at 4 replicas to handle data rates well in excess of 10 Gbps. In addition, more sophisticated refer- ence counting techniques than the one used by this 4 Implementation implementation (e.g. sloppy counters [3]), could also potentially improve the scalability of this implemen- We have begun implementing our proposed approach tation. in a framework called Dominoes. As a first step, we In addition to this basic framework construction,

4 we have also begun porting the Scout [10] and the [7] A. Foong, T. Huff, H. Hum, J. Patwardhan, and CTP [4] networking frameworks to Dominoes. We G. Regnier. TCP performance re-visited. In Pro- have completed initial ports of the (essentially state- ceedings of the 2003 IEEE International Sympo- less) UDP and IP protocols, where preliminary scal- sium on Performance Analysis of Systems and ing results are encouraging, and are currently working Software, pages 70–79, 2003. on a TCP protocol implementation. We have also be- gun adapting the Cactus framework [8] used to imple- [8] M. A. Hiltunen, R. D. Schlichting, X. Han, ment CTP to provide an event-based programming M. Cardozo, and R. Das. Real-time dependable model in Dominoes, and porting basic components of channels: Customizing QoS attributes for dis- CTP to the resulting system. tributed systems. IEEE Transactions on Parallel and Distributed Systems, 10(6):600–612, 1999. References [9] J. Lofstead, F. Zhang, S. Klasky, and K. Schwan. Adaptable, metadata rich IO methods for [1] A. Baumann, P. Barham, P. E. Dagand, T. Har- portable high performance IO. In Proceedings ris, R. Isaacs, S. Peter, T. Roscoe, A. Sch¨upbach, of the 2009 IEEE International Symposium on and A. Singhania. The multikernel: a new OS Parallel & Distributed Processing (IPDPS’09), architecture for scalable multicore systems. In Washington, DC, USA, 2009. IEEE Computer Proceedings of the ACM SIGOPS 22nd Sympo- Society. sium on Operating Systems Principles, pages 29– [10] D. Mosberger and L. L. Peterson. Making 44. ACM, 2009. paths explicit in the Scout operating system. In [2] S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, Proceedings of the 2nd USENIX Symposium on F. Kaashoek, R. Morris, A. Pesterev, L. Stein, Operating Systems Design and Implementation, M. Wu, Y. Dai, Y. Zhang, and Z. Zhang. Corey: pages 153–168, 1996. An Operating System for Many Cores. In [11] E. M. Nahum, D. J. Yates, J. F. Kurose, and Proceedings of the 8th USENIX Symposium on D. Towsley. Performance issues in parallelized Operating Systems Design and Implementation, network protocols. In Proceedings of the First 2008. Symposium on Operating Systems Design and [3] S. Boyd-Wickizer, A. T. Clements, Y. Mao, Implementation, November 1994. A. Pesterev, M. F. Kaashoek, R. Morris, and [12] J. K. Ousterhout. Why threads are a bad idea N. Zeldovich. An analysis of Linux scalability to (for most purposes). In 1996 USENIX Technical many cores. In Proceedings of the 2010 USENIX Conference, 1996. Invited Talk. Symposium on Operating System Design and Im- plementation, 2010. [13] D. Qiu and R. Srikant. Modeling and per- formance analysis of bittorrent-like peer-to-peer [4] P. G. Bridges, M. A. Hiltunen, R. D. Schlicht- networks. In Proceedings of the 2004 conference ing, G. T. Wong, and M. Barrick. A configurable on Applications, technologies, architectures, and and extensible transport protocol. ACM/IEEE protocols for computer communications, pages Transactions on Networking, 15(6):1254–1265, 367 – 378. ACM Press, New York, 2004. 2007. [14] P. Willmann, S. Rixner, and A. L. Cox. An eval- [5] S. T. Chanson, D. W. Neufeld, and L. Liang. uation of network stack parallelization strategies A bibliography on multicast and group com- in modern operating systems. In Proceedings munications. ACM Operating Systems Review, of the USENIX Annual Technical Conference, 23(4):20–25, 1989. pages 91–96, June 2006. [6] M. Dobrescu, N. Egi, K. Argyraki, B. gon Chun, K. Fall, G. Iannaccone, A. Knies, M. Manesh, and S. Ratnasamy. RouteBricks: Exploiting par- allelism to scale software routers. In Proceedings of the 22nd ACM Symposium on Operating Sys- tems Principles, 2009.

5