The Power of Indirection: Achieving Multicast Scalability by Mapping Groups to Regional Underlays
Total Page:16
File Type:pdf, Size:1020Kb
The Power of Indirection: Achieving Multicast Scalability by Mapping Groups to Regional Underlays Krzysztof Ostrowski Ken Birman Amar Phanishayee Cornell University Cornell University Cornell University Abstract 10000 Reliable multicast is a powerful primitive, useful for data 8000 replication, event notification (publish-subscribe), fault- tolerance and other purposes. Yet many of the most inter- 6000 esting applications give rise to huge numbers of heavily 4000 overlapping groups, some of which may be large. Exist- ing multicast systems scale scale poorly in one or both 2000 Throughput (packets/s) respects. We propose the QuickSilver Scalable Multicast 0 protocol (QSM), a novel solution that delivers perfor- 1 2 4 8 16 32 64 128 256 512 mance almost independent of the number of groups and Number of groups introducesnew mechanisms that scale well in the number QSM Throughput JGroups Throughput of nodes with minimal performance and delay penalties when loss occurs. Key to the solution is a level of indi- rection: a mapping of groups to regions of group overlap Figure 1: Scalability of multicast rate with the number in which communication associated with different proto- of groups. All groups completely overlap on the same cols can be merged. The core of QSM is a new regional 25 members. A single source sends messages in all the multicast protocol that offers scalability and performance different groups in a round-robin fashion. benefits over a wide range of region sizes. falls into two categories: ”lightweight group” solutions 1 Introduction that map application groups into broadcasts in an under- lying group spanning all receivers, then filter at the re- 1.1 Motivation ceivers to drop unwanted messages, and those that run separate protocols for each group independently. In this paper we report on a new multicast substrate, Lightweight groups work best in smaller systems; with QSM, targeting very large deployments of client comput- scale, data rates in the underlying group become ex- ers that communicate using publish-subscribe or event cessive, receivers are presented with huge numbers of notification architectures. Were assuming that there may undesired packets that waste resources and fill up re- be thousands of users, many running Windows, and dis- ceive buffers, and the overload triggers increased packet tributed over a WAN. For the present paper, our goal is loss. The Spread system [1] works around this us- 1 simply to support the highest possible throughput and ing lightweight groups in combination with an agent ar- ”best effort” reliability. In future work, we’ll extend chitecture: clients relay multicasts to a small group of QSM with a stackable protocol extension architecture agents; these broadcast each message, filter them on re- supporting protocols that customize reliability or other ception, and then relay matching messages back to the properties on a per-group (e.g. per-topic) basis. client, but the approach introduces two extra message Existing reliable multicast support for multiple groups hops, and the agents experience load linear in the system size and multicast rate. 1In other work (the Tempest system and its Ricochet multicast pro- tocol), our group is looking at clustered applications with time-critical Running a separate protocol for every group brings behavior; the architecture is completely different. different issues. Such an approach incurs overhead lin- 1 ¡ ¢ £ ¤ ¥ ¦ § ¨ ¨ ¦ § © © ¡ ¢ £ ¤ ¥ ¨ ¨ § § © © ear in the number of groups for ACKs, NAKs and other £ ¤ ¦ control traffic. Moreover, contention between groups for ¢ ¦ ¡ ¢ £ ¤ ¥ communication resources, both within individual nodes and on the wire, emerges as an issue. To quantify such ¦ ¦ ¥ effects, we measured the performance of JGroups [2], a ¥ ¥ ¥ popular communication package that supports multiple £ ¤ ¡ ¢ £ ¤ ¥ ¡ ¢ £ ¤ ¥ groups. We configured JGroups to provide only weak re- liability guarantees. Nonetheless, as shown in Figure 1, ¡ ¢ £ ¤ ¥ ¨ ¨ § § © © the overhead associated with running multiple protocols in JGroups decreases the achievable aggregated through- put by 6% with 2 groups,20% with 32 groupsand almost Figure 2: Left: a pattern of overlapping groups. Right: 60% with 256 groups. regions of overlap. Here, we explore a third option. QSM runs per-group ! " ! ' , & ) * + protocols over a lower layer implementing protocols on * + & ! a per-region basis. Regions are designed to have a fairly - . , & ) * + regular structure, and this lets us innovate in the regional * + ! protocols. The performance so obtained is strong across - the board, even with very large numbers of groups. As % & seen in Figure 1 and explored further below, we also $ ! # ( & ) * + achieve low latency and excellent scalability in group size. 1.2 Group Overlap and its Implications Figure 3: Left: In a typical system, nodes that belong to multiple groups participate in multiple protocols. Right: Our goals emerge from communication patterns seen in Our system splits protocols into two levels: a single “bot- demanding real-world multicast systems. One of us de- tom half per region, and a per-group “upper half. veloped the multicast technology used in the current New York and Swiss Stock Exchange systems, the French Air Traffic Control System, and the US Navy AEGIS war- If we rule out lightweight group approaches as funda- ship [4], and we are in dialog with developers of very mentally non-scalable, the scenario depicted in Figure 2 large data centers, such as Amazon, Google, Yahoo! and poses real problems for existing technologies. Since each Lockheed Martin. If developers of such systems are group is operated independently, the sending node will forced to work over an unreliable event notification or multicast data separately in a large number of groups, publish-subscribe architecture, they typically implement perhapsusing a separate IP multicast groupin each. Each stronger properties by hand, a difficult, error-prone and group will implement its own flow control and reliability inefficient approach. In contrast, by treating these in protocol, hence a node that participates in many groups terms of multicast to large numbers of groups, we take independently exchanges ACKs, NAKs and other con- a major step towards offering customizable communica- trol messages with its counterparts. Not only does traffic tion properties and protocol stacks on a per-group basis. rise linearly in the number of groups, but the commu- This opens the door to a whole class of applications that nication pattern sabotages efforts to prevent ACK and previously could not be supported. NAK implosion, typically using aggregation on trees. We assume a system with a large number of nodes, For example, in Figure 3 (left) we’ve drawn multiple su- each belonging to a number of potentially overlapping perimposed acknowledgement trees of the sort used in groups (left side of Figure 2). Nodes multicast asyn- RMTP [10] and SRM [6]. Unless tree creation is coor- chronously (without waiting for replies) and may do so dinated across groups, each node will communicate with at a high rate; a single node may multicast to multiple far more neighbors than in a single-group configuration. groups concurrently. We want to optimize for the high- est possible throughput while keeping latency reasonably 1.3 Our Approach low, and achieving a simple reliability property similar to that in systems such as SRM or RMTP. Specifically, the Key to our approach is the recognition that a system system should attempt to deliver every pending message with large numbers of overlapping groups typically has still buffered at the sender or another node, to all nodes a much smaller number of regions of overlap. In the ex- that have not crashed or left the group to which the mes- ample on Figure 2 (right), 18 nodes have subscribed to sage was addressed. 300 groups but these overlap in just 7 regions. Regions 2 of overlap have a few important properties that we refer Designer-assisted region discovery seems entirely to as fate and interest sharing, and exploit in our system. practical. Consider, for example, publish-subscribe sys- Interest sharing arises when nodes in a region receive tems in which subjects are mapped to groups. Each in- the same multicast messages, which makes it natural to stance of a given application, operated by a similar kind consider message batching and similar tactics. If regions of user, is likely to result in similar patterns of subscrip- are large scalability becomes an issue, and it makes sense tions and hence extensive overlap. Equities traders who to create IP multicast groups and attempt loss recovery trade high-tech stocks share interest in similar sets of eq- at a regional granularity. In the example in Figure 2, the uities; traders focused on the services sector also share transmitting node, instead of multicasting in 200 groups, interests, yet the two categories of traders have little over- mightdo so in 6, and can pack small messages targetedto lap. One can easily visualize the corresponding regions. a given region in larger packets, increasing network uti- Operators of e-commerce datacenters implement lization and reducing the frequency with which receiver farms consisting of very large numbers of scalable ser- overheads. Latency suffers, but as noted earlier, QSM vices by cloning the service and using multicast (or isnt optimized to minimize latency. publish-subscribe) to replicate updates [5]. Since a sin- Fate sharing captures a related observation: nodes in gle server may be involved in replicating many objects, a region experience the same workload, suffer from the sets of servers end up subscribing to large numbers of same bursts of traffic, miss the same packets dropped at heavily overlapping groups. In effect, regularity of the the sender, which makes it desirable to implement flow datacenter architecture makes it easy to identify regions control and certain parts of the loss recovery mechanisms that can be exploited by our protocols. at this level. If we multicast at the granularity of regions, nodes will acknowledge the same messages and can re- 1.4 Technical Challenges cover missing data on a peer-to-peer basis.