A Study of Memory Management for Web-Based Applications on Multicore Processors
Total Page:16
File Type:pdf, Size:1020Kb
A Study of Memory Management for Web-based Applications on Multicore Processors Hiroshi Inoue, Hideaki Komatsu, and Toshio Nakatani IBM Tokyo Research Laboratory 1623-14, Shimo-tsuruma, Yamato-shi, Kanagawa-ken, 242-8502, Japan {inouehrs, komatsu, nakatani}@jp.ibm.com Abstract 1. Introduction More and more server workloads are becoming Web-based. In these Web-based workloads, most of the memory objects are used Emerging Web-based workloads are demanding much higher only during one transaction. We study the effect of the memory throughputs to support ever-increasing client requests. When a management approaches on the performance of such Web-based next-generation server system is designed, it is important to be applications on two modern multicore processors. In particular, able to run such Web-based applications efficiently on multicore using six PHP applications, we compare a general-purpose alloca- processors. One important characteristic of those Web-based ap- tor (the default allocator of the PHP runtime) and a region-based plications is that most memory objects allocated during a transac- allocator, which can reduce the cost of memory management by tion are transaction scoped, living only during that transaction. not supporting per-object free. The region-based allocator Therefore the runtime systems can discard these objects at once achieves better performance for all workloads on one processor when a transaction ends. core due to its smaller memory management cost. However, when Region-based memory management [1-4] is a well-known using eight cores, the region-based allocator suffers from hidden technique to reduce the cost of memory management by discard- costs of increased bus traffics and the performance is reduced for ing many objects at once. Typically, a region-based allocator does many workloads by as much as 27.2% compared to the default not provide a per-object free and hence does not reuse the mem- allocator. This is because the memory bandwidth tends to become ory area allocated for dead objects. Instead, it discards the whole a bottleneck in systems with multicore processors. region at once to reclaim all the objects allocated in that region. We propose a new memory management approach, defrag- For example, the Apache HTTP server uses a region-based cus- dodging, to maximize the performance of the Web-based work- tom allocator [5] that deallocates all of the objects allocated to loads on multicore processors. In our approach, we reduce the serve an HTTP connection when that connection terminates. memory management cost by avoiding defragmentation overhead Although the region-based memory management looks ideal in the malloc and free functions during a transaction. We found for managing transaction-scoped objects in Web-based applica- that the transactions in Web-based applications are short enough tions, it does not improve, or in some cases even degrades, the to ignore heap fragmentation, and hence the costs of the defrag- performance of Web-based applications on systems with multi- mentation activities in existing general-purpose allocators out- core processors compared to a general-purpose allocator. For weigh their benefits. By comparing our approach against the example, our results showed that performance of Web-based ap- plications written in PHP significantly degraded on a system with region-based approach, we show that a per-object free capability ® ® can reduce bus traffic and achieve higher performance on multi- eight cores of Intel Xeon (Clovertown) processors, while it core processors. We demonstrate that our defrag-dodging ap- improved the performance of the same applications on one or few proach improves the performance of all the evaluated applications processor cores. Figure 1 shows an example of the degradation in on both processors by up to 11.4% and 51.5% over the default throughput. Although the region-based allocator significantly allocator and the region-based allocator, respectively. 1.4 Categories and Subject Descriptors D.3.3 [Programming Lan- memory management . 1.2 guages]: Language Constructs and Features – Dynamic storage others management. 1 x11 speedup 0.8 for memory General Terms Performance, Languages. management 0.6 shorter is faster Keywords Dynamic memory management, Region-based mem- ory management, Scripting Language, Web-based applications. per transaction 0.4 18% slowdown normalized CPU time 0.2 for other parts 0 © ACM, 2009. This is the author's version of the work. It is default allocator region-based allocator posted here by permission of ACM for your personal use. Not for of the PHP runtime redistribution. The definitive version was published in PLDI’09 Figure 1. Performance of a region-based allocator on multi-core June 15–20, 2009, Dublin, Ireland. processors (for MediaWiki on 8 Xeon cores). http://doi.acm.org/10.1145/1543135.1542520 Table 1. Summary of three allocation approaches for transaction-scoped objects including our proposed defrag-dodging approach. bulk per-object defrag cost of bandwidth type of allocators examples of the approach free free mentation malloc / free requirement default allocator of the PHP general purpose allocators Yes Yes Yes high low runtime [8], amalloc (arena supporting bulk freeing malloc) in libc, Reaps [9] Apache pool allocator [5], region-based allocators Yes No No lowest high GNU obstack [10] our defrag-dodging Yes Yes No low low our DDmalloc allocator speeds up the memory management functions, it degraded the compared the performance of three allocators, our DDmalloc, a performance of the rest of the program. This degradation is due to region-based allocator, and the default allocator of the PHP run- longer access latency for system memory caused by the increased time, using six Web-based applications. All three allocators sup- bus traffics. The region-based allocator does not support per- port bulk freeing (with the function freeAll) for transaction- object free and hence cannot reuse the memory locations of dead scoped objects at the end of each transaction. In addition to the objects. Therefore many dead objects are left in cache memory freeAll, the default allocator and our DDmalloc support per-object and cache pressure is increased, resulting in increased cache free during a transaction. Our DDmalloc avoids defragmenting misses and bus traffics. The system memory bandwidth tends to the heap in malloc and per-object free functions to reduce costs become a bottleneck in systems with multicore processors [6, 7], but still maintains a free list to keep track of freed objects and because the rate of improvement in processing power with such reuse them in future allocations. This configuration gives the multicore processors exceeds the rate of improvement in system highest performance on multicore processors. We conducted ex- memory bandwidth. Hence the increase in bus traffic may limit periments on two systems, one with eight Intel Xeon cores and the performance of the region-based allocator on multicore proc- the other with eight Sun Niagara cores. By reducing the CPU time essors. used for memory management our DDmalloc showed improve- Based on our learning, we present a new memory management ments in throughput by up to 11.1% on Xeon and 11.4% on Niag- approach for transaction-scoped objects, called defrag-dodging. ara compared to the default allocator of the PHP runtime using With this approach, we aim to find a sweet spot between the gen- eight cores. Meanwhile, due to the increased bus traffic the re- eral-purpose memory management and the region-based memory gion-based allocator did not improve the throughput when using management. As discussed above, the region-based memory man- eight cores. agement is not an optimal way to reduce the memory manage- To compare our DDmalloc against well known general- ment cost on multicore processors. purpose allocators that do not support bulk freeing, we also exam- Unlike the general-purpose memory allocators such as the de- ined the performance of the Ruby runtime for a Web-based appli- fault allocator of the PHP runtime, our approach reduces the cation. We used Hoard [11] and TCmalloc [12] for the memory management cost by skipping those defragmentation comparisons. The results showed that the cost of the defragmenta- activities in malloc and free, which take a non-negligible CPU tion activities exceeds the benefit even in those sophisticated time. We found that the transactions of Web-based applications memory allocators. Our DDmalloc achieved higher performance are short enough to ignore the harmful effects of gradual heap than these high-performance allocators. fragmentation even in large Web-based applications, and hence In this paper, we examined our defrag-dodging approach for the cost of the defragmentation activities outweighs the benefit. transaction-scoped objects in Web-based applications, since mul- For example, the default allocator of the PHP runtime supports ticore processors have already become so common for servers that both per-object and bulk freeing and it clean up the heap at the efficient execution of Web-based server applications on multicore end of each transaction by bulk freeing. In spite of cleaning up the processors is important. However our defrag-dodging approach is heap every transaction, the default allocator pays a cost for de- applicable to other applications that deallocate many objects in fragmentation activities in malloc and per-object free functions to groups and where the lifetimes of those objects are short enough avoid gradual performance degradation and the cost matters for to ignore gradual fragmentation. Although it has received little the overall performance of the workloads. study to date, our results show that the limited bandwidth in mul- Unlike the region-based memory management, our approach ticore environments is another important problem for memory retains its per-object free capability even though all of the mem- allocators, in addition to such problems as lock contention and ory objects are freed at the end of a transaction.