Performance Comparison of Uniprocessor and Multiprocessor Web Server Architectures
Total Page:16
File Type:pdf, Size:1020Kb
Performance Comparison of Uniprocessor and Multiprocessor Web Server Architectures by Ashif S. Harji A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Computer Science Waterloo, Ontario, Canada, 2010 c Ashif S. Harji 2010 I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. iii Abstract This thesis examines web-server architectures for static workloads on both uniprocessor and multi- processor systems to determine the key factors affecting their performance. The architectures examined are event-driven (µserver) and pipeline (WatPipe). As well, a thread-per-connection (Knot) architecture is examined for the uniprocessor system. Various workloads are tested to determine their effect on the performance of the servers. Significant effort is made to ensure a fair comparison among the servers. For example, all the servers are implemented in C or C++, and support sendfile and edge-triggered epoll. The existing servers, Knot and µserver, are extended as necessary, and the new pipeline-server, WatPipe, is implemented using µserver as its initial code base. Each web server is also tuned to determine its best configuration for a specific workload, which is shown to be critical to achieve best server performance. Finally, the server experiments are verified to ensure each is performing within reasonable standards. The performance of the various architectures is examined on a uniprocessor system. Three workloads are examined: no disk-I/O, moderate disk-I/O and heavy disk-I/O. These three workloads highlight the differences among the architectures. As expected, the experiments show the amount of disk I/O is the most significant factor in determining throughput, and once there is memory pressure, the memory footprint of the server is the crucial performance factor. The peak throughput differs by only 9–13% among the best servers of each architecture across the various workloads. Furthermore, the appropriate configuration parameters for best performance varied based on workload, and no single server performed the best for all workloads. The results show the event-driven and pipeline servers have equivalent throughput when there is moderate or no disk-I/O. The only difference is during the heavy disk-I/O experiments where WatPipe’s smaller memory footprint for its blocking server gave it a performance advantage. The Knot server has 9% lower throughput for no disk-I/O and moderate disk-I/O and 13% lower for heavy disk-I/O, showing the extra overheads incurred by thread-per-connection servers, but still having performance close to the other server architectures. An unexpected result is that blocking sockets with sendfile outperforms non-blocking sockets with sendfile when there is heavy disk-I/O because of more efficient disk access. Next, the performance of the various architectures is examined on a multiprocessor system. Knot is excluded from the experiments as its underlying thread library, Capriccio, only supports uniprocessor execution. For these experiments, it is shown that partitioning the system so that server processes, subnets and requests are handled by the same CPU is necessary to achieve high throughput. Both N-copy and new hybrid versions of the uniprocessor servers, extended to support partitioning, are tested. While the N-copy servers performed the best, new hybrid versions of the servers also performed well. These hybrid servers have throughput within 2% of the N-copy servers but offer benefits over N-copy such as a smaller memory footprint and a shared address-space. For multiprocessor systems, it is shown that once the system becomes disk bound, the throughput of the servers is drastically reduced. To maximize performance on a multiprocessor, high disk throughput and lots of memory are essential. v Acknowledgements I would like to thank Peter Buhr, my supervisor, mentor and friend. You have always given me the trust and respect of a colleague, but also tremendous support and guidance. As a result, I have had a most unusual graduate experience. I have enjoyed the wide variety of projects we have worked on together, and I will miss our close collaboration. I also thank the other members of my committee: Tim Brecht, Sebastian Fischmeister, Martin Karsten and Matt Welsh. Your feedback has improved this thesis and will help in my future research. During my time in the Programming Languages lab, it has always been more than a place to work. The lab has undergone many changes over the years, but one constant is the interesting people who choose to work there. I would like to acknowledge all the members of PLG. You are a big part of the reason I stayed for as long as I did. In particular, I would like to thank my good friend Roy Krischer for his help and advice, and for his love of good food and movies. I would also like to thank Jiongxiong Chen, Maheedhar Kolla, Brad Lushman, Richard Bilson, Ayelet Wasik, Justyna Gidzinski, Stefan B¨uttcher, Mona Mojdeh, Josh Lessard, Rodolfo Esteves, Azin Ashkan, Ghulam Lashari, Jun Chen, Davinci Logan Yonge-Mallo, Kelly Itakura, Ben Korvemaker, John Akinyemi, Tom Lynam, Egidio Terra, Nomair Naeem, Jason Selby, Dorota Zak and Krszytof Borowski. It has been a privilege sharing a lab with all of you. As well, thank you to Lauri Brown, Stuart Pollock, Robert Warren, Elad Lahav, David Pariag and Mark Groves. I would also like to thank the people I have had the pleasure of working with in my capacity as lab manager. Especially, Mike Patterson, Wendy Rush, Lawrence Folland and Dave Gawley. To my parents and my brother, thank you for your tremendous support and patience. vii To my parents ix Contents List of Tables xv List of Figures xix List of Abbreviations xxi 1 Introduction 1 1.1 Contributions ................................... ...... 2 1.2 ThesisOutline................................... ...... 3 2 Background and Related Work 5 2.1 HandlinganHTTPRequest. ....... 5 2.2 ServerArchitectures .... ...... ..... ...... ...... .. ........ 6 2.2.1 Event-Driven Architecture . ........ 7 2.2.2 Thread-Per-Connection Architecture . ............ 10 2.2.3 Pipeline Architecture . ....... 11 2.3 Uniprocessor Performance Comparisons . ............. 13 2.4 Multiprocessor Performance Comparisons . .............. 15 2.5 File-SystemCache ................................ ...... 18 2.6 APIImprovements ................................. ..... 18 2.6.1 Scalable Event-Polling . ....... 18 xi 2.6.2 Zero-CopyTransfer. ..... 20 2.6.3 AsynchronousI/O ............................... 20 2.7 Summary ......................................... 21 3 Uniprocessor Web-Server Architectures 23 3.1 FileSet ......................................... 23 3.2 ResponseTime .................................... 26 3.3 Verification ..................................... ..... 27 3.4 Tuning.......................................... 28 3.5 Environment..................................... ..... 29 3.6 CacheWarming.................................... 31 3.7 TableCalculation ................................ ....... 31 3.8 Servers ......................................... 33 3.8.1 KnotandCapriccio.............................. 33 3.8.2 µserver ........................................ 37 3.8.3 SYMPEDArchitecture. 37 3.8.4 Shared-SYMPED Architecture . ....... 38 3.8.5 WatPipe....................................... 38 3.9 Static Uniprocessor Workloads . ........... 40 3.101.4GB.......................................... 40 3.10.1 TuningKnot ................................... 40 3.10.2 Tuning µserver .................................... 47 3.10.3 TuningWatPipe................................ 57 3.10.4 ServerComparison . ..... 59 3.114GB............................................ 67 3.11.1 TuningKnot ................................... 67 3.11.2 Tuning µserver .................................... 72 xii 3.11.3 TuningWatPipe................................ 77 3.11.4 ServerComparison . ..... 79 3.12.75GB.......................................... 85 3.12.1 TuningKnot ................................... 85 3.12.2 Tuning µserver .................................... 90 3.12.3 TuningWatPipe................................ 96 3.13 ServerComparison ............................... ....... 99 3.14 Comparison Across Workloads . ..........106 3.15Summary ........................................ 108 4 Multiprocessor Web-Server Architectures 113 4.1 Overview ........................................ 113 4.2 FileSet ......................................... 114 4.3 Environment..................................... 115 4.4 Affinities ....................................... 117 4.5 Scalability..................................... 120 4.6 4GB............................................. 123 4.6.1 TuningN-copy .................................. 123 4.6.2 Tuning µserver ....................................131 4.6.3 TuningWatPipe................................. 137 4.6.4 ServerComparison . ...... ..... ...... ...... ..... 142 4.7 2GB............................................. 149 4.7.1 Tuning N-copy µserver................................149 4.7.2 TuningN-copyWatPipe . 152 4.7.3 Tuning µserver ....................................154 4.7.4 TuningWatPipe................................. 156 4.7.5 ServerComparison . ...... ..... ...... ...... ..... 158 4.8 Comparison Across Workloads . .........165 4.9 Summary ........................................