Leveraging Linux Kernel Tracing to Classify and Detail Application Bottlenecks

MEng Individual Project Imperial College London Department of Computing Leveraging Linux kernel tracing to classify and detail application bottlenecks Supervisor: Author: Dr. Anthony Field Ashley J Davies-Lyons Second Marker: Dr. Giuliano Casale June 17, 2019 2 Abstract GAPP is a bottleneck identification tool that uses Linux kernel probes to identify periods of reduced parallelism in multithreaded programs. Although GAPP is effective at identifiying the lines of source code that lead to a bottleneck, it is unable to classify the type of bottleneck - for example whether it is due to lock contention or I/O. This project solves this problem by augmenting the stack traces generated by GAPP with classifications, and adds details of any files that were interacted with, or IP addresses that were interacted with. Additionally, by tracking kernel-level synchronisation (‘futex’) calls, we develop a lock analysis feature that assists with identifying particularly critical locks (and unlockers) in user applications. Further, we provide a summary of the most critical individual file and synchronisation actions. In the spirit of GAPP, we implement this without requiring instrumentation, and does not introduce any language or library dependencies. We find that our extended tool is able to reliably classify the different categories of bottleneck, and adds useful information to the GAPP output which is useful in diagnosing the root causes of a bottleneck. We verify this with two large open source projects - an image tracking benchmark, and a production game server. Finally, we find that the overhead we add is competitive with similar tools, and that our tool works correctly with alternative threading library - having evaluated with TBB and pthreads In addition to our main contributions, we additionally add a number of quality-of-life improvements to the tool, including a user interface to present the data, improved stack trace reporting, and easier methods of attaching to user applications. 2 Acknowledgements I am hugely thankful to my supervisor, Tony, for being consistently helpful, supportive, and available for meetings over the last year, in addition to being a great personal tutor for the past four years, and also to my co-supervisor, Reena, for being ever-willing to help with strange bugs on short notice, and whose extensive body of relevant knowledge and experience saved me an incalculable amount of debugging and stress. I’d also like to acknowledge my tremendous gratitude to my sixth form lecturers Gareth and Jonathan, whose eagle-eyed spotting of a mistake in my A-Level results rescued my place to study here in the first place. And, last but not least, I am grateful to my mother, the rest of my family, & my friends, who have been endlessly supportive through the last few months. Contents 1 Introduction 6 1.1 Objectives . 7 1.2 Contributions . 8 2 Background 9 2.1 Note on threading . 9 2.2 Software Performance . 9 2.2.1 Tracing . 10 2.2.2 Sampling . 10 2.2.3 Categorising performance analysis . 11 2.3 Task-based parallelism - TBB . 12 2.4 Overview of recent profiling tools and approaches . 12 2.4.1 wPerf . 12 2.4.2 Coz, and causal profiling . 13 2.4.3 TaskProf . 13 2.4.4 GAPP . 13 2.4.5 Comparison of GAPP and wPerf . 14 2.5 Synchronisation . 14 2.5.1 Background . 14 2.5.2 Processor-level synchronisation . 14 2.5.2.1 Summary for x86 . 15 2.5.3 Synchronisation primitives . 15 2.5.3.1 Types of primitives . 15 2.5.3.2 Spinning . 16 2 2.5.4 Futexes . 17 2.6 eBPF: tracing in the Linux kernel . 19 2.6.1 BCC . 20 2.6.2 Performance of eBPF . 21 3 Extending GAPP 23 3.1 Overview . 23 3.1.1 Goal . 23 3.1.2 Implementation . 23 3.1.3 Feature summary . 24 3.1.4 Comparison . 25 3.1.5 Backend summary . 26 3.2 Causation flag system . 27 3.3 Identifying Synchronisation Bottlenecks . 29 3.3.1 Limitations of return probes . 29 3.3.1.1 Avoiding this issue . 31 3.3.2 Summary . 32 3.3.3 Tracking futex wait operations . 33 3.3.4 Tracking futex wake operations . 34 3.4 Identifying IO Bottlenecks . 35 3.4.1 Identifying IO Bottlenecks - Files . 36 3.4.2 Identifying IO Bottlenecks - Networking . 38 3.4.3 Identifying IO Bottlenecks - Reads and Writes . 38 3.5 Modifications to the core GAPP algorithm . 39 3.6 Python front-end processing . 40 3.6.1 Additional synchronisation stack traces . 40 3.6.2 Lock page . 41 3.6.3 Most Critical Futex Activity . 41 3.6.4 Most Critical File Activity . 42 3.7 User Experience . 43 3.7.1 Tracing relevant threads . 43 3.7.2 Enhanced Stack Trace Reporting . 44 3 3.7.3 User Interface . 45 3.8 Engineering larger systems with BCC and eBPF . 46 3.8.1 Separation of code . 46 3.8.2 Debugging issues . 49 4 Evaluation 51 4.1 Setup . 51 4.2 Note on glibc stack traces and parent threads . 51 4.3 Individual feature evaluation . 52 4.3.1 Elementary synchronisation . 52 4.3.2 Shared locks . 55 4.3.3 Multiple locks . 57 4.3.4 IO - File reading . 59 4.3.5 IO - File writing . 62 4.4 Alternative threading library . 64 4.4.1 Ad-hoc threading . 64 4.4.2 TBB . 65 4.5 Real-world Benchmarks & Programs . 68 4.5.1 Parsec - Bodytrack . 68 4.5.2 Cuberite - A Minecraft Server . 71 4.6 Quantifying errors . 75 4.6.1 Evaluating error count in Cuberite . 76 4.6.2 Missing futex traces . 76 5 Conclusion 77 5.1 Summary . 77 5.2 Future Work . 77 5.2.1 DNS Probing . 77 5.2.2 Conflation of identically-defined locks . 78 5.2.3 Ad-hoc synchronisation . 78 5.2.4 Unit testing . 78 A Additional Screenshots 82 4 B Evaluation Code Listings 86 5 1 | Introduction Improvements in the processing power of a single core have been considerably slowing in the past decade, compared to the prior decades. One of the most powerful consumer desktop CPUs today, the Intel i9-9900KF [1], ships with a 3.6GHz clock speed as standard; little more than the 3.46GHz of the Intel i7-990X [2] from 2011. Instead, core counts are increasing, and manufacturers are implementing technologies such as SMT (Simultaneous Multithreading, a.k.a. hyperthreading), which enable an individual core to run multiple threads concurrently. Almost every modern CPU [3] offers at least four to eight cores, and even a single chip made for a high performance server can contain dozens of cores. Accordingly, software applications are transitioning towards models which take advantage of multiple processor cores. From a developer’s perspective, the runtime operations of multi-threaded applications are generally harder to reason about than an equivalent single-threaded application. Ad- ditionally, traditional performance metrics and measurement tools designed for single- threaded applications are not as effective when applied to multi-threaded programs, be- cause of extra inefficiencies and forms of bottlenecks which are introduced by the parallelism. Most obviously, there is the direct overhead introduced by the threading itself - for example, context switches caused by moving threads on and off CPU cores can add significant overhead to an otherwise efficient application. This overhead is important, but by far the most significant performance issues arise due to suboptimal thread synchronisation. When threads wait on each-other (or even on IO devices such as a disk) when they could be doing other work, a program can suffer significant performance losses. On an eight core machine, an application with eight threads can perform worse than a single threaded version if there is significant contention for a single resource, or if locking is too coarse. Fixing these sorts of issues can be tricky, but it is often just as tricky to find the cause in the first place: it is far from trivial to know in advance which parts of an application will cause performance issues, and this difficulty is only worsened when we start trying to reason about the runtime of multithreaded software with complex inter-thread interactions. Considering all of this, it is clear that there is a need for effective, straightforward, and accurate tools for developers to profile and diagnose inefficiencies in their applications. They must be effective so that they can help achieve large performance improvements, straightforward in order to enable mass-adoption by developers, and accurate so as to avoid developers losing faith in the tool. 6 1.1 Objectives There exists a wide body of research focused on identifying bottlenecks in multithreaded applications [4, 5, 6, 7], much of which assumes the use of a specific threading library [6] or focuses on a specific concurrency model such as task-parallel programming (Intel’s TBB, Cilk, etc.) [5]. Research which takes a more general approach is fairly recent [4, 7, 8]. This project extends the bottleneck detection tool GAPP [8], which detects synchronisation bottlenecks using a generic approach that avoids introducing a dependency on a specific threading library or language, and does not require the program under analysis to be instrumented in any way. In its present form, GAPP utilises tracing features in the kernel (its inner workings are described further in subsection 2.4.4) to identify when a significant number of threads in an application are in a non-runnable state (e.g. waiting on a lock), and while this is true, any application threads that are descheduled have a stack trace sample taken and reported. Additionally, there is periodic sampling of the threads to accumulate a set of critical functions and lines which were executing by the thread prior to being descheduled.

Leveraging Linux Kernel Tracing to Classify and Detail Application Bottlenecks

Demarinis Kent Williams-King Di Jin Rodrigo Fonseca Vasileios P

Thread Evolution Kit for Optimizing Thread Operations on CE/Iot Devices

Red Hat Enterprise Linux for Real Time 7 Tuning Guide

SUSE Linux Enterprise Server 12 SP4 System Analysis and Tuning Guide System Analysis and Tuning Guide SUSE Linux Enterprise Server 12 SP4

Greg Kroah-Hartman [email protected] Github.Com/Gregkh/Presentation-Kdbus

Futexes Are Tricky

Petalinux Tools Documentation: Reference Guide

Systemtap Beginners Guide

How to Run POSIX Apps in a Minimal Picoprocess Jon Howell, Bryan Parno, John R

Thread Synchronization: Implementation

Glibc and System Call Wrappers

Performance Evaluation of Container-Based Virtualization for High Performance Computing Environments