CADET: Debugging and Fixing Misconfigurations Using Counterfactual Reasoning
Total Page:16
File Type:pdf, Size:1020Kb
CADET: Debugging and Fixing Misconfigurations using Counterfactual Reasoning Md Shahriar Iqbal∗ Rahul Krishna∗ Mohammad Ali Javidian University of South Carolina Columbia University Purdue University [email protected] [email protected] [email protected] Baishakhi Ray Pooyan Jamshidi Columbia University University of South Carolina [email protected] [email protected] Abstract Swap Mem. Swap Memory Modern computing platforms are highly-configurable with thou- sands of interacting configuration options. However, configuring GPU Resource these systems is challenging and misconfigurations can cause un- 512 Mb Latency 1Gb Memory Pressure expected non-functional faults. This paper proposes CADET (short 2Gb 3Gb for Causal Debugging Toolkit) that enables users to identify, ex- 4Gb Latency plain, and fix the root cause of non-functional faults early and in a GPU Growth GPU Growth principled fashion. CADET builds a causal model by observing the (a) (b) (c) performance of the system under different configurations. Then, it uses casual path extraction followed by counterfactual reason- Figure 1: Observational data (in Fig.1a) (incorrectly) shows ing over the causal model to (a) identify the root causes of non- that high GPU memory growth leads to high latency. The trend functional faults, (b) estimate the effects of various configuration is reversed when the data is segregated by swap memory. options on the performance objective(s), and (c) prescribe candi- date repairs to the relevant configuration options to fix the non- functional fault. We evaluated CADET on 5 highly-configurable systems by comparing with state-of-the-art configuration optimiza- heat dissipation [15, 71, 74, 84]. The sheer number of modalities tion and ML-based debugging approaches. The experimental results of software deployment is so large that exhaustively testing every indicate that CADET can find effective repairs for faults in multiple conceivable software and hardware configuration is impossible. non-functional properties with (at most) 13% more accuracy, 32% Consequently, identifying the root cause of non-functional faults is higher gain, and 13× speed-up than other ML-based performance notoriously difficult [35] with as much as 99% of them going unno- debugging methods. Compared to multi-objective optimization ap- ticed or unreported for extended durations [4]. This has tremendous proaches, CADET can find fixes (at most) 8× faster with comparable monetary repercussions costing companies worldwide an estimated or better performance gain. Our study of non-functional faults re- $5 trillion in 2018 and 2019 [34]. Further, developers on online fo- ported in NVIDIA’s forum shows that CADET can find 14% better rums are quite vocal in expressing their dissatisfaction. For example, repairs than the experts’ advice in less than 30 minutes. one developer on NVIDIA’s developer forum bemoans: “I am quite upset with CPU usage on TX2 [8],” while another complained, “I 1 Introduction don’t think it [the performance] is normal and it gets more and more Modern computing systems are highly configurable and can seam- frustrating [7].” Crucially, these exchanges provoke other unan- lessly be deployed on various hardware platforms and under differ- swered questions, such as, “what would be the effect of changing arXiv:2010.06061v2 [cs.SE] 8 Mar 2021 ent environmental settings. The configuration space is combinatori- another configuration ‘X’? [2].” Therefore, we seek methods that ally large with 100s if not 1000s of software and hardware configura- can identify, explain, and fix the root cause of non-functional faults tion options that interact non-trivially with one another [38, 49, 99]. early in a principled fashion. Unfortunately, configuring these systems to achieve specific goals Existing work. Much recent work has focused on configuration is challenging and error-prone. optimization, which are approaches aimed at finding a configura- Incorrect configuration (misconfiguration) elicits unexpected in- tion that optimizes a performance objective [29, 72, 86, 100, 105]. teractions between software and hardware resulting non-functional Finding the optimum configuration using push-button optimization faults, i.e., faults in non-functional system properties such as latency approaches is not applicable here because they do not give us any and energy consumption. These non-functional faults—unlike reg- information about the underlying interactions between the faulty ular software bugs—do not cause the system to crash or exhibit configuration options that caused the non-functional fault. This an obvious misbehavior [75, 82, 94]. Instead, misconfigured sys- information is sought after by developers seeking to address these tems remain operational while being compromised, resulting severe non-functional faults [82, 93]. performance degradation in latency, energy consumption, and/or Some previous work has used machine learning-based perfor- mance modeling approaches [36, 86, 87, 96]. These approaches are ∗Joint First Author adept at inferring the correlations between certain configuration Preprint, October, 2020, ARXIV Md Shahriar Iqbal, Rahul Krishna, Mohammad Ali Javidian, Baishakhi Ray, and Pooyan Jamshidi options and performance, however, they lack the mathematical lan- as (or better than) the best configuration discovered by the opti- guage to express how or why the configuration options affect perfor- mization techniques. Further, CADET can find effective repairs for mance. Without this, we risk drawing misleading conclusions. They faults in multiple performance objectives (i.e., latency and energy also require significant time to gather the training samples, which consumption) with accuracy and gain as high as 16% and 36%, re- grows exponentially with the number of configurations [56, 103]. spectively, than other ML-based performance debugging methods. Limitations of existing work. In Fig.1, we present an example Compared to multi-objective optimization approaches, CADET can to help illustrate the limitations with current techniques. Here, find repairs 5× faster while having similar performance gain. Fi- the observational data gathered so far indicates that GPU memory nally, with a case study, we demonstrate that CADET finds 14% growth is positively correlated with increased latency (as in Fig.1a). better repairs than the experts’ advice in 24 minutes. An ML-model built on this data will, with high confidence, predict Contributions. In summary, our contributions are as follows. that larger GPU memory growth leads to larger latency. However, • We propose CADET, a novel causal diagnostics, and fault mitiga- this is counter-intuitive because higher GPU memory growth should, tion tool that identifies the root causes of non-functional faults and in theory, reduce latency not increase it. When we segregate the recommends repairs to resolve non-functional faults. same data on swap memory (as in Fig.1b), we see that there is in- • Our empirical results, conducted on on 5 highly configurable deed a general trend downward for latency, i.e., within each group systems deployed on 3 architecturally different hardware, CADET of swap memory, as GPU memory growth increases the latency de- outperforms current state-of-the-art ML-based diagnostics and op- creases. We expect this because GPU memory growth controls how timization approaches in terms of efficacy (+16% accuracy and +36% much memory the GPU can “borrow” from the swap memory. De- gain) and speed (up to 13× faster). CADET was also adept at han- pending on resource pressure imposed by other host processes, a dling both single- and multi-objective performance faults. resource manager may arbitrarily re-allocate some swap memory; • We offer a manually curated performance fault dataset (called this means the GPU borrows proportionately more/less swap mem- Jetson Faults) and accompanying code required to reproduce our ory thereby affecting the latency correspondingly. This is reflected findings at https://git.io/JtFNG. by the data in Fig.1b. If the ML-based model were to consult the available data (from Fig.1a) unaware of such underlying factors, these models would be incorrect. With thousands of configurations, inferring such nuanced information from optimization or ML-based 2 Motivation approaches would require a considerable amount of measurements and extensive domain expertise which can be impractical, if not 2.1 A real-world example impossible, to possess in practice. This section illustrates the challenges both users and component Our approach. In this paper, we design, implement, and evalu- developers face when configuring and composing complex systems ate CADET (short for Causal Debugging Toolkit). CADET uses with a concrete issue report from the NVIDIA forum [6]. Here, a causal structural discovery algorithms [91] to construct a causal user notices some strange behavior when trying to transplant their model using observational data [78]. Then, it uses counterfactual code for real-time computation of depth information from stereo- reasoning over the causal model to (a) identify the root causes of cameras from NVIDIA Jetson TX1 to TX2. Since TX2 has twice the non-functional faults, (b) estimate the effects of various config- computational power, they expected at least 30% lower latency, but urable parameters on the non-functional properties, and (c) pre- observed 4 times higher latency. The user indicates working with a scribe candidate repairs to the relevant configuration options to fix reliable reference implementation