CADET: Debugging and Fixing Misconfigurations using Counterfactual Reasoning Md Shahriar Iqbal∗ Rahul Krishna∗ Mohammad Ali Javidian University of South Carolina Columbia University Purdue University
[email protected] [email protected] [email protected] Baishakhi Ray Pooyan Jamshidi Columbia University University of South Carolina
[email protected] [email protected] Abstract Swap Mem. Swap Memory Modern computing platforms are highly-configurable with thou- sands of interacting configuration options. However, configuring GPU Resource these systems is challenging and misconfigurations can cause un- 512 Mb Latency 1Gb Memory Pressure expected non-functional faults. This paper proposes CADET (short 2Gb 3Gb for Causal Debugging Toolkit) that enables users to identify, ex- 4Gb Latency plain, and fix the root cause of non-functional faults early and in a GPU Growth GPU Growth principled fashion. CADET builds a causal model by observing the (a) (b) (c) performance of the system under different configurations. Then, it uses casual path extraction followed by counterfactual reason- Figure 1: Observational data (in Fig.1a) (incorrectly) shows ing over the causal model to (a) identify the root causes of non- that high GPU memory growth leads to high latency. The trend functional faults, (b) estimate the effects of various configuration is reversed when the data is segregated by swap memory. options on the performance objective(s), and (c) prescribe candi- date repairs to the relevant configuration options to fix the non- functional fault. We evaluated CADET on 5 highly-configurable systems by comparing with state-of-the-art configuration optimiza- heat dissipation [15, 71, 74, 84]. The sheer number of modalities tion and ML-based debugging approaches.