Machine Learning for Predictive Analytics of Compute Cluster Jobs

Machine Learning for Predictive Analytics of Compute Cluster Jobs Andresen. Dan1, Hsu. William2, Yang. Huichen2, and Okanlawon. Adedolapo3 Department of Computer Science, Kansas State University Manhattan, Kansas, USA Abstract— We address the problem of predicting whether attributes of jobs that can help users or administrators to sufficient memory and CPU resources have been requested classify jobs into equivalence classes by likelihood of failure, for jobs at submission time. For this purpose, we examine based on aggregate demographic and historical profile infor- the task of training a supervised machine learning system mation regarding the user who submitted each job. Among to predict the outcome - whether the job will fail specifi- these are estimating (1) the failure probability of a job at cally due to insufficient resources - as a classification task. submission time, (2) the estimated resource utilization level Sufficiently high accuracy, precision, and recall at this task given submission history. We address the predictive task facilitates more anticipatory decision support applications with the prospective goal of selecting helpful, personalized in the domain of HPC resource allocation. Our preliminary runtime or postmortem feedback to help the user make better results using a new test bed show that the probability of cost-benefit tradeoffs. Our preliminary results show that the failed jobs is associated with information freely available at probability of failed jobs is associated with information job submission time and may thus be usable by a learning freely available at job submission time and may this be system for user modeling that gives personalized feedback usable by a learning system for user modeling that gives to users. personalized feedback to users. Beocat is the primary HPC system of Kansas State Uni- Keywords: HPC, machine learning, predictive analytics, decision versity (KSU). When submitting jobs to the managed queues support, user modeling of Beocat, users need to specify their estimated running time and memory. The Beocat system then schedules jobs 1. Introduction based on these job requirements, the availability of system This work presents a new machine learning-based ap- resources, and job-dependent factors such as static properties proach as applied to high-performance computing (HPC). In of the executable. However, users cannot in general estimate particular, we seek to train a supervised inductive learning the usage of their jobs with high accuracy, as it is challenging system to predict when jobs submitted to a compute cluster even for trained and experienced users to estimate how much may be subject to failure due to insufficient requested re- time and memory a job requires. A central hypothesis of sources. There exist various open-source software packages this work, based upon observation of users over the 20- which can help administrators to manage HPC systems, such year operating life of Beocat to date, is that estimation as the Sun Grid Engine (SGE) [1] and Slurm [2]. These accuracy is correlated with user experience in particular provide real-time monitoring platforms allowing both users use cases, such as the type of HPC codes and data they and administrators to check job statuses. However, there are working with. The risks of underestimation of resource does not yet exist software that can help to fully automate requirements include wasting some of these resources: if a the allocation of HPC resources or to anticipate resource user submits a job which will fail during execution because needs reliably by generalizing over historical data, such as this user underestimates resource needs at submission, the determining the number of processor cores and the amount job will occupy some resources after submission until it has of memory needed as a function of requests and outcomes been identified as having failed by the queue management on previous job submissions. Machine learning (ML) applied software of the compute cluster. This situation will not towards decision support in estimating resource needs for only affect the available resources but also affect other an HPC task, or to predict resource usage, is the subject jobs in the cluster’s queues. This is a pervasive issue in of several previous studies [3]–[7]. Our continuing work is HPC, not specific to Beocat, and cannot by solved solely based on a new predictive test bed developed using Beocat, by proactive HPC management. We therefore address it by the primary HPC platform at Kansas State University, and a using supervised learning to build a model from historical data set compiled by the Department of Computer Science logs. This can help save some resources in HPC systems from the job submission and monitoring logs of Beocat. and also yield recommendations to users such as estimated Our purpose in this paper is to use supervised inductive CPU/RAM usage of job, allowing them to submit more learning over historical log files on large-scale compute robust job specifications rather than waiting until jobs fail. clusters that contain a mixture of CPUs and GPUs. We In the remainder of this paper, we lay out the machine seek initially to develop a data model containing ground learning task and approach for Beocat, surveying algorithms and describing the development of a experimental test bed. most cost-effective based on these estimates and historical data?” This in turn requires data preparation steps including 2. Machine Learning and Related Work integration and transformation. As related above, the core focus and novel contribution Prediction targets and ground truth. Data transforma- of this work are the application of supervised machine tions on these logs allow us to define a basic regression learning to resource allocation in HPC systems, particularly task of predicting the CPU and memory usage of jobs, and a predictive task defined on compute clusters. This remains the ancillary task of determining whether a job submission an open problem as there is as yet no single machine learning will fail due to this usage exceeding the allocation based representation and algorithm that can reliably help users to on the resources requested by a user. The ground truth for predict job memory requirements in an HPC system, as been these tasks comes from historical data collected from Beocat noted by researchers at IBM [8]. However, it is still worth over several years, which presents an open question of how while to try to find some machine learning technologies that to validate this ground truth across multiple HPC systems. could help administrators better anticipate resource needs This is a challenge beyond the scope of the present work and help users make more cost-effective allocation choices but an important reason for having open access data: so that in an HPC system. Different HPC system have different the potential for cross-system transfer can be assessed as a environments; our goal is to improve resource allocation in criterion, and cross-domain transfer learning methods can be our HPC system. Our objectives go beyond simple CPU and evaluated. memory usage prediction towards data mining and decision support. [9] There are thus two machine learning tasks on 2.2 Defining learning tasks: regression vs. clas- which we are focused: (1) regression to predict usage of sification CPU and memory, and (2) classification over job submission As explained above, this work concerns learning to predict instances to predict job failure after the submission. We also for two questions: the numerical question of estimating try to train different models with different machine learning the quantity of resources used (CPU cycles and RAM), algorithms. The following part is the machine learning and the yes-no question of whether a job will be killed. algorithms involved in our experiment. The first question, CPU/RAM estimation, is by definition a discrete estimation task. However, in its most general form, 2.1 Test bed for machine learning and decision the integer precision needed to obtain multiple decision support support estimates such as those discussed in the previous A motivating goal of this work is to develop an open test section, and to then generate actionable recommendations bed for machine learning and decision support using Beocat for the available options, makes this in essence a continuous data. Beocat is at present the largest in the state of Kansas. It estimation task (i.e., regression). The section question is is also the central component of a regional compute cluster binary classification (i.e., concept learning, with the concept that provides a platform for academic HPC research, includ- being "job killed due to resource underestimate"). ing many interdisciplinary and transdisciplinary users. This Beyond the single job classification task, we are interested is significant because user experience can vary greatly by in the formulation of classification tasks for users - that is, both familiarity with computational methods in their domain assessing the level of expertise and experience. These may of application and HPC platform. Examples of application be independent factors as documented in the description of domain-specific computational methods include data integra- the test bed. tion and modeling, data transformations needed by the users on their own raw data, and algorithms used to solve their 2.3 Ground Features and Relevance specific problems. Meanwhile, the HPC platform depends A key question is that of relevance determination - how to design choices such as programming languages, scientific deal with increasing numbers of irrelevant features. [9]–[10] computing libraries, the parallel computing architecture (e.g., In this data set, ground features are primitive attributes of a parallel programming library versus MapReduce or ad hoc the relational schema (and simple observable or computable task parallelism), and load balancing and process migration variables). For our HPC predictive analytics task, this is methods, if any. initially a less salient situation, because the naive version Precursors: feature sets and ancillary estimation tar- of the task uses only per-job features or predominantly gets.

Load more