A Distributed Workflow Platform for High-Performance Simulation Toan Nguyen, Jean-Antoine Desideri
Total Page:16
File Type:pdf, Size:1020Kb
A Distributed Workflow Platform for High-Performance Simulation Toan Nguyen, Jean-Antoine Desideri To cite this version: Toan Nguyen, Jean-Antoine Desideri. A Distributed Workflow Platform for High-Performance Simu- lation. International Journal On Advances in Intelligent Systems, IARIA, 2012, 4 (3&4), pp.82-101. hal-00700816 HAL Id: hal-00700816 https://hal.inria.fr/hal-00700816 Submitted on 24 May 2012 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A Distributed Workflow Platform for High-Performance Simulation Toàn Nguyên Jean-Antoine-Désidéri Project OPALE Project OPALE INRIA Grenoble Rhône-Alpes INRIA Sophia-Antipolis Méditerranée Grenoble, France Sophia-Antipolis, France [email protected] , [email protected] [email protected] Abstract —This paper presents an approach to design, implement and deploy a simulation platform based on II. RELATED WORK distributed workflows. It supports the smooth integration of Simulation is nowadays a prerequisite for product design existing software, e.g., Matlab, Scilab, Python, OpenFOAM, and for scientific breakthrough in many application areas Paraview and user-defined programs. Additional features ranging from pharmacy, biology to climate modeling that include the support for application-level fault-tolerance and also require extensive simulation testing. This requires often exception-handling, i.e., resilience, and the orchestrated large-scale experiments, including the management of execution of distributed codes on remote high-performance clusters. petabytes volumes of data and large multi-core supercomputers [10]. Keywords-workflows; fault-tolerance; resilience; simulation; In such application environments, various teams usually distributed systems; high-performance computing collaborate on several projects or part of projects. Computerized tools are often shared and tightly or loosely coupled [23]. Some codes may be remotely located and non- I. INTRODUCTION movable. This is supported by distributed code and data Large-scale simulation applications are becoming management facilities [29]. And unfortunately, this is prone standard in research laboratories and in the industry [1][2]. to a large variety of unexpected errors and breakdowns [30]. Because they involve a large variety of existing software and Most notably, data replication and redundant terabytes of data, moving around calculations and data files computations have been proposed to prevent from random is not a simple avenue. Further, software and data are often hardware and communication failures [42], as well as failure stored in proprietary locations and cannot be moved. prediction [43], sometimes applied to deadline-dependent Distributed computing infrastructures are therefore necessary scheduling [12]. [6][8]. System level fault-tolerance in specific programming This article explores the design, implementation and use environments are proposed, e.g., CIFTS [20], FTI [48]. Also, of a distributed simulation platform. It is based on a middleware usually support mechanisms to handle fault- workflow system and a wide-area distributed network. This tolerance in distributed job execution, usually calling upon infrastructure includes heterogeneous hardware and software data replication and redundant code execution components. Further, the application codes must interact in a [9][15][22][24]. timely, secure and effective manner. Additionally, because Also, erratic application behavior needs to be supported the coupling of remote hardware and software components is [45]. This implies evolution of the simulation process in the prone to run-time errors, sophisticated mechanisms are event of such occurrences. Little has been done in this area necessary to handle unexpected failures at the infrastructure [33][46]. The primary concerns of the designers, engineers and system levels [19]. This is also true for the coupled and users have so far focused on efficiency and performance software that contribute to large simulation applications [35]. [47] [49] [50]. Therefore, application unexpected behavior is Consequently, specific management software is required to usually handled by re-designing and re-programming pieces handle unexpected application and software behavior of code and adjusting parameter values and bounds. This [9][11][12][15]. usually requires the simulations to be stopped and restarted. This paper addresses these issues. Section II is an A dynamic approach is presented in the following overview of related work. Section III is a general description sections. It support the evolution of the application behavior of a sample application, infrastructure, systems and using the introduction of new exception handling rules at application software. Section IV addresses fault-tolerance run-time by the users, based on occurring (and possibly and resiliency issues. Section V gives an overview of the unexpected) events and data values. The running workflows implementation using the YAWL workflow management do not need to be suspended in this approach, as new rules system [4]. Section VI is a conclusion. can be added at run-time without stopping the executing III. TESTCASE APPLICATION workflows. This allows on-the-fly management of unexpected A. Example events. This approach also allows a permanent evolution of This work is performed for the OMD2 project the applications that supports their continuous adaptation to (Optimisation Multi-Discipline Distribuée , i.e., Distributed the occurrence of unforeseen situations [46]. As new Multi-Discipline Optimization) supported by the French situations arise and data values appear, new rules can be National Research Agency ANR. added to the workflows that will permanently take them into An overview of two running testcases is presented here. account in the future. These evolutions are dynamically It deals with the optimization of an auto air-conditioning hooked onto the workflows without the need to stop the system [36]. The goal of this particular testcase is to running applications. The overall application logics is optimize the geometry of an air conditioner pipe in order to therefore maintained unchanged. This guarantees a constant avoid air flow deviations in both pressure and speed adaptation to new situations without the need to redesign the concerning the pipe output (Figure 1). Several optimization existing workflows. Further, because exception-handling methods are used, based on current research by public and codes are themselves defined by new ad-hoc workflows, the industry laboratories. user interface remains unchanged [14]. Figure 1. Flow pressure (left) and speed (right) in an air-conditioner pipe (ParaView screenshots). This example is provided by a car manufacturer and Windows 7 and Windows XP running on a range of local involves several industry partners, e.g., software vendors, workstations and laptops (Figure 11, at the end of this paper). and academic labs, e.g., optimization research teams (Figure 1). B. Application Workflow The testcases are a dual faceted 2D and 3D example. In order to provide a simple and easy-to-use interface to Each facet involves different software for CAD modeling, the computing software, a workflow management system is e.g. CATIA and STAR-CCM+, numeric computations, e.g., used (Figure 2). It supports high-level graphic specification Matlab and Scilab, and flow computation, e.g., OpenFOAM for application design, deployment, execution and and visualization, e.g., ParaView (Figure 12, at the end of monitoring. It also supports interactions among this paper). heterogeneous software components. Indeed, the 2D The testcases are deployed on the YAWL workflow example testcase described in Section III.A involves several management system [4]. The goal is to distribute the codes written in Matlab, OpenFOAM and displayed using testcases on various partners locations where the software are ParaView (Figure 7). The 3D testcase involves CAD files running (Figure 2). In order to support this distributed generated using CATIA and STAR-CCM+, flow computing approach, an open source middleware is used calculations using OpenFOAM, Python scripts and [17]. visualization with ParaView. Extensions allow also the use A first step is implemented using extensively the of the Scilab toolbox. virtualization technologies (Figure 3), i.e., Oracle VM Because proprietary software are used, as well as open- VirtualBox, formerly SUN’s VirtualBox [7]. This allows source and in-house research codes, a secured network of hands-on experiments connecting several virtual guest connected computers is made available to the users, based on computers running heterogeneous software (Figure 10, at the existing middleware (Figure 8). end of this paper). These include Linux Fedora Core 12, This network is deployed on the various partners YAWL thus provides an abstraction layer that helps users locations throughout France. Web servers accessed through design complex applications that may involve a large the SSH protocol are used for the proprietary software number of distributed