arXiv:1502.01509v1 [cs.OS] 5 Feb 2015 h okt o n uhthing. such clos any cannot nor cra system the sockets then operating before the the issued and consequence, is I warning a normally no bug... As fact, behaves happens. of matter component a a cut, As failed halts. cable the a case, failure, failur a any this electric be of an cause can be The component can channel. communication crashed a The or [15]. process [14], anything failure most doing life, real However, In situation. protocol. the realistic TCP a the not is to and con this TCP according network the closed all the instance, are the For all way. nections killed, “clean” closes a is machine in sockets process the system a of in when system failure However, operating Traditional processes. [1]. kill situations tools real-life in it. MPICH-V tested fixing important, of more developers and, the consecutiv bug two this helped of out tool situation finding This precise scenar a fault [10]. in specific failures bug a rare and a tes system exhibited an was providing failure-injection It at failures. a aims survive using can [2] that MPICH-V environment standard execution impl MPI fault-tolerant the the of tion instance, For failures. realistic eal ots hi otaei alyconditions, faulty in software their test m to distrib applications able fault-tolerant resilient be develop robust, who with People experiment systems. to issue portant o tcnb sdt mlmn eemnsi n probabilis and deterministic prese implement We to SystemTap. scenarios. used on failure be can based ke a it system present how we injection paper, this failure In else. level destroyed anything do be OS can the they letting kernel-level, way. at internal killed kernel’s ”clean” SystemTa are them. the processes a probes closes that OS in infrastructure the an killed, sy a communications simply operating is network process the a When existing letting without the parts killed conditions, close real-life be distr reproduce must resilient to application robust, order with In experiment systems. to issue important oto groups control cations u prahi ae ntoLnxtools: Linux two close on not are based sockets is the approach and Our application the with interfere nti ae,w rsn ehdt netfiue nappli in failures inject to method a present we paper, this In and strained be to need detectors failure way, similar a In alr neto ndsrbtdsseshsbe nim- an been has systems distributed in injection Failure Abstract also semantics fail-stop tkernel-level at Fiueijcini itiue ytm a enan been has systems distributed in injection —Failure SlvlFiueIjcinwt SystemTap with Injection Failure OS-level crus.W loepanhwti approach this how explain also We (). .I I. hnacmoetfis tsml stops simply it fails, component a when : hrfr,teoeaigsse cannot system operating the therefore, ; NTRODUCTION Universit ail Coti Camille -33,Vleaes,France Villetaneuse, F-93430, SystemTap { † first.last IN NS M 7030 UMR CNRS, LIPN, ai 3 obnePrsCit Paris Sorbonne 13, Paris e ´ i.e. follow s ementa- without al.If calls. jection ,with ,, ibuted fthe of † rnel- stem uted } and is p ted ust tic @univ-paris13.fr sh nt io d. n ‡ e e e - - ioa Greneche Nicolas DSI nee variable integer esetvsfrohrknso alrsi eto V. mention section in we failures and we of approach kinds Last, other this FIST. for of with perspectives presentation in practice failures the in inject fai conclude implemented how to explain are used we Sys- IV, be scenarios with section can In Injection it applications. Failure distributed how for and section approach (FIST), In our injecti temTap approach. present our failure with we existing them III, analyze compare we and and present systems, we II, section replicatio as packets. such network semantics, of crash omission other and to extended be can rcse nefr ihkre ceue ntoways: two Thos in Thi scheduler host. kernel processes. the with on instrumented interfere processes processes the additional in introduces policy approach injection be fault a cannot Infiniband like media Network virtualized. run- processes. betw MPI’s MPI communication perform Open to etc.) when InfiniBand, (Ethernet, example, clo For daemon are time hardware. parallel to for used related co libraries Open The parallel (OpenMPI, [9]. libraries of etc.) parallelisation resilience of top virtual fault on a check running use to can infrastructure Developers HPC systems. distributed common in t 3.Mroe,i al neto amnfis h pol the fails, daemon injection caches fault inapplicable. CPU a becomes with if affinity Moreover, [3]. process etc optimize fault to inte the libraries scheduler [7] OpenMP by with computing like libraries altered computing parallel parallel be example, in not infrastructure should injection behavior scheduler The enlstructure kernel miuu a.Tepitrrfr h parent the refers pointer The way. ambiguous The h eane fti ae sognzda olw.In follows. as organized is paper this of remainder The )mr rcse edt ehnldb h scheduling the by handled be to need processes more schedul- and 2) cycles CPU as such resources, consumes it 1) apply to node each on used is process daemon a [11], In than trick more is computing parallel in injection Faults ncmo-iu ae ytm,arnigpoesi a is process running a systems, based common-Linux In pid policy. quanta; ing sue ytekre oietf h rcs nanon- a in process the identify to kernel the by used is orted pid †‡ task_struct I.F III. I R II. e ´ n one oanother to pointer a and ssatd tslcsantokmedium network a selects it started, is IUEINJECTION AILURE LTDWORKS ELATED hssrcuedfie an defines structure This . task_struct task_struct For . sely lure ract MP een icy on de n e s 1 . . 2

To put it another way, it refers to the father process that . Running distributed applications on FIST performed the fork/exec sequence to spawn the child process. The idea behind FIST can be summarized as follows: As a consequence, Linux process organization can be seen as • a specific control group is defined for processes that initd a tree with the process as a root. belong to the experiment; these are the processes that Control groups (Cgroups) are a major evolution of this can be affected by intentional failures. model. Without cgroups, resource usage policy can only be • a SystemTap script inserts a hook in the kernel’s sched- defined for a targeted process and its potential childs. Cgroups uler. When the scheduler is invoked to run a process, it allow to add a process dynamicalle to one or more groups at checks whether this process belongs to the experimental each creation of a task_struct. Such a group is called cgroup. If it does, the failure injection scenario is fol- a cgroup. Resource usage policy can target a cgroup instead lowed to decide whether the process must be run normally of a single process. Cgroups make it possible to apply the or killed. same resource usage policy on a group of processes without In practice, a process can be executed in a given cgroup wondering about the process tree structure. by using the command cgexec. For instance, the command Cgroups are always used with a new improvement of sleep can be executed in the cgroup xp on all the mounted : name spaces. Name spaces make it possible to controllers using the following command: have different instances of some kernel objects. This provides powerful isolation capabilities to Linux kernels. For instances, cgexec -g *:xp sleep 1 processes that belong to a given cgroup can use their own Distributed applications are often made of two parts: the instance of the IP stack. These two components allow to run run-time environment, and the application processes them- a lightweight sandbox under a strict resource limitation policy selves. In the case of MPI programs, the application is [13]. supported by a distributed run-time environment which is made of a set of processes (one on each machine used by the computation) that are spawning, supporting and monitoring A. SystemTap the application processes on their machine [6], [4]. The failure SystemTap is a production-ready kernel auditing tool. An detector is usually implemented in this run-time environment. audit policy is written in a high-level language. The main As a consequence, the processes of the run-time environment part of the policy consist in definitions of probes. A probe is are the ones that need to be executed in the experimental an instrumentation point of the kernel (for example, a kernel cgroup. For instance, Open MPI [8] can be specified on function). Systemtap relies on two kinds of probes: Kprobes the command-line which command must be used to start its and Jprobes. Kprobes can trap at almost any kernel code run-time environment’s daemons (called orted). In order to address and define a handler to execute code when the address execute these daemons in the cgroup xp on all the mounted is reached. A Kprobe can be attached to any instruction in controllers, the mpiexec command can receive the following the kernel. Jprobes are inserted at a kernel function’s entry, option: providing a convenient access to its arguments. --mca orte_launch_agent ’cgexec -g *:xp orted’ Systemtap provides a compiler that transforms an auditing policy to a . When loaded, this kernel Otherwise, if we want to run the application processes in module registers all the defined probes. Every time a probed the aforementioned control group, the parallel program can be function is reached, the defined handler code is executed. Han- executed by the cgexec command: dler codes can use SystemTap’s native collection of functions mpiexec -n 5 cgexec -g *:xp mpiprogram or a guru mode. SystemTap’s native functions are high-level primitives implemented in tapsets (quite similar to libraries). IV. IMPLEMENTING FAILURE SCENARIOS WITH The guru mode enables to pass rough C kernel code to handler SYSTEMTAP code. In this section, we present how failure scenarios can be implemented using SystemTap. We give two examples: a B. Process supervision and control deterministic scenario (section IV-A) and a probabilistic one (section IV-B). Process supervision consists in collecting states and met- A process can be killed at kernel-level by canceling it (and rics about a targeted process. Control consist in performing freeing it) from the scheduler. When the kernel is about to actions on a supervised process. Prior to the introduction of schedule a process, it calls the context_switch function. cgroups in the Linux kernel, common “master” processes were As explained in section III-A, a Jprobe can be inserted when used to supervise and control child processes. The two main this function is entered. Then, we can see which control group drawbacks of this model are the following ones: the process belongs to (see section III and, based on the failure 1) if the master process fails, every supervised child process scenario, decide to kill or not the process. also fails. To mitigate this issue, the master’s source code In figure 1 we give an example of a C function that can be should be very short; compiled by SystemTap to find out which cgroup a process 2) a process can belong to only one supervision domain, belongs to. Based on the unique pid of the process, the since it has only one father. task_cgroup function gets the cgroup of this given task. 3

function find_cgroup:string(task:long) %{ struct cgroup *cgrp; struct task_struct *tsk = (struct task_struct *)((long)THIS->task);

/* Initialize with an empty value */ strcpy(THIS->__retvalue, "NULL");

cgrp = task_cgroup(tsk, cpu_cgroup_subsys_id); if (cgrp) cgroup_path(cgrp, THIS->__retvalue, MAXSTRINGLEN); %}

Fig. 1. Finding out which cgroup a process belongs to.

SystemTap modules can call functions defined in the kernel. probe kernel.function("context_switch") { Hence, the free_task function can be called to cancel and cgroup = find_cgroup($next) free a process at the moment when it is about to be scheduled. if ( cgroup == "/xp" ){ r = randint( 2 ) We give an example of a piece of code that kills a process if( r >= 1 ) { calling this function from a SystemTap module in figure 2. skip_task( $next ) Therefore, the process is canceled by the scheduler, but the } } cannot do anything else. The I/O structures } (e.g., network sockets) remain open, like with any real-life failure. Fig. 4. Probabilistic failure injection scenario: every time it is about to be scheduled, the process has a certain probability to be killed. A. Deterministic failure scenarios We can imagine the following deterministic failure scenario: after a certain time TIMEOUT, the process is killed. Hence, V. CONCLUSIONANDPERSPECTIVE whenever the process is about to be scheduled, the SystemTap In this paper, we have presented FIST, a failure-injection module needs to determine for how long it has been run- technique that takes advantage of recent Linux kernel features, ning. This information can be obtained from a field of the namely, cgroups and SystemTap. This technique injects highly task_struct data structure used by the kernel (see section realistic failures, taking advantage of the fact that it kills III for more information). On recent versions of the Linux processes directly in the kernel’s task scheduler, making them kernel, a tapset function provides this information. die without notice. We have presented a method to use it to Figure 3 gives an excerpt of what can be used by SystemTap. implement deterministic and probabilistic failure scenarios. The two functions task_stime_ and task_utime_ return respectively the system time and the user time spent by the process, obtained from the internal task_struct data A. Limitations and how they can be overcome structure. When the context_switch function is called, The major limitation of this approach is that the System- the module finds out which cgroup the process to be scheduled Tap module must, of course, be executed with super-user belongs to. If it belongs to the xp cgroup, the process is privileges. This can be overcome by two approaches. The concerned by failure injection. The module finds out for how most generic one is to use sudo. The administrator of the long the process has been running. If it has been running for machine can allow it on the modprobe command only. As a longer than TIMEOUT, the process is killed. consequence, the only thing that users would be able to do is to insert their modules on the kernel. The other way is to work on an experimental testbed that B. Probabilistic failure scenarios provides a deployment software such as Kadeploy [12], such We can imagine the following probabilistic failure scenario: as the Grid’5000 platform [5]. Kadeploy allows users to deploy every time a process is scheduled, it has one chance out of two their own system image on the nodes they have reserved and (likelihood of 50%) to be killed. Hence, whenever the process therefore have (temporarily) their own root access on these is about to be scheduled, the SystemTap module generates a nodes. random number in [0, 2[, and if this number is larger or equal to 1, then the process is killed. Figure 4 gives an excerpt of what can be used by SystemTap. B. Using this method to implement other failure models When the context_switch function is called, the module In this work we have focused in fail-stop failures. However, finds out which cgroup the process to be scheduled belongs to. it can be easily extended to other semantics. For instance, If it belongs to the xp cgroup, SystemTap generates a random probes can be added in the network stack. Messages can be integer in [0, 2[ using the function randint. If this integer dropped to inject omission failures, or sent twice to inject is larger or equal to 1, the process is killed. duplication. 4

function skip_task(task:long) %{ struct task_struct *tsk = (struct task_struct *)((long)THIS->task); free_task( tsk ); %}

Fig. 2. Killing a process by freeing the corresponding task in the kernel’s scheduler.

function task_stime_:long(task:long){ if (task != 0) return @cast(task, "task_struct", "kernel")->stime; else return 0; }

function task_utime_:long(task:long){ if (task != 0) return @cast(task, "task_struct", "kernel")->utime; else return 0; }

probe kernel.function("context_switch") { cgroup = find_cgroup($next) if ( cgroup == "/xp" ){ t = task_stime_( $next ) + task_utime_( $next ) if( t > TIMEOUT ) { skip_task( $next ) } } }

Fig. 3. Deterministic failure injection scenario: after a certain time TIMEOUT, the process is killed.

REFERENCES and Jack Dongarra, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, 12th European PVM/MPI Users’ Group [1] Marcos Kawazoe Aguilera, Wei Chen, and Sam Toueg. Heartbeat: A Meeting, volume 3666 of Lecture Notes in Computer Science, pages timeout-free failure detector for quiescent reliable communication. In 225–232, Sorrento, Italy, September 2005. Springer. Marios Mavronicolas and Philippas Tsigas, editors, Proceedings of the [7] Barbara Chapman, Gabriele Jost, and Ruud Van Der Pas. Using 11th Workshop on Distributed Algorithms (WDAG’97), volume 1320 of OpenMP: portable shared memory parallel programming, volume 10. Lecture Notes in Computer Science, pages 126–140. Springer, 1997. MIT press, 2008. [2] George Bosilca, Aur´elien Bouteiller, Franck Cappello, Samir Djilali, [8] Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Gilles F´edak, C´ecile Germain, Thomas H´erault, Pierre Lemarinier, Dongarra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Oleg Lodyg ensky, Fr´ed´eric Magniette, Vincent N´eri, and Anton Se- Brian Barrett, Andrew Lumsdaine, Ralph H. Castain, David J. Daniel, likhov. MPICH-V: Toward a scalable fault tolerant MPI for volatile Richard L. Graham, and Timothy S. Woodall. Open MPI: Goals, nodes. In High Performance Networking and Computing (SC2002), concept, and design of a next generation MPI implementation. In Recent Baltimore USA, November 2002. IEEE/ACM. Advances in Parallel Virtual Machine and Message Passing Interface, [3] Franc¸ois Broquedis, J´erˆome Clet-Ortega, St´ephanie Moreaud, Nathalie 11th European PVM/MPI Users’ Group Meeting (EuroPVM/MPI’04), Furmento, Brice Goglin, Guillaume Mercier, Samuel Thibault, and pages 97–104, Budapest, Hungary, September 2004. Raymond Namyst. hwloc: a Generic Framework for Managing Hardware [9] Thomas H´erault, Thomas Largillier, Sylvain Peyronnet, Benjamin Affinities in HPC Applications. In Proceedings of the 18th Euromicro Qu´etier, Franck Cappello, and Mathieu Jan. High accuracy failure International Conference on Parallel, Distributed and Network-Based injection in parallel and distributed systems using . In Conf. Processing (PDP2010), pages 180–186, Pisa, Italia, February 2010. Computing Frontiers, pages 193–196, 2009. IEEE Computer Society Press. [10] William Hoarau, Pierre Lemarinier, Thomas H´erault, Eric Rodriguez, [4] Ralph M. Butler, William D. Gropp, and Ewing L. Lusk. A scalable S´ebastien Tixeuil, and Franck Cappello. FAIL-MPI: How Fault-Tolerant process-management environment for parallel programs. In Jack Don- Is Fault-Tolerant MPI? In CLUSTER, 2006. garra, P´eter Kacsuk, and Norbert Podhorszki, editors, Recent Advances [11] William Hoarau and S´ebastien Tixeuil. A language-driven tool for fault in Parallel Virtual Machine and Message Passing Interface, 7th Eu- injection in distributed systems. In GRID, pages 194–201, 2005. ropean PVM/MPI Users’ Group Meeting (EuroPVM/MPI’02), volume [12] Emmanuel Jeanvoine, Luc Sarzyniec, and Lucas Nussbaum. Kadeploy3: 1908, pages 168–175. Springer, 2000. Efficient and Scalable Operating System Provisioning for Clusters. [5] Franck Cappello, Eddy Caron, Michel Dayde, Frederic Desprez, USENIX ;login:, 38(1):38–44, February 2013. Yvon Jegou, Pascale Vicat-Blanc Primet, Emmanuel Jeannot, Stephane [13] Dirk Merkel. Docker: lightweight linux containers for consistent Lanteri, Julien Leduc, Nouredine Melab, Guillaume Mornet, Benjamin development and deployment. Linux Journal, 2014(239):2, 2014. Quetier, and Olivier Richard. Grid’5000: A large scale and highly [14] David Powell. Failure mode assumptions and assumption coverage. In reconfigurable grid experimental testbed. In Proceedings of the 6th FTCS, pages 386–395, 1992. IEEE/ACM International Workshop on Grid Computing CD (SC—05), [15] Richard D. Schlichting and Fred B. Schneider. Fail stop processors: pages 99–106, Seattle, Washington, USA, November 2005. IEEE/ACM. An approach to designing fault-tolerant computing systems. ACM [6] Ralph H. Castain, Timothy S. Woodall, David J. Daniel, Jeffrey M. Transactions on Computer Systems, 1:222–238, 1983. Squyres, Brian Barrett, and Graham E. Fagg. The open run-time envi- ronment (openRTE): A transparent multi-cluster environment for high- performance computing. In Beniamino Di Martino, Dieter Kranzlm¨uller,