<<

2020 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud , Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)

Continuous of

Hong Zhu, Ian Bayley Hongbo Wang School of Engineering, Computing and Mathematics School of Computer and Communication Engineering Oxford Brookes University University of Science and Technology Beijing Oxford OX33 1HX, UK Beijing 100083, China [email protected], [email protected] [email protected]

Abstract—Debugging is one of the most difficult tasks during after execution. In this paper, however, we focus instead the development of cloud-native applications for the microser- on dynamic debugging. This is the process of investigating vices architecture. This paper proposes a continuous debug- a piece of through controlled executions of the ging facility to support the DevOps continuous development methodology. It has been implemented and integrated into program and focused observations of its dynamic behaviour, the Integrated DevOps Environment CIDE for microservices in order to find out the cause of an unexpected behaviour, or written in the agent-oriented CAOPLE. failure, by locating defects in the program code. It is widely The paper also reports controlled experiments with the debug used throughout traditional because facility. Experiment data show that the overhead is less than of its effectiveness. However, according to Levine, it is “is 3% of the execution time on average. totally missing from the toolbox of microservice developers” Keywords-Software-as-a-Service; Microservices; Cloud na- [5]. This paper proposes such a facility. tive applications; Integrated DevOps Environment; Debug The main contributions of this paper are as follows. facility; Continuous debugging First, we introduce the notion of continuous debugging to complement the existing continuous development methodol- I. INTRODUCTION ogy and analysed the requirements of debugging microser- Microservices is a software architectural style in which vices on debug tools. a cloud-native application consists of a large number of Second, we propose a novel debugging facility that sat- services that are distributed over a cluster of computers, isfies all the requirements of continuous debugging for running in parallel, and interacting with each other through microservices. It consists of tools to trace the execution of a service requests and responses [1]. These services are small selected agent(s), to take a state snapshot of the agent(s), scale, of fine granularity, and each realises one function only. and to control the execution of the agent(s). The debug The instances of a service, called agents in the literature and facility can be integrated into a software development and hereafter in this paper, may be created or terminated dynam- operation environment to support the principle of DevOps. ically in response to changes in demand or the failure of In particular, it enables to directly debug other agents. Systems in the microservices architectural style microservices running in a cluster environment to realise can therefore achieve elastic scalability, optimal performance the philosophy of “you develop it, you operate it”. and fault tolerance [2]. Furthermore, the system can evolve Third, we demonstrate that the proposed debug facility is without interruption to its operation, because instances of feasible by implementing it via modifying the CAVM virtual obsolete services can be gradually removed and replaced machine that runs microservice programs and by integrating with agents of new services. In this way, the microservices it into the integrated DevOps environment CIDE. style enables (CT), Finally, we conducte controlled experiments to demon- (CI) and continuous deployment (CD), all of which are strate the efficiency of the implementation of the debug crucial to cloud-native applications [3]. For this reason, facility. Our experiments show that the overhead can be they have been widely adopted by industry for cloud-native less than 3% increase of execution time compared with applications [4]. an equivalent virtual machine without such a debug facil- However, microservices also pose grave challenges to ity. We have also demonstrate that the use of the debug software development and maintenance. In particular, when facility imposes minimal interference in the execution of a vast number of services are running in parallel in a large the microservices. For the trace operation, the overhead is cluster of computers, it is notoriously difficult to diagnose less than 5% on average and less than 14% in the worst the causes of failures; see, for example, [5]. One approach to case. The time needed to take a state snapshot is only a fault localisation is static analysis, which relies on data saved few milliseconds, and it is linear in both the number of to files during execution such as log entries and analyse them variables in the microservices, the total volume of data held

978-0-7381-3199-3/20/$31.00 ©2020 IEEE 736 DOI 10.1109/ISPA-BDCloud-SocialCom-SustainCom51426.2020.00118 in the variables, and the number of messages in the incoming components, debugging activities should take place as soon queue. For most microservices, such debug operations will as a failure occurs, whether it is detected by the system only take a few milliseconds. automatically or manually. This could be during coding, but The rest of this paper is organised as follows. Section it could also be just after a failure of or of II analyses the requirements for debugging microservices in a stage environment, and of course, and introduces the notion of continuous debugging. Section it should occur immediately after the component fails in a III presents a continuous debug facility, which has designed production environment. For the system as a whole, there and implemented for Agent-Oriented programming language should be no interruption to operation, as debugging should CAOPLE and integrated to the Integrated DevOps Environ- be applied to components in parallel with the development ment CIDE. Section IV reports the experiments carried out and operation activities on other components. For example, to evaluate the overhead of the debug facility. Section V the completion of debugging for a component should trigger concludes the paper with a comparison of related work. . The most fundamental principle of DevOps is to inte- II. DEBUGGING MICROSERVICES grate development and operation. To support this principle Interactive debugging is a process consisting of a linear means integrating debugging tools into the development sequence of interactions between a software developer and environment as well as the operation environment, and a piece of software. The developer issues commands to including it in the automated pipeline of the DevOps process. control the execution of the program and then observes its The integration should be seemless, in keeping with the state. It is an integral part of the whole process of software fundamental change that DevOps brings to the ownership development and operation. This section introduces the of programs and project [6], as summarised by the slogan notion of continuous debugging in order to fit debugging into “you develop it, you operate it”. the methodology for developing cloud native applications B. Requirements for Debugging Microservices in microservices architecture. We then analyse the specific requirements for debugging microservices. In order to perform continuous debugging, the require- ments are that debugging should be: A. Continuous Debugging 1) remote: the developer issues commands from their The DevOps methodology is the current best industrial workstation to operate a microservice running on a practice for the development and operation of cloud-native separate remote machine, the one where the failure applications with microservices, so it is important for a originally occurred. It can often be impossible to set debug facility fitting into the DevOps methodology. up an environment that replicates the failure on the A key characteristic of the DevOps methodology is ’s own workstation. continuous development, which has implications for both 2) parallel: it is possible for the developer to interact, individual components and the system as a whole. For ie issue commands and observe states, simultaneously components, continuity means that development activities with multiple services. This is required because a proceed with minimal delay. For example, as soon as coding failure behaviour normally exhibits itself in the inter- is finished, unit testing must begin, followed immediately by actions between services, and the cause is normally not integration testing, then by deployment of the component just a single bug in the service under investigation, to a stage environment, then user testing and so on. From but a combination of fault(s) in other services. The this perspective, each component moves through the DevOps multiple services may even be on different machines, pipeline smoothly and continuously until it is delivered. For as the environment is distributed. the system as a whole, continuity means that it is constantly 3) online: it is possible to enter and exit debug mode changing and evolving as new components are added in, freely, resuming the service’s normal operation once and old ones are modified and removed, simultaneously. debugging has finished. Often it is necessary to exam- System level releases and version updates are replaced by ine many services to discover which is faulty and we simultaneous evolutions of its components. While this is must be able to do this without affecting the normal happening, the system must operate continuously without operation of the service. interruption. 4) non-intrusive: it does not require instrumentation code The current theory and practice of continuous develop- to be inserted into the program and remain in the ment consist of four elements: continuous testing, contin- code during normal operation of the program. Such uous integration, continuous deployment, and continuous code would cause a significant overhead on system delivery. Debugging is missing from the DevOps process performance. and pipeline models. Debugging tools are not included in the 5) isolated: debugging one service does not affect the suite of pipeline automation toolkits. Here, we define a fifth functional operation or performance of other microser- element: the notion of continuous debugging. For individual vices, so the impact on the system as a whole is min-

737 imised, which is particularly important in a production A. Execution Trace environment. A trace is a sequence of the instructions that an agent Clearly, the requirements given above should be sufficient executes in a particular period of time. The user can start to support debugging as soon as failure occurs without a set of selected agents by using the Start Trace affecting the operation of the system as a whole. command, and finish by using the Stop Trace command. During tracing, for each selected agent, the sequence of . Weakness of Existing Debug Facilities executed instructions is saved to a separate file on the machine where the agent is running. That recorded trace All existing modern IDEs integrate a software develop- can be transmitted to the user’s workstation by using the ment environment with debugging facilities that typically Get Trace command. Figure 1 shows an example of trace. include the following functions: • Setting in the code where the execution will stop, so that the program state can be observed. The execution then be resumed when observations have been made. • Executing the program step-by-step so these observa- tions can be made after each step. The steps can be of different granularities: a machine instruction, a high- level language statement, a method call, etc. • Inspecting the values of variables after the execution stops at a or after a step. However, the conventional debugging experience that these facilities provide does not meet the requirements set -/*-* out in subsection II-B. It is not online because execution Figure 1. An Example of Execution Trace must start from the beginning. It is not remote because the debug tool runs on the same machine as the microservice in As shown in Figure 1, a trace begins with a header, order to control execution. It is not isolated because when containing the agent’s universal unique ID, its caste name a breakpoint is set, all threads and processes that hit the (i.e. the name of the microservice), the IP address where breakpoint will be paused. the agent is running, and the time when the trace started. As far as we know, there is no existing debugging fa- Alongside each instruction executed, the trace records in cility that meets the requirements of continuous debugging readable format the time in milliseconds from the start microservices running on a cluster of machines [5]. More of tracing, the line number of the statement discussions on related works is given in Section V-A. from which the instruction was generated, and the program counter (PC) i.e. the address of the instruction. III. THE PROPOSED DEBUG FACILITY B. State Snapshot In this section, we propose a new debug facility for The execution state of an agent is characterised by the continuous debugging of microservices. It has been imple- program counter, values of its variables, the contents of its mented and integrated into the integrated DevOps environ- stack, and the messages in its queue still to be processed. ment CIDE [8] for developing microservices written in the The Get State command snapshots the first three of these and service agent-oriented programming language CAOPLE [7]. sends them as a single package to the user’s workstation for We have also designed and implemented a command line viewing and analysis. Figure 2 shows an example of such a interface, which also accepts scripts, so that the debug facil- state snapshot. The header is similar to the trace information. ity can be integrated with other tools in DevOps pipelines. Variables are listed with their names, data types and values; The examples given in this section are from the current JSON format is used for structured data types such as arrays implementation. or records. The debug facility consists of three tools, described in the The message queue can be obtained with the Get Message following three subsections; their implementation is briefly Queue command. Each agent has its own message queue be- reported in subsection III-D. Each provides commands that cause interactions between agents occur with asynchronous the user/developer can issue to one or more selected agents messaging in the form of service request events and service to control their execution or display information about their response events. Figure 3 shows an example message queue state or behaviour. The agents need not be instances of the for an agent. Note that each message contains the caste same microservice nor on the same machine. name, agent ID, agent IP address, event type and parameters.

738 Figure 4. An Example of Checkpoint Records Figure 2. An Example of State Snapshot

C. Execution Control The execution of a selected agent can be controlled by using the Pause, Step Forward and Resume commands. The Pause command temporarily halts execution of the selected agents and does so after the current instruction has com- pleted to ensure that instructions are atomic. Once execution has been halted, the Step Forward command makes each selected agent execute one more instruction and halt again. Figure 3. An Example of Message Queue Snapshot The Resume command continues the normal executions of the selected agents. Breakpoints can also be set for a set of selected agents. Another way to obtain the state of an agent is to set When the execution of an agent hits a breakpoint, it will a checkpoint, so that it saves the state information into a halt. The Run to Next Breakpoint command will let each file when the program hits that specific point. Of course, selected agent execute until it hits a breakpoint again. Like multiple checkpoints can be set for one agent, and a check- checkpoints, breakpoints can be specified with either source point can be set on multiple agents. If an agent hits a code line numbers or object code instruction addresses. checkpoint multiple times, a sequence of state snapshots will Setting and removing breakpoints can be performed by using be recorded in one file. Each file holds the state of one agent. the Add Breakpoint and Clear Breakpoint commands on a The command Add Checkpoint adds one or many check- number of selected agents, and each agent can have many points to a set of selected agents; its parameter is a sequence breakpoints. Agents can have different breakpoints even if of locations, which are either addresses of instructions in the they are instances of the same caste. object code of the agent or line numbers in its source code. Note that the Get State and Get Message Queue com- The command Clear Checkpoint removes checkpoints from mands can be used both when the agent is executing and selected agents. Once checkpoints have been set, the Start when it is paused. For example, using them before and after Checking and the Stop Checking commands, respectively, a Step Forward shows the effect of one individual instruction start and stop the checkpointing. The recorded state snap- on the state of an agent. shots can be transmitted to the user’s workstation by issuing the Get Checkpoints command. D. Implementation of The Debug Facility Figure 4 shows an example of such a checkpoint record. The debug facility has been fully implemented by modi- We can see that two state snapshots were taken at time fying the CAVM virtual machine [7] and adding functions moments 2991 and 3001 for a single checkpoint set at line to receive the commands from users as service requests, 9 of the source code and instruction 11 of the object code. execute the commands and respond with messages returned Each record shows the value of variables, contents of the to the service requester. stack and messages in the queue. Since there is a message in The graphical user interface of the integrated DevOps the queue from an agent of caste Peer at time moment 2991 environment CIDE [8] has been modified to include a set but the queue is empty at time moment 3001, the message of buttons etc. for the user to issue commands to selected is processed in between these two time moments. agents as service requests to the virtual machines where the

739 agents are executing; it then receives the messages from the • RQ1: In terms of performance, how does the system modified CAVM and display the received data. Therefore, with the debug facility compare with the system with- the debugging facility is integrated into the DevOps environ- out? ment CIDE. Details of the implementation will be reported To answer this question, a benchmark called BM1 of in a separate paper for reasons of space. six programs was designed and coded, and run on the Figure 5 gives the GUI for the debug facility as a part of original CAVM (without debug facility) and the modified CIDE’s runtime management of agents. CAVM (with the debug facility). As shown in Table II, The debugging facility can also be invoked using com- the benchmark combines numerical and text processing with mand line instructions, making it possible to write shell invocations of library functions, system functions, and file scripts that integrate with the other monitoring and analysis operations. Each was executed 100 times consecutively and tools in a DevOps pipeline. The format of instructions is as the average taken. follows: me − dbg command{agentID@IP, }+[parameters] Table II The debug commands are listed in Table I. PROGRAMS IN BENCHMARK BM1 Program Function Main features Table I P1 Summation of a list of 400 integers Numerical calculation COMMAND LINE DEBUG INSTRUCTIONS from 0 to 399 P2 Generate 500 random numbers Library function call Command Parameter Meaning P3 Probes and displays the workload System functions call pause Pause the execution of the agent on the computer 100 times resume Resume the execution of the agent P4 Read 200 lines of text from a file File operations, Text pro- step Execute one more instruction and display them on the screen cessing start trace Start tracing the agent P5 Take 500 readings of the CPU us- System function calls, stop trace Stop tracing the agent age and calculate the average Numerical calculation get trace Get recorded trace of the agent P6 Calculate the average of 100 ran- Library function call, get state Get the state of the agent dom numbers Numerical calculation get msg Get the agent’s messages in queue set checkpoints Points Set checkpoints to an agent clear checkpoints Remove the agent’s checkpoints The experiment is repeated on two computer systems: a start check Start to record the agent’s state at check- points desktop computer and a server in a cluster; see Table III. stop check Stop recording the agent’s state at check- points Table III get checkpoints Get recorded states of the agent at check- THE EXPERIMENT PLATFORM points set breakpoints Points Set breakpoints to the agent Specification Desktop PC Server clear breakpoints Remove all breakpoints of the agent OS Windows Vista Windows Server 2012 R2 run to breakpoints Execute the agent to the next breakpoint No. of Nodes 1 16 clear Ip Address Remove debug data on the machine No. of Cores 4 4 Memory Size 8GB 32GB Hard Drive Size 300GB 300GB The debug facility as implemented above enables debug- CPU Intel Core 2 2.67GHz Intel Xeon E3-1230v5 3.41GHz ging activities to be conducted on microservices remotely, online, and isolated; the microservices are agents executing Table IV shows the average execution time (column Time) in a distributed and parallel fashion. It does not require for each benchmark program without the debug facility, instrumentation of the code so it is non-intrusive. All the the additional execution time (column Diff ) when running requirements of Section II-B are satisfied. The debug facility the debug facility and the increase in running time as a is integrated to the DevOps environment and the debugging percentage (column Rate). commands can be scripted. In this way, it supports the con- tinuous debugging and the integration principle of DevOps. Table IV RESULTS OF EXPERIMENT 1 IV. EVALUATION Server Desktop PC The implementation of the debug facility modifies the vir- Time (ms) Diff Rate % Time (ms) Diff Rate % tual machine CAVM, which forms the runtime environment P1 106.02 1.16 1.09 172.67 3.03 1.75 P2 131.01 3.44 2.63 171.12 3.99 2.33 of CAOPLE programs. Thus, it brings additional runtime P3 117.14 1.19 1.02 217.79 2.34 1.07 overhead to the performance of the programs. Controlled P4 14.6 0.52 3.56 79.44 1.34 1.69 experiments have been conducted to evaluate this overhead. P5 367.48 0.05 0.01 591.99 11.62 1.96 P6 6.18 0.42 6.80 11.28 0.27 2.42 This section reports the findings. Avg 123.74 1.13 2.52 207.38 3.77 1.87 A. Experiment 1 Experiment 1 is designed to answer the following research We can see that the overhead of the debug facility is question. negligible. On average, it is 2.52% for the server cluster

740                                                                          

             

Figure 5. Graphical User Interface of the Debug Facility and 1.87% for the desktop PC. The nondeterminism due to and then without. As before, each program in the benchmark OS scheduling and JIC etc. has more effect. is repeated 100 times and the average is calculated. The experiment is repeated on the same computer systems as in B. Experiment 2 Experiment 1. Table V shows the average execution times This experiment aims to investigate the impact of debug without tracing in column Time, the increase in execution operations on the execution speed of the agent. Note that time in column Diff when tracing is switched on, and the tracing an agent’s execution will record every instruction percentage increase in column Rate. The results show that executed by the agent, and thereby increase its execution overhead of tracing is 4.21% for the server and 3.64% for time. Since the debug facility only affects the agent being the desktop PC. debugged, the impact will be limited to the agent itself. Table V For state snapshot operations, the agent’s execution must be RESULTS OF EXPERIMENT 2.1 suspended while a state snapshot is carried out to ensure data Server Desktop PC integrity. The impact of that state snapshot is determined by Time (ms) Diff Ratio% Time (ms) Diff Ratio% the length of the time for which the agent must be suspended P1 107.18 0.61 0.57 175.7 10.29 5.86 P2 134.45 3.04 2.26 175.11 14.21 8.11 for it to happen. Therefore, we have the following research P3 118.33 0.93 0.79 220.13 5.22 2.37 questions. P4 15.12 0.99 6.55 80.78 0.05 0.06 P5 367.53 7.07 1.92 603.61 0.23 0.04 • RQ2.1: When an agent is running with tracing acti- P6 6.6 0.87 13.18 11.55 0.62 5.39 vated, how much will its performance downgrade? Average 4.21 Average 3.64 • RQ2.2: How long does it take to get an agent’s state snapshot? 2) Experiment 2.2: To answer research question RQ2.2, a • RQ2.3: How long does it take to get the list of messages new benchmark BM2 was designed, in which each program in an agent’s message queue? is characterised by two parameters: the number of variables The following experiments were designed and conducted and the data size of each variable. Using BM2, the rela- to answer these questions. tionship between these two factors and the time taken to 1) Experiment 2.1: To answer research question RQ2.1, snapshot can be studied. When each code sample in BM2 the same benchmark BM1 is used first with tracing enabled was run, 10 state snapshots were taken, and the average

741 execution time was calculated. Table VI shows the execution %25.5;  times in column T , and the volumes of data in column V for various numbers of variables (Var).  A @  #C Statistical analysis reveals that the execution time for  taking a state snapshot is linear in both the number of  variables and the volume of data. For example, Figure 6(a)  shows the execution times as the number of variables varies  from 1 to 50 with each variable being a list of 1000 integers.  Figure 6(b) shows the execution times for taking a state         snapshot as the total volume of data varies from 20 KB to (a) In the ideal 200KB when there are 50 variables. %25.5; A @    #C   A @  #C   

 

     (b) In the workload scenario        Figure 7. Times to snapshot a message queue (a) Time (ms) to take a state snapshot for x variables (x=1..50), where each variable is a list of 1000 integers.   A @   C. Conclusions of the Experiments #C  From the experimental data, we can draw the following conclusions.

• The overhead of the debug facility is less than 3% on

average, so it should not be noticeable to the user. •  The overhead of the trace function is less than 5%      on average. In the worst case, it was 13.18% but (b) Time (ms) to take a state snapshot of 50 variables, where the volume of data in this was as a replacement for traditional step-by-step each snapshot varies from 20 KB to 200KB. interactive debugging so the increase in execution time Figure 6. Analysis of The Result of Experiment 2.2 is acceptable. • The overhead of taking a snapshot of an agent’s state is negligible. Even if there are 50 variables each holding 3) Experiment 2.3: To answer research question RQ2.3, a 1000 integers (200KB) the time taken is only a few third benchmark BM3 was designed, in which each program milliseconds (6 ms). The execution time is linear in is characterised by the length of message queue that the both the number of variables and the total size of the agent will have. When a program in BM3 was run, the Get data. Message Queue operation was performed 10 times and the • The execution time to take a snapshot of the message average execution time calculated. queue is linear in the number of messages in the queue The experiment was conducted in two scenarios: an ideal in the ideal scenario when no other concurrent thread scenario in which there were no other agents running on the or process the snapshot operation. In a heavy machine and a non-ideal workload scenario in which there workload scenario, when a snapshot can be interrupted, were a number of other agents running on the same machine it is still linear in the number of messages in the queue, at a nearly saturated workload. statistically speaking. Figure 7(a) and (b) show the distributions of the times Therefore, the overhead of the debug facility as a whole is taken to take a snapshot of the message queues in the acceptable. two scenarios. In the ideal scenario, snapshotting was not interrupted by other agents (i.e. threads and processes) V. C ONCLUSION running on the same machine. The execution time was linear A. Related Work in the number of messages in the queue. In the workload Debug facilities are an integral part of modern integrated scenario, the time to snapshot was affected in this way but software development environments (IDEs) like Eclipse, statistically the same linear increase pattern can be observed. NetBeans, IntelliJ, XCode, Visual Studio, etc. They provide

742 Table VI RESULTS OF EXPERIMENT 2.2

                            

             

         

               

               

               

                                                                            

the functions for setting breakpoints, stepping through the backward through the processing of an individual data program and inspecting the memory state. A typical example record to identify the origin of the final or intermediate is the GNU Project Debugger GBD, designed for offline output. and local debugging. Debug facilities and tools have also • culprit and remediation, which sends all required been developed for various programming languages, such as information to the driver when the crash occurs so that dlv for Go, ptvsd for Python, etc. As discussed in Section the user can determine the culprit and take actions to II-C, they do not meet the all requirements of debugging fix the code and then carry on the computation. microservices, nor do they support continuous debugging as Compared to our debug facility, the overhead of BigDebug a part of the DevOps pipeline. As Levine pointed out, the is much higher: up to 24% for recording tracing, 19% for debugger “is totally missing from the toolbox of microservice crash monitoring, and 9% for watchpoint [11]. Moreover, developers” [5]. BigDebug is only applicable to the MapReduce type of The closest near matches are works on debugging the dataflow computations. It is not applicable for microservices. programs running on supercomputers [9], the BigDebugger Tracing has been widely used in practice for debug- for debugging MapReduce applications for Big Data analysis ging concurrent systems and software running on parallel [11], and two more recent developments: Cloud Debugger hardware architectures such as multiple core systems [12]. for Google’s Cloud Platform for microservices [14] and Google’s Dapper provides context-based tracing in particu- Squash [17]. They are discussed below. lar. It relies on the homogeneous infrastructure of common To enable the debugging of parallel programs running on RPC libraries to minimise the instrumentation burden. Its supercomputer architectures, Jin et al. [9] developed a code (call graph) and architecture has become the de library for launching a debug facility simultaneously on the facto standard for trace collection. Zipkin, created at Twitter, nodes in a supercomputer from the front-end, and sending is an open source clone of Dapper. Zipkin and its derivatives data collected back to the developer’s workstation. The main including Amonon’s X-Ray are in widespread use. However, function of the library code is the communication between as Alvaro [13] pointed out, despite the fact that distributed the front-end and back-end in supercomputer systems. A systems are a mature research area in academia and are similar work is reported in [10] addressing the same prob- ubiquitous in industry, the art of debugging distributed lem. systems is still in its infancy. In the context of Big Data analysis applications, Gulzar et A recent development in industry is Google’s Cloud De- al. [11] developed a debug facility called BigDebug to extend bugger [14]. It has the features of remote online debugging, the Spark system. It provides a set of debug primitives to which is called real-time debugging in [14]. It provides two enable debugging of the MapReduce type of computation on debugging tools: snapshot and logpoints. The former gets Hadoop clusters. Their debug facility includes the following the values of selected variables when the execution of an functions. instance hits a snapshot location in the code, and sends the • Simulated breakpoint, which enables the user to inspect value to the user’s workstation. The latter generates a log intermediate results at a given ”breakpoint” and then entry in the target log system when the execution of an resume the execution. This creates the illusion of a instance hits a logpoint location in the code. Both of them breakpoint, even though the program is still running are similar to our checkpointing functions, but there are at on the cloud in the background. least three important differences: Cloud Debugger does not • Guarded watchpoint, which enables the user to query a collect information about message queues, which is vital subset of data matching a given guard condition. for debugging microservices. Its snapshots and logpoints are • Data trace, which enables the user to trace forward and applied to ”all instances of the app”, rather than just the

743 selected instances, as in our approaches. Its snapshots require is not sufficient for debugging microservices. the user to specify which variables are required, whereas we Both observability techniques and debug facilities pro- obtain values of all valid variables, including dynamically- vide a means of observations for microservices. However, created variables. there are subtle differences. For example, considering the Looking more closely at the last of these differences, observability as a system quality attribute like usability, although selecting just some of variables can reduce the efficiency, maintainability, testability, etc., Baron Schwartz amount of data transferred to the user’s workstation, this defines observability as ”a measure of how well internal is offset by the longer computation time required to set states of a system can be inferred from knowledge of its the snapshot, which takes about 40 seconds [14]. In our external outputs” [16]. Indeed, existing observability tech- approach, the equivalent task of setting a checkpoint can niques treat microservices as a black box and only observe be done instantly. Moreover, the user may not be able external features. In contrast, when debugging, one does not to know the names of variables if they do not have the only observes a system’s external outputs, but perhaps more source code and even if they did, they would not be able importantly, its internal states too. Moreover, controllability to specify dynamically generated variables and - as a mathematical dual of observability in control theory generated internal variables. Finally, user-specified variables is equally important for debugging but completely missing may not be in scope for some locations of the program. in the observability tool box. Many debugging facilities, It is worth noting, moreover, that Cloud Debugger does including ours, not only contain tools for observation on not have any tool for control the executions of the application system’s behaviour and state, but also tools to control the under debug. Finally, in order to reduce the overhead, Cloud execution of the system. Debugger restricts the time period for which a snapshot or There are many tools available for each of the four pillars logpoint is effective and enables the user to set conditions for of observability. These tools are increasingly integrated the functions. In spite of this, Cloud Debugger’s overhead together, as with Google Cloud operations suite and its is still higher than ours. It causes an additional latency of predecessor Stackdriver [19], and they help to diagnose each service request about 10ms on average [14]. With less failure and their causes. For example, through telemetry than 10ms, our debugger can get the values of more than 50 visualisation and alert tools, one can recognise whether a variables and 200KB of data (equivalent to 50,000 integers); failure occurred in the system, and predict whether a failure see experiment data in Figure 6 of Section IV-B2. could occur or is in progress. Through message tracing Squash is an open source project of solo.io that provides and log aggregation and analysis tools, one can identify an interface between an existing IDE and the software which machine and which microservice caused a problem. running on remote machine for testing microservices using Observability tools are widely used in industry to locate bugs existing debug facilities like dlv for Go, ptvsd for Python in a microservice system [21], and are becoming an active and GNU Project Debugger gdb [17], [18]. Currently, it sup- research subject. For example, Shang et al. [22] proposed a ports VS Code, Intellij and Eclipse IDEs for microservices data mining approach to the analysis of log files to identify running on Kubernetes and OpenShift platforms and written failures of deployment of big data analysis applications in programming languages (microservices frameworks) Go, to Hadoop clusters. Tong et al. [23] also employed data Java, JavaScripts (Nodejs) and Python [18]. It completely mining techniques to analyse log files of cloud platforms relies on existing to perform debugging activities, to pinpoint bug-induced software failures. They can identify but provides no new debugging facilities. Therefore, it only which processes cause the system failure, but not where meets some of the requirements of debugging microservices exactly this happened in the code. It is highly desirable to that we recognised in Section II-B; for example, it does not integrate debugging tools into the microservice development enable online debugging microservices. toolbox. For example, Google Cloud operations suite has In the wider context of site , the recently included the debugging tool Cloud Debugger [19]. current trend in industry is away from monitoring to towards observability engineering (or observability for short), which B. Future Work consists of four pillars: monitoring, alert and visualisation, It is interesting to study the effectiveness of the proposed distributed system tracing and log aggregation and analysis debug facility in practice, for example, through empirical [20]. It gives better support for debugging than monitoring studies with professional software developers and using a does by providing more information, especially the “highly benchmark like DbgBench [24]. It is also worth noting that granular insights into the behaviour of systems”. However, in recent years there has been a rapid growth in research as Shridharan pointed out [15], observability is not debug- on automated debugging; see [25] for a recent survey of ging, which is characterised as “an iterative process which research on the topic. The execution traces (also called involves introspection of the various observations and facts execution profile in the literature) are the main input to such reported by the system, making the right deductions and automated debugging algorithms. It may be interesting to testing whether the theory holds water”. Thus, observability investigate how to link our trace tool to such automated

744 debugging tools. The implementation of our debug facility [15] C. Sridharan, Monitoring and Observability, Online at https:// is through modification of the language’s virtual machine medium.com/@copyconstruct/monitoring-and- observability- (i.e. the runtime environment). The approach should be 8417d1952e1c, 4 Sept. 2017. Last access: 6 June 2020. applicable for other languages that are implemented by using [16] B. Schwartz, Monitoring Isn’t Observability, Online at https: language virtual machines such as Java, Python, etc. It will //orangematter.solarwinds.com/2017/09/14/monitoring-isnt- be interesting to explore how to implement a similar debug observability/, 14 Sept. 2017. Last access: 6 Jun. 2020. facility for Java and Python. [17] D. Rettori, Squash: The Definitive Cloud-Native Debugging REFERENCES Tool, Online at https://itnext.io/squash-the-definitive- cloud- native-debugging-tool-89614650cc94, 8 Mar., 2019. Last ac- [1] J. Lewis and M. Fowler, Microservices, Online at http:// cess: 10 Jun. 2020. martinfowler.com/articles/microservices.html, 25 Mar. 2014. Last access:27 Jun.2020. [18] Solo.io, Debug Your Microservice Applications from Your Terminal or IDE While They Run in Kubernetes, Squash [2] S. Newman, Building Microservices: Designing Fine-Grained Official Website. Online at https://squash.solo.io. Last access: Systems, OReilly, Feb. 2015. 25 Jun. 2020.

[3] L. Krause, Microservices: Patterns and Applications, Mi- [19] Google Cloud Platform, Google Cloud Operations Suite Doc- croservicesbook.io, April 2015. umentation, Online at https://cloud.google.com/stackdriver/ docs. Last access: 25 Jun. 2020. [4] H. Zhu and I. Bayley, If Is The Answer, What Is The Question? A Case for Paradigm Shift [20] A. Asta, Observability at Twitter: Technical Overview, Part 1, Towards Service Agent Orientation, in Proc. of SOSE 2018, Online at https://blog.twitter.com/engineering/en us/a/2016/ pp152-163, Mar. 2018. observability-at-twitter-technical-overview-part-i.html, 18 Mar. 2016; Part 2, Online at https://blog.twitter.com/ [5] I. Levine, Squash: Microservices Debugger. Online at engineering/en us/a/2016/observability-at-twitter-technical- https://medium.com/solo-io/squash-microservices-debugger- overview-part-ii.html, 22 Mar. 2016. Last access: 8 Jun. 5023e27533de, 5 Mar. 2018. Last access:25 Jun.2020. 2020.

[6] S. Sharma, and B. Coyne, DevOps for the Dummies, 2nd IBM [21] V. Tarcic, The DevOps 2.0 Toolkit: Automating the Continu- Limited Edition. John Wiley & Sons, 2015. ous Deployment Pipeline with Containerized Microservices, Lean Pub, 2016. [7] C. Xu, H. Zhu, I. Bayley, D. Lightfoot, M. Green, and P. Mar- shall, CAOPLE: A Programming Language for Microservices [22] W. Shang, Z. M. Jiang, H. Hemmati, B. Adams, A. E. Hassan SaaS, in Proc. of SOSE 2016, pp42-52, Apr. 2016. and P. Martin, Assisting Developers of Big Data Analytics Applications when Deploying on Hadoop Clouds, In Proc. of [8] D. Liu, H. Zhu, C. Xu, I. Bayley, D. Lightfoot, M. Green, P. ICSE 2013, pp402-411, May 2013. Marshall, CIDE: An Integrated Development Environment for Microservices, in Proc. of SCC 2016, pp808-812, Jun.2016. [23] T. Jia, Y. Li, Ho. Tang, Z. Wu, An Approach to Pinpointing Bug-Induced Failure in Logs of Open Cloud Platforms, in [9] C. Jin, D. Abramson, M. N. Dinh, A. Gontarek, R. Moench Proc. of CLOUD 2016, pp294-302, Jun. 2016. and L. DeRose, A Scalable Parallel Debugging Library with Pluggable Communication Protocols, in Proc. of CCGrid [24] M. Bohme, E. O. Soremekun, S. Chattopadhyay, E. 2012, pp252-259, May 2012. Ugherughe, and A. Zeller, How Developers Debug Software - the DbgBench Dataset, in Proc. of ICSE-C 2017, pp244-246, [10] J. Zhang, Z. Luan, W. Li, H. Yang, J. Ni, Y. Huang and May 2017. D. Qian, CDebugger: A Scalable Parallel Debugger with Dynamic Communication Topology Configuration, in Proc. [25] W. E. Wong, R. Gao, Y. Li, R. Abreu, F. Wotawa, A Survey on of CSC 2011, pp228-234, Dec. 2011. Software Fault Localization, IEEE Transactions on Software Engineering 42(8), pp707-740, Aug. 2016. [11] M. A. Gulzar, M. Interlandi, S. Yoo, S. D. Tetali, T. Condie, T. Millstein, and M. Kim, BigDebug: Debugging Primitives for Interactive Big Data Processing in Spark, In Proc. of ICSE 2016, pp784-795, May 2016.

[12] A. Spear, M. Levy, and M. Desnoyers, Using Tracing to Solve the Multicore System Debug Problem, IEEE Computer 45(12), pp60-64, Dec. 2012.

[13] P. Alvaro, OK, But Why? Tracing and Debugging Distributed Systems, Comm. of ACM, 60(7), pp46-47, Jul. 2017.

[14] Google, ”Cloud Debugger”, Online at https://cloud.google. com/debugger, Last access: 6 Jun. 2020.

745