<<

On Recent Advances on Stateful Orchestrated

Container Reliability

Kęstutis Pakrijauskas Dalius Mažeika Faculty of Fundamental Science Faculty of Fundamental Science Vilnius Gedinimas Techincal University Vilnius Gedinimas Techincal University Vilnius, Lithuania Vilnius, Lithuania [email protected] [email protected]

Abstract— Thanks to its flexibility and light weight, containers there is no risk of data loss, and recreation of the failing are becoming the primary platform to run . component is a small deal. However, it is a different matter with Container orchestration frameworks – or stateful microservices that deal with data. State is “a sequence Swarm – enable companies to stay on competitive edge by keeping of values in time that contain the intermediate results of a the velocity of code deploys high. While containers are ideal for desired computation.” [6]. The state makes deployment, stateless workloads, using orchestrated containers for stateful management, scaling, replication a complex engagement. State services is an option too. Being a commodity and crucial to any has to be synchronized across multiple replicas in a business, state or, in other words, data has to be protected and be microservice. Recovering a stateful microservice is not a trivial available. This research raises questions on what the reliability matter. Solid backup recovery strategy, replication, sharding, challenges of running stateful microservices are, and what are the recent approaches to increase reliability of stateful services in etc. may not be enough to ensure high reliability on stateful orchestrated container systems. A literature review was microservices. Inter-service state consistency of data may be performed to answer the questions. violated if a single microservice was recovered to an earlier state. Data of seemingly healthy stateful microservice can be Keywords—microservices, containers, Kubernetes, stateful, corrupted, thus triggering a restore or rollback operation. failure, availability, review Migration to a healthier node in the platform is a challenge with stateful microservices as well because of its resource intensive I. INTRODUCTION and potentially disrupting nature. Software and technology are the key in transforming organizations and delivering value to stakeholders and customers in modern times [1]. Success of business closely depends on whether or not its systems are running at the desired state. Microservices, an implementation of Service-Oriented Architecture, allows companies to keep up with the demand to scale and roll new services out [2]. Challenges of running microservice-based application are different compared to monolithic application: monitoring, recovery, and load balancing, etc. Data or state is an asset of any business. Modern systems becoming highly complex and of high scale according to DevOps report 2018 [3] and 2019 [4]. High system reliability is a concern. Thus, enterprises, such as [5], build their systems with resilience and reliability in mind. Components of microservices should be engineered, prepared Fig. 1. Comparison of monolith and microservice architectures for failure instead of attempting to ensure that no components fail. Microservices and entire system should be tolerant to Container orchestration frameworks, such as Kubernetes, failures, able to recover quickly [2], [11]. Downtime of Docker Swarm, or Mesos, were developed for [7]. Such tools microservice based systems is decreased if microservice returns allow to automate and abstract different microservices back online in a timely manner. management tasks like service discovery, storage orchestration, rollout and rollback, resilience [8]. It is important to evaluate A stateless service is limited to its function – its output reliability of applications running in container orchestration depends on its input. Recovery of stateless microservices systems given their capabilities of running microservices. components is straight forward – its components are ephemeral,

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE Reliability is commonly defined as “The probability that an performed using guidelines defined by Kitchenham and Chaters item will perform a required function without failure under [13]. The search was limited to digital libraries and search stated conditions for a stated period of time.”[9]. Mathematical engines on the Internet. The search used the following terms: and statistical methods are used to quantify reliability. However, “stateful microservice” OR “stateful container orchestration” uncertainty too great in practice or reliability to be calculated. OR “microservice database” OR “container database” OR Reliability became an important effectiveness parameter as cost “microservice fault” OR “container fault”. Further studies were and complexity of systems increases. According to ISO/IEC identified by examining references in included articles. 25010 standard, reliability is a characteristic of systems and software product quality model. This paper aims to discover II. CHALLENGES SUBJECT TO STATEFUL MICROSERVICES approaches applicable to stateful microservices to satisfy the A. Federated Multidatabase sub-characteristics of reliability as described in the standard [10]: Fine-grained microservices are developed and operated by independent teams. Each microservices can be independently • maturity: degree to which a system, product or deployed and scaled. Stateful services rely on their own component meets needs for reliability under persistent storage mechanism. To reduce coupling, integration normal operation at storage level, i.e., using one data storage mechanism, like database, tends to be avoided. Interaction between microservices • availability: degree to which a system, product or should be limited to . However, as there is no guarantee that component is operational and accessible when a link to retrieve records from another microservice is valid, required for use consistency becomes a challenge [14]. • fault tolerance: degree to which a system, product Microservices, being autonomous and independently or component operates as intended despite the deployed, may store data on a variety of platforms. Each presence of hardware or software faults microservice stores persistent data on its own private database. • recoverability: degree to which a system, product This data is accessible to other services only via API. or component operates as intended despite the Relationships to other entities in REST architecture are presence of hardware or software faults expressed as URI links. URI is Uniform Resource Identifier that globally addresses the referenced entities. Lifecycle of A prerequisite to reliability evaluation is setting the standard microservices is independent, thus databases are backed up of performance. The standard of performance is defined by periodically and independently. In case of recovery, links Service Level Agreement (SLA), Service Level Objectives between microservices may be broken due to inconsistent state (SLO) and Service Level Indicators (SLI) [12]. of microservices after data was restored from a backup on one SLI is a defined quantitative measure of an aspect of the level of them [15]. of a service. An SLO is the target range or value of SLIs. Setting Microservice architecture is designed to survive its SLO is important to evaluate reliability as it adds transparency individual components failing. Stateless and stateful services and understanding whether the level of a service meets the can be recovered independently. As data in stateful services can expectations or not. An SLA is an agreement with users on what be recovered from a backup, there is a question whether the to do with the service if it is not performing within the SLO. restored data is consistent with the data on other microservices. Having an established service health baseline and solid The challenge is ensuring data consistency among multiple prediction on what is to happen with a microservice is not microservices, how and when perform backup operations [14]. enough. Orchestrated container systems have a rich selection of Databases of microservices can be seen as a federated settings and techniques that can be used to increase microservice multidatabase – a hybrid between centralized and distributed reliability by taking a data-driven decision or detecting faults. database system. A database that is distributed for global users This paper aims to summarize the available information on and centralized for local users. Each microservice treats its studies related to reliability of stateful microservices in database as a centralized one, ensuring its durability and orchestrated container systems. The research questions are: consistency [15]. However, managing its consistency as a challenge because of distributed persistence. Foreign key • RQ1: How and what kind of data can be used to relationships between databases of different microservices are make data-driven decision on microservice represented as loosely coupled references like URI. There is no reliability in orchestrated container systems? guarantee that a retrieved URI points to a valid record in another microservice. • RQ2: What are the recent data-driven methods used to increase or predict stateful microservice Although backup of individual microservices can be reliability? successfully used for independent recovery, it is likely that its state will not be consistent with the state of the application. For • RQ3: What are the recent approaches or techniques example, if order information is stored across multiple stateful used to increase stateful microservice reliability in microservices, some of it may be lost in an event of recovery of orchestrated container systems? individual microservice. Thus, the state will not be consistent. To the best of our knowledge, summarization of existing evidence concerning the topic is lacking. Literature review was B. Backup Availability Consistency deployments resulted in additional performance overhead Fine-grained and independent microservices may consist out related to network and volume plugins. of many components, each possibly having its own mechanism Network bridges are used in the two evaluated container of persistent data storage. Given the varying number of data orchestration frameworks to ensure network isolation between storage techniques and the large number of components, backing containers. These network plugins increase CPU utilization. up entire microservice application in a consistent manner is a Similar results were found in a study by E. Kim, K. Lee and C. challenge. Yoo [17]. The BAC (Backup Availability Consistency) theorem states In addition, persistent volume plugins make a large impact that: “When backing up an entire microservices architecture, it on the general database workload resource model. Even though is not possible to have both availability and consistency” [14]. the experimental Cassandra workload was CPU intensive, The tradeoff is between independent microservices, which may volume plugins introduced performance bottleneck at I/O lead on eventual inconsistency, and consistent backup of operations. microservices, which leads to locking the state of the application for a period of time, thus limiting application availability. On one hand, the authors argue that container orchestration frameworks bring benefits to SLO-aware container scheduling. Inconsistency of services, can manifest in the following On the other hand, container orchestrators introduce additional ways avoided: performance overhead which is to be optimized. • Broken link: a reference cannot be followed. For D. Stateful Pod Rescheduling example, when a microservice is referencing data Pod is the smallest deployable unit in Kubernetes. It is a in obsolete microservice that was restored from a single or a group of containers sharing underlying resources backup. such as storage, network, and namespaces. Deployments and • Orphan state: there is no reference to follow. For Jobs are used to start Pods running stateless applications. example, when data in a microservice is not StatefulSet is used to launch stateful Pods [18]. referenced at all because the referencing Pod availability is guaranteed by ReplicaSet. Even though it microservice has no references to the orphan data. is ReplicaSet that ensures the number of running Pods, Pods and • Missing state: state is obsolete. For example, two their number updates are made to Deployment which is a higher- microservices have different states because one of level concept [18]. them was restored from a backup. StatefulSet “manages the deployment and scaling of a set of There are three ways of dealing with broken links: Pods, and provides guarantees about the ordering and uniqueness of these Pods” [18]. Similarly, to Deployment, • Reconstruct the missing references manually. StatefulSet manages identical contained specification Pods. • Accept the inconsistency. However, Pods ins StatefulSet are not interchangeable: identifier of each Pod persists thru rescheduling. StaefulSet uses • Use cached data. persistent volumes which, combined with stable Pod identifiers, make new Pod mapping to persistent storage easier. A Pod Dealing with orphan state starts with identifying and retains its identifier in a StatefulSet if it was restarted. flagging them. Once the records are flagged, they can be deleted or overwritten. Each StatefulSet has a defined VolumeClaimTemplate that provides persistent storage using PersistentVolumes. While Missing state can be reproduced from source or replicated Kubernetes Volumes are ephemeral and dependent of Pod from other sources. lifecycle, a PersistentVolume is independent from any Pod that The challenges posed by BAC can either be acknowledged is using it. It is the PersistentVolumeClaim that is a storage or avoided. Acknowledging implies that eventual inconsistency request by a Pod. is accepted, there are measures to deal and live with it. Avoiding In StatefulSet, PersistentVolumeClaim, used by a failed Pod, the challenge of BAC can result it tighter coupling or other is not removed as in the case of Volume in Deployment. The design solutions that make microservices more dependent on failed Pod is relaunched using the bind PersistentVolumeClaim. one another. Thus, every Pod in StatefulSet used the same C. Performance Overhead of Stateful Workloads PersistentVolumeClaim [18]. Performance overhead of Docker and container Kubernetes uses Horizontal Pod Autoscaler (HPA) that orchestration frameworks – Docker Swarm and Kubernetes – scales the number of Pods up and down based on resource was evaluated by in a study by E. Truyen, D. Van Landuyt, B. utilization, for example CPU or a custom defined metric. Lagaisse and W. Joosen [16]. The study focuses on evaluating However, additional application specific actions, such as setting CPU-intensive Cassandra workload. up of replication, have to be complete if a stateful Service is scaled out [18]. The study found that deployments to Docker containers result in negligible performance overhead compared to In case of a stateful service is scaled out data has to be deployment to hosts. However, Docker Swarm and Kubernetes replicated onto the new Pod. Depending on the storage mechanism, for example MySQL RDBMS, overall performance of the stateful microservice is degraded during the process of bound [16]. Measuring disk usage is a forthright approach for copying the data. In addition, while a stateful Pod is rescheduled, stateful services as stateful services operate with data that is it may not be available to use because of Readiness and Liveness persistently stored on disk. Impact of network utilization, probes that wait while the Pod is fully operational. notably network latency in the analyzed research [22], is important in distributed architectures. Measurement of requests III. IDENTIFYING SLIS AND SLOS TO DRIVE DECISION per second is the universal indicator of how well the service is MAKING setup. In addition, this indicator may be dependent on Historic SLIs and SLOs can be the basis for data-driven unmanageable features of the service such as the type of data decisions taken to proactively evaluate, improve, and maintain managed. SLOs, or the thresholds for SLIs are originated from microservices. However, the question is how the historic data the user that requires a certain level of performance. should be treated before it can be used to fuel different Pre-processing and preparation of data is a crucial step for algorithms: machine leaning, deep learning, neural network, rule data-driven methods [19], [25]. Anomalies and improperly based, selection, etc. The abundance of methods and algorithms trained algorithms may result in poorer results of prediction that may be used to digest SLI and SLO data makes it a algorithms. challenge of its own. A collection of any microservice monitoring, logging, or IV. RECENT DATA-DRIVEN METHODS USED TO INCREASE tracing records can be set as the SLI. Microservice monitoring OR PREDICT STATEFUL MICROSERVICE RELIABILITY consists out of three components [2]: A. Resource Optimization • Metrics. This component consists of service Autonomic vertical scaling comes in handy in concentrated telemetry data such as CPU, memory, storage I/O containerized clouds where adding more nodes or instances is usage. not an option, thus horizontal scaling is not an option anymore. Podolskiy, Mayo, Koey et al. propose a method of deriving • Traces. This component helps to understand how SLO-compliant resource allocation for containerized microservices interact with each other. applications based on performance models, and both single- • Logs. Each service generates events that are objective and multi-objective optimization [19]. logged. Event logs are essential to understand the Out of three regression algorithms: Linear regression; lasso activity of the system. regression; random forest, the Lasso model was deemed the Each monitoring component can have indicators or features most suitable because of the achieved R2 coefficient of that are further divided into three groups [19]: determination, and being simpler than linear regression, although the latter having similar R2 performance. Further on, • Manageable features, such as resource allocation multiple lasso regression models were evaluated: independent to and other settings. predict individual SLIs; application-wise to predict SLO compliance for entire application; SLI-wise to predict an SLI for • Partially manageable features are SLIs of other all applications; and all-targets to predict SLOs for all microservices located on shared infrastructure like applications. The results had shown that of degree 2 is sufficient Kubernetes nodes. for SLI prediction. • Unmanageable features are the ones that fully Analyzing distributions of the 99%-tile throughput and depend on the users of the microservice. These response time for all three applications had shown that represent user demand and usage patterns. For anomalies make up to 8.2% of all observations. The impact of example, requests per second or the type of data. anomalies is evaluated by removing fractions of anomalies and As for the SLIs to be used as features to make data driven re-evaluating the R2 score. For example, removal of 11% of decisions, Table I summarizes studies and the SLIs selected to anomalies optimized the R2 score. drive theirs proposed methods. Validation tests were performed: a preliminary test to acquire SLI to be used as features in prediction modules that TABLE I. SLIS USED AS FEATURES TO DRIVE PROPOSED MODEL IN DIFFERENT PAPERS continuously allocate resources; an evaluation test to acquire SLI that are used in lasso regression model to predict SLIs to be SLI Studies used in continuous constrained optimization and limited brute CPU [19]–[25] force search to model the desired SLOs. Validation results had shown that SLOs were violated only twice in 16 trials. Thus, the Disk usage [20], [25] proposed performance modeling technique was deemed usable. Network utilization [22] Placement of Pods across distributed nodes becomes a Requests per second [20], [22] problem, which has to be solved at the time of Pod scheduling. To overcome this challenge, F. Rossi, V. Cardellini, F. Lo Presti CPU utilization being the prime indicator of how significant & M. Nardelli proposes to identify the relationship between the workload is. This is the case with stateful services as well. application and system metrics [22]. Workload of certain stateful applications, for example Cassandra NoSQL database management system can be CPU- The authors in their research applied reinforcement learning In their network-aware Pod placement solution the custom solution which, based on experience, learns the most suitable scheduler takes the snapshot of Kubernetes cluster, sends it to scaling policy. They present ge-kube – an orchestration tool for Deployment Service, which makes decision on Pod placement Kubernetes. based on available resources and network delays. The decision is passed back to the custom scheduler which then orders where One of the reinforcement learning challenges is to find the to place the Pod. optimal balance between exploitation (using the effective actions) and exploration (searching for effective actions). At the In the experiment to validate the results, the authors of ge- analysis stage, reinforcement learning agent assesses the state of kube employed network aware, first-fit, and round-robin application and updates expected long-term cost (Q-function). heuristic algorithms, compared them to default Kubernetes At the Plan stage, the Replication Manager uses the scheduler and by solving the Pod placement problem as integer reinforcement learning agent to identify which scaling action to linear programming problem. A Redis scaled-out cluster was take. In ε-greedy selection, reinforcement learning agent with ε used to evaluate the algorithms. The number of requests per probability chooses an exploration action to improve second was measures to evaluate optimal Pod placement in geo- knowledge. With 1-ε probability reinforcement learning agent distributed nodes. chooses the best-known action. The first-fit, round-robin and the default Kubernetes B. Machine Learning for Fault Prediction scheduler achieved similar performance:15 × 103 to 18 × 103 requests per second. While Pods placed by network-aware With the goal to predict faults in distributed Kubernetes- 3 based edge cloud environments, M. Soualhia, C. Fu & F. Khomh algorithm can achieve 44 × 10 requests on average. propose to use machine learning for fault detection and neural A custom scheduler was created by Y. Yang and L. Chen networks for fault prediction [25]. [26]. In their proposed three-module architecture, the third Support Vector Machine, Random Forest and Neural dynamic resource scheduling module applies prediction data to Network models are able to identify non-fatal disk and CPU improve resource utilization. As the problem of their research faults with F1-Score of over 95%. Long Short-Term Memory was optimization of Pod scheduling mechanism the evaluation and Convolution Neural Networks can successfully predict of the proposed model is by measuring how evenly CPU and faults in over 85% of cases. memory utilization is distributed across Kubernetes nodes. Custom schedulers enable fault tolerant, redundant V. TECHNIQUES TO INCREASE RELIABILITY OF STATEFUL placement that ensures that a stateful microservice in an MICROSERVICES IN ORCHESTRATED CONTAINER SYSTEMS orchestrated container system performs as designed within SLO A. Evaluating Efficiency of Reliability Mechanisms boundaries. Container orchestrators has their mechanisms to ensure high D. State Controller for Stateful Kubernetes Services availability of services they run. However, thanks to their With the goal of increasing reliability of Stateful services, flexibility, container orchestrators can be improved or extended Abdollahi Vayghan, Saied, Toeroe, & Khendek propose a State to take custom actions. SLIs are the basis to the evaluation of Controller for Kubernetes [27]. The proposed State Controller how successful the improved mechanism is. At this stage, integrates the concept of active and standby states with various indicators are used for evaluation: served requests per Kubernetes to improve availability of stateful microservice second [22], CPU and memory utilization [26], and duration of applications. The idea is to assign labels to pods describing outage [27]. whether they are active or standby. The State Controller watches B. Kubernetes Operator the state of each pod, if a pod with an active status fails, another pod is assigned the active status. The pod with newly assigned Kubernetes Operators allows to abstract some of the active status becomes an entry point. operations with stateless and stateful services: backups, deployment, scaling, failovers, etc. The idea behind is to The State Controller can be integrated with both StatefulSet automate daily administration and automation duties. However, and Deployment Controller. In case of StatefulSet, the State Kubernetes Operators lack configuration safety, transparency Controller creates two pods with separate PersistentVolumes. It and isolation [28]. creates two services: one that exposes the active pod to clients, and the other service that replicates data to the standby pod(s). Łaskawiec, Choraś, Kozik and Varadarajan [29] propose an With Deployment Controller, the State Controller deploys pod Intelligent Operator. Their proposed operator gathers replicas which share a PersistentVolume but have separate performance metrics and configuration samples. The collected storage areas for each pod. In case of a failure, the State data is used to train the Machine Learning model to predict the Controller will switch the pods and the service will resume the best configuration for a Kubernetes Operator. In addition, the process from the last stored state. Intelligent Operator blindly changes configuration parameters – performs and experiment – to evaluate how performance The State Controller was evaluated with three service outage parameters react to it. scenarios: due to container process failure, due to pod process failure, and due to node failure. In addition, OpenSAF, a C. Custom Scheduler for Kubernetes middleware that implements Availability Management The ge-kube proposed by , F. Rossi, V. Cardellini, F. Lo Framework, was evaluated. The proposed State Controller can Presto & M. Nardelli [22] uses a custom Kubernetes scheduler. increase service reliability by 55% to 92% comparing to built-in [6] Q. C. To, J. Soto, and V. Markl, “A survey of state management in big Kubernetes capabilities. data processing systems,” arXiv, pp. 1–25, 2017. [7] P. Jamshidi, C. Pahl, N. C. Mendonca, J. Lewis, and S. Tilkov, “Microservices: The journey so far and challenges ahead,” IEEE Softw., TABLE II. COMPARISON OF DIFFERENT CONTROLLERS IN TERMS OF vol. 35, no. 3, pp. 24–35, 2018. OUTAGE TIME [8] Kubernetes, “Kubernetes,” Kubernetes.io, 2020. [Online]. Available: Outage type Outage duration in seconds https://kubernetes.io/. [Accessed: 04-Nov-2020]. [9] P. D. T. O’Connor and A. Kleyner, Practical Reliability Engineering, 5th Kubernetes State OpenSAF Controller ed. John Wiley & Sons, Ltd, 2012. [10] ISO/IEC, “ISO/IEC 25010:2011 - Systems and software engineering – App Container Failure 2.2 1.2 0.2 Systems and software Quality Requirements and Evaluation (SQuaRE) – Pod Process Failure 2.1 0.7 3.3 System and software quality models,” 2011. [11] P. A. Pereira and M. Bruce, Microservices In Action. Manning, 2019. Node Reboot 164.5 2.9 3.3 [12] B. Beyer, C. Jones, J. Petoff, and N. R. Murphy, Site Reliability Engineering: How Google Runs Production Systems. 2016. [13] B. Kitchenham and S. Charters, “Guidelines for performing systematic VI. SUMMARY literature reviews in software engineering,” 2007. Stateful services hold and work with important and valuable [14] G. Pardon, C. Pautasso, and O. Zimmermann, “Consistent Disaster Recovery for Microservices: The BAC Theorem,” IEEE Cloud Comput., commodity – data. Handling data in microservice application vol. 5, no. 1, pp. 49–59, 2018. introduced additional challenges: distributed persistence of data [15] M. Manouvrier, C. Pautasso, and M. Rukoz, Microservice disaster crash and diverse data handling mechanisms. It becomes increasingly recovery: A weak global referential integrity management, vol. 12138 difficult to ensure consistent state of backups in highly LNCS. Springer International Publishing, 2020. independent and diverse microservice-based system. Thus, a [16] E. Truyen, D. Van Landuyt, B. Lagaisse, and W. Joosen, “Performance tradeoff has to be made between consistent backup and overhead of container orchestration frameworks for management of multi-tenant database deployments,” in Proceedings of the ACM availability of data. Running stateful workloads in orchestrated Symposium on Applied Computing, 2019, vol. Part F1477, pp. 156–159. container frameworks has particular challenges as well. [17] E. Kim, K. Lee, and C. Yoo, “On the Resource Management of Container orchestrators – Kubernetes and Docker Swarm – Kubernetes,” pp. 154–158, 2021. introduces performance overhead for database applications, [18] Kubernetes, “Concepts,” Kubernetes.io, 2021. [Online]. Available: especially to high performing ones. https://kubernetes.io/docs/concepts/. [Accessed: 28-Feb-2021]. [19] V. Podolskiy, M. Mayo, A. Koey, M. Gerndt, and P. Patros, “Maintaining Disk, CPU, network, and memory utilization are the SLOs of Cloud-Native Applications Via Self-Adaptive Resource commonly used SLIs to evaluate performance of a stateful Sharing,” Int. Conf. Self-Adaptive Self-Organizing Syst. SASO, vol. 2019- June, no. June, pp. 72–81, 2019. microservice. SLOs help to define baseline performance. [20] W. Delnat, E. Truyen, A. Rafique, D. Van Landuyt, and W. Joosen, “K8- Fault prediction can help with load balancing, scaling, and scalar,” pp. 33–39, 2018. [21] L. Toka, G. Dobreff, B. Fodor, and B. Sonkoly, “Machine Learning- scheduling decisions. As avoidance of faults reduces the impact based Scaling Management for Kubernetes Edge Clusters,” IEEE Trans. of faults on the microservice application. Netw. Serv. Manag., vol. 4537, no. c, pp. 1–14, 2021. [22] F. Rossi, V. Cardellini, F. Lo Presti, and M. Nardelli, “Geo-distributed Regression and reinforcement learning are used to predict efficient deployment of containers with Kubernetes,” Comput. Commun., stateful microservice performance and thus make decisions on vol. 159, no. April, pp. 161–174, 2020. Pod scheduling in geographically distributed or resource- [23] F. Rossi, M. Nardelli, and V. Cardellini, “Horizontal and vertical scaling constrained nodes. of container-based applications using reinforcement learning,” IEEE Int. Conf. Cloud Comput. CLOUD, vol. 2019-July, no. Section V, pp. 329– Scheduling mechanism of Kubernetes requires tuning and 338, 2019. special care to avoid rescheduling of stateful Pods. Redundancy [24] A. R. Sampaio, J. Rubin, I. Beschastnikh, and N. S. Rosa, “Improving of stateful Kubernetes Pods can be ensured with a State microservice-based applications with runtime placement adaptation,” J. Internet Serv. Appl., vol. 10, no. 1, 2019. Controller – constant health-check helps to quickly failover to [25] M. Soualhia, C. Fu, and F. Khomh, “Infrastructure fault detection and failover to a healthy stateful Pod based on its Label. prediction in edge cloud environments,” Proc. 4th ACM/IEEE Symp. Edge Comput. SEC 2019, pp. 222–235, 2019. Performance evaluation combined with configuration [26] Y. Yang and L. Chen, “Design of Kubernetes Scheduling Strategy Based analysis allows to make Kubernetes operators intelligent. Such on LSTM and Grey Model,” in Proceedings of IEEE 14th International operator is automating performance tuning and configuration Conference on Intelligent Systems and Knowledge Engineering, ISKE management. 2019, 2019, pp. 701–707. [27] L. Abdollahi Vayghan, M. A. Saied, M. Toeroe, and F. Khendek, REFERENCES “Microservice Based Architecture: Towards High-Availability for Stateful Applications with Kubernetes,” Proc. - 19th IEEE Int. Conf. [1] A. McAfee and E. Brynjolfsson, “Investing in the IT that Makes a Softw. Qual. Reliab. Secur. QRS 2019, pp. 176–185, 2019. Competitive Difference,” Harv. Bus. Rev., vol. 86, pp. 98–107, 2008. [28] A. Mahajan and T. A. Benson, “Suture: Stitching safety onto kubernetes [2] S. Newman, Building Microservices: Designing Fine-Grained Systems. operators,” Conex. Student Work. 2020 - Proc. 2020 Student Work. Part O’Reilly Media, 2015. Conex. 2020, pp. 19–20, 2020. [3] A. Mann, M. Stahnke, A. Brown, and N. Kersten, “State of DevOps [29] S. Łaskawiec, M. Choraś, R. Kozik, and V. Varadarajan, “Intelligent Report 2018,” 2018. operator: Machine learning based decision support and explainer for [4] A. Mann, M. Stahnke, A. Brown, and N. Kersten, “State of DevOps human operators and service providers in the fog, cloud and edge Report 2019,” 2019. networks,” J. Inf. Secur. Appl., vol. 56, no. December 2020, p. 102685, [5] H. Adkins, B. Beyer, P. Blankinship, P. Lewandowski, O. Stubblefield, 2021. and A. Stubblefield, Building Secure and Reliable Systems. O’Reilly Media, Inc., 2020.