A Guide to Your Cloud-Native Stack Table of Contents

A Guide to Your Cloud-Native Stack...... 03

Orchestration Systems...... 04

Enhancing the Stack...... 06

Monitoring...... 06

Logging...... 10

Distributed Tracing...... 12

Observability...... 13

Service Mesh...... 14

Conclusion...... 16

02 A Guide to Your Cloud-Native Stack

Cloud computing has been with us for The cloud-native paradigm is not all well over a decade now, and although its sweetness and light, however; in adoption has been steadily rising over exchange for the considerable benefits it that time, there are many reasons why affords, there is a price to pay in the form the rate of adoption hasn’t been swifter. of increased complexity. Chief among them, might be that organizations don’t necessarily believe Application workloads tend to be highly that a ‘lift and shift’ approach to distributed and often transient in nature, workload migration provides much which makes them all the harder to benefit. manage in a production setting. This hasn’t deterred the proponents of the Organizations want to transform the cloud-native approach though, and a lot way they operate; to be more agile and of effort has been put into the creation responsive, and simply replicating a data of tools, methods, and standards to fa- center in the cloud is not going to bring cilitate running these workloads, and to about that transformation. lower the bar to cloud-native adoption. We call this the ‘cloud-native stack’. Forward thinking organizations are be- ginning to make this transformation, by Defining the cloud-native stack is a bit embracing the cloud-native approach. like trying to nail jelly to the wall! New This is not just about moving software companies, projects, tools, and applications to run on cloud approaches are appearing almost on a infrastructure, but it is also about monthly basis. Trying to keep pace with building applications using the the change is both challenging and time microservices architecture. This makes consuming. But, some things are settled, those applications more resilient and and have become a fact of life. This guide scalable, and more capable of benefiting discusses the nature of the cloud-native from the on-demand, elastic nature of stack, and the choices that make up its the cloud. composition.

This guide discusses the nature of the cloud-native stack, and the choices that make up its composition.

03 Orchestration Systems

As the cloud-native approach has received the most interest and largest transitioned from a hype, to a level of contributions by the open source maturity that has encouraged community. They are also the most organizations to trust their business widely adopted. with, so too has the technology that characterizes it. Not least, the orchestration element of the stack that Swarm hosts cloud-native applications. The Docker project introduced ‘Swarm Classic’ toward the end of 2015, but In the beginning, with the advent of the quickly replaced it with a more func- Docker container format, there wasn’t an tional ‘in-Engine’ version called ‘Swarm easy way to run containerized Mode’ in the middle of 2016. microservices at scale. The Docker Engine was perfect for local development but trying to wrestle with the intricacies of managing distributed workloads, was Apache Mesos beyond its capabilities. As a basis for hosting cloud-native applications as A platform for pooling the physical or vir- distributed workloads, a system was tual resources of a datacenter, Apache required that allowed applications to Mesos can be used in conjunction with scale easily and quickly. It also needed to ‘frameworks’ (e.g. Aurora, Marathon) for provide resilience in the event of providing the mechanisms for running inevitable failures, support advanced distributed workloads. deployment patterns, and facilitate automation for the delivery of frequent application updates. Though there weren’t any open source, purpose-built orchestration systems to Borrowing and building on the concepts provide these requirements early on, it inherent in Google’s proprietary Borg wasn’t long before a number of new cluster manager, Kubernetes is a very projects materialized to fill the void. popular open source project supported by many influential organizations, such This list is not exhaustive, but does as Google, Microsoft and IBM. represent the contenders that have

04 Orchestration Systems

In a relatively short time, Kubernetes into a particular vendor, unless you rely has emerged as the de-facto choice for heavily on the value-add extras that running cloud-native applications. There particular vendors provide. may be many reasons for this, but prime among them is as a result of the many In some ways, it’s like choosing Linux as different organizations that have your preferred host operating system coalesced around Kubernetes, turning it (OS) kernel, and then electing to adopt into a well respected, open, Ubuntu as your preferred OS distribution. community-driven project. In the same way, because Kubernetes is open source, there is a wide choice That’s not to say that Swarm doesn’t available. Currently, there are over 50 continue to play a part in the ecosystem; different Kubernetes distributions from Swarm has a very low entry bar into the a myriad of vendors, each of which has orchestration of cloud-native been independently certified by the applications, and is still a foundational Cloud-native Computing Foundation component of Docker Enterprise Edition. (CNCF) as part of its Certified Kubernetes You’ll also find thatDC/OS , a distributed Conformance Program. operating system based on Mesos, can itself host Kubernetes clusters - One other factor to consider when deter- Kubernetes on Mesos, if you will. mining orchestration system choice, is complexity. There is no doubt that If Kubernetes has won the battle for the Kubernetes (and the other competing hearts and minds of engineering teams platforms) is technically complex, and as the de facto standard choice for requires significant knowledge and skill hosting cloud-native applications, then to operate. Many organizations prefer to we might be excused for thinking that focus on developing the value that char- we’re limited for choice. Whilst it’s true acterizes their business or cause, and to that adopting and investing in the leave the task of running infrastructure Kubernetes platform funnels you into a and operations, to third-parties (at least particular technology, it doesn’t lock you in part).

If DIY is your preferred approach, be prepared to make a considerable investment in terms of engineering skills and know-how.

05 Enhancing the Stack

Systems like Kubernetes provide a development decisions. significant level of automation that makes orchestrating cloud-native Taking a peek under the hood of our applications possible. These systems applications running in production is no can’t be everything to everybody though, easy task, but there are some techniques and at some point there has to be some and tools that can enhance our cloud-na- delineation in terms of purpose. tive stack. This allows us to begin to Otherwise, we end up creating a crack this difficult problem. functional monster and we start to become more opinionated, with choice Monitoring becoming the inevitable casualty. Monitoring is an operational practice To successfully operate a cloud-native that has been an essential ingredient for stack, it’s necessary to augment the managing the availability of software orchestration system with capabilities applications for a number of years. It’s a that enable us to gain insight into the tried and tested technique for health of the stack. We need to know observing the state of health of a when individual services are ailing, or complete system and has a mature set whether there is unacceptable latency of tools that pre-date the cloud-native between services, or a predetermined era. A lot of these tools are proprietary threshold of failed requests has been but some have been developed using the breached, and so on. open source software model.

And we need to know about this health Generally, these tools are agent-based at all levels in the stack; right from the and collect metrics from the host operat- host machines, up to the application ing system and the monolithic services themselves. Tools that enable us applications that run on top of them, us- to observe what’s going on at every level ing ‘probes’. Their architecture is based of the stack, help us to maintain service on a different era in computing, and is levels, as well as gain valuable insight not entirely suited to the elastic, into application behavior. The insight we on-demand nature of cloud computing gain, subsequently informs future environments.

06 Enhancing the Stack

In fact, their inadequacies are exposed by probing an application to determine further when we toss cloud-native whether it responds as expected. This applications into the mix, which are provides limited insight regarding the composed of multiple loosely-coupled, state of an application, however, and in replicated services that have an cloud-native stacks it’s more common to ephemeral existence. Pinning down see monitoring tools that provide what to monitor and where, is a introspection of applications. significant problem. Consequently, as the cloud-native movement has gathered This is often termed ‘white box’ steam, different approaches have been monitoring and is characterized by the sought to reinvent monitoring for the collection of metrics (including those cloud-native era. obtained from instrumented applications), which are stored in a A lot of the legacy monitoring tools rely time-series database for subsequent on a ‘black box’ approach to monitoring analysis and alerting purposes.

Comparison among Black-Box and White-Box Tests

Input Output

Black Box

Input Output

White Box

Fig 1. Black Box Monitoring vs. White Box Monitoring

07 Enhancing the Stack

Rather than a point-in-time assessment By far the most popular monitoring tool like that provided by a probe, this ap- that is used for cloud-native applications, proach to monitoring provides a rich, is the open source Prometheus. multi-faceted, continuous insight into Prometheus is a graduated project the state of the application’s health. under the stewardship of the CNCF, and Tools that provide white box monitoring has followed closely behind Kubernetes usually provide a visualization feature, or in its journey to maturity. It’s incredibly the means for exporting data to be well-respected in the cloud-native visualized in other tools. community, and is frequently used to monitor production cloud-native stacks. So, what’s on offer to us? There are some very capable commercial products that With a multi-dimensional data model, provide comprehensive monitoring for Prometheus applies key-value pairs to infrastructure as well as applications metrics in order to generate time series such as New Relic, AppDynamics, data that can be queried in a fine-grained Instana, and Datadog. These manner. It also provides a set of client commercial solutions tend to be SaaS- libraries for instrumenting application based and come at a price. The price, services, with most popular however, may well be worth the programming languages supported. And investment for the value that can be if there’s a need to translate metrics from gained from a comprehensive, hosted other systems that don’t support the monitoring solution. Prometheus format, there are a large number of ‘exporters’ available that We should also mention that the major enable you to gather metrics from these public cloud providers also provide systems. The Prometheus metrics format monitoring solutions for applications forms the basis of an open standard for that run on their infrastructure; Google metrics exposition, called OpenMetrics. Stackdriver, AWS CloudWatch and Azure Monitor. Using a tool native to the Perhaps one of its biggest selling points platform in question, may make a lot of for the cloud-native stack, is its ability to sense if your stack is based entirely on a interact with service discovery specific cloud platform. But bear in mind mechanisms. This enables it to that these tools are not as feature rich as automatically discover what to collect they could be at the present time. metrics from. Additionally, when

08 Enhancing the Stack

cloud-native applications are deployed to use Prometheus seriously, make use of a Kubernetes cluster, because Kubernetes instead. Grafana can use itself exposes metrics in the Prometheus Prometheus as a data source, and pro- format, it’s possible to start to construct a vides a comprehensive, general-purpose complete picture of the health of the graphing and dashboard capability. entire stack. Prometheus also has an alerting mechanism for issuing notifications to a Whilst Prometheus has an in-built number of different supported backends visualization capability, it’s fairly limited (e.g. Slack, OpsGenie, PagerDvuty, etc.) or in nature, and most organizations that via a webhook receiver.

Fig 2. Screenshot of the Grafana Dashboard

09 Enhancing the Stack

Logging

Monitoring is an essential ingredient in monitoring practices, they also pose a the determination of the state of health challenge for the collection and analysis of cloud-native applications, but it relies of event logs. Once again, it’s the on us knowing which metrics to collect. ephemeral, distributed nature of these This isn’t always known for every applications that are the crux of the application from the outset, and there is problem. How can we examine the logs a degree of learning about the of an application service instance, if it’s behavior of the applications we develop been and gone before we’ve even found over time. Once we’re entirely intimate out where it’s running? What happens with how our application behaves, then to the logs when a host node fails and is we can fine tune our metrics to monitor replaced by a healthy alternative? How for known problem circumstances in do we make sense of the myriad of logs production. generated by multiple sources across the application environment? As a complement to monitoring, we can and should log events during the To answer that question, it’s worth first execution of our applications. Logs are looking at what the ‘12-factor app’ often described as a ‘set of time-ordered methodology has to say about events’ that are emitted by an application logs. In short, it states logs application or system, which are stored should be treated as unbuffered event for subsequent analysis. Event logging streams, and that their management has long been a staple component that shouldn’t be the concern of the supports deep introspection of developer, but that of the environment in application behavior, and the wider which the application runs. system within which the application In other words, developers insert code to resides. When things go wrong with output events at key points in their ap- applications or systems, logging onto plication, but this is simply written to the a host server and poring over copious stdout and stderr streams. For cloud-na- amounts of logged events, is an tive environments, this means obligatory problem solving technique. extending the stack to ensure this valuable information is collected, stored, But, just as cloud-native applications and made availablefor subsequent cause problems for traditional analysis.

10 Enhancing the Stack

The tools that you choose to use for number of years. Elasticsearch provides managing event logs will depend on the indexing and query capabilities, many different factors. Logstash performs the log collection and parsing function, and Kibana enables The gorilla of the logging world is visualization of the information Splunk, which has served large collected. In more recent times, an enterprises for well over a decade. As an increasingly popular alternative to the enterprise-grade solution, it’s scope Logstash component of the logging extends beyond traditional logging, and stack, is a CNCF-hosted project called can be used for eliciting business Fluentd, which is a little more efficient insights, too. If you’re a large on memory consumption. Just as the organization with deep pockets, looking different Beats can be configured to ship for a comprehensive solution that goes logs to Logstash, Fluentd can be the beyond standard logging for forwarding target for Fluent Bit, a operational purposes, it might be the lightweight collector of logs with input right choice for you. But if your plugins for a number of different data requirements are less expansive, then sources. Fluent Bit can be deployed to there are plenty of other choices to be ship logs without Fluentd running had. alongside it. In either case the logging stack is referred to as EFK then. Just like the monitoring domain, SaaS-based solutions exist for logging Standing up and managing the too. Sumo Logic’s capabilities and infrastructure to host an ELK or EFK market proposition is very similar to that stack, however, is not a trivial exercise, of Splunk, whilst Loggly was a pioneer and it quickly becomes an end in itself. If in the SaaS market for logging, and was you can’t or don’t want to make the acquired by Solarwinds in 2018. Bear investment in skills and infrastructure, in mind that if you elect to go the SaaS then there are managed service route for log management, you may end providers who will take care of running up shifting gigabytes of data over the the stack on your behalf. wire on a daily basis. Before we leave the topic of logging, it’s The ELK (Elasticsearch, Logstash, worth pointing out that variations of the Kibana) stack has dominated the open ELK stack don’t have a monopoly when it source approach to log management for a comes to open source solutions, and

11 Enhancing the Stack

there are other reputable tools available, other services. If we suffer performance such as Graylog and Grafana Loki. Loki, or latency issues during the course of the whilst very new, is particularly request transaction, it’s going to be very interesting as it is inspired by hard to stitch the metrics and logs Prometheus, and can use the very same together to build a picture of what’s Kubernetes labels that Prometheus uses, going on. To help build that picture, and to analyze the collected logs in Grafana. to give us a fighting chance of Switching between metrics and logs pinpointing the cause of what has using the exact same labels in one manifested as a performance issue, there visualization tool, is incredibly powerful. needs to be a way of tracing the request or transaction as it traverses the mesh Distributed Tracing of services that make up the application. The techniques which provide this kind The recent innovations in monitoring of insight, belong in the domain of and logging get us a long way in distributed tracing. observing what’s happening in our cloud-native applications; but only so far. Distributed tracing is not a new concept, Cloud-native applications are comprised and has been used by the likes of Google of a set of distributed microservices. and Twitter for a number of years. But it’s As a client request hits one of those been the lack of a standard that has services, it might spark a complex prevented its widespread adoption in transaction pattern across numerous cloud-native environments.

Fig 3. Screenshot of Tracing with Jaeger

12 Enhancing the Stack

If your services rely on libraries, frame- Go, JavaScript), which are supported by works or services, which don’t all provide its open source project. Additionally, it the means to instrument in the same has a large number of community s manner, then there will be breaks in the upported libraries. Jaeger is a more trace which will degrade the insights recent and very popular alternative, and that can be learned. is an open source project hosted by the CNCF. It makes use of the language Open Tracing is an incubating project of support provided by the Open Tracing the CNCF that aims to derive a vendor project, which includes Go, JavaScript, neutral API specification for Java, Python and so on. When instrumenting cloud-native applications. researching which tracing system you A trace refers to the picture that we want intend to adopt, it will pay to observe to build of the transaction as it winds (no pun intended) the languages and its way through the distributed services, frameworks that are supported. with each step in the journey referred to as a span. A context is passed along from There are, of course, commercial span to span, and contains the ID of the solutions available, including SaaS trace and span, as well as any other data offerings like LightStep, Instana, and that could be useful for carrying across Datadog. each span boundary. Developers use the same instrumentation in their code, Observability irrespective of the tracing system they adopt, provided that system supports the Before we leave this topic, it’s worth Open Tracing API specification. In pointing out that monitoring, logging, theory, this means it’s possible to and tracing enable us to detect instrument once, but swap the tracing anomalies in our applications and system that provides the visualization systems. Though, in isolation they don’t and insights, without having to really help us understand the behavior of re-instrument your code. the application itself. Combined, however, they provide us with a rich In terms of tracing systems, there is a context, which enables us to gain growing list available to choose from. insights that we would otherwise not The most mature is Zipkin, which has a be able to glean from our applications. number of libraries for different There is a trend to refer to this imbued programming languages (e.g. C#, Java, competence, as ‘observability’.

13 Service Mesh

Earlier on we suggested that systems The goal of service mesh technology or like Kubernetes do a particular job and tooling is to provide the means for clus- do it well. We also mentioned that they ter administrators to apply traffic need to be augmented with additional management policy to service meshes, tooling to be the rounded solution whilst securing the communication, and required for cloud-native applications. enabling introspection of traffic behavior. In that sense, Kubernetes has been Policy and configuration is defined using likened to a kernel, that manages and a control plane component, and it’s schedules its guest workloads. To make enforced in the service mesh courtesy the job of running cloud-native of sidecar proxies that constitute a data workloads more seamless in a plane. Using service mesh technology, Kubernetes environment, the we can define client request routing, community produces helpful in-cluster configure ingress and egress and deal operators, such as the TLS certificate with timeouts and retries. We do this, management operator, cert-manager. while relying on the data plane to handle service discovery and load balance traffic They also build abstractions on top of to service instances. Kubernetes, and one of the more important of these is the ‘service mesh’. The capabilities you find in service mesh technology are not new concepts, and it’s The term service mesh comes from the possible to trace early implementations notion that cloud-native applications of service meshes back to libraries like implemented as microservices, don’t Google’s Stubby, Netflix’sHystrix , and necessarily conform to the tiered nature Twitter’s Finagle. But libraries have of previous generations of application limited appeal, particularly as updates architectures; instead they present a require a complete update and rollout of loosely-connected mesh of interacting the entire service mesh. The sidecar services. Suddenly, client traffic proxy approach removes the need for traverses east-west, as well as this disruptive course of action. north-south. Managing and controlling the traffic between these ephemeral The first proxy-based service mesh services is non-trivial. The need to do technology was Linkerd, a CNCF hosted this has spawned a new management open source project initiated by Buoyant. layer on top of the orchestration system, It found its way into production at a called service mesh technology. number of early adopters, including

14 Service Mesh

Monzo Bank. Its first incarnation It uses Lyft’s popular open source Envoy (version 1), was based on Finagle. It was project as its sidecar proxy, and has a daemon-based, and was followed by a number of supported adapters for lighter, sidecar proxy version originally interfacing with third-party called Conduit. infrastructure backends. For example, there are adapters for Fluentd, Conduit, with a control plane written in Prometheus, and Jaeger, amongst a host Golang and a data plane provided by a of others. proxy technology written in Rust, was targeted specifically at Kubernetes. It has With Linkerd 2 playing catch up with subsequently been subsumed into Istio in terms of features and popularity, Linkerd as version 2, and the heavier Istio certainly gets the most attention in JVM-based version 1 is likely to be the community. This may be as a result retired in favor of version 2 at some point of the corporate weight behind the in the future. project, but it’s too early to say which technology will win as the de-facto Hot on the heels of Linkerd, a series of standard for service mesh technology. companies announced an open source Istio is certainly complex, whilst collaboration around another service Linkerd and the recent Connect feature mesh technology called Istio. The project of Hashicorp’s Consul, provide a simpler started with Google, IBM and Lyft but it approach to the management of service has subsequently gained rapid support meshes. There’s plenty of road left to run from a variety of other companies. in the service mesh race, so it would be wise to take a carefully considered It is generally accepted as the leading decision, when it comes to defining your technology in service mesh technology. approach.

“Kubernetes is the new kernel. We can refer to it as a “cluster kernel” versus the typical operating system kernel.” -- Jessie Frazelle

15 Conclusion

Embarking on the journey to cloud If we are realistic and pragmatic in our native adoption may seem daunting, approach, we can look forward to especially when you consider that the realizing significant benefits for the cloud-native paradigm is still relatively organizations we serve, by taking new, and that it seems like there is so advantage of the availability of excellent much to learn and to assimilate. And, for open source platforms, tools, and sure we haven’t even touched on other methods that make cloud-native stacks aspects of the cloud-native stack, such viable, and the body of knowledge and as continuous integration/deployment, experience that is willingly and application registries and distribution, collaboratively shared within the networking and so on. community.

About Giant Swarm

Giant Swarm provides an API Driven Open Source Platform that enables businesses to easily provision and scale Kubernetes clusters together with a wide set of managed ser- vices. Giant Swarm continuously updates the platform with all its tenant clusters and takes full responsibility that the cloud-native infrastructure is operational at all times. Giant Swarm provides its customers hands-on expert support via a private Slack channel to further accelerate their cloud-native success.

Get in touch with us now

Let’s have a first call where we can discuss your goals, solutions, setup and timeline to come up with the perfect plan for scaling and delivering your solutions with ease.

[email protected] www.giantswarm.io twitter.com/giantswarm

16