A Cloud-Native Monitoring and Analytics Framework

RC25669 (WAT1710-006) October 4, 2017 Computer Science IBM Research Report A Cloud-Native Monitoring and Analytics Framework Fabio A. Oliveira, Sahil Suneja, Shripad Nadgowda, Priya Nagpurkar, Canturk Isci IBM Research Division Thomas J. Watson Research Center P.O. Box 218 Yorktown Heights, NY 10598 USA Research Division Almaden – Austin – Beijing – Brazil – Cambridge – Dublin – Haifa – India – Kenya – Melbourne – T.J. Watson – Tokyo – Zurich LIMITED DISTRIBUTION NOTICE: This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). Many reports are available at http://domino.watson.ibm.com/library/CyberDig.nsf/home. A Cloud-native Monitoring and Analytics Framework Fabio A Oliveira, Sahil Suneja, Shripad Nadgowda, Priya Nagpurkar, Canturk Isci IBM T.J. Watson Research {fabolive,suneja,nadgowda,pnagpurkar,canturk}@us.ibm.com Abstract filesystem data changes are less frequent, and configuration changes Operational visibility is an important administrative capability and normally occur when an application is deployed. is one of the critical factor in deciding the success or failure of a Yet another source of data for modern operational visibility stems cloud service. Today, it is increasingly becoming more complex from the diverse and prolific image economy (DockerHub, Amazon along many dimensions which include being able to track both Marketplace, IBM Bluemix) that we witness as a result of perva- persistent and volatile system state, as well as provide higher level sive virtualization. The more the world relies on cloud images, the services such as log analytics, software discovery, behavioral anom- more important it becomes to proactively and automatically certify aly detection, drift analysis to name a few. In addition, the target them by performing security and compliance validation, which endpoints to monitor are becoming increasingly varied in terms of requires visibility into dormant artifacts, in addition to running their heterogeneity, cardinality, and lifecycles, while being hosted cloud instances. across different software stacks. In this paper, we present our unified Adding to the complexity of dealing with a multitude of data monitoring and analytics pipeline to provide operational visibility, types for modern operational visibility, cloud environments are that overcomes the limitations of traditional monitoring solutions, becoming larger and increasingly heterogeneous. For instance, it is as well as provides a uniform platform as opposed to configuring, nowadays common for a cloud provider to support deployments on installing and maintaining multiple siloed solutions. Our OpVis physical hosts, virtual machines (VMs), containers, and unikernels, framework has been running in our production cloud for over two all at the same time. As a result, for more effective visibility, opera- years, while providing a multitude of such operational visibility and tional data from this diverse set of runtimes needs to be properly analytics functionality uniformly across heterogeneous endpoints. collected, interpreted, and contextualized. Tenancy information, To be able to adapt to the ever-changing cloud landscape, we high- resource limits, scheduling policies, and the like are exposed by light it’s extensibility model that enables custom data collection different cloud runtime platforms (e.g., Openstack, Kubernetes, and and analytics based on the cloud user’s requirements. We describe Mesos) in different ways. its monitoring and analytics capabilities, present performance mea- As if heterogeneity were not enough, the lighter the virtualiza- sures, and discuss our experiences while supporting operational tion unit (e.g., containers and unikernels), the higher the deploy- visibility for our cloud deployment. ment density, which leads to a sharp increase in the number of endpoints to be monitored. Figure 1 summarizes the complexity of modern cloud environments along multiple dimensions, including 1 Introduction deployment types and cloud runtimes, as well as some challenges for which operational visibility is needed. In cloud environments, operational visibility refers to the capability In this paper, we propose a novel approach to operational visi- of collecting data about the underlying system behavior and mak- bility to tackle the above challenges. To enable increasingly sophis- ing this data available to support important administrative tasks. ticated analytics that require an ever-growing set of data sources, Without visibility into operational data, cloud operators and users we implemented OpVis, an extensible framework for operational have no way to reason about the health and general behavior of visibility and analytics. Importantly, OpVis provides a unified view the cloud infrastructure and applications. of all collected data from multiple data sources and different cloud Traditionally, the operational visibility practices have been lim- runtimes/platforms. OpVis is extensible with respect to both data ited to resource monitoring, collection of metrics and logs, and collection and analytics. security compliance checks on the underlying environment. In to- We contend that an effective operational visibility platform must day’s world, better equipped to manipulate massive amounts of decouple data collection from analytics. Old solutions that attempt data and to extract insights from it using sophisticated analytics to mix data collection and analysis at the collection end do not scale, algorithms or machine-learning techniques, it becomes natural to and are limited to localized rather than holistic analytics. We enable broaden the scope of operational visibility to enable, for instance, algorithms that can uncover data relationships across otherwise deep log analytics, software discovery, network/behavioral anomaly separated data silos. detection, configuration drift analysis, to name a few use cases. Furthermore, to scale to the increasing proliferation of ephemeral, To enable these analytics, however, we need to collect data from short-lived instances in today’s high-density clouds, we propose a broader range of data sources. Logs and metrics no longer suf- an agentless, non-intrusive data collection approach. Traditional fice. For example, malware analysis is done based on memory and agent-based methods are no longer suitable, with their maintenance filesystem metadata, vulnerability scanning needs filesystem data, and lifecycle management becoming a major concern in enterprises. network analysis requires data on network connections, and so on. Our implementation of OpVis supports multiple data sources At the same time, these data sources are potentially very different and cloud runtimes. We have been using it in a public production in nature. Log events are typically continuously streamed, whereas cloud environment for over two years to provide operational visibility capabilities, along with a number of analytics applications IBM Research Technical Report, USA 2017. IBM Research Technical Report, December 2017, USA F. A. Oliveira et al. Cloud Operations Management Deployments Cloud Software Operating State Operational Visibility Inactive Active Application Analytics Physical VM Container Unikernels (eg. VM/Container Security & Accounting Compliance metrics Host Image) & Billing Openstack Kubernetes Mesos Disk Memory Network State State State Behavioral Network Anamoly Anamoly (eg. Compliance Malware IP-Access Configuration (eg. Vulnerability abnormal Analysis Checks Analysis blacklisting Analysis abnormal Network (eg. HIPPA, syscall traffic) ITCS) actvity) Figure 1. Cloud operation management. we implemented to, among other things, provide security-related VM Cloud services to our cloud users. Container Cloud We evaluate OpVis with a combination of both controlled exper- Crawler Crawler iments and real production data in cases where we were allowed to publicize it. Data 2 Existing Techniques Service Data Bus To gain visibility inside VMs, most existing solutions usually re- Indexers Annotators purpose traditional monitoring tools, typically requiring installa- Data tion inside the monitored endpoint’s runtime, and thus causing Store Search Service guest intrusion and interference [4]. Other’s avoid application- specific agents by installing generic hooks or drivers inside the Analytics Analytics … Analytics guest [34, 58–60], requiring VM specialization, leading to vendor App_1 App_2 App_n locking. The ones that do not require guest modification, can usually provide only few black box metrics, for example by querying the VM management consoles like VMware vCenter and Red Hat Figure 2. OpVis overview. Enterprise Management. Yet another approach is to gather metrics by remotely accessing the target endpoints (e.g. over SSH, or via HTTP queries). A combination of one or more of these techniques extensible and opensourced operational visibility framework, that is also typically seen in some solutions [3, 45]. does not enforce guest cooperation or cause guest intrusion, inter- The landscape is a bit different with containers, since the se- ference and modification.

Load more