C om p lim e nt s of

SECOND EDITION Understanding Log at Scale Log Data, Analytics & Management

Matt Gillespie & Charles Givre

REPORT

SECOND EDITION Understanding Log Analytics at Scale Log Data, Analytics, and Management

Matt Gillespie and Charles Givre

Beijing Boston Farnham Sebastopol Tokyo Understanding Log Analytics at Scale by Matt Gillespie and Charles Givre Copyright © 2021 O’Reilly Media, Inc.. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more infor‐ mation, contact our corporate/institutional sales department: 800-998-9938 or [email protected].

Acquisitions Editor: Jessica Haberman Proofreader: Abby Wheeler Development Editor: Michele Cronin Interior Designer: David Futato Production Editor: Beth Kelly Cover Designer: Randy Comer Copyeditor: nSight, Inc. Illustrator: Kate Dullea

May 2021: Second Edition February 2020: First Edition

Revision History for the Second Edition 2021-05-05: First Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Understanding Log Analytics at Scale, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. This work is part of a collaboration between O’Reilly and Pure Storage. See our statement of editorial independence.

978-1-098-10423-8 [LSI] Table of Contents

1. Log Analytics ...... 1 Capturing the Potential of Log Data 3

2. Log Analytics Use Cases...... 13 Cybersecurity 14 IT Operations 19 Industrial Automation 24

3. Tools for Log Analytics...... 31 32 Elastic (Formerly ELK) Stack 33 Sumo Logic 34 Apache Kafka 34 Apache Spark 35 Combining Log Analytics with Modern Developer Tools 35 Deploying Log Analytic Tools 36

4. Topologies for Enterprise Storage Architecture...... 39 Direct-Attached Storage (DAS) 42 Virtualized Storage 43 Physically Disaggregated Storage and Compute 44

5. The Role of Object Stores for Log Data...... 47 The Trade-Offs of Indexing Log Data 49

6. Performance Implications of Storage Architecture...... 51 Additional Considerations for Security Analytics 54

v 7. Enabling Log Data’s Strategic Value with a Unifed Fast File and Object Platform...... 63

8. Nine Guideposts for Log Analytics Planning...... 69 Guidepost 1: What Are the Trends for Ingest Rates? 70 Guidepost 2: How Long Does Log Data Need to Be Retained? 71 Guidepost 3: How Will Regulatory Issues Affect Log Analytics? 71 Guidepost 4: What Data Sources and Formats Are Involved? 72 Guidepost 5: What Role Will Changing Business Realities Have? 73 Guidepost 6: What Are the Ongoing Query Requirements? 73 Guidepost 7: How Are Data-Management Challenges Addressed? 74 Guidepost 8: How Are Data Transformations Handled? 75 Guidepost 9: What About Data Protection and High Availability? 75

9. Conclusion...... 77

vi | Table of Contents CHAPTER 1 Log Analytics

The humble machine log has been with us for many technology gen‐ erations. The data that makes up these logs is a collection of records generated by hardware and software—including mobile devices, lap‐ top and desktop PCs, servers, operating systems, applications, and more—that document nearly everything that happens in a comput‐ ing environment. With the constantly accelerating pace of business, these logs are gaining importance as contributors to practices that help keep applications running 24/7/365 and analyzing issues faster to bring them back online when outages do occur. If logging is enabled on a piece of hardware or software, almost every system process, event, or message can be captured as a time- series element of log data. Log analytics is the process of gathering, correlating, and analyzing that information in a central location to develop a sophisticated understanding of what is occurring in a datacenter and, by extension, providing insights about the business as a whole. The comprehensive view of operations provided by log analytics can help administrators investigate the root cause of problems and iden‐ tify opportunities for improvement. With the greater volume of that data and novel technology to derive value from it, logs have taken on new value in the enterprise. Beyond long-standing uses for log data, such as troubleshooting systems functions, sophisticated log analytics have become an engine for business insight and compli‐ ance with regulatory requirements and internal policies, such as the following:

1 • A retail operations manager looks at customer interactions with the ecommerce platform to discover potential optimizations that can influence buying behavior. Complex relationships among visit duration, time of day, product recommendations, and promotions reveal insights that help reduce cart abandon‐ ment rates, improving revenue. • A ride-sharing company collects position data on both drivers and riders, directing them together efficiently in real time, as well as performing long-term analysis to optimize where to position drivers at particular times. Analytics insights enable pricing changes and marketing promotions that increase rider‐ ship and market share. • A smart factory monitors production lines with sensors and instrumentation that provide a wealth of information to help maximize the value generated by expensive capital equipment. Applying analytics to log data generated by the machinery increases production by tuning operations, identifying potential issues, and preventing outages.

Using log analytics to generate insight and value is challenging. The volume of log data generated all over an enterprise is staggeringly large, and the relationships among individual pieces of log data are complex. Organizations are challenged with managing log data at scale and making it available where and when it is needed for log analytics, which requires high compute and storage performance.

Log analytics is maturing in tandem with the global explosion of data more generally. The International Data Corporation (IDC) predicts that the global data‐ sphere will grow more than fivefold in seven years, from 33 zettabytes in 2018 to 175 zettabytes in 2025.1 (A zettabyte is 10^21 bytes or a million petabytes.)

What’s more, the overwhelming majority of log data offers little value and simply records mundane details of routine day-to-day operations such as machine processes, data movement, and user transactions. There is no simple way of determining what is

1 David Reinsel, John Gantz, and John Rydning. IDC, November 2018. “The Digitization of the World from Edge to Core”.

2 | Chapter 1: Log Analytics important or unimportant when the logs are first collected, and con‐ ventional data analytics are ill-suited to handle the variety, velocity, and volume of log data. This report examines emerging opportunities for deriving value from log data, as well as the associated challenges and some approaches for meeting those challenges. It investigates the mechanics of log analytics and places them in the context of specific use cases, before turning to the tools that enable organizations to fulfill those use cases. The report next outlines key architectural considerations for data storage to support the demands of log ana‐ lytics. It concludes with guidance for architects to consider when planning and designing their own solutions to drive the full value out of log data, culminating in best practices associated with nine key questions:

• What are the trends for ingest rates? • How long does log data need to be retained? • How will regulatory issues affect log analytics? • What data sources and formats are involved? • What role will changing business realities have? • What are the ongoing query requirements? • How are data-management challenges addressed? • How are data transformations handled? • What about data protection and high availability?

Capturing the Potential of Log Data At its core, log analytics is the process of taking the logs generated from all over the enterprise—servers, operating systems, applica‐ tions, and many others—and deducing insights from them that power business decision making. That requires a broad and coher‐ ent system of telemetry, which is the process of PCs, servers, and other endpoints capturing relevant data points and either transmit‐ ting them to a central location or a robust system of edge analytics. Log analytics begins with collecting, unifying, and preparing log data from throughout the enterprise. Indexing, scrubbing, and nor‐ malizing datasets all play a role, and all of those tasks must be

Capturing the Potential of Log Data | 3 completed at high speed and with efficiency, often to support real- time analysis. This entire life cycle and the systems that perform it must be designed to be scalable, flexible, and secure in the face of requirements that will continue to evolve in the future. Generating insights consists of searching for specific pieces of data and analyzing them together against historical data as well as expected values. The log analytics apparatus must be capable of detecting various types of high-level insights such as anomalies, relationships, and trends among the log data generated by informa‐ tion technology (IT) systems and technology infrastructure, as shown in Figure 1-1.

Figure 1-1. High-level types of insights discoverable from log data

The following are some examples of these types of high-level insights: Anomaly Historically, 90% of the traffic to a given server has come from HR. There is now an influx of traffic from a member of the sales department. The security team might need to investigate the possibility of an insider threat. Relationship The type of spike currently observed in traffic to a self-serve support portal from a specific customer often precedes losing that customer to the competition. The post-sales support team might need to ensure that the customer isn’t at risk. Trend Shopping cart abandonment rates are increasing on the ecom‐ merce site for a specific product type. The sales operations team

4 | Chapter 1: Log Analytics might need to investigate technical or marketing shortcomings that could be suppressing that product’s sales. In addition to detecting these high-level insights, the log analytics apparatus must be capable of effective reporting on and visualization of those findings to make them actionable by human administrators. Your Environment Has Too Many Log Sources to Count Log data is generated from many sources all over the enterprise, and deciding which ones to use for analytics is an ongoing process that can never be completed. The following list is representative, as opposed to exhaustive: Servers Operating systems, authentication platforms, applications, databases Network infrastructure Routers, switches, wireless access points Security components Firewalls, identity systems such as active directory (AD), intru‐ sion prevention systems, management tools Virtualization environments Hypervisors, orchestration engines, management utilities Data storage Local, virtualized, storage area network (SAN), and/or network- attached storage (NAS) resources; object storage Client machines Usage patterns, data movement, resource accesses Although they derive from a shared central concept, implementa‐ tions of log analytics are highly variable in scope, intent, and requirements. They can run the gamut from modest to massive in scale, with individual log entries that might be sparse or verbose, carrying all manner of information in an open-ended variety of for‐ mats that might not be readily compatible, as shown in Figure 1-2. All share the challenge of tightening the feedback loop between sift‐ ing through and interpreting enormous numbers of events, often in real time, to generate insights that can optimize processes.

Capturing the Potential of Log Data | 5 Storing log data enables analysts to go back through the repository of time-series data and re-create a series of events, correlating causes and effects after the fact. In addition to casting light on the past, identifying historical patterns also helps illuminate present and future dangers and opportunities. The sheer volume of that data and the need to be able to effectively query against it places significant demands on storage systems.

Figure 1-2. Challenge of bringing together nonstandardized log fle formats Treating Logs as Data Sources The contents of logs are less a series of metrics than they are text strings akin to natural language, in the sense that they are formatted imprecisely, with tremendous variation depending on who created the log-writing mechanism. In addition, because log entries are only semistructured, they must be interpreted and then parsed into dis‐ crete data points before being written to a database. Telemetry from thousands of different sources might be involved, from simple sensors to enterprise databases. In keeping with that enormous diversity, the structure, contents, and syntax of entries vary dramatically. Beyond differences in format and syntax, various logs contain discrete datasets, with mismatched types of data. Trans‐ forming and normalizing this data is key to making it valuable. Analytics can be performed on log data that is either streaming or at rest. Real-time or near-real-time analysis of logs as they are gener‐ ated can monitor operations and reveal emerging or existing prob‐ lems. Analysis of historical data can identify trends in quantities such as hardware utilization and network throughput, providing technology insights that complement business insights more broadly and help guide infrastructure development. Using older data

6 | Chapter 1: Log Analytics to create baselines for the future also helps to identify cases for which those ranges have been exceeded.

Logs Versus Metrics Both logs and metrics are essentially status messages, which can come from the same source. They are complementary but distinct, as represented in Figure 1-3.

Figure 1-3. Comparison between logs and metrics

Logs are semistructured, defined according to the preferences of the individual developers that created them. They are verbose by nature, most often based on free text, often resembling the natural language from which they derive. They are intended to give detail about a specific event, which can be useful in drill-down, root-cause analysis of scenarios such as system failures or security incidents. Metrics are quantitative assessments of specific variables, typically gathered at specific time intervals, unlike logs, which are triggered by external events. Metrics have a more structured format than logs, making them suitable for direct numerical analysis and visual‐ ization. Because their collection is governed by time rather than events, volumes of metrics data tend to scale more gradually than logs with increased IT complexity and transaction volume. Of the two, logs are far messier. Although they are semistructured, recovering that structure requires parsing with specialized tools. Metrics, by contrast, are inherently highly structured. Logs and metrics can work together, with different functions that reflect their respective structures. For example, metrics reveal trends through repeated measurement of the same quantities over time. Referred to as the aforementioned “time series,” this sequence of data points can be plotted as a line graph, for example, where increases in query response time might indicate deteriorating performance of a database. The greater level

Capturing the Potential of Log Data | 7 of forensic detail in the corresponding logs can be the key to deter‐ mining why that deterioration occurred. One key distinction between logs and metrics is that metrics are usually generated to answer specific questions and often cannot be repurposed for other use cases. For example, if you wanted to know how many unique users per day were accessing a website, you could gather this information by aggregating all the logs from each day during a query. However, doing this at query time is computation‐ ally expensive unless the data is indexed. A metric-based approach would be to generate a daily summary of unique users. The metric- based approach is quicker to query, however. As you can see, you cannot drill down into the data any further. Therefore, when plan‐ ning metrics, it is important to think of as many use cases as possi‐ ble when you are designing them to extract maximum value.

Log data is messy, both in the organizational sense of mismatched formats of logs from various sources, as well as in the hygienic sense of misspellings, missing data, and so on. Beyond the need to inter‐ pret the structures of entries and then parse them, the transforma‐ tions applied to log data must also account for quality issues within the data itself. For example, log analytics systems typically provide the ability to interpret data so that they can successfully query against data points that might include synonyms, misspellings, and other irregularities. Aside from quality issues, data logs can contain mismatches simply because of the way they characterize data, such as one security sys‐ tem tagging an event as “warning” while another tags the same event as “critical.” Such discrepancies among log entries must be resolved as part of the process of preparing data for analysis. The need to collect log data from legacy systems can also be chal‐ lenging. Whereas legacy applications, operating systems, and hard‐ ware are frequent culprits in operational issues, they can provide less robust (or otherwise different) logging than their more modern counterparts. Additional layers of data transformation might be required by such cases in order to normalize their log data to that of the rest of the environment and provide a holistic basis for log analytics.

8 | Chapter 1: Log Analytics Standardizing Log Formatting One of the major issues in log analysis is that there is no standard way of formatting log messages. The closest protocol is the Syslog (RFC-5414) format; unfortunately, there are many other formats for log messages. In order to analyze log data, the messages must ulti‐ mately be mapped to a schema of columns and rows. Using custom log formats means that any downstream system which seeks to con‐ sume these logs will require custom parsers for each format. In the security environment, some devices can be configured to emit logs as XML or JSON.2 These formats offer two advantages over plaintext logs in that the schema of the data is contained in the file structure, which simplifies downstream processing. Most analytic scripting languages such as Python or R have libraries to easily ingest these formats, which eliminates the need for custom parsers. Of these two formats, JSON has two significant advantages over XML: the schema in JSON data is not ambiguous, and the file size is smaller for JSON files than XML given the same data.

Columnar versus row-based formats An additional option for log formats are columnar formats such as Apache Parquet. Parquet is an open source columnar file format designed to support fast data processing using complex data. Par‐ quet is a columnar format,3 which means that instead of storing rows together, it groups data together by columns. Additionally, Par‐ quet stores both the actual data and metadata about the columns. The columnar format has several advantages over row-based for‐ mats, including columnar compression, metadata storage, and com‐ plex data support. In practice, what this means is that storing log data in Parquet format can result in size reductions of up to six times that of conventional logs. Storing data in Parquet format may accelerate queries as well as Parquet stores metadata which is useful in minimizing the amount of data scanned during queries. Many modern analytic platforms support parquet format, including Python/Pandas, R, Spark, Drill, Presto, Impala, and others.

2 Unfortunately, the option to emit logs in XML or JSON is not available on all devices. 3 For more information about the difference between row-based and columnar stores see: “The Data Processing Holy Grail? Row vs. Columnar Databases”.

Capturing the Potential of Log Data | 9 The Log Analytics Pipeline As log data continues to grow in volume, variety, and velocity, the associated challenges require structured approaches to infrastruc‐ ture design and operations. Log analytics and storage mechanisms for machine data based on tools such as Splunk and the Elastic Stack must be optimized across a life cycle of requirements, as illustrated in Figure 1-4. To perform these functions effectively, it must be pos‐ sible to draw data from anywhere in the environment, without being impinged by data silos.

Figure 1-4. Pipeline for assembling and driving value from log data

This pipeline represents the process for transforming log data into actionable insights, although in practice, their order might be rear‐ ranged, or only a subset of the steps listed here might be used. Com‐ mon operations performed on log data include the following: Collect From its dispersed sources, log data must be aggregated, parsed, and scrubbed, inserting defaults for missing values and discard‐ ing irrelevant entries, etc. ETL (Extract, Transform, Load) Data preparation can include being cleaned of bad entries, reformatted, normalized, and enriched with elements of other datasets. Index To accelerate queries, the value of indexing all or a portion of the log data must be balanced against the compute overhead required to do so (as discussed below). Store Potentially massive sets of log data must be stored efficiently, using infrastructure built to deliver performance that scales out smoothly and cost effectively.

10 | Chapter 1: Log Analytics Search The large scale of the log data in a typical implementation places extreme demands on the ability to perform flexible, fast, sophis‐ ticated queries against it. Correlate The relationships among various data sources must be identi‐ fied and correlated before the significance of the underlying data points can be uncovered. Visualize Treating log entries as data means that they can be represented visually using graphs, dashboards, and other means to assist humans in understanding them. Analyze Slicing and dicing log data and applying algorithms to it in a structured, automated way enables you to identify trends, pat‐ terns, and actionable insights. Report Both predefined and ad hoc reporting must be powerful and flexible so that users can rapidly produce outputs that illumi‐ nate their business needs.

The framework of steps given here is a guideline that could easily be expanded to, more specifically, call out actions such as data parsing, transformation, and inter‐ pretation, among many others. The point of this life cycle description is to provide a workable overall view rather than the most exhaustive one possible.

Getting your arms fully around the challenges associated with implementing log analytics is daunting. The potential sources and types of log data available are of an open-ended variety, as are the possible uses of that data. Although the specific implementations at every organization will be different, they share general technical requirements as well as the potential to be applied to common busi‐ ness needs and use cases.

Capturing the Potential of Log Data | 11

CHAPTER 2 Log Analytics Use Cases

Technologists have been analyzing machine logs for decades, from the earliest days of tuning or troubleshooting their environments. Over time, the industry has found ways to increasingly automate that analysis, leading to the emergence of log analytics as we know it today. Now more than ever, log analytics can help businesses run more efficiently, reduce risk, and ensure continuity of operations. The use cases described in this section illustrate some examples of how log analytics has taken on new importance in the past several years, demonstrating how it can deliver unprecedented value to organizations of all types and sizes. Factors that have contributed to that growing importance include the following:

• Machine data provides greater opportunities for log analytics as well as challenges. The scale of data analysis will grow further as we continue to drive intelligence into the world around us. A single self-driving car is estimated to generate multiple terabytes of data each day, while a smart factory might generate a petabyte per day.1 • Greater variety among types of endpoints has already reached unprecedented levels as the IT environment has become more complex. As the pace of change accelerates and the Internet of Things (IoT) adds billions of new devices online, the insights to

1 Richard Friedman. Inside HPC, May 31, 2019. “Converging Workflows Pushing Con‐ verged Software onto HPC Platforms”.

13 be gained by bringing together multiple data sources will con‐ tinue to increase. • Technology evolution, making log analytics feasible at greater scale than before. In particular, the mainstream emergence of flash storage offers faster read/write speed than conventional spinning hard disk drives (HDDs), and low-cost compute capacity offers high performance with commodity servers.

With the increased scope and prevalence of log analytics as a whole, a growing set of common use cases have emerged. The remainder of this section discusses several prevalent ones, grouped here under the categories of cybersecurity, IT operations, and industrial automa‐ tion. While many other use cases are possible and indeed prevalent, these provide a representative sample. Cybersecurity Securing IT and other systems is a classic application of log analytics based on the massive numbers of events that are logged and trans‐ mitted throughout a large organization. Cyber protection in this area draws from log data and alerts from security components such as firewalls and intrusion detection systems, general elements of the environment such as servers and applications, and activities such as user login attempts and data movement. Log analytics can play a role in multiple stages of the security life cycle: Proactively identifying and characterizing threats Log analytics can iteratively search through log data to detect unknown threats that conventional security systems are not designed to identify, creating testable hypotheses. Detecting and responding to attacks and other security events When abnormal indicators arise, log analytics can help to iden‐ tify the nature and scope of a potential breach, minimize expo‐ sure, and then neutralize and recover from the attack. Performing forensic analysis afer a breach has occurred A robust log analytics platform helps identify the point-in-time log information that should be brought into a postmortem investigation as well as making that data available and acting on it.

14 | Chapter 2: Log Analytics Use Cases Speed Matters In recent years, there has been a growing recognition that effective cybersecurity measures are not just focused on preventing intru‐ sions, but rather detecting intrusions as rapidly as possible and (ide‐ ally) automatically acting to remediate the breach. The metric’s mean time to detect (MTTD) and the mean time to remediate (MTTR) are now all standard KPIs for sophisticated cybersecurity organizations. Reducing MTTD is vital because the less time a malicious actor has to operate in a corporate environment, the less damage they can do. In many cases, attackers lurk in the environment for days or months before picking a target. It is essential that security teams are able to retain searchable logs and correlate events quickly across several days or months to piece together an attack. Inability to look back quickly or slow historical queries can lead to blind spots. Calculating this metric is actually quite complex in that an organiza‐ tion must record when a breach first occurred and then the time delta for when it was actually reported. Doing so usually will involve multiple logs as well as some records from the security incident and event management (SIEM) platform that actually records when the breach was detected. MTTR is the obvious follow on to MTTD and for which there are many moving pieces. In addition to detecting the incident, it must be remediated. After detecting a potential incident, the best systems will automatically act to remediate the incident. Whereas a SIEM platform can detect and monitor security incidents, a security orchestration, automation and response (SOAR) platform is neces‐ sary to remediate threats while minimizing human intervention. A SIEM will detect the threat and fire an alert, but the SOAR platform can take actions such as isolating an infected device, automatically without human intervention. SOAR platforms can significantly reduce the MTTR by removing or greatly reducing the human ele‐ ment from the equation of security incident remediation. With all that said, speed is of the essence in both detection and remediation. Platforms which can rapidly access and analyze the data will ultimately be the most effective in reducing risk to organizations.

Cybersecurity | 15 Detecting anomalies Cyber processes often use analytics to define a typical working state for an organization, expressed as ranges of values or other indicators in log data, and then monitor activity to detect anomalies. For example, an unusual series of unsuccessful authentication attempts might suggest attempted illicit access to resources. Unusual move‐ ment of data could indicate an exfiltration attempt. The sheer volume of log data makes it untenable for humans to interpret it unaided. With thousands of events per minute being documented by hard‐ ware and software systems all over the computing environment, it can be difficult or impossible to determine what is worthy of atten‐ tion. models can help analytics engines cull through these huge amounts of log data, detecting patterns that would not be apparent to human operators. Those processes can occur automatically, or they can be initiated by ad hoc queries by analysts or others. Their outputs can be used to identify items of interest for human analysts to investigate further, allowing them to focus their attention where it is the most valuable. A common application is that threat hunters often use log analytics to help identify potential threats, look more deeply into them, and determine what response, if any, is required.

Machine Learning Is Invaluable to Anomaly Detection The twin limiting factors in detecting anomalies in log data for security usages are massive data volumes and the necessity of look‐ ing for undefined patterns. The data itself is messy, consisting of many different formats and potentially containing misspellings, inconsistencies, and gaps. The anomalous patterns being looked for can be subtle and easy to overlook. All of this makes humans poorly suited to anomaly detection at scale. Sustained massive levels of event volumes quickly become overwhelming, and a significant proportion of those events are irrelevant. At the same time, software tools might also not be suc‐ cessful, given that the effectiveness of its detection is limited by the accuracy of its assumptions, which are likely to be predetermined and static. Over the past 5 to 10 years, the industry has developed sophisticated dashboards to provide real-time views that help iden‐ tify potential security incidents. These techniques can also be

16 | Chapter 2: Log Analytics Use Cases applied to insider threat detection and user behavior analytics (UBA). Vendors such as Exabeam, Securonix, and Splunk UBA have become leaders in the space. Machine learning (ML) and artificial intelligence (AI) are increas‐ ingly viable for improving those user monitoring approaches, over‐ coming some key limitations and turning massive data stores from a liability into an asset for helping to train ML models. Algorithms can use both historical and real-time data to continually update their vision of what “business as usual” looks like and use that mov‐ ing baseline as the standard against which they interpret emerging log events. In recent years, predictive analytics have become more common in a variety of usages. Based on all data received up to the current moment, a machine learning model can predict expected parame‐ ters of future events and then flag data that falls outside those ranges. Working from that set of detected anomalies, the algorithm can correlate them with other incidents to limit the universe of events to be considered and to illuminate patterns in real time. By alerting security analysts to those anomalies and patterns, the system can pare the scope of alerts that must be investigated by human opera‐ tors down to a manageable level. As a result, IT staff can focus on innovation and adding value to the organization rather than just maintaining the status quo. In addition to alerting, ML models can be included as a part of an automated playbook to not only identify malicious activity in a cor‐ porate environment, but remediate this. For example, an ML model can be built to detect anomalous behavior of a particular class of networking device. If it detects unusual behavior, the automation can then check the change management logs to see if that device has open change tickets. If it does not, the device can be isolated from the network, thereby preventing damage.

Identifying and Defeating Advanced Threats One of the issues confronted by modern security teams is the sub‐ tlety and long-time horizons associated with today’s stealthy attacks. Advanced persistent threats operate by moving laterally as quietly as possible through an organization to gain access to additional resour‐ ces and data, in a process that often elapses over a matter of months.

Cybersecurity | 17 The key to detection often lies less in identifying any specific event than in overall patterns. In practice, a security analyst might begin with a suspicious activity such as a service running from an unexpected file location and then use various log data to uncover additional information surrounding that event to help discover whether it is malicious or benign. For example, other activities during the same login session, connections from unexpected remote IP addresses, or unusual patterns of data movement captured in logs can be relevant. Treating log data as a coherent whole rather than natively as a dispa‐ rate collection of data points enables analysts to examine activities anywhere across the entire technology stack. This approach also enables analysts to traverse events backward and forward through time to retrace and analyze the behaviors of a given application, device, or user. This capability can be vital in cases such as under‐ standing the behaviors of persistent cyber threats that operate over the course of weeks or months. Data context consists of information about each data point’s connec‐ tions to others, which must be encoded along with the data itself, typically in the form of metadata created to describe the main data. This context enables analysis to identify the significance of a given data point in relation to the greater whole. Statistical analysis against bodies of machine log data can reveal relationships that would otherwise remain hidden. Those insights help security teams more confidently categorize events in terms of the levels of threat they represent, enabling faster, more precise responses that help limit negative impacts on operations, assets, and reputations. In the context of a smart factory, for example, that anal‐ ysis can help avoid unplanned outages that would otherwise lead to lost productivity and profitability.

First, Prepare the Organization Because sources of log data often cross organizational boundaries, even within the same company, setting the foundations for a log analytics practice involves more than technology. For example, a single transaction might involve technology components that are managed and controlled by many different individuals and depart‐ ments, as illustrated in Figure 2-1.

18 | Chapter 2: Log Analytics Use Cases Figure 2-1. Resources involved in a single transaction, managed by separate teams

The expertise of people who know the systems best can be invalua‐ ble in determining what logs are available and how to get data from them for use in analytics. Cooperation with all of these entities is essential at all points along the transaction chain. Administrators of everything from enterprise databases to network hardware will need to enable logging and provide access to the log files generated. This reality makes the buy-in and support of senior management essential. Involving them early in the decision-making process is sound advice, and it illustrates the value of presenting them with use cases that support their business interests. In particular, the cyber security potential of log analytics is a strong inducement to cooperate that crosses organizational boundaries.

IT Operations Even though IT operations have always been challenging, today’s business and technology climate makes them even more so. Very high standards for the quality of end-user experience have become the norm; meanwhile, the pace of change has accelerated and the ability to turn on a dime is taken for granted. At the same time, many of these organizations are being forced to do more with less in the face of budgets that might be flat or even decreasing. The technology environment itself has also become more complex and varied. In place of traditional client-server computing stacks that were relatively uniform and static, dynamic, heterogeneous infrastructures change in real-time cadence with varying workloads. A growing proportion of the network is defined in software, creat‐ ing new management challenges, and software paradigms such as microservices and containers challenge basic assumptions about

IT Operations | 19 enterprise applications. And all of this unfolds against the backdrop of unprecedented data volumes. The combination of log analytics, combined with metrics and appli‐ cation performance monitoring (APM), also known as observability, can help address the need for IT operations to be highly responsive to each diverse part of the enterprise, maintaining smooth opera‐ tions and avoiding downtime. Speed is of the utmost importance, with operations teams automatically finding and remediating as many issues as possible without human intervention, and addressing others quickly. This capability is a prerequisite for meeting service- level agreements (SLAs), delivering a good end-user experience, and maintaining uninterrupted, highly responsive access to resources. Enabling logs in infrastructure and applications enables visibility into factors that reveal insights about performance and availability. Both machine-to-machine and machine-to-human modalities can make use of analysis based on that log data to identify potential issues before they arise and to tune the environment on an ongoing basis, to deliver the best results possible in the face of changing busi‐ ness requirements. For troubleshooting, the log analytics stack helps accelerate root-cause analysis by bringing together time-series data from all over the environment. Infrastructure Monitoring and Troubleshooting As IT infrastructure becomes more complex, the challenges with providing an uninterrupted, smooth operation and an excellent end- user experience become more acute. In particular, a single operation can involve a large number of different systems, which might involve a combination of resource types, as shown in Figure 2-2, such as far-flung collections of sensors, on-premises systems, public cloud, and software as a service (SaaS). Platforms can vary dramati‐ cally, services and applications might be controlled by different parts of the organization, and the information available from each can be inconsistent. Traditionally, systems were monitored manually and independently, calling on the familiar image of an administrator scanning multiple displays arrayed at a monitoring station. Unfortunately, this approach of eyeing alerts as the basis for keeping systems in good working order becomes less viable as the environment becomes larger and more complex. Both the large numbers of systems to keep

20 | Chapter 2: Log Analytics Use Cases watch over and the multidimensional interactions among them defy the abilities of human operators. What’s more, the dispersed nature of the log data being generated under this model makes it difficult to discern relationships among them.

Figure 2-2. Diverse resource types contributing to a single end-user experience

For example, when doing root-cause analysis of a performance issue with a database application, there are many places to look at once. The application itself can be a contributor, as can the database plat‐ form, the server hardware that both run on, and the availability of network resources. There might be contributing factors associated with all of these, or even another external entity that might not be immediately apparent, such as a single sign-on (SSO) mechanism, firewall, or DNS server. Aggregating all of that data can provide the basis for more efficient and sophisticated analysis. Having access to a composite picture of all the factors potentially contributing to the issue lets administra‐ tors troubleshoot with a holistic approach, rather than having to look at various resources in a piecemeal fashion. For example, staff are able to look at all of the events that occurred on the entire body of related systems at a specific point in time, correlating them to uncover the issue—or combination of issues—behind the problem. Alternatively, edge analytics alleviate the need to send the complete dataset to the data lake.

Software development, optimization, and debugging Applying log analytics within software development organizations arms developers with information about how their code behaves

IT Operations | 21 and interacts with the rest of the environment. This insight can help them optimize the quality of their software while also letting them act more deliberately, breaking the cycle of running from fire to fire. In addition to enabling the development organization to be proac‐ tive rather than reactive, the organization as a whole is saved from the impacts of avoidable issues in software. The success of a development organization is directly tied to how well its software functions in a variety of situations, including edge cases and unforeseen circumstances. Identifying potential issues before they affect the production environment is a critical capability. For example, a minor inefficiency in an application’s operation could become an unacceptable bottleneck as usage, data volumes, and integration with other systems grow. An intermittent delay in one process can affect other processes that are dependent on it over time, and the cascade effect can eventually become untenable. Log files can provide early indications of emerging issues long before they would normally become apparent to users or adminis‐ trators. For example, a gradual trend toward a database server tak‐ ing a progressively longer time opening a set of database records can indicate growing response-time issues. Log analytics can help detect this trend and then determine its root cause, whether it is a simple capacity issue or a need for performance tuning in application code. Anticipating the usability issue before it affects users allows for the necessary steps to be taken in a more timely fashion and business impacts to be avoided. Apart from avoiding exceptions, log analytics can also aid in capacity planning by helping predict how a system will continue to perform as it scales. By tracking the time needed to perform a given operation as the number of database records increases, it’s possible to estimate how those trends will continue in the future. That infor‐ mation can give you a better idea of the viability of a specific piece of code going forward, which can feed strategy about when a new approach to achieving the desired result might be needed. In the context of a DevOps practice, log analytics can help teams ensure compatibility of their code with complex distributed systems. Applied in the early stages of your continuous integration/continu‐ ous delivery (CI/CD) pipeline, log analytics can show how the code interacts with the rest of the environment before it is released to production. By anticipating issues at this stage, we can often fix

22 | Chapter 2: Log Analytics Use Cases them with less effort and expense than if those problems didn’t emerge until post-deployment.2

Application performance monitoring The purpose of APM is to maintain a high standard of user or cus‐ tomer experience by measuring and evaluating the performance of applications across a range of domains. Common capabilities include analysis of transaction throughput and response time, estab‐ lishing baseline levels of those quantities, and monitoring to gener‐ ate alerts when performance strays outside set limits. Real-time data visualizations are typically employed to help conceptualize the sig‐ nificance of emerging events, as an aid to identifying potential prob‐ lems and their root causes (hopefully before they occur). Log analytics can also play a key role in the related field of A/B test‐ ing, for which log data related to usage is collected separately for two versions of an application. The two sets of log data are then com‐ pared side by side to identify how the changes made between the two versions of the application affect factors such as the UX. Exerting control and visibility across applications has become more complex in recent years as modern applications have transmuted from monolithic stature to distributed collections of microservices, containers, and other components loosely coupled together using a variety of APIs. In addition to forming a complex structure for the application, these components can be hosted using a combination of on-premises resources and multiple cloud environments. Together, these factors make the task of integrating and consolidat‐ ing log data to track application performance more challenging than in the past. Accordingly, APM approaches have shifted to meet the needs of modern applications and development practices. Because APIs are central to the architectures of distributed applications, monitoring and managing API performance is critical to the success of APM as a whole. Likewise, the growing prominence of contain‐ ers, especially for cloud-distributed applications, makes monitoring container performance an important consideration, as well. APM practices should allow for robust prioritization of perfor‐ mance problems for resolution, identifying the most urgent ones for

2 For more about log analytics in support of CI/CD, see “DevOps Drives the Business”.

IT Operations | 23 triage. That capability is often assisted by machine learning algo‐ rithms that are trained to recognize the signs of common perfor‐ mance issues. At the same time that triage is necessary, the log analytics toolset must also provide precise insights that enable deeper root-cause analysis. This capability is essential to avoid wast‐ ing time and effort to solve a secondary or ancillary issue without addressing the underlying root cause. Industrial Automation In many cases, log analytics for industrial automation begins with adding connectivity to rich data collection mechanisms that already exist on industrial systems. The computing and control apparatus often captures detailed log data, stores it for a specified period of time, and then discards it. Technical staff can manually access that data in response to outages or performance problems, although in many organizations, there is no day-to-day usage of that log data beyond watching for exceptions or other issues. Enabling Industry 4.0 The essence of the fourth industrial revolution (Industry 4.0)3 is tak‐ ing full advantage of the connected computer systems that underlie industrial processes. Data exchange among different parts of the environment, including cyber-physical systems, is a key aspect of that development. The potential for using log data to enable Indus‐ try 4.0 depends on integrating data from both IT and operational technology (OT) systems, generating insight from it, and applying that insight to optimize efficiency, profitability, and competitiveness. For example, analyzing data gathered from automated manufactur‐ ing equipment can enable sophisticated systems for preventive maintenance. Such measures monitor operating logs and metrics from industrial mechanisms in production and optimize mainte‐ nance to avoid unplanned outages and maximize the working life span of capital equipment.

3 Tom Merritt. Tech Republic, September 3, 2019. “Top Five Things to Know about Industry 4.0”.

24 | Chapter 2: Log Analytics Use Cases The mechanisms to facilitate those processes are sometimes known as the physical-to-digital-to-physical (PDP) loop,4 which is illustra‐ ted in Figure 2-3. In the physical-to-digital stage of the PDP loop, log data is captured from cyber-physical systems to create a record of operational details. In the digital-to-digital stage, that data is gath‐ ered centrally so that analytics and visualizations can be applied to it, generating insights about how operation of the physical systems can be optimized or enhanced. The digital-to-physical stage pro‐ vides the novel aspect of this process that distinguishes Industry 4.0, namely, to provide a feedback loop back to the cyber-physical sys‐ tems, which can then act on those insights.

Figure 2-3. Te Industry 4.0 physical-to-digital-to-physical loop

Operating in real time, the information flowing through the PDP loop enables industrial automation equipment to be continually self- tuning. The resulting efficiency improvements help to ensure that companies can extract the maximum value out of their capital investments. In place of the physical equipment in the preceding description of the PDP loop, we can modify the model to use a digital twin, which is a sensor-enabled digital replica of an industrial cyber-physical sys‐ tem. The twin is a dynamic digital doppelganger that is continuously

4 Mark Cotteleer and Brenna Sniderman. Deloitte Insights, December 18, 2017. “Forces of Change: Industry 4.0”.

Industrial Automation | 25 updated by means of log data that is collected from the physical device so that it accurately represents the system in real time. We can use the simulation to observe and investigate the operation of equipment under different conditions and analyze them together with historical data to predict maintenance requirements in advance.

Industrial Internet of Things Instrumentation and telemetry are nothing new in industrial appli‐ cations, for which data has been collected for decades from sensors on equipment that ranges from oil rigs to assembly lines to jet engines. It is common for millions or even billions of dollars’ worth of equipment to be in use within a large industrial operation. That high value of capital equipment and its importance to profitability has led to increasingly richer and more plentiful information being provided by these telemetry systems. The motivation for collecting all of this information is both to make the equipment operate as efficiently as possible—generating more profit more quickly—and to extend the working life of the equip‐ ment itself, maximizing return on investment. Improvement in the tools and other technologies that support log analytics has driven the ability to get more granular telemetry, to aggregate and analyze it more effectively, and to more directly implement the outcome. The goal, then, is to tighten the feedback loop between microevents that occur in the environment, looking at millions of such events to deduce significance, and to use that insight to optimize some process. Part of the challenge of log analytics in an Industrial Internet of Things (IIoT) context is the enormous variety of data. The rapidly evolving nature of this field means that many new players are emerging, and it might be some time before standard data formats emerge. It is not uncommon to have thousands of different types of sensors, all with different data formats and different ways of com‐ municating information. All of that information needs to be brought together and normalized in such a way that either a human or an algorithm can make sense of it to reveal the information buried inside. It is common in IIoT deployments for large numbers of sensors and other data sources to be located far from the network core. In such

26 | Chapter 2: Log Analytics Use Cases cases, it is often not practical or desirable to transfer the full volume of data generated over a wide-area connection. In response, “edge analytics”—the practice of performing analytics close to the data source—is becoming increasingly prevalent and can have multiple advantages. In autonomous vehicles and other real-time usages, for example, the results of an algorithm being applied to log data are required as near to instantaneously as possi‐ ble. Long-range transfer of data is also incompatible with latency- sensitive usages, such as real-time control of manufacturing-line equipment. Performing edge analytics on log data helps support the ability to analyze that data in a variety of ways for different business needs. For example, monitoring for safety issues or controlling machine settings in real time might be performed at the edge, whereas analy‐ sis of operational data across the enterprise might be better suited to a centralized analytics apparatus. As edge analytics expands in use, more organizations will use tools such as Kafka to split the data between cloud and on-premises storage—making the data accessible in both locations. Similarly, this combination of approaches allows for both real-time inquiry to discover and address problems as they happen as well as longitudinal studies, including the addition of historical informa‐ tion, to discover long-term trends and issues. That work depends on having a centrally available repository of historic data. On their way to that central store, log data streams might pass through complex, multistage pipelines. The data and various pieces of metadata typically come from sen‐ sors to a staging point where the data is collected and transformed in various ways so that it can be more readily stored in a database and compared against other data that initially might have had an incompatible structure and format. For example, this staging point could be a central information hub on an oil field where data from dozens of wellheads is collected. From there, the data can be aggregated at a regional hub that incor‐ porates all of the data from several oil fields, performing further transformations on the data and combining it into an aggregate stream that is sent to a larger collection point, and so on, all the way to the network core (which increasingly includes private, hybrid, or public cloud resources). In the ideal case, enterprises should make

Industrial Automation | 27 architectural decisions which will allow portability of their data across multiple clouds or on-premises. The architecture should facilitate getting data to the right places at the right time.

Sending the data to a remote location to perform cal‐ culations on it may introduce unacceptable transport latency, even on high-speed connections. Moreover, the need to send a massive stream of raw data as free text can be prohibitive in terms of bandwidth require‐ ments, particularly as the volumes of log data being collected continue to increase.

Predictive maintenance Fine-tuning maintenance schedules for large-scale industrial equip‐ ment is critical to getting the full value out of the capital they repre‐ sent. Taking a machine out of service for maintenance too soon cuts into efficiency by creating unneeded planned downtime, whereas, stretching out the time between maintenance intervals carries the risk of unplanned downtime and interrupted productivity. By increasing the amount of instrumentation built into industrial equipment, a richer set of data is generated, supporting sophistica‐ ted log analytics that provide insights about optimal maintenance timing. Rather than a simple break-fix approach that addresses problems on an exception basis, or even a regular schedule designed to prevent unplanned outages, predictive maintenance can respond to actual conditions and indicators, for greater accuracy. Machine learning models can help transform telemetry into insight, helping predict the most cost-effective approaches to physical main‐ tenance, responding to the needs of an individual piece of equip‐ ment rather than averages. In fact, this approach to predictive maintenance has implications far beyond the industrial context, a few examples of which include the following:

• Vehicle fleets and their replaceable components such as filters and lubricants • IT physical infrastructure, including components of servers, storage, and network devices • Field equipment needs that range from leaking pipelines to vending machine replenishment

28 | Chapter 2: Log Analytics Use Cases In any of these spheres of operation, among many others, applying log analytics to predictive maintenance can optimize operational expenses by increasing efficiency and productivity while offering capital expense advantages, in the form of longer equipment life cycles.

When building storage infrastructure, you should make scalability a primary design requirement. In addition to storage capacity, you need to recognize the importance of growing performance requirements as automated and ad hoc query volumes increase over time with new business needs.

Industrial Automation | 29

CHAPTER 3 Tools for Log Analytics

Tasks along the log analytics pipeline ingest disparate log data, transform it to a more usable state, draw insights from it, and output those insights either as machine-to-machine communications or in human-consumable forms such as visualizations and reports. A large number of toolsets are available to perform these functions, both proprietary and open source. The largest market shares among these for log analytics are held by Splunk, the Elastic Stack, and Sumo Logic, some characteristics of which are summarized in Table 3-1.

Table 3-1. Vital statistics for a few popular log analytics tools Splunk Elastic Stack Sumo Logic Open source/proprietary Proprietary Open source Proprietary SaaS option Yes Yes Yes On-premises option Yes Yes No

All of these toolsets utilize compute and storage resources to per‐ form search, analysis, and visualization that are suited to the needs of log analytics, as illustrated in Figure 3-1. These solutions place a premium on the ability to ingest data directly from virtually any source, provide high-throughput flexible analytics on it, and scale as data volumes grow, in terms of both capacity and performance to drive increased query complexity and volume.

31 Figure 3-1. Placement of log analytics tools within the broader solution stack

Even though each individual implementation will have its own unique requirements for compute resources, they might have char‐ acteristics in common. For example, because log analytics workloads tend to depend on high throughput of small packets, many imple‐ mentations use processors with large numbers of relatively light‐ weight cores. In addition, applications that require fast response time will benefit from large allotments of system memory, possibly holding data close to the processor with an in-memory data store. However, as data sizes become larger—including with regularly used historic data—flash storage can play an increasing role. Splunk Splunk is a proprietary offering that is available for either on-premises or SaaS implementations, with the primary difference being where the data is stored: on-premises or in the cloud, respec‐ tively. It offers the largest variety of plug-ins (around 600) for integration with external tools and platforms among the prod‐ ucts discussed here, and it has the most established ecosystem,

32 | Chapter 3: Tools for Log Analytics documentation, and user community. Splunk also provides an extensive collection of developer tools. Splunk the company has been public since 2012. It focuses on the enterprise segment from the standpoints of feature set, scalability, and pricing, and it targets IT operations, security, IoT, and business analytics usages. The platform is feature-rich, although the complexity that accompanies that richness can create a relatively steep learning curve for new users. Splunk implements machine learning for advanced analytics capabilities such as anomaly detec‐ tion and forecasting. Splunk has a proprietary extension called Enterprise Security, which adds true SIEM functionality to a Splunk installation. Enterprise security allows an organization to perform many tasks associated with incident management directly in Splunk. In the last few years, Splunk has added user behavior analytics (UBA) as well as security orchestration, automation and response (SOAR) and a full observa‐ bility suite to their offerings. Elastic (Formerly ELK) Stack The Elastic Stack is a combination of open source tools that can be implemented either on-premises or in the cloud, with options for the latter that include the Elastic Cloud platform or AWS Elastic‐ search Service, a hosted solution offered by (AWS). Also available is Elastic Cloud on Kubernetes, which enables the use of container infrastructure to deploy, orchestrate, and oper‐ ate Elastic products with Kubernetes. The company Elastic has been public since 2018, and the primary components of the Elastic Stack1 include the following:

• Elasticsearch, the search and analytics engine at the core of the Elastic Stack • Logstash, a data-ingest and transformation pipeline • Kibana, a visualization tool to create charts and graphs

1 The Elastic Stack was known as the ELK Stack (named for the initials of Elasticsearch, Logstash, and Kibana) until the data-shipping solution Beats was added, causing the acronym to implode.

Elastic (Formerly ELK) Stack | 33 Customers can download and implement the Elastic Stack for free or choose from a variety of subscription options that offer varying levels of security hardening and technical support (either within business hours or 24/7/365). Higher subscription levels also offer more sophisticated search and analytics options, including machine learning capabilities in areas such as anomaly detection and alerting, forecasting, and root-cause indication. As an open source platform, the Elastic Stack also has community- developed extensions for cybersecurity use cases such as Hunting ELK (HELK)—a purpose-built Elastic distribution for cyber threat hunting. HELK integrates the Elastic Stack with more advanced data analytics tools such as Apache Spark, Kafka and Jupyter Notebooks. Sumo Logic Another proprietary solution for log analytics, Sumo Logic is offered as a cloud-based service, without an on-premises option. Thus, although customers don’t need to maintain their own infra‐ structure to support their Sumo Logic implementation, they are compelled to transfer their data offsite to Sumo Logic’s AWS-based cloud network. The security implications of that requirement can be a blocking factor for some organizations. In some geographies, the ability to depend on always-available high-bandwidth, dependable, and secure internet connectivity might also be limited. Sumo Logic targets small and medium businesses in addition to large enterprises, placing a premium on simplicity of setup and ease of use. Multiple subscription tiers are offered, including free, profes‐ sional, and enterprise, with successive tiers adding progressively higher levels of alerting, integration, support, and security-focused functionality. The toolset implements machine learning algorithms to continually investigate log data for anomalies and patterns that can produce insights and provide 24/7 alerting in response to events or problems. Apache Kafka While not a log analytics platform itself, Apache Kafka is an open source stream processing platform that is being included in many organizations’ logging infrastructure. Originally developed by

34 | Chapter 3: Tools for Log Analytics LinkedIn and open-sourced in 2011, Kafka has arguably become the standard open source stream processing platform. As a part of any log analytic platform, it is necessary to send the logs from their source to an application. Elastic uses Beats and Logstash for log forwarding. Splunk accomplishes this task with their univer‐ sal or heavy forwarders which send logs from their sources to the Splunk indexers. Using Kafka for log forwarding offers several sig‐ nificant advantages over the proprietary forwarding mechanisms, including buffering, reduced vendor lock-in, ability to send logs to multiple platforms, the ability to apply transformations to data prior to indexing and finally, easy integration into common open source data analytics platforms. Kafka can also be used to split data and send it to multiple locations, both on-premises and external clouds. Apache Spark Spark has emerged as one of the most prominent open source, dis‐ tributed scalable analytics platforms in the market. Spark was devel‐ oped by the AMPLab at UC Berkeley and donated to the Apache Foundation in 2014, which currently maintains it. What makes Spark particularly powerful is that Spark has well documented APIs to allow users to develop analytics in Python, R, Scala, Java and .NET. Spark also has an SQL interface which allows a developer to interact with Spark’s data structures. In addition to supporting large scale analytics, Spark can pro‐ cess streaming data via Spark Streaming. Finally, Spark has a robust machine learning library and can be used to build scalable machine learning models which can be deployed in production environments. Combining Log Analytics with Modern Developer Tools The tools listed above represent the current market leaders in log analytics; however, as analysts’ needs change, other tools are gaining popularity for log analytics. Many of these tools were not strictly developed for log analytics but are being used quite effectively in the broader data analytics community and are being adapted for log analytic use cases. These tools include:

Apache Spark | 35 Jupyter Notebook Jupyter Notebooks are arguably the most popular platform used by data scientists today for advanced data analysis. Jupyter Notebooks are modeled similarly to an interactive laboratory notebook, in which a data scientist can type code into cell blocks on a web page. Jupyter is usually coupled with Python, but in theory can be used with most scripting languages. Jupyter is popular with data scientists because it allows rapid iteration on machine learning experiments. Apache Drill Drill is an open-source distributed query engine for self- describing data. Drill enables analysts to query and join virtu‐ ally any data source, including Splunk, Elastic, Sumo Logic and even raw log files without having to index or define a schema, using standard SQL. In addition to log files, Drill can natively read Syslog formatted data, and packet capture (PCAP) files natively. Drill was inspired by the Dremel paper which inspired several other similar open source tools including Apache Impala, and Presto as well as Google’s own Big Table. However, at the time of writing, none of these tools except Drill can natively read log files. Other Languages While Python is arguably the most popular language used in security data analytics, other languages from the data science world are having some inroads into security analytics, such as R and Golang. R is a language which was developed primarily for statistical analysis; however, a considerable number of modules have been developed to facilitate security data analysis in R. Deploying Log Analytic Tools With the exception of Sumo Logic, the tools listed above are all tools which can be deployed in either the cloud or on premises. If an organization is seeking to keep most of their data in the cloud, it is vital to have a coherent log forwarding strategy and infrastructure. There are challenges for both approaches and the correct approach will be driven by many business factors including company size, infrastructure size, regulations and compliance and more. Regardless of the current architecture, the one certainty is that things will change. Therefore, implementing a non-proprietary

36 | Chapter 3: Tools for Log Analytics messaging platform such as Kafka will allow an organization to direct or redirect data flows to a new system without having to rear‐ chitect the entire log analytics platform. In contrast, if an organiza‐ tion is using Splunk universal forwarders as their log forwarding mechanism, switching to a new platform will require refitting every endpoint with new forwarding software. When designing architec‐ ture, it is vital that organizations consider the portability of their data. Using an architecture pattern which locks the organization into a particular vendor or platform will result in massive costs in both money and time should the organization decide to switch or upgrade their platform.

Deploying Log Analytic Tools | 37

CHAPTER 4 Topologies for Enterprise Storage Architecture

At its most fundamental level, the value of log analytics depends upon drawing on the largest universe of data possible and being able to manipulate it effectively to derive valuable insights from it. Two primary requirements of that truism from a systems perspective are capacity and performance,1 meaning that the systems that underlie log analytics—including storage, compute, and networking—must be able to handle massive and essentially open-ended volumes of data with the speed to make its analysis useful to satisfy business requirements. As datacenter technologies have evolved, some advances in the abil‐ ity to handle larger data volumes at higher speed have required little effort from an enterprise architecture perspective. For example, storage drives continue to become larger and less expensive, and swapping out spinning HDDs for solid-state drives (SSDs) is an increasingly simple transition. Datacenter operators continue to deploy larger amounts of memory and more powerful processors as the generations of hardware progress. These changes enable

1 Of the canonical three legs of the data tripod—volume, velocity, and variety—this con‐ text is concerned primarily with the first two because of its focus on system resources, as opposed to how those resources are applied. More broadly speaking, all three are intertwined and essential not only to storage architecture, but to the entire solution stack.

39 new horizons for what’s possible, with little effort on the part of the practitioner. Accompanying changes to get the full value out of hardware advan‐ ces require more ingenuity. As a simple example, advances in per‐ formance, security, and stability enabled by a new generation of processors might require the use of a new instruction set architec‐ ture such that software needs to be modified to take advantage of it. More broadly and perhaps less frequently, changes to enterprise architecture are needed to take full advantage of technology advan‐ ces as well as to satisfy new business needs. The rise of compute clusters, virtualization and containers, and SANs are examples of this type of technological opportunities and requirements. Like other usage categories, log analytics draws from and requires all these types of changes. Much of what is possible today is the direct result of greater processing power, faster memory, more capable storage media, and more advanced networking technologies, com‐ pared to what came before. At the same time, seizing the full oppor‐ tunity from these advances is tied to rearchitecting the datacenter. Incorporation of cloud technologies, edge computing, and the IoT are high-profile examples that most conference keynotes at least make mention of. One critical set of challenges inherent to enabling massive data manipulations such as those involved in log analytics is the balloon‐ ing scale and complexity of the datacenter architecture required to collect, manipulate, store, and deliver value from the data. Platoons of system administrators are needed just to replace failed drives and other components, while the underlying architecture itself might not be ideal at such massive scale.

Log analytics place performance demands on the underlying systems that are an order of magnitude greater than those of a general-purpose datacenter infrastructure. That reality means that the same archi‐ tectures—including for storage—that have met con‐ ventional needs for years need more than an incremental degree of change; evolution at the funda‐ mental architecture level might be indicated.

One example of the need to fundamentally revise architectures as new usages develop is the emergence of storage area networks

40 | Chapter 4: Topologies for Enterprise Storage Architecture (SANs), which took storage off the general network and created its own, high-performance network designed and tuned specifically for the needs of block storage, avoiding local area network (LAN) bot‐ tlenecks associated with passing large data volumes. Now, log ana‐ lytics and enterprise analytics more generally are examples of emerging, data-intensive usages that place challenges beyond the outer limits of SAN performance. Newer network-scale storage architectures must be able to handle sustained input/output (I/O) rates in the range of tens of gigabytes per second. Hyper-distributed applications are demanding hereto‐ fore unheard-of levels of concurrency. The architecture as a whole must be adaptable to the ongoing—and accelerating—growth of data stores. That requirement not only demands that raw capacity scale out as needed, but to support that larger capacity, performance must scale out in lockstep so that value can be driven from all that data. Complexity must be kept in check as well, lest management of the entire undertaking become untenable. Ultimately, it is important to remember that analysts and end users really aren’t interested in the storage, but rather how long it takes them to get to a meaningful result. To consider the high-level arrangement of resources as it pertains to these challenges, three storage-deployment models play primary roles (see also Figure 4-1): DAS Simple case in which each server has its own local storage on board that is managed by some combination of OS, hypervisor, application, or other software. Virtualized storage Pools storage resources from across the datacenter into an aggregated virtual store from which an orchestration function dynamically allocates it as needed. Physically disaggregated storage and compute Draws from the conventional NAS design pattern, where stor‐ age devices are distinct entities on the physical network.

Topologies for Enterprise Storage Architecture | 41 Figure 4-1. Common models for storage architecture in the enterprise Direct-Attached Storage (DAS) In many organizations, log analytics has grown organically from the long-standing use of log data on an as-needed basis for tasks such as troubleshooting and root-cause analysis of performance issues. Often, no specific initiative is behind expanded usages for logfiles; instead, individuals or working teams simply find new ways to apply this information to what they are already doing. Similarly, no specific architecture is adopted to support log analytics’ potential, and the systems that perform log analytics are general- purpose servers with DAS that consist of one or more HDDs or SSDs in each server unit. Scaling out in this model consists of buy‐ ing more copies of the same server. As the organization adopts more advanced usages to get the full value from log analytics, a sprawling infrastructure develops that is difficult to manage and maintain. Requirements for datacenter space, power, and cooling can also become cost-prohibitive. In addition, this approach can constrain flexibility, because the orga‐ nization must continue to buy the same specific type of server, with the same specific drive configuration. Moreover, in order to add storage space for larger collections of historic data, it is necessary to buy not just the storage capacity itself, but the entire server as a unit, with added cost for compute that might not be needed, as well as increased datacenter resources to support it. The inefficiency associ‐ ated with that requirement multiplies as the log analytics undertak‐ ing grows.

42 | Chapter 4: Topologies for Enterprise Storage Architecture Virtualized Storage Storage virtualization arose in part as a means of improving resource efficiency. This model continues to use DAS at the physical level, with a virtualization layer that abstracts storage from the underlying hardware. Even though the physical storage is dispersed, applications see it in aggregate, as a single, coherent entity of which they can be assigned a share as needed. This approach helps elimi‐ nate unused headroom in any given server by making it available to the rest of the network, as well. The software-defined nature of the virtualized storage resource builds further upon the inherent rise in efficiency from this method of sharing local storage. Orchestration software dynamically creates a discrete logical storage resource for a given workload when it is needed and eliminates it when it is no longer needed, returning the storage capacity to the generalized resource pool. Virtualization of compute and networking extends this software- defined approach further, by creating on-demand instances of those resources, as well. Together, these elements enable a software- defined infrastructure that can help optimize efficient use of capital equipment and take advantage of public, private, and hybrid cloud. Because the storage topology for the network is continually rede‐ fined in response to the needs of the moment, the infrastructure the‐ oretically reflects a best-state topology at all times. Software-defined storage was developed for the performance and density requirements of traditional IT applications. Because log ana‐ lytics has I/O requirements that are dramatically higher than those workloads, it can outpace the capabilities of the storage architecture. In particular, the distributed and intermixed nature of the infra‐ structure in this model can create performance bottlenecks that strain the ability of the network to keep up. In addition, because data movement across the network is typically more expensive and slower than computation, it is not unusual to create multiple copies of data at different locations, which increases physical storage requirements.

Virtualized Storage | 43 Physically Disaggregated Storage and Compute Moving to a more cloud-like model, the storage architecture can physically disaggregate storage and compute. In this model, large numbers of servers are provisioned that don’t keep state themselves, but just represent and transform that state. Separate storage devices are deployed as a separate tier that maintain all the states for those servers. The storage devices themselves are purpose-built to be highly efficient at scale, handling data even in the multipetabyte range. Because the compute and storage elements of the environment are physically disaggregated, they can be scaled out independently of each other. More compute or storage elements can be added in a flexible way, providing the optimal level of resources needed as busi‐ ness requirements grow. In addition, as historical data stores become larger, they benefit from the fact that this physically disag‐ gregated model allows storage elements to be packed very densely together, to improve network efficiency, and to use efficient error encoding to get maximum value out of the storage media. The cen‐ tralized nature of the storage also helps improve efficiency by avoid‐ ing the need to create extraneous copies of the data. At large scale, physical disaggregation is often superior to compute and storage virtualization at meeting the high IO requirements of log analytics implementations. And while conceptually, this design approach is not entirely new compared to SAN or NAS, it can be built to meet design criteria that are beyond the performance and throughput capabilities of those traditional architectures. Physically disaggregated architectures for log analytics are, at some level, just new and enhanced manifestations of established concepts. For example, one critical design goal is to build in extremely high concurrency, parallelizing work at a very fine-grained level that ena‐ bles it to be spread evenly across the environment. In addition, the representations of file structure and metadata should be thoroughly distributed. These consistency measures help avoid hot spots within the infrastructure. This allows for the optimal use of resources to support real-time operation in the face of the massive ingest rates associated with log analytics, which can commonly exert 10 times

44 | Chapter 4: Topologies for Enterprise Storage Architecture the performance and throughput requirements on the infrastructure compared to common enterprise workloads.

Physically decoupling compute and storage enables each to scale out independent of the other, decreasing total cost of ownership (TCO) by removing the need to purchase extraneous equipment. Following an archi‐ tectural pattern similar to SAN and NAS but with modern data management and flash-based storage enables the levels of performance and throughput needed for real-time analytics at scale.

Physically Disaggregated Storage and Compute | 45

CHAPTER 5 The Role of Object Stores for Log Data

Object storage and file storage take different approaches to organiz‐ ing data, as illustrated in Figure 5-1.

Figure 5-1. Data structures used by fle and object storage

Object storage uses a flat address space and comprehensive meta‐ data to store chunks of data referred to as “blocks” in place of the hierarchical folder structure used in file storage, making it far more scalable. Thus, data retrieval remains fast, even as data stores swell from larger historical datasets to feed increasingly complex analyt‐ ics. In addition, object storage is well suited to managing unstruc‐ tured data by means of custom metadata to describe an object’s contents. This characteristic makes data self-describing, which pro‐ vides flexibility for implementations of advanced analytics. Scaling

47 analytics pipelines in the cloud can quickly become unpredictable in both cost and performance, so having an infrastructure which facili‐ tates portability of data is vital for successful analytic deployments. Object stores are designed to scale to hundreds of petabytes in size without degraded performance. They also protect data integrity while maximizing usable storage space. This technique divides data into shards, each of which is a subset of the table’s full set of rows. Shards are distributed across multiple database instances to spread load and increase parallelism for large objects. Multiple copies of each shard can be stored to provide data resil‐ iency, although this approach can become prohibitive as data vol‐ umes increase. Erasure encoding can also be employed to efficiently protect data by computing and storing error-correcting shards along with the data shards. However, when applications are not designed to take advantage of disaggregated compute and storage, they must maintain multiple copies of data instead of offloading that to the storage provider. The result is increased inefficiency and cost. Because it is designed to scale out, including to distributed architec‐ tures, object storage is more cost-effective than file storage, espe‐ cially as stores of historical log data grow larger. Moreover, because object storage is cloud native, log analytics systems can make use of either private cloud or public cloud object stores, including Amazon Simple Storage Service (Amazon S3), Microsoft Azure Blob, and Google . Those environments also provide high avail‐ ability and useful developer tools and design patterns that stream‐ line the production of microservices to provide custom manipulations on log data. Although object storage offers tremendous scalability and flexibility, accessing off-site public cloud storage can create unacceptable lags in response time for latency-sensitive log analytics applications. Accordingly, many IT organizations choose to build on-premises private cloud object storage infrastructure This approach also ena‐ bles them to realize the value of object storage without entrusting their data to a third party. Conversely, using public cloud can reduce requirements for local storage by placing a portion of the data on remote object stores. Increasing the proportion of data on public cloud infrastructure can reduce the in-house administration burden, freeing personnel from the care and feeding of a growing storage infrastructure.

48 | Chapter 5: The Role of Object Stores for Log Data For example, Splunk SmartStore combines a remote object storage tier with a local cache manager that allows data to reside locally. To increase performance, the hot data lives in local caches. Pure Storage provides fast object storage for SmartStore to avoid that latency, making it possible to have large volumes of historical data available for fast queries while reducing the dependence on large, local stor‐ age for the hot tier. The cache manager allows system administrators to control the flow of data between the cache and remote storage as well as monitoring and adjusting the size of those caches according to recorded performance and hit rates. Organizations that tend to do frequent long-term searches, for example, might find that they need larger cache capacities to prevent frequent calls to remote storage. In the present reality where not all applications are designed or rear‐ chitected to use object stores, unified file and object access plays a pivotal role, as illustrated in Figure 5-2. This architecture allows data to be either ingested or accessed through either a file or an object interface, increasing compatibility with both legacy and cloud-based applications. Applications designed to access file-based data are therefore able to use data stored in the cloud as objects.

Figure 5-2. Unifed fle and object access to storage The Trade-Ofs of Indexing Log Data Indexing data amounts to creating a map used to quickly locate the specific database record or records, allowing for faster query results compared to scanning the entire table. Determining when to index data is a nice computer science problem. In essence, the conundrum is that indexing data can accelerate queries against it later, but the act of indexing itself is computationally intensive. Particularly when

The Trade-Ofs of Indexing Log Data | 49 indexing is done at the point of ingress, it can affect throughput and therefore scalability. One way of preventing impacts on data ingress is to index data at rest as resources allow. It can also often be valuable to index only the highest-value portion of data, storing the remainder in unindexed form. In these usages with partial indexing of log data, algorithms must be capable of determining the priority of indexing individual, specific pieces of data, based on the likelihood of performance bene‐ fit in future queries. This approach targets queries that are known and planned for in advance, so it must be accompanied by the ability to return fast results from unplanned queries on unindexed data, as well. To put this another way, if an organization is using Kafka, they can create pipelines whereby some data is sent to a data lake and other data is sent to Splunk or Elastic for indexing. Indexing is therefore well suited to point queries, such as finding the incidence of a specific phrase or event, a so-called “needle-in-the- haystack” problem within a predictable sphere of data. Each query can consult the index and return fast results, potentially justifying the indexing overhead. On the other hand, for ad hoc queries that consider unbound amounts of data, the computational cost of scan‐ ning the world of unindexed data can be justified by the fact that indexing all of that data would be more expensive still. Both indexed and raw, unindexed data typically cohabitate in enter‐ prise environments, potentially accessed by different applications. A healthy combination of indexing at the point of ingest and reindex‐ ing is called for, with reindexing (despite its resource-intensive nature) playing a key role as the projected uses for data change.

50 | Chapter 5: The Role of Object Stores for Log Data CHAPTER 6 Performance Implications of Storage Architecture

Growing IT complexity is a fact of life as more business processes become digitized, higher levels of automation are enabled, and new technologies enter the datacenter. That growing complexity drives increasing volumes of log data that can potentially be used for log analytics; a company of a given size would generate far more logs today than a company of similar size a decade ago. The availability of more log data generates the potential for more sophisticated log analytics, placing more extensive demands on the underlying systems. Specifically, even as the amounts of data that need to be ingested, handled, and stored rises exponentially, so do the numbers of queries being made against it, both by automated systems running reports and dashboards as well as by human users placing ad hoc queries to generate business insights. Log analytics operations depend on rapid, dependable access to stored data, which places growing performance and scalability demands on the storage hardware. Fast response rates are critical to business use cases, both to optimize efficiency and to provide a good user experience. These requirements are driving flash adoption in the enterprise; in fact, flash storage has become the standard imple‐ mentation for many use cases.

51 The emergence of flash storage for the enterprise in the past 10 years or so represents a sea change in storage architecture because of its dramatically higher speed and longevity. Notwithstanding those advantages, cost constraints mean that spinning disks still dominate in the datacenter. As the cost of flash storage decreases, spinning disk usage will decrease. Conventional HDDs are appropriate for dumping large amounts of data that won’t be searched against often, but the random reads and writes are much slower than flash. Searching a huge dataset to support real-time or near-real-time log analytics requirements can be prohibitively time-consuming with spinning disks.

Even as the volumes of log data being generated are growing rapidly, many companies are extending their standard retention periods, requiring longer-term preservation of data. A key driver behind this trend is the fact that analytics models can often be made more pow‐ erful by making larger sets of historical data available to them. Querying against several years of data allows tracking of long-term trends, and as AI models are increasingly adopted for analytics of all kinds, those large datasets can also be useful for training deep learn‐ ing models.

Log analytics presents substantial performance chal‐ lenges. Throughput must be optimized from the point of ingest, through all stages of processing, to outputs such as alerts, dashboard visualizations, or reports. Data pipelines and storage must be consolidated, elim‐ inating data silos and the practice of storing multiple copies of data for different usages.

Tiered storage is a common approach to handling large amounts of historical data, including log data. By providing multiple areas of storage, each with a different balance between cost and perfor‐ mance, tiered storage allows the medium to be tailored to different needs, as illustrated in Figure 6-1. In this conception, the perfor‐ mance tier houses the data most likely to be queried frequently and needed for real-time analytics scenarios (i.e., the “hottest” data). The archive tier is for long-term storage of infrequently accessed “cold”

52 | Chapter 6: Performance Implications of Storage Architecture data, and the capacity tier in between strikes a balance between the two for “warm” data.

Figure 6-1. Tiered storage balances cost and performance for diferent data types

In practical terms, this range might correspond to the age of log data. For example, the most recent 30 days of log data might be stored in the performance tier, data 31 days to a year old in the capacity tier, and data that is older still in the archive tier. As organizations continue to stretch the limits of what they can monitor, accomplish, and predict with log analytics, query volumes will continue to increase, including searches against older data. Increased requirements to perform analysis and extract insights from data stored in the capacity and even archive tiers compels architects to increase the performance of those tiers, which is largely accomplished by the addition of flash storage.

Drive More Value from Storage with Data-Reduction Technologies Because flash is an order of magnitude faster than spinning disks, it has become all but essential to data-intensive workloads such as log analytics. Particularly for real-time results that draw on historical data, the storage tier requires the speed that only flash can offer. Storage vendors have devised a range of data reduction features that reduce the amount of capacity needed to store a given body of data. While not all are available from all providers, the following capabil‐ ities have been developed by the industry:

• Pattern removal detects and consolidates simple binary pat‐ terns within datasets (e.g., summarizing a string as “1,000 zeroes” instead of actually encoding the 1,000 identical values). These measures can reduce storage requirements as well as

Performance Implications of Storage Architecture | 53 processing for other data-reduction measures such as dedupli‐ cation and compression. • Deduplication ensures that only unique blocks of data are com‐ mitted to flash storage. Ideally, deduplication works globally (rather than within a volume or a pool) and on variable block sizes for maximum efficiency. • Inline compression reduces the number of bits needed to repre‐ sent a given piece of data. The use of multiple algorithms is desirable, to optimize compression ratios among different types of data that have different requirements. • Post-process compression applies additional compression algo‐ rithms to data after the process is complete, increasing the data reduction result achieved with inline compression. • Copy reduction handles data-copy processes within the flash storage medium using only metadata, producing snapshots and clones that offer greater efficiency than actual copies of the data.

To get the full value of these capabilities in the datacenter, they should be available right out of the box, without requiring extensive configuration or tuning. High-performance data-reduction meas‐ ures should be always-on and suitable for use across workloads, regardless of size, type, or criticality.

Additional Considerations for Security Analytics Security analytics can place additional, unique requirements on IT organizations. A review of special considerations offers perspective on common security analytics challenges and the role of scalable data storage that can adapt to the need for real-time and historical data analysis. Effective real-time data processing is key to achieving organizational MTTD and MTTR goals. Easy access to historical data enables needed insights into threat landscapes for root-cause analysis and persistent threat detection. Scalable, resilient architec‐ tures enable the agility to keep pace with not only the growth of data, but to new and changing sources of log and event data.

54 | Chapter 6: Performance Implications of Storage Architecture Responding to and Detecting Threats in Real Time In order to minimize the MTTD and MTTR, an organization must have the ability to query and perform statistical analysis on log data in real time or near real time. Simply detecting and blocking mali‐ cious activity isn’t enough to adequately protect an organization. In order to respond to more sophisticated attacks, it is often necessary to conduct historical data analysis as well as real-time detection. What is clear is that security use cases require a considerable amount of data, regardless of whether the system is real time or not. For a real-time system to be effective, it must have rules which the security team develops by analyzing a large corpus of data. Addi‐ tionally, real-time systems frequently employ machine learning models which must be trained (and retrained) on large collections of data. In any case, an organization will need an efficient and effective file system to enable their security professionals to access and ana‐ lyze the various forms of security data they collect. Let us examine several use cases for real-time threat detection and remediation.

Ransomware Ransomware is a type of malware which blocks a user or organiza‐ tion from accessing their data, usually by encrypting it and simulta‐ neously demanding payment for releasing the data. In the last year, ransomware has crippled several large organizations, including hos‐ pitals, school systems, and government organizations. If an organization has effective backups, it is possible to reverse the damaging effects of ransomware; however, some ransomware also encrypts backups, making restoring impossible. The best approach of course is to detect and defeat ransomware as quickly as possible. Unfortunately, it is very difficult to detect ransomware fast enough to prevent damage. Ransomware can be detected in log files by look‐ ing for anomalous file system activity, abnormally high CPU usage, or disk activity, and suspicious network communication. If these patterns are identified, it is vital to isolate the infected systems from the rest of the corporation to prevent the further loss of data.

Additional Considerations for Security Analytics | 55 The most common ways organizations get infected by ransomware are from spam, malicious websites, infected removable drives, and malicious applications. Proactively monitoring logs for these attack vectors can reduce the risk of a ransomware infection.

Anomaly detection Anomaly detection is a broad grouping of machine learning techni‐ ques used to detect things that are outside the ordinary. In a security context, usually this means something bad happened, but not always. For example, a router may be exhibiting abnormal behavior. The cause could be that the router has been infected with malware, or it could be that a technician is reconfiguring the device. Treating anomalies as automatic alerts can result in flooding operators with useless alerts, which will annoy your security operations center (SOC) and possibly worse, cause them to miss a real alert. Anomaly detection can be broken down into novelty and outlier detection. There is a subtle distinction between the two which is that novelty detection is when you start with a dataset that contains only good observations. This data is then used to train a machine learn‐ ing model which can then identify inliers and outliers. Outlier detection starts with a dataset with both inliers and outliers. Those familiar with machine learning will likely map these to supervised and unsupervised learning. Anomaly detection is not a single tech‐ nique but rather a collection of techniques which could include sim‐ ple statistical analysis, all the way to sophisticated machine learning techniques. At the high level, most anomaly detectors work in a similar manner where they calculate a score which is used to represent “normal.” This score is then calculated for an unknown observation, and if the difference exceeds a threshold, it is treated as an anomaly.1 In order to perform anomaly detection, it is necessary to have a log‐ ging infrastructure that supports historical querying. While it may be possible to use summary data for this purpose, this will only work if the summaries contain the correct features for the model. Therefore, it will be necessary to have a decent volume of historical

1 A simple example might be to take the last 100 years of temperature data for a given location on a given day. Then, if that temperature varied by more than a number of degrees, there is an anomaly.

56 | Chapter 6: Performance Implications of Storage Architecture data from which an analyst can build their model. A data scientist working on such a problem will need to perform aggregate queries over long time periods and thus the logging infrastructure should have enough data to support this as well as a reasonable latency for queries which span long time periods.

IDS/IPS considerations Intrusion detection systems (IDS) and intrusion prevention systems (IPS) are related systems which detect and prevent intrusions into networks. An IDS sits behind a firewall and scans packets, sending alerts in the event of a suspected breach. An IPS is similar to an IDS in that it also sits behind a firewall and scans packets, but an IPS can take action to prevent damage to corporate networks. In addition to simply sending alerts, an IPS can drop potentially malicious packets, block malicious traffic, or reset network connections. While IDS/IPS can reduce the risk to organizations by reducing the MTTD/MTTR, an organization still must analyze the data con‐ tained in their logs to understand the root causes for the entries. For example, if a user has consistent failed login attempts, or if there are unknown devices on the network in particular areas or times, these could all be indications of an orchestrated attack on the organization. Analyzing the Threat Landscape Another facet of security data analysis is longer term analysis. Sophisticated actors will gain unauthorized access to an organiza‐ tion with the goal of maintaining undetected but unauthorized access for a lengthy period of time. These actors’ goals are usually not immediate profit but rather industrial or political espionage, financial crimes, or serious disruption of their target. These kinds of attacks are labeled as advanced persistent threats (APT).

Detecting advanced persistent threats In order to understand how to detect APTs and the role log analytics plays in this, you must understand the high-level pattern which APTs will follow. Most APTs follow a pattern of launching an initial compromise whereby they gain access to some portion of a network. This first step can be accomplished by gaining access to an individual’s user

Additional Considerations for Security Analytics | 57 account via phishing email, web hijacking, or some other technique. Once that has been accomplished, the goal is to make this access persistent, which is known as establishing a foothold. After the foot‐ hold has been established, the next goal is to escalate privileges. Usu‐ ally the attackers will not gain access to root-level or other administrative accounts. Usually they will break in via some weakly secured system and gain user access. However, since their goal is to gain persistent access to valuable information, the actor will attempt to gain access to an administrator or root-level account in the net‐ work. This will allow the malicious actor to perform activities such as reconfigure the network, disable security provisions, and create new user accounts. The final step in the process is lateral motion which is moving from one internal device or network segment to another and repeating this process. Most corporate networks are well protected from unauthorized activity from outside the firewall; however, APTs are particularly dangerous because the attackers will use an organization’s internal network to attack other internal systems. APT actors will use extremely sophisticated software for command and control which carefully masks its external communications to blend in with the normal activity of an enterprise network. The challenge with detecting APTs is largely separating the noise from the actual attack traffic. Automated tools are usually ineffective in detecting APTs and as a result, organizations use cyber threat hunting teams to detect and defeat APTs. In order to do their work, hunting teams must have access to large amounts of log data from a variety of sources. Additionally, many cyber-threat hunters will build custom machine learning models to detect threats. All of this depends on having meaningful access to large amounts of diverse security data, often spanning long time periods. FireEye reports that the median time an APT attack goes undetected in the United States is 71 days; in Europe, the Middle East, and Africa (EMEA), 177 days; and the Asia-Pacific area (APAC), 204 days. With these numbers in mind, it is clear that an organization should retain enough historical data to detect these breaches and also to be able to analyze the actor’s activity to remediate the damage.

58 | Chapter 6: Performance Implications of Storage Architecture Legal discovery (e-discovery) In the event of a legal proceeding or dispute, both parties to the dis‐ pute are subject to discovery—a process in which information is gathered with the intent of using it as evidence. As an ever- increasing amount of human activity and interaction takes place on networked computers, when legal or criminal issues arise, it is increasingly common to gather evidence from computer systems, and that evidence can include emails, computer activity from logs, malware, documents, chat histories, and more. In an enterprise, most data is stored on central servers rather than on an individual’s machine. Therefore, the process and requirements for e-discovery are similar to threat hunting in that the investigator must have access to a large corpus of log (and other) datasets, and must be able to effectively search that data for items which are rele‐ vant to the investigation. As disputes can cover years’ worth of activ‐ ity, effective legal discovery depends on organizations retaining considerable amounts of data and storing this data in a way which enables effective analysis.

Root-cause analysis When a cyber or IT operations incident occurs, logs can provide the first line of defense in identifying the problem and possibly, via a SOAR platform, remediating it. However, understanding the root cause of the issue is helpful in order to prevent the issue from recur‐ ring in the future. Root-cause analysis (RCA) is a process through which an analyst will identify the problem, establish a timeline from the normal activ‐ ity which preceded the incident and the problem, and finally iden‐ tify the root cause as well as any other causal factors. The usual result of RCA is identifying actions to prevent the problem from recurring. Similar to threat hunting, RCA depends on having a large corpus of historical data, stored in a platform which will allow the analyst to reconstruct the timeline of events. This will necessarily draw from multiple log sources and formats. An analyst must be able to execute queries over long time spans in order to reconstruct the timelines and sequence of events.

Additional Considerations for Security Analytics | 59 Seamless Scalability One of the main challenges in log analytics is determining how much data to retain. The drivers for this can be cost, compliance, risk appetite, use cases, and other factors. Inevitably, a log analytic platform will need to grow as new systems are brought online or new use cases are developed that require additional data. Log analytics are frequently mission critical in that they can identify (and prevent) breaches, incidents, and other issues, so a system must be able to scale without issues. Some things to consider are dynamic data, system disruption, and the process for adding data sources.

Adding data sources As log infrastructure scales, it will be necessary to add new data sources. For analytic systems such as Splunk, it is necessary to create new data flows from the source system into the log analytic system. Once that flow is established, it is necessary to configure that system to parse the data into indexes. Frequently however, there are many organizational hurdles to make this happen. For instance, there may be legal, privacy, policy, or compliance approvals which need to be obtained prior to starting data flows. Having a standardized infrastructure, such as a Kafka messaging bus, can greatly reduce the technical complications of sending data into the log indexer. Kafka also allows an organization to split feeds into multiple destinations. Lastly, a messaging bus such as Kafka allows an organization to track data sources and how they are being used to some degree.

Dynamic data formats A very common issue as log analytics platforms scale is dynamic data, or data in which the schema changes. Often schema changes will cause corrupt records in SIEMs which can lead to incidents being missed entirely. As many log analytic platforms use regular expressions to parse logs into fields, if the log format changes, it will no longer match the regex. To some degree, unexpected changes can be mitigated with effective change management procedures. Ensuring that the SIEM engineers are aware of changes made to log-generating systems will allow them the time to prepare the SIEM to accept and process new log

60 | Chapter 6: Performance Implications of Storage Architecture formats. However, a better approach is to log data in a format which contains the schema itself, such as JSON or Parquet, and that way the query engine will automatically detect and parse any new fields. This approach will prevent corrupt data from entering the SIEM, but it is still only a partial solution. If downstream visualizations or other analytics depend on particular fields, and if those fields are renamed, those analytics will break. In recent years, an alternative approach of using machine learning to parse logs has emerged and there have been some models which can parse logs into fields with greater accuracy than regex. The machine learning approach is likely the most effective for organiza‐ tions that have highly dynamic data.

Additional Considerations for Security Analytics | 61

CHAPTER 7 Enabling Log Data’s Strategic Value with a Unifed Fast File and Object Platform

Even though data is widely regarded today as every business’s most important asset, it tends to be isolated in silos across the enterprise that are each developed to meet the needs of a discrete set of appli‐ cations and workloads. This fragmentation of data limits the degree to which it can be accessed in real time, hampering the ability to perform flexible analytics on it. This limitation runs counter to the modern perspective of data as a primary strategic differentiator, unleashed by the power of analytics. Maintaining architectures that are customized for specific work‐ loads or applications expand the number of data silos. Data ware‐ houses have been built as a single repository for integrating and storing structured data pulled from multiple unstructured data sources across your organization. However, they have historically been used for batch processing of structured data and often use an architecture that is customized for a specific data warehouse application. Alternatively, data lakes are essentially uncurated masses of data, dwelling in a storage architecture designed to store its contents as efficiently as possible, rather than with speed of access, sharing, and delivery. This property makes the data-lake approach to storage limited for log analytics. In addition, data lakes lack the ability to

63 tailor data delivery to the specific latency, throughput, and I/O requirements of individual applications and usages. Silos can also develop as data analytics increasingly uses complex machine learning. AI clusters, powered by tens of thousands of GPU cores, require storage to also be massively parallel, servicing thou‐ sands of clients and billions of objects. The increasing use of stream‐ ing data also changes the requirement for throughput. Unified fast file and object (UFFO) helps address common chal‐ lenges with analytics deployments through simplicity, performance, and consolidation. You need simplicity first and foremost—no bolt- ons or overlays that diminish the results or limit the ability to address a broad set of data requirements. The solution must stay simple to manage at a big data scale. Another key requirement for a UFFO implementation for analytics includes the ability to deliver very high throughput for both file and object storage, scale-out per‐ formance for support of growing workloads and cost-effective stor‐ age. UFFO-based storage architectures are conceived specifically for data sharing as efficiently as possible, overcoming limitations of data silos and data lakes for enterprise-wide log analytics, as illustrated in Figure 7-1. Because of the flexibility of unified fast file and object, it is possible to deploy this solution across file and object use cases spanning log analytics, data warehouse, data lakes, streaming data, and more.

Figure 7-1. Unifcation of data with analytics workloads

Applications for modern data have needs that aren’t met by tradi‐ tional architectures, because modern data is different. Traditional applications used scale-up resources, but modern applications are often deployed as scale-out microservices, changing the way we

64 | Chapter 7: Enabling Log Data’s Strategic Value with a Unifed Fast File and Object Platform procure and deploy infrastructure. These current workloads mani‐ fest with data that is growing rapidly, with an increased variety of file and object sizes, counts, and profiles—requiring native multi‐ protocol support. The growth of microservice-based architectures and the rise of containers and object storage have created a need for a new category of storage that services the new ways in which appli‐ cations are being created, deployed, and managed. Further demand for this storage is fueled by the use of cloud-native operations on- premises. At the same time, we see that the data requirements for these applications are driven by the need for a multiprotocol high- performance storage foundation that can service a wide variety of applications and data services. Additionally, organizations are look‐ ing for the agility to maintain existing file-based applications without creating additional data silos or compromising on perfor‐ mance. Ideally, a unified fast file and object storage platform consol‐ idates existing and modern applications to help IT organizations respond to this challenge. This approach enables data storage and delivery to multiple applica‐ tions in a dynamic fashion that is managed in software, without requiring any workload-specific tuning. This capability requires massive hardware and software parallelism to provide real-time results for demanding workloads. Storage based on a UFFO architectural approach represents software-defined, on-demand infrastructure built to accommodate requirements that are constantly changing and must be responded to in real time. It embraces virtualized constructs, including virtual machines and containers, as well as bare-metal physical hardware. This also depends on having a modern storage medium such as flash technology, which is superior to older spinning disks at handling multiple simultaneous demands. This medium dramatically improves performance, particularly in the small data accesses and random reads and writes that are prevalent in analytics workloads and are a major challenge for data lake architectures.

Storage Platforms to Optimize Log Analytics with UFFO The specific storage systems that enterprises use when building a UFFO-based architecture are the foundations for next-generation data analytics in general and log analytics in particular. Because these systems must operate at petabyte scale, architects should

Enabling Log Data’s Strategic Value with a Unifed Fast File and Object Platform | 65 choose storage platforms that provide high performance out of the box and that can adapt to requirements dynamically. Efficient oper‐ ation across the spectrum of requirements demands that ongoing manual tuning or configuration are not required for different workloads. Hot-pluggable blades are often the form factor of choice, allowing capacity and performance to be easily and instantly scaled out as needed. That simplicity can be critical to agility when accommodat‐ ing tens of thousands of clients simultaneously, as well as tens of billions of data objects. The platform should support both file and object storage so that the architecture can flexibly adapt to varying requirements. Key design criteria to consider for UFFO to provide on-demand access to enterprise data for log analytics include the following: Born digital Machine-generated data doesn’t all fit neatly in one database. It’s not even all generated in-house. It is multidimensional and can be very unpredictable. Multimodal Files offer flexibility and compatibility for existing workflows. Objects offer compatibility and interoperability with cloud- native applications. Real-world applications are increasingly blending these in consolidated pipelines. By some estimates, enterprises will triple their unstructured file and object data requirements within the next three years. In motion Billions of files and objects are generated, processed, and ana‐ lyzed in real time at scale. Always available Modern data needs to be highly available without suffering from disruptive upgrades, supported with rich data protection capabilities, and/or be ready for distribution—sometimes from the cloud and sometimes to the cloud. As nonvolatile storage has come down in price, all-flash architec‐ tures have become viable for mainstream implementations. These devices, particularly as part of modern intelligent architectures built for flash, provide order-of-magnitude improvements in perfor‐ mance and latency over mechanical HDDs and/or legacy architec‐ tures retrofitted with SSDs, especially with the small packets that are common with log analytics workloads. Accordingly, flash has

66 | Chapter 7: Enabling Log Data’s Strategic Value with a Unifed Fast File and Object Platform become the gold standard for storage systems used in modern analytics. The provider of the storage systems is every bit as important as the storage itself. There is simply no substitute for their experience working with end-customer architects, system administrators, data‐ base administrators, and others to identify and resolve the diverse challenges in implementing data storage topologies for log analyt‐ ics. That expertise can be critical to working through issues related to I/O and storage bottlenecks, scaling and concurrency, and high availability, among others. Top-tier providers can also provide assistance around integration with enterprise software platforms such as analytics engines, databases, ERP systems, and virtualiza‐ tion platforms, across operating environments, as well as monitor‐ ing and observability options to reduce risk and achieve the best possible results.

Enabling Log Data’s Strategic Value with a Unifed Fast File and Object Platform | 67

CHAPTER 8 Nine Guideposts for Log Analytics Planning

The benefits that log analytics can provide vary dramatically among different organizations, as do the infrastructure and techniques best suited to enabling those benefits. Nevertheless, the common set of best practices and considerations described here can help guide architects during the planning process.

Guidepost 1: What are the trends for ingest rates? Accommodate future needs for performance and capacity. Guidepost 2: How long does log data need to be retained? Fine tune data retention policies to optimize costs and mini‐ mize liability. Guidepost 3: How will regulatory issues afect log analytics? Provide verifiable measures to govern data usage, transport, and storage. Guidepost 4: What data sources and formats are involved? Forecast and prepare for upcoming requirements for changes in data-transformation pipelines. Guidepost 5: What role will changing business realities have? Align infrastructure planning for log analytics with broader corporate strategy.

69 Guidepost 6: What are the ongoing query requirements? Identify future query volumes among different types such as ad hoc versus point queries. Guidepost 7: How are data-management challenges addressed? Plan for impacts on log data formatting and delivery from changes in the environment. Guidepost 8: How are data transformations handled? Ensure that tools and applications to transform data are suffi‐ cient for the future. Guidepost 9: What about data protection and high availability? Designate log data’s criticality and sensitivity, reflected in secu‐ rity and backup/restore policies.

In particular, it is important to keep in mind that key considerations and concerns that bear on planning infrastructure for log analytics will intensify as log data continues to grow in volume, velocity, and variety. Architects must therefore plan for flexible scalability of capacity and performance in their storage systems to support the log analytics function as it continues to become more demanding as well as more valuable to the enterprise as a whole. Guidepost 1: What Are the Trends for Ingest Rates? Log data originates all over the environment, and its volume grows continually as the environment becomes more complex over time. Even as multiple terabytes per day inundate the log analytics plat‐ form initially, the sheer scale of the data is unbound in the future. Determining how those data volumes are likely to grow over time is critical to understanding the future state of the environment. In particular, the storage infrastructure must be designed to accom‐ modate future needs from both the capacity and performance per‐ spectives. The capacity aspects of this requirement speak to the value of decoupling compute and storage so that the latter can scale independently of the former. The storage infrastructure must be able to scale in terms of performance to support the larger numbers of more complex queries against ever-growing log data volumes.

70 | Chapter 8: Nine Guideposts for Log Analytics Planning Guidepost 2: How Long Does Log Data Need to Be Retained? Corporate standards, audit provisions, and regulatory requirements can all affect the required retention period for log data. As volumes continue to grow exponentially, the cost and complexity associated with storing it can become burdensome or even untenable. At the same time, growing data stores make a structured approach to tracking the data life cycle more vital so that it is not retained longer than necessary. Best practices in this area include assessment and tuning of data- retention policies to ensure that they are appropriate both to meet requirements and to ensure that retention periods are maintained at the shortest appropriate level. If possible, retention requirements should be projected forward to discern whether they are likely to increase or decrease in the future. Nearer-term requirements include data reduction techniques such as deduplication and com‐ pression to reduce the burden on storage system capacities as much as possible. Architects should also consider options for cost-effectively storing archival data. For example, if performance requirements associated with older data stored as Amazon S3 objects are lower than for cur‐ rent operating data, it might be desirable to push them out to Ama‐ zon Glacier or a similar cost-optimized service. Guidepost 3: How Will Regulatory Issues Afect Log Analytics? Setting data-retention standards is a clear issue associated with meeting regulatory requirements, but the full scope of considera‐ tions in this area is far broader. Personally identifiable information (PII) and other sensitive data must be controlled and protected with verifiable measures that govern how it is used, transported, and stored. This set of concerns can affect issues such as how specific data can be used in public cloud infrastructures or shared with part‐ ners, for example. Particularly for organizations that operate in multiple geographic areas, data sovereignty can be a complex issue. Because data is gov‐ erned by the laws of the jurisdiction where it is located,

Guidepost 2: How Long Does Log Data Need to Be Retained? | 71 organizations must be concerned with the physical locations of their data, particularly in cases for which public cloud resources are used. For example, data that was collected through perfectly legitimate means in one country can be in violation of the privacy laws in another. As a related matter, confidential data might be subject to subpoena or other unwanted inspection by government or legal entities in the jurisdiction where it is stored. Response times potentially required by subpoena actions can be a challenge in the common case where it requires weeks to restore older data from backup and then days to query against the associated large volumes of data. For organizations that must regularly respond to law enforcement requests for archival data, architects might need to accommodate streamlined access.

In a world in which regulations over data privacy and usage are evolving, forecasting legal requirements can be problematic. One approach is to consider the meas‐ ures taken to date by government entities that have provided early leadership. Frameworks such as the European Union’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) suggest the types of regulatory patterns that might later be adopted elsewhere. Guidepost 4: What Data Sources and Formats Are Involved? The semistructured nature of log data means that the sources of those logs play an important role in determining how to handle the data. As IT complexity and the scope of sources grow and change over time, the associated challenges can become more complex. Par‐ ticularly as IoT topologies are built out over the next several years, many organizations will find themselves needing to accommodate a vast assortment of new sensors and other endpoints, and that variety will also usher in an expanded diversity of log types and formats. As the universe of data sources and log types becomes broader and more complex, novel transformation pipelines will be needed to cre‐ ate a coherent whole from that data. Forecasting those changes in advance and accommodating them in how the infrastructure scales is an important set of requirements for architects. The ability to

72 | Chapter 8: Nine Guideposts for Log Analytics Planning cost-effectively scale out compute and storage, independently of each other, is central to successfully meeting this set of evolving requirements. Guidepost 5: What Role Will Changing Business Realities Have? The changing needs of a business and its use of log analytics to guide business change are deeply intertwined. Planning should include considering what business questions log analytics can answer, such as how to allocate resources for maximum efficiency or whether to launch a marketing initiative geared toward increasing revenue. Architects must consider how intelligence generated through log analytics can inform business strategy. Just about any major business event can have an impact on infra‐ structure planning for log analytics. That reality makes it valuable for infrastructure planning to draw on the business’s larger corpo‐ rate strategy. For example, architects should consider the potential impacts if the company were to enter into new market segments or geographies. Either type of change would be likely to increase the volume and variety of log data. Likewise, mergers and acquisitions could introduce new and unpredictable infrastructures alongside what the company already operates. Because every organization is subject to unforeseen circumstances, architects must design flexible, scalable frameworks that can accommodate uncertain future needs. Guidepost 6: What Are the Ongoing Query Requirements? The central issues around query requirements are what types of searches are being done against the data and how many. In addition to searches by human users and machine-to-machine systems, plan‐ ning must take into consideration factors such as executive dash‐ boards, data visualization systems, reporting, and real-time alerting. Architects must account for potential addition of such new loads on the log analytics infrastructure as well as the expected growth in workloads generated by existing systems.

Guidepost 5: What Role Will Changing Business Realities Have? | 73 The nature of the queries being made against the log data store is a key design factor for log analytics. In practical terms, for example, architects should identify what data is most likely to have ad hoc queries made against it, as opposed to point queries. They also need to consider how often each type is likely to occur as well as how much historical data is likely to be searched and how frequently. These capabilities should be tuned to avoid slow or cumbersome search, avoiding lost opportunities or missed deadlines. Likely usages such as ad hoc queries being done in conjunction with threat hunting activity should be considered to help guide this pro‐ cess. The answers to those questions can help guide design decisions about factors such as how to implement tiered storage or which data to index at the point of ingress. Guidepost 7: How Are Data-Management Challenges Addressed? Effectively managing data is a cornerstone of driving value from the massive volumes of log data constantly being generated by hard‐ ware, software, and processes of every description. In particular, the inherent variety within log data adds complexity to matters of schema evolution and back-compatibility as log data sources change and multiply. Solution architects should identify strategies for main‐ taining coherence and functionality in the face of such changes. Data management concerns associated with log analytics extend to changes in the operating environments in which log data is gener‐ ated. For example, the firmware or software running on sensors, sys‐ tems, and other entities that are transmitting logs might be upgraded or reconfigured. Those changes may alter the formatting of log files, requiring adaptation by the log analytics platform. Proper planning requires establishing processes to anticipate such changes in advance and setting standards for addressing them when they arise.

74 | Chapter 8: Nine Guideposts for Log Analytics Planning Guidepost 8: How Are Data Transformations Handled? Transforming log data as it is collected from a wide variety of sour‐ ces places significant demands on the underlying systems that must typically be satisfied in real time on steadily growing data streams. The alternative of simply transmitting and storing logs as flat files puts untenable processing burdens on analytics processes. The transformation workload itself can be highly compute intensive, and it takes place over a complex and varied compute layer. It also places significant demands on the storage layer. The tools and applications in place to perform those transforma‐ tions must be capable of handling any foreseeable data requirements as well as integrating and interoperating effectively with the other components of the log analytics pipeline. Likewise, the underlying infrastructure must be designed to provide storage and compute architectures that can accommodate the associated performance and scalability requirements. Guidepost 9: What About Data Protection and High Availability? As log analytics evolves within an organization, it enables increas‐ ingly sophisticated usages that deliver increasingly significant value. Over time, those usages can become business critical or even mis‐ sion critical, elevating the value of the underlying log data as well as the requirements for assuring its accessibility. Foreseeing and plan‐ ning for that transition involves providing high availability for a subset of log data, without interfering with the smooth operation of analytics based on mixed data streams. Designating specific bodies of log data as critical is also tied into other aspects of IT planning. From a security perspective, this status must be considered when identifying requirements for how it should be protected and how sensitive information in the log data should be masked, as well as its recoverability after tampering or other interference as the result of a breach. Backup and restore pro‐ cesses for protecting that data should also reflect its potentially changing value to the organization.

Guidepost 8: How Are Data Transformations Handled? | 75

CHAPTER 9 Conclusion

To deliver on the potential for log analytics to improve operations across the business, architects are challenged to adapt their existing storage infrastructures to real-time needs for access to diverse log data at scale. Done right, legacy architectures based on spinning disks can be replaced with all-flash solutions at similar or lower cost. UFFO-based architectures, for example, can dramatically increase throughput and scale while tailoring data access to the needs of spe‐ cific applications and workloads, so that log analytics can better meet the needs of the business. Disaggregating storage and compute facilitates scalability by ena‐ bling each to be built out and added to independently of the other, using a cost-effective “pay as you grow” scale-out approach. At the same time, a well-designed log analytics architecture can seamlessly scale capacity and performance together, helping to ensure that query and response performance continues to meet SLAs and pro‐ vide excellent end-user experience as the volume of log data and the purposes it serves continue to grow exponentially.

77 About the Authors Matt Gillespie has been writing about enterprise data center tech‐ nology for over 20 years, specializing in creating content for deci‐ sion makers that’s also precise and informed enough to pass muster with engineers. In earlier lives, he was an equities analyst at Mor‐ ningstar and a staff writer at the University of California, Davis Cen‐ ter for Neuroscience. His current clients range from multinational corporations to startups, and his recent content focus has included multicloud computing, machine learning, IoT, security, IT Ops, ana‐ lytics, HPC, and software-defined infrastructure. He can be found at https://www.linkedin.com/in/mgillespie1. Charles Givre is cofounder and CEO of DataDistillr, an early-stage, VC-backed startup dedicated to making data easy to query and use. Prior to starting DataDistillr, Charles worked as a data scientist at the intersection of cybersecurity and data science at JP Morgan Chase, Deutsche Bank, and Booz Allen Hamilton, among others. Charles is passionate about teaching data science and analytic skills and has taught classes all over the world at conferences and univer‐ sities, as well as for clients. He coauthored Learning Apache Drill (O’Reilly), and he is an instructor for multiple O’Reilly online train‐ ing courses. Charles blogs at thedataist.com and tweets @cgivre. He can also be found at https://www.linkedin.com/in/cgivre.