A N T 3 4 6 Know your data with machine learning in Open Distro for Elasticsearch Jon Handler Alolita Sharma Principal Solutions Architect Principal Technologist Search Services Amazon Web Services

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Event Engine https://dashboard.eventengine.run

Created to help AWS field teams run: • Workshops • GameDays • Bootcamps • Immersion Days • And other events that require hands- on access to AWS accounts https://dashboard.eventengine.run https://dashboard.eventengine.run/dashboard https://dashboard.eventengine.run/dashboard What is Elasticsearch?

Elasticsearch, Distributed Easy ingestion Logstash, search and and visualization and analytics engine Sometimes referred to Built on as the “ELK Stack”

Source: DB-Engines.com, April 2019

Source: DB-Engines.com, October 2019 An Apache 2.0-licensed distribution of Elasticsearch enhanced with enterprise-grade security, alerting, SQL, and more Open Distro for Elasticsearch BENEFITS

100% open source Enterprise-grade Community-driven Providing you the Delivering security Providing individuals freedoms, so you can and advanced capabilities and organizations the freely view, use, change, such as alerting, SQL, freedom to easily and distribute the code and cluster diagnostics contribute changes to the distro Open Distro for Elasticsearch - Features

Security Alerting SQL Performance Analyzer

Achieve encryption in- Monitor your data and Easily interact with your Get deep visibility into flight, fine-grained access send automatic alerts on Elasticsearch cluster and system bottlenecks even control, audit logging, and any changes in your data extract insights using the when your Elasticsearch compliance familiar SQL query syntax cluster is under duress. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Elasticsearch works like a database 1 2 3 Send data as Data is indexed— Queries, via REST APIs, allow JSON via REST APIs all fields searchable, fielded matching, Boolean including nested JSON expressions, include sorting and analysis

2

1 3

Server, application, Application Data network, AWS, and Elasticsearch Cluster Application users, analysts, other logs DevOps, security You use the indexing APIs to send data to Elasticsearch*

POST endpoint/index/_doc POST endpoint/index/_bulk { "field": "value", { Action } "field": "value", { Field: Value, … } "field": "value", { Action } "field": "value" … { Field: Value, … } }

* Your ingestion tools will probably automate this You use the query APIs to retrieve data from Elasticsearch

Elasticsearch cluster

Scoring/Sorting

Query Matches Ranked Engine results You use aggregations to analyze log data

Elasticsearch cluster

• Histogram • Numeric sum, min, max • Terms bucketing Analysis Engine Query Matches • Nesting Engine (Aggregations) Kibana: search, analyze, and visualize log data © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Simple to get started

1 2 3 Visit the website Download & install Load and query data the Elasticsearch and Kibana packages Flexible deployment options

Docker RPM Debian Tarball Deploy on AWS with AWS CloudFormation CloudFormation stack set for an Open Distro cluster in VPC

Community repo link https://tinyurl.com/y28ohx4u

Lab Guide : http://tinyurl.com/ukjvwjv © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where things go wrong

Resource starvation Skewed usage Elasticsearch operations CPU, heap, network, Hot nodes, hot shards, Garbage collection, segment disk I/O uneven multi-tenancy merges, FS caching Performance Analyzer Get Deep diagnostic insights into your cluster

Identify bottlenecks Runs independent Analyze hundreds across the stack of your cluster of data points Provides a powerful REST API for Perform diagnostics even if Supports over 60 metrics across querying Elasticsearch metrics to the cluster is under duress 10 dimensions for instrumentation diagnose issues across stack of your cluster health Performance Analyzer provides metrics about Elasticsearch JVM

OS/ Hardware

ES

Events Components

Plugin Reader PerfTop Instrumentation of the Gathers information and Lightweight, ASCII Elasticsearch process at a stores in a local DB visualizations of code level Performance Analyzer data PerfTop CLI

• Provides pre-configured dashboards for analyzing cluster, node, and shard performance

• Custom JSON templates to create the dashboards to diagnose your cluster performance Performance Analyzer to Elasticsearch https://tinyurl.com/y2u2mfwe

Performance Analyzer REST API Retrieve

Parse/ Transform

Elasticsearch documents _bulk Kibana © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Security KEEP YOUR DATA SECURE

Encryption Authentication Authorization Audit logging Keep your Leverage Granular access Track and record data secure your existing control to control all user actions authentication the user actions and meet HIPAA, when in transit infrastructure on your cluster PCI compliance Security core concepts

Users Roles Permissions Users make requests to Security roles define the scope of An individual action, such as Elasticsearch clusters. A user has a permission or action group: creating an index credentials, zero+ backend roles, cluster, index, document, or field (e.g. indices:admin/create) and zero+ custom attributes

Action groups Backend roles Role mappings Permission sets. E.g., the Additional, external roles that Users assume roles after they predefined SEARCH action group come from an authorization successfully authenticate. Role authorizes roles to use backend (e.g. LDAP/Active mappings map roles to users (or the _search and _msearch APIs Directory) backend roles) Encryption

Encrypt traffic to your Node-to-node encryption; all intra-cluster traffic is cluster for Kibana/APIs encrypted; trust relationships established with certs TLS 1.2, OpenSSL Authentication flow

Authc AuthZ Permissions Action groups Via basic HTTP Auth, LDAP, Backend identities mapped Allow a role to perform Groups of permissions AD, SAML, web tokens, SSL to Open Distro roles an action against a cluster/index/document/fiel d

Request with Request with credentials user/backend roles Response

Authc AuthZ

Internal user database, Authc Roles and federated IDPs – LDAP, SAML, provider permissions OpenID Connect, JSON web token, certificate Kibana multi-tenancy

Group A Group B

Create tenants for Kibana (public, Group A permissions Group B permissions per-group, private) When using Kibana select a tenant Assign tenants to users to enable access Dashboard A Dashboard B Tenancy is not the same as access control Index 1 Index 2 Auditing – What has happened?

Track requests to Elasticsearch Failed logins, SSL exceptions, bad headers, … Multiple storage options Stdout, same or other Elasticsearch cluster, Log4J, custom webhooks Combine with Alerting plugin for notifications © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Alerting - What's it good for?

• Infrastructure: Uptime, network traffic, instance availability, CPU

• Application: KPI drops, feature use

• DevOps: APM, observability

• SIEM: Fraud detection, DOS Alerts

Monitor Action Destination

Trigger

Monitor – A job that runs on a defined schedule and queries Elasticsearch Trigger – Conditions that, if met, generate alerts and can perform some action Alert – A notification that a monitor’s trigger condition has been met Action – Information you want the monitor to send out after being triggered Destination – A reusable location for an action, such as Amazon Chime, Slack, or a webhook URL Life cycle of an Alert

When a trigger is active, the alert is triggered

The trigger sends notifications to your destinations

Acknowledge an alert to stop notifications for the trigger

The alert goes inactive when the monitor's value falls back within thresholds © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. SQL support Query data with SQL

Comprehensive Translate Use SQL support SQL to JSON existing tools Supports over 40 functions, Create JSON using SQL Provides a JDBC driver so you data types, and commands including to configure sophisticated can use a variety of business join support access control policies intelligence, analytics, and ETL tools © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. New Features IN DEVELOPMENT

Performance k-NN search Root Cause Analysis

Anomaly Detection Machine Learning algorithms (RCF) k-NN search NEW! MACHINE LEARNING PLUGIN Unsupervised Machine Learning algorithm k-NN search is used to find nearest K points in a vector space. Datasets are represented in the form of vectors

Similarity Search Used in similarity search and image recognition applications

Proximity The KNN algorithm assumes that similar things exist in close proximity. Searches return the most similar items to the input item, having first converted the item to high dimensional spatial coordinates called embeddings, where proximity in the hyperspace translates to item similarity k-NN search NEW! MACHINE LEARNING PLUGIN Key components

• k-NN codec • k-NN plugin which extends Mapper, Search and Action plugins • New Apache Lucene file formats _hnsw and _hnswc • NMS library integration through JNI NMS library

• Non-Metric Space Library (NMSLIB) is an efficient cross-platform similarity search library in C++ • Is a toolkit for evaluation of similarity search methods • Implements approximate k-NN search using “Hierarchical Navigable Small world”(HNSW) graphs • The core-library does not have any third-party dependencies and is lightweight • NMSLIB makes it easy to add new search methods and distance functions k-NN search NEW! MACHINE LEARNING PLUGIN Apache Lucene File Formats Two new file formats with extension _hnsw, _hnswc have been added to store serialized graphs

These file formats • Co-exist with the other Apache Lucene file formats • Are immutable like other Apache Lucene files, which makes these file system cache friendly and thread safe • Are created with each segment and deleted when segments merge • Get deserialized and loaded into the memory only during query phase and run the query vector against the index to fetch neighbors • Have footers similar to the other Apache Lucene files which will be used by Elasticsearch for checksum validation during recovery k-NN search NEW! MACHINE LEARNING PLUGIN k-NN codec A new codec KNNCodec adds a new Apache Lucene index file format _hnsw for storing and retrieving the vectors in NMS library using a JNI layer Floating point vectors are converted to byte array and stored as binary doc values For other functions we delegate the request to underlying Apache Lucene version supported by related Elasticsearch version KNNDocValuesConsumer adds floating point vectors to the _hnsw index KNNDocValuesProducer reads floating point vectors from the _hnsw index k-NN search NEW! MACHINE LEARNING PLUGIN k-NN plugin k-NN plugin customizes the mapper to index the floating point vectors using the Mapper plugin k-NN plugin customizes the query clause to query the K nearest neighbors using Search plugin

Here is the template for KNN plugin: k-NN search NEW! MACHINE LEARNING PLUGIN Mapper plugin

• Mapper plugin helps add new field data types • New data type named ‘knn_vector’ to represent the vector fields in a document

Search plugin

• Plugin for extending search time behavior • Helps define custom query clauses • New query clause named “knn” k-NN search NEW! MACHINE LEARNING PLUGIN k-NN Index k-NN search NEW! MACHINE LEARNING PLUGIN Indexing Flow k-NN search NEW! MACHINE LEARNING PLUGIN Indexing k-NN fields

DEMO k-NN search NEW! MACHINE LEARNING PLUGIN Search Request Flow k-NN search NEW! MACHINE LEARNING PLUGIN Search Request Flow

DEMO k-NN search NEW! MACHINE LEARNING PLUGIN k-NN

LAB 6

See Lab Guide for step-by-step instructions. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Anomaly Detection NEW! MACHINE LEARNING PLUGIN Real time anomaly detection on streaming data

The anomaly detection plugin automatically detects anomalies in your Elasticsearch data in near real time using the Random Cut Forest (RCF) algorithm. Anomaly Detection RCF is an unsupervised machine learning algorithm that models a sketch of your incoming data stream to compute an anomaly grade and confidence score value for each incoming data point. The anomaly detection plugin uses the alerting plugin to notify you as soon as an anomaly is detected. Machine Learning algorithms (RCF) Anomaly Detection NEW! MACHINE LEARNING PLUGIN Useful Resources Source Code https://github.com/opendistro-for-elasticsearch/anomaly-detection

Anomaly Detection Kibana plugin Anomaly Detection https://github.com/opendistro-for-elasticsearch/anomaly-detection-kibana-plugin

Random Cut Forest (RCF) algorithm library used by Anomaly Detection plugin https://github.com/aws/random-cut-forest-by-aws

Technical Documentation https://opendistro.github.io/for-elasticsearch-docs/docs/ad/

Blog Posts Real Time Anomaly Detection in Open Distro for Elasticsearch: https://bit.ly/2sGlAS5 Random Cut Forests: https://bit.ly/35YsQXO © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Community and Contributions

Open Distro for Elasticsearch’s success is driven by the community’s participation, contributions, and innovation to the project

Use the resources below to join in for project discussions, share your knowledge with fellow community members, write a blog post, participate in Open Distro events and meetups, contribute documentation, file PRs, bugs or request a feature!

Website: opendistro.github.io Project discussion forums (Q&A): discuss.opendistrocommunity.dev

Source code: github.com/opendistro-for-elasticsearch Developer community: github.com/opendistro-for-elasticsearch/community/issues Technical Articles and blog posts: opendistro.github.io/blog Community and Contributions

Open Distro for Elasticsearch’s success is driven by the community’s participation, contributions, and innovation to the project. Join in for our online community meeting to learn about what’s new on Open Distro, project roadmap, features and Q&A!

Regular community meetings online https://www.meetup.com/Open-Distro-for-Elasticsearch-Meetup-Group/events

Face-to-face meetings https://www.meetup.com/Open-Distro-for-Elasticsearch-Meetup-Group/events

Open Source Conferences and Events https://opendistro.github.io/for-elasticsearch/events.html Related breakouts

OPN 212 Analyze your log data with Open Distro for Elasticsearch OPN 204 Secure your Open Distro for Elasticsearch cluster OPN 302-R, 302-R1 Get started with Open Distro for Elasticsearch OPN 310-R, 310-R1 Alerting with Open Distro for Elasticsearch OPN 311-R, 311-R1 Analyze Performance of your workload with Open Distro for Elasticsearch ANT 346 Know your data with machine learning in Open Distro for Elasticsearch Learn big data with AWS Training and Certification Resources created by the experts at AWS to help you build and validate data analytics skills

New free digital course, Data Analytics Fundamentals, introduces Amazon S3, Amazon Kinesis, Amazon EMR, AWS Glue, and Amazon Redshift

Classroom offerings, including Big Data on AWS, feature AWS expert instructors and hands-on labs

Validate expertise with the AWS Certified Big Data - Specialty exam or the new AWS Certified Data Analytics - Specialty beta exam

Visit aws.amazon.com/training/paths-specialty/

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you!

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.