A N T 3 4 6 Know your data with machine learning in Open Distro for Elasticsearch Jon Handler Alolita Sharma Principal Solutions Architect Principal Technologist Search Services Amazon Web Services Amazon Web Services
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Event Engine https://dashboard.eventengine.run
Created to help AWS field teams run: • Workshops • GameDays • Bootcamps • Immersion Days • And other events that require hands- on access to AWS accounts https://dashboard.eventengine.run https://dashboard.eventengine.run/dashboard https://dashboard.eventengine.run/dashboard What is Elasticsearch?
Elasticsearch, Distributed Easy ingestion Logstash, search and and visualization and Kibana analytics engine Sometimes referred to Built on as the “ELK Stack” Apache Lucene
Source: DB-Engines.com, April 2019
Source: DB-Engines.com, October 2019 An Apache 2.0-licensed distribution of Elasticsearch enhanced with enterprise-grade security, alerting, SQL, and more Open Distro for Elasticsearch BENEFITS
100% open source Enterprise-grade Community-driven Providing you the Delivering security Providing individuals freedoms, so you can and advanced capabilities and organizations the freely view, use, change, such as alerting, SQL, freedom to easily and distribute the code and cluster diagnostics contribute changes to the distro Open Distro for Elasticsearch - Features
Security Alerting SQL Performance Analyzer
Achieve encryption in- Monitor your data and Easily interact with your Get deep visibility into flight, fine-grained access send automatic alerts on Elasticsearch cluster and system bottlenecks even control, audit logging, and any changes in your data extract insights using the when your Elasticsearch compliance familiar SQL query syntax cluster is under duress. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Elasticsearch works like a database 1 2 3 Send data as Data is indexed— Queries, via REST APIs, allow JSON via REST APIs all fields searchable, fielded matching, Boolean including nested JSON expressions, include sorting and analysis
2
1 3
Server, application, Application Data network, AWS, and Elasticsearch Cluster Application users, analysts, other logs DevOps, security You use the indexing APIs to send data to Elasticsearch*
POST endpoint/index/_doc POST endpoint/index/_bulk { "field": "value", { Action } "field": "value", { Field: Value, … } "field": "value", { Action } "field": "value" … { Field: Value, … } }
* Your ingestion tools will probably automate this You use the query APIs to retrieve data from Elasticsearch
Elasticsearch cluster
Scoring/Sorting
Query Matches Ranked Engine results You use aggregations to analyze log data
Elasticsearch cluster
• Histogram • Numeric sum, min, max • Terms bucketing Analysis Engine Query Matches • Nesting Engine (Aggregations) Kibana: search, analyze, and visualize log data © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Simple to get started
1 2 3 Visit the website Download & install Load and query data the Elasticsearch and Kibana packages Flexible deployment options
Docker RPM Debian Tarball Deploy on AWS with AWS CloudFormation CloudFormation stack set for an Open Distro cluster in VPC
Community repo link https://tinyurl.com/y28ohx4u
Lab Guide : http://tinyurl.com/ukjvwjv © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where things go wrong
Resource starvation Skewed usage Elasticsearch operations CPU, heap, network, Hot nodes, hot shards, Garbage collection, segment disk I/O uneven multi-tenancy merges, FS caching Performance Analyzer Get Deep diagnostic insights into your cluster
Identify bottlenecks Runs independent Analyze hundreds across the stack of your cluster of data points Provides a powerful REST API for Perform diagnostics even if Supports over 60 metrics across querying Elasticsearch metrics to the cluster is under duress 10 dimensions for instrumentation diagnose issues across stack of your cluster health Performance Analyzer provides metrics about Elasticsearch JVM
OS/ Hardware
ES
Events Components
Plugin Reader PerfTop Instrumentation of the Gathers information and Lightweight, ASCII Elasticsearch process at a stores in a local DB visualizations of code level Performance Analyzer data PerfTop CLI
• Provides pre-configured dashboards for analyzing cluster, node, and shard performance
• Custom JSON templates to create the dashboards to diagnose your cluster performance Performance Analyzer to Elasticsearch https://tinyurl.com/y2u2mfwe
Performance Analyzer REST API Retrieve
Parse/ Transform
Elasticsearch documents _bulk Kibana © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Security KEEP YOUR DATA SECURE
Encryption Authentication Authorization Audit logging Keep your Leverage Granular access Track and record data secure your existing control to control all user actions authentication the user actions and meet HIPAA, when in transit infrastructure on your cluster PCI compliance Security core concepts
Users Roles Permissions Users make requests to Security roles define the scope of An individual action, such as Elasticsearch clusters. A user has a permission or action group: creating an index credentials, zero+ backend roles, cluster, index, document, or field (e.g. indices:admin/create) and zero+ custom attributes
Action groups Backend roles Role mappings Permission sets. E.g., the Additional, external roles that Users assume roles after they predefined SEARCH action group come from an authorization successfully authenticate. Role authorizes roles to use backend (e.g. LDAP/Active mappings map roles to users (or the _search and _msearch APIs Directory) backend roles) Encryption
Encrypt traffic to your Node-to-node encryption; all intra-cluster traffic is cluster for Kibana/APIs encrypted; trust relationships established with certs TLS 1.2, OpenSSL Authentication flow
Authc AuthZ Permissions Action groups Via basic HTTP Auth, LDAP, Backend identities mapped Allow a role to perform Groups of permissions AD, SAML, web tokens, SSL to Open Distro roles an action against a cluster/index/document/fiel d
Request with Request with credentials user/backend roles Response
Authc AuthZ
Internal user database, Authc Roles and federated IDPs – LDAP, SAML, provider permissions OpenID Connect, JSON web token, certificate Kibana multi-tenancy
Group A Group B
Create tenants for Kibana (public, Group A permissions Group B permissions per-group, private) When using Kibana select a tenant Assign tenants to users to enable access Dashboard A Dashboard B Tenancy is not the same as access control Index 1 Index 2 Auditing – What has happened?
Track requests to Elasticsearch Failed logins, SSL exceptions, bad headers, … Multiple storage options Stdout, same or other Elasticsearch cluster, Log4J, custom webhooks Combine with Alerting plugin for notifications © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Alerting - What's it good for?
• Infrastructure: Uptime, network traffic, instance availability, CPU
• Application: KPI drops, feature use
• DevOps: APM, observability
• SIEM: Fraud detection, DOS Alerts
Monitor Action Destination
Trigger
Monitor – A job that runs on a defined schedule and queries Elasticsearch Trigger – Conditions that, if met, generate alerts and can perform some action Alert – A notification that a monitor’s trigger condition has been met Action – Information you want the monitor to send out after being triggered Destination – A reusable location for an action, such as Amazon Chime, Slack, or a webhook URL Life cycle of an Alert
When a trigger is active, the alert is triggered
The trigger sends notifications to your destinations
Acknowledge an alert to stop notifications for the trigger
The alert goes inactive when the monitor's value falls back within thresholds © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. SQL support Query data with SQL
Comprehensive Translate Use SQL support SQL to JSON existing tools Supports over 40 functions, Create JSON using SQL Provides a JDBC driver so you data types, and commands including to configure sophisticated can use a variety of business join support access control policies intelligence, analytics, and ETL tools © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. New Features IN DEVELOPMENT
Performance k-NN search Root Cause Analysis
Anomaly Detection Machine Learning algorithms (RCF) k-NN search NEW! MACHINE LEARNING PLUGIN Unsupervised Machine Learning algorithm k-NN search is used to find nearest K points in a vector space. Datasets are represented in the form of vectors
Similarity Search Used in similarity search and image recognition applications
Proximity The KNN algorithm assumes that similar things exist in close proximity. Searches return the most similar items to the input item, having first converted the item to high dimensional spatial coordinates called embeddings, where proximity in the hyperspace translates to item similarity k-NN search NEW! MACHINE LEARNING PLUGIN Key components
• k-NN codec • k-NN plugin which extends Mapper, Search and Action plugins • New Apache Lucene file formats _hnsw and _hnswc • NMS library integration through JNI NMS library
• Non-Metric Space Library (NMSLIB) is an efficient cross-platform similarity search library in C++ • Is a toolkit for evaluation of similarity search methods • Implements approximate k-NN search using “Hierarchical Navigable Small world”(HNSW) graphs • The core-library does not have any third-party dependencies and is lightweight • NMSLIB makes it easy to add new search methods and distance functions k-NN search NEW! MACHINE LEARNING PLUGIN Apache Lucene File Formats Two new file formats with extension _hnsw, _hnswc have been added to store serialized graphs
These file formats • Co-exist with the other Apache Lucene file formats • Are immutable like other Apache Lucene files, which makes these file system cache friendly and thread safe • Are created with each segment and deleted when segments merge • Get deserialized and loaded into the memory only during query phase and run the query vector against the index to fetch neighbors • Have footers similar to the other Apache Lucene files which will be used by Elasticsearch for checksum validation during recovery k-NN search NEW! MACHINE LEARNING PLUGIN k-NN codec A new codec KNNCodec adds a new Apache Lucene index file format _hnsw for storing and retrieving the vectors in NMS library using a JNI layer Floating point vectors are converted to byte array and stored as binary doc values For other functions we delegate the request to underlying Apache Lucene version supported by related Elasticsearch version KNNDocValuesConsumer adds floating point vectors to the _hnsw index KNNDocValuesProducer reads floating point vectors from the _hnsw index k-NN search NEW! MACHINE LEARNING PLUGIN k-NN plugin k-NN plugin customizes the mapper to index the floating point vectors using the Mapper plugin k-NN plugin customizes the query clause to query the K nearest neighbors using Search plugin
Here is the template for KNN plugin: k-NN search NEW! MACHINE LEARNING PLUGIN Mapper plugin
• Mapper plugin helps add new field data types • New data type named ‘knn_vector’ to represent the vector fields in a document
Search plugin
• Plugin for extending search time behavior • Helps define custom query clauses • New query clause named “knn” k-NN search NEW! MACHINE LEARNING PLUGIN k-NN Index k-NN search NEW! MACHINE LEARNING PLUGIN Indexing Flow k-NN search NEW! MACHINE LEARNING PLUGIN Indexing k-NN fields
DEMO k-NN search NEW! MACHINE LEARNING PLUGIN Search Request Flow k-NN search NEW! MACHINE LEARNING PLUGIN Search Request Flow
DEMO k-NN search NEW! MACHINE LEARNING PLUGIN k-NN
LAB 6
See Lab Guide for step-by-step instructions. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Anomaly Detection NEW! MACHINE LEARNING PLUGIN Real time anomaly detection on streaming data
The anomaly detection plugin automatically detects anomalies in your Elasticsearch data in near real time using the Random Cut Forest (RCF) algorithm. Anomaly Detection RCF is an unsupervised machine learning algorithm that models a sketch of your incoming data stream to compute an anomaly grade and confidence score value for each incoming data point. The anomaly detection plugin uses the alerting plugin to notify you as soon as an anomaly is detected. Machine Learning algorithms (RCF) Anomaly Detection NEW! MACHINE LEARNING PLUGIN Useful Resources Source Code https://github.com/opendistro-for-elasticsearch/anomaly-detection
Anomaly Detection Kibana plugin Anomaly Detection https://github.com/opendistro-for-elasticsearch/anomaly-detection-kibana-plugin
Random Cut Forest (RCF) algorithm library used by Anomaly Detection plugin https://github.com/aws/random-cut-forest-by-aws
Technical Documentation https://opendistro.github.io/for-elasticsearch-docs/docs/ad/
Blog Posts Real Time Anomaly Detection in Open Distro for Elasticsearch: https://bit.ly/2sGlAS5 Random Cut Forests: https://bit.ly/35YsQXO © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Community and Contributions
Open Distro for Elasticsearch’s success is driven by the community’s participation, contributions, and innovation to the project
Use the resources below to join in for project discussions, share your knowledge with fellow community members, write a blog post, participate in Open Distro events and meetups, contribute documentation, file PRs, bugs or request a feature!
Website: opendistro.github.io Project discussion forums (Q&A): discuss.opendistrocommunity.dev
Source code: github.com/opendistro-for-elasticsearch Developer community: github.com/opendistro-for-elasticsearch/community/issues Technical Articles and blog posts: opendistro.github.io/blog Community and Contributions
Open Distro for Elasticsearch’s success is driven by the community’s participation, contributions, and innovation to the project. Join in for our online community meeting to learn about what’s new on Open Distro, project roadmap, features and Q&A!
Regular community meetings online https://www.meetup.com/Open-Distro-for-Elasticsearch-Meetup-Group/events
Face-to-face meetings https://www.meetup.com/Open-Distro-for-Elasticsearch-Meetup-Group/events
Open Source Conferences and Events https://opendistro.github.io/for-elasticsearch/events.html Related breakouts
OPN 212 Analyze your log data with Open Distro for Elasticsearch OPN 204 Secure your Open Distro for Elasticsearch cluster OPN 302-R, 302-R1 Get started with Open Distro for Elasticsearch OPN 310-R, 310-R1 Alerting with Open Distro for Elasticsearch OPN 311-R, 311-R1 Analyze Performance of your workload with Open Distro for Elasticsearch ANT 346 Know your data with machine learning in Open Distro for Elasticsearch Learn big data with AWS Training and Certification Resources created by the experts at AWS to help you build and validate data analytics skills
New free digital course, Data Analytics Fundamentals, introduces Amazon S3, Amazon Kinesis, Amazon EMR, AWS Glue, and Amazon Redshift
Classroom offerings, including Big Data on AWS, feature AWS expert instructors and hands-on labs
Validate expertise with the AWS Certified Big Data - Specialty exam or the new AWS Certified Data Analytics - Specialty beta exam
Visit aws.amazon.com/training/paths-specialty/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.