Operating Elasticsearch at Scale! Agenda • Introductions • Search Platform @ PayPal
• Elasticsearch Primer
• Making Elasticsearch truly Elastic
• Managing Elasticsearch @ Scale
• Elasticsearch in action!
• Q & A Introductions
Vikram Aishwarya Ramakrishnan Sankaravadivel
Senior Manager, Senior Software Engineer Technology Platforms & Experience Technology Platforms & Experience
15 years of industry experience with Passionate technologist with deep expertise in specific focus in the areas of Large scale managing Large scale Elasticsearch deployments distributed systems, Enterprise Search & Big Data & Reactive systems
[email protected] [email protected] https://www.linkedin.com/in/vikramakrishnan https://www.linkedin.com/in/aishwaryasankar/ Search Platform @ PayPal
Fully Managed Search offering for PayPal Inc.
1500+ nodes
4+ PB of Data
Other Offerings
Custom Security Plugin 20+ Billion Ingests / day
Custom Monitoring & Alerting 50+ Million Searches / day Elasticsearch Primer
Open Source Distributed
Released in 2010 and built on Supports replication, sharding & top of Apache Lucene routing
Schema Free Restful APIs
Supports JSON structured API driven and all actions can be documents, detects types and performed via simple APIs using makes them searchable. JSON over HTTP Elasticsearch Primer
(un)Structured Searches Fuzzy Searches
Full text search, Faceted Autocomplete, ”Did you mean”, Navigation, Pagination Suggestions.
Powerful Aggregations Visualizations
Bucketing, Metric, Pipelining Tools like Kibana, Grafana to provide deep actionable insights Elasticsearch Primer What’s in it for SREs?
Improved Observability Realtime, Actionable Insights Elasticsearch Primer Things to know before we dig deep…
• Cluster • Field
• Node • Mapping
• Document • Shards
• Index • Routing Elasticsearch Primer Anatomy of an Elasticsearch cluster
*master
P1 P2 P3
R3 R1 R2
Primary Shard
Replica Shard Elasticsearch Primer Anatomy of an Indexing request
3 forward to replica shard(s)
*master
P1 P2 P3
R3 R1 R2
1 2
route request based on Index request id Elasticsearch Primer Dissecting Indexing doc 1 doc 2 Inverted Index Term Document source text The Quick brown fox jumped The two lazy dogs were slower over the lazy dog than the less lazy dog, Rover quick 1 brown 1
html_strip The Quick brown fox jumped over the lazy The two lazy dogs were slower than the fox 1 dog less lazy dog, Rover filter jumped 1 over 1 The brown The two lazy dogs were slower standard Quick fox jumped lazy 1, 2 than the less lazy dog Rover tokenizer over the lazy dog dog 1, 2 two 2 the quick brown were lower case fox jumped the two lazy dogs slower than the less lazy dog rover dogs 2 filter over the lazy dog slower 2 less 2 stop words quick brown fox jumped two lazy dogs slower filter over lazy dog less lazy dog rover rover 2 Elasticsearch Primer Anatomy of a search request. Gather globally 3 sorted result set of doc ids
*master
P1 P2 P3
R3 R1 R2
forward to every 2 primary / replica 1 Search request shard of the index Elasticsearch Primer Dissecting a search request. Inverted Index Search for “The quick Brown Dog” Term Document quick 1 html_strip brown 1 The Quick Brown Dog filter fox 1 jumped 1 quick standard over 1 The quick Brown Dog tokenizer brown lazy 1, 2 dog 1, 2 lower case the quick brown dog dog two 2 filter dogs 2 slower 2 stop words quick brown dog filter less 2 rover 2 Elasticsearch Primer Installation
$ curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch- {VERSION}.tar.gz
$ tar –xvf elasticsearch-{VERSION}.tar.gz
$ cd elasticsearch-{VERSION}/bin
$ ./elasticsearch Elasticsearch Primer Installation
Now, how do we
scale this for the enterprise? Scaling Elasticsearch! Scaling Elasticsearch!
SCALABILITY? Scaling Elasticsearch!
ØScale Up Scaling Elasticsearch!
ØScale Up
ØScale Out Scaling Elasticsearch!
Data Volume
Scale! Ingestion
Search Scaling Elasticsearch!
ØMemory / CPU / IO / Network Issues
ØSub-optimal Queries
ØMapping Explosion
ØUndersized / Oversized shards Is Elasticsearch truly elastic? Scaling Elasticsearch!
Problem#1 : Low Ingestion Throughput
Need to process ‘X’ million / minute additionally.
Learnings
Ø _routing for Bulk requests
Ø index.refresh_interval
Ø number_of_shards, total_shards_per_node
Ø index.translog.durability:async Scaling Elasticsearch!
Problem #2 : Node instabilities due to Mapping Explosion and Sparse Fields Scaling Elasticsearch!
Problem #2 : Node instabilities due to Mapping Explosion and Sparse Fields
Ø Mapping Explosion Scaling Elasticsearch!
Problem #2 : Node instabilities due to Mapping Explosion and Sparse Fields
Ø Mapping Explosion
Ø Sparse fields Scaling Elasticsearch!
Problem #2 : Node instabilities due to Mapping Explosion and Sparse Fields
Learnings:
Ø Keep a watch on mapping. 20% disk space
Ø Split indices. 13% search latency Ø Date based indices works best for logs monitoring! Scaling Elasticsearch!
Problem #3 : Expensive/sub optimal queries – A threat to availability! Scaling Elasticsearch!
Problem #3 : Expensive/sub optimal queries – A threat to availability!
Example Queries: #1 Group By operation on a Id field
{ [{ "aggs": { "key": ”[email protected]", "ids": { "doc_count": 1 "terms": { }, { "key": ”[email protected]", "field": ”emailId" "doc_count": 1 } }, { } "key": ”[email protected]", } "doc_count": 1 } }………] Scaling Elasticsearch!
Problem #3 : Expensive/sub optimal queries – A threat to availability!
Example Queries: #2 Retrieve all documents from Elasticsearch
_search?scroll=1m { { "scroll": "1m", "query": { "scroll_id" : "match_all": {} "cXVlcnlUaGVuRmV0Y2T=" } } } Scaling Elasticsearch!
Problem #3 : Expensive/sub optimal queries – A threat to availability!
Example Queries: #3 Multiple levels of aggregations
{ "aggs": { "date": { "aggs": { "customer": { "aggs": { "device": { "aggs": { "type": { "aggs": { "action":{ ... } } Scaling Elasticsearch!
Problem #3 : Expensive/sub optimal queries – A threat to availability!
Example Queries: #4 Wildcard / Regex Queries { "query": { "wildcard": { "field": "*foo" } } } Scaling Elasticsearch!
Problem #3 : Expensive/sub optimal queries – A threat to availability!
Learnings Query Logger
Gate Keeper
Query Optimizer Scaling Elasticsearch!
Problem #3 : Expensive/sub optimal queries – A threat to availability!
Learnings : Query Logger
Ø Logging in Elasticsearch
Is index.search.slowlog.* suffice?
Ø Distinct Query Logger Scaling Elasticsearch!
Problem #3 : Expensive/sub optimal queries – A threat to availability!
Learnings : Gate Keeper
Ø Intercept and Act.
Ø Proactively prevent incidents in LIVE.
Ø Example: Time range > 48 hours | Nested Aggregations level is >10 Scaling Elasticsearch!
Problem #3 : Expensive/sub optimal queries – A threat to availability!
Learnings : Query Optimizer Scaling Elasticsearch!
Problem #3 : Expensive/sub optimal queries – A threat to availability!
Learnings : Query Optimizer
{"index": ["INDEXNAME-*"]} {"query": {"range": {"timestamp": { "gte":
Problem #3 : Expensive/sub optimal queries – A threat to availability!
Learnings : Query Optimizer
INDEXNAME-2019-06
INDEXNAME-2019-05 {"index": ["INDEXNAME-*"]} {"query": {"range": {"timestamp": { INDEXNAME-2019-04 "gte":
INDEXNAME-YYYY-MM Scaling Elasticsearch!
Problem #3 : Expensive/sub optimal queries – A threat to availability!
Learnings : Query Optimizer
INDEXNAME-2019-06
INDEXNAME-2019-05 {"index": ["INDEXNAME-*"]} {"query": {"range": {"timestamp": { Query INDEXNAME-2019-04 "gte":
INDEXNAME-YYYY-MM Scaling Elasticsearch!
Problem #3 : Expensive/sub optimal queries – A threat to availability!
Learnings : Query Optimizer
Ø Reduce the number of indices(shards)that are being queried.
Ø Reorder the queries for reducing the scan margin *
Search latency by 25% and we observed better availability J Scaling Elasticsearch!
Problem #4: Oversized Shards
When? Learnings
Ø Shards assigned per index is incorrect. Ø Anticipate the future and assign the shards.
Ø Skewness in the values of the routing key. Ø Split Ø Re-index Managing Elasticsearch! Managing Elasticsearch!
Ø Monitoring your clusters
Ø Securing Elasticsearch
Ø Upgrading Clusters How do we police the Police?
When Availability, is the top priority! Monitoring your clusters! Managing Elasticsearch!
COLLECT SHIP ANALYZE ALERT ACT
Collect Node stats from every node
Ø Index field data, docs, merge, refresh
Ø Shard type, state, docs type, queue, rejected, max Ø Thread Pool heap_used, gc collection count and Ø JVM time, buffer pools
Ø Os & Process CPU Load, Memory, File descriptors Managing Elasticsearch!
COLLECT SHIP ANALYZE ALERT ACT
LOGS CLUSTER MONITORING CLUSTER Node 1
Node 1 Node n
Node N Managing Elasticsearch!
COLLECT SHIP ANALYZE ALERT ACT
LOGS CLUSTER MONITORING CLUSTER Node 1
Node 1 Node n
Node N Managing Elasticsearch!
COLLECT SHIP ANALYZE ALERT ACT Securing Elasticsearch!
Why do we need security?
In-House Guard Plugin Security features, free in Elasticsearch?
Ø End to End TLS/SSL Ø Audit Logging
Ø Authentication & Authorization Ø Third party integrations with LDAP/SAML Upgrading Elasticsearch!
It’s tough only when you are in <5.X versions!
Must do checks:
Ø Are queries backward compatible? Approach 1: Parallel cluster with dual ingestion
Ø Mapping changes between versions? Approach 2: Re-index from remote Demo
Q & A