Operating Elasticsearch at Scale! Agenda • Introductions • Search Platform @ Paypal

Operating Elasticsearch at Scale! Agenda • Introductions • Search Platform @ PayPal • Elasticsearch Primer • Making Elasticsearch truly Elastic • Managing Elasticsearch @ Scale • Elasticsearch in action! • Q & A Introductions Vikram Aishwarya Ramakrishnan Sankaravadivel Senior Manager, Senior Software Engineer Technology Platforms & Experience Technology Platforms & Experience 15 years of industry experience with Passionate technologist with deep expertise in specific focus in the areas of Large scale managing Large scale Elasticsearch deployments distributed systems, Enterprise Search & Big Data & Reactive systems [email protected] [email protected] https://www.linkedin.com/in/vikramakrishnan https://www.linkedin.com/in/aishwaryasankar/ Search Platform @ PayPal Fully Managed Search offering for PayPal Inc. 1500+ nodes 4+ PB of Data Other Offerings Custom Security Plugin 20+ Billion Ingests / day Custom Monitoring & Alerting 50+ Million Searches / day Elasticsearch Primer Open Source Distributed Released in 2010 and built on Supports replication, sharding & top of Apache Lucene routing Schema Free Restful APIs Supports JSON structured API driven and all actions can be documents, detects types and performed via simple APIs using makes them searchable. JSON over HTTP Elasticsearch Primer (un)Structured Searches Fuzzy Searches Full text search, Faceted Autocomplete, ”Did you mean”, Navigation, Pagination Suggestions. Powerful Aggregations Visualizations Bucketing, Metric, Pipelining Tools like Kibana, Grafana to provide deep actionable insights Elasticsearch Primer What’s in it for SREs? Improved Observability Realtime, Actionable Insights Elasticsearch Primer Things to know before we dig deep… • Cluster • Field • Node • Mapping • Document • Shards • Index • Routing Elasticsearch Primer Anatomy of an Elasticsearch cluster *master P1 P2 P3 R3 R1 R2 Primary Shard Replica Shard Elasticsearch Primer Anatomy of an Indexing request 3 forward to replica shard(s) *master P1 P2 P3 R3 R1 R2 1 2 route request based on Index request id Elasticsearch Primer Dissecting Indexing doc 1 doc 2 Inverted Index Term Document source text The Quick brown fox jumped The two lazy dogs were slower over the lazy dog than the less lazy dog, Rover quick 1 brown 1 html_strip The Quick brown fox jumped over the lazy The two lazy dogs were slower than the fox 1 dog less lazy dog, Rover filter jumped 1 over 1 The brown The two lazy dogs were slower standard Quick fox jumped lazy 1, 2 than the less lazy dog Rover tokenizer over the lazy dog dog 1, 2 two 2 the quick brown were lower case fox jumped the two lazy dogs slower than the less lazy dog rover dogs 2 filter over the lazy dog slower 2 less 2 stop words quick brown fox jumped two lazy dogs slower filter over lazy dog less lazy dog rover rover 2 Elasticsearch Primer Anatomy of a search request. Gather globally 3 sorted result set of doc ids *master P1 P2 P3 R3 R1 R2 forward to every 2 primary / replica 1 Search request shard of the index Elasticsearch Primer Dissecting a search request. Inverted Index Search for “The quick Brown Dog” Term Document quick 1 html_strip brown 1 The Quick Brown Dog filter fox 1 jumped 1 quick standard over 1 The quick Brown Dog tokenizer brown lazy 1, 2 dog 1, 2 lower case the quick brown dog dog two 2 filter dogs 2 slower 2 stop words quick brown dog filter less 2 rover 2 Elasticsearch Primer Installation $ curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch- {VERSION}.tar.gz $ tar –xvf elasticsearch-{VERSION}.tar.gz $ cd elasticsearch-{VERSION}/bin $ ./elasticsearch Elasticsearch Primer Installation Now, how do we scale this for the enterprise? Scaling Elasticsearch! Scaling Elasticsearch! SCALABILITY? Scaling Elasticsearch! ØScale Up Scaling Elasticsearch! ØScale Up ØScale Out Scaling Elasticsearch! Data Volume Scale! Ingestion Search Scaling Elasticsearch! ØMemory / CPU / IO / Network Issues ØSub-optimal Queries ØMapping Explosion ØUndersized / Oversized shards Is Elasticsearch truly elastic? Scaling Elasticsearch! Problem#1 : Low Ingestion Throughput Need to process ‘X’ million / minute additionally. Learnings Ø _routing for Bulk requests Ø index.refresh_interval Ø number_of_shards, total_shards_per_node Ø index.translog.durability:async Scaling Elasticsearch! Problem #2 : Node instabilities due to Mapping Explosion and Sparse Fields Scaling Elasticsearch! Problem #2 : Node instabilities due to Mapping Explosion and Sparse Fields Ø Mapping Explosion Scaling Elasticsearch! Problem #2 : Node instabilities due to Mapping Explosion and Sparse Fields Ø Mapping Explosion Ø Sparse fields Scaling Elasticsearch! Problem #2 : Node instabilities due to Mapping Explosion and Sparse Fields Learnings: Ø Keep a watch on mapping. 20% disk space Ø Split indices. 13% search latency Ø Date based indices works best for logs monitoring! Scaling Elasticsearch! Problem #3 : Expensive/sub optimal queries – A threat to availability! Scaling Elasticsearch! Problem #3 : Expensive/sub optimal queries – A threat to availability! Example Queries: #1 Group By operation on a Id field { [{ "aggs": { "key": ”[email protected]", "ids": { "doc_count": 1 "terms": { }, { "key": ”[email protected]", "field": ”emailId" "doc_count": 1 } }, { } "key": ”[email protected]", } "doc_count": 1 } }………] Scaling Elasticsearch! Problem #3 : Expensive/sub optimal queries – A threat to availability! Example Queries: #2 Retrieve all documents from Elasticsearch _search?scroll=1m { { "scroll": "1m", "query": { "scroll_id" : "match_all": {} "cXVlcnlUaGVuRmV0Y2T=" } } } Scaling Elasticsearch! Problem #3 : Expensive/sub optimal queries – A threat to availability! Example Queries: #3 Multiple levels of aggregations { "aggs": { "date": { "aggs": { "customer": { "aggs": { "device": { "aggs": { "type": { "aggs": { "action":{ ... } } Scaling Elasticsearch! Problem #3 : Expensive/sub optimal queries – A threat to availability! Example Queries: #4 Wildcard / Regex Queries { "query": { "wildcard": { "field": "*foo" } } } Scaling Elasticsearch! Problem #3 : Expensive/sub optimal queries – A threat to availability! Learnings Query Logger Gate Keeper Query Optimizer Scaling Elasticsearch! Problem #3 : Expensive/sub optimal queries – A threat to availability! Learnings : Query Logger Ø Logging in Elasticsearch Is index.search.slowlog.* suffice? Ø Distinct Query Logger Scaling Elasticsearch! Problem #3 : Expensive/sub optimal queries – A threat to availability! Learnings : Gate Keeper Ø Intercept and Act. Ø Proactively prevent incidents in LIVE. Ø Example: Time range > 48 hours | Nested Aggregations level is >10 Scaling Elasticsearch! Problem #3 : Expensive/sub optimal queries – A threat to availability! Learnings : Query Optimizer Scaling Elasticsearch! Problem #3 : Expensive/sub optimal queries – A threat to availability! Learnings : Query Optimizer {"index": ["INDEXNAME-*"]} {"query": {"range": {"timestamp": { "gte": <timestamp>, "lte": <timestamp>} }.. .}... } Scaling Elasticsearch! Problem #3 : Expensive/sub optimal queries – A threat to availability! Learnings : Query Optimizer INDEXNAME-2019-06 INDEXNAME-2019-05 {"index": ["INDEXNAME-*"]} {"query": {"range": {"timestamp": { INDEXNAME-2019-04 "gte": <timestamp>, "lte": <timestamp>} }.. .}... INDEXNAME-YYYY-MM Scaling Elasticsearch! Problem #3 : Expensive/sub optimal queries – A threat to availability! Learnings : Query Optimizer INDEXNAME-2019-06 INDEXNAME-2019-05 {"index": ["INDEXNAME-*"]} {"query": {"range": {"timestamp": { Query INDEXNAME-2019-04 "gte": <timestamp>, Optimizer "lte": <timestamp>} }.. .}... INDEXNAME-YYYY-MM Scaling Elasticsearch! Problem #3 : Expensive/sub optimal queries – A threat to availability! Learnings : Query Optimizer Ø Reduce the number of indices(shards)that are being queried. Ø Reorder the queries for reducing the scan margin * Search latency by 25% and we observed better availability J Scaling Elasticsearch! Problem #4: Oversized Shards When? Learnings Ø Shards assigned per index is incorrect. Ø Anticipate the future and assign the shards. Ø Skewness in the values of the routing key. Ø Split Ø Re-index Managing Elasticsearch! Managing Elasticsearch! Ø Monitoring your clusters Ø Securing Elasticsearch Ø Upgrading Clusters How do we police the Police? When Availability, is the top priority! Monitoring your clusters! Managing Elasticsearch! COLLECT SHIP ANALYZE ALERT ACT Collect Node stats from every node Ø Index field data, docs, merge, refresh Ø Shard type, state, docs type, queue, rejected, max Ø Thread Pool heap_used, gc collection count and Ø JVM time, buffer pools Ø Os & Process CPU Load, Memory, File descriptors Managing Elasticsearch! COLLECT SHIP ANALYZE ALERT ACT LOGS CLUSTER MONITORING CLUSTER Node 1 Node 1 Node n Node N Managing Elasticsearch! COLLECT SHIP ANALYZE ALERT ACT LOGS CLUSTER MONITORING CLUSTER Node 1 Node 1 Node n Node N Managing Elasticsearch! COLLECT SHIP ANALYZE ALERT ACT Securing Elasticsearch! Why do we need security? In-House Guard Plugin Security features, free in Elasticsearch? Ø End to End TLS/SSL Ø Audit Logging Ø Authentication & Authorization Ø Third party integrations with LDAP/SAML Upgrading Elasticsearch! It’s tough only when you are in <5.X versions! Must do checks: Ø Are queries backward compatible? Approach 1: Parallel cluster with dual ingestion Ø Mapping changes between versions? Approach 2: Re-index from remote Demo Q & A.

Load more