Operating Elasticsearch at Scale! Agenda • Introductions • Search Platform @ PayPal

• Elasticsearch Primer

• Making Elasticsearch truly Elastic

• Managing Elasticsearch @ Scale

• Elasticsearch in action!

• Q & A Introductions

Vikram Aishwarya Ramakrishnan Sankaravadivel

Senior Manager, Senior Software Engineer Technology Platforms & Experience Technology Platforms & Experience

15 years of industry experience with Passionate technologist with deep expertise in specific focus in the areas of Large scale managing Large scale Elasticsearch deployments distributed systems, Enterprise Search & Big Data & Reactive systems

[email protected] [email protected] https://www.linkedin.com/in/vikramakrishnan https://www.linkedin.com/in/aishwaryasankar/ Search Platform @ PayPal

Fully Managed Search offering for PayPal Inc.

1500+ nodes

4+ PB of Data

Other Offerings

Custom Security Plugin 20+ Billion Ingests / day

Custom Monitoring & Alerting 50+ Million Searches / day Elasticsearch Primer

Open Source Distributed

Released in 2010 and built on Supports replication, sharding & top of routing

Schema Free Restful APIs

Supports JSON structured API driven and all actions can be documents, detects types and performed via simple APIs using makes them searchable. JSON over HTTP Elasticsearch Primer

(un)Structured Searches Fuzzy Searches

Full text search, Faceted Autocomplete, ”Did you mean”, Navigation, Pagination Suggestions.

Powerful Aggregations Visualizations

Bucketing, Metric, Pipelining Tools like , to provide deep actionable insights Elasticsearch Primer What’s in it for SREs?

Improved Observability Realtime, Actionable Insights Elasticsearch Primer Things to know before we dig deep…

• Cluster • Field

• Node • Mapping

• Document • Shards

• Index • Routing Elasticsearch Primer Anatomy of an Elasticsearch cluster

*master

P1 P2 P3

R3 R1 R2

Primary Shard

Replica Shard Elasticsearch Primer Anatomy of an Indexing request

3 forward to replica shard(s)

*master

P1 P2 P3

R3 R1 R2

1 2

route request based on Index request id Elasticsearch Primer Dissecting Indexing doc 1 doc 2 Inverted Index Term Document source text The Quick brown fox jumped The two lazy dogs were slower over the lazy dog than the less lazy dog, Rover quick 1 brown 1

html_strip The Quick brown fox jumped over the lazy The two lazy dogs were slower than the fox 1 dog less lazy dog, Rover filter jumped 1 over 1 The brown The two lazy dogs were slower standard Quick fox jumped lazy 1, 2 than the less lazy dog Rover tokenizer over the lazy dog dog 1, 2 two 2 the quick brown were lower case fox jumped the two lazy dogs slower than the less lazy dog rover dogs 2 filter over the lazy dog slower 2 less 2 stop words quick brown fox jumped two lazy dogs slower filter over lazy dog less lazy dog rover rover 2 Elasticsearch Primer Anatomy of a search request. Gather globally 3 sorted result set of doc ids

*master

P1 P2 P3

R3 R1 R2

forward to every 2 primary / replica 1 Search request shard of the index Elasticsearch Primer Dissecting a search request. Inverted Index Search for “The quick Brown Dog” Term Document quick 1 html_strip brown 1 The Quick Brown Dog filter fox 1 jumped 1 quick standard over 1 The quick Brown Dog tokenizer brown lazy 1, 2 dog 1, 2 lower case the quick brown dog dog two 2 filter dogs 2 slower 2 stop words quick brown dog filter less 2 rover 2 Elasticsearch Primer Installation

$ curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch- {VERSION}.tar.gz

$ tar –xvf elasticsearch-{VERSION}.tar.gz

$ cd elasticsearch-{VERSION}/bin

$ ./elasticsearch Elasticsearch Primer Installation

Now, how do we

scale this for the enterprise? Scaling Elasticsearch! Scaling Elasticsearch!

SCALABILITY? Scaling Elasticsearch!

ØScale Up Scaling Elasticsearch!

ØScale Up

ØScale Out Scaling Elasticsearch!

Data Volume

Scale! Ingestion

Search Scaling Elasticsearch!

ØMemory / CPU / IO / Network Issues

ØSub-optimal Queries

ØMapping Explosion

ØUndersized / Oversized shards Is Elasticsearch truly elastic? Scaling Elasticsearch!

Problem#1 : Low Ingestion Throughput

Need to process ‘X’ million / minute additionally.

Learnings

Ø _routing for Bulk requests

Ø index.refresh_interval

Ø number_of_shards, total_shards_per_node

Ø index.translog.durability:async Scaling Elasticsearch!

Problem #2 : Node instabilities due to Mapping Explosion and Sparse Fields Scaling Elasticsearch!

Problem #2 : Node instabilities due to Mapping Explosion and Sparse Fields

Ø Mapping Explosion Scaling Elasticsearch!

Problem #2 : Node instabilities due to Mapping Explosion and Sparse Fields

Ø Mapping Explosion

Ø Sparse fields Scaling Elasticsearch!

Problem #2 : Node instabilities due to Mapping Explosion and Sparse Fields

Learnings:

Ø Keep a watch on mapping. 20% disk space

Ø Split indices. 13% search latency Ø Date based indices works best for logs monitoring! Scaling Elasticsearch!

Problem #3 : Expensive/sub optimal queries – A threat to availability! Scaling Elasticsearch!

Problem #3 : Expensive/sub optimal queries – A threat to availability!

Example Queries: #1 Group By operation on a Id field

{ [{ "aggs": { "key": ”[email protected]", "ids": { "doc_count": 1 "terms": { }, { "key": ”[email protected]", "field": ”emailId" "doc_count": 1 } }, { } "key": ”[email protected]", } "doc_count": 1 } }………] Scaling Elasticsearch!

Problem #3 : Expensive/sub optimal queries – A threat to availability!

Example Queries: #2 Retrieve all documents from Elasticsearch

_search?scroll=1m { { "scroll": "1m", "query": { "scroll_id" : "match_all": {} "cXVlcnlUaGVuRmV0Y2T=" } } } Scaling Elasticsearch!

Problem #3 : Expensive/sub optimal queries – A threat to availability!

Example Queries: #3 Multiple levels of aggregations

{ "aggs": { "date": { "aggs": { "customer": { "aggs": { "device": { "aggs": { "type": { "aggs": { "action":{ ... } } Scaling Elasticsearch!

Problem #3 : Expensive/sub optimal queries – A threat to availability!

Example Queries: #4 Wildcard / Regex Queries { "query": { "wildcard": { "field": "*foo" } } } Scaling Elasticsearch!

Problem #3 : Expensive/sub optimal queries – A threat to availability!

Learnings Query Logger

Gate Keeper

Query Optimizer Scaling Elasticsearch!

Problem #3 : Expensive/sub optimal queries – A threat to availability!

Learnings : Query Logger

Ø Logging in Elasticsearch

Is index.search.slowlog.* suffice?

Ø Distinct Query Logger Scaling Elasticsearch!

Problem #3 : Expensive/sub optimal queries – A threat to availability!

Learnings : Gate Keeper

Ø Intercept and Act.

Ø Proactively prevent incidents in LIVE.

Ø Example: Time range > 48 hours | Nested Aggregations level is >10 Scaling Elasticsearch!

Problem #3 : Expensive/sub optimal queries – A threat to availability!

Learnings : Query Optimizer Scaling Elasticsearch!

Problem #3 : Expensive/sub optimal queries – A threat to availability!

Learnings : Query Optimizer

{"index": ["INDEXNAME-*"]} {"query": {"range": {"timestamp": { "gte": , "lte": } }.. .}... } Scaling Elasticsearch!

Problem #3 : Expensive/sub optimal queries – A threat to availability!

Learnings : Query Optimizer

INDEXNAME-2019-06

INDEXNAME-2019-05 {"index": ["INDEXNAME-*"]} {"query": {"range": {"timestamp": { INDEXNAME-2019-04 "gte": , "lte": } }.. .}...

INDEXNAME-YYYY-MM Scaling Elasticsearch!

Problem #3 : Expensive/sub optimal queries – A threat to availability!

Learnings : Query Optimizer

INDEXNAME-2019-06

INDEXNAME-2019-05 {"index": ["INDEXNAME-*"]} {"query": {"range": {"timestamp": { Query INDEXNAME-2019-04 "gte": , Optimizer "lte": } }.. .}...

INDEXNAME-YYYY-MM Scaling Elasticsearch!

Problem #3 : Expensive/sub optimal queries – A threat to availability!

Learnings : Query Optimizer

Ø Reduce the number of indices(shards)that are being queried.

Ø Reorder the queries for reducing the scan margin *

Search latency by 25% and we observed better availability J Scaling Elasticsearch!

Problem #4: Oversized Shards

When? Learnings

Ø Shards assigned per index is incorrect. Ø Anticipate the future and assign the shards.

Ø Skewness in the values of the routing key. Ø Split Ø Re-index Managing Elasticsearch! Managing Elasticsearch!

Ø Monitoring your clusters

Ø Securing Elasticsearch

Ø Upgrading Clusters How do we police the Police?

When Availability, is the top priority! Monitoring your clusters! Managing Elasticsearch!

COLLECT SHIP ANALYZE ALERT ACT

Collect Node stats from every node

Ø Index field data, docs, merge, refresh

Ø Shard type, state, docs type, queue, rejected, max Ø Thread Pool heap_used, gc collection count and Ø JVM time, buffer pools

Ø Os & Process CPU Load, Memory, File descriptors Managing Elasticsearch!

COLLECT SHIP ANALYZE ALERT ACT

LOGS CLUSTER MONITORING CLUSTER Node 1

Node 1 Node n

Node N Managing Elasticsearch!

COLLECT SHIP ANALYZE ALERT ACT

LOGS CLUSTER MONITORING CLUSTER Node 1

Node 1 Node n

Node N Managing Elasticsearch!

COLLECT SHIP ANALYZE ALERT ACT Securing Elasticsearch!

Why do we need security?

In-House Guard Plugin Security features, free in Elasticsearch?

Ø End to End TLS/SSL Ø Audit Logging

Ø Authentication & Authorization Ø Third party integrations with LDAP/SAML Upgrading Elasticsearch!

It’s tough only when you are in <5.X versions!

Must do checks:

Ø Are queries backward compatible? Approach 1: Parallel cluster with dual ingestion

Ø Mapping changes between versions? Approach 2: Re-index from remote Demo

Q & A