A Comparison of Data Stores for the Online Feature Store Component
Total Page:16
File Type:pdf, Size:1020Kb
DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2021 A comparison of Data Stores for the Online Feature Store Component A comparison between NDB and Aerospike ALEXANDER VOLMINGER KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE A comparison of Data Stores for the Online Feature Store Component A comparison between NDB and Aerospike ALEXANDER VOLMINGER Civilingenjör Datateknik Date: April 25, 2021 Supervisor: Jim Dowling Examiner: Stefano Markidis School of Electrical Engineering and Computer Science Host company: Spotify AB Swedish title: En jämförelse av datalagringssystem för andvänding som Online Feature Store Swedish subtitle: En jämförelse mellan NDB och Aerospike A comparison of Data Stores for the Online Feature Store Component / En jämförelse av datalagringssystem för andvänding som Online Feature Store © 2021 Alexander Volminger Abstract | i Abstract This thesis aimed to investigate what Data Stores would fit to be implemented as an Online Feature Store. This is a component in the Machine Learning infrastructure that needs to be able to handle low latency Reads at high throughput with high availability. The thesis evaluated the Data Stores with real feature workloads from Spotify’s Search system. First an investigation was made to find suitable storage systems. NDB and Aerospike were selected because of their state-of-the-art performance together with their suitable functionality. These were then implemented as the Online Feature Store by batch Reading the feature data through a Java program and by using Google Dataflow to input data to the Data Stores. For 1 client NDB achieved about 35% higher batch Read throughput with around 30% lower P99 latency than Aerospike. For 8 clients NDB got 20% higher batch Read throughput, with a varying P99 latency different compared to Aerospike. But in a 8 node setup NDB achieved on average 35% lower latency. Aerospike achieved 50% faster Write speeds when writing feature data to the Data Stores. Both Data Stores’ Read performance was found to suffer upon Writing to the data store at the same time as Reading, with the P99 Read latency increasing around 30% for both Data Stores. It was concluded that both Data Stores would work as an Online Feature Store. But NDB achieved better Read performance, which is one of the most important factors for this type of Feature Store. Keywords Feature Stores, Data Stores, NDB, Aerospike, NoSQL, Online Feature Stores ii | Sammanfattning Sammanfattning Den här uppsatsen undersökte vilka datalagringssystem som passar för att implementeras som en Online Feature Store. Detta är en komponent i maskininlärningsinfrastrukturen som måste hantera snabba läsningar med hög genomströmning och hög tillgänglighet. Uppsatsen studerade detta genom att evaluera datalagringssystem med riktig feature data från Spotifys söksystem. En utredning gjordes först för att hitta lovande datalagringssystem för denna uppgift. NDB och Aerospike blev valda på grund av deras topp prestanda och passande funktionalitet. Dessa implementerades sedan som en Online Feature Store genom att batch-läsa feature datan med hjälp av ett Java program samt genom att använda Google Dataflow för att lägga in feature datan i datalagringssystemen. För 1 klient fick NDB runt 35% bättre genomströmning av feature data jämfört med Aerospike för batch läsningar, med ungefär 30% lägre P99 latens. För 8 klienter fick runt 20% högre genomströmning av feature data med en P99 latens som var mer varierande. Men klustren med 8 noder fick NDB i genomsnitt 35% lägre latens. Aerospike var 50% snabbare på att skriva feature datan till datalagringssystemet. Båda systemen led dock av sämre läsprestanda när skrivningar skedde till dem samtidigt. P99 läs-latensen gick då upp runt 30% för båda datalagringssystemen. Sammanfattningsvis funkade båda av de undersökta datalagringssystem som en Online Feature Store. Men NDB hade bättre läsprestanda, vilket är en av de mest viktigaste faktorerna för den här typen av Feature Store. Nyckelord Feature Stores, Datalagringsystem, NDB, Aerospike, NoSQL, Online Feature Stores Acknowledgments | iii Acknowledgments I would like to thank my supervisor at KTH, Prof. Jim Dowling, for overseeing the thesis work, helping me structure the benchmark and for sharing his extensive knowledge about Feature Stores. I would also like thank all the amazing people at Spotify whom I have been speaking with throughout the thesis. Special thanks to my supervisor at Spotify, Daniel Lazarovski, for the technical guidance and general support. But also to Anders Nyman for the support throughout the thesis and the rest of the Search team at Spotify. Lastly I want to thank Mikael Ronström at Logical Clocks for all the help with NDB and thoughts in general about the thesis. Stockholm, March 2021 Alexander Volminger iv | CONTENTS Contents 1 Introduction1 1.1 Problem Description......................2 1.2 Purpose.............................2 1.3 Research Question.......................2 1.4 Research Methodology.....................3 1.5 Delimitations..........................3 1.6 Structure of the thesis......................4 2 Background5 2.1 Feature Data Stores.......................5 2.2 Data Models...........................7 2.2.1 RDBMS........................7 2.2.2 NoSQL.........................8 2.3 Distributed Systems....................... 10 2.3.1 Skewed Data and Hot Spots.............. 11 2.3.2 Partitioning/Sharding.................. 11 2.4 Client-Server Data Stores.................... 12 2.4.1 NDB Cluster...................... 12 2.4.2 Aerospike........................ 13 2.4.3 Redis.......................... 14 2.4.4 Dynamo......................... 15 2.4.5 Riak........................... 16 2.4.6 BigTable........................ 16 2.4.7 HBase.......................... 17 2.4.8 Cassandra........................ 17 2.4.9 Netflix’s Hollow.................... 18 2.5 Previous Benchmarks...................... 19 2.5.1 Redis, HBase & Cassandra............... 20 CONTENTS | v 2.5.2 PostgreSQL, Redis & Aerospike............ 20 2.5.3 YCSB.......................... 22 2.5.4 Jepsen.......................... 23 2.6 Choice of Data Stores...................... 23 3 Experimental Procedure 25 3.1 Data............................... 25 3.1.1 Feature Requests.................... 25 3.1.2 Feature Data...................... 26 3.2 Experimental design...................... 26 3.2.1 Workloads....................... 26 3.2.2 Test Environment.................... 27 3.2.3 Data Store Cluster Setups............... 28 3.2.4 Measurements..................... 29 4 Implementation 31 4.1 Read Benchmark........................ 32 4.1.1 NDB.......................... 32 4.1.2 Aerospike........................ 33 4.2 Write Program......................... 34 4.2.1 NDB.......................... 35 4.2.2 Aerospike........................ 35 4.3 Cluster Configurations..................... 36 4.3.1 NDB.......................... 36 4.3.2 Aerospike........................ 37 5 Results and Discussion 38 5.1 Read Benchmark........................ 38 5.1.1 One Client....................... 38 5.1.2 Several Clients..................... 43 5.2 Write Program......................... 46 5.2.1 Memory Usage..................... 49 5.3 Write & Read Benchmark.................... 49 6 Conclusions and Future Work 53 6.1 Conclusions........................... 53 6.2 Sustainability and Ethics.................... 55 6.3 Future work........................... 55 References 56 vi | CONTENTS A Benchmark Tables 62 A.1 Read Benchmark........................ 62 A.1.1 NDB.......................... 62 A.1.2 Aerospike........................ 65 A.2 Read & Write Benchmark.................... 67 A.2.1 NDB.......................... 67 A.2.2 Aerospike........................ 69 B Cluster Configurations 71 B.1 NDB Configuration....................... 71 B.2 Aerospike Configuration.................... 75 C Availability Zones 78 C.1 NDB............................... 78 C.2 Aerospike............................ 79 D Hardware Utilization 80 D.1 Read Benchmark........................ 80 D.1.1 1 Client, 6 Nodes & 1 Thread............. 80 D.1.2 1 Client, 6 Nodes & 2 Threads............. 84 D.1.3 1 Client, 6 Nodes & 4 Threads............. 88 D.1.4 1 Client, 6 Nodes & 8 Threads............. 92 D.1.5 1 Client, 6 Nodes & 16 Threads............ 96 D.1.6 1 Client, 6 Nodes & 32 Threads............ 100 D.1.7 2 Clients, 6 Nodes & 16 Threads............ 104 D.1.8 2 Clients, 6 Nodes & 32 Threads............ 108 D.1.9 4 Clients, 6 Nodes & 16 Threads............ 110 D.1.10 8 Clients, 6 Nodes & 16 Threads............ 114 D.1.11 1 Client, 8 Nodes & 1 Thread............. 120 D.1.12 1 Client, 8 Nodes & 2 Threads............. 124 D.1.13 1 Client, 8 Nodes & 4 Threads............. 128 D.1.14 1 Client, 8 Nodes & 8 Threads............. 132 D.1.15 1 Client, 8 Nodes & 16 Threads............ 136 D.1.16 1 Client, 8 Nodes & 32 Threads............ 140 D.1.17 2 Clients, 8 Nodes & 16 Threads............ 144 D.1.18 4 Clients, 8 Nodes & 16 Threads............ 148 D.1.19 8 Clients, 8 Nodes & 16 Threads............ 152 D.2 Write Program......................... 157 D.2.1 6 Nodes & 128 workers................. 157 D.2.2 6 Nodes & 256 workers................. 158 CONTENTS | vii D.2.3 6 Nodes & 512 workers................. 162 D.2.4 8 Nodes & 256 Workers................ 166 D.2.5 8 Nodes & 512 Workers................ 169 D.3 Write & Read Benchmark.................... 173 D.3.1 6 Nodes & 256 Workers................ 173 D.3.2 6 Nodes