Scaling Uber with Node.js Amos Barreto @amos_barreto

Uber is everyone’s Private driver.

REQUEST! RIDE! RATE! Tap to select location Sit back and relax, tell your Help us maintain a quality service driver your destination by rating your experience YOUR DRIVERS

4 Your Drivers

UBER QUALIFIED RIDER RATED LICENSED & INSURED

Uber only partners with drivers Tell us what you think. Your From insurance to background who have a keen eye for feedback helps us work with checks, every driver meets or customer service and a drivers to constantly improve beats local regulations. passion for the trade. the Uber experience.

19 LOGISTICS

4 #OMGUBERICECREAM

22 UberChopper

#OMGUBERCHOPPER

22 #UBERVALENTINES

22 #ICANHASUBERKITTENS

22 Trip State Machine (Simplified)

Request Dispatch Accept

Arrive

End Begin

6 Trip State Machine (Extended)

Expire / Request Dispatch (1) Reject

Dispatch (2) Accept

Arrive

End Begin

6 OUR STORY

4 Version 1

• PHP dispatch

PHP • Outsourced to remote contractors in Midwest

• Half the code in spanish

Cron • Flat file • Lifetime: 6-9 months

6 33 “I read an article on HackerNews about a new framework called Node.js” Jason Roberts Tradeoffs

• Learning curve • drivers • Scalability • Documentation • Performance • Monitoring • Library ecosystem • Production operations Version 2

• Lifetime: 9 months Node.js • Developed in house • Node.js application • Prototyped on 0.2 • Launched in production with 0.4 • MongoDB datastore “I really don’t see dispatch changing much in the next three years”

33 Expect the unexpected

15 Version 3

• Mongo did not scale with CN volume of GPS logs (global CN CN write lock) • Swapped mongo for redis and flat files SF NYC SEA CHI Decoupling storage of different types of data Version 3 (continued)

• Node.js mongo client failed CN to recognize replica CN CN topology changes

SF NYC SEA CHI Be wary of immature client libraries Commits to client modules over time Version 3 (continued)

SF NYC SEA CHI BOS PAR Focus on driving business value

15 15 15 Capacity planning, forecasting, and load testing are your friends

15 Measure everything

15 Version 4

• Nickname: The Grid CN CN CN • Multi-process dispatch • Peer assignment • Redis is now considered the SF NYC SEA CHI SF NYC CHI source of truth SF CHI • Use lua interpreter for atomic operations • Fan out to all city peers to find nearby cars clientStatus = redis.call('hget', clientHash, ‘status’) driverStatus = redis.call('hget', driverHash, ‘status’) if clientStatus == 'WaitingForPickup' and driverStatus == 'Open' then local clientPeerId = redis.call('hget', clientAssignmentHash, clientToken) redis.call('hset', driverHash, 'status', 'DispatchPending') redis.call('hset', clientAssignmentHash, clientToken, driverPeerId) if clientPeerId then redis.call('zincrby', countKey, -1, clientPeerId) end redis.call('zincrby', countKey, 1, driverPeerId) return redis.status_reply('SUCCESS') else return redis.error_reply('ERROR - clientStatus: '..tostring(clientStatus)..', driverStatus: '..tostring(driverStatus)) end

15 clientStatus = redis.call('hget', clientHash, ‘status’) driverStatus = redis.call('hget', driverHash, ‘status’) if clientStatus == 'WaitingForPickup' and driverStatus == 'Open' then local clientPeerId = redis.call('hget', clientAssignmentHash, clientToken) redis.call('hset', driverHash, 'status', 'DispatchPending') redis.call('hset', clientAssignmentHash, clientToken, driverPeerId) if clientPeerId then redis.call('zincrby', countKey, -1, clientPeerId) end redis.call('zincrby', countKey, 1, driverPeerId) return redis.status_reply('SUCCESS') else return redis.error_reply('ERROR - clientStatus: '..tostring(clientStatus)..', driverStatus: '..tostring(driverStatus)) end

15 clientStatus = redis.call('hget', clientHash, ‘status’) driverStatus = redis.call('hget', driverHash, ‘status’) if clientStatus == 'WaitingForPickup' and driverStatus == 'Open' then local clientPeerId = redis.call('hget', clientAssignmentHash, clientToken) redis.call('hset', driverHash, 'status', 'DispatchPending') redis.call('hset', clientAssignmentHash, clientToken, driverPeerId) if clientPeerId then redis.call('zincrby', countKey, -1, clientPeerId) end redis.call('zincrby', countKey, 1, driverPeerId) return redis.status_reply('SUCCESS') else return redis.error_reply('ERROR - clientStatus: '..tostring(clientStatus)..', driverStatus: '..tostring(driverStatus)) end

15 clientStatus = redis.call('hget', clientHash, ‘status’) driverStatus = redis.call('hget', driverHash, ‘status’) if clientStatus == 'WaitingForPickup' and driverStatus == 'Open' then local clientPeerId = redis.call('hget', clientAssignmentHash, clientToken) redis.call('hset', driverHash, 'status', 'DispatchPending') redis.call('hset', clientAssignmentHash, clientToken, driverPeerId) if clientPeerId then redis.call('zincrby', countKey, -1, clientPeerId) end redis.call('zincrby', countKey, 1, driverPeerId) return redis.status_reply('SUCCESS') else return redis.error_reply('ERROR - clientStatus: '..tostring(clientStatus)..', driverStatus: '..tostring(driverStatus)) end

15 Version 4 (continued)

SF1 SF2 NY1 NY2 SEA1 SEA2 CHI1 BOS1 BOS2 PAR1

SF3 NY3 NY4 SEA3 SEA4 CHI2 CHI3

Version 5

SF SF SF SF SF SF SF SF

SF

Version 5 max # of loc queries loc of # max

# of nodes Version 5

CN CN CN

ncar SF NYC SEA CHI ncar SF NYC CHI ncar SF CHI ncar Break out services as needed

15 Understand v8 to optimize Node.js applications

15

SF1 SF2 NY1 NY2 SEA1 SEA2 CHI1 BOS1 BOS2 PAR1

SF3 NY3 NY4 SEA3 SEA4 CHI2 CHI3 Don’t take vacation ;) Don’t live in Chicago!

15 Stateless applications… No single points of failure… Replicated data stores… Dynamic application topology…

15 Version 6

SF1 SEA3 NY2 PAR1 CHI1 NY3 BOS1 BOS2 NY1

SF3 SEA1 NY4 SEA4 CHI2 CHI3 SEA2 SF2

Grid Grid Grid Manager Manager Manager Version 7 haproxy Do the obvious

15

Pros

• every application is horizontally scalable

• flexible, partially dynamic topology

• failure recovery manual in the worst case • supports primary business case very well

• conservative estimates 1-2 years of runway Never be satisfied Cons

• what happens when a city out scales the capacity of a single redis instance? • who wants to wake up in the middle of the night for servers crashes? • what about future business use cases? #WORLDCLASS

4 World Class

• city agnostic dispatch application • “stateless” applications • scale to 100x current load • flexible data model Every now and then it’s okay to bend the rules

15 Realtime Analytics Realtime Analytics So why did we stick with Node.js?

• JavaScript is easy to learn • Simple interface with thorough documentation • Lends itself to fast prototyping • Asynchronous, nimble • Avoid concurrency challenges • Increasingly mature module ecosystem How to win with Node.js?

• measure everything - particularly response times and event loop lag

• learn to take heap dumps to debug memory issues

• strace, perf, flame graphs are necessary tools for improving performance • small, reusable components to reduce duplication The Human Factor

34 Thank you. Questions?