Scaling Uber with Node.js Amos Barreto @amos_barreto
Uber is everyone’s Private driver.
REQUEST! RIDE! RATE! Tap to select location Sit back and relax, tell your Help us maintain a quality service driver your destination by rating your experience YOUR DRIVERS
4 Your Drivers
UBER QUALIFIED RIDER RATED LICENSED & INSURED
Uber only partners with drivers Tell us what you think. Your From insurance to background who have a keen eye for feedback helps us work with checks, every driver meets or customer service and a drivers to constantly improve beats local regulations. passion for the trade. the Uber experience.
19 LOGISTICS
4 #OMGUBERICECREAM
22 UberChopper
#OMGUBERCHOPPER
22 #UBERVALENTINES
22 #ICANHASUBERKITTENS
22 Trip State Machine (Simplified)
Request Dispatch Accept
Arrive
End Begin
6 Trip State Machine (Extended)
Expire / Request Dispatch (1) Reject
Dispatch (2) Accept
Arrive
End Begin
6 OUR STORY
4 Version 1
• PHP dispatch
PHP • Outsourced to remote contractors in Midwest
• Half the code in spanish
Cron • Flat file • Lifetime: 6-9 months
6 33 “I read an article on HackerNews about a new framework called Node.js” Jason Roberts Tradeoffs
• Learning curve • Database drivers • Scalability • Documentation • Performance • Monitoring • Library ecosystem • Production operations Version 2
• Lifetime: 9 months Node.js • Developed in house • Node.js application • Prototyped on 0.2 • Launched in production with 0.4 • MongoDB datastore “I really don’t see dispatch changing much in the next three years”
33 Expect the unexpected
15 Version 3
• Mongo did not scale with CN volume of GPS logs (global CN CN write lock) • Swapped mongo for redis and flat files SF NYC SEA CHI Decoupling storage of different types of data Version 3 (continued)
• Node.js mongo client failed CN to recognize replica set CN CN topology changes
SF NYC SEA CHI Be wary of immature client libraries Commits to client modules over time Version 3 (continued)
SF NYC SEA CHI BOS PAR Focus on driving business value
15 15 15 Capacity planning, forecasting, and load testing are your friends
15 Measure everything
15 Version 4
• Nickname: The Grid CN CN CN • Multi-process dispatch • Peer assignment • Redis is now considered the SF NYC SEA CHI SF NYC CHI source of truth SF CHI • Use lua interpreter for atomic operations • Fan out to all city peers to find nearby cars clientStatus = redis.call('hget', clientHash, ‘status’) driverStatus = redis.call('hget', driverHash, ‘status’) if clientStatus == 'WaitingForPickup' and driverStatus == 'Open' then local clientPeerId = redis.call('hget', clientAssignmentHash, clientToken) redis.call('hset', driverHash, 'status', 'DispatchPending') redis.call('hset', clientAssignmentHash, clientToken, driverPeerId) if clientPeerId then redis.call('zincrby', countKey, -1, clientPeerId) end redis.call('zincrby', countKey, 1, driverPeerId) return redis.status_reply('SUCCESS') else return redis.error_reply('ERROR - clientStatus: '..tostring(clientStatus)..', driverStatus: '..tostring(driverStatus)) end
15 clientStatus = redis.call('hget', clientHash, ‘status’) driverStatus = redis.call('hget', driverHash, ‘status’) if clientStatus == 'WaitingForPickup' and driverStatus == 'Open' then local clientPeerId = redis.call('hget', clientAssignmentHash, clientToken) redis.call('hset', driverHash, 'status', 'DispatchPending') redis.call('hset', clientAssignmentHash, clientToken, driverPeerId) if clientPeerId then redis.call('zincrby', countKey, -1, clientPeerId) end redis.call('zincrby', countKey, 1, driverPeerId) return redis.status_reply('SUCCESS') else return redis.error_reply('ERROR - clientStatus: '..tostring(clientStatus)..', driverStatus: '..tostring(driverStatus)) end
15 clientStatus = redis.call('hget', clientHash, ‘status’) driverStatus = redis.call('hget', driverHash, ‘status’) if clientStatus == 'WaitingForPickup' and driverStatus == 'Open' then local clientPeerId = redis.call('hget', clientAssignmentHash, clientToken) redis.call('hset', driverHash, 'status', 'DispatchPending') redis.call('hset', clientAssignmentHash, clientToken, driverPeerId) if clientPeerId then redis.call('zincrby', countKey, -1, clientPeerId) end redis.call('zincrby', countKey, 1, driverPeerId) return redis.status_reply('SUCCESS') else return redis.error_reply('ERROR - clientStatus: '..tostring(clientStatus)..', driverStatus: '..tostring(driverStatus)) end
15 clientStatus = redis.call('hget', clientHash, ‘status’) driverStatus = redis.call('hget', driverHash, ‘status’) if clientStatus == 'WaitingForPickup' and driverStatus == 'Open' then local clientPeerId = redis.call('hget', clientAssignmentHash, clientToken) redis.call('hset', driverHash, 'status', 'DispatchPending') redis.call('hset', clientAssignmentHash, clientToken, driverPeerId) if clientPeerId then redis.call('zincrby', countKey, -1, clientPeerId) end redis.call('zincrby', countKey, 1, driverPeerId) return redis.status_reply('SUCCESS') else return redis.error_reply('ERROR - clientStatus: '..tostring(clientStatus)..', driverStatus: '..tostring(driverStatus)) end
15 Version 4 (continued)
SF1 SF2 NY1 NY2 SEA1 SEA2 CHI1 BOS1 BOS2 PAR1
SF3 NY3 NY4 SEA3 SEA4 CHI2 CHI3
Version 5
SF SF SF SF SF SF SF SF
SF
Version 5 max # of loc queries loc of # max
# of nodes Version 5
CN CN CN
ncar SF NYC SEA CHI ncar SF NYC CHI ncar SF CHI ncar Break out services as needed
15 Understand v8 to optimize Node.js applications
15
SF1 SF2 NY1 NY2 SEA1 SEA2 CHI1 BOS1 BOS2 PAR1
SF3 NY3 NY4 SEA3 SEA4 CHI2 CHI3 Don’t take vacation ;) Don’t live in Chicago!
15 Stateless applications… No single points of failure… Replicated data stores… Dynamic application topology…
15 Version 6
SF1 SEA3 NY2 PAR1 CHI1 NY3 BOS1 BOS2 NY1
SF3 SEA1 NY4 SEA4 CHI2 CHI3 SEA2 SF2
Grid Grid Grid Manager Manager Manager Version 7 haproxy Do the obvious
15
Pros
• every application is horizontally scalable
• flexible, partially dynamic topology
• failure recovery manual in the worst case • supports primary business case very well
• conservative estimates 1-2 years of runway Never be satisfied Cons
• what happens when a city out scales the capacity of a single redis instance? • who wants to wake up in the middle of the night for servers crashes? • what about future business use cases? #WORLDCLASS
4 World Class
• city agnostic dispatch application • “stateless” applications • scale to 100x current load • flexible data model Every now and then it’s okay to bend the rules
15 Realtime Analytics Realtime Analytics So why did we stick with Node.js?
• JavaScript is easy to learn • Simple interface with thorough documentation • Lends itself to fast prototyping • Asynchronous, nimble • Avoid concurrency challenges • Increasingly mature module ecosystem How to win with Node.js?
• measure everything - particularly response times and event loop lag
• learn to take heap dumps to debug memory issues
• strace, perf, flame graphs are necessary tools for improving performance • small, reusable components to reduce duplication The Human Factor
34 Thank you. Questions?