Building Data Science Teams the Skills, Tools, and Perspectives Behind Great Data Science Groups
Total Page:16
File Type:pdf, Size:1020Kb
Building Data Science Teams The Skills, Tools, and Perspectives Behind Great Data Science Groups DJ Patil THE SIMPLEST WAY TO BRING MapReduce and SQL Optimized in One Database Appliance www.asterdata.com Everyone knows data is the new black. The Aster MapReduce Analytics Portfolio enables customers to quickly make use of their data for actionable insights, analysis and product innovation. - Jonathan Goldman, Director of Analytics, Teradata Aster (and former Principal Data Scientist at LinkedIn) Learn More www.Asterdata.com/MapReduce Building Data Science Teams DJ Patil Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo Building Data Science Teams by DJ Patil Copyright © 2011 O’Reilly Media. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribookson line.com). For more information, contact our corporate/institutional sales depart- ment: (800) 998-9938 or [email protected]. Editor: Mike Loukides Printing History: Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages re- sulting from the use of the information contained herein. ISBN: 978-1-449-31623-5 1316117207 Table of Contents Building Data Science Teams ............................................ 1 Being Data Driven 2 The Roles of a Data Scientist 5 Decision sciences and business intelligence 5 Product and marketing analytics 7 Fraud, abuse, risk and security 8 Data services and operations 9 Data engineering and infrastructure 9 Organizational and reporting alignment 10 What Makes a Data Scientist? 11 Hiring and talent 14 Building the LinkedIn Data Science Team 16 Reinvention 18 About the Author 18 iii Building Data Science Teams Starting in 2008, Jeff Hammerbacher (@hackingdata) and I sat down to share our experiences building the data and analytics groups at Facebook and Link- edIn. In many ways, that meeting was the start of data science as a distinct professional specialization (see “What Makes a Data Scien- tist?” on page 11 for the story on how we came up with the title “Data Sci- entist”). Since then, data science has taken on a life of its own. The hugely positive response to “What Is Data Science?,” a great introduction to the meaning of data science in today’s world, showed that we were at the start of a movement. There are now regular meetups, well-established startups, and even college curricula focusing on data science. As McKinsey’s big data re- search report and LinkedIn’s data indicates indicates (see Figure 1), data sci- ence talent is in high demand. This increase in the demand for data scientists has been driven by the success of the major Internet companies. Google, Facebook, LinkedIn, and Amazon have all made their marks by using data creatively: not just warehousing data, but turning it into something of value. Whether that value is a search result, a targeted advertisement, or a list of possible acquaintances, data science is pro- ducing products that people want and value. And it’s not just Internet com- panies: Walmart doesn’t produce “data products” as such, but they’re well known for using data to optimize every aspect of their retail operations. Given how important data science has grown, it’s important to think about what data scientists add to an organization, how they fit in, and how to hire and build effective data science teams. 1 Figure 1. The rise in demand for data science talent Being Data Driven Everyone wants to build a data-driven organization. It’s a popular phrase and there are plenty of books, journals, and technical blogs on the topic. But what does it really mean to be “data driven”? My definition is: A data-driven organization acquires, processes, and leverages data in a timely fashion to create efficiencies, iterate on and develop new products, and navigate the competitive landscape. There are many ways to assess whether an organization is data driven. Some like to talk about how much data they generate. Others like to talk about the sophistication of data they use, or the process of internalizing data. I prefer to start by highlighting organizations that use data effectively. 2 | Building Data Science Teams Ecommerce companies have a long history of using data to benefit their or- ganizations. Any good salesman instinctively knows how to suggest further purchases to a customer. With “People who viewed this item also viewed ...,” Amazon moved this technique online. This simple implementation of collab- orative filtering is one of their most used features; it is a powerful mechanism for serendipity outside of traditional search. This feature has become so pop- ular that there are now variants such as “People who viewed this item bought ... .” If a customer isn’t quite satisfied with the product he’s looking at, suggest something similar that might be more to his taste. The value to a master retailer is obvious: close the deal if at all possible, and instead of a single purchase, get customers to make two or more purchases by suggesting things they’re likely to want. Amazon revolutionized electronic commerce by bring- ing these techniques online. Data products are at the heart of social networks. After all, what is a social network if not a huge dataset of users with connections to each other, forming a graph? Perhaps the most important product for a social network is something to help users connect with others. Any new user needs to find friends, ac- quaintances, or contacts. It’s not a good user experience to force users to search for their friends, which is often a surprisingly difficult task. At LinkedIn, we invented People You May Know (PYMK) to solve this problem. It’s easy for software to predict that if James knows Mary, and Mary knows John Smith, then James may know John Smith. (Well, conceptually easy. Finding connec- tions in graphs gets tough quickly as the endpoints get farther apart. But solv- ing that problem is what data scientists are for.) But imagine searching for John Smith by name on a network with hundreds of millions of users! Although PYMK was novel at the time, it has become a critical part of every social network’s offering. Facebook not only supports its own version of PYMK, they monitor the time it takes for users to acquire friends. Using so- phisticated tracking and analysis technologies, they have identified the time and number of connections it takes to get a user to long-term engagement. If you connect with a few friends, or add friends slowly, you won’t stick around for long. By studying the activity levels that lead to commitment, they have designed the site to decrease the time it takes for new users to connect with the critical number of friends. Netflix does something similar in their online movie business. When you sign up, they strongly encourage you to add to the queue of movies you intend to watch. Their data team has discovered that once you add more than than a certain number of movies, the probability you will be a long-term customer is significantly higher. With this data, Netflix can construct, test, and monitor product flows to maximize the number of new users who exceed the magic number and become long-term customers. They’ve built a highly optimized Being Data Driven | 3 registration/trial service that leverages this information to engage the user quickly and efficiently. Netflix, LinkedIn, and Facebook aren’t alone in using customer data to en- courage long-term engagement — Zynga isn’t just about games. Zynga con- stantly monitors who their users are and what they are doing, generating an incredible amount of data in the process. By analyzing how people interact with a game over time, they have identified tipping points that lead to a suc- cessful game. They know how the probability that users will become long-term changes based on the number of interactions they have with others, the num- ber of buildings they build in the first n days, the number of mobsters they kill in the first m hours, etc. They have figured out the keys to the engagement challenge and have built their product to encourage users to reach those goals. Through continued testing and monitoring, they refined their understanding of these key metrics. Google and Amazon pioneered the use of A/B testing to optimize the layout of a web page. For much of the web’s history, web designers worked by intu- ition and instinct. There’s nothing wrong with that, but if you make a change to a page, you owe it to yourself to ensure that the change is effective. Do you sell more product? How long does it take for users to find the result they’re looking for? How many users give up and go to another site? These questions can only be answered by experimenting, collecting the data, and doing the analysis, all of which are second nature to a data-driven company. Yahoo has made many important contributions to data science. After observ- ing Google’s use of MapReduce to analyze huge datasets, they realized that they needed similar tools for their own business.