Scaling Big Data Mining Infrastructure: the Twitter Experience

Scaling Big Data Mining Infrastructure: The Twitter Experience Jimmy Lin and Dmitriy Ryaboy Twitter, Inc. @lintool @squarecog ABSTRACT A little about our backgrounds: The first author is an As- The analytics platform at Twitter has experienced tremen- sociate Professor at the University of Maryland who spent an dous growth over the past few years in terms of size, com- extended sabbatical from 2010 to 2012 at Twitter, primar- plexity, number of users, and variety of use cases. In this ily working on relevance algorithms and analytics infrastruc- paper, we discuss the evolution of our infrastructure and the ture. The second author joined Twitter in early 2010 and development of capabilities for data mining on \big data". was first a tech lead, then the engineering manager of the One important lesson is that successful big data mining in analytics infrastructure team. Together, we hope to provide practice is about much more than what most academics a blend of the academic and industrial perspectives|a bit would consider data mining: life \in the trenches" is occupied of ivory tower musings mixed with \in the trenches" practi- by much preparatory work that precedes the application of cal advice. Although this paper describes the path we have data mining algorithms and followed by substantial effort to taken at Twitter and is only one case study, we believe our turn preliminary models into robust solutions. In this con- recommendations align with industry consensus on how to text, we discuss two topics: First, schemas play an impor- approach a particular set of big data challenges. tant role in helping data scientists understand petabyte-scale The biggest lesson we wish to share with the community data stores, but they're insufficient to provide an overall \big is that successful big data mining is about much more than picture" of the data available to generate insights. Second, what most academics would consider data mining. A sig- we observe that a major challenge in building data analytics nificant amount of tooling and infrastructure is required to platforms stems from the heterogeneity of the various com- operationalize vague strategic directives into concrete, solv- ponents that must be integrated together into production able problems with clearly-defined metrics of success. A data workflows|we refer to this as \plumbing". This paper has scientist spends a significant amount of effort performing two goals: For practitioners, we hope to share our experi- exploratory data analysis to even figure out \what's there"; ences to flatten bumps in the road for those who come after this includes data cleaning and data munging not directly us. For academic researchers, we hope to provide a broader related to the problem at hand. The data infrastructure en- context for data mining in production environments, point- gineers work to make sure that productionized workflows op- ing out opportunities for future work. erate smoothly, efficiently, and robustly, reporting errors and alerting responsible parties as necessary. The \core" of what academic researchers think of as data mining|translating 1. INTRODUCTION domain insight into features and training models for various The analytics platform at Twitter has experienced tremen- tasks|is a comparatively small, albeit critical, part of the dous growth over the past few years in terms of size, com- overall insight-generation lifecycle. plexity, number of users, and variety of use cases. In 2010, In this context, we discuss two topics: First, with a certain there were approximately 100 employees in the entire com- amount of bemused ennui, we explain that schemas play an pany and the analytics team consisted of four people|the important role in helping data scientists understand peta- only people to use our 30-node Hadoop cluster on a daily ba- byte-scale data stores, but that schemas alone are insuffi- sis. Today, the company has over one thousand employees. cient to provide an overall \big picture" of the data avail- There are thousands of Hadoop nodes across multiple dat- able and how they can be mined for insights. We've fre- acenters. Each day, around one hundred terabytes of raw quently observed that data scientists get stuck before they data are ingested into our main Hadoop data warehouse; even begin|it's surprisingly difficult in a large production engineers and data scientists from dozens of teams run tens environment to understand what data exist, how they are of thousands of Hadoop jobs collectively. These jobs ac- structured, and how they relate to each other. Our discus- complish everything from data cleaning to simple aggrega- sion is couched in the context of user behavior logs, which tions and report generation to building data-powered prod- comprise the bulk of our data. We share a number of ex- ucts to training machine-learned models for promoted prod- amples, based on our experience, of what doesn't work and ucts, spam detection, follower recommendation, and much, how to fix it. much more. We've come a long way, and in this paper, we Second, we observe that a major challenge in building data share experiences in scaling Twitter's analytics infrastruc- analytics platforms comes from the heterogeneity of the var- ture over the past few years. Our hope is to contribute to a ious components that must be integrated together into pro- set of emerging \best practices" for building big data analyt- duction workflows. Much complexity arises from impedance ics platforms for data mining from a case study perspective. SIGKDD Explorations Volume 14, Issue 2 Page 6 mismatches at the interface between different components. ample, using machine learning techniques to train predic- A production system must run like clockwork, splitting out tive models of user behavior|whether a piece of content is aggregated reports every hour, updating data products every spam, whether two users should become \friends", the like- three hours, generating new classifier models daily, etc. Get- lihood that a user will complete a purchase or be interested ting a bunch of heterogeneous components to operate as a in a related product, etc. Other desired capabilities include synchronized and coordinated workflow is challenging. This mining large (often unstructured) data for statistical regu- is what we fondly call \plumbing"|the not-so-sexy pieces of larities, using a wide range of techniques from simple (e.g., software (and duct tape and chewing gum) that ensure ev- k-means clustering) to complex (e.g., latent Dirichlet alloca- erything runs together smoothly is part of the \black magic" tion or other Bayesian approaches). These techniques might of converting data to insights. surface \latent" facts about the users|such as their interest This paper has two goals: For practitioners, we hope to and expertise|that they do not explicitly express. share our experiences to flatten bumps in the road for those To be fair, some types of predictive analytics have a long who come after us. Scaling big data infrastructure is a com- history|for example, credit card fraud detection and mar- plex endeavor, and we point out potential pitfalls along the ket basket analysis. However, we believe there are several way, with possible solutions. For academic researchers, we qualitative differences. The application of data mining on hope to provide a broader context for data mining in produc- behavioral data changes the scale at which algorithms need tion environments|to help academics understand how their to operate, and the generally weaker signals present in such research is adapted and applied to solve real-world problems data require more sophisticated algorithms to produce in- at scale. In addition, we identify opportunities for future sights. Furthermore, expectations have grown|what were work that could contribute to streamline big data mining once cutting-edge techniques practiced only by a few innova- infrastructure. tive organizations are now routine, and perhaps even necessary for survival in today's competitive environment. Thus, 2. HOW WE GOT HERE capabilities that may have previously been considered luxu- ries are now essential. We begin by situating the Twitter analytics platform in Finally, open-source software is playing an increasingly the broader context of \big data" for commercial enterprises. important role in today's ecosystem. A decade ago, no The\fourth paradigm"of how big data is reshaping the phys- credible open-source, enterprise-grade, distributed data an- ical and natural sciences [23] (e.g., high-energy physics and alytics platform capable of handling large data volumes ex- bioinformatics) is beyond the scope of this paper. isted. Today, the Hadoop open-source implementation of The simple idea that an organization should retain data MapReduce [11] lies at the center of a de facto platform that result from carrying out its mission and exploit those for large-scale data analytics, surrounded by complementary data to generate insights that benefit the organization is of systems such as HBase, ZooKeeper, Pig, Hive, and many course not new. Commonly known as business intelligence, others. The importance of Hadoop is validated not only among other monikers, its origins date back several decades. by adoption in countless startups, but also the endorsement In this sense, the \big data" hype is simply a rebranding of of industry heavyweights such as IBM, Microsoft, Oracle, what many organizations have been doing all along. and EMC. Of course, Hadoop is not a panacea and is not Examined more closely, however, there are three major an adequate solution for many problems, but a strong case trends that distinguish insight-generation activities today can be made for Hadoop supplementing (and in some cases, from, say, the 1990s. First, we have seen a tremendous ex- replacing) existing data management systems. plosion in the sheer amount of data|orders of magnitude Analytics at Twitter lies at the intersection of these three increase.

Scaling Big Data Mining Infrastructure: the Twitter Experience

Learning React Functional Web Development with React and Redux

UNIVERSITY of CALIFORNIA SAN DIEGO Simplifying Datacenter Fault

Crawling Code Review Data from Phabricator

Artificial Intelligence for Understanding Large and Complex

Unicorn: a System for Searching the Social Graph

2.3 Apache Chemistry

Learning React Modern Patterns for Developing React Apps

Elmulating Javascript

Ecological Momentary Assessments from a Cross-Platform, Native Application

“They Keep Coming Back Like Zombies”: Improving Software

Realtime Data Processing at Facebook

The Compassion Project