Gobblin: Unifying Data Ingestion for Hadoop

Gobblin: Unifying Data Ingestion for Hadoop Lin Qiao, Yinan Li, Sahil Takiar, Ziyang Liu, Narasimha Veeramreddy, Min Tu, Ying Dai, Issac Buenrostro, Kapil Surlaker, Shirshanka Das, Chavdar Botev LinkedIn Inc. Mountain View, CA, USA flqiao,ynli,stakiar,ziliu,nveeramreddy,mitu,ydai,ibuenros,ksurlaker,sdas,[email protected] ABSTRACT With first hand experience on big data ingestion and inte- Data ingestion is an essential part of companies and organi- gration pain points, we built Gobblin, a unified data inges- zations that collect and analyze large volumes of data. This tion framework. The development of Gobblin was mainly paper describes Gobblin, a generic data ingestion frame- driven by the fact that LinkedIn's data sources have become work for Hadoop and one of LinkedIn's latest open source increasingly heterogeneous. Data is constantly obtained and products. At LinkedIn we need to ingest data from various written into online data storage systems or streaming sys- sources such as relational stores, NoSQL stores, streaming tems, including Espresso [23], Kafka [6], Voldemort [25], systems, REST endpoints, filesystems, etc. into our Hadoop Oracle, MySQL, RocksDB [17], as well as external sites, clusters. Maintaining independent pipelines for each source including S3, Salesforce, Google Analytics, etc. Such data can lead to various operational problems. Gobblin aims to include member profiles, connections, posts and many other solve this issue by providing a centralized data ingestion external and internal activities. These data sources crop framework that makes it easy to support ingesting data from daily terabytes worth of data, most of which needs to be a variety of sources. loaded into our Hadoop clusters to feed business- or consumer- Gobblin distinguishes itself from similar frameworks by oriented analysis. We used to develop a separate data inges- focusing on three core principles: generality, extensibility, tion pipeline for each data source, and at one point we were and operability. Gobblin supports a mixture of data sources running over a dozen different pipelines. out-of-the-box and can be easily extended for more. This Having this many different data ingestion pipelines is like enables an organization to use a single framework to handle re-implementing the HashMap every time we need to use different data ingestion needs, making it easy and inexpen- HashMap with a different type argument. Moreover, these sive to operate. Moreover, with an end-to-end metrics col- pipelines were developed by several different teams. It is not lection and reporting module, Gobblin makes it simple and hard to imagine the non-scalability of this approach, and the efficient to identify issues in production. issues it brought in terms of maintenance, operability and data quality. Similar pains have been shared with us from engineers at other companies. Gobblin aims to eventually replace most or all of these ingestion pipelines with a generic 1. INTRODUCTION data ingestion framework, which is easily configurable to A big data system is often referred to for the sheer volume ingest data from several different types of sources (covering of datasets it handles as well as the large processing power a large number of real use cases), and easily extensible for and new processing paradigms it is associated with. These new data sources and use cases. big data challenges have fostered significant innovations on The challenges Gobblin addresses are five-fold: large scale computation platforms [2, 4, 5, 8, 14, 15, 19, 22, 24]. However, another important aspect of the complexity of { Source integration: The framework provides out- a big data system comes from the coexistence of heteroge- of-the-box adaptors for all of our commonly accessed nous data sources and sinks. In reality, data integration data sources such as MySQL, Kafka, Google Analytics, starts to cause big pain points a lot of times before devel- Salesforce and S3, etc. opers or data engineers are able to solve traditional ETL or data processing at scale. It becomes more and more critical { Processing paradigm: Gobblin supports both stan- for a big data system to address the challenges of large data dalone and scalable platforms, including Hadoop and ingestion and integration with high velocity and quality. Yarn. Integration with Yarn provides the ability to run continuous ingestion in addition to scheduled batches. This work is licensed under the Creative Commons Attribution- { Extensibility: Data pipeline developers can integrate NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this their own adaptors with the framework, and make it license, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain leverageable for other developers in the community. permission prior to any use beyond those covered by the license. Con- tact copyright holder by emailing [email protected]. Articles from this volume { Self-service: Data pipeline developers can compose were invited to present their results at the 41st International Conference on a data ingestion and transformation flow in a self- Very Large Data Bases, August 31st - September 4th 2015, Kohala Coast, serviced manner, test locally using standalone mode Hawaii. Proceedings of the VLDB Endowment, Vol. 8, No. 12 and deploy the flow in production using scale-out mode Copyright 2015 VLDB Endowment 2150-8097/15/08. without code change. 1764 constructs an Extractor per workunit. This resembles Hadoop's InputFormat, which partitions the input into Splits source extractor converter and specifies a RecordReader for each split. The algo- rithm for generating workunits should generally divide the work as evenly as possible to the workunits. In quality writer publisher checker our implementations we mainly use hash based or bin packing based approaches. m m workunits (tasks) o { An Extractor does the work specified in a workunit, e n i.e., extracting certain data records. The workunit t runtime i should have the information of where to pull data from a t (e.g., which Kafka partition, which DB table, etc.) as s task task state o well as what portion of the data should be pulled. t executor tracker r Gobblin uses a watermark object to tell an extrac- o i tor what the start record (low watermark) and end r job job n record (high watermark) are. For example, for Kafka e launcher scheduler g jobs the watermarks can be offsets of a partition, and for DB jobs the watermarks can be values of a col- deployment umn. A watermark can be of any type, not necessarily numeric, as long as there is a way for the Extractor to single hadoop know where to start and where to finish. yarn node mr {A Converter does data transformation on the fly while data is being pulled. There are often use cases that compaction require such conversions, e.g., some fields in the schema need to be projected out for privacy reasons; or data pulled from the source are byte arrays or JSON objects Figure 1: Gobblin Architecture and need to be converted to the desired output formats. A converter can convert one input record into zero or more records. Converters are pluggable and { Data quality assurance: The framework exposes chainable, and it is straightforward to implement one, data metrics collectors and data quality checkers as where all it takes is to implement two methods for first class citizens which can be used to power contin- converting schema and data records, respectively. uous data validation. Since we started the project in 2014, Gobblin has been {A Quality Checker determines whether the extracted launched in production and gradually replacing a number of records are in good shape and can be published. There ad-hoc data ingestion pipelines at LinkedIn. Gobblin was are two types of quality checking policies in terms open sourced on Github as of February 2015. We aim to of scope: record-level policies and task-level policies, provide the community with a solid framework that meets which check the integrity of a single record and the the needs of many real world use cases, with desirable fea- entire output of a task, respectively. There are also tures that make the ingestion process as pleasant as eating two types of quality checking policies in terms of neces- a tasty data buffet. sity: mandatory policies and optional policies. Viola- tion of a mandatory policy will result in the record or task output being discarded or written to an outlier 2. ARCHITECTURE AND OVERVIEW folder, while violation of an optional policy will result in warnings. 2.1 System Architecture and Components {A Writer writes records extracted by a task to the The architecture of Gobblin is presented in Figure 1. A appropriate sinks. It first writes records that pass Gobblin job ingests data from a data source (e.g., Espresso) mandatory record-level policies to a staging directory. into a sink (e.g., HDFS). A job may consist of multiple After the task successfully completes and all records workunits, or tasks, each of which represents a unit of work of the task have been written to the staging direc- to be done, e.g., extracting data from an Espresso table par- tory, the writer moves them to a writer output direc- tition or a Kafka topic partition. tory, which are pending audit against task-level quality checking policies and are to be published by the pub- 2.1.1 Job Constructs lisher. Gobblin's data writer can be extended to pub- A job is formulated around a set of constructs (ellipses). lish data to different sinks such as HDFS, Kafka, S3, Gobblin provides an interface for each of the constructs, thus etc., or publish data in different formats such as Avro, a job can be customized by implementing these interfaces, Parquet, CSV, etc., or even publish data in different or extending Gobblin's out-of-the-box implementations.

Load more