Apache Hive: from Mapreduce Toenterprise-Grade Big Data
Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing Jesús Camacho-Rodríguez, Ashutosh Chauhan, Alan Gates, Eugene Koifman, Owen O’Malley, Vineet Garg, Zoltan Haindrich, Sergey Shelukhin, Prasanth Jayachandran, Siddharth Seth, Deepak Jaiswal, Slim Bouguerra, Nishant Bangarwa, Sankar Hariappan, Anishek Agarwal, Jason Dere, Daniel Dai, Thejas Nair, Nita Dembla, Gopal Vijayaraghavan, Günther Hagleitner Hortonworks Inc. ABSTRACT consisted of (i) reading huge amounts of data, (ii) executing Apache Hive is an open-source relational database system for transformations over that data (e.g., data wrangling, consoli- analytic big-data workloads. In this paper we describe the key dation, aggregation) and finallyiii ( ) loading the output into innovations on the journey from batch tool to fully fledged other systems that were used for further analysis. enterprise data warehousing system. We present a hybrid As Hadoop became a ubiquitous platform for inexpensive architecture that combines traditional MPP techniques with data storage with HDFS, developers focused on increasing more recent big data and cloud concepts to achieve the scale the range of workloads that could be executed efficiently and performance required by today’s analytic applications. within the platform. YARN [56], a resource management We explore the system by detailing enhancements along four framework for Hadoop, was introduced, and shortly after- main axis: Transactions, optimizer, runtime, and federation. wards, data processing engines (other than MapReduce) such We then provide experimental results to demonstrate the per- as Spark [14, 59] or Flink [5, 26] were enabled to run on formance of the system for typical workloads and conclude Hadoop directly by supporting YARN. with a look at the community roadmap.
[Show full text]