Online Analytical Processing on Hadoop Using Apache Kylin
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Applied Information Systems (IJAIS) – ISSN : 2249-0868 Foundation of Computer Science FCS, New York, USA Volume 12 – No. 2, May 2017 – www.ijais.org Online Analytical Processing on Hadoop using Apache Kylin Supriya Vitthal Ranawade Shivani Navale Akshay Dhamal Bachelor of Engineering (I.T.) Bachelor of Engineering (I.T.) Bachelor of Engineering (I.T.) P.E.S Modern College of P.E.S Modern College of P.E.S Modern College of Engineering Engineering Engineering Department of I.T. Department of I.T. Department of I.T. Kuldeep Deshpande Chandrashekhar Ghuge Founder and CEO Assistant Professor Ellicium Solutions P.E.S Modern College of Engineering Department of I.T. ABSTRACT There has been a lot of research on how Hadoop is used as an In the Big Data age, it is necessary to remodel the traditional alternative for data warehousing (e.g. 1, 4, 7, 9, and 11). Only data warehousing and Online Analytical Processing (OLAP) storing the data in the datawarehouse is not profitable to the system. Many challenges are posed to the traditional business until, the underlying data is analysed to gain business platforms due to the ever increasing data. In this paper, we insights. Online Analytical Processing (OLAP) is an approach have proposed a system which overcomes the challenges and to answer multidimensional analytical queries swiftly, and is also beneficial to the business organizations. The proposed provides support for decision-making and intuitive result system mainly focuses on OLAP on Hadoop platform using views for queries [8].However, the traditional OLAP Apache Kylin. Kylin is an OLAP engine which builds OLAP implementation, namely the ROLAP system based on cubes from the data present in hive and stores the cubes in RDBMS, appears to be inadequate in face of big data HBase for further analysis. The cubes stored in HBase are environment [8]. New massively parallel data architectures analyzed by firing SQL-like analytics queries. Also, reports and analytic tools go beyond traditional parallel SQL and dashboards are further generated on the underlying cubes datawarehouses and OLAP engines [8]. Therefore, some that provides powerful insights of the company data. This databases such as SQL Server and MySQL are able to provide helps the business users to take decisions that are profitable to OLAP-like operations, but the performance cannot be the organization. The proposed system is a boon to small as satisfactory [14]. There is a need to design a system for well as large scale business organizations. The aim of the solving the problems of computing OLAP data cubes over Big paper is to present a system which builds OLAP cubes on Data. Hadoop and generate insightful reports for business users. This is the main motivation behind designing an efficient system which can be beneficial to many. In this paper, we Keywords have proposed a system which overcomes some critical Analytics, OLAP, Hadoop, Kylin, Hive, Datawarehouse problems faced by traditional data warehousing and OLAP. The proposed system is advantageous to businesses is many 1. INTRODUCTION scenarios, to help them take smart and profitable decision for 1.1 Background their organization. Complexity and volume of the data in world is exploding. Data is being collected and stored at unprecedented rates. [3] 1.2 Related Work Because of the huge data volumes, many companies do not OLAP was introduced in the work done by [14]. Considering keep their big data, and thus do not realize any value out of the past two decades have seen explosive growth, both the this [3]. Big Companies that want to truly benefit from big number of products and services offered and in the adoption data must also integrate these new types of information with of these technologies in industry, their other work [14] gave traditional corporate data, and fit the insight they glean into the introductions of OLAP and data warehouse technologies their existing business processes and operations [3].This is based on the new challenges of massively parallel data posing various challenges to the traditional platforms. Due to architecture [8]. the intrinsic nature of Big Data application scenarios ,it is Paper [15] presented OLAP approach based on MapReduce natural to adopt Data Warehousing and OLAP methodologies parallel framework. First, a file type was designed, which was with the goal of collecting, extracting, transforming, loading, based on SMS (Short Message Service) data structure. warehousing and OLAPing such kinds of datasets, by adding significant add-ons supporting analytics over Big Data [13]. In [6] the authors have illustrated a data cube model for XML Almost every day, we see another article on the role that big documents to meet the increasing demand for analyzing data plays in improving profitability, increasing productivity, massive XML data in OLAP. They have also proposed a basic solving difficult scientific questions, as well as many other algorithm to construct the XDCM on Hadoop. To improve areas where big data is solving problems and helping us make efficiency, they have offered some strategies and described an better decisions [5]. optimized algorithm. 1 International Journal of Applied Information Systems (IJAIS) – ISSN : 2249-0868 Foundation of Computer Science FCS, New York, USA Volume 12 – No. 2, May 2017 – www.ijais.org Paper [8] presents the design, implementation, and evaluation computing OLAP data cubes over Big Data, mainly due to of HaoLap, an OLAP system for big data. HaoLap is built on two intrinsic factors of Big Data repositories [13]: Hadoop and based on the proposed models and algorithm: (1) specific multidimensional model to map the dimensions and 1. Size, which becomes really explosive in such data sets; the measures; (2) the dimension coding and traverse algorithm [13] to achieve the roll up operation on hierarchy of dimension 2. Complexity (of multidimensional data models), which values; (3) the chunk model and partition strategy to shard the can be very high in such data sets (e.g., cardinality cube; (4) the linearization and reverse linearization algorithm mappings, irregular hierarchies, dimensional attributes to store chunks and cells; (5) the chunk selection algorithm to etc.).[13] optimize OLAP performance; and (6) the MapReduce based OLAP and data loading algorithm.[8] They compared HaoLap Some of the critical problems discussed in paper [13] that performance with Hive, HadoopDB, HBaseLattice, and arise when computing OLAP cube are as follows: Olap4Cloud on several big datasets and OLAP 3. Size: fact tables can easily become huge when computed applications.[8] over Big Data sets – this adds severe computational issues as the size can become a real bottleneck from 1.3 Overview practical applications [13]. This paper is organised as follows. In section 2, the limitations of the existing data warehouse and OLAP platforms are 4. Complexity: Building OLAP data cubes over Big Data mentioned. The next section i.e. section 3 consists of the also implies complexity problems which do not arise in features, system architecture and the implementation of the traditional OLAP settings (e.g., in relational proposed system. Advantages of the proposed system and environments) – for instance, the number of dimensions limitations of the DW and OLAP on Hadoop platform are can really become explosive, due to the strongly mentioned in section 4 and 5 respectively. unstructured nature of Big Data sets, as well as there could be multiple (and heterogeneous) measures for such 2. LIMITATIONS OF EXISTING DW data cubes [13]. AND OLAP PLATFORMS 5. Design: Designers must move the attention on the Some challenges related to the traditional datawarehouse following critical questions:[13] platforms mentioned in [1] are as follows: a. What is the overall building time of the data cube to Table 1: Challenges related to traditional DW platforms be designed (computing aggregations over Big Data Challenges Description may become prohibitive)? Poor Query Approaches like 64 bit computing, b. How the data cube should be updated? Performance/R increasing memory, MPP systems, c. Which maintenance plan should be selected? esponse [1] and columnar databases have been implemented to solve this challenge d. Which building strategy should be adopted? still it is remains a number one challenge [1]. 6. End-user performance: OLAP data cubes computed over Big Data tend to be huge, hence end-user No support for Traditional RDBMS based performance easily becomes poor on such cubes, advanced datawarehouse platforms generally do especially during the aggregation and query phases – analytics [1] not support these advanced analytic therefore, it follows that end-user performance must be functions using SQL [1]. Advanced included as a critical factor within the design process of analytics is performed outside OLAP data cubes over Big Data.[13] RDBMS using hand coded platforms 7. Analytics: There exist several problems to be [4]. investigated, running from how to design an analytical High hardware Reduced cost of hardware and support process over OLAP data cubes computed on top of Big cost [1] per additional volume, number of Data to how to optimize the execution of so-obtained users and complexity of analysis is an analytical processes, and from the seamless integration of important requirement that OLAP (Big) data cubes with other kinds of unstructured datawarehouse platforms have to information (within the scope of analytics).[13] satisfy [1]. The above mentioned problems that arise when computing OLAP cubes traditionally, could possibly be overcome by using Kylin as an OLAP solution on Hadoop platform. No support for Lack of ability to scale up on demand on demand with minimal cost and ramp up time is 3. PROPOSED SYSTEM workload [1] a major challenge to existing 3.1 Features datawarehouse platforms [1]. Handles huge amount of complex data with simplicity- Hadoop is known to handle a variety of data with simplicity. Quick responses to the business-related queries These challenges are overcome by using Hadoop as a DW compared to other options such as Hive etc.