<<

International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Vol. 1 Issue 4, June - 2012

A Vision towards -Oriented Punam Bajaj Simranjit Kaur Dhindsa Department of Computer Science & Engineering Chandigarh Engg. College, Landran, Mohali, India

Abstract: added applications, and better system utilization.

This paper describes a basic difference And as data volumes continue to increase driven by everything from longer detailed histories to between column-oriented databases and the need to accommodate big data companies traditional -oriented databases. As require a solution that allows their data applications require higher storage and warehouse to run more applications and to be easier availability of data, the demands are more responsive to changing business satisfied by better and faster techniques [1]. environments. Plus, they need a simple, self- Column-oriented database systems managing system that boosts performance but (Column-stores) have attracted a lot of helps reduce administrative complexities and attention in the past few years. A column- expenses. Column Oriented DBMS provides oriented DBMS is a database management unlimited scalability, high availability and self- system (DBMS) that stores its content by managing administration [5]. column rather than by row as in row- oriented databases. This has advantages for data warehouses and library catalogues where aggregates are computed over large numbers of similar data items [4]. In this paper, we discuss how Column oriented Database better than traditional row- oriented DBMSs. This paper focuses on Fig 1. Base concept conveying an understanding of columnar databases and the proper utilization of columnar In fig 1, Starting with a generic , There are databases within the enterprise. two obvious ways to map database tables onto a Keywords: Column–Stores, Column– one dimensional interface: store the table row- oriented DBMS, Data WareHouses, by-row or store the table column-by-column. Columnar Database. The row-by-row approach keeps all information about an entity together. In the example above, it I Introduction: will store all information about the first

Faced with massive data sets, a growing user employee, and then all information about the population, and performance-driven service second employee, etc. The column-by-column level agreements, organizations everywhere are approach keeps all attribute information under extreme pressure to deliver analyses faster together: all of the employee id’s will be stored and to more people than ever before. That means consecutively, then all of the employee job, etc. businesses need faster Both approaches are reasonable designs and performance to support rapid business decisions, typically a choice is made based on performance

www.ijert.org 1 International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Vol. 1 Issue 4, June - 2012 expectations. If the expected workload tends to access data on the granularity of an entity (e.g., find an employee, add an employee, delete an employee), then the row-by-row storage is preferable since all of the needed information will be stored together. On the other hand, if the expected workload tends to read per query only Fig 2. Two Dimensional Table a few attributes from many records (e.g., a query This simple table includes an employee that finds the most common e-mail address identifier (EmpId), name fields (Lastname and domain), then column-by-column storage is Firstname) and a Salary .The database must coax preferable since irrelevant attributes for a its two-dimensional table into a one-dimensional particular query do not have to be accessed [2]. series of bytes, for the operating system to write Traditionally, Row oriented databases it to either the RAM, or hard drive, or both. A are better suited for transactional environments, row-oriented database serializes all of the values in a row together, then the values in the next such as a call center where a customer's entire row, and so on. record is required when their profile is retrieved. Column-oriented databases are 1, Wilson, Joe, 40000; better suited for analytics, where only portions 2, Yaina, Mary, 50000; of each record are required. By grouping the 3, John, Cathy, 44000; data together like this, the database only needs to retrieve columns that are relevant to the query, A column-oriented database serializes all of the greatly reducing the overall I/O needed [6]. values of a column together, then the values of the next column, and so on. II Core difference of Columnar Database 1, 2, 3; than row-oriented Database: Wilson, Yaina, Johnson; The world of relational database systems is a Joe, Mary, Cathy; two-dimensional world. Data is stored in tabular 40000, 50000, 44000; data structures where rows correspond to distinct real-world entities or relationships, and columns This is a simplification. Partitioning, indexing, are attributes of those entities. There is, caching, views, OLAP cubes, and transactional however, a distinction between the conceptual systems such as write ahead logging or and physical properties of database tables. This multiversion all aforementioned two-dimensional property exists dramatically affect the physical organization [4]. only at the conceptual level. At a physical level, database tables need to be mapped onto one III Limitations of Row oriented DBMS’s: dimensional structure before being stored. This Historically, database system implementations is because common computer storage media and research have focused on the row-by-row (e.g. magnetic disks or RAM), despite ostensibly data layout, since it performs best on the most being multi-dimensional, provide only a one common application for database systems: dimensional interface. For example, a database business transactional data processing. However, might have this table [2]. there are a set of emerging applications for database systems for which the row-by-row layout performs poorly. These applications are more analytical in nature, whose goal is to read

www.ijert.org 2 International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Vol. 1 Issue 4, June - 2012 through the data to gain new insight and use it to As a consequence of these drive decision making and planning. The nature query characteristics, storing data row-by-row is of the queries to data warehouses (analytical no longer the obvious choice; in fact, specially databases) is different from the queries to as a result of the latter two characteristics, the transactional databases. Queries tend to be: column-by-column storage layout can be better 1) Less Predictable: In the transactional [2]. world, since databases are used to automate business tasks, queries tend to IV Evolution of Column Oriented DBMSs: be initiated by a specific set of The following are some cited advantages of predefined actions. As a result, the basic column-stores: structure of the queries used to implement these predefined actions is Improved bandwidth utilization: In a column- coded in advance, with variables filled store, only those attributes that are accessed by a in at run-time. In contrast, queries in the query need to be read off disk (or from memory data warehouse tend to be more into cache). In a row-store, surrounding exploratory in nature. They can be attributes also need to be read since an attribute initiated by analysts who create queries is generally smaller than the smallest granularity in an ad-hoc, iterative fashion. in which data can be accessed. 2) Longer Lasting: Transactional queries Improved data compression: Storing data from tend to be short, simple queries (“add a the same attribute domain together increases customer”, “find a balance”). In locality and thus data compression ratio contrast, data warehouse queries, since (especially if the attribute is sorted). Bandwidth they are more analytical in nature, tend requirements are further reduced when to have to read more data to yield transferring compressed data. information about data in aggregate rather than individual records. Improved code pipelining: Attribute data can be 3) More Read-Oriented Than Write- iterated through directly without indirection Oriented: Analysis is naturally a read- through a tuple interface. This results in high oriented endeavor. Typically data is IPC (instructions per cycle) efficiency, and code written to the data warehouse in batches, that can take advantage of the super-scalar followed by many read only queries. properties of modern CPUs. Occasionally data will be temporarily written for “what-if” analyses, but on Improved cache locality: A cache line also tends the whole, most queries will be read- to be larger than a tuple attribute, so cache lines only. may contain irrelevant surrounding attributes in 4) Attribute-Focused Rather Than a row-store. This wastes space in the cache and Entity-Focused: Data warehouse reduces hit rates [3,4]. queries typically do not query individual Additional Considerations: entities; rather they tend to read multiple entities and summarize or aggregate In addition to better performance, the column- them. Further, they tend to focus on only orientation aspect of column-based database a few attributes at a time rather than all supplies a number of useful benefits to those attributes. wishing to deploy fast databases.

www.ijert.org 3 International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Vol. 1 Issue 4, June - 2012

First, there is no need for indexing as with data is of uniform type; therefore, there are some traditional row-based databases. The elimination opportunities for storage size optimizations of indexing means: (1) less overall storage is available in column-oriented data that are not consumed in columnar databases because available in row-oriented data. Compression is indexes in legacy RDBMS’s often balloon the useful because it helps reduce the consumption storage cost of a database to double or more the of expensive resources, such as hard disk space initial data size; (2) data load speed is increased or transmission bandwidth. Infobright is an because no indexes need to be maintained; (3) example of an open source column oriented ad-hoc DML work speed is increased because no database built for high-speed reporting and index updates are performed; (4) no indexing analytical queries, especially against large design or tuning work is imposed on the volumes of data. Data that required 450GB of database IT staff. storage using SQL Server required only 10GB with Infobright, due to Infobright’s massive Second, there is far less design work forced on compression and the elimination of all indexes. database architects when a column-based Using Infobright, overall compression ratio seen database is used. The need for complicated in the field is 10:1. Some customers have seen partitioning schemes, materialized or results of 40:1 and higher. Eg.1TB of raw data summary table designs, and other such work is compressed 10 to 1 would only require 100 GB completely removed because column databases of disk capacity [6]. need none of these components to achieve superior query performance. V Conclusion: Data Compression: Column oriented DBMS is an enhanced Compression is a technique used by many approach to service the needs of Business DBMSs to increase performance. Compression Intelligence (BI), data warehouse, and analytical improves performance by reducing the size of applications where scalability, performance and data on disk, decreasing seek times, increasing simplicity are paramount. It delivers a future- the data transfer rate and increasing buffer pool proof infrastructure with the hit rate [7].One of the most-often cited ability to scale from a single database node to a advantages of Column-Stores is data multi-node. When you need to analyse a compression. Intuitively, data stored in columns mountain of data, there simply is no substitute is more compressible than data stored in rows. for column database technology that ensure Compression algorithms perform better on data scalable, linear performance capabilities and to with low information entropy (high data value deliver faster performance than legacy databases locality) [1]. Imagine a database table containing that use all of them just to cross the finish line information about customers (name, phone second. The Columnar Database is evolving number, e-mail address, e-mail address, etc.). software which can overcome the lacks of the Storing data in columns allows all of the names scope of row oriented databases. It provides a to be stored together, all of the phone numbers range of benefits to an environment needing to together, etc. Certainly phone numbers will be expand the envelope to improve performance of more similar to each other than surrounding text the overall analytic workload. It is usually not fields like e-mail addresses or names. Further, if difficult to find important workloads that are the data is sorted by one of the columns, that column selective, and therefore benefit column will be super-compressible. Column tremendously from a columnar orientation.

www.ijert.org 4 International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Vol. 1 Issue 4, June - 2012

VI Future Work: [7] Miguel C. Ferreira, Compression and Columnar database benefits are enhanced with Query Execution within Column Oriented larger amounts of data, large scans and I/O Databases. bound queries. While providing performance [8] Oracle Exadata, Hybrid Columnar benefits, they also have unique abilities to Compression (HCC) on Exadata. compress their data. Therefore, Columnar can now be used in a data mart or a large integrated data warehouse[5]. Oracle describes the Exadata columnar compression scheme where Hybrid Columnar Compression on Exadata enables the highest levels of data compression and provides enterprises with tremendous cost-savings and performance improvements due to reduced I/O. Average storage savings can range from 10x to 15x depending on which Exadata Hybrid Columnar Compression feature is implemented; customer benchmarks have resulted in storage savings of up to 204x! Exadata Hybrid Columnar Compression is an enabling technology for two new Oracle Exadata Storage Server features: Warehouse Compression and Archive Compression. We can explore Exadata Hybrid Columnar Compression – the next generation in compression technology [8].

VII References:

[1]Daniel J. Abadi, Peter A. Boncz, Stavros Harizopoulos, Column-oriented Database Systems, VLDB ’09, August 24-28, 2009, Lyon, France. [2] Daniel J. Abadi, Query Execution in Column-Oriented Database Systems, [3] D. J. Abadi, S. R. Madden, N. Hachem, Column-stores vs. row-stores: how different are they really?, in: SIGMOD’08, 2008, pp. 967–980. [4] Column-Oriented DBMS, wikipedia [5] TeraData Columnar, www.TeraData.com [6] Infobright, Analytic Applications With PHP and a Columnar Database(2010), 403-47 Colborne St Toronto, Ontario M5E 1P8 Canada.

www.ijert.org 5