A Vision Towards Column-Oriented Databases Punam Bajaj Simranjit Kaur Dhindsa Department of Computer Science & Engineering Chandigarh Engg

International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Vol. 1 Issue 4, June - 2012 A Vision towards Column-Oriented Databases Punam Bajaj Simranjit Kaur Dhindsa Department of Computer Science & Engineering Chandigarh Engg. College, Landran, Mohali, India Abstract: added applications, and better system utilization. This paper describes a basic difference And as data volumes continue to increase driven by everything from longer detailed histories to between column-oriented databases and the need to accommodate big data companies traditional row-oriented databases. As require a solution that allows their data applications require higher storage and warehouse to run more applications and to be easier availability of data, the demands are more responsive to changing business satisfied by better and faster techniques [1]. environments. Plus, they need a simple, self- Column-oriented database systems managing system that boosts performance but (Column-stores) have attracted a lot of helps reduce administrative complexities and attention in the past few years. A column- expenses. Column Oriented DBMS provides oriented DBMS is a database management unlimited scalability, high availability and self- system (DBMS) that stores its content by managing administration [5]. column rather than by row as in row- oriented databases. This has advantages for data warehouses and library catalogues where aggregates are computed over large numbers of similar data items [4]. In this paper, we discuss how Column oriented Database better than traditional row- oriented DBMSs. This paper focuses on Fig 1. Base concept conveying an understanding of columnar databases and the proper utilization of columnar In fig 1, Starting with a generic table, There are databases within the enterprise. two obvious ways to map database tables onto a Keywords: Column–Stores, Column– one dimensional interface: store the table row- oriented DBMS, Data WareHouses, by-row or store the table column-by-column. Columnar Database. The row-by-row approach keeps all information about an entity together. In the example above, it I Introduction: will store all information about the first Faced with massive data sets, a growing user employee, and then all information about the population, and performance-driven service second employee, etc. The column-by-column level agreements, organizations everywhere are approach keeps all attribute information under extreme pressure to deliver analyses faster together: all of the employee id’s will be stored and to more people than ever before. That means consecutively, then all of the employee job, etc. businesses need faster data warehouse Both approaches are reasonable designs and performance to support rapid business decisions, typically a choice is made based on performance www.ijert.org 1 International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Vol. 1 Issue 4, June - 2012 expectations. If the expected workload tends to access data on the granularity of an entity (e.g., find an employee, add an employee, delete an employee), then the row-by-row storage is preferable since all of the needed information will be stored together. On the other hand, if the expected workload tends to read per query only Fig 2. Two Dimensional Table a few attributes from many records (e.g., a query This simple table includes an employee that finds the most common e-mail address identifier (EmpId), name fields (Lastname and domain), then column-by-column storage is Firstname) and a Salary .The database must coax preferable since irrelevant attributes for a its two-dimensional table into a one-dimensional particular query do not have to be accessed [2]. series of bytes, for the operating system to write Traditionally, Row oriented databases it to either the RAM, or hard drive, or both. A are better suited for transactional environments, row-oriented database serializes all of the values in a row together, then the values in the next such as a call center where a customer's entire row, and so on. record is required when their profile is retrieved. Column-oriented databases are 1, Wilson, Joe, 40000; better suited for analytics, where only portions 2, Yaina, Mary, 50000; of each record are required. By grouping the 3, John, Cathy, 44000; data together like this, the database only needs to retrieve columns that are relevant to the query, A column-oriented database serializes all of the greatly reducing the overall I/O needed [6]. values of a column together, then the values of the next column, and so on. II Core difference of Columnar Database 1, 2, 3; than row-oriented Database: Wilson, Yaina, Johnson; The world of relational database systems is a Joe, Mary, Cathy; two-dimensional world. Data is stored in tabular 40000, 50000, 44000; data structures where rows correspond to distinct real-world entities or relationships, and columns This is a simplification. Partitioning, indexing, are attributes of those entities. There is, caching, views, OLAP cubes, and transactional however, a distinction between the conceptual systems such as write ahead logging or and physical properties of database tables. This multiversion concurrency control all aforementioned two-dimensional property exists dramatically affect the physical organization [4]. only at the conceptual level. At a physical level, database tables need to be mapped onto one III Limitations of Row oriented DBMS’s: dimensional structure before being stored. This Historically, database system implementations is because common computer storage media and research have focused on the row-by-row (e.g. magnetic disks or RAM), despite ostensibly data layout, since it performs best on the most being multi-dimensional, provide only a one common application for database systems: dimensional interface. For example, a database business transactional data processing. However, might have this table [2]. there are a set of emerging applications for database systems for which the row-by-row layout performs poorly. These applications are more analytical in nature, whose goal is to read www.ijert.org 2 International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Vol. 1 Issue 4, June - 2012 through the data to gain new insight and use it to As a consequence of these drive decision making and planning. The nature query characteristics, storing data row-by-row is of the queries to data warehouses (analytical no longer the obvious choice; in fact, specially databases) is different from the queries to as a result of the latter two characteristics, the transactional databases. Queries tend to be: column-by-column storage layout can be better 1) Less Predictable: In the transactional [2]. world, since databases are used to automate business tasks, queries tend to IV Evolution of Column Oriented DBMSs: be initiated by a specific set of The following are some cited advantages of predefined actions. As a result, the basic column-stores: structure of the queries used to implement these predefined actions is Improved bandwidth utilization: In a column- coded in advance, with variables filled store, only those attributes that are accessed by a in at run-time. In contrast, queries in the query need to be read off disk (or from memory data warehouse tend to be more into cache). In a row-store, surrounding exploratory in nature. They can be attributes also need to be read since an attribute initiated by analysts who create queries is generally smaller than the smallest granularity in an ad-hoc, iterative fashion. in which data can be accessed. 2) Longer Lasting: Transactional queries Improved data compression: Storing data from tend to be short, simple queries (“add a the same attribute domain together increases customer”, “find a balance”). In locality and thus data compression ratio contrast, data warehouse queries, since (especially if the attribute is sorted). Bandwidth they are more analytical in nature, tend requirements are further reduced when to have to read more data to yield transferring compressed data. information about data in aggregate rather than individual records. Improved code pipelining: Attribute data can be 3) More Read-Oriented Than Write- iterated through directly without indirection Oriented: Analysis is naturally a read- through a tuple interface. This results in high oriented endeavor. Typically data is IPC (instructions per cycle) efficiency, and code written to the data warehouse in batches, that can take advantage of the super-scalar followed by many read only queries. properties of modern CPUs. Occasionally data will be temporarily written for “what-if” analyses, but on Improved cache locality: A cache line also tends the whole, most queries will be read- to be larger than a tuple attribute, so cache lines only. may contain irrelevant surrounding attributes in 4) Attribute-Focused Rather Than a row-store. This wastes space in the cache and Entity-Focused: Data warehouse reduces hit rates [3,4]. queries typically do not query individual Additional Considerations: entities; rather they tend to read multiple entities and summarize or aggregate In addition to better performance, the column- them. Further, they tend to focus on only orientation aspect of column-based database a few attributes at a time rather than all supplies a number of useful benefits to those attributes. wishing to deploy fast business intelligence databases. www.ijert.org 3 International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Vol. 1 Issue 4, June - 2012 First, there is no need for indexing as with data is of uniform type; therefore, there are some traditional row-based databases. The elimination opportunities for storage size optimizations of indexing means: (1) less overall storage is available in column-oriented data that are not consumed in columnar databases because available in row-oriented data. Compression is indexes in legacy RDBMS’s often balloon the useful because it helps reduce the consumption storage cost of a database to double or more the of expensive resources, such as hard disk space initial data size; (2) data load speed is increased or transmission bandwidth.

Load more