Database Systems Journal vol. I, no. 2/2010 3

Column-Oriented , an Alternative for Analytical Environment

Gheorghe MATEI Romanian Commercial Bank, Bucharest, ROMANIA [email protected]

It is widely accepted that a is the central place of a system. It stores all data that is relevant for the company, data that is acquired both from internal and external sources. Such a repository stores data from more years than a transactional system can do, and offer valuable information to its users to make the best decisions, based on accurate and reliable data. As the volume of data stored in an enterprise data warehouse becomes larger and larger, new approaches are needed to make the analytical system more efficient. This paper presents column-oriented databases, which are considered an element of the new generation of DBMS technology. The paper emphasizes the need and the advantages of these databases for an analytical environment and make a short presentation of two of the DBMS built in a columnar approach. Keywords : column-oriented , row-oriented database, data warehouse, Business Intelligence, symmetric multiprocessing, massively parallel processing.

Introduction rate of development and adoption, the 1In the evolution of computing science, following innovations will be achieved in three generations of database technology are the next five years: identified since the 60’s till nowadays. The • most data warehouses will be stored first generation started in the 60’s and its in a columnar fashion; main purpose was to enable disparate but • most OLTP (On-Line Transaction related application to share data otherwise Processing ) databases will either be than passing files between them. augmented by an in-memory database or The publishing of “ A reside entirely in memory; of Data for Large Shared Data Banks ” by E. • most large-scale database servers F. Codd marked the beginning of the second will achieve horizontal scalability through generation of DBMS (database management clustering; systems ) technology. Codd’s premise was • many data collection and reporting that data had to be managed in structures problems will be solved with databases that developed according to the mathematical set will have no formal schema at all. theory. He stated that data had to be This study examines how some organized into tuples, as attributes and innovations in database technology field are relations. implemented more and more. Most of these A third generation began to emerge in technologies have been developed for at the late 90’s and now is going to replace least ten years, but they are only now second-generation products. Multi-core becoming widely adopted. processors became common, 64-bit As Carl Olofson, research vice president technology is used largely for database for database management and data servers, memory is cheaper and disks are integration software research at IDC, said, cheaper and faster than ever before. “many of these new systems encourage you A recent IDC study [1] examines to forget disk-based partitioning schemes, emerging trends in DBMS technology as indexing strategies and buffer management, elements of the third generation of such and embrace a world of large-memory technology. It considers that, at the current models, many processors with many cores, 4 Column-Oriented Databases, an Alternative for Analytical Environment clustered servers, and highly compressed operational and historical data in order to columnwise storage ”. ensure compliance [2]. From the innovations that the study These new demands add more pressures considers that will be achieved in the next upon IT departments. More and more years, this paper presents the columnar data hardware resources are needed in order to storage. store and manage an increasing volume of data. The increasing number of queries 2. The need for column-oriented needs larger amounts of CPU cycles, so databases more processors, having a higher The volume of data acquired into an performance, must be added to the system organization is growing rapidly. So does the The size of the data warehouses storing number of users who need to access and this data is increasing permanently, analyse this data. IT systems are used more becoming larger and larger. While five years and more intensive, in order to answer more ago the largest data warehouses were around numerous and complex demands needed to 100 terabytes in size, now a data warehouse make critical business decisions. Data size at the petabyte level is no longer analysis and business reporting need more unusual. The challenge is to maintain the and more resources. Therefore, better, faster performance of these repositories, which are and more effective alternatives have to be built, mostly, as relational structures, storing found. Business Intelligence (BI) systems data in a row-oriented manner. The are proper solutions for solving the relational model is a flexible one and it has problems above. Decision-makers need a proven its capacity to support both better access to information, in order to transactional and analytical processing. But, make accurate and fast decisions in a as the size and complexity of data permanent changing environment. As part of warehouses have increased, a new approach a BI system, reporting has become critical was proposed as an alternative on the row- for a company’s business. oriented approach, namely storing data in a Years ago, reports prepared by analysts column-oriented manner. Unlike the row- were addressed only to the company’s oriented approach, where the data storage executive management. Nowadays, layer contains records (rows), in a column- reporting has become an instrument oriented system it contains columns. This is addressed to decision-makers on all a simple model, more adequate for data organizational levels, aiming to improve the repositories used by analytical applications, company’s activity, to ensure decision with a wide range of users and query types. quality, control costs and prevent losses. Researches indicate that the size of the As already mentioned, the volume of largest data warehouse doubles every three data acquired into a company is growing years. Growth rates of system hardware permanently, because business operations performance are being overrun by the need expand and, on the other hand, the company for analytical performance [3]. The volume has to interact with more sources of data and of data needed to be stored is growing due to keep more data online. More than ever more and various requirements for reporting before, users need a faster and more and analytics, from more and more business convenient access to historical data for areas, increased time periods for data analysing purposes. Enterprise data retention, a greater number of observations warehouses are a necessity for the loaded in data warehouses and a greater companies that want to stay competitive and number of attributes for each observation. successful. More and more reports and ad- This is true if taking into consideration only hoc queries are requested to support the structured data. But nowadays, decision making process. At the same time, organizations collect a larger and larger companies have to run audit reports on their volume of unstructured data, as images, Database Systems Journal vol. I, no. 2/2010 5 audio and video files, which need a much A column-oriented DBMS stores data in greater storing space than structured data. a columnar manner and not by rows, as Row-oriented databases have been classic DBMS do. In the columnar designed for transactional processing. For approach, each attribute is stored in a example, in the account management system separate table, so successive values of that of a bank, all attributes of an account are attribute are stored consecutively. This is an stored in a single row. Such an approach is important advantage for data warehouses not optimal in an analytical system, where a where, generally, information is obtained by lot of read operations are executed in order aggregating a vast volume of data. to access a small number of attributes from a Therefore, operations as MIN, MAX, SUM, vast volume of data. In a row-oriented COUNT, AVG and so forth are performed architecture, system performance, users’ very quickly [5]. access and data storage become major issues When the tables of a database are very quickly [4]. As they are designed to designed, their columns are established. The retrieve all elements from several rows, row- number of rows will be determined when the oriented databases are not well suited for tables will be populated with data. In a row- large scale processing, as needed in an oriented database, data is stored in a tabular analytical environment. As opposed to manner. The data items of a row are stored transactional queries, analytical queries one after another; rows are also stored one typically scan all the database’s records, but after another, so the last item of a row is process only a few elements of them. In a followed by the first item of the next row. column-oriented database all instances of a In a column-oriented database, the data single data element, such as account items of a column are stored one after number, are stored together so they can be another, and also are the columns; so the last accessed as a unit. Therefore, column- item of a column is followed by the first oriented databases are more efficient in an item of the next column. analytical environment, where queries need to read all instances of a small number of 3. Differences between the row- data elements. oriented and column-oriented System performance enhances approaches spectacularly in a column-oriented solution, In a typical relational DBMS, data is because queries search only few attributes, stored and managed as rows, each row and they will not scan the attributes that are containing all the attributes of an element of irrelevant for those queries. Requested data that entity (table). Such systems are used by is found faster, because less sort operations transactional applications which, at a certain have to be performed. moment, generate or modify one or a small A typical feature of evolved BI systems number of records. Unlike transactional is their capability to make strategic business applications, which use all, or almost all the analyses, to process complex events and to attributes of a record, analytical and BI drill deeply into data. As the volume of data applications scan few attributes (columns) of becomes impressive and performance a vast number of records. Most often, they demands required by users are likely to have to aggregate data stored in those outpace, it is obviously that row-oriented columns in order to meet the users’ relational database management systems demands. Because of the row-oriented stopped to be the solution for implementing structure of the database, the entire record a BI system having powerful analytical and has to be read in order to access the required predictive capabilities. A new model tends attributes. This fact causes the reading of a to come into prominence as an alternative on vast amount of unuseful additional data in developing analytical databases, namely one order to access the requested information. that manages data by columns. 6 Column-Oriented Databases, an Alternative for Analytical Environment

Figure 1 shows that much more data the total volume of term deposits opened at than needed is read to satisfy the request for the branches in Bucharest.

City City interest interest balance balance Row ID ID Row Currency Currency Open date ID branchID Calculated Interest Interest rate Balance date Account type type Account Term accountTerm RON equivalentRON Account numberAccount Original currency 1 2 3

………. n

Fig. 1. Analytical request in a row-oriented database

Row-oriented databases were designed they still have to examine the entire content for transactional applications, and they are of a row. optimized for retrieval and processing of Taking into consideration those small data sets. Seldom, to support presented above, a new approach was analytical requests, it is necessary to build proposed, to store data along columns. In additional indexes, pre-aggregating data such data organization, each column is structures, or special materialized views and stored separately and the system selects only cubes. All these aspects require additional the columns requested by users. In every processing time and data storage. However, column data is stored in row order, so that because they are built in order to provide the 50 th entry for each column belongs to the quickly results for queries that were known same row, namely the 50 th row. at the design stage, they will not have the Figure 2 shows that the same query as same performance when ad-hoc queries, that those in figure 1 reads less data in a column- were not foreseen before, are performed. oriented system, in order to provide the The business demands require the same result. No additional indexes have to storage of many data items. But any user be built for improving query performance, wants to get information as soon as possible. because every column forms an index. This Therefore, a proper solution for data fact reduces the number of I/O operations organization has to be implemented in order and enables quick access to data, without the to ensure a good performance of the system. need to read the entire database. Data from Several technical solutions can be used each column is stored contiguously on disk. to improve system performance, such as Column values are joined into rows based partitioning, star indexes, query pre- on their relative position in each column. As processing, bitmap and index joins, or a result of the column-oriented architecture, hashing. These solutions aim to offer only those columns needed for a specific support for more specific data retrieval, but query are read form disk. Because in an Database Systems Journal vol. I, no. 2/2010 7 analytical environment most of queries need I/O savings. This fact contributes to system to retrieve only few columns, this vertical performance improvement, as regards the partitioning approach produces important query execution time.

City City interest interest balance balance Row ID ID Row Currency Currency Open date ID branchID Calculated Interest Interest rate Balance date Account type type Account Term accountTerm RON equivalentRON Account numberAccount Original currency 1

2

3

……….

n

Fig. 2. Analytical request in a column-oriented database

Note. Figure 2 presents a reunion of the business requests need only a few of them, tables in a column-oriented database. In fact, the columnar approach is a proper solution each table has two columns: one containing for analytical systems. the row ID, and the other, the values of the Talking about the efficiency of a appropriate attribute. Because of the limited column-oriented system, some remarks are space on the page, the row ID column is not to be made concerning processing time. multiplied for every table, and the attribute Thus, such a system is more efficient when columns are close together. it’s necessary to aggregate a large number of Comparing the two figures above, it’s rows, but only a small number of columns easy to observe that the same request has to are requested. If many columns of a small read more data in a row-oriented structure number of rows have to be processed, a row- than in a column-oriented one. In order to oriented system is more efficient than a read a certain attribute in a row-oriented column-oriented one. The efficiency is even structure, all the adjacent attributes have to greater when row size is relatively small, be read, even if they are not interesting for because the entire row can be retrieve with a the requester. In a column-oriented single disk seek. structure, since all values of an attribute are Updating an entire column at once is stored together, consecutively, this problem faster in a column-oriented database. All the doesn’t exist [6]. data of that column is modified through only A column-oriented database is faster one updating command, without the need to than a row-oriented one, because its read all columns of each row. But writing or processing is not affected by unnecessary updating a single row is more efficient in a content of rows. As long as many database row-oriented database if all attributes are tables can have dozens of columns and most supplied at the same time, because the entire 8 Column-Oriented Databases, an Alternative for Analytical Environment row can be written with a single disk access, significantly lower than in a column- whereas writing to multiple columns oriented structure [7]. And taking into requires multiple writes. consideration that I/O operations are the SQL queries in a column-oriented bottleneck of a database application, the database are identical with those in a row- column-oriented approach proves its oriented database, without any modification. superiority against the row-oriented one. What is different, is the way that the Unlike the row-oriented approach, the database administrator has to think about column-oriented approach allows rapid data. While in a row-oriented database he joins and aggregations . Tables are already thinks in terms of individual transactions, in sorted, so there is no need to sort them a column-oriented database he has to think before merge or join them. In addition, in terms of similar items derived from sets accessing data along columns allows of transactions. From the indexing point of incremental data aggregation, which is very view, he has to pay more attention to the important for BI applications. In addition, cardinality of the data, because an index is this approach allows parallel data access, related with a subject, such as the balance improving the system performance. account, and not with an entire transaction Thereby, complex aggregations can be with all its fields. fulfiled by the parallel processing of columns and then joining data in order to 4. Advantages of the column-oriented achieve the final result. approach Column-oriented databases need a Column-oriented databases provide smaller disk space to store data than row- important advantages towards the row- oriented databases. To accommodate the oriented ones, some of them being presented sustained increase of volume of data, below. additional structures – as indexes, tables for Column-oriented databases provide a pre-aggregation and materialized views, are better performance for analytical requests. built in row-oriented systems. Column- In the row-oriented approach, the system oriented databases are more efficient performance decreases significantly as the structures. They don’t need additional number of simultaneous queries increases. storage for indexes, because data is stored Building additional indexes in order to within the indexes themselves. Bitmap accelerate queries becomes uneffective with indexes are used to optimize data store and a large number of diverse queries, because its fast retrieval. That’s why in a column- more storage and CPU time are required to oriented database queries are more efficient load and maintain those indexes. In a than in a row-oriented one. column-oriented system indexes are built to Moreover, a higher data compression store data, while in a row-oriented system rate can be achieved in a column-oriented they represent the way to point to the database than in a row-oriented one. It is storage area that contains the row data. As a well known that compression is more result, a column-oriented system will read effective when repeated values are only the columns required in a certain query. presented, and values within a column are On the other hand, as they store data as quite similar to each other. A column- blocks by columns rather than by rows, the oriented approach allows the ability to actions performed on a column can be highly compress the data due to the high completed with less I/O operations. Only potential for the existence of similar values those attributes requested by users are read in adjacent rows of a certain column. In a from disk. Although a row-oriented table row-oriented database, values in a row of a can be partitioned vertically, or an index can table are not likely to be very similar; be created for every column so it could be therefore, they cannot be compressed as accessed independently, the performance is efficient as in a column-oriented database. Database Systems Journal vol. I, no. 2/2010 9

Concerning the repository presented in designed for analytical purposes, which figures 1 and 2, there is no doubt that many involve processing of a large number of repeated values will be found within the values of few columns, a column-oriented CITY column, but no repetition will be architecture is a better solution. A data found between CITY and another attribute warehouse, which is the central place of an in a row. analytical system, must be optimized for Data loading is a faster process if it’s reading data. In such an architecture, data executed in a column-oriented database than for a certain column is stored as a block, so in a row-oriented one. As known, to load the analytical queries, which usually data in a data warehouse involves to perform aggregate data along columns, are more activities. Data is extracted from performed faster. A column-oriented system source systems and loaded into a staging reads only the columns required for area. This is the place where data processing a certain query, without bringing transformations and joins are performed in into memory irrelevant attributes. Such an order to denormalize data and load it into approach provides important advantages the data warehouse as fact and dimension concerning the system performance, because tables. Then the needed indexes and views typical queries involve aggregation of large are created. In a row-oriented structure, all volumes of data [8]. data in a row (record) is stored together, and indexes are built taking into consideration 5. Examples of column-oriented all the rows. In a column-oriented structure, database systems data of each column is stored together and Besides the column-oriented approach, the system allows the parallel loading of the another important innovation applied in data columns, ensuring a shorter time for data warehousing consists in the way in which loading. data is processed. Two major techniques are Taking into consideration the features used to design a data warehouse presented above, it can be stated that a architecture: symmetric multiprocessing column-oriented database is a scalable (SMP) and massively parallel processing environment that keeps providing fast (MPP). It couldn’t be certified the queries when the volume of data, the superiority of one approach against the number of users and the number of other. Each of these solutions has its own simultaneous queries are increasing. supporters, because both of them are valid But this thing doesn’t mean that all approaches and, when properly applied, lead repositories have to be built in a columnar to notable results. manner. A column-oriented architecture is Two database systems are presented in more suitable for data warehousing, with the next sections, each of them using one of selective access to a small number of the two types of architecture. columns, while a row-oriented one is a better solution for OLTP systems. For an 5.1.Sybase IQ OLTP system, which is heavily loaded with Sybase IQ is a high-performance interactive transactions, a row-oriented decision support server designed specifically architecture is well-suited. All data for a for data warehousing. It is a column- certain row is stored as a block. In such an oriented relational database that was built, architecture, all the attributes of a record are from the very beginning, for analytics and written on disk with a single command, this BI applications, in order to assist reporting thing ensuring a high performance for and decision support systems. This fact writing operations. Usually, an operation in offers it several advantages within a data such a system creates, queries or changes an warehousing environment, including entry in one or more tables. For an OLAP performance, scalability and cost of (On-Line Analytical Processing ) system, ownership benefits. 10 Column-Oriented Databases, an Alternative for Analytical Environment

A Sybase IQ database is different from a easy to achieve. In order to enhance the conventional relational database, because its performance of ad-hoc queries, it delivers main purpose is to allow data analysis and more specialized indexes, such as: indexes not its writing or updating. While in a for low cardinality data, grouped data, range conventional database the most important data, joined columns, textual analysis, real- thing is to allow many users to update the time comparisons for Web applications, date database instantly and accurately, without and time analysis [10]. Furthermore, every interfering with one another, in a Sybase IQ field of a row can be used as an index, database the most important thing is to without the need to define conventional ensure fast query response for many users. indexes. Due to these indexes, analytical Sybase IQ has a column-oriented queries focus on specific columns, and only structure and has its own indexing those columns are loaded into memory. This technology that ensures high performance to reduces very much the need for expensive, reporting and analytical queries, which are high performance disk storage and, at the performed, as its developers state, up to 100 same time, the number of I/O operations. times faster than in a traditional DBMS. Offering an enhanced scalability, Sybase IQ offers enhanced features for data Sybase IQ can be used by a large number of loading. Its flexible architecture ensures that users – hundreds and even thousands – the system will provide rapidly the which can access vast volumes of data, from requested information, within seconds or a few gigabytes to several hundred minutes at most, no matter how many terabytes. queries are issued. Sybase IQ offers a fast and flexible Sybase IQ has enhanced compression access to information. As already algorithms, which reduce the disk space mentioned, it is designed for query necessary for data storing from 30 to 85 processing and ad-hoc analysis. As opposed percent, depending on data’s structure. A to traditional data warehouses, it does not significant cost reduction results due to this require data to be pre-aggregated in order to fact. As already mentioned, storing data in a analyse it. Therefore, users can efficiently column-oriented manner increases the and quickly analyse atomic level data. With similarity of adjacent records on disk, and Sybase IQ users can analyse the business values within a column are often quite performance and can track the company’s similar to each other. More tests confirmed KPI ( key performance indicators ). that Sybase IQ requires 260 Terabytes of Comparing with other products, it provides physical storage for storing 1000 Terabytes better solutions for measuring the business of row input data. A row-oriented database results, managing the customer relationship requires additional storage for indexes and, and ensuring financial controls. in the example above, this additional storage Several intelligent features are can reach up to 500 Terabytes [2]. In integrated in Sybase IQ architecture. These addition, Sybase IQ allows operating features, such as the use of symmetric directly with compressed data [9], with a multiprocessing (SMP), enhance database’s positive impact on processing speed. performance and reduce its maintenance As opposed to traditional databases, overhead. A SMP architecture offers an Sybase IQ is easier to maintain, and the increased performance because all tuning needed to get a higher performance processors can equally access the database’s requires less time and hardware resources. It tables. doesn’t need specialized schemas Sybase IQ architecture (figure 3) (dimensional modeling) to perform well. consists of multiple SMP nodes. Some of Sybase IQ is built upon open standards, so them are used only for reading data ( read- the integration and interoperability with only nodes ), while other can be used both other reporting systems and dashboards are for reading and writing data ( read/write Database Systems Journal vol. I, no. 2/2010 11 nodes ). However, each node can be flexibly repositories, depending on its age, and it can designated as a read or write node, be easily scaled out. Data can be loaded according to the requirements at a given through real time loads or batch ETL moment. Thus, if an overnight batch has to (Extract, Transform, Load ) processes. be executed in order to load or update a Starting with the release of version 15 of large volume of data, it’s a good idea to Sybase IQ, a new “ load from client ” option make all the read/write nodes to operate as has been added. This option allows loading write nodes, even if they run as read nodes data from external sources, via ODBC, during the day. In addition, this architecture JDBC, ADO.Net, OLE DB and DBLib. can be scaled up incrementally, by adding Data, which can be encrypted through new nodes as needed. different methods, is rapidly accessed for As shown in figure 3, Sybase columnar analytical or reporting purposes. store allows storing data in more

Fig. 3. Sybase IQ architecture (source: [10])

Sybase IQ enables data to be managed obliged to develop enterprise data more efficiently than in a traditional warehouses and powerful applications able database, built in a row-oriented approach. to respond to more and more ad-hoc queries Complex analyses are run much faster. High from an increasing number of users that data compression reduces storage costs, and need to analyse larger volumes of data, often vast volumes of data can be processed much in real time. more quickly. Vertica Analytic Database is a DBMS that can help in meeting these needs. It is a 5.2.Vertica column-oriented database that was built in To gain competitive advantages and order to combine both column store and comply with new regulations, companies are execution, as opposed to other solutions that 12 Column-Oriented Databases, an Alternative for Analytical Environment are column-oriented only from storage point Vertica decomposes the logical tables of view. and physically stores them as groups of Designed by Michael Stonebraker, it columns named “ projections ”. According to incorporates a combination of architectural this concept, data is stored in different ways, elements – many of them which have been similar to materialized views. Each used before in other contexts – to deliver a projection contains a subset of the columns high-performance and low-cost data of one or more tables, and is sorted on a warehouse solution that is more than the different attribute. Vertica automatically sum of its elements. selects the proper projection in order to Vertica is built in a massively parallel optimize query performance. Due to the processing (MPP) architecture. In a MPP effective compression algorithms used by architecture, processors are connected with Vertica, multiple projections can be certain sets of data, data is distributed across maintained, which concurs to the the nodes of the network, and new performance improvement of a large range processors can be added almost without of queries, including ad-hoc queries needed limit. As data is partitioned and load into a for exploratory analysis. server cluster, the data warehouse performs On the other hand, these projections faster. Due to the MPP technology, the serve as redundant copies of the data. system performance and storage capacity Because data is compressed so efficiently, can be enhanced simply by adding a new Vertica can use the disk space to store these server to the cluster. Vertica automatically copies to ensure fault tolerance and to takes advantages of the new server without improve concurrent and ad-hoc query the need for expensive and time consuming performance. Partitioning data across the upgrades. cluster, Vertica ensures that each data While many of the new data warehouses element is stored on two or more nodes. use only MPP technology or columnar Thus, an intelligent data mirroring is approach, Vertica is the only data warehouse implemented, named K-Safety, where k is that includes both innovations, as well as the number of node failures that a given set other features. It is designed to reduce I/O of projections will not affect the system disk operations and is written natively to availability. In order to guarantee K-Safety, support grid computing. k+1 replicas of all projections are built. Because of the columnar approach, Each replica has the same columns, but they which reduces the expensive I/O operations, may have different sort order. K-Safety queries are 50 to 200 times faster than a allows requests for data stored into failed row-oriented database. Its MPP architecture nodes to be satisfied by corresponding offers a better scalability, that can be projections on other nodes. Once a failed achieved by adding new servers in the grid node is restored, projections on the other architecture. nodes are automatically used to repopulate In order to minimize the disk space its data. needed to store a database, Vertica uses The value of k has to be configured so more compression algorithms, depending on that a proper trade-off between hardware data type, cardinality and sort order. For costs and availability guarantees to be met. each column, the proper algorithm is If necessary, a new node can be added in the automatically chosen, based on a sample of grid, and Vertica will automatically allocate the data [11]. As data into a column is a set of objects to that node and it can begin homogenous, having the same type, Vertica processing queries, increasing database provides a compression ratio from 8 to 13 performance [12]. Conversely, a node can time relative to the size of the original input be removed and the database will continue data. to work, but at a lower rate. Database Systems Journal vol. I, no. 2/2010 13

As shown in figure 4, a single Vertica the WOS fits into main memory and is node is organized into a hybrid store designed to efficiently support insert and consisting of two distinct storage structures: update operations. Data is stored as the Write-Optimized Store (WOS) and the collections of uncompressed and unsorted Read-Optimized Store (ROS). Generally, columns, maintained in update order.

Query Update

0 - 3 4 - 9 …... N-3 N-2 N-1 N

ROS WOS Current epoch

Tuple Mover

Fig. 4. Vertica storage model (source: [11])

An asynchronous background process decomposes the query according to the data called the Tuple Mover moves data from stored into the involved nodes, and sends WOS into the permanent disk storage in the them the appropriate subqueries. Then it ROS. The Tuple Mover operates on the collects each node’s partial result and entire WOS, sorting many records at a time composes them in order to offer to the and writing them to the ROS as a batch. requester the final answer. Data in the ROS is sorted and compressed, In [13] a comparison is made between a so it can be efficiently read and queried. 1.5 terabytes row-oriented data warehouse Queries and updates do not interfere and a column-oriented database containing with one another. Updates are collected in the same data and managed by Vertica time-based buckets called epochs. New Analytic Database. The results are presented updates are grouped in the current epoch in the table 1 bellow. until the transaction is committed. Data in older epochs is available for querying and moving into the ROS. Because of the grid computing architecture, a query can by initiated on any node of the network. Vertica query planner 14 Column-Oriented Databases, an Alternative for Analytical Environment

Table 1. Advantages of Vertica Analytic Database (source: [13]) Row-oriented data Vertica Analytic Vertica advantages warehouse Database Avg query response 37 minutes 9 seconds 270x faster answers time Reports per day 30 1,000 33x more reports Data availability Next day 1 minute Real-time views Hardware cost $1.4M $50,000 1/28 th of the (2*6 servers + SAN) (6 HP ProLiant hardware, built-in servers) redundancy

All those presented above enable attribute separately, fact that eliminates the Vertica to manage larger volumes of wasteful retrieval as queries are performed. historical data, analyse data at any level of On the other hand, much more data can be detail, perform real-time analyses, conduct loaded in memory, and processing data into ad-hoc and short-lived business analytical memory is much faster. projects, and build new analytic Software as Column-oriented databases provide a Service (SaaS) business. faster answers, because they read only the columns requested by users’ queries, since 6. Conclusions row-oriented databases must read all rows For applications that write and update and columns in a table. Data in a column- many data (OLTP systems), a row-oriented oriented database can be better compressed approach is a proper solution. In such an than those in a row-oriented database, architecture, all the attributes of a record are because values in a column are much more placed contiguously in storage and are homogenous than in a row. The compression pushed out to disk through a single write of a column-oriented database may reduce operation. An OLTP system is a write- its size up to 20 times, this thing providing a optimized one, having a high writing higher performance and reduced storage performance. costs. Because of a greater compression rate, In contrast, an OLAP system, mainly a column-oriented implementation stores based on ad-hoc queries performed against more data into a block and therefore more large volumes of data, has to be read- data into a read operation. Since locating the optimized. The repository of such a system right block to read and reading it are two of is a data warehouse. Periodically (daily, the most expensive computer operations, it’s weekly, or monthly, depending upon how obviously that a column-oriented approach current data must be), the data warehouse is is the best solution for a data warehouse load massively. Ad-hoc queries are then used by a Business Intelligence system performed in order to analyse data and developed for analytical purposes. discover the right information for the decision making process. And for analytical applications, that read much more than they write, a column-oriented approach is a better solution. Nowadays, data warehouses have to answer more and more ad-hoc queries, from a greater number of users which need to analyse quickly larger volumes of data. Columnar database technology inverts the database’s structure and stores each Database Systems Journal vol. I, no. 2/2010 15

References [7] Daniel Abadi, Samuel Madden, Nabil Hachem, Column-Stores vs. Row- [1] Carl Olofson, The Third Generation of Stores: How Different Are They Database Technology: Vendors and Really?, Proceedings of the 2008, Products That Are Shaking Up the ACM SIGMOD International Market, 2010, www.idc.com Conference on Management of Data, [2] Sybase, Sybase IQ: The Economics of Vancouver, Canada, Business Reporting, White paper, http://portal.acm.org 2010, [8] Mike Stonebraker, Daniel Abadi et al., www.sybase.com/files/White_Papers/ C-Store: A column-oriented DBMS, Sybase-IQ-Business-Reporting- Proceedings of the 31 st VLDB 051109-WP.pdf Conference, Trondheim, Norway, [3] David Loshin, Gaining the Performance 2005, Edge Using a Column-Oriented http://db.csail.mit.edu/projects/cstore/v Database Management System, Sybase ldb.pdf white paper, 2009, www.sybase.com [9] Daniel Tkach, When the information is [4] Sybase, A Definitive Guide to Choosing the business, Sybase white paper, a Column-based Architecture, White 2010, paper, 2008, www.information- www.sybase.com/files/white_papers management.com/white_papers/10002 [10] Philip Howard, Sybase IQ 15.1, A 398-1.html Bloor InDetail Paper, 2009, www.it- [5] William McKnight, Evolution of director.com/business/innovation Analytical Platforms, Information [11] ***, Revolutionizing data warehousing Management Magazine, May 2009, in telecom with the Vertica Analytic www.information- Database, 2010, management.com/issues/2007_58/anal www.vertica.com/white-papers ytics_business_intelligence_bi- [12] ***, The Vertica Analytic Database 10015353-1.html technical overview, 2010, [6] Daniel Abadi, Column-Stores For Wide www.vertica.com/white-papers and Sparse Data, 3 rd Biennial [13] ***, Increasing Enterprise Data Conference on Innovative Data Warehouse Performance, Longevity Systems Research, January 7 – 10, and ROI with the Vertica Analytic 2007, Asilomar, California, USA, Database, 2010, http://db.csail.mit.edu/projects/cstore/a www.vertica.com/white-papers badicidr07.pdf

16 Column-Oriented Databases, an Alternative for Analytical Environment

Gheorghe MATEI has graduated the Faculty of Planning and Economic Cybernetics in 1978. He achieved the PhD in Economic Cybernetics and Statistics in 2009, with a thesis on Business Intelligence systems in the banking industry. After a long career in the IT department, now he is working in the accounting and reporting department in Romanian Commercial Bank. His fields of interest include Business Intelligence systems, data warehousing, decision support systems, collaborative systems. He is a co-author of the book “ Business Intelligence Technology ” (2010), as well as author and co-author of several articles in journals, international databases and proceedings of national and international conferences in the mentioned domains.