Latest Trends in Information Technology

Testing relational strategies

Vlad Diaconita, Ion Lungu, Iuliana Botha Department of Economic Informatics and Cybernetics Academy of Economic Studies, Bucharest, Romania {vlad.diaconita, ion.lungu, iuliana.botha}@ie.ase.ro}

Abstract: Query optimization is one of the most important problems in . Query optimization can be talked about in different scenarios with different tools and different results. In this paper we’ll talk about query processing and optimization and test strategies in different context, starting from simple queries continuing with correlated queries, joins with or without different indexing strategies.

Keywords: optimization, testing, indexing, partitioning,

results. In the event of a runtime error, a message is 1. Introduction being generated by the runtime database processor. In practice, SQL is the that is used Like shown in [1], given a query, there are many in most commercial RDBMSs. Like shown in different plans that a DBMS can follow to process many works, like [1] or [2], in any RDBS, a query it and produce the results. All plans produce the written in SQL goes through the flow pictured in same result but can vary in the amount of time that figure 1. they need to execute. Query optimization is absolutely necessary in a DBMS, the cost difference between two alternatives can be

Figure 1 Query flow in RDBMS important. The optimizer doesn’t always choose the best Even if in this paper the tests are done using an strategy, only a reasonably efficient strategy for Oracle environment, regardless of the RDBMS executing the query. Finding the best strategy is vendor the process is similar and so are the usually too time-consuming. Also, determining the strategies depicted. First, the scanner finds the optimal query execution strategy can require query tokens from the text of the query. The parser detailed information on how the files are checks the query syntax if it complies with the implemented and even on the contents of the files, syntax rules of the query language then translates it information that may not be fully available in the into an internal form, usually a relational calculus DBMS catalogue. expression or something equivalent. The query in A language like SQL provides the developer with a valid if the attributes and names are tool for specifying what the needed results of the consistent with the database schema that is query are without the need to specify how these interrogated. An internal representation of the query results should be obtained. is made, usually as a tree data structure called a A RDBMS must be able to compare query query tree, or using a graph data structure called a execution strategies and choose a near-optimal query graph. The Database Management System strategy. Each DBMS typically has a number of (DBMS) must create an execution strategy plan for general database access algorithms that implement retrieving the results of the query from the database relational algebra operations such as Selection, files. A query can have many possible execution Projection or Join or combinations of these strategies, and the optimization is the process of operations. Only execution strategies that can be choosing a appropriate one. implemented by the DBMS access algorithms and The query optimizer module has to come up with a that apply to the particular query, as well as to the good execution plan, and the code generator particular physical , can be executes that plan. The runtime database processor considered by the query optimization module. has the task of executing the query code, in As noted in [6] there are two separate stages to compiled or in interpreted mode, and generate the running a query. First, the optimizer works out what it “thinks” is going to happen, and then the

ISBN: 978-1-61804-134-0 152 Latest Trends in Information Technology

run-time engine does what it wants to do. In This retrieves the commission for ANDR. The principle, the optimizer should know about, outer query block is: describe, and base its calculations on the activity select * from oca where comis>c that the run-time engine is supposed to perform. In where c represents the result returned from the practice, the optimizer and the run-time engine inner block. sometimes have different ideas about what they Like shown in [2], an SQL query is first translated should be doing. There are many features available into an equivalent extended relational algebra to the optimizer to manipulate the query before expression, represented as a query tree data optimizing it such as predicate pushing, subquery structure which then optimized. SQL queries are unnesting and star transformation. split into query blocks, units that can be translated The query optimizer chooses an execution plan for into the algebraic operators and then optimized. A each query block. In Oracle, all_rows goal means query block contains a single SELECT-FROM- that the optimizer will attempt to find an execution WHERE expression, as well as GROUP BY and plan that returns all the rows in the shortest possible HAVING clauses if these are part of the block. If time. A first_rows N goal will try to maximize the they exist, nested queries within a query are plan for returning a part (10,100, 10000) of the total identified as separate query blocks. Because SQL number of rows. The Rule Based Optimizer (RBO) includes aggregate functions, such as MAX, MIN, is obsolete starting Oracle 10g. It’s still present but SUM or COUNT, these functions must also be is no longer supported and is present to provide included in the extended algebra, like operators. backwards compatibility. The choose option, this In this example, the inner block needs to be gives the optimizer a run-time choice between rule evaluated once to produce the average commission based optimization that is currently obsolete and of the orders by the clients of a certain broker, all_rows. which is then used by the outer block to return the As shown in [7], Oracle9i Database has an I/O cost needed results. This is also called a non-correlated model that evaluated everything primarily in terms nested. The execution plan for query (1) is the one of single block reads, largely ignoring CPU costs in listing 1: (or using constants to estimate CPU costs). Starting with Oracle 10G the cost model for the optimizer is SELECT STATEMENT, GOAL = ALL_ROWS CPU+I/O, with the cost unit as time. Cost=2812 Cardinality=17927 Bytes=699153 ACCESS FULL Object owner=FORT 2. Processing a query Object name=OCA Cost=1406 Cardinality=17927 Consider the following statement in extended Bytes=699153 relational algebra: SORT AGGREGATE Cardinality=1 Bytes=18 TABLE ACCESS FULL Object owner=FORT πord,id_cc(σComis> Comis(σ broker =’ANDR’(OCA))) Object name=OCA Cost=1407 Cardinality=5192 where: ℑ • (script F) is the symbol used for an Bytes=93456 aggregate function; Listing 1 Non-correlated nested ℑ • σ is the symbol for the selection operation; So, the cost of this select is 2812, the estimated • π is the symbol for the projection operation. cardinality is 17923 and the real number of returned This statement can be translated into SQL like this: records being 93367. (1) select ord,id_cc from oca where comis> The cost represents the optimizer’s best estimate of (select avg(comis) from oca where broker='ANDR') the time it will take to execute the statement, but This query retrieves all the orders, regardless of the this cost can be wrong because of many reasons, broker, which have a commission that is greater such as some wrong assumptions exist in the cost than the average commission for the ANDR broker. model or relevant statistics is missing or present but The query includes a nested sub query so it should misleading. be decomposed into two blocks. With older databases it used to be much more The inner block is: costly to run correlated nested queries, where a select avg(comis) from oca tuple variable from the outer query block appears in where broker='ANDR' the WHERE-clause of the inner query block:

ISBN: 978-1-61804-134-0 153 Latest Trends in Information Technology

(2) select o.ord,o.id_cc from oca o where condition. If the search uses an index, it is called an comis> (select avg(t.comis) from oca t index scan. A more advanced index, a clustering where t.broker=o.broker); one can be used to retrieve multiple records if the In this query we search for the orders with the selection condition involves an equality comparison commission greater than the average for each on a non key attribute with a clustering index. A broker. In this case, the inner query must be run for cluster is a method for storing more than one table each tuple from the outer query. The execution plan on the same block. Usually, a block contains data for query (2) is the one in listing 2: for exactly one table. In a cluster, there can be SELECT STATEMENT, GOAL = ALL_ROWS many tables sharing the same block. Cost=4320 Cardinality=17927 Bytes=1111474 If a condition of a SELECT operation is a HASH JOIN Cost=4320 Cardinality=17927 conjunctive condition constructed of several simple Bytes=1111474 conditions connected with the AND logical VIEW Object owner=SYS Object operator such as the forth method, the DBMS can name=VW_SQ_1 Cost=1440 Cardinality=358530 use additional methods to implement the operation Bytes=6453540 such as conjunctive selection that uses an individual HASH GROUP BY Cost=1440 index, a composite one or an intersection of record Cardinality=358530 Bytes=8246190 pointers. TABLE ACCESS FULL Object owner=FORT When the selection has a single condition, the Object name=OCA Cost=1405 database management system will check if an Cardinality=358530 Bytes=8246190 access path exists on the attribute involved in that TABLE ACCESS FULL Object owner=FORT condition. If an access path (such as index or hash Object name=OCA Cost=1405 key or sorted file) is available, the method Cardinality=358530 Bytes=15775320 corresponding to that access path is used. If this is Listing 2 Correlated nested without indexes not the case, the brute force, linear search approach of method can be used. Query optimization for a These tests are done on copies of the original SELECT operation is required mostly for production environment tables and have no indexes conjunctive select conditions whenever more than or statistics at first. The cost of this select is 4320, one of the attributes involved in the conditions have which is 53% greater than in the first example an access path. The optimizer should choose the given that the estimated cardinality being the same. access path that retrieves the fewest records in the The real number of records returned by this query is most efficient way by estimating the different costs 108429. and choosing the method with the least estimated cost. 3. Optimizing strategies When the optimizer has to choose from multiple There are many algorithms for executing a simple conditions in a conjunctive select condition, SELECT operation, which is basically a search it typically considers the selectivity of each operation to locate the records in a disk file that condition. As shown in [1], selectivity is the ratio of satisfy a certain condition. Some of the search the number of records (tuples) that satisfy the algorithms depend on the file having specific access condition to the total number of records (tuples) in paths, and they may apply only to certain types of the file (table/relation). So, selectivity is a number selection conditions. Many rules for different query between zero and one. At the extremes, zero types (simple selection, different types of joins, selectivity means none of the records in the file multiple queries etc) and different environments satisfies the selection condition, and a selectivity of (databases, data warehouses, virtual data one means that all the records in the file satisfy the warehouses, data marts etc) have been identified, condition. Compared to a conjunctive selection discussed and tested in numerous works, such as [1] condition, a disjunctive condition, where simple , [2], [3] or [4]. conditions are connected by the OR logical A number of search algorithms are possible for connective, is much harder to process and optimize. selecting records from a file. These are also known For example, given this statement: σbroker=’ANDR’ OR as file scans, because they scan the records of a file Comis > 8 OR Sym=‘EBS’(OCA). In this expression not much to search for the records that comply with a optimization can be made, because the records

ISBN: 978-1-61804-134-0 154 Latest Trends in Information Technology

satisfying the disjunctive condition are made from table hashes each of its records using the same hash the union of the records satisfying the three function to explore the appropriate bucket, and that conditions. So, if any one of the conditions does not record is combined with all matching records from have an access path, only the brute force, linear the first table in that bucket. This simplified search approach can be used. If an access path description of -hash join assumes that the exists on every simple condition in the disjunction, smaller of the two tables fits entirely into memory the selection can be optimized by retrieving the buckets after the first phase. In practice, these records satisfying each condition and then applying techniques are implemented by accessing whole the union operation to eliminate duplicates. The disk blocks of a file, and not individual records. execution plan for the above statement, if indexes The execution plan of the query optimizer can be exist on each involved is the following uses overridden with hints inserted in the SQL the three indexes and has a cost of 1143 and an statements. estimated cardinality of 6913.: Indexes exist to help speed up queries. Having the A DBMS can use many of the methods discussed proper columns indexed can reduce the logical I/Os above, and some additional ones. The query for queries. However, creating an index to make optimizer must choose the appropriate one for one query run faster may not be a good solution. executing each SELECT operation in a query which There are costs associated with data changes when has the lowest estimated cost. the indexes are involved and the maintenance The JOIN operation is one of the most time- requirements should be considered. The consuming operations in query processing. performance gains of adding the index should be A natural join is a type of equijoin where the join more than the cost of maintaining the index. If we predicate arises implicitly by comparing all build too many indexes it can lead to performance columns in both tables that have the same column- issues, instead of solving problems. Indexes should names in the joined tables. The resulting joined be used selectively and their usage monitored. table contains only one column for each pair of Indexes are definitely a useful tool for improving equally named columns. access to the data in the database. Several types of There are many possible ways to implement a two- indexes are available on both database platforms. way join, which is a join on two files. Joins Understanding which type of index is being used involving more than two files are called multi way and how to improve that index will help in joins. The number of possible ways to execute performance tuning. Knowing how the various multi way joins grows very rapidly. index types affect data changes and improve The nested-loop join is the basic “brute force” SELECT statements will help you to decide if the algorithm, useful when small subsets of data are benefits of the index outweigh the costs for putting being joined, as it does not require any special it in place. access paths on either file in the join. For each At first, a new index can lead to inexplicable record in a table, retrieve every record another table results. For example, we add a primary key on the and test whether the two records satisfy the join OCA table which shouldn’t affect the (2) query: condition. A sort merge join can be used if an index alter table oca add constraint oca_pk primary exists on either one or both join attributes the join key(id_u,ord); can be greatly improved by retrieving directly the The cost and the cardinality for query (1) remains matching records. the same from listing 1 but the new execution plan A partition-hash join implies that the records of the for query (2) shows a cost of 39207 and a tables involved are partitioned in smaller parts. The cardinality of 149469489 (!?) compared to 4320 partitioning of each table is done using the same and 17927 even if the execution plan remains the hashing function on the join attributes. In the same. partitioning phase, a single pass through the table We force Oracle to gather statistics using the with fewer records hashes its records to the various following: partitions of the table. The collection of records BEGIN with the same value of the hash function is at the DBMS_STATS.CREATE_STAT_TABLE same partition, which is a hash bucket in a hash. In ('fort','saves'); the probing phase, a single pass through the other DBMS_STATS.GATHER_TABLE_STATS

ISBN: 978-1-61804-134-0 155 Latest Trends in Information Technology

('fort', 'oca', stattab => 'saves'); Instead of storing the ID, a bitmap for each key END; is used. Each column means a different value in the / bitmapped index. This two-dimensional array After this, the cost becomes 2884 and the represents each value within the index for each cardinality 165157, much closer to the truth. rows of the table. Because of this, these indexes are The oldest and most popular type of Oracle typically smaller in size and are useful for columns indexing is a standard b-tree index, which excels at that have a low cardinality (few distinct values) and servicing simple queries. The b-tree index was for read-only tables. They might be more expensive introduced in the earliest releases of Oracle and than other types of indexes for tables in which the remains widely used with Oracle. data changes. Building multiple bitmapped indexes While b-tree indexes can speed up simple queries, can provide fast answers for difficult SQL queries. they are not a very good choice for low-cardinality For the OCA table, column status can have only 9 columns that have fewer than 200 distinct values. different values. If we run this statement: σ stare They do not have the selectivity required in order to =’A’(OCA) without any index, the estimated cost is benefit from standard b-tree index structures. Also, 1409. If we build a bitmap index and then gather there is no support for SQL functions. The B-tree table stats, the CBO doesn’t use it for this query. indexes are not able to speed SQL queries using The cost with full table scan is 1411. If we force the Oracle's built-in functions. Oracle provides a usage of the index with a hint the cost goes up to variety of built-in functions that allow SQL 5648: statements to query on a piece of an indexed select /*+ index(oca OCA_STARE) */ column or on any one of a number of * from oca where stare='A' transformations against the indexed column. Bitmap join indexes store the join of two tables. An index-organized table (IOT) has a storage This type of index is useful in a data warehousing organization that is a variant of a primary B-tree. environment and with a star data model schema, Unlike an ordinary (heap-organized) table whose because it will index the smaller table information data is stored as an unordered collection (heap), on the larger fact table. The row IDs are stored for data for an index-organized table is stored in a B- the corresponding row ID of the joined table. tree index structure in a primary key sorted manner Reverse key indexes can spread out index blocks [10]. Each leaf block in the index structure stores for a sequenced column. By using a sequence, there both the key and non-key columns. Index-organized can be many records that all start with the same tables are ideal for OLTP applications, which number. Reversing the numbers will allow for the require fast primary key access and high index to have different beginning values and use availability. Organizing a table like this would different blocks in the index b-tree structure. When make it fast when using the primary key for the loading data, the reverse index will minimize the joins or using just the primary key as the part of the concurrency on the index blocks. WHERE clause. If the table is normally searched One of the most important features in Oracle by other columns than how the table is organized, indexing is the introduction of function-based creating an IOT might not be the solution. indexing. Function-based indexes allow creation of We create an index organized table OCA2, a copy indexes on expressions, build-in functions, and of OCA. After gathering statistics like shown user-written functions in PL/SQL or Java. Oracle’s above, we run the query (2) which doesn’t any part function-based index type can dramatically reduce of the primary key. The cost becomes 4840 instead query time. The function-based index can be a of 2884. We run the following query on both the composite index with other columns included in the original table and the IOT one: index. The function that is used needs to match (3) select t.* from oca2 t,jurnal_executii j what is being used in the WHERE clause. where t.id_u=j.id_u and t.ord=j.ord Partitioning is a useful way to tune a large database The cost on the original query is 4300 and on the environment. Oracle offers options for partitioning IOT is 5000 so in this case this type of index table, such as LIST, HASH, RANGE, and doesn’t perform well. COMPOSITE. The partition key is how the table is Bitmap indexes aren’t the same with the standard partitioned and partitioned indexes can be created b-tree indexes and they are stored differently. for these tables. The index can be a local

ISBN: 978-1-61804-134-0 156 Latest Trends in Information Technology

partitioned index based on the partition key and set override the default execution plans by using hints up for each partition. The main objective of the and also compare the performance of different partitioning technique is to radically decrease the indexing strategies. amount of disk activity and to limit the amount of Acknowledgment: This paper presents some data to be examined or operated on and to enable results of the research project PN II, TE Program, parallel execution required to perform queries. Code 332: “Informatics Solutions for decision Tables are partitioning using a partitioning key that making support in the uncertain and unpredictable is a set of columns which will determine by their environments in order to integrate them within a conditions in which partition a given row will be Grid network”, financed within the framework of store. People research program. We build o copy of the OCA table, partitioned by range of data_i. Any query that involves this References column has a much lower cost. For example: [1] Ramez Elmasri, Shamkant B. Navathe, σ data_i <’01/07/2007’(OCA_PARTITIONED) has a cost Fundamentals of database systems, 6th ed, of 507 on the partitioned table compared to 1401 on Addison-Wesley Publishing House, 2011, ISBN- the non-partitioned one. We run two joins, the first 13: 978-0-136-08620-8 join that doesn’t use the column on which the table [2] Yannis E. Ioannidis, Query Optimization, is partitioned, and the second join that makes an Computer Sciences Department, University of extra selection on this column. Wisconsin A. OCA ord=ord TRADES [3] Yousuke Watanabe, Hiroyuki Kitagawa. 2010. Query result caching for multiple event-driven B. σ data_i <’01/07/2007’(OCA ord=ord TRADES) We run these ⋈queries on the original table and on continuous queries, Inf. Syst. 35, 1 (January 2010), the partitioned one, the results⋈ being shown in table 94-110. 1. [3] Michelle Malcher, Table 1 Testing results Administration for Microsoft® SQL Server® DBAs, The McGraw-Hill Companies, ISBN: 978-0-07- Original Partitioned 174430-0, 2011 A Query 4339 4373 [4] David Taniar, Hui Yee Khaw, Haorianto B Query 2744 1851 Cokrowijoyo Tjioe, and Eric Pardede. 2007. The In this case, the results are as expected. The cost is use of Hints in SQL-Nested query optimization. Inf. significant lower when the column on which the Sci. 177, 12 (June 2007), 2493-2521 table is partition is used and a bit higher when it [5] Ion Lungu, Manole Velicanu, Adela Bara, Vlad isn’t used. Diaconiţa, Iuliana Botha - Practices for Designing Also, there can be use sub partitioning techniques and Improving Data Extraction in a Virtual Data in which the table in first partitioned by Warehouses Project, International Journal of range/list/hash and then each partition is divided in Computers, Communications & Control, Vol. III sub partitions such as: composite range-hash (2008), pag. 369-374, ISSN 1841-9836, E-ISSN partitioning and composite range-list partitioning. 1841-9844 IOT can be partitioned by range, list, or hash. [6] Jonathan Lewis, Cost-Based Oracle Fundamentals, Apress, ISBN (pbk): 1-59059-636- 4. Conclusions 6, 2006 [7] Kimberly Floss, Understanding Optimization, We can say that tuning and optimizing is a process Oracle Magazine, 2005 that is handled by the DBMS, the database [8] Donald K. Burleson, Oracle Tuning, ISBN 0- administrator and the developers. The DBMS can 9744486-2-1, 2006 choose an execution plan according to his rules but [9] Oracle® Multimedia User's Guide 11g Release also provides different tools that can gather 1 (11.1), B28415-02 statistics and suggest changes (different execution [10] Oracle® 's Guide 11g plans, rewriting queries to take advantage of the Release 1 (11.1) indexes, new indexes) that can be implemented by the administrator. Developers can also test and

ISBN: 978-1-61804-134-0 157