Efficient Computation of Multiple Group by Queries Zhimin Chen Vivek Narasayya Microsoft Research {Zmchen, Viveknar}@Microsoft.Com

Total Page:16

File Type:pdf, Size:1020Kb

Efficient Computation of Multiple Group by Queries Zhimin Chen Vivek Narasayya Microsoft Research {Zmchen, Viveknar}@Microsoft.Com Efficient Computation of Multiple Group By Queries Zhimin Chen Vivek Narasayya Microsoft Research {zmchen, viveknar}@microsoft.com ABSTRACT Group By queries are required can also be large. A naïve Data analysts need to understand the quality of data in the approach is to execute a different Group By query for each set of warehouse. This is often done by issuing many Group By queries columns. on the sets of columns of interest. Since the volume of data in In some commercial database systems, an extension to the these warehouses can be large, and tables in a data warehouse GROUP BY construct called GROUPING SETS is available, often contain many columns, this analysis typically requires which allows the computation of multiple Group By queries using executing a large number of Group By queries, which can be a single SQL statement. The multiple Group By queries are expensive. We show that the performance of today’s database specified by providing a set of column sets. For example systems for such data analysis is inadequate. We also show that specifying GROUP BY GROUPING SETS ((A), (B), (C), (A,C)) the problem is computationally hard, and develop efficient would cause the Union All of all four Group By queries to be techniques for solving it. We demonstrate significant speedup returned in the result set. The query optimizer has the ability to over existing approaches on today’s commercial database optimize the execution of the set of queries by potentially taking systems. advantage of the commonality among them. However, as the following experiment on a commercial database system shows, 1. INTRODUCTION GROUPING SETS functionality is not optimized for the kinds of Decision support analysis on data warehouses influences data analysis scenarios discussed above. important business decisions and hence the accuracy of such Example 1. Consider a scenario where the data analyst wishes to analysis is crucial. Understanding the quality of data is an obtain single-column value distribution for each character or important requirement for a data analyst [18]. For instance, if the categorical column in a relation. For example, there are 12 such number of distinct values in the State column of a relation columns in the lineitem table of the TPC-H database (1GB) [24]. describing customers within USA is more than 50, this could We used a single GROUPING SETS query on a commercial indicate a potential problem with data quality. Other examples database system to compute results of these 12 Group By queries. include the percentage of missing (NULL) values in a column, the We compared the time taken to execute the GROUPING SETS maximum and minimum values etc. Such aggregates help a data query with the time taken using the approach presented in this analyst evaluate whether the data satisfies the expected norm. paper. Using our approach was about 4.5 times faster. This kind of data analysis can be viewed as obtaining frequency The explanation for this is that GROUPING SETS is not distributions over a set of columns of a relation; i.e., a Group By optimized for scenarios where many column sets with little query of the form: SELECT X, COUNT(*) FROM R GROUP overlap among them are requested, which is a common data BY X, where X is a set of columns on relation R. Note that X may analysis scenario. In Example 1, where we request 12 single sometimes contain derived columns, e.g., LEN(c) for computing column sets only, the plan picked by the query optimizer is to first the length distribution of a column c. For example, consider a compute the Group By of all 12 columns, materialize that result, relation Customer(Lastname, FirstName, M.I., Gender, Address, and then compute each of the 12 Group By queries from that City, State, Zip, Country). A typical scenario is understanding the materialized result. It is easy to see that such a strategy is almost distributions of values of each column, which requires issuing the same as the naïve approach of computing each of the 12 several Group By queries, one per column. In addition, it is Group By queries from the lineitem table itself. In contrast, our sometimes necessary to also understand the joint frequency solution consisted of materializing the results of the following distributions of sets of columns, e.g., the analyst may expect that Group By queries into temporary tables: (1) (l_receipdate, (LastName, FirstName, M.I., Zip) is a key (or almost a key) for l_commitdate) (2) (l_tax, l_discount, l_quantity, l_returnflag, that relation. We note that some data mining applications [15] l_linestatus); and computing the single column Group By queries also have similar requirements of computing frequency of these columns from the respective temporary tables. The Group distributions over many sets of columns of a relation. By queries for each of the remaining columns were computed This kind of data analysis can be time consuming and resource directly from the lineitem table. intensive for two reasons. First the number of rows in the relation can be large. Second, the number of sets of columns over which An important observation, illustrated by the above example, is that materializing results of queries, including queries that are not Permission to make digital or hard copies of all or part of this work for required, can speed up the overall execution of the required personal or classroom use is granted without fee provided that copies are queries. It is easy to see that the search space, i.e., the space of not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy queries that are not required, but whose results, if materialized, otherwise, or republish, to post on servers or to redistribute to lists, could speed up execution of the required queries, is very large. requires prior specific permission and/or a fee. For example, for a relation with 30 columns, if we want to SIGMOD 2005 June 14-16, 2005, Baltimore, Maryland, USA. compute all single column Group By queries, the entire space of Copyright 2005 ACM 1-59593-060-4/05/06 $5.00. relevant Group By queries to consider for materialization is 230. We note that previous solutions to the above problem, including columns, the above solution does not scale, since the step of GROUPING SETS, were designed for efficient processing of the constructing the search lattice itself would not be feasible. In (partial) datacube e.g., [2,14,16,21], as a generalization of CUBE contrast, our approach constructs only a small part of the lattice in and ROLLUP queries. Unlike the kind of data analysis scenarios a bottom-up manner, interleaved with the search algorithm presented above, these solutions were geared towards OLAP like (Section 4). This allows us to scale to a much larger inputs. We scenarios with a small set of Group By queries typically having note that once the set of queries to be materialized is determined, significant overlap in the sets of columns. Hence, most of these several physical operators, e.g., PipeHash, PipeSort [14,16], previous solutions to this problem assume that search space of Partitioned-cube, Memory-Cube [16], etc. for efficiently queries can be fully enumerated as a first step in the optimization. executing a given set of queries (without introducing any new Thus, these solutions do not scale well to the kinds of data queries) can be used. This problem is orthogonal to ours, and analysis scenario described earlier. these techniques can therefore be leveraged by our solution as Our key ideas and contributions can be summarized as follows. well. The basic ideas is to take advantage of commonality across First, we show that even a simple case of this problem, where we Group By queries using techniques such as shared scans, shared need to compute n single-column Group By queries of a relation sorts, and shared partitions [2,8,15,16,21]. We note that work on is computationally hard. Second, unlike previous solutions to this efficient computation of a CUBE (respectively ROLLUP) query is problem, our algorithm explores the search space in a bottom up a special case of the partial cube computation discussed above manner using a hill-climbing approach. Consequently, in most where the entire lattice (resp. a hierarchy) of queries needs to be cases, we do not require enumerating the entire search space for computed. In contrast our work can be thought of as introducing optimizing the required queries. This allows us to scale well to the logically equivalent rewritings of a GROUPING SETS query in a kinds of data analysis scenarios mentioned above. Third, our cost based manner. algorithm is cost based and we show how this technique can be The second category of work is creation of materialized views for incorporated into today’s commercial database systems for answering queries. The paper [17] studies the problem of how to optimizing a GROUPING SETS query. As an alternative, we speed up update/creation of a materialized view V by creating show how today’s client data analysis tools/applications can take additional materialized views. They achieve this by creating a advantage of our techniques to achieve better performance until DAG over logical operators by considering various equivalent such time that GROUPING SETS queries are better optimized for expressions for V (e.g., those generated by the query optimizer); such scenarios. Finally, we have conducted an experimental and then determining an optimal set of additional views (sub-trees evaluation of our solution on real world datasets against of the DAG) to materialize.
Recommended publications
  • Aggregate Order by Clause
    Aggregate Order By Clause Dialectal Bud elucidated Tuesdays. Nealy vulgarizes his jockos resell unplausibly or instantly after Clarke hurrah and court-martial stalwartly, stanchable and jellied. Invertebrate and cannabic Benji often minstrels some relator some or reactivates needfully. The default order is ascending. Have exactly match this assigned stream aggregate functions, not work around with security software development platform on a calculation. We use cookies to ensure that we give you the best experience on our website. Result output occurs within the minimum time interval of timer resolution. It is not counted using a table as i have group as a query is faster count of getting your browser. Let us explore it further in the next section. If red is enabled, when business volume where data the sort reaches the specified number of bytes, the collected data is sorted and dumped into these temporary file. Kris has written hundreds of blog articles and many online courses. Divides the result set clock the complain of groups specified as an argument to the function. Threat and leaves only return data by order specified, a human agents. However, this method may not scale useful in situations where thousands of concurrent transactions are initiating updates to derive same data table. Did you need not performed using? If blue could step me first what is vexing you, anyone can try to explain it part. It returns all employees to database, and produces no statements require complex string manipulation and. Solve all tasks to sort to happen next lesson. Execute every following query access GROUP BY union to calculate these values.
    [Show full text]
  • SQL from Wikipedia, the Free Encyclopedia Jump To: Navigation
    SQL From Wikipedia, the free encyclopedia Jump to: navigation, search This article is about the database language. For the airport with IATA code SQL, see San Carlos Airport. SQL Paradigm Multi-paradigm Appeared in 1974 Designed by Donald D. Chamberlin Raymond F. Boyce Developer IBM Stable release SQL:2008 (2008) Typing discipline Static, strong Major implementations Many Dialects SQL-86, SQL-89, SQL-92, SQL:1999, SQL:2003, SQL:2008 Influenced by Datalog Influenced Agena, CQL, LINQ, Windows PowerShell OS Cross-platform SQL (officially pronounced /ˌɛskjuːˈɛl/ like "S-Q-L" but is often pronounced / ˈsiːkwəl/ like "Sequel"),[1] often referred to as Structured Query Language,[2] [3] is a database computer language designed for managing data in relational database management systems (RDBMS), and originally based upon relational algebra. Its scope includes data insert, query, update and delete, schema creation and modification, and data access control. SQL was one of the first languages for Edgar F. Codd's relational model in his influential 1970 paper, "A Relational Model of Data for Large Shared Data Banks"[4] and became the most widely used language for relational databases.[2][5] Contents [hide] * 1 History * 2 Language elements o 2.1 Queries + 2.1.1 Null and three-valued logic (3VL) o 2.2 Data manipulation o 2.3 Transaction controls o 2.4 Data definition o 2.5 Data types + 2.5.1 Character strings + 2.5.2 Bit strings + 2.5.3 Numbers + 2.5.4 Date and time o 2.6 Data control o 2.7 Procedural extensions * 3 Criticisms of SQL o 3.1 Cross-vendor portability * 4 Standardization o 4.1 Standard structure * 5 Alternatives to SQL * 6 See also * 7 References * 8 External links [edit] History SQL was developed at IBM by Donald D.
    [Show full text]
  • Overview of SQL:2003
    OverviewOverview ofof SQL:2003SQL:2003 Krishna Kulkarni Silicon Valley Laboratory IBM Corporation, San Jose 2003-11-06 1 OutlineOutline ofof thethe talktalk Overview of SQL-2003 New features in SQL/Framework New features in SQL/Foundation New features in SQL/CLI New features in SQL/PSM New features in SQL/MED New features in SQL/OLB New features in SQL/Schemata New features in SQL/JRT Brief overview of SQL/XML 2 SQL:2003SQL:2003 Replacement for the current standard, SQL:1999. FCD Editing completed in January 2003. New International Standard expected by December 2003. Bug fixes and enhancements to all 8 parts of SQL:1999. One new part (SQL/XML). No changes to conformance requirements - Products conforming to Core SQL:1999 should conform automatically to Core SQL:2003. 3 SQL:2003SQL:2003 (contd.)(contd.) Structured as 9 parts: Part 1: SQL/Framework Part 2: SQL/Foundation Part 3: SQL/CLI (Call-Level Interface) Part 4: SQL/PSM (Persistent Stored Modules) Part 9: SQL/MED (Management of External Data) Part 10: SQL/OLB (Object Language Binding) Part 11: SQL/Schemata Part 13: SQL/JRT (Java Routines and Types) Part 14: SQL/XML Parts 5, 6, 7, 8, and 12 do not exist 4 PartPart 1:1: SQL/FrameworkSQL/Framework Structure of the standard and relationship between various parts Common definitions and concepts Conformance requirements statement Updates in SQL:2003/Framework reflect updates in all other parts. 5 PartPart 2:2: SQL/FoundationSQL/Foundation The largest and the most important part Specifies the "core" language SQL:2003/Foundation includes all of SQL:1999/Foundation (with lots of corrections) and plus a number of new features Predefined data types Type constructors DDL (data definition language) for creating, altering, and dropping various persistent objects including tables, views, user-defined types, and SQL-invoked routines.
    [Show full text]
  • Real Federation Database System Leveraging Postgresql FDW
    PGCon 2011 May 19, 2011 Real Federation Database System leveraging PostgreSQL FDW Yotaro Nakayama Technology Research & Innovation Nihon Unisys, Ltd. The PostgreSQL Conference 2011 ▍Motivation for the Federation Database Motivation Data Integration Solution view point ►Data Integration and Information Integration are hot topics in these days. ►Requirement of information integration has increased in recent year. Technical view point ►Study of Distributed Database has long history and the background of related technology of Distributed Database has changed and improved. And now a days, the data moves from local storage to cloud. So, It may be worth rethinking the technology. The PostgreSQL Conference 2011 1 All Rights Reserved,Copyright © 2011 Nihon Unisys, Ltd. ▍Topics 1. Introduction - Federation Database System as Virtual Data Integration Platform 2. Implementation of Federation Database with PostgreSQL FDW ► Foreign Data Wrapper Enhancement ► Federated Query Optimization 3. Use Case of Federation Database Example of Use Case and expansion of FDW 4. Result and Conclusion The PostgreSQL Conference 2011 2 All Rights Reserved,Copyright © 2011 Nihon Unisys, Ltd. ▍Topics 1. Introduction - Federation Database System as Virtual Data Integration Platform 2. Implementation of Federation Database with PostgreSQL FDW ► Foreign Data Wrapper Enhancement ► Federated Query Optimization 3. Use Case of Federation Database Example of Use Case and expansion of FDW 4. Result and Conclusion The PostgreSQL Conference 2011 3 All Rights Reserved,Copyright ©
    [Show full text]
  • SQL Views for Medical Billing
    SQL Views for Cortex Medical Billing SQL Views To execute a view in the SQL Query Analyzer window “Select * from view_name”. To narrow the results of a view you may add a WHERE clause. The column named in the WHERE clause must match the column name in the view in the following order: 1. A column name that follows the view name 2. A column name in the SELECT statement of the view To find out the name of a column in a view, double click on the view name in Enterprise Manager. EXAMPLE: CREATE VIEW ClientPayment_view( ClPaymentServiceNumber,ClPaymentAmount ) AS SELECT ClientPayment.ClientServiceNumber, SUM(ClientPayment.Amount) FROM ClientPayment GROUP BY ClientPayment.ClientServiceNumber In this example, in order to execute a SQL query with a WHERE clause on the ClientPayment_view you must use the column name “ClPaymentServiceNumber” or “ClPaymentAmount”. If the column names (ClPaymentServiceNumber,ClPaymentAmount) did *NOT* follow the view name you could use the column names in the SELECT statement (ClientServiceNumber, Amount). Example SELECT statement with WHERE clause: “ CONFIDENTIAL Page 1 of 4 SQL Views for Cortex Medical Billing SELECT * FROM ClientPayment_view WHERE ClPaymentServiceNumber = '0000045'” When writing a view where you only want one record returned to sum a field like amount (i.e., SUM(Amount)) you cannot return any other columns from the table where the data will differ, like EnteredDate, because the query is forced to return each record separately, it can no longer combine them. View Name Crystal Report Join Returned
    [Show full text]
  • Efficient Processing of Window Functions in Analytical SQL Queries
    Efficient Processing of Window Functions in Analytical SQL Queries Viktor Leis Kan Kundhikanjana Technische Universitat¨ Munchen¨ Technische Universitat¨ Munchen¨ [email protected] [email protected] Alfons Kemper Thomas Neumann Technische Universitat¨ Munchen¨ Technische Universitat¨ Munchen¨ [email protected] [email protected] ABSTRACT select location, time, value, abs(value- (avg(value) over w))/(stddev(value) over w) Window functions, also known as analytic OLAP functions, have from measurement been part of the SQL standard for more than a decade and are now a window w as ( widely-used feature. Window functions allow to elegantly express partition by location many useful query types including time series analysis, ranking, order by time percentiles, moving averages, and cumulative sums. Formulating range between 5 preceding and 5 following) such queries in plain SQL-92 is usually both cumbersome and in- efficient. The query normalizes each measurement by subtracting the aver- Despite being supported by all major database systems, there age and dividing by the standard deviation. Both aggregates are have been few publications that describe how to implement an effi- computed over a window of 5 time units around the time of the cient relational window operator. This work aims at filling this gap measurement and at the same location. Without window functions, by presenting an efficient and general algorithm for the window it is possible to state the query as follows: operator. Our algorithm is optimized for high-performance main- memory database systems and has excellent performance on mod- select location, time, value, abs(value- ern multi-core CPUs. We show how to fully parallelize all phases (select avg(value) of the operator in order to effectively scale for arbitrary input dis- from measurement m2 tributions.
    [Show full text]
  • PROC SQL for PROC SUMMARY Stalwarts Christianna Williams Phd, Chapel Hill, NC
    SESUG Paper 219-2019 PROC SQL for PROC SUMMARY Stalwarts Christianna Williams PhD, Chapel Hill, NC ABSTRACT One of the fascinating features of SAS® is that the software often provides multiple ways to accomplish the same task. A perfect example of this is the aggregation and summarization of data across multiple rows or BY groups of interest. These groupings can be study participants, time periods, geographical areas, or really just about any type of discrete classification that one desires. While many SAS programmers may be accustomed to accomplishing these aggregation tasks with PROC SUMMARY (or equivalently, PROC MEANS), PROC SQL can also do a bang-up job of aggregation – often with less code and fewer steps. This step-by-step paper explains how to use PROC SQL for a variety of summarization and aggregation tasks. It uses a series of concrete, task-oriented examples to do so. For each example, both the PROC SUMMARY method and the PROC SQL method will be presented, along with discussion of pros and cons of each approach. Thus, the reader familiar with either technique can learn a new strategy that may have benefits in certain circumstances. The presentation format will be similar to that used in the author’s previous paper, “PROC SQL for DATA Step Die-Hards”. INTRODUCTION Descriptive analytic and data manipulation tasks often call for aggregating or summarizing data in some way, whether it is calculating means, finding minimum or maximum values or simply counting rows within groupings of interest. Anyone who has been working with SAS for more than about two weeks has probably learned that there are nearly always about a half dozen different methods for accomplishing any type of data manipulation in SAS (and often a score of user group papers on these methods ) and summarizing data is no exception.
    [Show full text]
  • Query Language Reference
    Oracle® Endeca Server Query Language Reference Version 7.4.0 • September 2012 Copyright and disclaimer Copyright © 2003, 2012, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. UNIX is a registered trademark of The Open Group. This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited. The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing. If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, the following notice is applicable: U.S. GOVERNMENT END USERS: Oracle programs, including any operating system, integrated software, any programs installed on the hardware, and/or documentation, delivered to U.S. Government end users are "commercial computer software" pursuant to the applicable Federal Acquisition Regulation and agency- specific supplemental regulations. As such, use, duplication, disclosure, modification, and adaptation of the programs, including any operating system, integrated software, any programs installed on the hardware, and/or documentation, shall be subject to license terms and license restrictions applicable to the programs.
    [Show full text]
  • Handling One-To-Many Relations - Grouping and Aggregation
    Handling One-to-Many Relations - Grouping and Aggregation • No class/lab next Tuesday o Today's lab is also Problem Set #1 - due two weeks from today o We won't cover all of today's lab exercises in this lab preparation talk - the rest will be covered this Thursday • Handling One-to-Many Relations o Today's lab prep will review SQL queries from last Thursday's lecture notes o Key concept is use of GROUP BY statements to handle one-to- many relations o We also introduce another database (of URISA proceedings) to learn more about SQL queries and relational database design • SQL queries and GROUP BY expressions * o See the Sample Parcel Database * o See SQL Notes and Oracle help * o See Lab 1 examples o The basic SELECT statement to query one or more tables: SELECT [DISTINCT] column_name1[, column_name2, ...] FROM table_name1[, table_name2, ...] WHERE search_condition1 [AND search_condition2 ...] [OR search_condition3...] [GROUP BY column_names] [ORDER BY column_names]; o Note that the order of the clauses matters! The clauses, if included, must appear in the order shown! Oracle will report an error if you make a mistake, but the error message (e.g., "ORA-00933: SQL command not properly ended") may not be very informative. * Kindly refer to Lecture Notes Section Aggregration: GROUP BY, Group Functions Simple GROUP BY Example From the Parcels Database When a SQL statement contains a GROUP BY clause, the rows are selected using the criteria in the WHERE clause and are then aggregated into groups that share common values for the GROUP BY expressions. The HAVING clause may be used to eliminate groups after the aggregation has occurred.
    [Show full text]
  • Group by Clause in Sql Definition
    Group By Clause In Sql Definition Sedition Dimitry discrowns wilily. Walton azotising overfondly while undesirous Norman foretoken specially or carbonado throughout. Steel-blue and meandrous Konrad swish her humbleness imbrue while Chev forgive some bikini nightmarishly. When you need to group by clause with the method invocation and libraries for. Consider the definition are neural networks better quality, group by clause in sql definition. The definition a join is not used together with the event of the rows with group by clause in sql definition without polluting the parser will eventually return to! Finding the lever without using group join clause in sql server EmpName' is invalid in the slot list because raft is not contained in with an aggregate function or night GROUP BY clause condition must split to reach GROUP join clause. Now we have group by clause in sql definition list of penalties and having clause to all statements or clustered indexes. Sql and partners for any user that we can also on google cloud events that group by clause in sql definition without a grouping. 103 Grouping on Two can More Columns SELECT Statement The. We simply parse or group by clause in sql definition without a developer who makes sure that a word in a dml to use an aggregate functions to low values are inserted. How many players disappear from outside these rows in group by? Multiple tables if you want to create a result per group by clause in sql definition list provided, but not all data management system administration and retrieve and the definition.
    [Show full text]
  • SQL:1999, Formerly Known As SQL3
    SQL:1999, formerly known as SQL3 Andrew Eisenberg Sybase, Concord, MA 01742 [email protected] Jim Melton Sandy, UT 84093 [email protected] Background For several years now, you’ve been hearing and reading about an emerging standard that everybody has been calling SQL3. Intended as a major enhancement of the current second generation SQL standard, commonly called SQL-92 because of the year it was published, SQL3 was originally planned to be issued in about 1996…but things didn’t go as planned. As you may be aware, SQL3 has been characterized as “object-oriented SQL” and is the foundation for several object-relational database management systems (including Oracle’s ORACLE8, Informix’ Universal Server, IBM’s DB2 Universal Database, and Cloudscape’s Cloudscape, among others). This is widely viewed as a “good thing”, but it has had a downside, too: it took nearly 7 years to develop, instead of the planned 3 or 4. As we shall show, SQL:1999 is much more than merely SQL-92 plus object technology. It involves additional features that we consider to fall into SQL’s relational heritage, as well as a total restructuring of the standards documents themselves with an eye towards more effective standards progression in the future. Standards Development Process The two de jure organizations actively involved in SQL standardization, and therefore in the development of SQL: 1999, are ANSI and ISO. More specifically, the international community works through ISO/IEC JTC1 (Joint Technical Committee 1), a committee formed by the International Organization for Standardization in conjunction with the International Electrotechnical Commission.
    [Show full text]
  • A Step Toward SQL/MED Datalink
    A step towards SQL/MED DATALINK Author: Gilles Darold PgConf Asia September 2019 1 SQL/MED ● SQL/MED => SQL Management of External Data ● Define methods to access non-SQL data in SQL ● Standardized by ISO/IEC 9075-9:2001 (completed in SQL:2003 and SQL:2008) SQL/MED specification is divided in two parts 2 1 Part 1 : Foreign Data Wrapper ● Access other data sources represented as SQL tables in PostgreSQL ● Fully implemented in PostgreSQL ● List of FDW : https://wiki.postgresql.org/wiki/Foreign_data_wrappers 3 2 Part 2 : Datalinks ● Reference file that is not part of the SQL environment ● The file is assumed to be managed by some external file manager json ● Column values are references to local or remote files ● There is no PostgreSQL implementation ● A plpgsql/plperlu prototype at https://github.com/lacanoid/datalink 4 3 What are Datalink exactly ? ● Files are referenced through a new DATALINK type. ● A special SQL type intended to store : ● URL in database referencing a external file ● Access control and behavior over external files ● File contents are not stored in the database but on file system 5 4 Datalink representation 6 5 Why are Datalink useful? Applications that need: ● the power of SQL with ACID properties ● and working directly with files. Support huge files, ex: video or audio streaming, without having to be stored in ByteA or Large Object columns. Avoid filling Share Buffers, WALs and database with files that can be accessed directly by other programs, ex: images or PDF files through a Web server. Benefit of external file caching, advanced file content search and indexing.
    [Show full text]