Enhancing the Performance of Distributed Big Data Processing Systems Using Hadoop and Polybase

Eastern-European Journal of Enterprise Technologies ISSN 1729-3774 4/2 ( 84 ) 2018

Розглянуто пiдхiд до пiдвищення продуктивностi розпо- UDC 004.75 дiлених iнформацiйних систем на основi спiльного використан- DOI: 10.15587/1729-4061.2018.139630 ня технологiй кластера Hadoop та компонента PolyBase SQL Server. Показано, що актуальнiсть вирiшуваної в роботi пробле- ми пов'язана з необхiднiстю обробки великих даних, що мають рiзний спосiб подання вiдповiдно до рiшення рiзнопланових задач ENHANCING бiзнес-проектiв. Проведено аналiз методiв та технологiй ство- рення гiбридних сховищ даних на основi рiзних даних типу SQL THE та NoSQL. Показано, що в даний час найбiльш поширеною є тех- нологiя обробки великих даних з використанням середовища PERFORMANCE розподiлених обчислень Hadoop. Проаналiзовано iснуючi техно- логiї органiзацiї та доступу до даних в кластерi Hadoop iз SQL- OF DISTRIBUTED подiбних СУБД за допомогою конекторiв. Наведено порiвняльнi кiлькiснi оцiнки використання конекторiв Hive та Sqoop при екс- BIG DATA портi даних у сховище Hadoop. Проведено аналiз та особливос- тi обробки великих даних в архiтектурi розподiлених кластер- PROCESSING них обчислень на базi Hadoop. Наведенi та описанi особливостi технологiї PolyBase як компонента SQL Server для органiзацiї SYSTEMS моста мiж SQL Server та Hadoop даних типу SQL та NoSQL. Наведений склад модельної обчислювальної установки на базi USING HADOOP вiртуальної машини для спiльного налаштування PolyBase та Hadoop для рiшення тестових завдань. Розроблено методичне AND POLYBASE забезпечення установки та конфiгурування програмного забез- печення Hadoop i PolyBase SQL Server з урахуванням обмежень на обчислювальнi потужностi. Розглянуто запити для вико- S. Minukhin ристання PolyBase та сховища даних Hadoop при обробцi вели- Doctor of Technical Sciences, Professor* ких даних. Для оцiнки продуктивностi системи запропоновано E-mail: [email protected] абсолютнi та вiдноснi метрики. Для тестових даних великих V. Fedko об'ємiв приведенi результати експериментiв i проведений їх PhD, Associate Professor* аналiз, що iлюструє пiдвищення продуктивностi iнформацiй- E-mail: [email protected] ної системи – часу виконання запитiв i величини тимчасових Y. Gnusov таблиць, що створюються при цьому. Проведений порiвняль- PhD, Head of Department ний аналiз дослiджуваної технологiї з iснуючими конекторами Department of cybersecurity з кластером Hadoop, який показав перевагу PolyBase над конек- Kharkiv National University of торами Sqoop та Hive. Результати проведених дослiджень Internal Affairs можуть бути використанi при проведеннi наукових i тренiнго- L. Landau ave., 27, вих експериментiв для вдосконалення бiзнес-процесiв органiза- Kharkiv, Ukraine, 61080 цiй при впровадженнi надсучасних IТ-технологiй Ключовi слова: Hadoop, MapReduce, HDFS, PolyBase SQL E-mail: [email protected] Server, T-SQL, розподiленi обчислення, масштабування, масш- *Department of Information Systems табована група PolyBase, зовнiшнi об'єкти, Hortonworks Data Simon Kuznets Kharkiv National Platform University of Economics Nauky ave., 9-А, Kharkiv, Ukraine, 61166

1. Introduction of unloading and using “cold” data from warehouses, capacities of which are terabytes. Modern data processing technologies address the chal- One of the relevant directions of development of modern lenges associated with scaling, flexibility of using different information systems is the transition to high-performance tools, access time and data query processing rate. It is environments that uses distributed computing – grid sys- necessary to store and process Big Data. Big Data are char- tems, cloud platforms (Azure, Amazon, Google) and so on., acterized by specific properties, which include their capa- ensuring enhancing efficiency of computation processes city – terabytes or petabytes of data; high real-time pro- at the expense of scalability and distributed data ware- cessing rate and natural diversity of their types, structured house technologies. The most common in modern scientif- and unstructured by nature [1, 2]. ic research are platforms Apache Hadoop (Hortonworks Typicaly, different types of data are used to solve prob- (HDP)), Cloudera Distributed Hadoop (CDH)), as well as lems that are diverse in a contextual sense. For example, the software tools for working with clusters – Apache Sqoop, existing database of activity of a company contains large Apache HBase, and in cloud platforms environments (Elas- amount of structured data. In contrast to it, operational in- tic MapReduce – AWS, HDInsight-MS Azure). They are formation relating to market behavior and market trends has based on the project Apache Hadoop – a distributed pro- an unstructured form. In addition, with increasing capaci- gramming environment and a data warehouse. With a view ties of stored and operational data, there appears a problem to improving the productivity of this ecosystem, the Ma-

 S. Minukhin, V. Fedko, Y. Gnusov, 2018 Information technology

pReduce software framework that provides implementation system Hadoop, and DBMS MS SQL Server. Using the of such important characteristics in terms of operation of a distributed file system (HDFS) in Hadoop cluster, on the complex system, such as ease of setting, fault tolerance, and one hand, and the SQL-type database, on the other hand, scalability, was developed. allows creation of hybrid warehouses. This approach gives The problem of processing different types of data is the an opportunity to control on one platform large amounts of following. When data of one type are stored in bases, for diverse information, operative and analytical data, at the data exchange, it is enough to specify a data provider library expense of their distributed storage and processing. That is in a code and set a connection string for each database. For why, its application is also promising to solve the problems example, for SQL Server database, the data provider support in modern data processing systems HTAP (Hybrid Trans- is performed by specifying the namespace System.data.Sql- actions and Analytics Processing) [24]. Selection of MS Client, and connection strings and in the simplest case have SQL Server as operating database is caused by a high rat- the form, shown in [3, 4]. If data are stored in databases of ing of MS SQL Server among the most popular operating different types (for example, NoSQL), the situation is com- DBMS in the world [25]. According to Gartner research, plicated, the name spaces for the relevant data providers are it has taken leading positions [26] in the world market of determined in the code. appropriate technologies. One of the key points of a solution to the problem of It should be noted that developing modern data process- dealing with Big Data is the use of NoSQL technology. In ing technologies in the framework of the studied performance the case where the type of NoSQL database is used, the of sophisticated data processing systems use methodologies application of the SQL language is impossible, except for for continuous deployment, development and integration libraries for specific data providers and connection strings. (DevOps). In this connection, it should be noted that the Comparative analysis of performance of the considered methodology for deployment of separate components of data types is presented in reviews [5–7], specific features software systems and frameworks for Big Data processing is of transition from using SQL data type to NoSQL data are an integral part of these methodologies, which is explained considered in [8, 9]. by the following factors: scalability; the need for continuous Thus, the problem is to find possibilities for sharing dif- updating of both software and hardware parts of a system ferent types of databases at implementation of industrial and [27]. This, in turn, requires development of high-quality research projects. Research into the areas of human genome methodological support for implementation of all stages of [10], education [11], machine learning [12], solutions of engi- installation and configuration of software and hardware neering problems [13], social studies [14], and business [15] platform of the developed and studied system. can be considered as examples. The main directions in overcoming the considered problems in sharing the DB of SQL and NoSQL types are the 2. Literature review and problem statement development of technologies for bridges between them [16, 17], building integrated platforms for sharing different Modern tendencies in the use of databases imply the data types [18], converters (adapters) queries for SQL- and solution of a relevant problem – where to store these applica- NoSQ-data interaction [20, 21], designing bridges when or- tions if part of it is relational and another part is of NoSQL ganizing warehouses of different types [19, 22, 23]. type [16]. Splitting data between SQL and NoSQL data- The latter of the mentioned are related to the creation of bases requires management of some data sources. One way hybrid warehouses, constructed based on sharing distribut- to bridge this gap between SQL and NoSQL is to create an ed data warehouses, for example, based on Hadoop cluster, abstraction level for both types of databases that behave as a and industrial data warehouses (IDS (EDW, Enterprise single database. In paper [17], the NoSQL data are converted Data Warehouses) [22, 23]. IDS typically use common par- into the triple format and are included in the SQL database allel databases that support complex processing, updating as a virtual relationship. For each query, SQL extension, and transaction of SQL queries. In this case, many compa- including a NoSQL query template, is used. This template nies store data in a distributed form with the possibility to describes the conditions for NoSQL data and allows the rest process them in different ways – by intellectual data analy- part of the SQL query to refer to the corresponding NoSQL sis, real-time SQL-queries, etc. In this connection, the tech- data through variable links. nologies, which make it possible to process simultaneously To enhance performance of data processing systems, the data of large capacities and of different types in combination possibilities of high-performance computing (HPC) sys- with technologies of distributed high-performance comput- tems, in particular, the cluster system Hadoop, built on the ing, are relevant. distributed file system HDFS, are widely used nowadays. The considered variety of tools and technologies which This architecture allows using capabilities of SQL-oriented should be used simultaneously in applications, both in the DBMS with possibilities of high-performance computation, process of their development and deployment, results in a short review of which is considered below. considerable difficulties in their application. That is why at Sqoop (SQL-to-Hadoop, Sqoop) [29] is intended for the end of 2017, there emerged the concept of “translytical”, interaction of relational databases and Hadoop. Sqoop is introduced in the Forrester Wave report [24]. It defines the used to transfer the relational tables to/from HDFS and incorporation of an operating database and a data warehouse generation of classes to Java that allow MR to interact with on one platform, thus, developing the concept of data ware- the data from the relational DBMS import utility. houses construction. Hadapt (HadoopDB) [30] is the system, designed to It therefore seems relevant to develop approaches aimed support execution of SQL- queries for unstructured and at improving performance of an information system when structured data. In Hadapt, one copy of DBMS PostgreSQL working with Big Data of different types by creating deploys at each Hadoop node. Strings of relational tables a bridge between databases, stored in the distributed are hash- elements placed on PostgreSQL instances on all

17 Eastern-European Journal of Enterprise Technologies ISSN 1729-3774 4/2 ( 84 ) 2018

cluster nodes. HDFS system is used as a warehouse of un- – to analyze the characteristics of Hadoop architecture structured data. and organization of interaction with the PolyBase connector Hive [31] represents an abstraction above MR. It en- when working with Big Data; ables us to query data without development of MR tasks. – to consider and describe the main types of queries that To retrieve data from the warehouse, Hive applies SQL-like are used to access the database in PolyBase SQL Server; queries in the Hive language Query Language (HiveQL). – to develop methodical support for deploying and con- Apache PIG [32] is the platform for representation of figuring joint running of Hadoop and PolyBase; stream data through the high-level language Pig Latin; it – to carry out experiments and analyze their results executes a series of MR tasks, managed by Hadoop. based on quantitative estimates of the developed metrics of Apache HBase [33, 34] is a scalable NoSQL database, performance of the studied data processing technology. running above the HDFS system. For MySQL data, Phoe- nix, creating clones of MySQL and SQL tables for an access to HBase, was developed. Instead of creating tasks in MR, 4. Technology of Big Data processing in the distributed Phoenix interacts with HBase with the coprocessor and thus system of Hadoop reduces the query execution time. Apache Cassandra [35] is a distributed database, de- One of the most high-performance systems for process- signed to process Big structured data, using CQL language ing various types of data is the Hadoop cluster, designed for [36] for creating SQL-queries. Its idea is to develop a processing large data volumes (10 GB or larger [42]), stored SQL-query language on top of Cassandra, bypassing the in a large number of files. SQL interpreter as a whole due to the lack of compatibility In this article, Haddop is used as a warehouse of hybrid with the SQL code. type data. In it, reference books are stored in the relational Mongodb-Hadoop Connector (MDBHC) [37] is a con- database, and operative data are loaded from Web sites to nector to work with unstructured data in Hadoop HDP 2.1 the NoSQL warehouse in the transformed and cleared form platform. The connector represents MongoDB as the Ha- (Fig. 1). The fact that Hadoop does not support transactions doop-compatible file system, allowing the MR task to be results in unit-by-unit addition and reading big arrays. read directly from the MongoDB without prior copying from Warehouse in the Hadoop technology is provided by the HDFS, and thereby reduce communication delay. In addi- distributed file system (HDFS), in which units of files are tion to the existing MR, Pig, Hadoop Streaming and Flume distributed among thousands of nodes in the cluster. streams, MDBHC launches SQL queries from the Hive by The data, presented in the form of units of 64, 128 and MongoDB data. The latest MDBHC version allows the 256 MB, are in the DataNode nodes (called data nodes). If Hive to access BSON hives with full support for MongoDB a file size exceeds the selected unit, the data that were not collections. MDBHC provides integration with MR tasks – placed in it, fall to the next one. In the NameNode man- the data aggregation from some input sources, or as a part of agement node, files’ addresses are recorded in a tree-like the Hadoop-based data warehouse or operating processes of warehouse system, as well as in metadata files and direc- ETL (SSIS). tories (Fig. 2). The data adapter system, [38] is an architecture, which includes four components: a relational database, the NoSQL database, DB Adapter and DB Converter. The system is the coordinator between the applications and two databases, controlling the query stream and transformation processes. DB Converter is responsible for data conversion and reports conversion in the DB adapter for their subsequent processing. Oracles SQL Connector for Hadoop [39], Greenplum [40] and Asterdata [41] use the external table mechanism, which provides a “relational” form to the data, stored in HDFS. The use of high-performance distributed environment implies the existence of an interface between the traditional organization of SQL-queries and unstructured databases. In the future, the possibilities of PolyBase SQL Server to create a hybrid warehouse based on Hadoop cluster are explored in this capacity.

3. The aim and objectives of the study

The aim of present research is to increase performance of distributed information systems by sharing Hadoop computational cluster and PolyBase SQL Server. Fig. 1. Data processing using Hadoop To accomplish the aim, the following tasks have been set: – to perform quantitative analysis of performance of NameNode node also performs a reading operation. First, SQL-oriented DB connectors with the Hadoop cluster when the nearest unit with necessary data is read, after which working with Big Data; the data are selected from it. Several applications can run

18 Information technology

with these data simultaneously, which makes it possible to The distributed application, running with YARN, increase the scalability of the system. should have a dedicated management class, it is responsi - ble for synchronization of tasks. In general form, the data processing scheme using YARN and MR can be seen in Fig. 4.

Fig. 2. The structure of HDFS file system

Distributed computations in Hadoop are constructed in the MapReduce (MR) framework. Two stages of the computational process are implemented in it:

– at stage 1, required data (Map) are selected; Fig. 4. Structural scheme of Hadoop MapReduce – at stage 2, they are used for calculations (Reduce). Since data are placed at different units at different Described scheme proved to be effective to address servers in the cluster, the operations, conducted at stage 1, the problem of scalability: its feature is not the vertical are executed in parallel. This organization reduces the total scalability by increasing the capacity of nodes, included query execution time (Fig. 3) and, thus, improves the service in a cluster, but the horizontal. In it, elasticity is realized quality in a distributed system. by connecting additional clusters (computers). A similar The Map stage creates the “key-value” pairs, which problem also arises when using NoSQL-type databases are shuffled by matching key values and, if necessary, and [45], which is solved through sharding and replication sorted [43]. This stage can be simultaneously implemented procedures. In Hadoop, sharding is compared with data on a large number of computers (nodes), including some replication, as well as splitting files into units. In the intermediate computations in order to reduce the amount of absence of transactions, reliability (fault tolerance) of transmitted data. computational processes is ensured. The feature of the given technology is that it includes Thus, Hadoop MR together with YARN planner ensures a chain of multiple MR processes. With an increase in the scalability and fault tolerance of the distributed system due number of elements in the chain and the number of com- to the procedure of automatic saving of several copies of all puters in a cluster, there occurs a problem, associated with data on different cluster nodes. In case of disconnection of scalability of the architecture. To solve it, framework YARN a node during data processing, tasks are forwarded to other [44], designed to manage cluster resources and parallel ex- nodes in the cluster. ecution of tasks, is used. In this case, the program provides isolation of these tasks, which is an analogue of transactions in relational databases. 5. Analysis of performance of SQL and Hadoop connectors

Data transfer at Hadoop is determined by the following factors: – data volumes; – performance of input/output devices; – performance of a computer network; – magnitude of migration parameter; – Hadoop cluster performance (selected platform for installation, configuration, and operation). When moving Big Data, it seems necessary to test and adjust separate parts of data streams to enhance the system performance, allowing reduction of the cost of their transmission. It also requires a well-grounded choice and installation of software components for the selected architecture (processing mode) of the Hadoop cluster to ensure the required performance of the system. Below, there is an example of results of testing performance of data migration from MySQL tables to Hadoop Fig. 3. Scheme of MR process using Sqoop (Table 1) [46].

19 Eastern-European Journal of Enterprise Technologies ISSN 1729-3774 4/2 ( 84 ) 2018

Table 1 Thus, with an increasing volume of transferred files start- Testing the performance of export from MySQL to Hadoop ing with 5 GB for the configuration in the form of 1 node- using Sqoop master and 32 working nodes, there is a steady tendency of a decrease in query execution time with the use of the Hive General capacity of table 7 GB connector in the range of 270–400 % (by 2.7–4 times). Memory capacity of AWS RDS 600 МB Without the use of additional settings for the table capacity instance of 7 GB, similar to Sqoop connector, the use of Hive allows a 1 Name Node, Configuration of Hadoop decrease in query execution time up to 9.7 min, which corre- 2 Data Nodes sponds to results of test 2 for Sqoop. Unit’s capacity in Hadoop 256 МB It can be concluded from the shown results that an Capacity of random access increase in performance of the information data process- memory of AWS instance 7,5 GB ing system in conjunction with distributed computations (m1.large) depends on the configuration of the cluster and its selected Test operation mode, the capacity of data to export as well as the Mappers: 2, Reducers: 1, mapred. 17 min. limitations (requirements) in relation to promptness of query child.java.opts : 550 MB execution and memory. They can be assigned in the form of Mappers: 4, Reducers: 1, mapred. metrics – absolute or relative indicators, reflecting distinc- child.java.opts : 550 MB; 12 min tive features (characteristics) of the considered technology. sqoop -m 8 . Command Mappers: 4, Reducers: 1, mapred. child.java.opts: 550 MB, 5.54 min sqoop -direct 6. Technology of connector PolyBase SQL Server – Hadoop Mappers: 4, Reducers: 1, mapred. child.java.opts : 550 MB, sqoop 4.37 min -direct, -compression The considered solutions of the interface with Hadoop are extended by the component SQL Server 2016 – PolyBase [49], which uses T-SQL. Thus, for tables with the specified size, depending on the It is necessary to separate the following from the opera- features and settings of the system and the used command tions with PolyBase: types, it is possible to achieve improvement of data migration 1) queries from SQL Server to Hadoop; performance in the range of 140–390 % (1.4‒4 times) by means 2) export/import from SQL Server to Hadoop. of the Sqoop connector. Other possible settings to improve per- The latter two are the ETL-operations for warehouse formance of the Sqoop connector are presented in paper [47]. organization: the first one is used to analyze the data, stored Below, there is an example of results of testing the per- in a hybrid warehouse, the other one is used to send data to formance of data migration from MySQL tables to Hadoop data marts, analysis and reporting. using Hive (Tables 2, 3) [48]. Polybase is the functional system of the parallel database system (Parallel Data Warehouse, PDW), including parallel Table 2 query optimizer and the mechanism for their implementa- Testing performance of data export from MySQL to Hadoop tion. The system uses MapReduce as an “auxiliary” mech- using Hive anism for data processing queries, downloaded from HDFS (Fig. 5, 6). General capacity of the 200 MB–10 GB table Memory capacity 8 GB 1 Name Node (8 2,4 GHz Xeon), Configuration Hadoop 32 Data Nodes (Intel Core 2 Duo 3 GHz) Unit capacity in Hadoop 128 MB Capacity of random Name Node – 8 GB, access memory Data Node – 2 GB Computer network Gigabit Ethernet

Table 3 Comparative analysis of query execution time for different data volume

Number of Volume, Map- Fig. 5. Polybase architecture [50] MySQL, s Hive, s entries МB Reduce, s 500 235 4.20 81.14 535.1 The paradigm “query separation” was implemented in 1,000 475 13.83 82.55 543.64 Polybase for scalable query processing [50, 51]: for structured 2,500 1,000 85.42 84.41 548.45 data – in relational tables and for unstructured data – in 5,000 2,000 392.42 83.42 553.44 HDFS. The place of PolyBase in data processing together with 10,000 5,000 1,518.18 88.14 557.51 Hadoop is shown in Fig. 7. 15,000 7,000 1,390.25 86.85 581.5 The main advantage of PolyBase is that it makes it 20,000 9,000 2,367.81 88.90 582.7 possible to apply all analytical Microsoft tools for the data,

20 Information technology

stored in Hadoop. We should separate simpler ones for an It should be noted that currently PolyBase is installed on end user – Excel and PowerBI with PowerPivot and Power platforms HDP and CDH [52]. Query, and professional tools, such as Integration Services, Stage 2 AnalysisServices, Reporting Services and Machine Learn- Installment and configuration of SQL Server 2016 (and ing Services, oriented to developers. above – Enterprise or Developer). Stage 3 Installment of PolyBase (performed at configuring SQL Server 2016, 2017). Fig. 6. HDFS: instance of a bridge in computational node of Stage 4 PDW [51] Connection of PolyBase to Hadoop as a data source. It includes two stages: configuring the server and creation of external data system (EDS). A change in configuration is implemented by sp_configure and RECONFIGURE. Stage 5 Connection of YARN and MR to Hadoop . The first component is connected by assigning the key value to configuration yarn.application.classpath of same name property of SQL Server, the second one – by setting configuration parameter mapreduce.application.class¬path (file yarn.site.xml). Stage 6 Authentification in Hadoop. SSL protocol is used for authentification and transfer of encrypted data between the nodes. Configuration authenti- cate (authenticity check) is selected by default. To provide a high protection level, it is necessary to assign the value integrity in property hadoop.rpc.protection, and the highest protection level – the value privacy (confidentiality). Stage 7 Scalability of PolyBase. It is provided by connection of PolyBase instances, included in one domain – a scalable group [53]. In domain, one instance is a managing node (Software – SQL Server 2016, PolyBase Engine, PolyBase DMS) (MN), the rest are computational (software – SQL Server 2016, PolyBase DMS)) (CN) (Fig. 8). It is implemented by the following algorithm. Fig. 7. Schematic of data processing using PolyBase-Hadoop Step 1 MN of SQL Server receives a query. Step 2 7. Software and hardware environment MN converts a query into implementation plan (DSQL plan) – a sequence of SQL procedures (DSQL), performed The developed system was implemented on PC with on instances of DB on CN, and DMS operations (Data HDD 1 ТB, RAM – 16 GB, processor Intel® Core™ i5- Movement Service) to transmit data between CN. 7600, x86-64, 4 core× 3,5 GHz, SmartCache L3 6 MB, Step 3 Intel® Turbo Boost 2.0. The PolyBase kernel in it parallelizes the query imple- To deploy software of the Hadoop system, we used: guest mentation. OS: virtual machine Virtual Box 5.2 (Oracle), RAM – 4 GB, Stage 8 2 core × 3,5 GHz; Hadoop 2.6 (Hortonworks HDP 2.2), Creation of objects. ОS – CentosOS 7; autonomous mode (at a single machine To use PolyBase, it is necessary to create the external (server)) (I/O files at a single machine (server)). table (ET), format of external file (EF) and EDS. To deploy Polybase component, MS SQL Server 2017 The EDS object determines Hadoop location and, Enterprise for OS Windows 10 was used. probably, accounting information for EDS authentication. For example, data source named HadoopCluster1 on the cluster, connected to port 8050 of the computer with 8. Development of methodological support for IP-address 10.10.10.10, is created by the following script deployment of Hadoop-PolyBase (Fig. 9). In the EF, it is necessary to indicate the file type (Par- Stage 1 quet, Hive ORC, Hive RCFile and Delimited Text) in the Deployment and setting of Hadoop. data source, determined earlier and for a text file to set: the The most common are distributives from companies field end attribute, line spacer, date format, file compression Cloudera, Hortonworks, MapR, IBM and Pivotal [52]. For method, etc. For example, if Hadoop contains Parquet-files, CLOUD-platforms, Elastic MapReduce (Amazon Web Ser- compressed by Gzip method, it is necessary to set value vices) and HDInsight (Microsoft Azure) are used. FORMAT_TYPE = PARQUET in format description.

21 Eastern-European Journal of Enterprise Technologies ISSN 1729-3774 4/2 ( 84 ) 2018

It should be noted that ET is a reference to the data stored in Hadoop for performing Insert, Select, Join operations, etc. Thus, in its functional features it is close to representation in a relational database, which increases the possibility of a user working with data. For example, the output of data on sales of goods by months is implemented by the script (Fig. 11).

Fig. 11. Script of data output

This script used connection ET – HDP_factSales with the local table DimDate with data, grouped by months, as well as data filtering on field Cost ET. To speed up the latter two operations, the MR tasks are enabled. Fig. 8. PolyBase Scalable group The import operation into a local table will be demonstrated using the example of transfer of sales data for 2018 from the ET of HDP_factSales to the local table fact- Sales2018. It is implemented by the script (Fig. 12).

Fig. 9. Creation of an external data source

The ET object displays an object of data warehouse from Fig. 12. Script of import operation into a local table Hadoop in PolyBase: tables of relational and document-oriented database and their representation. For archiving outdated data for 2016, the export from For work with ET in a correct way, it is necessary to take the factSales table in HDP_factSales2016 was performed by into account in its description that the number of its columns the script (Fig. 13). and their types must exactly match the correspondent pa- rameters of the table scheme within Hadoop. It is possible to postpone giving messages by setting in the reject_options parameter an allowable number of “failed” lines or their percent- age. This approach becomes relevant in the case where the data contained in the Hadoop cluster are contained in the document-oriented database that uses flexible scheme of collections. Thus, for example, to retrieve sales data from the fact Sales table that is stored in Hadoop (data source – Ha- doopCluster1) and has the format ParquetGzip1, the ET of HDP_factSales is created by the script (Fig. 10).

Fig. 13. Script of data export

The productivity of the developed system can be im- proved based on the T-SQL language that allows: – reading data from Hadoop; – combining them with the data, stored in database of Fig. 10. Creation of external file format SQL Server;

22 Information technology

– importing and exporting data between the local data- 1. Transferring data from Hadoop into a temporary table base and Hadoop warehouse. in SQL Server for their subsequent processing in PolyBase. The examples of developed requests (Fig. 9–11) that Connection of Hadoop nodes was carried out by the pred- implement the core operations of SQL Server and Hadoop icate pushdown mode (option in SELECT proposal [51]). interactions are represented above and further used when When predicate pushdown is forced into the script, we add solving test tasks. The syntax of basic requests is represented (Fig. 15), otherwise (Fig. 16) in Microsoft documentation [54–56]. Activity diagram in UML modeling language of implementation of the developed methodical support is shown in Fig. 14. Fig. 15. Script of forcing predicate pushdown

Fig. 16. Script of disabling predicate pushdown

2. Test datasets Files with information about the customers of one of the companies were selected for testing. Data for research were generated randomly through the site [57]. The number of entries in each of the used files is shown in Table 4.

Table 4 Number of entries and memory capacity of occupied test files File name Number of entries Memory capacity, МB Customers_1 50,000 30 Customers_2 200,000 80 Customers_3 500,000 200 Customers_4 1,000,000 400 Customers_5 1,300,000 520 Customers_6 1,500,000 600

3. Deployment of platform for data placement.

These files were places in Hadoop Hortonworks Data Fig. 14. UML-activity diagram of PolyBase-Hadoop Platform 2.2 for the ОS CentOS. The work with database deployment was implemented in SQL Server Management Studio 2017. 4. Connection. The methodological support, considered in the frame- To connect SQL Server to Hadoop, EDS was created work of the existing methodologies for continuous deploy- by the script, in which for correct operation of the predi- ment and integration of complex systems, enables formation cate pushdown option, parameter RESOURCE_MANAG- of a structure of business process of analytical processing ER_LOCATION, setting the address of Hadoop resource of Big Data of different types in the face of operation of manager, was forced [43]. high-performance cluster with developed interface tools. 5. Creation of the ET. This makes it possible to reduce the correspondent cost to ET was created for each file in the Hadoop. For example, customize the software and hardware part of the studied sys- for the file Customers_1, its scheme is represented by the tem and thus to enhance performance of the data processing following script (Fig. 17). system in general.

9. Experimental studies into estimating the effectiveness of Big Data processing using PolyBase-Hadoop

It should be noted that the results of the experimental studies, obtained from the use of the studied technology were not adequately displayed in modern literature due to the fact that this technology was presented by Microsoft in mid-2017. In this regard, the following results are actually some of the first research into this technology [50, 51]. This section describes and analyses experimentally obtained estimates of effectiveness of application of the Poly- Base component and the Hadoop cluster for applications with big data of different types. The studies were conducted according to the following methodical scheme. Fig. 17. Script of creation of external table

23 Eastern-European Journal of Enterprise Technologies ISSN 1729-3774 4/2 ( 84 ) 2018

6. Generation of queries. The results of research show that at smaller data vol- Conducted detailed analysis of functions to improve the ume, query execution time using the pushdown option is performance of data export operations in Hadoop showed somewhat less than in the control sample. However, with an that it is necessary to select the predicate pushdown. The increase in the amount of processed data, queries with forced main purpose of this function is to reduce the number of predicate pushdown option are executed in considerably less lines to be moved, to decrease the number of columns to time than without it (Fig. 20). be moved and to use subsets of expressions and operators to create and optimize the external table. A query can have some predicates that can be transferred to a Hadoop cluster. For experiments, we used a query, the result of implementation of which is all customers born after 1980. For the ET, THE query was implemented in two modes: with forced predicate pushdown option (Fig. 18) and disabled predicate pushdown option (Fig. 19) scripts, respectively:

Fig. 18. Script of forcing the predicate pushdown mode

Fig. 20. Dependence of query execution time on the number of entries with the use of the predicate pushdown option and without it Fig. 19. Script of disabling predicate pushdown mode The action of the forced predicate pushdown option 7. Performance metrics. has a significant impact on the magnitude of volume of The following were used as performance metrics: used memory to create temporary tables, because only the absolute: query execution time, capacity of memory, re- data that meet selection criteria return to SQL. It should quired to create temporary tables; be noted that in this case, less memory to store such data relative: the ratio of query execution time to capacity of in a temporary table is required, which is demonstrated transferred tables, the ratio of capacity of temporary tables in Fig. 21. to capacity of transferred tables. To carry out the experimental study and to obtain quantitative estimates of performance improvement, the following queries were implemented. To determine the memory, required for temporary tables, it is necessary to use the query to the systemic database tempdb. It should be executed together with the main queries due to the fact that upon its sampling completion all temporary tables, associated with it, are removed from the database. Quantitative estimations of query execution time and required memory of temporary tables for used test files are shown in Table 5.

Table 5 Results of queries execution

Execution time, s Memory capacity, МB

File name Fig. 21. Dependence of amount of memory, required to DISABLE FORCE DISABLE FORCE execute records, on the number of records with the use of Customers_1 1 3 2.6 1.5 predicate pushdown option and without it

Customers_2 3 3 10.0 6.0 For quantitative estimation of performance of the Customers_3 7 4 25.6 11.0 studied technology with the forced predicate pushdown Customers_4 15 6 52.0 23.0 function, new metrics were additionally proposed – the ratio of query execution time to memory capacity of Customers_5 18 8 59.0 25.0 moved tables (Table 6, Fig. 22) and the ratio of memory of created temporary tables to capacity of moved tables Customers_6 21 10 64.0 28.0 (Table 7, Fig. 23).

24 Information technology

Table 6 2. It is not expedient to disable calculations on the Ha- Ratio of query execution time to table capacity doop cluster when processing the data of insignificant capacity (for presented test examples – less than 80 MB). Mode File name 3. Forcing calculations on the Hadoop cluster is ef - DISABLE, s FORCE, s fective if an insignificant part of the entire table gets in a Customers_1 0.03 0.1 query result. This is due to the fact that when calculations Customers_2 0.0375 0.0375 on Hadoop are disabled, SQL Server first copies all data Customers_3 0.035 0.02 into temporary tables (which requires a lot of memory), Customers_4 0.0375 0.015 and then performs filtering. In this case, the use of the Customers_5 0.034 0.0153 predicate pushdown can significantly decrease the abso - Customers_6 0.035 0.016 lute and relative indicators of the use of resource memory (system-oriented indicator) and query execution time (providing service quality – customer-oriented indicator) (Fig. 20–23). 4. Application of the predicate pushdown option allows decreasing the cost of storing files, as in the case of the use of a cluster, data warehouse cost at its nodes is significantly less than the cost of larger files in SQL Server. Thus, in this case it is possible to use, for example, the relative indicator of total cost of storing files in nodes of the Hadoop cluster to total cost of storing files in SQL server as an additional metrics – in this case, the economic indicator of operation of the information system.

Fig. 22. Dependence of ratio of query execution time to 10. Discussion of results of research into performance of tables capacity on tables capacity using predicate pushdown Polybase connector with Hadoop cluster when working option and without it with Big Data of different types

Table 7 The conducted studies showed that the use of PolyBase technology in conjunction with cluster technologies to man- Ratio of temporary tables capacity to exported tables age Big Data makes it possible to use effectively operative capacity and analytical data simultaneously on one platform, address- Mode ing the problem within the Translytical Data Platforms. File name DISABLE FORCE The use of PolyBase as a component of SQL Server allows increasing reliability of the entire system of data processing. Customers_1 0.086 0.05 In this case, the data, obtained from external sources in SQL Customers_2 0.125 0.075 Server, are previously stored in a relational database, then Customers_3 0.128 0.055 are transmitted to be stored to the file system HDFS with Customers_4 0.13 0.057 the use of the external table PolyBase. Despite the fact that Customers_5 0.113 0.048 this procedure somewhat decreases data transfer rate, effectiveness of the system as a whole increases significantly due Customers_6 0.106 0.046 to the use of the following procedures: 1) a log is maintained when CRUD operations are performed in a relational database, which provides additional tools of data recovery, both at soft and hard failures, due to Protocol “Write Ahead Log (WAL)”; 2) processing events, related to data storage, is fully transactional. Therefore, implementation of Consistency and Durability properties ensures an increase in reliability of filling a data warehouse. The given examples of using the PolyBase indicate quite high efficiency of the explored technology for organization of a hybrid warehouse. However, they are model and require subsequent experimental research involving practical problems in business analysis, communication, IoT and other Fig. 23. Dependence of ratio of temporary tables capacity technologies, working with Big Data at the level of an enter- to capacities of tables using the predicate pushdown option prise and a corporation. and without it The proposed methodological support of integration of the Hadoop cluster and the PolyBase component allows Results of conducted experiments with the presented increasing efficiency of business processes of organizations test data allow us to draw the following conclusions. when using modern IT-technologies within the framework 1. Productivity of calculations in Hadoop can be im- of modern methodology of deployment and operation of com- proved using the predicate pushdown option. plex systems DevOps for Big Data processing.

25 Eastern-European Journal of Enterprise Technologies ISSN 1729-3774 4/2 ( 84 ) 2018

The work presents only basic queries. They are not shown, that it is possible to provide the required performance enough to fully address the broad range of applied problems of the system considering many factors influencing it. It was of data processing. Thus, it can be considered a restriction concluded on the need to develop qualitative methodological of this study. However, it should be noted that the studied support within the methodology of continuous deployment technology is new (2017) and that is why it is virtually un- and integration of components of complex systems, specifi- explored in modern scientific literature. An exception could cally, to install and configure the cluster in different modes be, for example, papers [50, 51]. of its operation when processing different Big Data types. In this work, for installing and configuring the Hadoop 3. The features of organization of data storage in the dis- cluster, one uses the mode on a single machine, which is tributed system Hadoop were explored. Connection of the quite specific in itself but allows getting good qualitative Task Scheduler and parallelization technologies in Hadoop and quantitative results using limited computing resourc- makes it possible to increase the productivity of data pro- es. They can serve as a basis for predictive analysis of the cessing systems. It was shown that existence of an interface application of this technology in cloud platforms and server (a bridge) between SQL data and clustered HDFS ware- solutions with great performance. That is why the obtained house improves efficiency of working with many different results are a novelty, because under predetermined con- data types using existing analytical tools of information ditions, the performance of a new technology, almost not processing products by Microsoft Corp. studied before, is evaluated. 4. An analysis of the technological platform of the Poly- The paper presents experimental studies on interaction Base component was performed and it showed the prospects of SQL Server and Hadoop as a unified system. The data, of its application in conjunction with high-performance stored in HDFS, are processed by means of Hadoop and are computations. This is determined by means of horizontal transferred to SQL Server for a subsequent analysis. To do scaling, the T-SQL language and support of data entry in this, SQL Server 2017 has a special component In-Database large units, ensuring efficiency of Big Data processing in the Machine Learning Services. In this study, we obtained Hadoop cluster. quantitative system- and customer-oriented estimates of 5. The methodical provision of integrated deployment enhancing performance of the system based on the experi- and configuring of the Hadoop cluster and PolyBase on mental study of enabling predicate pushdown option in the the virtual HDP platform, using the operation mode “at a MapReduce query to retrieve the lines from HDFS. single machine” (pseudodistributed mode), was developed. Enabling the PolyBase component for companies that Examples of creating external objects and examples of im- have already installed the DBMS SQL Server is free of plementation of various types of queries for sharing these charge. Due to the high prevalence of DBMS, SQL Server technologies were explored in detail. nowadays is a big enough market segment of users. The main 6. Experimental studies were conducted for quantitative competitor of SQL Server is Oracle having similar operating assessment and analysis of the developed absolute and rela- functionality [58]. If a company or an enterprise does not tive metrics of performance of the examined system of Big use these DBMS and considers the license acquisition and Data processing. The obtained results showed that the use of installation of one of them, the probability of selecting SQL the optional solutions of PolyBase and Hadoop on databases Server is higher due to the pricing policy of Microsoft Corp: of large capacities (from 200,000 to 1,000,000 entries of the a license for the acquisition of SQL Server costs 10 times volume from 30 to 600 MB) makes it possible to decrease less than that by Oracle [59–61]. Otherwise, the Oracle Big memory capacity, used for temporary tables, by 2 times, que- Data Connectors can be chosen as a connector [62]. The ry execution time by 2.5 times, and memory usage ratio – price of the license of this component for a single processor from 10 % to 4 % (by 2.5 times) for the maximum capacity is US $2,000.00. of the table. 7. An analysis of the developed relative metrics showed that a relative indicator of query execution time for Sqoop 11. Conclusions totaled 0.0366 and for Hive – 0.0582. Thus, they are somewhat worse than the result, obtained for test examples for 1. An analysis of the technology of Big Data processing Polybase – 0.015 –0.016 (Table 6). in distributed information systems using different organiza- 8. Further studies could address the optimization of tion was performed. It was shown that the main direction for planning and improvement of efficiency of the MapReduce development of hybrid data warehouses is bridges between framework and databases replications. In addition, assess- data of different types – SQL and NoSQL. ment of performance of homogeneous and heterogeneous 2. The review of the existing connectors for interaction Hadoop cluster are required when working at local resources with Hadoop cluster resources was carried out. Examples with the tools of VDI (virtual desktop infrastructure), as of quantitative estimates of application of Sqoop and Hive well as on cloud platforms Azure, AWS, and Google Com- for different configurations of the cluster were given. It was pute Engine.

References

1. Big Data 2.0 Processing Systems: Taxonomy and Open Challenges / Bajaber F., Elshawi R., Batarfi O., Altalhi A., Barnawi A., Sakr S. // Journal of Grid Computing. 2016. Vol. 14, Issue 3. P. 379–405. doi: https://doi.org/10.1007/s10723-016-9371-1 2. Big Data Taxonomy. BIG DATA WORKING GROUP. 2014. 33 p. URL: https://downloads.cloudsecurityalliance.org/initiatives/ bdwg/Big_Data_Taxonomy.pdf 3. Fedko V. V., Tarasov O. V., Losiev M. Yu. Orhanizatsiya baz danykh ta znan. Kharkiv: Vyd. KhNEU, 2013. 200 p. 4. Fedko V. V., Tarasov O. V., Losiev M. Yu. Suchasni zasoby dostupu do danykh. Kharkiv: Vyd. KhNEU im. S. Kuznetsia, 2014. 328 p.

26 Information technology

5. Priyanka, AmitPal A Review of NoSQL Databases, Types and Comparison with Relational Database // International Journal of Engineering Science and Computing. 2016. Vol. 6, Issue 5. P. 4963–4966. 6. Mohamed M. A., Altrafi O. G., Ismail M. O. Relational vs. NoSQL Databases: A Survey // International Journal of Computer and Information Technology. 2014. Vol. 03, Issue 03. P. 598–602. 7. A Survey and Comparison of Relational and Non-Relational Database / Jatana N., Puri S., Ahuja M., Kathuria I., Gosain D. // International Journal of Engineering Research & Technology. 2012. Vol. 1, Issue 6. P. 1–5. 8. Abdullah A., Zhuge Q. From Relational Databases to NoSQL Databases: Performance Evaluation // Research Journal of Applied Sciences, Engineering and Technology. 2015. Vol. 11, Issue 4. P. 434–439. doi: https://doi.org/10.19026/rjaset.11.1799 9. AH Al Hinai. A Performance Comparison of SQL and NoSQL Databases for Large Scale Analysis of Persistent Logs. Uppsala University, 2016. 111 p. 10. Evaluation of relational and NoSQL database architectures to manage genomic annotations / Schulz W. L., Nelson B. G., Felker D. K., Du- rant T. J. S., Torres R. // Journal of Biomedical Informatics. 2016. Vol. 64. P. 288–295. doi: https://doi.org/10.1016/j.jbi.2016.10.015 11. SQL and data analysis. Some implications for data analysits and higher education. Data Scientist Master’s Program. URL: https:// www.simplilearn.com/big-data-and-analytics/senior-data-scientist-masters-program-training 12. Elshawi R., Sakr S. Big Data Systems Meet Machine Learning Challenges: Towards Big Data Science as a Service. URL: https:// arxiv.org/pdf/1709.07493.pdf 13. An information modeling framework for bridge monitoring / Jeonga S., Houb R., Lynchc J. P., Sohnc H., Law K. H. // Advances in Engineering Software. 2017. Vol. 114. P. 11–31. 14. Scaling Social Science with Apache Hadoop. URL: http://blog.cloudera.com/blog/2010/04/scaling-social-science-with-hadoop/ 15. Liu X. An Analysis of Relational Database and NoSQL Database on an Ecommerce Platform Master of Science in Computer Sci- ence. University of Dublin, Trinity College, 2015. 81 p. 16. Ferreira L. Bridging the gap between SQL and NoSQL. Universidade do Monho, 2012. 97 p. URL: http://mei.di.uminho.pt/sites/ default/files/dissertacoes/eeum_di_dissertacao_pg15533.pdf 17. Roijackers J. Bridging SQL and NoSQL. Eindhoven University of Technology Department of Mathematics and Computer Science Master’s thesis, 2012. 100 p. 18. Oluwafemi E., Sahalu B., Abdullahi S. E. TripleFetchQL: A Platform for Integrating Relational and NoSQL Databases // Interna- tional Journal of Applied Information Systems. 2016. Vol. 10, Issue 5. P. 54. doi: https://doi.org/10.5120/ijais2016451513 19. Roijackers J., Fletcher G. H. L. On Bridging Relational and Document-Centric Data Stores // Lecture Notes in Computer Science. 2013. P. 135–148. doi: https://doi.org/10.1007/978-3-642-39467-6_14 20. Data adapter for querying and transformation between SQL and NoSQL database / Liao Y.-T., Zhou J., Lu C.-H., Chen S.-C., Hsu C.-H., Chen W. et. al. // Future Generation Computer Systems. 2016. Vol. 65. P. 111–121. doi: https://doi.org/10.1016/ j.future.2016.02.002 21. Kuderu N., Kumari V. Relational Database to NoSQL Conversion by Schema Migration and Mapping // International Journal of Computer Engineering in Research Trends. 2016. Vol. 3, Issue 9. P. 506. doi: https://doi.org/10.22362/ijcert/2016/v3/i9/48900 22. Maislos A. Hybrid Databases: Combining Relational and NoSQL. URL: https://www.stratoscale.com/blog/dbaas/hybrid-databases-combining-relational-nosql/ 23. Blessing E. J., Asagba P. O. Hybrid Database System for Big Data Storage and Management // International Journal of Computer Science, Engineering and Applications. 2017. Vol. 7, Issue 3/4. P. 15–27. doi: https://doi.org/10.5121/ijcsea.2017.7402 24. Yuhanna N., Gualtieri M. The Forrester Wave™: Translytical Data Platforms, Q4 2017. URL: https://www.forrester.com/report/ The+Forrester+Wave+Translytical+Data+Platforms+Q4+2017/-/E-RES134282 25. DB-Engines Ranking. URL: https://db-engines.com/en/ranking 26. Magic Quadrant for Operational Database Management Systems. URL: https://www.gartner.com/doc/3823563/magic-quadrant-operational-database-management 27. DevOps for Hadoop. URL: http://agilealmdevops.com/2017/10/22/devops-for-hadoop/ 28. Özcan F., Tian Y., Tözün P. Hybrid Transactional/Analytical Processing // Proceedings of the 2017 ACM International Conference on Management of Data – SIGMOD ‘17. 2017. doi: https://doi.org/10.1145/3035918.3054784 29. Apache Sqoop. URL: http://sqoop.apache.org 30. Hadapt Inc. URL: http://www.hadapt.com 31. The Apache Hive. URL: https://hive.apache.org 32. Apache Pig: Overview. URL: https://www.tutorialspoint.com/apache_pig/apache_pig_overview.htm 33. The Apache Software Foundation. URL: https://hbase.apache.org 34. Apache Hbase. URL: https://hortonworks.com/apache/hbase 35. CASSANDRA. URL: http://cassandra.apache.org 36. The Cassandra Query Language (CQL). URL: http://cassandra.apache.org/doc/latest/cql 37. MongoDB Connector for Hadoop. URL: https://docs.mongodb.com/ecosystem/tools/hadoop 38. Data adapter for querying and transformation between SQL and NoSQL database / Liao Y.-T., Zhou J., Lu C.-H., Chen S.-C., Hsu C.-H., Chen W. et. al. // Future Generation Computer Systems. 2016. Vol. 65. P. 111–121. doi: https://doi.org/10.1016/ j.future.2016.02.002

27 Eastern-European Journal of Enterprise Technologies ISSN 1729-3774 4/2 ( 84 ) 2018

39. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database. URL: http://www.oracle.com/tech- network/bdc/hadoop-loader/connectors-hdfs-wp-1674035.pdf 40. Greenplum. URL: https://greenplum.org 41. Asterdata. URL: http://www.asterdata.com 42. Shvachko K. Apache Hadoop. The Scalability Update // Login: The usenix Magazine. 2011. Vol. 36, Issue 3. P. 7–13. 43. Shuffling and Sorting in Hadoop MapReduce. URL: https://data-flair.training/blogs/shuffling-and-sorting-in-hadoop/ 44. Kawa A. Introduction to YARN. URL: https://www.ibm.com/developerworks/library/bd-yarn-intro/index.html 45. Banker, K. MongoDB in Action. Manning Publications, 2012. 287 p. 46. Performance Tuning Data Load into Hadoop with Sqoop. URL: http://www.xmsxmx.com/performance-tuning-data-load-into- hadoop-with-sqoop/ 47. Sqoop Performance Tuning Guidelines. URL: https://kb.informatica.com/h2l/HowTo%20Library/1/0930-SqoopPerformanceTun- ingGuidelines-H2L.pdf 48. Hadoop and Hive as scalable alternatives to RDBMS: a case study. URL: https://scholarworks.boisestate.edu/cgi/viewcontent. cgi?referer=https://www.google.com.ua/&httpsredir=1&article=1001&context=cs_gradproj 49. PolyBase Queries. URL: https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-queries 50. Indexing HDFS data in PDW / Gankidi V. R., Teletia N., Patel J. M., Halverson A., DeWitt D. J. // Proceedings of the VLDB Endowment. 2014. Vol. 7, Issue 13. P. 1520–1528. doi: https://doi.org/10.14778/2733004.2733023 51. Split query processing in polybase / DeWitt D. J., Halverson A., Nehme R., Shankar S., Aguilar-Saborit J., Avanes A. et. al. // Proceed- ings of the 2013 international conference on Management of data – SIGMOD ‘13. doi: https://doi.org/10.1145/2463676.2463709 52. Cloudera vs Hortonworks vs MapR: Comparing Hadoop Distributions. URL: https://www.experfy.com/blog/cloudera-vs-hortonworks-comparing-hadoop-distributions 53. PolyBase scale-out groups. URL: https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-scale-out- groups?view=sql-server-2017 54. CREATE EXTERNAL DATA SOURCE (Transact-SQL). URL: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql?view=sql-server-2017 55. CREATE EXTERNAL FILE FORMAT (Transact-SQL). URL: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-file-format-transact-sql?view=sql-server-2017 56. CREATE EXTERNAL TABLE (Transact-SQL). URL: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-2017 57. Generate Test CSV Data. URL: http://www.convertcsv.com/generate-test-data.htm 58. System Properties Comparison Microsoft SQL Server vs. Oracle. URL: https://db-engines.com/en/system/Microsoft+SQL+- Server%3BOracle 59. Please, Please Stop Complaining about SQL Server Licensing Costs and Complexity. URL: https://joeydantoni.com/2016/08/18/ please-please-stop-complaining-about-sql-server-licensing-costs-and-complexity/ 60. Ramel D. Microsoft Takes Aim at Oracle with SQL Server 2016. URL: https://rcpmag.com/articles/2016/03/10/microsoft-vs-oracle-at-data-driven.aspx 61. «Za tu zhe funkcional’nost’, kotoruyu daet SQL Server, Oracle prosit v 10 raz bol’she», – Konstantin Taranov o SQL Server. URL: https://habr.com/company/pgdayrussia/blog/329842/ 62. Oracle Big Data Connectors. URL: https://shop.oracle.com/apex/product?p1=OracleBigDataConnec - tors&p2=&p3=&p4=&p5=&intcmp=ocom_big_data_connectors