Trakia Journal of Sciences, Vol. 13, Suppl. 1, pp 414-419, 2015 Copyright © 2015 Trakia University Available online at: http://www.uni-sz.bg ISSN 1313-7069 (print) doi:10.15547/tjs.2015.s.01.071 ISSN 1313-3551 (online)

DISTRIBUTED – SOME APPROACHES, MODELS AND CURRENT TRENDS

S. Klisarova-Belcheva

Faculty of Economic and Social Sciences, Plovdiv University “Paisii Hilendarski”, Plovdiv, Bulgaria

ABSTRACT The progress of internet technology in recent years increases requirements toward distributed platforms to store huge amounts of data. The purpose of this article is to present an overview of the most common currently used distributed databases (like NoSQL and NewSQL). The patterns of storage of the latter, their advantages and disadvantages in different architectural solutions and development trends are discussed. In conclusion, the paper emphasizes the increased importance of distributed platforms for business software.

Key words: distributed databases, big data, NoSQL, NewSQL

DISTRIBUTED DATABASES – SOME possibilities and applications of distributed APPROACHES, MODELS AND CURRENT databases (NoSQL) and distributed relational TRENDS databases NewSQL. The revolution of the Internet technology in recent years has increased the necessity of DISTRIBUTED DATABASES – GENERAL distributed platforms for storage of huge CHARACTERISTIC AND volumes of data. This paper aims to give an CLASSIFICATIONS overview of the most famous distributed In the 70s of the last century, working on the databases (like NoSQL and NewSQL). It theory of data storing, Edgar F. Codd creates a compares the patterns of storage, their relational data model. Nowadays this model is advantages and disadvantages in different in the basis of the leading architectural solutions for storage and management systems (DBMS). The relational development trends. database management systems (RDBMS) are an integral part of the implementation of INTRODUCTION modern business applications. The main The progress of the technologies for data characteristic that gives the name of the model storage and information processing set the is a relation between objects in the database. database (DB) as core of the modern software Foundation of the relational model is the applications. Database concept appeared in the support for ACID transactions. ACID is an 60s of the XX century in the context of acronym used to describe the basic properties automated information management. of transactions in the RDB: Atomicity, Distributed database management system Consistency, Isolation and Durability. (DDBMS) are created in response to the Distributed DB are scalable. Their main feature necessity of the scalable database for is the possibility of physical separation of data developing web applications. Web-based and among multiple networked computers. scalable database spread rapidly in recent years, becoming a popular name NoSQL DB. The second important feature of the DDBMS is the absence of relations, i.e. NoSQL The aim of this report is to juxtapose newer database offer free scheme of stored models for data storage and traditional information. The free scheme increases the relational model compаring their basic possibility of storing large volumes of characteristics by applying a set of information in a damage resistant architecture. classifications. The paper emphasis are the Rick Cattell identified the main disadvantage of the DDBMS compared to the transactional

414 Trakia Journal of Sciences, Vol. 13, Suppl. 1, 2015

KLISAROVA-BELCHEVA S. model, which is that they do not support ACID Cviashchenko explore the main types of transactions (1). mechanisms for replication and data consistency (6). Depending on the type of In 2000, prof. Eric Brewer formulated a replication (synchronous and asynchronous) conjecture (now known as Brewer's Theorem), the different systems have variety of stating that any form of distributed computing mechanisms for consistency, like two phase is capable of providing at the same time not commit and etc. more than two of the three aforementioned properties: consistency of the data, availability DISTRIBUTED RELATIONAL DBMS – and partition tolerance (distributed). Two years CURRENT TRENDS AND BASE later Seth Gilbert and Nancy Lynch managed COMPARISON to formally prove Brewer's assumption (2). Brewer's Theorem is the basis for construction of the first from the reviewed DB classifiers distributed and undistributed database. Undistributed DB provide both properties consistency and availability, as most cases these databases are relational. In 2013 A. Nayak explores the types of DDBMS depending on which properties provide availability and/or consistency (3).

Another possible database classification is Figure 1. DB rating by type and years represented by Cattell, here DBMS are classified according to the storage place of the According to a recent research on the information - the database stored in RAM (in prevalence of databases, there is a steady trend memory) and DB stored on external memory of increasing interest in NoSQL DBMS shown (HDD) (1). The implementation of modern on Figure 1 (7). database stored on external drive, have developed special file systems for example, Despite the increased interest in DDBMS, HDFS (Hadoop Distributed File System) – file leading places in the rankings still hold system FOR special purposes, GFS (Google traditional RDBMS – Oracle, MySQL и MS File System) - distributed parallel file system SQL Server. The results show that the WITH protection from breaks, Amazon S3 - relational model is still predominant, but there network file system and others. is a steady trend of the increasing use of distributed systems. It occurs a new alternative In a study of NoSQL databases Jing Han offers to relational and distributed database a classification based on the physical model for management system which is up to the needs information storage: row-oriented stores, combining the capabilities of both the column-oriented stores (wide column too), traditional and the NoSQL database. These document stores, key-value stores and graph new database systems, known as NewSQL, stores (4). actually are NoSQL which supports the control of transactions and SQL (like Join, union) etc. Cattell added also a classification depending on the mechanisms for accessing and modifying the data. Relational and some of the distributed DBMS provide ACID transactions, while others are using MVCC (Multi Version Concurrency Control), locks (provide a mechanism to allow only one user at time to read or modify record) and others.

There are specific mechanisms developed to access the data in some distributed systems.One of the most popular models for distributed computing is MapReduce. Hung Yang analyzes the application of MapReduce Figure 2. Comparison between traditional, NoSQL to extract aggregate information as a and NewSQL database prerequisite for the development of scalable parallel applications for processing large Thus, the distributed and the relational volumes of data distributed on multiple databases have opportunities, to integrate the networked computers (5). Grigoriev and Trakia Journal of Sciences, Vol. 13, Suppl. 1, 2015 415

KLISAROVA-BELCHEVA S. advantages of both traditional relational and a. Row-oriented DBMS the distributed database. Figure 2 shows Systems using row-oriented model of the NewSQL database management systems as an record, are appropriate in the treatment of union of the strong features of relational and multiple transactions, related mainly to the distributed database - ACID transactions and addition of rows and requests. Data is stored scalability. sequentially in areas called rows (records). Each record in the database is represented by a fixed length which is relatively constant. When there is a need to add a new line, it is added at the back, without the need of complex operations. In John Kubiatovicz’s comparative analyses to add a new attribute and to change or remove it is a quite complex operation,, because the values of each row are written sequentially according to the attributes in a table (9). The principle scheme of the manner Figure 3. Basic properties of traditional RDBMS, to preserve data shown in Figure 4. The major NoSQL and NewSQL representatives of row-oriented database are most used relational databases: Oracle Figure 3 summarizes the properties of the database, Microsoft SQL Server, MySQL, three main types of database, such as ACID DB2 (IBM), PostgreSQL. These databases transactions, support large data, relations and support OLTP (online transaction processing) the possibility of storage in memory. The and are used in business software such as figure shows that while NoSQL and Relational software for managing sales in supermarkets, DBMS are complementary, the NewSQL ATM, etc. provides all of their properties.

The main representatives of NewSQL are:  NuoDB – scale-out SQL database for cloud computing and the modern datacenters  VoltDB – In-memory performance, streaming analytics with millisecond latency, OLTP in a scale-out architecture, SQL and JSON with ACID guarantees, Hadoop ecosystem integration  Clustrix – roves that SQL can scale out in production to massive deployment sizes  SAP Hana – an in-memory computing platform that has completely transformed RDB industry  FoundationDB – stores data in the Key- Figure 4. Presentation of data in row-oriented Value Store. database  Google Spanner – Spanner is Google's distributed NewSQL DB, the successor to Row-oriented database are appropriate in the BigTable etc. following situations: short transactions NewSQL is an established trend with a number covering a small number of records, frequent of options, which fits relational and distributed updates, many users and fast response time. database model. b. Column-oriented database In the column-oriented model the information DESCRIPTION AND COMPARISON OF is stored in areas that correspond to the MODERN DBMS BY MODEL OF DATA attributes of the table. Operations for adding STORAGE and extracting an information from rows To assess the advantages and disadvantages of (multiple attributes) are much more complex each model, it is necessary to compare their than adding, editing or extracting information abilities of realization of basic operations for from certain columns. The advantage is in case data processing: inserting, updating, deleting, to edit a column that all data is from the same select (select or aggregates) and metadata type and the editing could be done without updating (add or modification of the affecting other rows. Kubiatowicz, analyzing attributes). the fast working of these databases compared

416 Trakia Journal of Sciences, Vol. 13, Suppl. 1, 2015

KLISAROVA-BELCHEVA S. to row-oriented, experimentally proves that a operations for creating, editing, deleting or column-oriented databases are more effective extraction are facilitated. in calculating the aggregate functions (9). The scheme of the model represented on the Figure 5.

Figure 7. Presentation of data in the document- oriented stores Figure 5. Presentation of data in column-oriented database This approach to implementation is applied in

relational database in order to reduce the size Key representatives of the column-oriented of the base. The model is known as the EAV databases are: Cassandra (Apache), (Entity-Attribute-Value) data model. In a study HyperTable – Open Source system which of Carlo Beltrame about the specifics of the works with Apache HDFS, BigTable – Google, model key-value is demonstrated that this HBase – Apache Hadoop, Accumulo etc. model is the preferred solution for the creation c. Key-value stores of small non-commercial applications due to This is the simplest model of data storage. Its the simplified structure and reduced size of the fields of data can be regarded as "identical." database [10]. Among the most popular Each field contains two main attributes: databases of this class are: Dynamo – by  key that is synthetic or can be Amazon, Voldemort – used in LinkedIn, automatically generated Redis, Scalaris – unlike most other NoSQL  value - most often String / BLOB DB, it ensures strict consistency, ACID storage of various types of data in a common transactions etc. format like JSON, XML, etc. d. Document-oriented DB Presentation of data is based on the similarity In document-oriented models using the of the areas in which they are stored. All technology known from key-value databases operations are trivial, and because of the and the data (values) are presented in a more specificity of the model, the operations for complex structure called document. As a rule change of attributes (metadata) are missing. are used formats such as …XML, JSON, YAML and BSON. The main advantage is that the data include the description of the structure (metadata). Due to the use of descriptive language these computer sets are also known as XML-oriented database. Various implementations include different data organization such as collections, tags and more. Databases of this type resemble a relational logical representation of data. The main operations to remove data for each document are fast. Extracting parts of all Figure 6. Presentation of data in the key-value documents, however, is time-consuming and stores complex task. The possibility of calculating the DB type key-value use associative arrays, aggregates is achieved by secondary indexes known as a dictionary or a card (map). All and realization of MapReduce algorithm for processing the data. Trakia Journal of Sciences, Vol. 13, Suppl. 1, 2015 417

KLISAROVA-BELCHEVA S. The main representatives of this class database document, which expresses the current status are: CouchDB - Apache's development of web of the document (snapshot). They are suitable applications, data is stored in JSON format, for blogs or documentation systems that MongoDB – by many features, it is similar to require multiple data regarding each document, CouchDB, the difference is in the replication – such as a state at the time of entering an automatic sharding and MapReduce on information (snapshot). For them it is aggregate operations on documents. important retrieving the information about e. Graph-oriented stores the document to be implemented at once , not Among the more rarely used models are just parts of it. These systems can be NoSQL graph-oriented and hierarchical successfully used for archiving documents databases. Their main feature is the relations created in relational databases. Most systems between the elements. The elements are simple for DB management of the documentary- structures called nodes. Relations are made oriented model, offer fast text search. Many of with a table of links between nodes. They are them also offer secondary indexes, regardless defined as features for each relation. These the structure of the information. Another databases are suitable for solving tasks for the advantage is that things do not have the pattern realization of complex interrelationships. of the base used is pre-clarified. In a column- Presentation in graph helps to accelerate speed- oriented database structure is easily up for processing data. changeable, but it is required initially to set a Key representatives of the graph-oriented structure that eventually changes according to database are: adding new attributes.  Neo4j –ACID transactions, scales to Disadvantages of document-oriented database billions of nodes. This is the most popular in comparison to relational are: graph DB.  Does not support document structure –  ArangoDB – document and graph both advantage and disadvantage database  It is often slower than relational  InfoGrid – Web Graph database  Requires more storage space The main operations are quick insertion, but  Does not support of protection from operations data retrieval can be time repeated information. consuming.  Possibility for “broken document” f. Comparison of the different models Because of the possibility of random structure Row-oriented databases are essentially for each node of document-oriented database, relational databases. The main advantages of there is possibility the document to not be fully this model are in simultaneous processing of described. Document-oriented database is multiple transactions and inquiries. The system improved model of key-value model. They has to react immediately to customer inquiries provide an opportunity to describe the indexes (OLTP – Online transaction processing). The of documents and internal attributes, and their advantages of the column-oriented DBMS are deficiency results in the requirement for easily calculable aggregates. They are used clarification of the scheme. mainly for data warehouses, CRM and other ad hoc inquiry systems, where aggregate are PRACTICAL EXAMPLES OF THE computed over large data items – OLAP DIFFERENT DB (Online analytical processing). An example is Giants in the Internet industry, generating the use of column-oriented database in banking substantial portion of Web traffic rely software solutions. Example software solutions increasingly on NoSQL database. Kai Ored include RDBMS and column-oriented store. classifies No SQL DB [11] and use in the Interface for export/import data between the Twitter, Facebook, LinkedIn, eBay, Amazon. different databases. This leads to easier use For example Cassandra is used in eBay. It is aggregates and successful analyses. mainly used because of the large By examining trends in business analysis of nomenclatures - 200+ million items, 100 large data Karthik Kambatlaa shows that when million. users, 2 billion pages are exhibited per the database is larger than the RAM memory, day. Casandra is used for the following or the aggregates are more important than the reasons: multi datacenter (active-active), flow of data, a column-oriented database availability – no SPOF, scalability, write significantly improves the recovery of data for performance, distributed counters and hadoop analysis [8]. Column-oriented databases are support. Ebay also use MongoDB and HBase. suitable in the case of adding new attributes. Other large companies that use Casandra This essential is the design of the database. because of the mentioned advantages above Document-oriented databases are suitable in are: Soundcloud, Spotify, Instagram, Adobe case of need for multiple data into one Audience Manager, Adobe digital market 418 Trakia Journal of Sciences, Vol. 13, Suppl. 1, 2015

KLISAROVA-BELCHEVA S. suite, AOL, British Gas, Call of Duty, Cisco, applications and part of business applications VMWare, Dell and many other big companies. stake on the system model containing multiple EBay use MongoDB like McAfee, US data management system – traditional and Department of energy, US National Archives. distributed. The authoritative source for McAfee threat information, MongoDB enables big data Acknowledgements This paper was sponsored analytics and supports the real-time flow of as a part of project SR15 FESS 019 / cyberthreat data between Global Threat 24.04.2015 at the Scientific Research Fund of Intelligence’s cloud-based system and end the University of Plovdiv “Paisii Hilendarski”. client products. It currently stores 4 billion REFERENCES documents – terabytes of data. The advantages 1. Rick Cattell, Scalable SQL and NoSQL they use are: Data Stores, SIGMOD Record, December – Еasy to increase storage capacity by 2010 orders of magnitude, 2. Seth Gilbert, Nancy A. Lynch, Perspectives – Lowered latency, easy to interact with on the CAP Theorem, Computer 45(2): Feb. JSON 2012 – GridFS for high availability &CDNf 3. A Nayak, A Poriya, D Poojary, Type of – Flexability to “Decorate” base data NоSQL Databases and its Comparison with – Atomic updates and full consistency Relational Databases, International Journal They use : OS – Cent OS, Deployment of Applied Information Systems, March platform – their own, 24 servers with 96 GB 2013 RAM per server and SSD, sharding and 4. Jing Han, Survey on NoSQL database, 6th replication – 6 shards in largest cluster, International Conference on Pervasive database – 4 billions documents, 20’000 writes Computing and Applications (ICPCA), per second on SSDs. Google uses its own 2011 product BigTable over GFS. 5. Hung-Chih Yang, Ruey-Lung Hsiao, D. MariaDB is the database that powers billions Stott Parker, Map-Reduce-Merge: of users on sites like Google and Wikipedia. Simplified Relational Data Processing MariaDB is using from reuter.de, Ingenico Large Clusters, SIGMOD’07, Beijing, financial solutions, Jetair, Swisscom, China, June 12–14, 2007 Transticket and etc. 6. Grigoriev U. A., Cviashchenko E. V., Analysis processing versions of records in Despite the increased data volumes general databases NoSQL, Science and Education architect, of the System Engineering Team at Publisher VPO "MSTU. N.E. Bauman." , FC Jetair, wanted a system which delivered faster 77 – 48211 transactions. He felt there were better options 7. Aslett, Matthew "How Will The Database available as, "Our previous provider did not Incumbents Respond To NoSQL And give any support for load balancing using NewSQL?", 451 Group, published 2011-04- Distributed Replicated Block Device (DRBD), 04 and MySQL High Availability (MHA) support 8. Karthik Kambatlaa, Giorgos Kolliasb, Vipin for unlimited amount of servers was Kumarc, Ananth Gramaa, Trends in big overpriced." data analytics, Journal of Parallel and Distributed Computing 74, Issue 7, July CONCLUSION 2014 The Global digital market, automated 9. John Kubiatowicz, C-Store / DB Cracking, production and the need to analyze business EECS 262a, Advanced Topics in Computer information in real time require reliable Systems, 29.10.2014 software that supports fast transactions with 10. Carlo Beltrame, Key-value stores, large volumes of data.The problem of choosing Algorithms for Database Systems, ETH a suitable database (relational, distributed for Zurich, 2013 April 26 OLAP or OLTP) will continue to be a general 11. Kai Ored, Analysis and Classification of task in the development of software NoSQL Databases and Evaluation of their applications. More software architects utilize ability to replace an Object-relational the possibilities offered by the distributed Persistent Layer, Technical University database in their systems combining them with Munich,2010 April 14, traditional relational.

From the review of modern DBMS data according to their physical model can be concluded that a major part of web

Trakia Journal of Sciences, Vol. 13, Suppl. 1, 2015 419