Modeling and Querying Data in Nosql Databases

2013 IEEE International Conference on Big Data Modeling and Querying Data in NoSQL Databases Karamjit Kaur Rinkle Rani Computer Sci. and Engg. Deptt. Computer Sci. and Engg. Deptt. Thapar University Thapar University Patiala, Punjab, India 147004 Patiala, Punjab, India 147004 Email: [email protected] Email: [email protected] Abstract—Relational databases are providing storage for sev- amounts of data. Their data model is based on single machine eral decades now. However for today’s interactive web and mobile architecture and was not designed to be distributed. Today all applications the importance of flexibility and scalability in data softwares are developed expecting a large user-base, which model can not be over-stated. The term NoSQL broadly covers was not the case in 1970s. Scalability is one of the most all non-relational databases that provide schema-less and scalable discussed issue today, since web applications have got enor- model. NoSQL databases which are also termed as Internet- mous popularity. Scalability can be achieved either vertically or age databases are currently being used by Google, Amazon, Facebook and many other major organizations operating in the horizontally. Vertical scalability, also known as (a.k.a.) scaling era of Web 2.0. Different classes of NoSQL databases namely up is easier to achieve as compared to horizontal scalability key-value pair, document, column-oriented and graph databases a.k.a. scaling out. As the name suggests, scaling-up means enable programmers to model the data closer to the format as adding up resources to a single node and scaling-out means used in their application. In this paper, data modeling and query adding more nodes to a system. Horizontal scaling provides syntax of relational and some classes of NoSQL databases have more flexibility as commodity servers or cloud instances can be been explained with the help of an case study of a news website utilized. Traditional databases relies on vertical scaling where like Slashdot. as recently evolved non-relational databases use horizontal scaling for achieving scalability [2]. I. INTRODUCTION Although relational databases have matured very well After 1990’s due to popularity of HTTP, the cost of posting because of their prolonged existence and are still good for and exchanging information became cheaper which led to various use cases. Unfortunately for most of the today’s the flooding of information on Internet. It was realized that software design, relational databases show their age and do traditional techniques of data storage will soon become stale not give good performance especially for large data sets and and inefficient to handle such vast amount of unstructured dynamic schemas. We live in a world, where domain model is and semi-structured data. Not all the information generated on constantly changing during development phase and even after Web is structured, rather interactive Web has produced more deployment. These changes in requirements along with various semi-structured or un-structured data. All the available rich other reasons described above, led to the development of non- information cannot be forcefully made to fit in the tabular relational databases known as NoSQL databases. Popularity of format of relational databases. This problem was also faced by non-relational databases can be very well imagined by the fact object-oriented databases under the name “Object-Relational that many universities have started teaching about these data Impedance Mismatch” problem [1]. This mismatch occurs stores as part of their curriculum [3]. when an object is molded to fit into relational structure. A large percentage of digital information floating around the world is NoSQL is a term most commonly used to cover all non- in PDF, HTML and other types of formats which cannot be relational databases, it stands for Not only SQL. There is easily modeled, processed and analyzed. Natural text can not much disagreement on this name as it do not depict the real be easily captured in form of entities and relationships. The meaning of non-relational, non-ACID, schema-less databases, major problem is that legacy tools require data schema to be since SQL is not the obstacle as implied by the term NoSQL. defined a priory to the data creation. In today’s world, where The term “NoSQL” was introduced by Carlo Strozzi in 1998 as Internet has become part of the common person’s life, deciding a name for his open source relational database that did not offer on rigid pre-defined schema is unrealistic. Schema evolves a SQL interface [4]. The term was re-introduced in October no:sql(east) as users start using the application and as information to be 2009 by Eric Evans for an event named organized dealt with changes according to user’s requirements. Structure for the discussion of open source distributed databases [5]. of data also changes as information is gathered, stored and The name was an attempt to describe the increased number processed. of distributed non-relational databases that emerged during the second half of the 2000’s. Increasing number of players dealing Above stated problems of storing semi-structured infor- with WWW started recognizing the in-efficiency of relational mation, object-relational impedance mismatch and schema databases to handle huge amount of diverse data generated by evolution originated because of the changes in usage pattern the introduction of Web 2.0 applications. Google was the first of databases since the the time the idea of relational databases organization to lead this movement by introducing BigTable were conceptualized in 1970s. Relational databases were not [6] in 2006 followed by Amazon’s Dynamo [7] in 2007. designed keeping in mind sparse information and scalability, Influenced by adoption of non-relational databases by these big where scalability is the ability of a system to handle growing firms most of the organizations started developing their own 978-1-4799-1293-3/13/$31.00 ©2013 IEEE 1 NoSQL data stores customized according to their requirements. have the same number of fields, while documents in a col- Most of today’s popular NoSQL data stores have adopted ideas lection can have completely different fields. Documents are either from Google’s BigTable or Amazon’s Dynamo. Those addressed in the database via a unique key that represents inspired by BigTable are categorized as column-oriented or that document. Document-oriented databases are one of the wide-table data stores and others which are descendants of categories of NoSQL databases that are appropriate for web- Dynamo are termed as key-value based data stores. There are applications which involves storage of semi-structured data two other categories too, document and graph-based databases. and execution of dynamic queries. Practical usability of these These four classes of NoSQL databases deal with different databases can be guessed by the fact that there are more types of data and hence are suitable for different use cases. than 15 document-oriented databases available of which widely used are MongoDB [18], CouchDB [19] and RavenDB [20]. Column-oriented or Wide-table data stores are designed MongoDB is the most popular document database. MongoDB to address following three areas- huge number of columns, is designed to be able to face new challenges such as horizontal sparse nature of data and frequent changes in schema. Unlike scalability, high-availability and flexibility to handle semi- relational databases where rows are stored contiguously, in structured data. MongoDB has typical applications in content column-oriented databases column values are stored contigu- management systems, mobiles, gaming and archiving. In this ously. This change in storage design results in better per- paper, MongoDB has been chosen to represent the case-study formance for some operations like aggregations, support for under consideration. ad-hoc and dynamic query etc. In row-oriented databases, all columns of those rows which satisfies the where clause of the query are retrieved, which causes unnecessary disk input- output if only few columns out of all returned columns were Graph databases model the database as a network struc- required. These databases are also best suited for analytical ture containing nodes and, edges relating nodes to represent purposes as they deal with only few specific columns and relationship amongst them. Nodes may also contain properties for compression also as data-type and range of values are that describes the real data contained within each object. Sim- fixed for each column. For each column, row-oriented storage ilarly, edges may also have their own properties. Relationship design deals with multiple data types and limitless range connects two nodes and may be directed where direction of values, thus making compression less efficient overall. adds meaning to the relationship. Relationships are identified Most of the columnar databases are also compatible with by their names and can be traversed in both the directions. MapReduce framework, which speeds up processing of large Comparing with Entity-Relational Model(ER Model), a node amount of data by distributing the problem on large number corresponds to an entity, property of a node to an attribute and of systems [8]. Popular open source column-oriented databases relationship between entities to relationship between nodes. are Hypertable [9], HBase [10] and Cassandra [11]. Hypertable In relational databases, queries requiring attributes from more and HBase are derivatives of BigTable where as Cassandra than one table result in a join operation. Join is an slow oper- takes its features from both BigTable and Dynamo. ation which degrades the performance as table size increases. Graph databases are suitable for finding relationships within Key-Value Databases have a very simple data model. Data huge amounts of data at a faster rate, since processing of is organized as an associative array of entries consisting of intensive joins are not performed instead it provides index- key-value pairs.

Load more