<<

Masaryk University Faculty of Informatics

Master Thesis

Database management as a -based service for small and medium organizations

Student: Dime Dimovski

Brno, 2013

Statement

I declare that I have worked on this thesis independently using only the sources listed in the bibliography. All resources, sources, and literature, which I used in preparing or I drew on them, I quote in the thesis properly with stating the full reference to the source.

Dime Dimovski

2

Resume

The goal of this thesis is to explore the , manly focusing on management systems as a cloud service. It will give review of some of current available solutions of SQL and NOSQL based database management systems as a cloud service; advantages and disadvantages of the cloud computing in general and the common considerations.

Keywords

Cloud computing, SaaS, PaaS, Database management, SQL, NOSQL, DBaaS, Database.com, SQL Azure, Web Services, SimpleDB, DynamoDB, SQL, MongoDB, CouchDB, Google Datastore.

3

Contents

1. Introduction ...... 8

2. Introduction to Cloud Computing ...... 9

2.1 Cloud computing – definition ...... 9

2.2 Cloud Types ...... 10

2.2.1 NIST model ...... 10

2.3 Cloud computing architecture ...... 12

2.3.1 Infrastructure ...... 13

2.3.2 Platform ...... 14

2.3.3 Application Platform (APaaS ) or Virtual appliances ...... 15

2.3.4 Application ...... 16

3. ...... 17

4. Elasticity ...... 18

5. Database Management Systems in the cloud (Database as a service) ...... 19

6. Database.com ...... 21

6.1 Database.com Architecture ...... 21

6.2 Multitenant data model ...... 22

6.3 Multitenant indexes ...... 23

6.4 Multitenant relationships ...... 23

6.5 Multitenant field history ...... 23

6.6 Partitioning of metadata, data, and index data ...... 23

6.7 Application development ...... 24

6.8 Data Access ...... 24

6.9 Query languages ...... 25

6.10 Multitenant search processing ...... 25

4

6.11 Multitenant isolation and protection ...... 26

6.12 Deletes, undeletes ...... 27

6.13 Backup...... 27

6.14 Pricing ...... 27

7. ’s SQL AZURE ...... 28

7.1 Subscriptions ...... 28

7.2 ...... 28

7.3 Security and Access to a SQL Azure Database ...... 29

7.4 SQL Azure architecture ...... 29

7.5 Logical Databases on a SQL Azure Server ...... 29

7.6 Network Topology ...... 31

7.7 High Availability with SQL Azure ...... 33

7.8 Failure Detection ...... 33

7.9 Reconfiguration...... 33

7.10 Availability Guarantees ...... 34

7.11 Scalability with SQL Azure ...... 34

7.12 Throttling ...... 34

7.13 Load Balancer ...... 35

7.14 SQL Azure Management ...... 35

7.15 Pricing in SQL Azure ...... 35

8. Amazon WebServices ...... 37

8.1 Amazon Relational Database Service (Amazon RDS) ...... 37

8.2 Amazon RDS Architecture/Features ...... 37

8.3 Scalability with Amazon RDS ...... 38

8.4 High Availability ...... 39

8.5 Pricing ...... 39

9. Google Cloud SQL ...... 40

5

9.1 Pricing ...... 41

10. Summary of RDBMSaaS and common considerations ...... 42

11. NOSQL ...... 45

12. Amazon SimpleDB and DynamoDB...... 45

12.1 History ...... 45

12.2 Amazon DynamoDB DataModel ...... 46

12.3 Amazon DynamoDB Features ...... 48

12.4 Amazon SimpleDB ...... 49

12.5 Pricing ...... 51

13. Google Datastore ...... 52

13.1 Datastore Datamodel ...... 52

13.2 Queries and indexes ...... 52

13.3 Transactions ...... 53

13.4 Scalability ...... 53

13.5 High Availability ...... 53

13.6 Data Access ...... 54

13.7 Quotas and Limits ...... 54

14. MongoLab/MongoDB and Cloudent/Apache CouchDB ...... 55

14.1 Document oriented database ...... 55

14.2 MongoDB and CouchDB comparison ...... 56

14.3 MVCC – Multy Version Concurency Control ...... 56

14.4 Scalability ...... 57

14.5 Querying ...... 57

14.6 Atomicity and Durability ...... 58

14.7 Map Reduce ...... 58

14.8 Javascript ...... 58

14.9 REST ...... 58

6

14.10 MongoLab and Cloudent ...... 58

15. What benefits and cloud computing brings for small and medium organizations? 62

15.1 Advantages for Small Business ...... 62

15.2 Disadvantages of Cloud Computing ...... 63

15.3 Main things to be considered when moving to the cloud ...... 64

16. Will cloud computing reduce the budget? ...... 67

17. Conclusion ...... 69

Appendix ...... 70

Case studies from the industry – Amazon RDS ...... 70

Case studies from the industry – Microsoft SQL Azure ...... 70

Case studies from the industry – Amazon DynamoDB ...... 70

Case studies from the industry – Amazon SimpleDB ...... 71

References ...... 72

7

1. Introduction

The boom of the cloud computing over the past few years has led to situation that it is common to many innovations and new technologies. It became common for enterprises and a person to use the services that are offered in the cloud and recognize that cloud computing is a big deal even though they are not clear why that is so. Even the phrase “in the cloud” has been used in our colloquial language. Huge percentage of the developers in the world is currently working on “cloud-related” products. Therefore the cloud is this amorphous entity that is supposed to represent the future of modern computing.

In an attempt to gain a competitive edge, businesses are looking for new innovative ways to cut costs while maximizing value. They recognize the need to grow but at the same time they are under pressure to save money. The cloud gave this opportunity for the business allowing them to focus on their core business by offering hardware and software solution without having to develop them by their own.

In this thesis I will give an overview of what cloud computing is. I will describe its main concepts and architecture; and take a look at the paradigm XaaS (something/everything as a service) and the current available options in the cloud mostly focusing on Database in the cloud or Database as a service. I will give a closer look on how the cloud computing in general and database as a service can be used for small and medium enterprises, what are the main benefits that it offers and will it really help businesses to reduce the budget and focus on their core business.

8

2. Introduction to Cloud Computing

In reality the cloud is something that we have been using for a long time, it is the , with all the standards and protocols that provide Web services to us. Usually the Internet is drawn as a cloud, this represent s one of the essential characteristics of cloud computing, abstraction. Cloud computing refers to applications and services that run on a distributed network using virtualized resources and are accessed by common Internet protocols and networking standards. It is distinguished by the notion that resources are virtual and limitless and that details of the physical system on which software runs are abstracted from user .[1]

One of the main things that is driving cloud computing is the recent advancements in wireless speed and connectivity. Without these in place, cloud computing wouldn’t be practical or even possible. In many ways, cloud computing was/is an eventuality. The influence of telecommunications organizations and their push towards simplifying and miniaturizing virtually every electronic device that can be used by the mobile users is pushing cloud computing even faster. This represents a major breakthrough in not only computing but also communication.

Cloud computing represents a real paradigm shift in the way in which systems are deployed. The massive scale of cloud computing systems was enabled by the popularization of the Internet and the growth of some large service companies.[1]

Cloud computing has been compared to the standard utility companies, and it does bear a striking resemblance to these institutions. Just like water, electricity or gas, cloud computing makes the long- held dream of utility computing possible with a pay-as-you-go, infinitely scalable, universally available system. In other words, the ‘goods’ come from one central location; we’re just turning things off and on. This may ultimately give more people access to a larger pool or resources at an extremely reduced cost. One of the biggest benefits of cloud computing is its ability to offer users access to off-site hardware and software. With cloud computing the resources of the cloud itself are at your disposal. This means all the hardware, software, processors and networks will combine to give individuals much more computing power than has ever been possible. This will completely change nearly every facet of information exchange as well as influence everything from social networking to web development. By keeping things light and simple individual access devices are going to last a lot longer and become much more durable. And of course, losing or breaking a device is no longer going to be of any particular concern, as they will be easily replaced and there’s no danger of losing your files or information either.

With cloud computing, you can start very small and become big very fast. That's why cloud computing is revolutionary, even if the technology it is built on is evolutionary.

2.1 Cloud computing – definition

The use of the word “cloud” makes reference to the two essential concepts:

 Abstraction

 Virtualization

9

Abstraction

Cloud computing is abstracting the details of the system implementation from the users and the developers. Applications run on physical systems that aren't specified, data is stored in locations that are unknown, administration of systems is outsourced to others, and access by users is ubiquitous.[1]

Virtualization

Cloud computing virtualizes systems by pooling and sharing resources. Systems and storage can be provisioned as needed from a centralized infrastructure, costs are assessed on a metered basis, multi- tenancy is enabled, and resources are scalable with agility.

Cloud computing is an abstraction based on the notion of pooling physical resources and presenting them as a virtual resource. It is a new model for provisioning resources, for staging applications, and for platform-independent user access to services. Clouds can come in many different types, and the services and applications that run on clouds may or may not be delivered by a cloud service provider.

2.2 Cloud Types

Usually the cloud computing is separated into two distinct sets of models:

 Deployment models – refers to location and management of the cloud’s infrastructure.  Service models – particular types of services that can be accessed on a cloud computing platform.

2.2.1 NIST model

The NIST model is set of working definitions published by the U.S. National Institute of Standards and Technology. This cloud model is composed of five essential characteristics, three service models, and four deployment models.[2]

Essential Characteristics:

 On-demand self-service - A consumer can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with each service provider.  Broad network access - Capabilities are available over the network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones, tablets, laptops, and workstations).  Resource pooling - The provider’s computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand. There is a sense of location independence in that the customer generally has no control or knowledge over the exact location of the provided resources but may be able to specify location at a higher level of abstraction (e.g., country, state, or datacenter). Examples of resources include storage, processing, memory, and network bandwidth.

10

 Rapid elasticity - Capabilities can be elastically provisioned and released, in some cases automatically, to scale rapidly outward and inward commensurate with demand. To the consumer, the capabilities available for provisioning often appear to be unlimited and can be appropriated in any quantity at any time.  Measured service - Cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g. storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported, providing transparency for both the provider and consumer of the utilized service.

Service Models:

(SaaS) - The capability provided to the consumer is to use the provider’s applications running on a cloud infrastructure2. The applications are accessible from various client devices through either a thin client interface, such as a web browser (e.g., web-based email), or a program interface. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user specific application configuration settings.  (PaaS) - The capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired applications created using programming languages, libraries, services, and tools supported by the provider. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage, but has control over the deployed applications and possibly configuration settings for the application-hosting environment.  Infrastructure as a Service (IaaS) - The capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, and deployed applications; and possibly limited control of select networking components (e.g., host firewalls).

Deployment Models:

 Private cloud - The cloud infrastructure is provisioned for exclusive use by a single organization comprising multiple consumers (e.g., business units). It may be owned, managed, and operated by the organization, a third party, or some combination of them, and it may on or off premises.  Community cloud - The cloud infrastructure is provisioned for exclusive use by a specific community of consumers from organizations that have shared concerns (e.g., mission, security requirements, policy, and compliance considerations). It may be owned, managed, and operated by one or more of the organizations in the community, a third party, or some combination of them, and it may exist on or off premises.  Public cloud - The cloud infrastructure is provisioned for open use by the general public. It is usually open system available to general public via WWW or Internet. It may be owned, managed, and operated by a business, academic, or government organization, or some combination of them. It exists on the premises of the cloud provider. Examples of public cloud: Google application engine, Amazon elastic compute cloud, .  Hybrid cloud - The cloud infrastructure is a composition of two or more distinct cloud 11

infrastructures (private, community, or public) that remain unique entities, but are bound together by standardized or proprietary technology that enables data and application portability (e.g., cloud bursting for load balancing between clouds). [2]

2.3 Cloud computing architecture

Cloud computing is essentially a series of levels that function together in various ways to create a system. This system is also referred to as cloud computing architecture. The cloud creates a system where resources can be pooled and partitioned as needed. Cloud architecture can couple software running on virtualized hardware in multiple locations to provide an on-demand service to user-facing hardware and software. A cloud can be created within an organization's own infrastructure or outsourced to another datacenter. Usually resources in a cloud are virtualized resources because virtualized resources are easier to modify and optimize. A compute cloud requires virtualized storage to support the staging and storage of data. From a user's perspective, it is important that the resources appear to be infinitely scalable, that the service be measurable, and that the pricing be metered.[1]

Figure 1 Cloud computing stack

Applications in the cloud are usually composable systems, this means that they are using standard component so assemble services that are tailored for a specific purpose. A composable component must be:

• Modular: It is a self-contained and independent unit that is cooperative, reusable, and reeplaceable.

12

• Stateless: A transaction is executed without regard to other transactions or requests

In general cloud computing doesn’t require that hardware and software to be composable but it is a highly desirable characteristic. It makes system design easier to implement and solutions are more portable and interoperable.

Some of the benefits from composable system are:

 Easier to assemble systems

 Cheaper system development

 More reliable operation

 A larger pool of qualified developers  A logical design methodology

There is a trend toward designing composable systems in cloud computing in the widespread adoption of what has come to be called the Service Oriented Architecture (SOA). The essence of a service oriented design is that services are constructed from a set of modules using standard communications and service interfaces. An example of a set of widely used standards describes the services themselves in terms of the Web Services Description Language (WSDL), data exchange between services using some form of XML, and the communications between the services using the SOAP protocol. There are, of course, alternative sets of standards.[1]

What isn't specified is the nature of the module itself; it can be written in any the developer wants. From the standpoint of the system, the module is a black , and only the interface is well specified. This independence of the internal workings of the module or component means it can be swapped out for a different model, relocated, or replaced at will, provided that the interface specification remains unchanged. That is a powerful benefit to any system or application provider as their products evolve.

Essentially there are 3 tiers in a basic cloud computing architecture:

 Infrastructure

 Platform

 Application

If we further break down the standard cloud computing architecture there are really two areas to deal with; the front end and back end.

Front End - The front end includes all client (user) devices and hardware in addition to their and the application that they actually use to make a connection with the cloud.

Back End - The back end is populated with the various servers, data storage devices and hardware that facilitate the functionality of a cloud computing network.

2.3.1 Infrastructure

The infrastructure of cloud computing architecture is essentially all the hardware, data storage devices (including virtualized hardware), networking equipment, applications and software that operates and 13

drives the cloud.

Most Infrastructure as a Service (IaaS) providers use virtual machines to deliver servers that run applications. Virtual machines images or instances are containers that have assigned specific resources (number of CPU cycles, memory access, network bandwidth, etc.).

Figure 2 shows the cloud computing stack that is defined as the server. The Monitor, also called a is the low level software that allows different operating systems to run in their own memory space and manages I/O for the virtual machines.[1]

Figure 2 "Server" stack

2.3.2 Platform

A cloud computing platform is the actual programming, code and implemented systems of interfacing that help user-level devices (and applications) connect with the hardware and software resources of the cloud. It is a software layer that is used to create higher level of services.

A cloud computing platform is generally divided up between the front end and back end of a network. Its job is to provide a communication and access portal for the client, so that they may effectively utilize the resources of the cloud network. The platform may only be a set of directions, but it is in all actuality the most integral part of a cloud computing network; without it cloud computing would not be possible.

There are many different Platform as a Service (PaaS) providers, we will mention some of them:

.com’s Force.com and Databse.com Platforms

 Windows Azure Platform

 Google Apps and Google AppEngine

All platform services offer hosted hardware and software needed to build and deploy Web application or services that are custom built by the developers.

It makes sense for vendors to move their development environments into the cloud 14

with the same technologies that have been successfully used to create Web applications. Thus, you might find a platform based on an Oracle xVM hypervisor virtual machine that includes a NetBeans Integrated Development Environment (IDE) and that supports the Oracle GlassFish Web stack programmable using or Ruby. For Windows, Microsoft would be similarly interested in providing a platform that allowed Windows developers to run on a Hyper-V VM, use the ASP.NET application framework, support one of its enterprise applications such as SQL Server, and be programmable within Visual Studio—which is essentially what the Azure Platform does. This approach allows someone to develop a program in the cloud that can be used by others. Platforms often come with tools and utilities to aid in application design and deployment. Depending on a vendor they can be: tools for team collaboration, testing tools, versioning tools, database and web service integration, and storage tools. Platforms providers begin with creation of developer’s community to support the work done in the environment.

Platform is exposed to users through an API, also an application built in the cloud using a platform service would encapsulates the service through its own API. An API can control data flow, communications, and other important aspects of the cloud application. Till now there are is no standard API and each cloud vendor has their own.

2.3.3 Application Platform as a Service (APaaS ) or Virtual appliances

A is software that installs as middleware onto a virtual machine. This are usually a Web server, database server, BPM, ESBs, Messaging Portals and others that are running on a virtual machine image. This, by someone referred to as Application platform as a Service, is more or less horizontal extension of the offerings of PaaS.

APaaS is a type of service model that gives cloud software developers the power to actually do their jobs. This gives an opportunity to use the APaaS /Virtual Appliances to build more complex services. Within the ApaaS system, the actual software architectures of applications are built and established. It is also within this layer that overall portability (and the ability of an application to function alongside a bevy of other cloud applications as well as operating systems) is established. Since most of the actual developmental breakthroughs (both in terms of software and overall cloud usability) occur within the realms of the middleware (PaaS, APaaS), it makes sense that a great deal of attention is paid to it. [3]

For example Amazon WS is offering more than 700 different virtual machine images preconfigured with enterprise applications like Oracle BPM, SQL Server, and even complete application stacks such as LAMP (Linux, Apache, MySQL, and PHP) which are used to create a virtual machine within the Amazon Elastic Compute Cloud (EC2). It serves as the basic unit of deployment for services delivered using EC2.

APaaS gives software developers a solid part of platform that they can stand on, with its own impressive workbench of tools, while they are constructing and envisioning new possibilities.

The true benefit from APaaS however is its ability to provide accurate feedback regarding the functionality and compatibility of applications that are still under development. This is extremely important to software developers, who can take serious losses (in terms of both money and time spent) if they produce an application that simply won’t function in an environment, behave as expected once deployed, or function in a compatible manner with other elements in a cloud infrastructure. For those companies that want to run their IT and/or software development projects through an APaaS, they need only pay subscription fees and not licensing fees. Subscription is substantially cheaper than licensing and offers its benefits when paired with cloud APaaS. Most APaaS 15

packages that are put together for designers are often much easier to use than most standardized design tools. These packages often allow software development teams to integrate and share their work more smoothly as well as run the project from start to finish much faster than with other systems.[3]

The global emergence of APaaS will no doubt lead to the creation of a number of companies that will utilize the tools of APaaS to create their own business model, especially one that seeks to provide yet another proprietary service aimed at delivering timely solutions to business software issues. One particular area that could use the help is enterprise software, for example. Enterprise software is often hard to manage, difficult to customize and frequently falls short in its functionalities. When you couple these shortcomings with the fact that it is often quite expensive, there is a serious problem. An obvious solution for dealing with enterprise software problems would be the deployment of an APaaS-style service. Not only would this greatly increase the overall functionality of expensive enterprise business software, but it would also allow for a great range of customization, as well as the option for integrating it with other cloud services and/or networking opportunities. APaaS was created to make the lives of software designers, developers and investors much easier. It is through the use of APaaS that many excellent next generation apps have been developed and many experts in the field of cloud computing agree that it is APaaS that will produce some of the upcoming “game changing” applications that will actually shape the future of cloud computing in general.

2.3.4 Application

This area is compromised of the client hardware and the interface used to connect to the cloud. Big problems arise from the design of Internet protocols to treat each request to a server as an independent transaction (stateless service) [1]. The standard HTTP commands are all atomic in nature. While stateless servers are easier to architect and stateless transactions are more resilient and can survive outages, much of the useful work that computer systems need to accomplish are stateful. Usage of transaction servers, message queuing servers and other similar middleware is meant to bridge this problem. Standard methods that are part of Service oriented Architecture that help to solve this issue and that are used in cloud computing are:

 Orchestration – process flow can be choreographed as a service

 Use of service bus that controls cloud components

There are many ways how clients can connect to a cloud service. The most common are:

 Web browser

 Proprietary application

This application can run on number of different devices, PC, Servers, Smartphones, and Tablets. They all need a secure way to communicate with the cloud. Some of the basic methods to secure the connection are:

 Secure protocol such as SSL (HTTPS). FTPS, IPSec or SSH

 Virtual connection using a virtual private network (VPN)

 Remote data transfer such as Microsoft RDP or Citrix ICA that are using tunneling mechanism  Data encryption

16

3. Scalability

The scalability is the ability of a system to handle growing amount of work in a capable manner or its ability to improve when additional resources are added.

The scalability requirement arises due to the constant load fluctuations that are common in the context of Web-based services. In fact these load fluctuations occur at varying frequencies: daily, weekly, and over longer periods. The other source of load variation is due to unpredictable growth (or decline) in usage. The need for scalable design is to ensure that the system capacity can be augmented by adding additional hardware resources whenever warranted by load fluctuations. Thus, scalability has emerged both as a critical requirement as well as a fundamental challenge in the context of cloud computing.[1][4]

Typically there are two ways to increase scalability:

 Vertical scalability – by adding hardware resources, usually addition of CPU, memory etc. This vertical scaling (scaling-up) enables them to use virtualizations technologies more effectively by providing more resources for the hosted operating systems and applications to share.  Horizontal scalability – means to add more nodes to a system, such as adding new node to a distributed software application or adding more access points within the current system. Hundreds of small computers may be configured in a cluster to obtain aggregate computing power. The Horizontal scalability (scale-out) model also creates an increased demand for shared data storage with very high I/O performance especially where processing of large amounts of data is required. In general, the scale-out paradigm has served as the fundamental design paradigm for the large-scale data-centers of today.

Integrating multiple load balancers into your system is probably the best solution for dealing with scalability issues. There are many different forms of load balancers to choose from; server farms, software and even hardware that have been designed to handle and distribute increased traffic. Items that interfere with scalability[3]:

 Too much software clutter (no organization) within the hardware stack(s).

 Overuse of third-party scaling.

 Reliance on the use of synchronous calls.

 Not enough caching

 Database not being used properly.

Creating a cloud network that offers the maximum level of scalability potential is entirely possible if we apply a more “diagonal” solution. By incorporating the best solutions present in both vertical and horizontal scaling, it is possible to reap the benefits of both models[3]. Once the servers reach the limit of diminishing returns (no growth), we should simply start cloning them. This will allow us to keep a consistent architecture when adding new components, software, apps and users. For most individuals, problems arise from lack of resources not the inherent architecture of their cloud itself. A more diagonal approach should help the business to deal with the current and growing demands that it is facing.

17

4. Elasticity

Of all the attributes possessed by cloud computing in general, the most important is certainly its elasticity. It’s ability to amplify and instantly upgrade resources and/or capacities on a moment notice. Storage, processing and the scalability of applications are all elastic in the cloud. The really remarkable thing about cloud computing is the real-time infrastructure that actively responds on user requests for resources. Without the real-time monitoring and support behind this elasticity, the effectiveness, adaptability and muscle of cloud computing would be greatly undermined. It is this elastic ability that the service providers possess which allows them to offer their users access to cloud computing services at such reduced costs. Since users only pay for what they use they can save money. For example with the traditional grid computing network every user has its own intensive hardware setup of which most of the users rarely use more than 50% of the capacity. Their combined resource usage might be 20-30% of the total resources available on their central cloud computing hardware stack. What cloud computing is really offering is the ability for average users to retain their current standards and expectations, while leaving the door open for instant expansion opportunities if they desire it. This also gives a much more efficient way to use energy. Elasticity offers the same computing experience to which we are accustomed, with the added benefit of near limitless resources at the same time offering a way to manage the energy consumption. [1][3] The elastic capabilities offered by cloud computing makes it perfectly suited toward handling certain activities or processes.  Establishing an “in office” communication and online networking infrastructure (for employees). Setting up a system that allows those in the organization a cleaner and more efficient system for communicating and working often leads to greatly increased profits.  Using cloud computing to handle overdrafting - high volume data transfer periods and events. Some businesses only use cloud computing when they run out of their own resources, or perhaps anticipate that they might lack needed functionalities. This can be something that is scheduled for an annual or bi-annual basis; designed to meet a seasonal demand for a particular product for example.  Assigning all customer data and transaction information to a cloud computing element. This allows an organization to keep their customer’s data safe even from their own employees. Utilizing a third party to handle all customer data can also pay off in the event of a catastrophic type event. Cloud computing providers tend to keep your information more securely backed-up than most are even aware of. [3]

In other word elasticity allows both user and provider to “do more with less”.

18

5. Database Management Systems in the cloud (Database as a service)

Data and database management are integral part of wide variety of applications. Particularly Relation DBMSs had been massively used due to many futures that they offer:

 Overall functionality offering intuitive and relatively simple model for modeling different types of applications.  Consistency, dealing with concurrent workloads without worrying about the data getting out of sync  Performance, low latency and high throughput combined with many years of engineering and development  Reliability, persistence of data in the presence of different types of failures and ensuring safety.

The main concern is that the DBMSs and RDBMSs are not cloud-friendly because they are not as scalable as the web-servers and application servers, which can scale from a few machines to hundreds. The traditional DBMSs are not design to run on top of the shared-nothing architecture (where a set of independent machines accomplish a task with minimal resource overlap) and they do not provide the tools needed to scale-out from a few to a large number of machines. Technology leaders such as Google, Amazon, and Microsoft have demonstrated that data centers comprising thousands to hundreds of thousands compute nodes, provide unprecedented economies- of-scale since multiple applications can share a common infrastructure. All three companies have provided frameworks such as Amazon’s AWS, Google’s AppEngine and Microsoft Azure for hosting third party application in their clouds (data-center infrastructures). Because the RDBMs or “transactional data management” databases that back banking, airline reservation, online e-commerce, and supply chain management applications typically rely on the ACID (Atomicity, Consistency, Isolation, Durability) guarantees that databases provide and It is hard to maintain ACID guarantees in the face of data replication over large geographic distances1, they even have developed propriety data management technologies referred to as key-value stores or informally called NO-SQL database management systems.[6] The need for web-based application to support virtually unlimited number of users and to be able to respond to sudden load fluctuations raises the requirement to make them scalable in cloud computing platforms. There is a need that such scalability can be provisioned dynamically without causing any interruption in the service. Key-value stores and other NOSQL database solutions, such as Google Datastore offered with Google AppEngine, Amazon SimpleDB and DynamoDB, MongoDB and others, have been designed so that they can be elastic or can be dynamically provisioned in the presence of load fluctuations. We will explain some of these systems in more details later on.

1 CAP theorem, also known as Brewer's theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: Consistency (all nodes see the same data at the same time) Availability (a guarantee that every request receives a response about whether it was successful or failed) Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system) According to the theorem, a distributed system can satisfy any two of these guarantees at the same time, but not all three.

19

As we move to the cloud-computing arena which typically comprises data-centers with thousands of servers, the manual approach of database administration is no longer feasible. Instead, there is a growing need to make the underlying data management layer autonomic or self-managing especially when it comes to load redistribution, scalability, and elasticity. [7]

Figure 3 Traditional VS Cloud Data Services

This issue becomes especially acute in the context of pay-per-use cloud-computing platforms hosting multi-tenant applications. In this model, the service provider is interested in minimizing its operational cost by consolidating multiple ten-ants on as few machines as possible during periods of low activity and distributing these tenants on a larger number of servers during peak usage [7]. Due to the above desirable properties of key-value stores in the context of cloud computing and large-scale data-centers, they are being widely used as the data management tier for cloud-enabled Web applications. Although it is claimed that atomicity at a single key is adequate in the context of many Web-oriented applications, evidence is emerging that indicates that in many application scenarios this is not enough. In such cases, the responsibility to ensure atomicity and consistency of multiple data entities falls on the application developers. This results in the duplication of multi-entity synchronization mechanisms many times in the application software. In addition, as it is widely recognized that concurrent programs are highly vulnerable to subtle bugs and errors, this approach impacts the application reliability adversely. The realization of providing atomicity beyond single entities is widely discussed in developer blogs. Recently, this problem has also been recognized by the senior architects from Amazon and Google, leading to systems like MegaStore [10] that provide transactional guarantees on key-value stores. Both RDBMs and NOSQL DBMs offerings in the cloud will be explained in more details, how they work who offers them and how they are provisioned. I will first focus on the relational database offered in the cloud. I will start with one of the first Enterprise database built for the cloud, the Salesforce’s database.com.

20

6. Database.com

Database.com is a database management system that is built for cloud computing with multitenancy inherent in its design. Traditional RDBMSs were designed to support on premises deployments for one organization. All core mechanisms such as system catalog, cashing mechanisms and query optimizer are built to support single-tenant applications and to run directly on a specifically tuned host operating system and hardware. Only possible way to build multi-tenant cloud database service with standard RDBMS is to use virtualization. Unfortunately, the extra overhead of the hypervisor typically hurts the performance of the RDBMS. Database.com combines several different persistence technologies, including a custom -designed relational database schema, which are innately designed for clouds and multitenancy - no virtualization required.

6.1 Database.com Architecture

Database.com’s core relational database technology uses a runtime engine that materializes all application data from metadata - data about the data itself. In Database.com’s metadata-driven architecture, there is a clear separation of the compiled runtime database engine (kernel), tenant data, and the metadata that describes each application’s schema. These distinct boundaries make it possible to independently update the system kernel and tenant -specific application schemas.

Figure 4 Databse.com Architecture [9]

Every logical database object is internally managed using metadata. Objects, (“tables” in traditional relational database parlance), fields, stored procedures, and database triggers are all abstract

21

constructs that exist merely as metadata in Database.com’s Universal Data Dictionary (UDD). Database.com used terminology is shown in 1.

Relational Database Term Equivalent Term in Databse.com Database Organization Table Object Column Field Row Record Table 1 Database.com Terminology

When a new application object is defined or some procedural code is written, Database.com does not create an actual table in a database or compile any code, it simply stores metadata that the system’s engine can use to generate the virtual application components at runtime. When modification or customization of something about the application schema is needed, like modify an existing field in an object, all that’s required is a simple non-blocking update to the corresponding metadata [9].

In order to avoid performance-sapping disk I/O and code recompilations, and improve application response times, Database.com uses massive and sophisticated metadata caches to maintain the most recently used metadata in memory. The system runtime engine must be optimizes to access metadata because frequent metadata access would prevent the service from scaling.

At the heart of Database.com is its transaction database engine. Database.com uses a relational database engine with a specialized schema build for multitenancy. It also employs a search engine (separate from the transaction engine) that optimizes full -text indexing and searches. As applications update data, the search service’s background processes asynchronously update tenant - and user- specific indexes in near real time. The goal of this separation of duties between the transaction engine and the search service lets applications process transactions without the overhead of text index updates [9].

6.2 Multitenant data model

Database.com storage model manages virtual database structures using a set of metadata, data, and pivot tables, as illustrated in Figure 5

Figure 5 Multitenant data model of Database.com [9] 22

When application schemas are created, the UDD keeps track of metadata concerning the objects, their fields, their relationships, and other object attributes. Few large database tables store the structured and unstructured data for all virtual tables. A set of related multitenant indexes, implemented as simple pivot tables with denormalized data, make the combined data set extremely functional.

Because Database.com manages object and field definitions as metadata rather than actual database structures, the system can tolerate online multitenant application schema maintenance activities without blocking the concurrent activity of other tenants and users [9].

6.3 Multitenant indexes

Database.com automatically indexes various types of fields to deliver scalable performance. Traditional database systems rely on native database indexes to quickly locate specific rows in a database table that have fields matching a specific condition. Index of MT_Data is managed by synchronously copying field data marked for indexing to an appropriate column in a pivot table called MT_Indexes.

In some circumstances the external search engine can fail to respond to a search request. In this cases Database.com falls back to a secondary search mechanism. A fallback search is implemented as a direct database query with search conditions that reference the Name field of target records. To optimize global object searches (searches that span tables) without having to execute potentially expensive union queries, a pivot table called MT_Fallback_Indexes that records the Name of all records is maintained. Updates to MT_Fallback_Indexes happen synchronously, as transactions modify records, so that fall-back searches always have access to the most current database information [9].

6.4 Multitenant relationships

Database.com provides “relationship” datatypes that an organization can use to declare relationships (referential integrity) among tables. When an organization declares an object’s field with a relationship type, the field is mapped to a Value field in MT_Data, and then uses this fie ld to store the ObjID of a related object [9].

6.5 Multitenant field history

Database.com provides history tracking for any field. When a tenant enables auditing for a specific field, the system asynchronously records information about the changes made to the field (old and new values, change date, etc.) using an internal pivot table as an audit trail [9].

6.6 Partitioning of metadata, data, and index data

All Database.com data, metadata, and pivot table structures, including underlying database indexes, are physically partitioned by tenant (OrgID) using native database partitioning mechanisms. Data partitioning is a proven technique that database systems provide to physically divide large logical data structures into smaller, more manageable pieces. Partitioning can also help to improve the performance, scalability, and availability of a large database system such as a multitenant environment. For example, by definition, every Database.com query targets a specific tenant’s information, so the

23

query optimizer need only to consider accessing data partitions that contain a tenant’s data rather than an entire table or index. This common optimization is sometimes referred to as “partition pruning.” [9]

6.7 Application development

Developers can declaratively build server-side application components using the Database.com Console. This point-and-click interface supports all facets of the application schema building process, including the creation of an application’s data model (objects and their fields, relationships, etc.), security and sharing model (users, profiles, role hierarchies, etc.), declarative logic (workflows), and programmatic logic (stored procedures and triggers). The Console provides access to built-in system futures which makes it easy to implement application functionality without the need of writing code [9].

6.8 Data Access

Database.com provides the following tools to query and work with data.

Database.com REST API and Force.com Web Services API

The REST API and Web Services API can be used to interact with Database.com by creating, retrieving, updating, and deleting records, maintaining passwords, performing searches, etc. This can be used with any language that supports Web services.

The SOAP-based API is optimized for real-time client applications that update small numbers of records at a time [8] [9].

Force.com Bulk API

The Bulk API is based on REST principles, and is optimized for loading or deleting large sets of data. It can be used to insert, update, delete, or restore a large number of records asynchronously by submitting a number of batches that are processed in the background by Database.com. The Bulk API is designed to simplify the processing of a few thousand to millions of records.

Apex Data Manipulation Language (DML)

DML statements are used to insert, delete, and update data from within your Apex code.

Apex Web Services

Apex methods can be exposed as Web service operations that can be called by external Web client applications. This is a powerful tool for building efficient communication between data service and application tier. By aggregating business logic onto Database.com, it can:

 Prevent unnecessary communication between data service and the client

24

 Client development and maintenance by providing a coarse-grained application- level API

 Build more robust applications, since all of the logic implemented in Apex is executed within a transaction on Database.com [9]

6.9 Query languages

Database.com is using the Salesforce Object Query Language (SOQL) to construct database queries. Similar to the SELECT command in the Structured Query Language (SQL), SOQL allows you to specify the source object, a list of fields to retrieve, and conditions for selecting rows in the source object. Database.com also includes a full-text, multi-lingual search engine that automatically indexes all text- related fields. Apps can leverage this pre-integrated search engine using the Salesforce Object Search Language (SOSL) to perform text searches.

Unlike SOQL, which can only query one object at a time, SOSL can search text, email, and phone fields for multiple objects simultaneously [9].

6.10 Multitenant search processing Web-based application users have come to expect an interactive search capability to scan the entire or a selected scope of an application’s database, return ranked results that are up-to-date, and do it all with sub-second response times. To provide such robust search functionality for applications, Database.com uses a search engine that is separate from its transaction engine. The relationship between the two engines is depicted in the figure 4.

Figure 6 Transaction and Search engine [9]

The search engine receives data from the transactional engine, with which it creates search indexes. The transactional engine forwards search requests to the search engine, which returns results that the transaction engine uses to locate rows that satisfy the search request.

As applications update data in text fields (CLOBs, Name, etc.), a pool of background processes called indexing servers are responsible for asynchronously updating corresponding indexes, which the search 25

engine maintains outside the core transaction engine. To optimize the indexing process, Database.com synchronously copies modified chunks of text data to an internal “to-be - indexed” table as transactions commit, thus providing a relatively small data source that minimizes the amount of data that indexing servers must read from disk. The search engine automatically maintains separate indexes for each organization (tenant).

Depending on the current load and utilization of indexing servers, text index updates may noticeably lag behind actual transactions. To avoid unexpected search results originating from stale indexes, Database.com also maintains an MRU (most recently used) cache of recently updated rows that the system considers when materializing full-text search results. In order to efficiently support possible search scopes, MRU caches are maintained per-user and per-organization.

Database.com’s search engine optimizes the ranking of records within search results using several different methods. For example, the system considers the security domain of the user performing a search and weighs those rows to which the current user has access more heavily. The system can also consider the modification history of a particular row and rank more actively updated rows ahead of those that are relatively static. The user can choose to weight search results as desired, for example, placing more emphasis on recently modified rows.

6.11 Multitenant isolation and protection

To protect the overall scalability and performance of the shared database system for all concerned applications, Database.com is using an extensive set of governors and resource limits associated with code execution. Execution of a code script is monitored and limited how much CPU time it can use, how much memory it can consume, how many queries and DML statements it can execute, how many math calculations it can perform, how many outbound Web services calls it can make, and much more. Individual queries that optimizer regards as too expensive to execute throw an exception to the caller [9].

Before an organization can transition a new application from development to production status, salesforce.com requires unit tests that validate the functionality of the application’s Database.com code routines. Salesforce.com executes submitted unit tests in Database.com’s sandbox development environment to ascertain if the application code will adversely affect the performance and scalability of the multitenant population at large.

Once an application’s code is certified for production by salesforce.com, the deployment process copies all the application’s metadata into a production Database.com instance and reruns the corresponding unit tests

After a production application is live, the performance profiler automatically analyzes and provides associated feedback to administrators. Performance analysis reports include information about slow queries, data manipulations, and sub-routines that you can review and use to tune application functionality.

26

6.12 Deletes, undeletes When an app deletes a record from an object, Database.com simply marks the row for deletion. It is possible to restore selected rows from the Recycle Bin for up to 30 days before it is permanently removed. The total number of records that is maintained for an organization is limited based on the storage limits for that organization. The Recycle Bin also stores dropped fields and their data until an organization permanently deletes them or 45 days has elapsed, whichever happens first. Until that time, the entire field and all its data is available for restoration [9].

6.13 Backup Database.com uses a variety of methods to ensure that organizations do not experience any data loss. Every transaction is stored to RAID disks in real-time with archive mode enabled, allowing the database to recover all transactions prior to any system failure. Every night all data is backed up to a separate backup server and automatic tape . The backup tapes are cloned as an additional precautionary measure, and the cloned tapes are transported to an off-site, fireproof vault twice a month [8].

6.14 Pricing Database.com pricing is based on number of users, records and transactions per month. The registration of new account is free and it includes:

 3 Standard Users

 3 Administration Users

 100,000 records in the database

 50,000 Transactions per month

Additional storage and capacity can be purchased at any time with no downtime.

27

7. Microsoft’s SQL AZURE

Microsoft SQL Azure Database is a cloud-based relational database service that is built on SQL Server technologies and runs in Microsoft data centers on hardware that is owned, hosted, and maintained by Microsoft.

SQL Azure is probably the most fully-featured relational database available in the cloud. It is based on the SQL Server standalone database but the way data is managed and stored in SQL Azure is significantly different.

Similar to an instance of SQL Server, SQL Azure Database exposes a tabular data stream (TDS) interface for Transact-SQL-based database access. This allows your database applications to use SQL Azure Database in the same way that they use SQL Server. Because SQL Azure Database is a service, administration in SQL Azure Database is slightly different.

Unlike administration for an on-premise instance of SQL Server, SQL Azure Database abstracts the logical administration from the physical administration. Users continue to administer databases, logins, users, and roles, but Microsoft administers the physical hardware such as hard drives, servers, and storage. This approach helps SQL Azure Database provide a large-scale multitenant database service that offers enterprise-class availability, scalability, security, and self-healing [11].

7.1 Subscriptions To use SQL Azure, Windows Azure platform account must be used. This account allows access to all the Windows Azure-related services, such as Windows Azure, Windows Azure AppFabric, and SQL Azure. The Windows Azure platform account is used to set up and manage subscriptions and to bill for consumption of any of the Windows Azure services including SQL Azure, and running SQL Azure does not require Windows Azure. Whit the Windows Azure platform account, the Windows Azure Platform Management portal can be used to create SQL Azure servers, databases, and its associated administrator accounts [11].

Each subscription allows one instance of SQL Server to be defined, which will initially include only a master database. For each server firewall settings has to be configured, to determine which connections will be allowed access.

7.2 Databases Each SQL Azure server always includes a master database. Up to 149 additional databases can be created for each SQL Azure server. Microsoft is offering two editions of SQL Azure databases: Web and Business, and when you create a database using the Windows Azure Platform Management portal, the maximum size you specify determines the edition you create. A Web Edition database can have a maximum size of 1 GB or 5GB. A Business Edition database can have maximum size of up to 150 GB of data, in 10GB increments up to 50GB, and then 50 GB increments [11][12]. If the size of the database reaches the limit it is not possible to insert data, update data, or create new database 28

objects. However, read and delete data, truncate tables, drop tables and indexes, and rebuild indexes are still possible.

SQL Azure data access model does not support cross-database queries in the current version a connection is made to a single database. If data from another database is needed, new connection must be created [11].

7.3 Security and Access to a SQL Azure Database

Most security issues for SQL Azure databases are managed by Microsoft within the SQL Azure , with very little setup required by the users. A user must have a valid login and password in order to connect to the SQL Azure database. Because SQL Azure supports only standard security, each login must be explicitly created. In addition, the firewall can be configured on each SQL Azure server to only allow traffic from specified IP addresses to access the SQL Azure server. This helps to greatly reduce any chance of a denial-of-service (DoS) attack. All communications between clients and SQL Azure must be SSL encrypted, and clients should always connect with Encrypt = True to ensure that there is no risk of man-in-the-middle attacks. DoS attacks are further reduced by a service called DoSGuard that actively tracks failed logins from IP addresses and if it notices too many failed logins from the same IP address within a period of time, the IP address is blocked from accessing any resources in the service [11].

The security model within a database is identical to that in SQL Server. Users are created and mapped to login names. Users can be assigned to roles, and users can be granted permissions. Data in each database is protected from users in other databases because the connections from the client application are established directly to the connecting user’s database.

7.4 SQL Azure architecture

Each SQL Azure database is associated with its own subscription. From the subscriber’s perspective, SQL Azure provides logical databases for application data storage. In reality, each subscriber’s data is replicated across three SQL Server databases that are distributed across three physical servers in a single data center. Many subscribers may share the same physical database, but the data is presented to each subscriber through a logical database that abstracts the physical storage architecture and uses automatic load balancing and connection routing to access the data. The logical database that the subscriber creates and uses for database storage is referred to as a SQL Azure database [11].

7.5 Logical Databases on a SQL Azure Server

SQL Azure subscribers access the actual databases, which are stored on multiple machines in the data center, through the logical server. The SQL Azure Gateway service acts as a proxy, forwarding the Tabular Data Stream (TDS) requests to the logical server. It also acts as a security boundary providing

29

login validation, enforcing the firewall and protecting the instances of SQL Server behind the gateway against denial-of-service attacks. The Gateway is composed of multiple computers, each of which accepts connections from clients, validates the connection information and then passes on the TDS to the appropriate physical server, based on the database name specified in the connection. Figure 8 shows the physical architecture represented by the single logical server.

Figure 7 Figure 8 A logical server and its databases distributed across machines in the data center [11]

The machines with the SQL Server instances are called data nodes. Each data node contains a single SQL Server instance, and each instance has a single user database, divided into partitions. Each partition contains one SQL Azure client database, either a primary or secondary replica. Each database hosted in the SQL Azure data center has three replicas: one primary replica and two secondary replicas. All reads and writes go through the primary replica, and any changes are replicated to the secondary replicas asynchronously. The replicas are the central means of providing high availability for your SQL Azure databases. The other SQL Azure databases partitions existing within the same SQL Server instances in the data center are completely invisible and unavailable between different subscribers [11].

For SQL Azure databases every commit needs to be a quorum commit. That is, the primary replica and at least one of the secondary replicas must confirm that the log records have been written before the transaction is considered to be committed. Each data node machine hosts a set of processes referred to as the fabric. The fabric processes perform the following tasks:

 Failure detection: notes when a primary or secondary replica becomes unavailable so that the Reconfiguration Agent can be triggered

 Reconfiguration Agent: manages the re-establishment of primary or secondary replicas after a node failure

30

 PM (Partition Manager) Location Resolution: allows messages to be sent to the Partition Manager

 Engine Throttling: ensures that one logical server does not use a disproportionate amount of the node’s resources, or exceed its physical limits

Topology: manages the machines in a cluster as a logical ring, so that each machine has two that can detect when the machine goes down

The machines in the data center are all commodity machines with components that are of low-to- medium quality and low-to-medium performance capacity. The low cost and the easily available configuration make it easy to quickly replace machines in case of a failure condition. In addition, Windows Azure machines use the same commodity hardware, so that all machines in the data center, whether used for SQL Azure or for Windows Azure, are interchangeable

In Figure 7, the logical server contains three databases: DB1, DB2, and DB3. The primary replica for DB1 is on Machine 6 and the secondary replicas are on Machine 4 and Machine 5. For DB3, the primary replica is on Machine 4, and the secondary replicas are on Machine 5 and on another machine not shown in this figure. For DB4, the primary replica is on Machine 5, and the secondary replicas are on Machine 6 and on another machine not shown in this figure. Note that this diagram is a simplification. Most production Microsoft SQL Azure data centers have hundreds of machines with hundreds of actual instances of SQL Server to host the SQL Azure replicas, so it is extremely unlikely that if multiple SQL Azure databases have their primary replicas on the same machine, their secondary replicas will also share a machine [11].

The physical distribution of databases that all are part of one logical instance of SQL Server means that each connection is tied to a single database, not a single instance of SQL Server.

7.6 Network Topology Four distinct layers of abstraction work together to provide the logical database for the subscriber’s application to use: the client layer, the services layer, the platform layer, and the infrastructure layer. Figure 8 illustrates the relationship between these four layers.

The client layer resides closest to the application, and it is used by the application to communicate directly with SQL Azure. The client layer can reside on-premises in a data center, or it can be hosted in Windows Azure. Every protocol that can generate TDS over the wire is supported. Because SQL Azure provides the TDS interface same as SQL Server, known and familiar tools and libraries can be used to build client applications for data that is in the cloud.

The infrastructure layer represents the IT administration of the physical hardware and operating systems that support the services layer.

31

Figure 8 Four layers of abstraction provide the SQL Azure logical database for a client application to use [11]

32

7.7 High Availability with SQL Azure The goal for Microsoft SQL Azure is to maintain 99.9 percent availability for the subscribers’ databases. As it was stated earlier this goal is achieved by the use of commodity hardware that can be quickly and easily replaced in the case of machine or drive failure and the management of the replicas, one primary and two secondary, for each SQL Azure database [12].

7.8 Failure Detection Management in the data centers needs to detect not only a complete failure of a machine, but also conditions where machines are slowly degenerating and communication with them is affected. The concept of quorum commit, discussed earlier, addresses these conditions. First, a transaction is not considered to be committed unless the primary replica and at least one secondary replica can confirm that the transaction log records were successfully written to disk. Second, if both a primary replica and a secondary replica must report success, small failures that might not prevent a transaction from committing but that might point to a growing problem can be detected [11].

7.9 Reconfiguration The process of replacing failed replicas is called reconfiguration. Reconfiguration can be required due to failed hardware or to an operating system crash, or to a problem with the instance of SQL Server running on the node in the data center. Reconfiguration can also be necessary when an upgrade is performed, whether for the operating system, for SQL Server, or for SQL Azure.

All nodes are monitored by six peers, each on a different rack than the failed machine. The peers are referred to as neighbors. A failure is reported by one of the neighbors of the failed node, and the process of reconfiguration is carried out for each database that has a replica on the failed node. Because each machine holds replicas of hundreds of SQL Azure databases (some primary replicas and some secondary replicas), if a node fails, the reconfiguration operations are performed hundreds of times. There is no prioritization in handling the hundreds of failures when a node fails; the Partition Manager randomly selects a failed replica to handle, and when it is done with that one, it chooses another, until all of the replica failures have been dealt with.

If a node goes down because of a reboot, that is considered a clean failure, because the neighbors receive a clear exception message.

Another possibility is that a machine stops responding for an unknown reason, and an ambiguous failure is detected. In this case, an arbitrator process determines whether the node is really down.

Although this discussion centers on the failure a single replica, it is really the failure of a node that is detected and dealt with. A node contains an entire SQL Server instance with multiple partitions containing replicas from up to 650 different databases. Some of the replicas will be primary and some will be secondary. When a node fails, the processes described earlier are performed for each affected database. That is, for some of the databases, the primary replica fails, and the arbitrator chooses a new primary replica from the existing secondary replicas, and for other databases, a

33

secondary replica fails, and a new secondary replica is created.

The majority of the replicas of any SQL Azure database must confirm the commit. At this time, user databases maintain three replicas, so a quorum commit would require two of the replicas to acknowledge the transaction. A metadata store, which is part of the Gateway components in the data centers, maintains five replicas and so needs three confirmations to satisfy a quorum commit. The master cluster, which maintains seven replicas, needs four of them to confirm a transaction. However, for the master cluster, even if all seven replicas fail, the information is recoverable, because mechanisms are in place to rebuild the master cluster automatically in case of such a massive failure [11].

7.10 Availability Guarantees

As mentioned earlier, the goal for Microsoft SQL Azure is to maintain 99.9 percent availability. Because of the way that database replicas are distributed across multiple servers and the efficient algorithms for promoting secondary replicas to primary, up to 15 percent of the machines in the data center can be down and the availability can still be guaranteed [11].

7.11 Scalability with SQL Azure

As said earlier one of the biggest benefits of hosting your databases in the cloud is the built-in scalability. With SQL Azure as with the most cloud database platforms you add more databases only when and if you need them, and if the need is only temporary, you can then drop the unneeded databases. There are two components within SQL Azure that allow this scalability by continuously monitoring the load on each node. One component is Engine Throttling, which ensures that the server doesn’t get overloaded. The other component is the Load Balancer, which ensures that a server isn’t continuously in the throttled state. In this section, we’ll look at these two components and discuss how engine throttling applies when predefined limits are reached and how load balancing works as the number of hosted database increases. The third technique to achieve greater scalability and performance are the Federations [31] used in SQL Azure. One or more tables within a database are split by row and portioned across multiple databases (Federation members). This type of horizontal partitioning is often referred to as ‘sharding’. The primary scenarios in which this is useful are where you need to achieve scale, performance, or to manage capacity [11].

7.12 Throttling

Because of the multitenant use of each SQL Server in the data center, it is possible that one subscriber’s application could render the entire instance of SQL Server ineffective by imposing heavy loads. For example, under full recovery mode, inserting lots of large rows, especially ones containing large objects, can fill up the transaction log and eventually the drive that the transaction log resides on. In addition each instance of SQL Server in the data center shares the machine with

34

other critical system processes that cannot be starved – most relevantly the fabric process that monitors the health of the system.

To keep a data center server’s resources from being overloaded and jeopardizing the health of the entire machine, the load on each machine is monitored by the Engine Throttling component. In addition, each database replica is monitored to make sure that statistics such as log size, logs write duration, CPU usage, the actual physical database size limit, and the SQL Azure user database size are all below target limits. If the limits are exceeded, the result can be that a SQL Azure database rejects reads or writes for 10 seconds at a time. Occasionally, violation of resource limits may result in the SQL Azure database permanently rejecting reads and writes (depending on the resource type in question) [11].

7.13 Load Balancer

At this time, although there are availability guarantees with SQL Azure, there are no performance guarantees. Part of the reason for this is the multitenant problem: many subscribers with their own SQL Azure databases share the same instance of SQL Server and the same computer, and it is impossible to predict the workload that each subscriber’s connections will be requesting. SQL Azure provides load balancing services that evaluate the load on each machine in the data center. When a new SQL Azure database is added to the cluster, the Load Balancer determines the locations of the new primary and secondary replicas based on the current load on the machines.

If one machine gets loaded too heavily, the Load Balancer can move a primary replica to a machine that is less loaded [11].

7.14 SQL Azure Management

Because your SQL Azure databases are hosted within larger SQL Server instances on machines in the data centers, the management work that needs to be done is very limited. However, some maintenance tasks are still necessary. All physical aspects of dealing with your databases are handled in the data center by Microsoft. Also all the upgrades are handled in the data center one replica at a time. The user has responsibility to troubleshoot poorly performing queries and concurrency problems, such as blocking. Just like in SQL Server, some of the main tools available for troubleshooting are the dynamic management views (DMVs) [11].

7.15 Pricing in SQL Azure

Billing in SQL Azure is per database, based on usage and database edition, this allows organization 35

to start with a small investment and add space as the business grows. SQL Azure provides two different database editions, Business Edition and Web Edition. SQL Azure edition features apply to the individual database. They can be mixed and match different database editions within the same SQL Azure server.

Both editions offer scalability, automated high availability, and self-provisioning.

 The Web Edition Database is suited for small Web applications and workgroup or departmental applications. This edition supports a database with a maximum size of 1 or 5 GB of data.

 The Business Edition Database is suited for independent software vendors (ISVs), line- of- business (LOB) applications, and enterprise applications. This edition supports a database of up to 150 GB of data, in 10GB increments up to 50GB, and then 50 GB increments.

Both editions charge an additional bandwidth-based fee when the data transfer includes a client outside the Windows Azure platform or outside the region of the SQL Azure database.

You specify the edition and maximum size of the database when you create it; you can also change the edition and maximum size after creation. The billing will be based on the new edition type (and the peak size the database reaches, daily) [13].

Microsoft is charging monthly fee for each SQL Azure user database. The database fee is amortized over the month and charged daily. The daily fee depends on the peak size that each database reached that day, the edition of each database, and the maximum number of databases you. A 10 GB multiplier is used for pricing Business Edition databases and a 1 GB or 5 GB multiplier is used for pricing Web Edition databases. Users pay for the databases they have, for the days they have them [13].

Bandwidth used between SQL Azure and Windows Azure or Windows Azure AppFabric is free within the same sub-region or data center.

36

8. Amazon WebServices

Amazon is another company that is offering relational database service as a part of their amazon web services. In the next section I will first speak about Amazon relational database services and later I will give an overview of their NOSQL database, Amazon SimpleDB and DynamoDB and another NOSQL solutions currently available.

8.1 Amazon Relational Database Service (Amazon RDS)

Amazon Relational Database Service (Amazon RDS) is a web service that can operate, and to some level scale a relational database in the cloud. It provides cost-efficient and resizable capacity while automating the administration tasks. Amazon RDS gives the users access to the capabilities of a MySQL or Oracle database running on their own Amazon RDS database instance. This gives the advantage that the code and applications that use on-premises MySQL or Oracle database can be easily migrated to Amazon RDS.

8.2 Amazon RDS Architecture/Features

Amazon RDS has different approach then the Database.com and SQL Azure. It offers the full capabilities of MySQL or Oracle database running on separate database instance. The features provided by Amazon RDS depend on the DB Engine you select. In general it offers:

 Pre-configured Parameters – DB Instances are pre-configured with a sensible set of parameters and settings appropriate for the DB Instance class that has been selected. It gives the possibility to launch a MySQL or Oracle DB Instance and connect an application without additional configuration.

 Monitoring and Metrics – Amazon RDS provides Amazon CloudWatch metrics for the DB Instance deployments. AWS Management Console can be used to view key operational metrics for the DB Instance deployments, including compute/memory/storage capacity utilization, I/O activity, and DB Instance connections.

 Automatic Software Patching – Amazon RDS will make sure that the relational database

software stays up-to-date with the latest patches

 Automated Backups – Turned on by default, the automated backup feature of Amazon RDS enables point-in-time recovery for the DB Instance. Amazon RDS will backup the database and transaction logs and store both for a user-specified retention period. This allows restores of the DB Instance to any second during the retention period, up to the last five minutes. Automatic backup retention period can be configured to up to thirty five days.

37

 DB Snapshots – DB Snapshots are actually user-initiated backups of the DB Instance. These full database backups will be stored by Amazon RDS until they are explicitly deleted. Users can also create a new DB Instance from a DB Snapshot.

 Isolation and Security– Using Amazon VPC2, it is possible to isolate DB Instances in own virtual network, and connect to an existing IT infrastructure using industry-standard encrypted IPsec VPN. In addition, for both MySQL and Oracle, it allows controlling access to the DB Instances using database security groups (DB Security Groups). A DB Security Group acts like a firewall controlling network access to the DB Instance. By default, network access is turned off to the DB Instances. For applications to access a DB Instance DB Security Group must be set to allow access from EC23 Instances with specific EC2 Security Group membership or IP ranges [14].

8.3 Scalability with Amazon RDS

Amazon RDS gives flexibility of being able to scale the compute resources or storage capacity associated with the relational database instance by using the Amazon RDS APIs or through the AWS Management Console. The compute and memory resources can be scaled up or down by using predefined DB Instance Classes. Currently Amazon is offering five supported DB Instance classes:

 Small DB Instance: 1.7 GB memory, 1 ECU (1 virtual core with 1 ECU), 64-bit platform, Moderate I/O Capacity

 Large DB Instance: 7.5 GB memory, 4 ECUs (2 virtual cores with 2 ECUs each), 64-bit platform, High I/O Capacity

 High-Memory Extra Large Instance 17.1 GB memory, 6.5 ECU (2 virtual cores with 3.25 ECUs each), 64-bit platform, High I/O Capacity

 High-Memory Double Extra Large DB Instance: 34 GB of memory, 13 ECUs (4 virtual cores with 3,25 ECUs each), 64-bit platform, High I/O Capacity

 High-Memory Quadruple Extra Large DB Instance: 68 GB of memory, 26 ECUs (8 virtual cores with 3.25 ECUs each), 64-bit platform, High I/O Capacity

For each DB Instance class, it is possible to select from 5GB to 1TB of associated storage capacity. Additional storage can be provisioned on the fly with no downtime. One ECU provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor [14].

2 Amazon (Amazon VPC) - isolated section of the Amazon Web Services (AWS) Cloud where you can launch AWS resources in a virtual network that you define, offering complete control over your virtual networking environment, including selection of your own IP address range, creation of subnets, and configuration of route tables and network gateways. 3 Amazon Elastic Compute Cloud (EC2) - web service that provides resizable compute capacity in the cloud.

38

8.4 High Availability

Amazon RDS run on the same high reliable infrastructure as the other Amazon web services. It has multiple features that enhance availability for critical production databases. Currently it offers Automatic host replacement and Replication.

With the automatic host replacement, Amazon RDS will automatically replace the compute instance powering the deployment in the event of a hardware failure.

The replication at this time is supported only for MySQL, although it is planned to be available for oracle in the near future. For MySQL Amazon RDS provides two replication features, Multi-AZ deployments and read replicas.

With Multi-AZ deployments Amazon RDS will automatically provision and manage a “standby” replica in a different Availability Zone (independent infrastructure in a physically separate location). Database updates are made concurrently on the primary and standby resources to prevent replication lag. In the event of planned database maintenance, DB Instance failure, or an Availability Zone failure, Amazon RDS will automatically failover to the up-to-date standby so that database operations can resume quickly without administrative intervention. Prior to failover you cannot directly access the standby, and it cannot be used to serve read traffic.

Read Replicas make it easy to elastically scale out beyond the capacity constraints of a single DB Instance for read-heavy database workloads. It is possible to create one or more replicas of a given source DB Instance and serve high-volume application read traffic from multiple copies of the data, thereby increasing aggregate read throughput. Amazon RDS uses MySQL’s native replication to propagate changes made to a source DB Instance to any associated Read Replicas. Since Read Replicas leverage standard MySQL replication, they may fall behind their sources, and they are therefore not intended to be used for enhancing fault tolerance in the event of source DB Instance failure or Availability Zone failure [14].

8.5 Pricing

Same as with the other, previously mentioned DBMS services, Amazon RDS pricing is based on the usage and the DB Instance class. It is possible to choose between hourly On-Demand pricing with no up-front or long-term commitments with reserved pricing option.  On-Demand DB Instances lets user to pay for compute capacity by the hour with no long- term commitments. This frees you from the costs and complexities of planning, purchasing, and maintaining hardware and transforms what are commonly large fixed costs into much smaller variable costs.  Reserved DB Instances give users the option to make a low, one-time payment for each DB Instance they want to reserve and in turn receive a discount on the hourly usage charge for that DB Instance. Depending on usage, there is a possibility to choose between three Reserved DB Instance types (Light, Medium, and Heavy Utilization) and receive anywhere between 30% and 55% of discount over On-Demand prices. Based on the 39

application workload and the amount of time they will run, Amazon RDS Reserved Instances may provide substantial savings over running on-demand DB instances.

The prices are different weather standard or Multi-AZ Deployment is used. For both standard and Multi-AZ deployments, pricing is per DB Instance-hour consumed, from the time a DB Instance is launched until it is terminated. There is no additional charge for backup storage up to 100% of provisioned database storage for an active DB Instance. After the DB Instance is terminated, backup storage is billed at per GB-month. Also additional backup storage is billable.

Data transferred between Amazon RDS and Amazon EC2 Instances in the same Availability Zone and Data transferred between Availability Zones for replication of Multi-AZ deployments is free.

Amazon RDS DB Instances outside VPC: For data transferred between an Amazon EC2 instance and Amazon RDS DB Instance in different Availability Zones of the same Region, there is no Data Transfer charge for traffic in or out of the Amazon RDS DB Instance. Charges apply only for the Data Transfer in or out of the Amazon EC2 instance, and standard Amazon EC2 Regional Data Transfer charges apply.

Amazon RDS DB Instances inside VPC: For data transferred between an Amazon EC2 instance and Amazon RDS DB Instance in different Availability Zones of the same Region, Amazon EC2 Regional Data Transfer charges apply on both sides of transfer.

Data transferred between Amazon RDS and AWS services in different regions is charged as Internet Data Transfer on both sides of the transfer.

Additionally for Oracle database there are two licensing models, “License Included” and “Bring- Your- Own-License (BYOL)”. In the "License Included" service model, you do not need separately purchased Oracle licenses; the Oracle Database software has been licensed by AWS.

Bring-Your-Own-License is suited for users that already own Oracle Database licenses. The “BYOL” model is designed for customers who prefer to use existing Oracle database licenses or purchase new licenses directly from Oracle [14].

9. Google Cloud SQL

Google Cloud SQL is a MySQL database in the Google's cloud. It has all the capabilities and functionality of MySQL. Google Cloud SQL is currently available for applications that are written in Java or Python. It can also be accessed from a command-line tool.

As all the others database as a service offers Google cloud SQL is fully managed, patch management, replication and other database management chores are managed by Google.

40

High availability is offered by built in automatic replication across multiple geographic regions so da service is available and data is preserved even when whole data center becomes unavailable. Users can choose to create databases and choose synchronous or asynchronous replication in datacenters in the EU or the US.

Google cloud SQL is tightly integrated with Google App Engine and other Google services which allow users to work across multiple products and get more value of their data. The database instances are not restricted to be used only by one application in the app engine allowing multiple applications to use same instance and database. Data to the database can be imported usind mysqldumps. This allows users to easily move data, applications, and services in and out of the cloud.

As initial trial Google is offering instances with small amount of RAM and 0.5GB of database storage. Additional RAM and storage can be purchased up to 16GB of RAM and 100GB of storage [15].

9.1 Pricing

Google offers two billing plans for Google Cloud SQL, Packages or Per Use. The packages offer is shown in the table below:

Tier RAM Included Storage Included I/O Per Day

D1 0.5GB 1GB 850K

D2 1GB 2GB 1.7M

D4 2GB 5GB 4M

D8 4GB 10GB 8M

D16 8GB 10GB 16M

D32 16GB 10GB 32M

Table 2 Google Cloud SQL Packages

Each database instance is allocated the RAM shown above, along with an appropriate amount of CPU. Storage is measured as the filespace used by the MySQL database. Bills are issued monthly, based on the number of days during which the database existed. Google is not charging the storage for backups created using the scheduled backup service. The number of I/O requests to storage made by database instance depends on the queries, workload and data set. Cloud SQL caches data in memory to serve queries efficiently and to minimize the number of I/O requests. Use of storage or I/O over the included quota is charged at the Per Use rate. The maximum storage for any instance is currently 100GB.

41

With the Per Use plan the same tiers as with the packages are offered with the difference that database instances is charged for periods of continuous use. Storage is charged per GB in hourly units (whether the database is active or not) measured as the largest number of bytes during that one hour period, rounded up to the nearest GB and the I/O are charged by number rounded to the nearest million.

Network use is charged for both packages and per use billing plans. Only outbound external traffic is charged the network usage between Google App Engine applications and Cloud SQL is not charged [15].

10. Summary of RDBM DBaaS and common considerations

As we can see from the previous section Relational Database as a Service (DBaaS) is currently found in the public marketplace in two broad capabilities - online general relational databases, and the ability to operate virtual machine images loaded with common databases such as MySQL, Oracle or similar commercial databases.

Database.com offers relational multitenant database specially build for the cloud using their metadata-driven architecture. Microsoft AzureSQL offers SQL Server like relational database management system and controls many of the database configuration details allowing the users to focus on the schema, data and application layer. Amazon RDS provides implementation of MySQL or Oracle on virtual machine build and tune for that purpose and Goolge also has their cloud SQL providing MySQL for their AppEngine PaaS.

While the all presented RDBMS DBaaS provide an opportunity to reduce cost there are many consideration to taken before moving the data to a cloud based solution. Figure 11 presents the main considerations comparison.

Data Sizing - All of the RDBMS DBaaS offerings presented have limits on the size of the data set that can be stored on their systems.

Portability - Portability and adherence to standards is a critical issue for ensuring Continuity of Operations and to mitigate business risk (e.g., a provider going out of business or raising rates). The ability to instantiate a replicated version of the data “off-cloud” or in another cloud offering can provide the business owners with an extra level of assurance that they will not suffer a loss of data. This can be facilitated by standards, such as the use of a standard database query language (SQL).

Transaction Capabilities - Transaction capabilities are an essential feature for databases that need to provide guaranteed reads and writes (ACID).

42

Salesforce Microsoft SQL Amazon RDS Google Cloud Database.com Azure (MySQL or Oracle) SQL

Maximum amount Maximum data is 5gb with web 1 terabyte per 100GB per of data that can be limited by number of edition database database instance. database stored records per and up to 150GB instance. database. Up to with business 22300000 records. edition database

Ease of software Low. Requires High. Most SQL High. MySQL/Oracle Medium. portability with database to be Server features are instantiation in cloud MySQL similar locally hosted specially built and available in SQL is very similar to the instance in capability tested by Salesforce Azure. local instantiated the cloud very before deployment. version. similar to the local instance but accessible only by Google App Engine

Transaction Yes Yes Yes Yes capabilities

Configurability and Low. It creates Medium. Can High. MySQL/Oracle Low. ability to tune indexes create indexes and instantiation in cloud Automatically databases automatically and stored procedures, on virtual machine. tuned. keeps record of most but no control over recently accessed memory allocation records but does not or similar allow control over it. resources. Also does not allow control over memory allocation and similar resources.

Database accessible Yes Yes Yes No. Requires as “stand-alone” Google App offering. Engine application layer.

Possibility to No Yes Yes Yes designate where the data is stored (ex. Region or data center)

Replication No Yes Yes Yes

Table 3 Main Considerations Comparison

43

Configurability - DBaaS offerings may provide capabilities that reduce the amount of configuration options available to database administrators. For some applications, if more configurability options are managed by the platform owner rather than the customer’s database administrator, this can be a benefit and it can reduce the amount of effort expended to maintain the database. For others, the inability to tune and control all aspects of the database, such as memory management, can be a limiting constraint in obtaining performance.

Database Accessibility - Most DBaaSs offer a predefined set of connectivity mechanisms that will directly impact adoption and use. There are three general approaches. First, Most RDBMS offerings are typically accessible through industry standard database drivers such as Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC). These drivers allow for applications external to the service to access the database through a standard connection, facilitating interoperability. Second, services typically provide interfaces that use standards-based, Service- Oriented Architecture (SOA) protocols, such as SOAP or REST, with Hypertext Transfer Protocol (HTTP) and a vendor-specific API definition. These services may provide software development kits in common source-code languages to facilitate the adoption. Third, some databases may be restricted to accessing data through software running in the vendor’s ecosystem. This approach may increase security, but it also significantly limits portability and interoperability.

Availability and Replication - the ability to ensure that data is available and not lost will be a key consideration. Ensuring access to data can come through enforcement of service-level agreements (SLA) metrics such as up time, replication across a cloud provider’s regions, and replication or movement of the data across cloud providers or to the consuming organization’s data center.

 Replication across a cloud provider’s hardware within a region may ameliorate the effects of a localized hardware or software failure.

 Replication across a cloud provider’s geographic regions may ameliorate the effects of a network outage, natural disaster, or other regional event.

 Replication across multiple cloud providers or back to the consuming organization’s IT infrastructure may provide the most continuity of operation benefit through full geographic and IT stack independence.

Many providers such as Microsoft and Amazon offer replication of the data across hardware within a specific region as part of a packaged service. Within a given vendor, replication across geographies is usually more expensive and may result in significant data transfer fees.

44

11. NOSQL

While RDBMS databases are widely deployed and successful, they have shortcomings for some applications that have been filled by the growing use of NoSQL databases. Rather than conforming to SQL standards and providing relational data modeling, NoSQL databases typically offer fewer transactional guarantees than RDBMSs in exchange for greater flexibility and scalability. NoSQL databases tend to be less complex than RDBMSs and scale horizontally across lower-cost hardware. Unlike RDBMSs, which share a common relational data model, several different types of databases, such as column-oriented, key-value, and document-oriented, are considered as “NoSQL” databases. NoSQL databases tend to be used in applications that do not require the same level of data consistency guarantees that RDBMS systems provide but that require throughput levels that would be very expensive for RDBMSs to support.

12. Amazon SimpleDB and DynamoDB

Amazon DynamoDB is a fully managed NoSQL database service. As I said in the introduction of DBMSs in the clouds, the NoSQL databases are more suitable for situations where applications experience explosive growth, when traditional databases require reworking to distribute their workload across multiple servers.

DynamoDB has been created by taking Amazon’s in-house NoSQL database, Dynamo (incremental scalability, predictable high performance), combining it with the best parts of SimpleDB (ease of administration of a cloud service, consistency, and a table-based data model that is richer than a pure key-value store) and putting it into a form suitable for external use as a service. In the next section I will give a short overview of the Dynamo and SimpleDB.

12.1 Dynamo History

The original Dynamo design was based on a core set of strong distributed systems principles resulting in an ultra-scalable and highly reliable database system. It was developed as response to the scaling challenges that Amazon.com faced, when direct database access was one of the major bottlenecks in scaling and operating the business. There are many services that only need primary- key access to a data store. For many services, such as those that provide best seller lists, shopping carts, customer preferences, session management, sales rank, and product catalog, the common pattern of using a relational database would lead to inefficiencies and limit scale and availability. Dynamo provided a simple primary-key only interface to meet the requirements of these applications[17][18].

Dynamo was targeted mainly at applications that need an “always writeable” data store where no updates are rejected due to failures or concurrent writes. It was built for an infrastructure within a single administrative domain where all nodes are assumed to be trusted. Applications that use

45

Dynamo do not require support for hierarchical namespaces (a norm in many file systems) or complex relational schema (supported by traditional databases). Dynamo can be characterized as a zero-hop DHT, where each node maintains enough routing information locally to route a request to the appropriate node directly in order to avoid routing requests through multiple nodes and meet the need of the latency sensitive applications that require at least 99.9% of read and write operations to be performed within a few hundred milliseconds [17].

Dynamo gave to the developers a system that met their reliability, performance, and scalability needs, it did nothing to reduce the operational complexity of running large database systems. Since developers were responsible for running their own Dynamo installations, they had to become experts on the various components running in multiple data centers. Also, they needed to make complex tradeoff decisions between consistency, performance, and reliability. This operational complexity was a barrier that kept them from adopting Dynamo [17].

12.2 Amazon DynamoDB DataModel

Amazon DynamoDB organizes data into tables containing items, and each item has one or more attributes.

Attributes

An attribute is a name-value pair. The name must be a string, but the value can be a string, number, string set, or number set. The following are all examples of attributes:

"ImageID" = 1 "Title" = "flower"

"Tags" = "flower", "jasmine", "white" "Ratings" = 3, 4, 2

Item

A collection of attributes forms an item, and the item is identified by its primary key. An item's attributes are a collection of name-value pairs, in any order. The item attributes can be sparse, unrelated to the attributes of another item in the same table, and are optional (except for the primary key attribute). The table has no schema other than its reliance on the primary key. Items are stored in a table. The primary key uniquely identifies an item for a DynamoDB table. In the following diagram, Figure 9, the ImageID is the attribute designated as the primary key:

46

Figure 9 Diagram of DynamoDB Data Model [18]

Notice that the table has a name, "my table", but the item does not have a name. The primary key defines the item; the item with primary key "ImageID"=1. [18]

Tables

Tables contain items, and organize information into discrete areas. All items in the table have the same primary key scheme. Attribute name (or names) to be used for the primary key are designated when a table is created, and the table requires each item in the table to have a unique primary key value. The first step in writing data to DynamoDB is to create a table and designate a table name with a primary key. The following is a larger table that also uses the ImageID as the primary key to identify items. DyanomoDB also allows specifying a composite primary key which enable specifying two attributes in a table that collectively form a unique primary index. All items in the table must have both attributes. One serves as a “hash partition attribute” and the other as a “range attribute.” For example, there might be a “Status Updates” table with a composite primary key composed of “UserID” (hash attribute, used to partition the workload across multiple servers) and a “Time” (range attribute). Then query can be executed to fetch either: 1) a particular item uniquely identified by the combination of UserID and Time values; 2) all of the items for a particular hash “bucket” – in this case UserID; or 3) all of the items for a particular UserID within a particular time range. Range queries against “Time” are only supported when the UserID hash bucket is specified. [18]

47

Table: My Images

P rimary Other Attributes Key

ImageID ImageLocation = Date = Title = Tags = Width = 1024 Depth = = 1 https://s3.amazonaws.com/bucket/img_1.jpg 1260653179 flower Flower, 768 Jasmine

ImageID ImageLocation = Date = Rated = Tags = Width = 1024 Depth = = 2 https://s3.amazonaws.com/bucket/img_2.jpg 1252617979 3, 4, 2 Work, 768 Seattle, Office

ImageID ImageLocation = Date = Price = Tags = Author = you Camera = = 3 https://s3.amazonaws.com/bucket/img_3.jpg 1285277179 10.25 Seattle, phone Grocery, Store

ImageID ImageLocation = Date = Title = Author = Colors = orange, Tags = = 4 https://s3.amazonaws.com/bucket/img_4.jpg 1282598779 Hawaii Joe blue, yellow beach, blanket, ball

Figure 10 DynamoDB Table

12.3 Amazon DynamoDB Features

As we said earlier Amazon DynamoDB is based on the principles of Dynamo, a progenitor of NOSQL, and brings the power of the cloud to the NOSQL database world. It offers high-availability, reliability, and incremental scalability, with no limits on dataset size or request throughput for a given table. As all the previous explained services DynamoDB is managed, scalable system that handles all the complexities of scaling and partitions and re-partitions of the data over more machine resources to meet the I/O performance requirements. It can scale the resources dedicated to a table to multiple servers spread over multiple Availability and there are no pre-defined limits to the amount of data each table can store.

In order to achieve high performance all data items are stored on Solid State Drives (SSD). Moreover, by not indexing all attributes, the cost of read and write operations is low as write operations involve updating only the primary key index thereby reducing the latency of both read and write operations.

One of the most important functionalities of DynamoDB is the Performance Predictability. There

48

are many applications that benefit from predictable performance as their workloads scale: online gaming, social graphs applications, online advertising, and real-time analytics to name a few. DynamoDB’s gives the ability of “Provisioned Throughput.” Users can specify the request throughput capacity they require for a given table. DynamoDB will allocate sufficient resources to the table to predictably achieve this throughput with low-latency performance. Throughput reservations are elastic and can be increased or decreased on-demand using the AWS Management Console or the DynamoDB APIs. CloudWatch metrics provides the ability to make informed decisions about the right amount of throughput to be dedicated to a particular table.

Amazon DynamoDB also integrates with Amazon Elastic MapReduce (Amazon EMR) which allows businesses to perform complex analytics on their large datasets using a hosted Hadoop framework on AWS. [18]

Some of the ways in which EMR can be used with DynamoDB are as follows:

 Users can analyze data stored in DynamoDB using EMR and store the results of the analysis in S3 while leaving the original data in DynamoDB.

 Users can back up the data from DynamoDB to S3 using EMR.

 Customers can also use Amazon EMR to access data in multiple stores, do complex analysis over this combined dataset, and store the results of this work.

12.4 Amazon SimpleDB

SimpleDB another NOSQL DBaaS offered by Amazon. The data model used by Amazon SimpleDB makes it easy to store, manage and query structured data. Developers organize their data-set into domains and can run queries across all of the data stored in a particular domain. Domains are collections of items that are described by attribute-value pairs. This can be thought at in terms analogous to concepts in a traditional spreadsheet table. For example, if we take details of a customer management database shown in the table below and consider how they would be represented in Amazon SimpleDB. The whole table would be domain named “customers.” Individual customers would be rows in the table or items in the domain. The contact information would be described by column headers (attributes). Values are in individual cells.

49

CustomerID First Last Street City State Zip Telephone name name address

123 Bob Smith 123 Main Springfield MO 65801 222-333- St 4444

456 James Johnson 456 Front Seattle WA 98104 333-444- St 5555

Figure 11 SimpleDB Table

Amazon SimpleDB differs from tables of traditional databases in important ways. It offers the flexibility to easily go back later on and add new attributes that only apply to certain records. For example, adding customers’ email addresses to enable real-time alerts on order status it is possible to add the new records and any additional attributes to the existing “customers” domain. The resulting domain might look something like this:

First name Last name Street City State Zip Telephone Email CustomerID address

123 Bob Smith 123 Main Springfield MO 65801 222-333- St 4444

456 James Johnson 456 Front Seattle WA 98104 333-444- St 5555

789 Deborah Thomas 789 New York NY 10001 444-555- [email protected] Garfield 6666

Figure 12 SimpleDB table after adding additional attributes

Domains have a finite capacity in terms of storage (10 GB) and request throughput which is considerable scaling limitation. Although there is a possibility to work around this limitation by partitioning workloads over many domains, this is not that simple to implement. SimpleDB also fails to meet the requirement of incremental scalability which is possible with DynamoDB. Another limitation of SimpleDB is Predictability of Performance. SimpleDB indexes all attributes for each item stored in a domain. While this simplifies schema design and provides query flexibility, it has a negative impact on the predictability of performance. For example, every database write needs to update not just the basic record, but also all attribute indices (regardless of whether all indices are used for querying). Similarly, since the Domain maintains a large number of indices, its

50

working set does not always fit in memory. This impacts the predictability of a Domain’s read latency, particularly as dataset sizes grow. SimpleDB’s original implementation had taken the "eventually consistent"4 approach to the extreme and presented users with consistency windows that were up to a second in duration. This meant that developers used to a more traditional database solution had trouble adapting to it. The SimpleDB team eventually addressed this issue by enabling users to specify whether a given read operation should be strongly or eventually consistent. consistent read can potentially incur higher latency and lower read throughput it is best to use it only when an application scenario mandates that a read operation absolutely needs to read all writes that received a successful response prior to that read. For all other scenarios the default eventually consistent read yield the best performance. [18]

12.5 Pricing

As the other services, DynamoDB and SimpleDB keep the pay only for what you use model. The pricing is calculated based on the provisioned throughput capacity, index data storage and data transfer. When a DynamoDB table is created or updated the needed capacity to be reserved is specified for reads and writes and it is charged hourly based on the capacity used. A unit of Write Capacity enables users to perform one write per second for items of up to 1KB in size. Similarly, a unit of Read Capacity enables users to perform one strongly consistent read per second (or two eventually consistent reads per second) of items of up to 1KB in size. Amazon DynamoDB is an indexed datastore, and the amount of disk space the data consumes will exceed the raw size of the data uploaded. Amazon DynamoDB measures the size of the billable data by adding up the raw byte size of the uploaded data, plus a per-item storage overhead of 100 bytes to account for indexing. The first 100MB stored per month are offered free and after that the price is calculated per GB depending on region. As with the other AWS there is no additional charge for data transferred between Amazon DynamoDB , SimpleDB and other Amazon Web Services within the same Region. Data transferred across Regions (e.g. between Amazon DynamoDB in the US East (Northern Virginia) Region and Amazon EC2 in the EU (Ireland) Region), is charged at Internet Data Transfer rates on both sides of the transfer. Amazon SimpleDB is biling based on machine hours utilization and data transfer depending on the region where the SimpleDB domains are established. Amazon SimpleDB measures the machine utilization of each request and charges based on the amount of machine capacity used to complete the particular request (SELECT, GET, PUT, etc.), normalized to the hourly capacity of a circa 2007 1.7 GHz Xeon processor. [18]

4 Eventually consistent- It means that given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system and all the replicas will be consistent.

51

13. Google Datastore

The Google App Engine Datastore is a schemaless object datastore providing robust, scalable storage mainly targeted for web application. App Engine's datastore is built on top of Bigtable. Bigtable is distributed storage system for managing structured data that is designed to scale to a very large size, petabytes of data across thousands of commodity server. Many Google project like Google Earth, Google Finance including the web indexing use Bigtable for storing data.

13.1 Datastore Datamodel Datastore is basically key-value pared database. The Datastore holds data objects known as entities. An entity has one or more properties, named values of one of several supported data types: for instance, a property can be a string, an integer, or a reference to another entity. Each entity is identified by its kind, which categorizes the entity for the purpose of queries, and a key, that uniquely identifies it within its kind [19, 20]. Entities of the same kind can have different properties, and different entities can have properties with the same name but different value types. The key consists of the following components:

 The entity's kind

 An identifier, which can be either

o a key name string

o an integer numeric ID

 An optional ancestor path locating the entity within the Datastore hierarchy.

Entities in the Datastore form a hierarchically structured space similar to the directory structure of a file system. When an entity is created, it is possible designate another entity as its parent; the new entity is a child of the parent entity. This creates the ancestor path. [20]

13.2 Queries and indexes

App Engine predefines a simple index on each property of an entity. An App Engine application can define further custom indexes in an index configuration file. Because all queries on App Engine are served by these pre-built indexes, the types of query that can be executed are more restrictive than those allowed on a relational database with SQL [20]. In particular, the following are not supported:

 Join operations

 Inequality filtering on multiple properties

52

 Filtering of data based on results of a subquery

All the queries in the Datastore are eventually consistent. A typical query includes the following:

 An entity kind to which the query applies

 Zero or more filters based on the entities' property values, keys, and ancestors

 Zero or more sort orders to sequence the results

In addition to retrieving entities from the Datastore directly by their keys, an application can perform a query to retrieve them by the values of their properties [20].

13.3 Transactions

The Datastore can execute multiple operations in a single transaction. By definition, a transaction cannot succeed unless every one of its operations succeeds. If any of the operations fails, the transaction is automatically rolled back. This is especially useful for distributed web applications, where multiple users may be accessing or manipulating the same data at the same time [20].

13.4 Scalability

The App Engine Datastore is designed to scale, allowing applications to maintain high performance as they receive more traffic:

 Datastore writes scale by automatically distributing data as necessary.

 Datastore reads scale because the only queries supported are those whose performance scales with the size of the result set (as opposed to the data set). This means that a query whose result set contains 100 entities performs the same whether it searches over a hundred entities or a million. This property is the key reason some types of query are not supported [20].

13.5 High Availability

App Engine's primary data repository is the High Replication Datastore (HRD), in which data is replicated across multiple data centers using a system based on the Paxos algorithm5. This provides a high level of availability for reads and writes [20].

5 Paxos is a family of protocols for solving consensus in a network of unreliable processors. Consensus is the process of agreeing on one result among a group of participants. This problem becomes difficult when the participants or their communication medium may experience failures

53

13.6 Data Access

Development on the DataStore is done though Application Programming Interfaces (API). These can be accessed by either Python or JAVA. The App Engine Java SDK provides a low-level Datastore API with simple operations on entities. The SDK also includes implementations of the Java Data Objects (JDO) and Java Persistence API (JPA) interfaces for modeling and persisting data. These standard interfaces include mechanisms for defining classes for data objects and for performing queries [20]. The Python Datastore interface includes a rich data modeling API and a SQL-like query language called GQL [21, 22].

13.7 Quotas and Limits Google has defined quotas and limits to variious aspects of application's Datastore usage:

• Each call to the Datastore API counts toward the Datastore API Calls quota.

• Data sent to the Datastore by the application counts toward the Data Sent to Datastore API quota.

• Data received by the application from the Datastore counts toward the Data Received from Datastore API quota.

The total amount of data currently stored in the Datastore for the application cannot exceed the Stored Data (billable) quota. This includes all entity properties and keys, as well as the indexes needed to support querying those entities. The following table shows the limits that apply specifically to the use of the Datastore [20]:

Limit Amount

Maximum entity size 1MB

Maximum transaction size 10MB

Maximum number of index entries for an entity 2000

Maximum number of bytes in composite indexes for an entity 2MB

Figure 13 Google Datasore Limits [20]

54

14. MongoLab/MongoDB and Cloudent/Apache CouchDB

Both CouchDB and MongoDB are document-oriented databases with schema less JSON-style and BSON (Binary JSON) style object data storage [26]. Because they offer similar functionalities I’ll write about them together and give a short overview of their differences. First what is document oriented database?

14.1 Document oriented database

A document oriented database or data store does not use tables for storing data. It stores each record as a document with certain characteristics. Documents inside a document-oriented database are similar, in some ways, to records or rows, in relational databases, but they are less rigid. They are not required to adhere to a standard schema nor will they have all the same sections, slots, parts, keys, or the like [24, 25]. For example here's a document:

FirstName:"Bob", Address:"5 Oak St.", Hobby:"sailing".

Another document could be:

FirstName:"Jonathan", Address:"15 Wanamassa Point Road", Children:[{Name:"Michael",Age:10}, {Name:"Jennifer", Age:8}, {Name:"Samantha", Age:5}, {Name:"Elena", Age:2}].

Both documents have some similar information and some different. Unlike a relational database where each record would have the same set of fields and unused fields might be kept empty, there are no empty 'fields' in either document (record) in this case. This system allows new information to be added and it does not require explicitly stating if other pieces of information are left out. The benefit would be that if you are using a document oriented database for storing a large number of records in a huge database, any change in the number or type of row does not need an alter on the table. All it is needed is to do is insert new documents with new structure and it is automatically inserted to the current datastore. Documents are addressed in the database via a unique key that represents that document. Often, this key is a simple string. In some cases, this string is a URI or path. Regardless, this key can be used to retrieve the document from the database. Typically, the database retains an index on the key such that document retrieval is fast. One of the other defining characteristics of a document-oriented database is that, beyond the simple key-document (or key-value) lookup that you can use to retrieve a document, the database will offer an API or query language that will allow document retrieval based on their contents. For example, you may want a query that gets you all the documents with a certain field set to a certain value. The set of query APIs or query language features available, as well as the expected performance of the queries, varies significantly from one implementation to the next.

Implementations offer a variety of ways of organizing documents, including notions of

 Collections

55

 Tags  Non-visible Metadata  Directory hierarchies

14.2 MongoDB and CouchDB comparison

As I said earlier both MongoDB and CouchDB are document-oriented databases with schemaless JSON-style object data storage. Table2 shows comparison between the both databases.

MongoDB CouchDB Data Model Document-Oriented (JSON) Document-Oriented (BSON) Interface HTTP/REST Native Drivers; REST Large Objects (Files) Yes (attachments) Yes (GRIDFS) Horizontal Partitioning scheme BigCouch, CouchDB Lounge, Auto-sharding Pillow Object Storage Database contains Documents Database contains collections Collection contains documents Query Method Map/Reduce ( + Map/Reduce (javascript) others) creating Views + range creating collections + object- queries based query language Replication Master-master with custom Master-Slave conflict resolution function Concurrency MVCC (Multi Version Update in-place Concurrency Control) Distributed Consistency Eventually consistent Strong consistency. Eventually consistent reads from secondary replicas Written in Erlang ++ Table 4 Comparison MongoDB adn CouchDB

14.3 MVCC – Multy Version Concurency Control

One big difference is that CouchDB is MVCC based, and MongoDB is more of a traditional update- in- place store [24, 25, 27]. MVCC is very good for certain classes of problems:  Problems which need intense versioning; problems with offline databases that re-sync

later

 Problems where you want a large amount of master-master replication happening. But

with MVCC there are some considerations:

56

 The database must be compacted periodically, if there are many updates;

 When conflicts occur on transactions, they must be handled by the manually (unless the db also does conventional locking -- although then master-master replication is likely lost) [25].

MongoDB updates an object in-place when possible. Problems requiring high update rates of objects are a great fit, compaction is not necessary. Mongo's replication, without the MVCC model, is more oriented towards master/slave and auto failover configurations than to master-master setups. MongoDB promises high write performance, especially for updates.

14.4 Scalability

One fundamental difference is that a number of Couch users use replication as a way scale. Mongo uses auto - sharding as a way of scalability. There is couple of available options for sharding CouchDB available as opensource or by third-party developers. The best known are CouchDB Lounge and BigCouch used by cloudant.com [25, 26]

BigCouch can be seen as Erlang/OTP applications that allow creating a cluster of CouchDBs that is distributed across many nodes/servers.[30] Instead of one big honking CouchDB, the result is an elastic data store which is fully CouchDB API-compliant.

The clustering layer is most closely modeled after Amazon's Dynamo, with consistent hashing, replication, and quorum for read/write operations. CouchDB view indexing occurs in parallel on each partition, and can achieve impressive speedups as compared to standalone serial indexing. [25]

14.5 Querying

CouchDB uses a view model which acts as an ongoing incremental map-reduce function, providing a constantly updated view of the database. From the HTTP interface different views can be accessed and data can be retrieved by key/index as well. The view model is well-suited for statically definable queries; job-style operations. There is elegance to the approach, although these structures must be pre-declared for each query to be executed. They can be thought of as materialized views.[27]

Mongo uses traditional dynamic queries. As with, say, MySQL, it can do queries where an index does not exist, or where an index is helpful but only partially so. Mongo includes a query optimizer which makes these determinations. This is very nice for inspecting the data administratively, and this method is also good when indexes are not used: such as insert-intensive collections. When an index corresponds perfectly to the query, the Couch and Mongo approaches are conceptually similar.[24]

57

14.6 Atomicity and Durability

Both MongoDB and CouchDB support concurrent modifications of single documents. Both forego complex transactions involving large numbers of objects.

CouchDB is a "crash-only" design where the database can terminate at any time and remain consistent.[25,27] Previous versions of MongoDB used a storage engine that would require a repair database operation when starting up after a hard crash. Newer versions offer durability via journaling.[24]

14.7 Map Reduce

Both CouchDB and MongoDB support map/reduce operations. For CouchDB map/reduce is inherent to the building of all views [24]. With MongoDB, map/reduce is only for data processing jobs but not for traditional queries.[25]

14.8 Javascript

Both CouchDB and MongoDB make use of Javascript. CouchDB uses Javascript extensively including in the building of views.

MongoDB supports the use of JavaScript but more as an adjunct. In MongoDB, query expressions are typically expressed as JSON-style query objects, however one may also specify a JavaScript expression as part of the query. MongoDB also supports running arbitrary javascript functions server-side and uses JavaScript for map/reduce operations.

14.9 REST

Couch uses REST as its interface to the database. MongoDB relies on language-specific database drivers for access to the database over a custom binary protocol. Of course, REST interface can be added on top of an existing MongoDB driver at any time.

14.10 MongoLab and Cloudent

The most popular platforms offering managed instances as a service of MongoDB and CouchDB are MongoLab and Cloudent respectfully.

MongoLab is offering two tiers of plans, shared and dedicated in order to accommodate a range of use cases and budgets. The database can be hosted on Amazon AWS or in . With the shared plan MongoLab is offering one MongoDB database on a shared mongod server process on a shared VM host and replication for backups [28]. The architecture is shown in the Figure 13 58

bellow.

Master0 Slave0

Replication for backup DB0 DBN DB0 DBN

Mongod server process Mongod server process

Mastern Slaven

Replication for backup DB0 DBN DB0 DBN

Mongod server process Mongod server process

VM Host 0 VM Host 1

Figure 14 MongoLab Shared Plan

The shared plan is offered for free up to 250MB and there are three more options available Small Medium and Large, also additional storage is available as an option.

The dedicated plan is offered in two variants with one dedicated node and with two and more dedicated nodes. The dedicated plan with one node is a single dedicated VM with automatic failover to a secondary on a shared VM. It offers high availability with the replicas but it does not allow reading from the replicas as a mean to increase read-throughput. It also offers monitoring services through MongoDB Monitoring Service (MMS). MMS is 10gen6 web service that monitors

6 10gen is a software company that develops and provides commercial support for the open MongoDB database 59

and graphs the performance of MongoDB clusters, servers and databases over time. It can monitor important statistics such as resident memory usage, rate of database operations, write-lock queue depth and CPU alongside any other MongoDB instances they might be running outside of MongoLab [28].

Dedicated plan with two and more nodes can scale to as many dedicated nodes of equal size as it is needed. Also in addition to providing high-availability, it scales horizontally read throughput by the creation of a Replica Set cluster of more than one member. The architectures of both dedicated plans are shown in Figure 14. With dedicated plans hosting is available on Amazon EC2 or in Rackspace Cloud.

Figure 15 Dedicated Plan Architecture: 1 Dedicated Node

60

Figure 16 Dedicated Plan Architecture: 2+ Dedicated Nodes

Cloudent.com is offering multi-tenant and single-tenant (private) CouchDB database clusters that are hosted and scaled within or across multiple top-tier data centers around the globe. In all offered plans Cloudant automatically replicates the data across this network as needed to push it closer to the global user base, reduce network latency overhead, ensure 24x7 availability, and provide disaster recovery capabilities.[27]

Cloudant, provides a domain through which to access the data layer. Behind that domain, Cloudant stores the data in horizontally scalable version of the CouchDB database. The horizontal scalability is done with BigCouch [30] as mentioned earlier. The data layer automatically handles load balancing, clustering, backup, growing/shrinking the clusters, and high availability. It also provides private, single-tenant clusters that exist entirely within a data center or that span across data centers to provide real-time data distribution to multiple locations.[29] Regardless of whether it is a multi- or single-tenant data layer, data can be replicated and synchronized between Cloudant data centers and: 61

 other Cloudant data centers for high availability, backup or for scalability and performance

 non-Cloudant data centers

 disconnected devices/networks great for mobile apps

 edge databases such as data marts or spreadsheets; great for independent analytic projects

Cloudant Data Layer also includes a number of dashboards that allow view and control of the data layer performance, usage, search indexing, billing and other metrics.[29]

Cloudnat pricing is a little bit different then MongoLab and it is based on data stored and millions of requests per month (MReq/mo). There is a free starting plan that includes 250MB of storage and 0.5 Mreq/mo.[25,28]

Data storage is counted in a way that includes the size only of the latest revision of all documents, plus the size of the view indexes. Older revisions and deleted documents do not count towards size quotas. They are purged automatically after a certain time. Requests are approximately the number of documents reads and writes from the database.

15. What benefits cloud database and cloud computing brings for small and medium organizations?

For small and medium business owners saving money and time whenever possible is critical to their success. Regardless weather it is just startup or more mature business, cloud software and services in general can help to cut costs and allow you to concentrate on the core of your business. The benefits of cloud computing for small business sound attractive but that does not mean that it does not have certain disadvantages or is right for every business. As I shown in the previous part of this paper, there are a lot of available options to choose from the cloud database as service offerings and when you include there all the other available cloud services, choosing the right provider and the right services for your business needs is not an easy task. Here I refer to cloud computing in general as the benefits from the DBaaS solutions are part of the benefits from the Clout computing. First I will write the main benefits.

15.1 Advantages for Small Business

I will speak about the advantages and disadvantages in more general terms of cloud computing as the same apply to the cloud database. The main advantages include

 Lower Initial Investment – only things needed to start using the cloud is computer and an Internet connection, it is possible take advantage of most cloud offerings without investing in any new hardware, specialized software, or adding to staff. This is one cloud computing 62

advantage that has universal appeal regardless of the industry or the type of business. This allows organizations and especially startups to invest in new projects and ideas without risk of big loss.

 Easier to manage - There are no power requirements or space considerations to think about and users do not have to understand the underlying technology in order to take advantage of it. There is no need for maintaining and updating any new hardware or software. Planning time is considerably less as well since there are fewer logistical issues.

 Pay as You Go - Large upfront fees is not the norm when it comes to cloud services. Most of the cloud services as I wrote earlier in this paper are available on a month to month basis with no long term contracts. It also gives the benefit of keeping multiple projects running without enormous expenses.

 Scalability - Cloud computing can be scaled to match the changing needs of the small business as it grows. Licenses, storage space, new instances and more can be added as needed.

 Deploy Faster – usually it is possible to get up and running significantly faster with cloud services than if there is a need to plan, buy, build, and implement in house. With many software as a service applications or other cloud offerings it is possible to start using the service within hours or days rather than weeks or months.

 Location Independent - Because services are offered over the Internet, there are no limits to using cloud software or services just at work or only on one computer. Access from anywhere is a big advantage for people who travel a lot, like to be able to work from home, or whose organization is spread out across multiple locations.

 Device independent - Most web-based software and cloud services are not designed specifically for any one browser or operating system. Many can be accessed via PC, Mac, on tablets and through mobile phones.

15.2 Disadvantages of Cloud Computing

While the advantages of cloud computing are clear and easy enough to understand, there are potentially a few disadvantages that needs to be considered carefully.

 Downtime - While we would like to think our data or the cloud based services that we use are available on demand all day every day, the truth is they are not. System uptime is entirely out of our hands with cloud services. There are two types of downtime:

o Scheduled downtime might be required to upgrade software, install new hardware, or perform other routine maintenance. Typically, scheduled downtime is infrequent, announced well in advance, and takes place at non-peak hours where 63

usage is likely to be low so as to minimize interruption to the customer.

o Unscheduled downtime, otherwise known as an outage, is indicative of some sort of failure or problem. It is rare but outages do happen even for the larger, more established cloud providers. If it does, there is not much that can be done other than wait.

 Security Issues - This is maybe one of the most discussed issues when considering moving to the cloud. You are turning over data about your business and your customers to a third party and entrusting them to keep it safe. Without the proper level of security, your data could be exposed to users outside your company or accessed by a hacker.

 Less control over your data loss - With cloud services, you will have to give up some degree of control over the prevention of data loss. That is in the hands of the cloud service provider.

 Integration and Customization - Some web based software solutions and cloud services are offered as a one size fits all solution. If you need to customize the application or service to fit specific needs or integrate with your existing systems, doing so may be challenging, expensive, or not an option.

15.3 Main things to be considered when moving to the cloud Migrating to a cloud solution is usually fairly easy, the service provider usually helps with setting everything up and transferring the information to the hosted environment. But there are some considerations that organization should look at.

 Prioritize applications Focus on the applications that provide the maximum benefit for the minimum cost/risk. Measure the business criticality, business risk, functionality of the services and impact to data sovereignty, regulation and compliance. Prioritize which applications to migrate to the cloud and in which order.

 Consumption models As can be seen from the different pricing models used by the services and providers described earlier, each provider has a different consumption model for how you procure and use the service. These consumption models need to be considered carefully from two perspectives – frequency of change and volume.

 Data residency and legal jurisdiction This issue is not recognized by many but most organizations realize that business information held outside their country is subject to the commercial law of the country it is held in. Most organizations decide to keep their data in the country of origin to ensure that the local country law still applies to their business information.

64

 Performance and availability When moving to a distributed IT landscape with some functionality in the cloud, where there is integration between these cloud applications and on-premise applications, then performance of this distributed functionality needs careful consideration and potentially increased processing to ensure service delivery. Similarly, availability will need careful assessment because an application that is all in the cloud, or distributed across the cloud and on-premise, will have different availability characteristics to the legacy on-premise application. Organizations also need to ensure that their local and wide area networks are enabled for cloud and will support the associated increase in bandwidth and network traffic.

 Service integration When moving an application to the cloud, continuity of service and service management needs to be considered. The service management role changes to more of a service integration role. An alternative to the in-house service management function providing this capability is the use of an outsourcing organization, to provide this function.

 Architecting for the cloud and cloud application maturity Cloud Computing provides real benefits for organizations but to realize these benefits the applications being utilized sometimes need to be architected to take advantage of the scalable nature of Cloud Computing. While new applications, should be built with this in mind, often legacy applications are built to take advantage of legacy systems and hence may not be able to truly leverage the benefits the Cloud can bring without significant re- architecting. There are even differences between how much re-architecting is needed from to move to a cloud provider and also to move from one Cloud Computing provider to the next, so the Cloud provider selection process should include questions about the Cloud provider’s technological underpinning and if re- architecting is needed, it does not come as a surprise. Currently application maturity is extremely variable from one application to the next.

 Exit strategy Before adopting a cloud service provider or application ensure you consider your exit strategy, e.g. data extraction, and put costs for this strategy into your business case and service costs. Many people are rightly concerned about moving to Cloud Computing and being fixed to one provider. This is indeed a concern and one which should not be brushed off lightly. That said however, Cloud Computing tends to be much more transparent when it comes to lock in and so organizations should be able to accurately gauge the risks. Organizations should look at a number of different factors:

o Does the vendor use industry standard APIs or proprietary ones?

o Does the vendor provide quick and easy data extraction in the event that the customer wishes to shift?

o Does the vendor use open standards or have they created their own ways of doing things?

o Can the Cloud Computing service be controlled by third party control panels? 65

 Data migration Moving data into or out of a SaaS and DBaaS application may require considerable transformation and load effort.

 Service and transaction state Maintaining continuity of the state of in-flight transactions at the point of transition into the cloud will need consideration. This will also be the case at the point of exit as well.

 Service Level Agreement (SLA) Small business owners usually don’t have experience with these types of agreements and not viewing them might open up Pandora’s Box without knowing it. Business impact in the SLA myst be carefully considered and analyzed. Close attention should be paid to the availability guarantees and penalty clauses:  Does the availability fit in with organization business model?  What do you need to do to receive the credits when the hosting provider failed to achieve the guaranteed service levels?  Are they automatically processes, or do you need to ask for them in writing? Usually the cloud providers have one SLA for all users and do not provide customization of the SLA. All this considerations must be evaluated carefully before moving to a cloud based solutions in order to mitigate the risk and be confident to choose the right cloud services that will support and insure growth of the business.

66

16. Will cloud computing reduce the budget?

A small business which decides to own and manage its own IT equipment sometimes fails to recognize that over time, these equipment and their components will begin to deteriorate thus causing the system to crash or experience latency. This may pose a bigger problem if the company has remote users and satellite offices. Without much thought, an entrepreneur will surely put in more money by upgrading its equipment and adding extra redundancy. Additional IT support personnel may be hired. The cycle will truly become vicious as new equipment will depreciate and break down after a couple of years.

In general, IT eats up a huge part of the company’s budget not only because of the costly equipment but its maintenance and upgrade costs as well. Upgrades, security threats, and unexpected system crashes often cost a lot of money. With cloud computing, all these IT capital investments and expenses are borne by the third-party supplier. The business owner will just have to budget for the system’s monthly subscription fees per user. There is also no need to invest on IT in anticipation of a future demand because cloud computing can be deployed on demand when needed. An entrepreneur can settle for a cloud computing service for better forecasting of an IT budget

Cloud computing simplifies budgeting. The business owner need not worry about merging projects or complex expansion because he only needs to pay for the resources his company uses. Also, when users are reduced, the accompanying cloud computing costs are reduced also. The traditional IT process of procurement, installation, management, protection, and support of an on-premise system can be a vicious cycle and contradicts the company’s goal of reducing recurring expenses. Cloud computing services and resources are used only when needed which greatly reduce recurrent expenditures and leverage the company in adapting to frequently evolving conditions of the market.

With cloud computing, a business owner can better manage uncertainties. He exposes his company to greater risks if he invests a lot of money on IT. Because of growing demand, a lot of businesses overinvest in Information Technology which eventually increases expenses and uncertainties of IT management and maintenance. Cloud computing vendors reduce the company’s reliance on on- premise IT systems thereby assuming the uncertainties and costs of IT support, security, backups, and hardware. The business owner, therefore, has no more liability in procurement, management, and upgrade of IT equipment. Growth opportunities can then be pursued without having to bear the uncertainties of important capital outlays.

One usually overlooked benefit from the small entrepreneurs is the fact that cloud computing also reduces energy costs because the company has less IT equipment to maintain. IT servers require specific temperatures to run perfectly. When a business owner decides to use cloud computing services, energy bills are reduced because expensive IT equipment are moved to a safe, monitored, and disaster-proof IT center.

When on-site IT problems arise, it is but expected that employee productivity is affected. Because of this, stress levels are elevated. When using cloud services, employees can do their work anywhere and anytime they wish. They can work from home by accessing the software through internet connection,

67

this also improves the morale. Travel time and costs are significantly reduced. Each employee who is given access to the software can even ask the cloud computing supplier’s team for support with regards to the problems which may arise while he is using the system. Management can even monitor remotely each employee’s activity through the management consoles provided by the supplier.

68

17. Conclusion

Database management system, for a long time has been an integral part of the computing. As the whole IT world is moving to the cloud whether you are assembling, managing or developing on a cloud computing platform, you need a cloud compatible database. In this work I gave a short overview of cloud computing and presented couple of the currently available companies that offer database as a service in the cloud. Although they differ from the most widely used “traditional” relational database systems and most of them might require revision and recoding of the existing applications, it is obvious that they bring a lot of benefits especially with the offer for fully managed and automated database administration tuning and optimization. Cloud database system are built to use the power of the cloud, they are extremely scalable and elastic, giving the opportunity to start small and expand as you need mitigating the risk and uncertainties of investing in IT equipment and professional IT support. Cloud computing in general, with the flexible pricing models and different plans it presents the one of the best solutions for startup and small companies that are developing new products and does not have the financial power to risk and invest in uncertain projects. The cloud database solution provides an ideal solution for web and mobile application. The fact that most of the DBaaS offerings are tightly integrated with other PaaS gives the organization the opportunity to focus on developing their products and do not waste any resources on administration of the platform and gives an opportunity to fully focus on the development of the product.

Despite the benefits offered by cloud-based DBMS, many people still have apprehensions about them. This is most likely due to the various security issues that have yet to be dealt with. Storing and entrusting security of critical business data in the cloud, to a third party, where the data will be spread on multiple hardware stacks and across multiple data centers can be a big security issue. In my opinion, maybe the cloud is still not ready to be used to move critical enterprise applications which store highly sensitive data but is definitely ready to be used for testing and development of new projects.

Many companies including some of the huge multinational corporations have already moved to cloud computing because it is less costly, efficient, and agile as compared to onsite IT systems. Therefore, small and medium scale enterprise must follow suit. If cloud computing is proven to work for these big enterprises, it will surely work for small and medium enterprises.

69

Appendix

Case studies from the industry – Amazon RDS

Airbnb, a vacation rental firm, kept its main database in Amazon RDS. The consistency between locally hosted MySQL and Amazon RDS MySQL facilitated the migration to AWS. A significant architecture consideration for Airbnb was that Amazon provided the underlying replication infrastructure. “Amazon RDS supports asynchronous master-slave replication,” wrote Tobi Knaup.21 Knaup added that the hot standby, which ran in a different AWS Availability Zone, was updated synchronously with no replication lag. Therefore, if the master database failed, the standby was promoted to the new master with no loss of data. [32]

Case studies from the industry – Microsoft SQL Azure

Xerox Corporation ported an on-premise enterprise print capability to a public cloud environment. This capability allowed mobile users to find printers with their smartphones and route printouts. As the on-premise version leveraged Microsoft SQL Server for the database component, Xerox selected Microsoft SQL Azure for . This approach allowed them to reuse their prior investments in SQL Server-based technology and .NET, and minimize the technical challenges of porting to a cloud based environment.38 They were also able to minimize their skills-based challenges because the development team was trained on Microsoft products. Xerox used SQL Azure for “user account information, job information, device information, print job metadata, and other such data,” but the actual print files were stored in Azure Blob Storage, not SQL Azure.39 Azure Blob Storage had different pricing and characteristics than SQL Azure. For example, unlike SQL Azure, Blob Storage was not limited to 10 GB (Web edition) or 50 GB (Business edition).[33]

Case studies from the industry – Amazon DynamoDB

"When IMDb launches features to our over 110MM monthly unique users worldwide, we want to be prepared for rapid growth (1000x scale), and for customers to use our software in exciting and different ways," said H.B. Siegel, CTO, IMDb. "To ensure we could scale quickly, we migrated IMDb’s popular 10 star rating system to DynamoDB. We evaluated several technologies and chose DynamoDB because it is a high-performance database system that scales seamlessly and is fully managed. This saves us a ton of development time and allows us to focus our resources on building better products for our customers, while still feeling confident in our ability to handle growth."[34]

70

Case studies from the industry – Amazon SimpleDB Alexa Web Search crawled the Internet every night and generated a Web-scale datastore with terabytes of data. They wanted to allow users to run custom queries against this data and generate up to 10 million results. To provide this service, Alexa’s architecture team leveraged a combination of AWS services that included EC2, S3, SQS, and SimpleDB. SimpleDB was used for status information because it was “schema-less.” AWS’ Jinesh Varia wrote, “There is no need to provide the structure of the record beforehand. Every controller can define its own structure and append data to a ‘job’ item.” SimpleDB allowed components of the architecture to independently and asynchronously read and write state information (e.g., status of jobs in-process). While a good fit for state information, SimpleDB, which had a 10 GB limit per domain, was not used for the nightly multiterabyte Internet crawl.[35]

71

References

[1]. Cloud Computing Bible - Barrie Sosinsky, Janury 2012. ISBN: 978-0-470-90356-8

[2]. http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf

[3]. Introduction to cloud computing - Ivanka Menken, Emereo Publishing 2011

[4]. Understanding PaaS - Michael P. McGrath, O'Reilly Media January 2012

[5]. Data Management Challenges in Cloud Computing Infrastructures - Divyakant Agrawal, A. E., University of California, Santa Barbara.

[6]. Database Scalability, Elasticity, and Autonomy in the Cloud - Divyakant Agrawal, A. E., Department of Computer Science, University of California at Santa Barbara.

[7]. Cloud Computing: Principles, Systems and Applications - Gillam, N. A., Springer 2010

[8]. http://relationalcloud.com/index.php?title=Database_as_a_Service

[9]. The multitenant, metadata-driven architecture of Database.com - Database.com Getting Started Series White Paper

[10]. Megastore: Providing Scalable, Highly Available Storage for Interactive Services - Jason Baker, C. B.-M. http://pdos.csail.mit.edu/6.824-2012/papers/jbaker-megastore.pdf

[11]. Inside SQL Azure. Microsoft TechNet. http://social.technet.microsoft.com/wiki/contents/articles/1695.inside-windows-azure-sql- database.aspx

[12]. https://www.windowsazure.com/en-us/home/features/data-management/

[13]. https://www.windowsazure.com/en-us/pricing/details/#storage

[14]. http://aws.amazon.com/rds/

[15]. https://developers.google.com/appengine/docs

[16]. http://en.wikipedia.org/wiki/Paxos_algorithm

[17]. ' weblog on building scalable and robust distributed systems http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

[18]. http://aws.amazon.com/dynamodb/

[19]. http://www.databasejournal.com/features/mssql/article.php/3823471/Cloud-Computing- with-Google-DataStore.htm 72

[20]. Google AppEngine Documents https://developers.google.com/appengine/docs/java/overview - Product page

[21]. Google AppEngine Documents https://developers.google.com/appengine/docs/phyton/overview - Product page

[22]. Google AppEngine Documents https://developers.google.com/appengine/docs/python/datastore/gqlreference

[23]. MongoDB - http://www.mongodb.org/ - Product Page

[24]. MongoDB blog: http://blog.mongodb.org – Product Blog

[25]. Cloudant Blog http://blog.cloudant.com/cloudant-bigcouch-is-open-source - Product Blog

[26]. http://bsonspec.org/

[27]. http://wiki.apache.org/couchdb/ Product wiki

[28]. http://www.mongolab.com – Product page

[29]. Technical Overview: Anatomy of the Cloudant Data Layer Service - 2012 Cloudant, Inc.

[30]. http://bigcouch.cloudant.com/

[31]. Building Scalable Database Solution with SQL Azure - Introducing Federation in SQL Azure. http://blogs.msdn.com

[32]. http://aws.amazon.com/solutions/case-studies/airbnb/

[33]. https://www.windowsazure.com/en-us/home/case-studies/

[34]. http://aws.amazon.com/dynamodb/testimonials/#imdb

[35]. http://aws.amazon.com/solutions/case-studies/alexa/

[36]. White Paper - Top Ten Data Management Trends - Scalability Experts - Raj Gill, Y. B.

[37]. http://nosql.mypopescu.com/post/1669537044/sql-and-nosql-in-the-cloud

[38]. White Paper - NOSQL for the Enterprise - Neo Technology (2011)

[39]. White Paper - Database as a Cloud Service - Scalability Experts - Wolter, R. (2011)

73