<<

Ensuring Application Uptime in the Cloud The Importance of Uptime

Nearly every business is now reliant, in some way, on the availability of one system or another. An hour of down can, for an ecommerce retailer, mean millions in lost revenues; for an independent software vendor it can lead to lost customers and a tarnished reputation. For the average CTO, uptime is a top, even if sometimes unstated, priority.

Whether your infrastructure is on-premise or hosted, dedicated or virtualised, public or private, uptime is a multifaceted issue.

While a to the cloud, for most, brings with it a welcome increase in the uptime you can expect, there is still a wide range of issues that can bring your systems down. There is also a difference between the uptime guarantee in your average hosting SLA and the actual uptime of any system. Understanding this is fundamental to ensuring that your organisation benefits from the uptime assurances it needs, a price it can afford.

In this ebook, we look at many ways you can work with your hosting provider to influence the uptime of your cloud hosted applications and what the trade-offs in each case are. Uptime in the Cloud

Uptime is a measure of the availability of a component or system. It is typically reported as a percentage indicating the ratio of uptime to total time over any period. Depending on the circumstances, uptime is used to refer to the actual, measured uptime of a system, the predicted uptime or the uptime that is guaranteed.

In the context of cloud hosting, the term ‘uptime’ is frequently used in reference to the predicted uptime of the data centre in question. This figure, or an improvement on it, is often passed on to you in the form of an uptime guarantee.

But a guarantee, with financial penalties for failure to comply, is not the same as actual system uptime. In addition, it’s entirely possible for the uptime of your specific applications to differ from that of the host data centre since it is affected by a range of other factors. Some of these factors can be the responsibility of your hosting provider while others may fall entirely at your feet. Either way, it’s essential that you take a holistic view of application uptime, and don’t rely entirely on that guarantee, if you are to meet your organisation's requirements for application availability within budget.

In the cloud hosting context, the term uptime is frequently used in reference to the expected availability of a data centre.

1 The True Cost of Uptime 6 Application Performance

Ensuring Application Uptime Business Continuity and Disaster 2 in the Cloud 7 Recovery

3 Single Points of Failure 8 Service Level Agreement

4 Scalability 9 Application Security

5 Application Architecture 10 Measurement and Monitoring

www.iomart.com 1 The True Cost of Uptime

Assuming your objective is to secure a desired level of application availability, rather than simply being reimbursed for any downtime, there are a great many ways that this can be achieved. As with most things however, uptime is inexorably linked with cost and each incremental increase is likely to add to your bill.

Therefore, before making changes to your infrastructure or agreements, you need to know how much downtime is, in truth, acceptable to your organisation. Understanding your recovery time objective (RTO) – i.e. how quickly you need your systems to and your applications to be back up – and recovery point objective (RPO) – i.e. how much data loss between backups or replications you can tolerate following a disruption – will to guide all your decisions going forward.

Do you need to recover in an hour, a day or a week? And does this differ between systems? Knowing this will help you the most cost effective choices while achieving the required uptime.

Before making changes to your infrastructure or agreements, you need to know how much downtime is, in truth, acceptable to your organisation.

www.iomart.com 2 Ensuring Application Uptime in the Cloud

Data centre tier Since data centre uptime is the starting point for most predictions of application uptime, it makes sense that we here also.

The Uptime Institute’s data centre tiers started life as a guide for those building data centre facilities. To the end- customer, they have become the standard measure of a data centre’s resilience and therefore the uptime that can be expected.

The tier of the data centre (or centres) in which you choose to host your applications will therefore have a direct effect on the level of uptime that your provider will be able to guarantee you.

Expected Availability by Data Centre Tier 99.99% 100.00% 99.98%

99.95%

99.90%

99.85%

99.80% 99.75% 99.75%

99.70% 99.67%

99.65%

99.60%

99.55%

99.50% Tier 1 Tier 2 Tier 3 Tier 4

If money is no problem, then you could of course invest heavily in the most robust hosting environment possible - the tier 4 data centre. But the cost of such an approach may not make financial sense for you. To move the uptime clock up from 99.982% (tier 3) to 99.995%, a tier 4 data centre has to implement an independently dual-powered cooling system along with electrical storage and distribution systems. These do not come cheap and will certainly put the cost of your hosting up significantly.

While these measures do, clearly, improve the expected uptime of the data centre, they protect against some of the less likely single incidents and do nothing to address the wide range of other factors that can still cripple your application. For many businesses, a tier III data centre offers a happy medium between high availability and affordable cost, leaving you with some budget to fix other issues.

www.iomart.com 3 Single Points of Failure

As the saying goes, ‘a chain is only as strong as its weakest ’. The same thing goes for your application and the infrastructure it runs on. Analysing the entire chain to identify and remove single points of failure, i.e. single devices that, if INTERNET they were to fail, would completely prevent the system from functioning, is a key step in achieving maximum uptime.

FIWA Single points of failure can take many forms and may or may not be under your direct control. Either way, don’t assume that redundancy or resilience is a given and, together with your hosting provider, take a look at all the links in your chain. As a minimum, we recommend that you look at the following: OA BAANC • Power supplies - do all critical devices have dual power supplies? • Network connections - do all critical devices have redundant connections to the network? • Network switches - are the switches redundant? • Load balancers - while load balancers share traffic between TAC redundant servers, a single load balancer is a single point of WITC failure that can take your application ‘off-air’ • Firewalls • DNS servers • Third-party dependencies • Routes to the internet via multiple carriers B, ANI, MMCAC O In most cases we recommend that you introduce device level redundancy to offset the risk of any application outage arising from the failure of one of these components. In some cases, however, your provider may be able to commit to a hardware replacement within your acceptable downtime window. For example, if you decide FAT TOANT IINT that your business can tolerate the failure of a specific component B for an hour, but a replacement can be installed in 30 minutes, there is no need to implement a fully redundant device in defence against Example of web hosting infrastructure this risk. with no single point of failure In addition to all of the above, your application itself can be a single point of failure. Refer to the section on application architecture for information.

www.iomart.com 4 Scalability

Even if your applications are ‘up’ in the broadest sense, any inability to cope with increases in demand can lead to unavailability for some users. Your approach C UT to capacity planning and scalability therefore plays a determinant role in your application’s uptime.

Cloud systems are often thought of as intrinsically scalable, but scalability can mean several things: • Scaling out - adding new servers, or machine instances, to a cluster • Scaling up - adding new resources (i.e. CPU, RAM) to an existing

ITC Scaling up and out are both perfectly possible in a cloud hosted environment, providing you’re willing to pay for it. Scaling out can, . theoretically, be performed instantly, whereas scaling up demands that the UC machine in question be restarted.

Elasticity is the automation of the scaling out process. In an elastic system, the application can monitor demand in real time and make calls to the C U hypervisor to create or destroy machine instances accordingly. But scaling up, scaling out and elasticity are all worth nothing if your application hasn’t been created to utilise them. [See the section on application architecture.]

A sound capacity planning approach combined with knowledge of the application architecture is fundamental in determining if or what scalability is required within your infrastructure. For mature workloads with predictable fluctuations in volume, it may be more cost effective to add the required headroom to your system permanently, rather than to pay for flexibility.

www.iomart.com 5 Application Architecture

Application architecture plays a crucial role in uptime. For instance, if your application hasn’t been developed in such a way that it can utilise redundant or scalable resources, these are not going to offer an effective means of improving uptime and availability.

Creating a robust, scalable application is a vast topic in itself. A potential starting point for your own consideration of this is the issue that prevents most applications from utilising redundant and scalable resources today - being stateful.

But making an application stateless is not trivial and, furthermore, there is no definitive way to do it - there are a number of approaches all with their pros and cons. As a result, a great many applications remain stateful even when hosted in the cloud.

If this is the case for you, you will need to look at meeting your uptime objectives though a combination of other measures such as SLA-backed device swap outs and master-slave replication.

If your application hasn’t been developed in such a way that it can utilise redundant or scalable resources, these are not going to offer an effective means of improving uptime and availability.

www.iomart.com 6 Application Performance

Even when demand is within capacity and infrastructure is functioning normally, application performance still has the potential to derail you.

Monitoring resource utilisation, such as CPU, RAM, disk and bandwidth, while planning for spikes and organic growth is critical in maintaining application availability to all users.

Page load time is of special importance as this has a profound effect on the user experience and is even used by Google in their search ranking algorithm. Numerous free and paid tools exist to monitor page load speed.

If your application is used extensively by a multi-national audience, you should consider using a Content Delivery Network (CDN) to mitigate any potential geographic latency. CDNs offer a number of benefits. By caching and serving content at the point ‘closest to the eyeball’, CDNs take load away from your origin servers. Cached content can also continue to be served even in the case of an outage at the origin servers and the CDN can also play a role in DDOS mitigation.

Page load time is of special importance as this has a profound effect on the user experience and is even used by Google in the ranking algorithm.

www.iomart.com 7 Business Continuity and Disaster Recovery

There are some threats to application availability that you can prevent, i.e. single points of failure, and there are others that you cannot, like flood, fire and theft. Understanding how you will maintain application availability during or after such a catastrophic event puts you in the realm of business continuity and disaster recovery (BCDR) planning.

In the application infrastructure sense, BCDR requires you to decide how you will return the application to availability, and in what timeframe, if key infrastructure becomes off from users.

A geographically diverse infrastructure, naturally, offers a great deal of protection from this of issue. The question is, however, what to place at each additional site and how much it costs to operate – after all, the cost of duplicating your entire infrastructure at a location only used in dire emergencies doesn’t tend to appeal.

You therefore need to look at each element of your infrastructure, such as servers, storage, the network, etc. and assess the likelihood and potential impact of a failure in order to identify the appropriate BCDR strategy. You should also look at your organisation’s ability to access and manage any redundant infrastructure, and the time it takes to return to live.

Note: A BCDR strategy, infrastructure or failover process that hasn’t been tested is one big single point of failure in your organisation. Don’t let D-day be the day that you out your process doesn’t quite work the way you expected it to. Bite the bullet, fake an event and your process.

ata Centre ata Centre ata Centre ata Centre acupeplicate acupeplicate

acup acup eplicate eplicate torage atabase atabase torage torage atabase atabase torage

eb erer eb erer eb erer eb erer

oad alancer oad alancer oad alancer oad alancer

fter Traffic Traffic Traffic Falloer

Active Passive Configuration Active Active Configuration

www.iomart.com 8 Service Level Agreement

The Service Level Agreement (SLA) between your provider and you plays a central role in your uptime strategy.

On the one hand you need to be aware of the reality behind the typical uptime guarantees that are made in the average SLA. On the other, you can potentially use the SLA to reduce the cost of achieving your objectives.

The crucial thing to be aware of is the difference between guaranteed uptime and actual uptime. It’s quite normal for hosting providers to improve upon the uptime of the host data centre and their own infrastructure when making an uptime guarantee. Such guarantees are financially underwritten, meaning that when an outage exceeds the agreed duration, you will receive a partial refund. This may be fine for some but, on the basis that you should always plan for outages to occur, you may want to shore up actual uptime in other ways.

But the SLA can also come to the rescue. If your provider can commit to a failed component within your acceptable down time window, by keeping the relevant parts on the shelf, this can in principle perform the same role as a redundant device. For example, if your business can tolerate an hour of downtime during which a failed component is replaced, this could save you money in application redevelopment and hardware redundancy.

The crucial thing to be aware of is the difference between guaranteed uptime and actual uptime.

www.iomart.com 9 Application Security

The threat from cyber-attacks, in particular Distributed Denial of Service (DDoS), is ever growing and evolving. No business is entirely safe and taking steps to isolate application availability from such menaces is vital.

Work with your provider to ensure that you have:

A resilient firewall architecture A solution for applying regular that can sustain failures and patches and updates to operating is running the latest software systems and devices - perhaps versions as part of a managed service agreement with your provider

A DDoS attack detection and Anti-virus software running where mitigation policy - even for attacks appropriate and, again, updated aimed at another business to latest versions and virus definitions

Intrusion detection and Finally, harden security by ! prevention systems so that ensuring that only the machines admins know when infiltration is that need to be exposed to the being attempted web, are. Isolate all others

www.iomart.com 10 Measurement and Monitoring

There is little point investing in improved application availability if you are not monitoring and measuring it too.

Monitoring load, demand and availability all enable you to measure the success of your past uptime initiatives and plan for the future. The monitoring solution you choose should also support alerting staff when incidents occur. It’s then down to you to define the procedures for responding to these alerts.

We recommend using a third party, different to your hosting provider, to monitor availability so that you avoid the possible scenario of everything looking hunky dory from within while your application is down to everyone not on your network.

Monitoring load, demand and availability all enable you to measure the success of your past uptime initiatives and plan for the future.

Conclusion

Application availability is a complex issue, the product of a great many factors and choices. By systematically reviewing all of the topics in this guide you will arrive at an understanding of the infrastructure threats to your applications’ uptime, potential solutions and the cost of implementing them.

But assessing this wide variety of interdependent factors is no mean feat. Don’t suffer in silence. Make your provider work for your business by getting them to assist you every step of the way.

www.iomart.com About iomart

iomart helps organisations maximise the flexibility, cost effectiveness and security of the cloud. With a dynamic range of managed cloud services that integrate with the hyper clouds of AWS and Azure, our agnostic approach delivers solutions tailored to your specific requirements.

As the most accredited cloud company in the UK, iomart’s 300+ expert consultants and solutions architects give you all the insight, guidance and technology you need to make the cloud work for you.

Call us now on 0800 040 7228 or email [email protected] to find out more.

www.iomart.com

Images for illustrative purposes only. © iomart.

Lister Pavilion, Kelvin Campus, West of Scotland Science Park, Glasgow, G20 0SP.

iomart is a registered trademark. All trademarks and registered trademarks are the property of their respective owners.