how to start your disaster recovery in this “cloudy” landscape EMC Proven Professional Knowledge Sharing 2011

Roy Mikes Storage and Virtualization Architect Mondriaan Zorggroep [email protected] Table of Contents

About This Document 3 Who Should Read This Document? 3 Introduction 4 1. What is a Disaster 6 2. What is a Disaster Recovery Plan (DR plan) 7 2.1. Other benefits of a Disaster Recovery Plan 7 3. Business Impact Analysis (BIA) 8 3.1. Maximum Tolerable Downtime (MTD) 9 3.2. Recovery Time Objective (RTO) 9 3.3. Recovery Point Objective (RPO) 9 4. Data Classification 10 5. Risk Assessment 13 5.1. Component Failure Impact Analysis (CFIA) 16 5.2. Identifying Critical Components 18 5.2.1. Personnel 18 5.2.2. Systems 18 5.3. Dependencies 19 5.4. Redundancy 21 6. Emergency Response Team (ERT) 23 7. Developing a Recovery Strategy 24 7.1. Types of 26 7.2. Virtualized Servers and Disaster Recovery 27 7.3. Other thoughts 28 8. Testing Recovery Plans 29 9. Role of virtualization 30 9.1. Role of VMware 31 9.2. Role of EMC 33 9.3. Role of VMware Site Recovery Manager (SRM) 35 10. VMware Site Recovery Manager 36 11. Standardization 41 12. Conclusion 42 References 44

EMC Proven Professional Knowledge Sharing 2

About This Article Despite our best efforts and precautions, disasters of all kind eventually strike an organization, usually unanticipated and unannounced. Natural disasters such as hurricanes, floods, or fires can threaten the very existence of an organization. Well-prepared organizations establish plans, procedures, and protocols to survive the effects that a disaster may have on continuing operations and help facilitate a speedy return to working order. Continuity and recovery planning are two separate procedures of reparation to restore and recover critical business operations in the event of such disasters. My focus in this article concerns recovery planning.

This article should help you understand the need for Business Continuity Management and Disaster Recovery Planning in relation to a working failover plan. Because it is not all technical, this article covers most of the non-technical discussions in relation to Disaster Recovery Planning. After reading this document I think you can make a good start.

As such, this material is probably most useful to those with little or no familiarity with this topic. Readers who fall into this category would be well served to read this document.

Who Should Read This Document? This article is written for IT professionals who are responsible for defining the strategic direction of protecting data in their data center(s). These include:  Storage Administrators  Operational, middle level managers  Business Managers  IT managers (CIO, Chief information officer)

Organizations and individuals who have the same interests should read this article as well. Where to start with Disaster Recovery Planning? It often remains a difficult story. My goal is to give a general guideline to provide insight into Disaster Recovery Planning, which should not be too difficult to read.

EMC Proven Professional Knowledge Sharing 3

Introduction Let‘s start this with a simple quote; ―Information is the organization’s most important asset‖

Data is created by the applications and is processed to become information. Information is undoubtedly the most important asset for an organization. Does this make sense? Absolutely! The digital footprint for each person on this planet is growing. In a sense it does not matter whether we as a person or a corporation store data; it has to be protected. For some people, photos are just as important as a company's ERP system. It is not for nothing that storage vendors put in a lot of energy to manage this information.

From a Disaster Recovery perspective, the world is divided into two types of businesses; those that have DR plans and those that don‘t. If a disaster strikes your organization in each category, which do you think will survive?

When disaster strikes, organizations without DR plans have an extremely difficult road ahead. If the business has any highly time-sensitive critical business processes, that business is almost certain to fail. If a disaster hits an organization without a DR plan, that organization has very little chance of recovery. And it‘s certainly too late to begin planning. Organizations that do have DR plans may still have a difficult time when a disaster strikes. You may have to put in considerable effort to recover time-sensitive critical business functions. But if you have DR plan, you have a fighting chance at survival.

Does your organization have a disaster recovery plan today? If not, how many critical, time- sensitive business processes does your organization have? Many organizations think they have a DR plan. They think they have some procedures and that is all it takes. True, you need procedures, but you need also to be sure that you actually can failover. How do you manage that? Personally, I think testing live will do more damage than knowing you can. I can take a guess, but actually do know for sure, the number of every organizational change. Many organizational infrastructures change per hour. Try to fit in your DR plan when changing that fast. Where does that leave you? Good question. Probably when you test your failover you do it once per year, maybe twice or even each quarter. How much do you think has changed since the last time you performed your failover. Thus, this is a considerable challenge.

Lucky for you there are many techniques and solutions, such as "clouds", where DR plans are probably already well organized, or VMware Site Recovery Manager (SRM) who can help you with your failover. VMware SRM is a business continuity and disaster recovery solution that helps you plan, test, and execute a scheduled migration or emergency failover of data center services from one site to another. But the most beautiful part of SRM is, you can test a plan without doing it live. Wow!!! I can actually failover anytime without doing some damage to the infrastructure environment? True! Virtualization these days can make Disaster Recovery implementations easy. Think not only public but also private. Private clouds have a huge positive impact and synergy. How many of you are looking for partnerships or serve as each other‘s failover? That makes 1+1=3. But take it easy people. Don't press too soon. There is a lot to consider before taking this road.

Depending on the nature of your business, good disaster recovery is achieved by designing a process which enables your operations to continue to work, perhaps from a different

EMC Proven Professional Knowledge Sharing 4 location, with different equipment, or from home, making full use of technology to achieve a near seamless transition that is all but invisible to your customers and suppliers. Insurance can mitigate the cost of recovery, but without a disaster recovery plan that gets you back up and running you could still go under. Indeed, more than 70% of businesses that don‘t have a DR plan fail within 2 years of suffering a disaster.

So what's next? Certainly a lot! But don't make life too difficult. There will always be one or more single points of failures. You should ask yourself if the costs are worth the five nines (99,999%) availability. The primary task and next step is to determine how you will achieve your Disaster Recovery goals for each of the systems and system components to ensure that the critical, time-sensitive business processes continue working. First, this is the point at which it becomes important to consider exactly what types of disasters you need to prepare for and to classify them by the extent and type of impact they have.

EMC Proven Professional Knowledge Sharing 5

1. What is a Disaster? You may argue with me about the definition of a disaster, because there is more than one definition. To some, anything that doesn't go according to their schedule or plans is a disaster. On a personal level, a fire in our house could be considered a disaster. In most cases, one broken server isn‘t a disaster but many servers are. However, it is important to understand the difference between these kinds of disasters, and a ‗true‘ disaster. This will allow you to keep things in perspective when making your own disaster plans.

Should your company experience a disaster, the first 48 hours following the disaster will be the most critical in your recovery efforts. How you respond during that period will determine if your business will survive. Furthermore, the most important hour is the one immediately following the event.

A disaster is defined as an event causing great loss, hardship, or suffering to many organizations. When we think of this kind of event we usually think of catastrophic events such as hurricanes, earthquakes, floods, fires, and even man-made disasters. In situations like this, help may be unavailable because rescuers may be in the same predicament as you, and it could take a considerable length of time for help to arrive.

Disaster preparedness is the sensible thing to do. It doesn't need to be expensive and it can save your business! In these situations we are not talking about losing server cooling or power for a few hours; we are talking about losing essential services, data, or information, under extreme circumstances, for a prolonged period of time.

Disaster recovery is becoming an increasingly important aspect of enterprise computing. As devices, systems, and networks become ever more complex, there are simply more things that can go wrong. As a consequence, recovery plans have also become more complex.

It is a common misconception that most of the threats to continuity are a result of natural disaster. To the contrary, statistically, these threats account for fewer than 1% of IT service unavailability.

EMC Proven Professional Knowledge Sharing 6

2. What is a Disaster Recovery Plan (DR plan)? A good Disaster Recovery Plan (DR plan) is like an information insurance policy for a business. A DR plan documents the ability to continue work after any number of catastrophic problems, ranging from a natural disaster such as flood, fire, and earthquake or planned/unplanned scenarios such as database corruption, server failures, or simply human errors.

Often a DR plan is confused with a Business Continuity Plan (BCP). Just as a DR plan, BCP is an event that makes the continuation of normal functions impossible. A DR plan is part of his big brother, Business Continuity Plan and IT related. I am not going to talk about a Business Continuity Plan. Instead, we are sticking with the DR plan.

A DR plan consists of the precautions taken so that the effects of a disaster will be minimized and the organization will be able to either maintain or quickly resume mission-critical functions. Typically, DR planning involves an analysis of business processes and continuity needs; it may also include a significant focus on disaster prevention.

2.1. Other benefits of a Disaster Recovery Plan Besides the obvious readiness to survive a disaster, organizations can realize profits or several other benefits from DR planning [1]:

. Improved business processes: Business processes undergo continuous analysis and reviews; there are always areas for improvement. . Improved technology: Often, you need to improve IT systems to support recovery objectives that you develop in the disaster recovery plan. The attention you pay to recoverability also often leads to making your IT systems more consistent with each other and, hence, more easily and predictably managed. . Fewer disruptions: As a result of improved technology, IT systems tend to be more stable than in the past. Also, when you make changes to system architecture to meet recovery objectives, events that used to cause outages no longer do so. . Higher quality services: Improved processes and technologies improve services, both internally and to customers and supply-chain partners. . Competitive advantages: Having a good DR plan gives an organization bragging rights that may outshine competitors. Price isn‘t necessarily the only point on which companies compete for business. A DR plan allows a company to also claim higher availability and reliability of services.

EMC Proven Professional Knowledge Sharing 7

3. Business Impact Analysis (BIA) Although a full DR plan takes many months or even longer to complete, a good first step of an individual DR plan is mapping out the most critical aspects of day-to-day business in your company. Data safety is perhaps one of the most crucial and overlooked aspects of disaster recovery. [2]

A Business Impact Analysis (BIA) is a detailed inventory of the critical processes, systems, and people that are associated with an organization‘s primary business activities. If you have never done a Business Impact Analysis, it seems to be one of the most difficult tasks. There always seems to be a lot of questions about what should and should not be included in the BIA.

The purpose of a BIA is to identify which business units, operations, and processes are essential to the survival of the business. Of course, there is no standard BIA; it depends per organization. Basically there are two areas to discover.

1. Determine the most critical business areas, often referred to as mission-critical applications. We will cover this later. 2. For each business area, determine the sub-business processes and identify the processes which are essential to the operation of the business, often referred to as business-critical. We will cover this one later also.

After having a clear view which processes are critical for your business (and don‘t take this lightly), management should estimate the maximum downtime that is tolerated. Management should determine the longest period of time that a critical process can be disrupted. This figure is known as the Maximum Tolerable Downtime (MTD). You may measure an MTD in hours or days. And often these are the most difficult answers to get.

After you complete the MTD and risk analysis for each critical business process, you need to condense the detailed information to a simple spreadsheet so you can see all the business processes on one page, along with their respective MTD and risk figures. Try to see the big picture here.

Because there is a potential risk for downtime regarding these critical processes, we cannot ignore the major consequences. These consequences are related to objectives of the business.

The objectives of the business impact analysis (BIA) includes: . Financial/Cash Flow/revenue loss . Legal/Regulatory . Life-threatening issues in hospitals, for example . Reputation . And so on…

(There are many more, depending on your type of organization)

EMC Proven Professional Knowledge Sharing 8

3.1. Maximum Tolerable Downtime For each process in the BIA, you need to determine its Maximum Tolerable Downtime (MTD), which is the time after which the process being unavailable creates irreversible (and often fatal) consequences. Generally, exceeding the MTD leads to severe damage to the viability of the business, including the actual failure of the business. Depending on the process, you can express the MTD in hours or days.

3.2. Recovery Time Objective (RTO) After you determine the MTD for processes, you can begin setting targets for recovery. One important target is the Recovery Time Objective (RTO).

RTO is the period of time required to return an application or process to a working state after a downtime situation. For any given process, the RTO is less than the MTD. By definition, it has to be. If you set a 5-day RTO for a process with a 2-day MTD, your business has failed before you can get the critical process running again. And what‘s the point of that? A process‘s RTO forms the basis for any DR planning that you‘ll do for that process.

For example, if a process has a 30-day RTO, you can get it running again—purchase a new server, install software, and restore backup data—at a leisurely pace. However, a process with a one-hour RTO requires a hot site with a standby server and data in near- real time. The costs for these two scenarios vary greatly.

3.3. Recovery Point Objective (RPO) The Recovery Point Objective (RPO), represents the first fallback point transactions for the last 5 minutes, hour, day. It represents the risk of a permanent loss of some part of your data.

Assume that an organization wants to establish a 5-hour RPO for an order entry system. To meet this figure, the organization has to implement a mechanism to back up or replicate transaction data so that it loses no more than 5 hours of transactions in a disaster scenario. Similar to the RTO, setting the RPO determines what sort of measures you need to take to ensure that you don‘t lose information related to any particular business process. Speed costs.

EMC Proven Professional Knowledge Sharing 9

4. Data Classification Every business requires certain applications that they use every day to run their business. These applications become assets and are incorporated into the business. Critical information is deployed on the asset and these assets are provided to each employee.

In this chapter I will go through the five classes of applications. It is important that the data in each organization is analyzed and classified in order to develop a recovery strategy. You must classify your data; otherwise, everything gets protected the same way. Without classification everything is important and you don‘t want that. Unless you own a gold mine! Let‘s face it; most employees are aware of which applications or processes are important for their organization. But, not everything is or should be important and that‘s why we are going to classify these types of data.

EMC‘s application matrix provides a requirement driven, five-category methodology for mapping technology solutions to critical applications. Critically ranking each application within the matrix dictates the need of disaster recovery or method used to protect the data.

. Class 1 - Mission Critical . Class 2 - Business Critical . Class 3 - Business Important . Class 4 - Productivity Important . Class 5 - Non-critical

Let's go through them;

Mission Critical applications are applications necessary for the company's mission to perform. These applications have a significant impact on revenue loss from downtime.

Business Critical applications are applications that increase productivity. These are the applications that usually support mission critical applications. At the time of a major disaster, this should be the second application restored. These applications also have an impact on revenue loss from downtime.

Business Important applications are also the type of applications that increase productivity and supporting applications that are not critical. These are ‗third-rate‘ applications.

Productivity Important are departmental applications, rather than the entire company. These will only affect the productivity of their departments.

Non-critical applications have minor impact on productivity. They are too personal to be recognized in times of crisis.

Please note that these are guidelines. It may be that for departments within an organization application importance may differ. But beware that you are not lost yourself discussing each application‘s importance. Eventually it happened to me! All of these departments declaring the importance of their application comes at a price. But hey... as long as somebody is willing to pay for it, no harm is done and you have created a new challenge.

EMC Proven Professional Knowledge Sharing 10

It's not rocket science to know that mission-critical applications need more performance and protection. Eventually we earn our money using these applications. So when we are clear about which applications we can put in boxes, hypothetically speaking, we know what is needed for performance, capacity, protection, and so on.

Requirements, such as high availability, high scalability, and redundant connections and so on, are examples of what you possibly need. As I mentioned earlier, mission-critical applications have bigger needs than ‗nice to have‘ applications. By now you probably know which applications are mission-critical or very important to the business. But is there a need for high performance/no downtime hardware? Every requirement has a price; is your business willing to pay the price for a Ferrari or is a nice Mercedes enough? The questions that have to be answered are probably about RPO and RTO. Ask questions such as; - How much downtime can be afforded? - How much data may be lost in an outage? - How fast do we need to recover the data? - How fast do we need to be up and running again?

If the answer is no downtime or less than 2 days, this certainly impacts the type of technology you need. Meeting such standards requires excellent skills combined with excellent hardware and software.

Once complete, it can be integrated into a matrix on the next page:

EMC Proven Professional Knowledge Sharing 11

The Criticality Matrix as shown in the table below depicts the typical requirements and the information infrastructure associated with each class. [2] Here again, it means that you decide which requirements are necessary in each class.

Backup & disaster Application Requirements Classes Application recovery function

High availability

High Scalability

Offsite tape PSYGIS/ Basis Clients administration Redundant connections

Instant onsite recovery PSYGIS File Lining Scalable performance Offsite disaster recovery Mission critical PSYGIS Medication prescription Non disruptive location

Offsite data replica PSYGIS Authentication and billing Rapid restore

PSYGIS Calendar Management Instant test environments -

Buseness continuance

Advanced recovery

High availability

High Scalability Centrasys Pharmacy

Redundant connections Offsite tape Mirador lab results

Scalable performance Instant onsite recovery Business critical MUS Methadone Offsite disaster recovery Non disruptive backups location REAKT Registration day activities

Rapid restore - -

Instant test environments

Vila Purchase & Inventory

SDB HRM / salaries Personnel administration & salaries

High availability Square Planning

High Scalability FIS Financials

Redundant connections Offsite tape Business important IRIS Financials

Scalable performance Priva Building Management

Non disruptive backups Office E-mail / Documents

BI Business intelligence

High availability Prodacapo Accounting Productivity High Scalability Onsite tape important SharePoint Intranet

High performance Marvin Process management system

Topdesk Enterprise Service Management

Scalable

Low cost none Non-critical Other applications -

-

Example of Criticality Matrix mental health institution

EMC Proven Professional Knowledge Sharing 12

5. Risk Assessment A risk assessment is an important step in protecting your business. It determines various natural or man-made threats that can disrupt processes in the organization and its facilities.

A common misconception is that most of the threats to continuity are a result of natural disaster. Statistically, these threats account for less than 1% of IT service unavailability. That leaves us with 99% attributable to other threats. It‘s important you know the risks about disasters such as tornadoes, hurricanes, floods, or other natural disasters. It‘s more important that you protect yourself against man-made threats.

This one is difficult because there are innumerable scenarios that can go wrong caused by humans. Do we look in to all these scenarios? I think not; there are simply too many.

This leaves us with two questions:

 Which possible risks are there for the organization?  What are the results of these possible threats?

NOTE: Because there are so many scenarios, it is more important to consider the misery caused by these scenarios/risks. There are more than one hundred ways to destroy something. The fact is, one way or the other, it’s broken. Try to shift your focus from cause to effect!

Besides the effects of disasters, you need to create a relatively complete list of the disasters that are reasonably likely to occur. The following list isn‘t meant to be complete. Disasters not listed here might belong in your threat model. But this list should give you a good starting point.

Global Threats Part of the risk process is to review the types of disruptive events that can affect the normal running of the organization. There are many potential disruptive events and the impact and probability level must be assessed to give a sound basis for progress.

Environmental Disasters

o Tornado o Hurricane o Flood o Snowstorm o Earthquake o Electrical storms o Fire o Subsidence and Landslides o Freezing conditions o Contamination and Environmental Hazards o Epidemic

EMC Proven Professional Knowledge Sharing 13

Organized and / or Deliberate Disruption

o Act of terrorism o Act of sabotage o Act of war o Theft o Labor Disputes / Industrial Action

Loss of Utilities and Services

o Electrical power failure o Loss of gas supply o Loss of water supply o Communications services breakdown o Loss of drainage / waste removal

Equipment or System Failure

o Internal power failure o Air conditioning failure o Production line failure o Cooling plant failure o Equipment failure (excluding IT hardware)

Serious Information Security Incidents

o Cyber crime o Loss of records or data o Disclosure of sensitive information o IT system failure

Other Emergency Situations

o Workplace violence o Public transportation disruption o Neighborhood hazard o Health and Safety Regulations o Employee morale o Mergers and acquisitions o Negative publicity o Legal problems

Although not a complete list, it does give a good idea of the wide variety of potential threats.

EMC Proven Professional Knowledge Sharing 14

Consequences of disasters according the International Disaster Database

EMC Proven Professional Knowledge Sharing 15

5.1. Component Failure Impact Analysis Originally a process defined by IBM in the 1980s to improve availability, Component Failure Impact Analysis (CFIA) [2] [3] is now a part of the ITIL "Best Practices". CFIA is a process of analyzing a particular hardware/software configuration to determine the true impact of any individual failed component.

Many know that CFIA is somehow related to ITIL Problem and Availability Management, yet it remains at best a fuzzy concept for most. While CFIA is impressive sounding, it is really just a way of evaluating (and predicting) the impact of failures, and locating Single Points of Failure (SPoF). CFIA can:

1. Identify Configuration Items (CI‘s) that can cause an outage 2. Locate CI‘s that have no backup 3. Evaluate the risk of failure for each CI 4. Justify future investments 5. Assist in Configuration Management Database (CMDB) creation and maintenance

All it takes to gain these benefits is an Excel spreadsheet or some graph paper. Following are the 3 steps to success with Component Failure Impact Assessment.

1. Select an IT Service, and get the list of CI‘s, hopefully from Configuration Management, upon which the IT Service depends. If there is no formal CMDB, then ask around IT for documentation, paper diagrams, and general knowledge.

2. Using a spreadsheet or graph paper, list CI‘s in one column and the IT Service(s) across the top row. Then, for each CI, under each service: a.) Mark ―X‖ in the column if a CI failure causes an outage b.) Mark ―A‖ when the CI has an immediate backup (―hot-start‖) c.) Mark ―B‖ when the CI has an intermediate backup (―warm-start‖)

You now have a basic CFIA matrix. Every ―X‖ and ―B‖ is a potential liability.

3. Examine first the ―X‘s‖, then the ―B‘s‖, by asking the following questions:

. Is this CI a SPoF? . What is the business/customer impact of this CI failing? How many users would be impacted? What would be the cost to the business? . What is the probability of failure? Is there anything we can do differently to avoid this impact? . Are there design changes that could prevent this impact? Should we propose redundancy or some form of resiliency? What would redundancy cost?

As you get good at CFIA, consider expanding your CFIA matrix to include the procedure used to recover from a CI failure as a row across the bottom of your CFIA matrix. (Of course, this requires that you have written procedures!) Adding documented response procedures to your CFIA matrix lets you examine the organization as well as infrastructure. Ask yourself:

EMC Proven Professional Knowledge Sharing 16

. How do we respond when this CI fails? . What procedures do we follow? Are these procedures documented? Could they be improved? Could they be automated? . Can we improve the procedure through staff training? New tools or techniques? . Could preventative maintenance have helped avoid this problem?

NOTE: We will cover these questions later in this article.

Sound CFIA at any level (infrastructure, organization, or both) delivers RFCs that can deliver real improvements to the business without requiring high process maturity or expensive supporting software. There are some IT-centric benefits to CFIA as well, including a head- start on IT Service Continuity Management; aiding Configuration Management which benefits from the addition of recovery procedures to the CMDB; and Problem and Incident Management who may follow these procedures.

How far are we going with this? You can take every little thing in your CMDB, but it‘s wise to focus on the CI‘s regarding your mission-critical applications. Identify your critical components. Only they are important. When one of these critical components fails, probably your mission-critical application does, too.

EMC Proven Professional Knowledge Sharing 17

5.2. Identifying Critical Components This chapter is very closely related to Component Failure Impact Analysis (CFIA), which talked about data classification, and gives you a good start to the applications we are going to protect.

5.2.1. Personnel

It‘s important to recognize that your personnel are critical components too. You may begin to notice a few names who appear frequently in the most critical processes or are involved with some critical applications. You may want to take a closer look at those people and consider whether they‘re truly critical for so many business processes or applications. Items in your DR plans that relate to critical personnel may include cross-training or staff expansion of some sort in order to reduce any possible exposures related to too many processes depending on too few individuals.

5.2.2. Systems

By this time you‘ve collected all the necessary information from all the important business processes for your Business Impact Analysis. You‘ve identified information systems, personnel, assets, and suppliers that these processes depend on.

In chapter 4 (Data Classification), we discussed the critical applications and what was needed to run them and keep them available at all times. These critical applications are depending and running on systems such as power, cooling, switches, firewalls, and servers. Now, all systems that are relevant for these critical applications should be named. But only those systems that are absolutely necessary for those critical applications that need to failover when disaster strikes.

So, again, make things easy and start with which application is running on which server. Go to work systematically.

For example; this application is running on this server and depending on the following services…this server is running on precisely this blade in this blade chassis…

Once you are ready, you have a list with all servers. You can expand this list to your own needs. Do this also with power, switches, and other systems that are related to the critical applications. The more extensive, the better. The important thing is to start somewhere!

EMC Proven Professional Knowledge Sharing 18

5.3. Dependencies Why identify dependencies? Your mission in this phase is to identify systems that are critical. The systems that support business processes aren‘t just the systems with the applications, but also everything else that you need to keep those systems running properly.

The following sections discuss dependencies in greater detail. As previously discussed, try to focus on the things that are related to the mission-critical application. After you reach that point you can always expand. A wise lesson; start easy!

When you have your list complete with all the critical components regarding mission-critical applications, you need to have not only an inventory-level view of your systems and applications, but also a high-level view of it. If you don‘t have these views of your environment, it‘s worth the time required to develop them. Often, these diagrams are the only way you can get a complete end-to-end view of a single application or an entire environment. Put all your information in a diagram. After adding them to your diagram, try to connect them to each other.

In my example on the next page there are some layers to play with. I roughly used the following layers:

 Power  Network infrastructure  Hardware  Storage   Operation systems  Services

I found a good way to get the above dependencies in a schema [4]. Below, we see how data flows as email is delivered. It‘s routed from the external mail server, through the spam filter, and into the internal mail server. IT staff envision their world like this. It‘s a conceptual model that facilitates troubleshooting of email problems. But it doesn‘t clearly show how email services might be impacted by various system failures. However, by adding dependency relations, it will.

You need at least these high-level diagrams of your systems environment. On the next page, I have provided an example where I identify the critical systems. Just add your dependencies and you are ready. (No lines added in the drawing because it will not be readable)

EMC Proven Professional Knowledge Sharing 19

; Power

Network Infrastructure Dependencies HP Blade Chassis EMC CX4-240 HP Unix (name server) Mission Critical RAID 5 RAID Group 0 VMware ESX Server LUN ID 525 Processes VM: (name server)

O:\Phoenixclient\ RAID 1/0

RAID Group 1 VM: (name server) LUN ID 523

Active Directory Services (ADS) LUN ID 524 LUN Dynamic Host Configuration Protocol Service (DHCP) RAID 5 RAID RAID Group 5

Domain Name System Service (DNS) LUN ID 520

Database LUN ID 521

VM: (name server)

Service Active Directory Services (ADS) Forest Roles RAID 5 Schema Master Domain Naming Master Domain Roles RAID Group 7 PDC Master Operating System RID Master LUN ID Infrastructure Master 600

LUN ID 601

Hypervisor VM: (name server) LUN ID 602

Hardware Active Directory Services (ADS) RAID 5

RAID Group 8

Dynamic Host Configuration Protocol LUN ID Network infrastructure Service (DHCP) 401

LUN ID 402 Power VM: (name server)

SQL Server 2005 RAID 5 SPH2007 instance

RAID Group 9 sharedservices1_db sharepoint_config LUN ID 403

LUN ID 404 portaaldb sharepoint_admincontent

. instance RAID 5

SMZ64SP RAID Group 10

LUN ID 405

HP Thin Client LUN ID 406 VM: (name server)

Sharepoint Shared Services Provider (SSP) RAID 5

RAID Group 11 Internet Information Services (IIS) LUN ID HP Blade chassis 401

VMware ESX Server LUN ID 402 VM: (name server)(Citrix Basis) VM: (name server) VMFS Sharepoint Shortcut to: Web Front-end (WFE) RAID 5 O:\Phoenixclient\

RAID Group 12 Internet Information Services (IIS) LUN ID 403

LUN ID 404 VM: (name server) VMFS Inlogportaal service Internet Explorer RAID 5

RAID Group 13 Internet Information Services (IIS) LUN ID 405

LUN ID VM: (name server) (Citrix Datacollector) VM: (name server) 406

Flex profile VMFS

Example of high-level diagram

EMC Proven Professional Knowledge Sharing 20

5.4. Redundancy The term redundancy is often confused with availability. While these two are related, they are not the same. Redundancy refers to, for example, the use of multiple servers, more than one Host Bus Adapter (HBA), or RAID protected disks. Redundancy is the duplication of critical components of a system with the intention of increasing system reliability. When it comes to redundancies, more is better you would say. But redundancy has its price. Why? Because you have to buy at least double the hardware. On the other hand, what is the price of a server compared to the outage of your mission critical application?

There are many scenarios to deploy a redundant solution for servers and storage. Let‘s look at some examples:

No redundancy

Look at the components in the picture. You only see a single point of failure. A single point of failure (SPoF) is a hardware or software element whose loss results in the loss of service.

HBA redundancy

Configuring multiple HBAs and using multi-pathing software provides path redundancy. Upon detection of one failed HBA, the software can re-drive the I/O through another available path.

HBA and Switch redundancy

This picture provides HBA and switch redundancy as well. It also protects against storage array port failures.

EMC Proven Professional Knowledge Sharing 21

HBA, Switch, and Disk redundancy

Now we are using some level of RAID, such as RAID-5. RAID protection will ensure continuous operation in the event of disk failures.

HBA, Switch, Disk, and Storage array redundancy

The diagram above depicts a high level redundancy infrastructure. Everything is redundant and there is little chance that if one component breaks down your applications are no longer available.

Remote replication is an essential part of any data protection plan. It provides protection in case of primary device, storage, or site failure. Remote replication involves moving data to a secondary storage array to protect against data loss in case of primary site failure.

There are two types of remote replication; synchronous, which allows RPO of close to zero and asynchronous, which allows updates to be made to a Secondary Image at intervals selected by the user.

Bottom line, there should be some sort of redundancy in your infrastructure to make sure all data and information is protected. Without redundancy, you will realize when an outage happens, you are lost. You will have to decide how far you want to pull redundancy. As mentioned earlier, this also has a price. The more redundancy, the more money it costs.

EMC Proven Professional Knowledge Sharing 22

6. Emergency Response Team

An emergency response team (ERT) [5] is a group of people who prepare for and respond to any emergency incident, such as a natural disaster or an interruption of business operations. Incident response teams are common in corporations as well as in public service organizations. This team is generally composed of specific members designated before an incident occurs, although under certain circumstances the team may be an ad-hoc group of willing volunteers.

Incident response team members typically are trained and prepared to fulfill the roles required by the specific situation. Ideally, the team has already defined a protocol or set of actions to perform to mitigate the negative effects of the incident.

There is a need to figure out when to declare a disaster. Define a procedure for declaring a disaster. At least two ERT members will declare a disaster. Most of the time this is relatively easy. Why? Because it doesn‘t matter whether it was caused by man-made or a natural event. Base your decision on consequences rather than cause.

For example, assume a case of no availability of your mission-critical environment or data loss in your mission-critical application. Either way, this is usually severe enough to declare a disaster. When a disaster is declared, the ERT launch the DR plan. Before declaring a disaster, determine the Maximum Acceptable Outage Time (MAOT). The MAOT may be a period ranging from a few hours to several days or more. The MAOT is the longest time that can be tolerated between the onset of a disaster and the resumption of a critical business process. The ERT should assess the disaster and determine whether your business‘s critical processes will likely exceed the MAOT. If the ERT thinks you‘ll exceed the MAOT, the ERT should declare a disaster.

Decide Moment: The final moment when decided that the IT infrastructure of the primary 12 Hour location will not recover within one business day Maximum

Start Fallback Scenario: 48 Hour Start failover to secondary location Maximum

Availability Mission Critical App: Working IT environment for at least 200 employees

Example of MAOT

EMC Proven Professional Knowledge Sharing 23

The scenario on the previous page is an example. It should be clear that in the decision point you do not need 12 hours to decide whether or not you will failover. If irreparable damage is caused by fire in your data center, it should be clear you need to failover.

After the ERT decides that the MAOT has been exceeded for critical processes, it invokes the DR plan.

Here are some guidelines to get your DR plan up and running:  Arrange an emergency meeting  Appoint an ERT Leader  Assign other roles, such as communications  Designate someone on the ERT to keep a logbook  Discuss the MAOT  Initiate recovery plans

Many organizations put emergency contact lists on laminated wallet cards. Wallet cards are very portable because they can fit into a wallet. And it‘s more likely to have your wallets with you when a disaster strikes. Consider putting items on your card such as; name, phone number(s), URL that contains the disaster recovery procedure, and even spouse info. You might need more or less information than what I have listed here.

7. Developing a Recovery Strategy

The primary task of this step is to determine how you will achieve your disaster recovery goals for each of the systems and system components that were identified. For most organizations, the design of a recovery strategy solution is a fairly custom process. While the design principles and considerations are mainly common, designers typically have to make a number of compromises.

Backup and recovery [7] are components of business continuity. Business continuity is the term that covers all efforts to keep critical data and applications running despite any type of interruption (including planned and unplanned). Planned interruptions include regular maintenance or upgrades. Unplanned interruptions could include hardware or software failures, data corruption, natural or man-made disasters, viruses, or human error. Backup and recovery is essential for operational recovery; that is, recovery from errors that can occur on a regular basis but are not catastrophic, i.e. data corruption or accidentally deleted files. Disaster recovery is concerned with catastrophic failures. Believe me, nothing is as interesting as a big failure because it‘s the moment you actually learn something. When planning for backup and recovery, you should decide how much data loss you‘re willing to incur. You can use this decision to calculate how often you need to perform backups. Backups should be performed at fixed intervals.

The length of time between backups is called the Recovery Point Objective (RPO); that is, the maximum amount of data that you are willing to lose. You should also decide how long you‘re willing to wait until the data is completely restored and business applications become available. The time it takes to completely restore data and for business applications to become available is called the Recovery Time Objective (RTO). Your RTO can be different from your RPO.

EMC Proven Professional Knowledge Sharing 24

After determining your recovery time and recovery point objectives, then you can determine how much time you actually have to perform your backups; typically called your backup window. The backup window determines the type and level of your backups. For example, if you have a system that requires 24-hour, 7-days a week, 365-days a year availability, then there is no backup window. So, you would have to perform an online backup (also known as a hot backup) in which the system is not taken offline. Lastly, as the number of backups increase, the space required to store them will also increase. Therefore, you should consider how long you are required to retain your backups (also referred to a data retention period) and plan for the appropriate amount of storage space.

When your deployment fails, you recover it by restoring it to a previously consistent state (that is, a particular point in time) from your backups. Restoring a deployment to a particular point in time is also known as a point-in-time recovery.

EMC Proven Professional Knowledge Sharing 25

7.1. Types of backup

You can choose from three different backup methods. Most backup strategies use a combination of two or three of these methods:

. Full is the starting point for all other backups and contains all the data in the folders and files that are selected to be backed up. Because the full backup stores all files and folders, frequent full backups result in faster and simpler restore operations. Remember that when you choose other backup types, restore jobs may take longer. It would be ideal to make full backups all the time, because they are the most comprehensive and are self-contained. However, the amount of time it takes to run full backups often prevents us from using this backup type. Full backups are often restricted to a weekly or monthly schedule, although the increasing speed and capacity of backup media is making overnight full backups a more realistic proposition.

. Incremental provides a faster method of backing up data than repeatedly running full backups. During an incremental backup only the files changed since the most recent backup are included. The time it takes to execute the backup may be a fraction of the time it takes to perform a full backup.

. Differential contains all files that have changed since the last full backup. The advantage of a differential backup is that it shortens restore time compared to a full backup or an incremental backup. However, if you perform the differential backup too many times, the size of the differential backup might grow to be larger than the baseline full backup.

I talked about three backups. Maybe this one doesn‘t belong here but it‘s a definitive copy of the original data, which makes it a backup.

. Mirrored ensures your information is protected from both system and site failures. In an array, it‘s a block level protection, so you can‘t open and navigate these file in Windows Explorer.

In EMC terms, we speak about MirrorView™. It leverages the power of EMC CLARiiON® networked storage systems to offer both synchronous and asynchronous remote mirroring. Whether you mirror data around the corner or across the globe, MirrorView provides disaster recovery that protects your most critical data in the event of an outage.

Another replication method is Symmetrix Remote Data Facility (SDRF®), which is used in EMC Symmetrix® systems. SRDF provides remote replication for disaster recovery and business continuity.

EMC Proven Professional Knowledge Sharing 26

7.2. Virtualized Servers and Disaster Recovery

Traditional disaster recovery plans are often very complex and difficult. The reason for this is bare metal recovery. Virtualization makes life easier for us and simplifies this environment. A virtual machine typically is stored on the host computer in a set of files, usually in a directory created by the host for that specific virtual machine. When you protect these files using your backup or replication software, you've protected the entire system. These files can then be recovered to any hardware without requiring any changes because virtual machines are hardware-independent.

Reliable disaster recovery solutions traditionally require duplicating your entire production infrastructure and with it, your costs. With virtualization software such as VMware vSphere, you can provide rapid and reliable recovery without requiring identical hardware. Virtual machines can share the physical resources of a single computer while remaining completely isolated from each other as if they were separate physical machines. If, for example, there are three virtual machines on one physical server and one of the virtual machines crashes, the other virtual machines remain available. Isolation is an important reason why the availability and security in a virtual environment is superior to applications running in a traditional, non- virtualized system. Server consolidation also lets you slash the cost of server infrastructure needed both for production and disaster recovery.

Virtualization is a must-have these days in combination with disaster recovery. You can easily test your disaster recovery plan to ensure the highest levels of reliability and availability of your entire IT infrastructure.

EMC Proven Professional Knowledge Sharing 27

7.3. Other thoughts

An amazing amount of work and planning is required before you push the button and begin drafting actual recovery plans. Disaster recovery has many aspects because you may need to recover different portions of your environment, depending on the scope and magnitude of the disaster that strikes. Your worst-case scenario (an earthquake, tornado, flood, or whatever sort of disaster happens in your part of the world) can render your work facility completely damaged or destroyed, requiring the business to continue elsewhere.

But besides that, there are more business justifications for developing a recovery strategy.

– Level of attention and expertise required – Performance impacts – Effect of link outages – Change Control Integration

Do you know who your expert is? The one who can provide innovative, valuable solutions to your organization (whether internal or external)? The one who knows the jargon, products, and tools of your organization? That expert exists at every level of your organization. It‘s your most valuable competitive asset and also your most scarce. Its scarcity is probably the greatest single factor limiting your growth. Your expert also goes home every night, and it‘s what you lose when it retires or goes over to the competition. There are not many people who have that specific experience and knowledge. Everything is learnable but it takes a while. It could be years before you are on the same level as you were before. So, take good care of your experts.

Most organization doesn‘t have an exact copy of their data center like a fully automatic failover site. In most cases, it‘s more important to recover your data and ‗mission-critical‘ application on the failover site than have the same amount of people working at the same time as before the disaster. Doing a recovery with less hardware has an impact on performance. Be aware that there are fewer people who can crawl behind the keyboard.

Things will not always go according to plan. That‘s a fact. That‘s the whole reason for this article. Be prepared that things will not always go as planned in your recovery plan. Try to anticipate things that can go wrong as much as possible.

If your organization is working according to ITIL, you are probably working with Change Management. When a disaster strikes, a lot is going on. Try to fit in Change Management in an appropriate manner given the circumstances.

EMC Proven Professional Knowledge Sharing 28

8. Testing Recovery Plans

Traditional recovery plans are often difficult to test, difficult to keep up to date, and depend on exact execution of complex, manual processes. In a virtualized environment, testing is simpler because you can execute non-disruptive tests using existing resources. Hardware independence eliminates the complexity of maintaining the recovery site by eliminating failures due to hardware differences.

But still, your organization is changing by the day and servers are added and deleted. Maybe there are needs that require adding mission-critical applications or simply merging with another organization. The fact remains, changes occur every day and these changes have an enormous impact on your DR plan. After you develop the DR plan, you need to put it through progressively intense cycles of testing. If an organization needs to trust its very survival to the quality and accuracy of a DR plan, you need to test that plan to be sure that it actually works. In disasters, you rarely get second chances.

DR plans contain lists of procedures to follow when a natural or man-made disaster occurs. The purpose of the plan is to recover the IT applications and infrastructure that support business-critical processes. When disaster hits you, it hits hard. You seldom can clearly tell whether those disaster plans will actually work. And given the nature of disasters, if your disaster plan fails, the organization may not survive the disaster.

When you test your disaster plan, note anything that‘s not going according plan, and then pass the plan back to the people who designed the plan so they can update it. This process improves the quality and accuracy of the disaster plan. Therefore, realistic testing of the recovery plan periodically is necessary and is also required to succeed in your mission.

Another thought is whether you can test and maintain protection simultaneously. Because what will happen when you start on Friday and not be back up and running on Monday? It is important this is included in your plan. Ask yourself every time, what if, and be prepared for the worst thing that can happen. Probably a good start would be to fragment your recovery plan into small pieces. Start with destroying one server and see if it can restored.

EMC Proven Professional Knowledge Sharing 29

9. Role of virtualization

The business world has undergone an enormous transformation over the past 20 years. Business process after business process has been captured in software and automated, moving from paper to electrons.

In today‘s world, virtually every strategic business decision has an IT implication. Market forces continue to accelerate in every region of the world, and across every industry, putting increasing pressure on IT departments to be more responsive and help organizations stay competitive and pursue new opportunities at lower cost.

Virtualization is rapidly transforming the IT landscape and fundamentally changing the way companies compute. Virtualization is the catalyst that makes IT-as-a-Service a reality. It is the enabling technology on which cloud computing architectures are and will be built. Whether you have virtualized all of your IT assets and applications or you are just starting out, you are on your way to transforming to a new model for IT.

Before virtualization, IT organizations would run one application per physical server, so cost- per-server was a quick way to compare costs; it was a one-to-one relationship. Therefore, many data centers have machines running at only 10 or 15 percent of total processing capacity. In other words, 85 or 90 percent of the machine‘s power is unused. It isn‘t rocket science to recognize that this situation is a waste of resources. But once you virtualize, many applications (each on its own virtual machine) run on each physical server; it is now a many- to-one relationship.

When a server is used to host a number of virtual machines, it is faced with much higher levels of demand for system resources than would be presented by a single operating system running a single application. Obviously, with more virtual machines running on the server, there will be more demand for processing. Even with two or more processors, virtualization can outstrip the processing capability of a traditional commodity server. Also, with more virtual machines on the server, there will be far higher storage and network traffic as each virtual machine transmits and receives as much data as would be demanded by a single operating system performing in the old ―one application, one server‖ model. Furthermore, because virtualization makes the robustness of hardware more important, most IT organizations seek to avoid so-called Single Point of Failure (SPoF) situations by implementing redundant resources in their servers: multiple network cards, multiple storage cards, extra memory, and multiple processors and all doubled or even tripled in an effort to avoid a situation where a number of virtual machines can be stalled due to the failure of a single hardware resource.

EMC Proven Professional Knowledge Sharing 30

9.1. Role of VMware

As virtualization is now a critical component of an overall IT strategy, it is important to choose the right vendor. VMware is the leading business virtualization infrastructure provider, offering the most trusted and reliable platform for building a good IT infrastructure, private- and public clouds.

VMware [6] stands alone as a leader. While challengers like Microsoft and Citrix are emerging, VMware has a tremendous head start in this market. It is clearly ahead in understanding the market, and is ahead in product strategy, business model, and technology innovations.

Why VMware? . Is built on a robust, reliable foundation for many years . Delivers a complete virtualization platform, from desktop through the data center out to public clouds . Provides the most comprehensive virtualization and cloud management . Integrates with your overall IT infrastructure . Is proven by more than 190,000 customers

VMware has invested in technologies to achieve very high virtual machine density on VMware vSphere. VMware supports more guest operating systems than any other bare- metal virtualization platform in 2010. The superior performance of VMware vSphere with unmodified (fully virtualized) guests, made possible by VMware‘s exclusive binary translation technology, means that VMware vSphere can run off-the-shelf operating systems with near- native performance. No other virtualization platform achieves the high virtual machine density of VMware vSphere and still maintains consistent, high application performance across all running virtual machines.

With VMware you can lower your operational costs. You can directly reduce your operational costs by using the dynamic IT services built into VMware vSphere that most other competitors do not offer.

EMC Proven Professional Knowledge Sharing 31

Most common for example are [8]: . High availability – HA, Explained here VMware HA provides uniform, cost-effective failover protection against hardware and operating system failures within your virtualized IT environment.

. Dynamic Resource Scheduler – DRS, Explained here VMware DRS continuously balances computing capacity in resource pools to deliver the performance, scalability, and availability not possible with physical infrastructure.

. vMotion, Explained here VMware vMotion uses VMware’s cluster file system to control access to a virtual machine’s storage. During a vMotion, the active memory and precise execution state of a virtual machine is rapidly transmitted over a high-speed network from one physical server to another and access to the virtual machine’s disk storage is instantly switched to the new physical host. Since the network is also virtualized by the VMware host, the virtual machine retains its network identity and connections, ensuring a seamless migration process.

. Storage vMotion, Explained here VMware Storage vMotion is a state-of-the-art solution that enables you to perform live migration of virtual machine disk files across heterogeneous storage arrays with complete transaction integrity and no interruption in service for critical applications.

. Site Recovery Manager – SRM, Explained here VMware vCenter Site Recovery Manager eliminates complex manual recovery steps and removes the risk and worry from disaster recovery.

. Fault Tolerance – FT, Explained here VMware Fault Tolerance provides continuous availability for applications in the event of server failures, by creating a live shadow instance of a virtual machine that is in virtual lockstep with the primary instance.

. Find more on http://www.vmware.com/products/

VMware is the proven choice for virtualization from the desktop to the data center. Small and midsize businesses run on VMware. More than 190,000 customers of all sizes, including all of the Fortune 100, trust VMware as their virtualization infrastructure platform. That must mean something!

EMC Proven Professional Knowledge Sharing 32

9.2. Role of EMC

The digital universe is still growing, even during a global economic downturn. The creation and replication of digital information set a record in 2009 by growing to 800 billion gigabytes, more than 60% over the previous year. People continue to take pictures, send e-mail, blog, and post videos. Organizations are still adding information. Governments are still requiring more information to be kept. And that‘s only the beginning of what‘s to come.

That‘s nice for business. Undoubtedly so for storage vendors. But it‘s not just about storing data. It‘s more about innovation, protection, optimization, and leveraging information.

In 2003, EMC, the world leader in information storage and management acquired VMware. Joe Tucci, EMC President and CEO, said, "Customers want help simplifying the management of their IT infrastructures. This is more than a storage challenge. Until now, server and storage virtualization have existed as disparate entities. Today, EMC is accelerating the convergence of these two worlds." Was he wrong?

I have the privilege to work with nice things related to EMC and VMware every day. And it is just amazing how easy things are to integrate. Let me give you a great example of the products EMC builds in relation to VMware.

 EMC Unified Storage vCenter plug-in

This plug-in is a must-have in combination with vSphere. With EMC‘s second-generation vCenter plug-in family (Virtual Storage Integrator, CLARiiON plug-in, and ® NFS Plug-in), EMC gives VMware administrators the ability to simplify visibility, provisioning, and management of EMC storage through the VMware lens. From VMware vCenter, administrators can leverage array functions to increase the efficiency in their VMware environment and hardware accelerates VM deployment.

Click on the document for downloading Or use url: http://www.mikes.eu/download/EMC Plug-in for VMware vCenter.pdf

Integration is good. EMC offers direct integration and management capability of their systems from VMware‘s Management suite by making use of API‘s. EMC and VMware integration makes things simpler and more efficient.

Without discussing products, I don‘t want to keep information away from you, shown in the table below.

EMC Proven Professional Knowledge Sharing 33

Product Families [9] Hardware

. Celerra, Explained here Bring powerful, high-availability unified storage to your organization in convenient integrated models and flexible gateways. All are easy to deploy and manage. Plus, simplify management with powerful software.

. CLARiiON, Explained here Get the high availability, scalability, and flexibility you need to manage and consolidate more data. Combine easy-to-use midrange networked storage with innovative technology and robust software capabilities.

® . Connectrix , Explained here Move your organization's vital information where it needs to go—quickly, easily, and reliably. Advanced directors and switches make it happen. Get best-in-class availability and easy management.

® . Centera , Explained here Store and manage your "fixed content"—unchanging digital assets—and keep them available online and accessible. All with EMC Centera content-addressed storage (CAS) systems. Be ready for growth with petabyte scalability.

. , Explained here Store, protect, and share your valuable data with reliable and easy-to-use storage solutions for home and small business.

. Symmetrix, Explained here Make high-end networked storage part of your information infrastructure with systems that take performance, availability, and security to new heights. Manage and protect your information today and expand in the future.

™ . VPLEX , Explained here Deploy next-generation architecture to enable simultaneous information access within, between, and across data centers.

Software

™ . Atmos , Explained here Build your own cloud services or leverage a public cloud to deliver content and information services anywhere in the world with EMC Atmos.

™ . Ionix , Explained here Simplify and automate key tasks—such as discovery, monitoring, reporting, planning, and provisioning—for even the largest, most complex storage environments.

® . PowerPath , Explained here Host-based solutions including multipathing, data migration, and host-based encryption.

EMC Proven Professional Knowledge Sharing 34

9.3. Role of VMware Site Recovery Manager

The beautiful part of VMware Site Recovery Manager (SRM) is, you can test a plan without doing it live. With SRM I can failover anytime without damaging the infrastructure environment.

SRM [8] provides business continuity and disaster recovery protection for virtual environments. Protection can extend from individual replicated data stores to an entire virtual site. VMware‘s virtualization of the data center offers advantages that can be applied to business continuity and disaster recovery:

 The entire state of a virtual machine (memory, disk images, I/O, and device state) is encapsulated. Encapsulation enables the state of a virtual machine to be saved to a file. Saving the state of a virtual machine to a file allows the transfer of an entire virtual machine to another host.

 Hardware independence eliminates the need for a complete replication of hardware at the recovery site. Hardware running VMware ESX at one site can provide business continuity and disaster recovery protection for hardware running VMware ESX at another site. This eliminates the cost of purchasing and maintaining a system that sits idle until disaster strikes.

 Hardware independence allows an image of the system at the protected site to boot from disk at the recovery site in minutes or hours instead of days.

SRM leverages array-based replication between a protected site and a recovery site. The workflow that is built into SRM automatically discovers which datastores are set up for replication between the protected and recovery sites. SRM can be configured to support bi- directional protection between two sites.

SRM provides protection for the operating systems and applications encapsulated by the virtual machines running on VMware ESX. A SRM server must be installed at the protected site and at the recovery site. The protected and recovery sites must each be managed by their own vCenter Server.

Implementing a SRM solution is almost ―too easy‖. But as you've read so far it‘s not only about the software you are using. The software, which will make your life a lot easier, is not the most important piece of the puzzle. Keep thinking about the first 8 chapters of this article, which are more important than the software.

VMWARE IS A TRUE ENABLER FOR DISASTER RECOVERY

EMC Proven Professional Knowledge Sharing 35

10. VMware Site Recovery Manager Downtime is expensive! Disaster preparedness and recovery planning is an iterative process, not a one-time event. You need to continually revisit disaster recovery plans to ensure they remain aligned with current business goals and test those plans regularly to ensure that they perform as planned.

VMware Site Recovery Manager [8] provides business continuity and disaster recovery protection for virtual environments. In a Site Recovery Manager environment, there are two sites involved, a protected (primary) site and a recovery (secondary) site. Protection groups that contain protected virtual machines are configured on the protected site and these virtual machines can be recovered by executing the recovery plans on the recovery site. The illustration below depicts how it operates at a very high level.

Site Recovery Manager uses a database on both protected and recovery sites to store information. The protected site Recovery Manager database stores data regarding the protection group settings and protected virtual machines, while the recovery site Recovery Manager database stores information on recovery plan settings

VMware Site Recovery Manager changes the way disaster recovery plans are designed and executed by involving two simple steps; protection and recovery.

Protection involves the following operations:  Array manager configuration  Inventory mapping  Creating a protection group

Recovery involves the following operations:  Creating a recovery plan  Test recovery  Real recovery

EMC Proven Professional Knowledge Sharing 36

The vCenter Server must be installed at both the protected site and recovery site, as well as an SQL Server or Oracle Database server.

 See Site Recovery Manager Compatibility Matrixes documentation for a list of supported servers and databases.

Each site has an inventory of virtual machines that reside on array based replicated LUNs (logical unit numbers), which are disk volumes in a storage array that are identified numerically. Before installing SRM, install the Storage Replication Adapter (SRA) for your storage and storage replication environment. SRA is software that ensures integration of your storage device with SRM. Because SRM interacts with arrays from a variety of storage vendors, consult the documentation that your storage vendor provides for array specific information used during SRM installation and configuration. The SRAs that have been created by storage vendors for Site Recovery Manager can be downloaded from the .com website.

 See Site Recovery Manager Storage Partner Compatibility Matrixes for a list of supported SRAs.

Optimally SRM is installed bi-directionally, so that each site serves as a recovery site for the other. The two sites should be a significant geographic distance from each other. The protected and recovery sites must be in a networked configuration that allows TCP connectivity. Each site consists of a vCenter Server, which is a Windows machine that runs the vCenter service. Installed with each vCenter Server is the SRM Server. The SRM Server hosts Site Recovery Manager and array management technology. It also serves the SRM plug-in to the VI Client. Management is done from the vCenter client on the protected site. SRM uses block based replication with SRA‘s installed on the SRM Server. This integration of hardware and software supports the most demanding application business continuance needs, in this case, a failover following a disaster.

Replication, Replication, Replication - Technology

SRM only works properly with a replication technology. Data replication, however, is a growing challenge. Working to achieve higher levels of data availability, storage administrators increasingly create multiple copies of business-critical data to quickly recover from disasters. As data centers attempt to maintain data availability in the event of local catastrophes while globally servicing customers, multiple copies of data must also be efficiently distributed and synchronized to other data centers.

EMC Proven Professional Knowledge Sharing 37

There are several replication techniques that can be used with VMware SRM. There is a compatibility matrix of supported vendors. The strength that SRM delivers is to: - Remove manual recovery complexity through automation - Provide central management of recovery plans and protection groups - Simplify and automate disaster recovery workflows

Replication in combination with VMware and EMC comes in a few flavors, such as:

EMC SRDF [9] EMC Symmetrix Remote Data Facility (SRDF) provides remote replication for disaster recovery and business continuity.

Click on the document for downloading Or use url: http://www.emc.com/products/detail/software/srdf.htm

EMC MirrorView [9] EMC MirrorView ensures your information is protected from both system and site failures. It leverages the power of EMC CLARiiON networked storage systems—to offer both synchronous and asynchronous remote mirroring.

Click on the document for downloading Or use url: http://www.emc.com/products/detail/software/mirrorview.htm

EMC Proven Professional Knowledge Sharing 38

EMC Celerra Replicator [9] EMC Celerra Replicator provides efficient, asynchronous data replication over Internet Protocol (IP) networks.

Click on the document for downloading Or use url: http://www.emc.com/products/detail/software/celerra-replicator.htm

EMC RecoverPoint [9] EMC RecoverPoint brings you continuous data protection and continuous remote replication for on-demand protection and recovery to any point in time. RecoverPoint's advanced capabilities include policy-based management, application integration, and bandwidth reduction.

Click on the document for downloading Or use url: http://www.emc.com/products/detail/software/recoverpoint.htm

EMC Proven Professional Knowledge Sharing 39

Plans The next steps involve making plans to configure Site Recovery Manager. Creating and managing recovery plans directly from vCenter are very powerful and easy to create. Site Recovery Manager provides an intuitive interface to help users create recovery plans for different failover scenarios and different parts of their infrastructure. Users can specify virtual machines to be suspended or shut down. They can also specify the order in which virtual machines are powered on or shut down, set user-defined scripts to execute automatically, and determine where to pause the recovery process if necessary. These steps are not detailed as they are beyond the scope of this article. Refer to VMware and storage vendor documentation for additional details. There is also a lot to find in the communities. Basically it comes down to this:

 Deploy Site Recovery Manager (SRM) at both the protected and recovery sites.

 Install the Storage Replication Adapters (SRA) on the same server as SRM on both the protected and recovery sites. Install the SRM plug-in on the protected and recovery vCenter servers.

 Set up connections between the protected and recovery sites.

 Configure the Array Manager so that SRM knows about the storage arrays.

 Create one or more Protection Groups that contain the replicated LUN and associated virtual machines, which holds the mission-critical application.

 Create a Recovery Plan which is associated with a Protection Group, so that in the event of a failover, the recovery site knows the relationship between virtual machines and the failed over storage.

 Run a test failover to verify functionality.

EMC Proven Professional Knowledge Sharing 40

11. Standardization

A good start is to ask, what are standards? According to ‗search and ‘ a standard is a definition or format that has been approved by a recognized organization or is accepted as a recognized standards organization or is accepted as a de facto standard by the industry. Standards exist for programming languages, operating systems, data formats, communications protocols, and so forth.

Standards are extremely important in the computer industry because they allow the combination of products from different manufacturers to create a customized system. Without standards, only hardware and software from the same company could be used together. In addition, standard user interfaces can make it much easier to learn how to use new applications.

A lot of organizations are committed to an open, standards-based approach to interoperability so that customers can implement solutions that meet their individual needs. It‘s important to create a policy with the basic concepts of standardization. Stability, future- proof, controlled innovation and security are essential.

VMware is committed to an open, standards-based approach to licensing and interoperability so that customers can implement virtualization-based solutions that meet their individual needs. Whether you have virtualized all of your IT assets and applications or you are just starting out, you are on your way to transforming to a new ‗standard‘ model for IT.

EMC Proven Professional Knowledge Sharing 41

12. Conclusion

We started with the sentence, and now we end with it, ―Information is the organization’s most important asset.‖

Given that, the information must be protected. We must look carefully at which information we protect because there is no point in protecting your total infrastructure environment. You must classify your data; otherwise, everything gets protected the same way. Without classification everything is important and you don‘t want that.

When disaster strikes, it hurts, one way or the other. If a disaster hits an organization without a disaster recovery plan, that organization has very little chance of recovery. Organizations that do have DR plans may still have a difficult time when a disaster strikes. You may have to put in considerable effort to recover time-sensitive critical business functions. But if you have a disaster recovery plan, you have a chance at survival.

It is a common misconception that most of the threats to continuity are a result of natural disaster. Statistically, these threats account for less than 1% of IT service unavailability. This finding indicates that you should mainly focus on other things than just natural disasters.

Doing nothing isn‘t an option because it can damage your company in many ways. For example: . Financial/Cash Flow/revenue loss . Legal/Regulatory . Life-threatening issues in hospitals, for example . Reputation

A good disaster recovery plan is like an information insurance policy for a business. A disaster recovery plan is the ability to continue work after any number of catastrophic problems, ranging from a natural disaster such as flood, fire, and earthquake or planned/unplanned scenarios like database corruption, server failures, or simply human errors. Disaster recovery is becoming an increasingly important aspect for an organization. Beside the fact that a disaster recovery plan is a must-have for the survival of your organization it has more benefits, such as; improved business processes, improved technology, fewer disruptions, higher quality services, and competitive advantages.

The maximum length of time a business function can be discontinued without causing irreparable damage to the business is called Maximum Tolerable Downtime (MTD). This value must be within the (MAOT) which is given by Management. After set targets for MTD you must set targets for your Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO) for each process. You need this when disaster strikes. You can give any sort of guarantee; how much data is lost and how long it takes before you're back online.

Make sure you have an Emergency Response Team ready. This ERT is a group of people prepared for any emergency or big incident, such as a natural disaster or an interruption of business operations. Emergency Response Team members typically are trained and prepared to fulfill the roles required by the specific situation. Ideally the team has already defined a protocol or set of actions to perform to mitigate the negative effects of the incident.

EMC Proven Professional Knowledge Sharing 42

Traditional disaster recovery plans are often very complex and difficult. As virtualization is now a critical component to an overall IT strategy, it is important to choose the right vendor. Avoid unnecessary risk and overhead when choosing a robust and production-proven hypervisor for your virtualized datacenter.

Not all are equal. VMware has a true enabler for disaster recovery named VMware Site Recovery Manager (SRM). VMware Site Recovery Manager is a business continuity and disaster recovery solution that helps you plan, test, and execute a scheduled migration or emergency failover of datacenter services from one site to another. But as mentioned in the Introduction, you can test a recovery plan without ruining anything. Yes, you can failover anytime without damaging the infrastructure environment. Virtualization these days can make Disaster Recovery implementations easy.

As the leader goes, so goes the organization. A disaster recovery plan needs executive sponsorship. Without executive sponsorship, these disaster recovery plans are not feasible. The executives are responsible for making decisions relating to an organization‘s direction, strategy, and financial commitment. They would approve the finance of purchasing hardware or software. Finally, the executive sponsorship role is needed to make decisions about a company‘s policies, procedures, and strategic directions. Ensure that it has the attention from executive management. When it does, it's more broad-based and probably more successful.

Disaster recovery and business continuity are extremely complex. This is often the reason why companies are holding back on a recovery strategy. What I try to reach with my paper is that we don‘t make disaster recovery too complicated. We can, but it isn‘t necessary. The most important issue is that data is protected and that we can provide this data quickly to the organization. Surely we must consider risks and do everything to prevent them. But this should not be your main concern. Your concern is to return as quickly as possible to daily business.

Virtualization is a true enabler to recover after a disaster. Costs are relatively low and it is very easy to integrate this into your infrastructure.

EMC Proven Professional Knowledge Sharing 43

References [1] IT Disaster Recovery Planning for Dummies By: Peter Gregory [2] EMC Information Availability Design and Management course [3] Source: Hank Marquis (2006), http://www.hankmarquis.com/articles.html [4] http://dependencymapping.com/ [5] http://en.wikipedia.org [6] By: Gartner RAS Core research note G00200526, Thomas J. Bittman, Philip Dawson, George J. Weis, 26 may 2010, Magic Quadrant for x86 server Virtualization Infrastructure [7] EMC® Documentum® Content Server Backup and Recovery White Paper version 6.5, Published January 2010 [8] VMware, http://www.vmware.com [9] EMC, http://www.emc.com

Disclaimer: The views, processes or methodologies published in this article are those of the author. They do not necessarily reflect EMC Corporation‘s views, processes or methodologies.

EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED ―AS IS.‖ EMC CORPORATION MAKES NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.

EMC Proven Professional Knowledge Sharing 44