Above The Clouds - Best practices to Create a Sustainable Computing Infrastructure to Achieve Business Value and Growth

Paul Brant

EMC Proven Professional Knowledge Sharing 2010

Paul Brant EMC Corporation [email protected]

Table of Contents

Table of Contents ...... 2

Table of Figures ...... 8

Table of Equations ...... 9

Abstract ...... 11

Introduction ...... 13

Sustainability Metrics and what does it to be Sustainable ...... 18 The Challenges to achieve Sustainability ...... 20 The concept of Green ...... 23

IT Sustainability and how to measure it ...... 24

The Carbon Footprint ...... 26

Environment Pillar- Green Computing, Growing Sustainability...... 31

Standards and Regulations ...... 32 Best Practice – In the US, Consider Executive Order 13423 and energy-efficiency legislation regulations ...... 32 Best Practice – Use tools and resources to understand environmental impacts ...... 33 IT facilities and Operations ...... 34 Best Practice – Place Data Centers in locations of lower risk of natural disasters ... 35 Best Practice – Evaluate Power GRID and Network sustainability for IT Data Centers ...... 37 Effectiveness Pillar ...... 39

Services and Partnerships ...... 39

Tools and Best Practices ...... 40

Efficiency Pillar ...... 41

Information Management ...... 42

Best Practice – Implement integrated Virtualized management into environment. .. 42 Best Practice - Having a robust Information Model ...... 44 Best Practices in Root Cause Analysis ...... 46

2 2010 EMC Proven Professional Knowledge Sharing Best practice - Effective root-cause analysis technique must be capable of identifying all those problems automatically...... 47 Best Practice – Rules-based correlation using CCT ...... 48 Best Practice - Reduction of Downstream Suppression ...... 49 Self Organizing Systems ...... 50 Best Practice - Dynamic Control in a Self Organized System ...... 53 Best Practice – Utilize STR’s when implementing adaptive controllers ...... 54 Best Practice – Require System Centric Properties ...... 55 Application ...... 57

Best Practice – Architect a designed for Run solution ...... 58 Storage ...... 59

Compression, Archiving and Data Deduplication ...... 60

Autonomic self healing systems ...... 66

Storage Media – Flash Disks ...... 69

Best Practice – Utilize low power flash technologies ...... 70 Server Virtualization ...... 70

Best Practice – Implement DRS ...... 71 Network...... 72 Best Practice - Architect Your Network to Be the Orchestration Engine for Automated Service Delivery (The 5 S’s) ...... 72

Scalable: ...... 73

Simplified: ...... 73

Standardized: ...... 73

Shared: ...... 73

Secure: ...... 73 Best Practice - Select the Right Network Platform ...... 74

Cloud network infrastructure ...... 74

Cloud network operating system (OS) ...... 74

Cloud network management systems ...... 74

3 2010 EMC Proven Professional Knowledge Sharing Cloud network security ...... 74 Best Practice – Consider implementing layer 2 Locator/ID Separation ...... 75 Best Practice – Build a Case to Maximize Cloud Investments ...... 76

Best Practice - Service Providers - Maximize and sustain Cloud Investments ...... 76 Best Practice - Enterprises - maximize and sustain Cloud Investments ...... 77 Best Practice – Understand Information Logistics and Energy transposition tradeoffs 77 Infrastructure Architectures ...... 80

Data Center Tier Classifications ...... 81 Cloud Overview ...... 84

Cloud Layers of Abstraction ...... 87 Cloud Type Architecture(s) Computing Concerns ...... 88

Failure of Monocultures: ...... 89

Convenience vs. Control ...... 89

General distrust of external service providers ...... 89

Concern to virtualize the majority of servers and desktop workloads ...... 90

Fully virtualized environments are hard to manage ...... 90

Many environments can't be virtualized onto x86 and hypervisors ...... 90

Concerns on security ...... 91

Industry Standards ...... 91

Applications support for virtualized environments, or only the one the vendor sells ...... 91

Environmental Impact Concerns ...... 92

Threshold Policy Concerns ...... 93

Interoperability issues Concerns ...... 93

Hidden Cost Concerns ...... 93

Unexpected behavior concerns ...... 95 Private Cloud ...... 96

Best Practice – Implement a dynamic computing infrastructure ...... 98

Best Practice – Implement an IT Service-Centric Approach ...... 99 4 2010 EMC Proven Professional Knowledge Sharing Best Practice – Implement a self-service based usage Model ...... 99

Best Practice – Implement a minimally or self-managed platform ...... 100

Best Practice – Implement a consumption-based billing methodology ...... 100 Public Cloud ...... 101 Community Cloud ...... 101

Best Practice in Community Cloud – Use VM’s ...... 106

Best Practice in Community Cloud – Use Peer to Peer Networking ...... 106

Best Practice in Community Cloud – Distributed Transactions ...... 106

Best Practice in Community Cloud – Distributed Persistence Storage ...... 107 Challenges in the federation of Public and Private Clouds ...... 107

Lack of visibility ...... 108

Multi-tenancy Issues ...... 108

Cloud computing needs to cover its assets ...... 109 Warehouse Scale Machines - Purposely Built Solution Options ...... 109

Best Practice – WSC’s must achieve high availability ...... 111

Best Practice - WSC’s must achieve cost efficiency ...... 111

WSC (Warehouse Scale Computer) Attributes ...... 112

One vs. Several Data Centers ...... 112

Best Practice – Use Warehouse Scale Computer Architecture designs in certain scenarios ...... 113

Architectural Overview of WSC’s ...... 113

Best Practice – Connect Storage Directly or via NAS in WSC environments ...... 114

Best Practice – WSC should consider using non-standard Replication Models ...... 114

Networking Fabric ...... 114

Best Practice – For WSC’s Create a Two level Hierarchy of networked switches ...... 115

Handling Failures ...... 115

Best Practice - Use Sharding and other requirements in WSC’s ...... 116

5 2010 EMC Proven Professional Knowledge Sharing Best Practice – Implement application specific compression ...... 118 Utility Computing ...... 119 Grid computing ...... 123 Cloud Type Architecture Summary ...... 123

Infrastructure and more ...... 124

Amazon Web services ...... 124

Cloud computing ...... 125

Grid Computing ...... 125

Similarities and differences ...... 126 Business Practices Pillar ...... 127

Process Management and Improvement...... 128

Best Practice - Provide incentives that support your primary goals: ...... 128

Best Practice - Focus on effective resource utilization ...... 129

Best Practice - Use virtualization to improve server utilization and increase operational efficiency ...... 129

Best Practice - Drive quality up through compliance: ...... 130

Best Practice - Embrace change management ...... 131

Best Practice - Invest in understanding your application workload and behavior: ...... 133

Best Practice - Right-size your server platforms to meet your application requirements 133

Best Practice - Evaluate and test servers for performance, power, and total cost of ownership ...... 134

Best Practice - Converge on as small a number of stock-keeping units (SKUs) as you can134

Best Practice - Take advantage of competitive bids from multiple manufacturers to foster innovation and reduce costs...... 135 Standards ...... 135

Best Practice - Use standard interfaces to Cloud Architectures ...... 137 Security ...... 138

Best Practice – Determine if cloud vendors can deliver on their security claims ...... 139

6 2010 EMC Proven Professional Knowledge Sharing Best Practice - Adopt federated identity policies backed by strong authentication practices139

Best Practice – Preserve segregation of administrator duties ...... 140

Best Practice - Set clear security policies ...... 141

Best Practice - Employ data encryption and tokenization ...... 141

Best Practice - Manage policies for provisioning virtual machines...... 142

Best Practice – Require transparency into cloud operations to ensure multi-tenancy and data isolation ...... 142 Governance ...... 143

Best Practices – Do your due diligence of your SLA’s ...... 143 Compliance ...... 146

Best Practice - Know Your Legal Obligations ...... 146

Best Practice - Classify / Label your Data & Systems ...... 146

Best Practice - External Risk Assessment ...... 146

Best Practice - Do Your Diligence / External Reports ...... 147

Best Practice - Understand Where the Data Will Be! ...... 147

Best Practice - Track your applications to achieve compliance...... 148

Best Practice - With off-site hosting, keep your assets separate...... 148

Best Practice - Protect yourself against power disruptions ...... 149

Best Practice - Ensure vendor cooperation in legal matters ...... 149 Profitability ...... 150

Business and Profit objectives to achieve Sustainability ...... 151

Best Practice - Consumer Awareness and Transparency ...... 151

Best Practice – Implement Efficiency Improvement ...... 152

Best Practice - Product Innovation ...... 152

Best Practice - Carbon Mitigation ...... 152

Information Technology Sector Initiatives ...... 152

Best Practice – Virtualization ...... 153

7 2010 EMC Proven Professional Knowledge Sharing Best Practice - Recycling e-Waste ...... 153

Cloud Profitability and Economics ...... 153

Cloud Computing Economics ...... 158

Best Practice – Consider Elasticity as part of the business deciding metrics ...... 158 Economics Pillar ...... 162 Best Practice – Consider Efficiency as only one part of the Economic Sustainable equation ...... 164 Conclusion ...... 164

Appendix A – Green IT, SaaS, Cloud Computing Solutions ...... 165

Appendix B – Abbreviations ...... 168

Appendix B – References ...... 173

Author’s Biography ...... 175

Index ...... 176

Table of Figures

Figure 1 - Sustainability and Technology Interest Trends ...... 13 Figure 2 – Achieving IT Sustainability ...... 19 Figure 3 - US Energy Flows (Quadrillon BTUs)[21] ...... 20 Figure 4 – Efficiency - Megawatts to Infowatts to Business Value Solutions ...... 22 Figure 5 – IT Data Center Sustainability Taxonomy ...... 27 Figure 6 – Top Level Sustainability Ontology (see notes below) ...... 30 Figure 7 - U.S. Federal Emergency Management Agency – Disaster MAP[22] ...... 35 Figure 8 - U.S. Geological Survey Seismological Zones (0=lowest, 4=Highest) ...... 36 Figure 9 - U.S. NOAA Hurricane Activity in the United States ...... 37 Figure 10 - Opportunities for efficiency improvements ...... 40 Figure 11 - EPA Future Energy Use Projections ...... 42 Figure 12 – Closed Loop System ...... 51 Figure 13 – Sustainability Ontology – Self Organizing Systems ...... 52 Figure 14 – A Sustainable Information transition lifecycle ...... 59

8 2010 EMC Proven Professional Knowledge Sharing Figure 15 – Self Organized VM application controller ...... 71 Figure 16 - Energy in Electronic Integrated Circuits ...... 78 Figure 17 - Moore's Law - Switching Energy ...... 79 Figure 18 - Data by physical vs. transfer ...... 80 Figure 19 – Sustainability Ontology – Infrastructure Architectures ...... 84 Figure 20 - Cloud Topology ...... 86 Figure 21 - Cloud Computing Topology ...... 87 Figure 22 - Using a Private Cloud to Federate disparate architectures ...... 98 Figure 23 - Community Cloud ...... 102 Figure 24 - Community Cloud Architecture ...... 105 Figure 25 – Sustainability Ontology – Business Practices ...... 128 Figure 26 - A continuous process helps maintain the effectiveness of controls as your environment changes ...... 131 Figure 27 - Consistent and well-documented processes help ensure smooth changes in the production environment ...... 132 Figure 28 – Provisioning for peak load ...... 160 Figure 29 – Under Provisioning Option 1 ...... 160 Figure 30 – Under Provisioning Option 2 ...... 161

Table of Equations

Equation 1 – Computing Energy Efficiency ...... 25 Equation 2 – Computing Energy Efficiency-Detailed ...... 25 Equation 3 – Computing Energy Efficiency-Detailed as a function on PUE ...... 25 Equation 4 – IT Long Term Sustainability Goal ...... 28 Equation 5 – What is Efficient IT ...... 39 Equation 6 – Linear model of a Control System ...... 54 Equation 7 – Energy Consumed by a CMOS ASIC ...... 78 Equation 8 – Power Consumed by a CMOS ASIC ...... 78 Equation 9 – Cloud Computing - Cost Advantage ...... 156 Equation 10 – Cloud Computing - Cost tradeoff for demand that varies over time ...... 156

9 2010 EMC Proven Professional Knowledge Sharing Disclaimer: The views, processes or methodologies published in this article are those of the authors. They do not necessarily reflect EMC Corporation’s views, processes or methodologies.

10 2010 EMC Proven Professional Knowledge Sharing Abstract

The IT industry is embarking on a new paradigm of service delivery. From the start, each information technology wave, mainframe, minicomputer, PC/microprocessor to networked distributed computing offered new challenges and benefits. We are embarking on a new wave, offering new methods and technologies to achieve sustainable growth and increased business value to companies ranging small businesses to major multinational corporations

There are various new technologies and approaches that businesses can now use to allow a more efficient and sustainable growth path. For example, Cloud computing, Cloud services, Private Clouds and Warehouse-Scale data center design methodologies are just a few of the approaches to sustainable business growth. Other “services on demand” offerings are starting to make their mark on the corporate IT landscape offering as well.

Every business has its own requirements and challenges when creating a sustainable IT business model that addresses the need for continued growth and scalability. Can this new wave, fostered by burgeoning new technologies such as Cloud computing and the never ending accelerating information growth curve, turn the IT industry into a flat and level playing field? What metrics should we implement to allow a systematic determination about technology selection? What are the standards and best practices in the evaluation of each technology? Will one technology fit all business, environmental and sustainable possibilities?

For example, to sustain business growth, IT consumers require specific standards so that data and applications are not held captive by non-interoperable Cloud services providers. Otherwise, we end up with walled gardens as we had with CompuServe, AOL, and Prodigy in the period before the Internet and worldwide web emerged. Data and application portability standards have to be firmly in place, with solid Cloud service provider backing.

Data centers are changing at a rapid exponential pace, faster than at any other point in history. However, with all the changes, data center facilities and the associated information management technologies, IT professionals face numerous challenges in unifying their peers to solve problems for their companies. Sometimes you may feel as if you are talking different languages or living on different planets. What do virtual computers and three-phase power have in common? Has your IT staff or department ever come to you asking for more power without

11 2010 EMC Proven Professional Knowledge Sharing considering that additional power/cooling is required? Do you have thermal hot spots in places you never expected or contemplated? Has virtualization changed your network architecture or your security protocols? What exactly does Cloud computing mean to your data center?

Is Cloud computing or SAAS (Storage or ) being performed in your data center already? More importantly, how do you align the different data center disciplines to understand how new technologies will work together to solve data center sustainability problems? One possible Best Practice is a standardized data center stack framework that would address the above issues allowing Best Practices to achieve a sustained business value growth trajectory.

How do we tier Data Center efficiency and map it back to business value and growth? In 2008, American data centers consumed more power than American televisions. Collectively, data centers consumed more power than all the TVs in every home and every sports bar in America. This really puts a new perspective on it. All of these questions will be addressed and possible solutions provided.

In summary, this article will go above the Cloud, offering Best Practices that will align with the most important goal, creating a sustainable computing infrastructure to achieve business value and growth.

12 2010 EMC Proven Professional Knowledge Sharing Introduction

Deploying IT, SaaS and Cloud Computing solutions to create a sustaining profitability model for businesses centers on identifying processes and technologies that create value propositions for all involved. This can be achieved by producing eco-centric, business analytics, metrics, key performance indicators and sustainability measures with a goal to support developing green and sustainable business models(See the section titled “Sustainability Metrics and what does it mean to be Sustainable” on page 18 for the definition of sustainability) .

This can be a daunting task. The good news is there is growing interest in sustainability as well as with various technologies. Some are on a major upward trend such as Cloud computing as shown in Figure 1 - Sustainability and Technology Interest Trends, below. This trending information shows the number of hits on ’s search engine normalized to the topic of Sustainability outlined in Blue. It appears that Sustainability and Cloud computing are certainly on the upward trend. We will find out why.

Figure 1 - Sustainability and Technology Interest Trends1

1 Google Trends 13 2010 EMC Proven Professional Knowledge Sharing According to Gartner[1], the top 10 strategic technologies for 2010 include:

Cloud Computing: Cloud computing is a style of computing that characterizes a model in which providers deliver a variety of IT-enabled capabilities to consumers. We can exploit Cloud-based services in a variety of ways to develop an application or a solution. Using Cloud resources does not eliminate the costs of IT solutions, but does re-arrange some and reduce others. In addition, enterprises consuming cloud services will increasingly act as cloud providers and deliver application, information or business process services to customers and business partners. Some have joked that Cloud computing is analogous to preferring to pay for the power we use, rather than buying a power plant!

In addition, Gartner predicts that by 2012, 20 percent of businesses will own no IT assets. Several interrelated trends are driving the movement toward decreased IT hardware assets, such as virtualization, cloud-enabled services, and employees running personal desktops and notebook systems on corporate networks. The need for computing hardware, either in a data center or at the desktop, will not go away. However, if the ownership of hardware transitions to third parties, there will be major shifts throughout the IT hardware industry. For example, enterprise IT budgets either will shrink or be reallocated to more-strategic projects. Enterprise IT staff will be either reduced or re-skilled to meet new requirements, and/or hardware distribution will have to change radically to meet the requirements of the new IT hardware sustainability model.

Advanced Analytics. Optimization and simulation use analytical tools and models to maximize business process and decision effectiveness by examining alternative outcomes and scenarios, before, during and after process implementation and execution. This can be viewed as a third step in supporting operational business decisions. Fixed rules and prepared policies gave way to more informed decisions powered by the right information delivered at the right time, whether through customer relationship management (CRM) enterprise resource planning (ERP) or other applications. The new step provides simulation, prediction, optimization and other analytics, not simply information, to empower even more decision flexibility at the time and place of every business process action. The new step looks into the future, predicting what can or will happen.

Client Computing. Virtualization is bringing new ways of packaging client computing applications and capabilities. As a result, the choice of a particular PC hardware platform, and

14 2010 EMC Proven Professional Knowledge Sharing eventually the OS platform, becomes less critical. Enterprises should proactively build a five to eight year strategic client computing roadmap that outlines an approach to device standards, ownership and support, operating system and application selection, deployment and update, and management and security plans to manage diversity.

IT for Green: IT can enable many green initiatives. The use of IT, particularly among the white- collar staff, can greatly enhance an enterprise’s green credentials. Common green initiatives include the use of e-documents, reducing travel via teleconferencing and remote worker support and tele-working. IT can also provide the analytic tools that others in the enterprise may use to reduce energy consumption in the transportation of goods or other carbon management activities.

According to Gartner, by 2014, most IT business cases will include carbon remediation costs. Today, server vitalization and desktop power management demonstrate substantial savings in energy costs, and those savings can help justify projects. Including carbon costs into business cases provides an additional measure of savings, and prepares the organization for increased scrutiny of its carbon impact.

Economic and political pressure to demonstrate responsibility for carbon dioxide emissions will force more businesses to quantify carbon costs in business cases. Vendors will have to provide carbon life cycle statistics for their products or face market share erosion. Incorporating carbon costs in business cases will only slightly accelerate replacement cycles. A reasonable estimate for the cost of carbon in typical IT operations is an incremental one or two percentage points of overall cost. Therefore, carbon accounting will more likely shift market share than market size.

In 2012, 60 percent of a new PC's total life greenhouse gas emissions will have occurred before the user first turns the machine on. Progress toward reducing the power needed to build a PC has been slow. Over the course of its entire lifetime, a typical PC consumes 10 times its own weight in fossil fuels, but around 80 percent of a PCs total energy usage still happens during production and transportation.

Greater awareness among buyers and those that influence buying, greater pressure from eco- labels, and increasing cost pressures and social pressure raised the IT industry’s awareness to the problem of greenhouse gas emissions. Requests for proposal (RFPs) now frequently look

15 2010 EMC Proven Professional Knowledge Sharing for both product and vendor environment-related criteria. Environmental awareness and legislative requirements will increase recognition of production as well as usage-related carbon dioxide emissions. Technology providers should expect to provide carbon dioxide emission data to a growing number of customers.

Reshaping the Data Center: In the past, design principles for data centers were simple: Figure out what you have, estimate growth for 15 to 20 years, then build to suit. Newly built data centers often opened with huge areas of white floor space, fully powered and backed by an uninterruptible power supply (UPS), water-and air-cooled and mostly empty. However, costs are actually lower if enterprises adopt a pod-based approach to data center construction and expansion. If you expect to need 9,000 square feet during the life of a data center, then design the site to support it, but only build what is needed for five to seven years. Cutting operating expenses, a large portion of overall IT spending for most clients, frees up money to reallocate to other projects or investments either in IT or in the business itself.

Social Computing: Workers do not want two distinct environments to support their work – one for their own work products (whether personal or group) and another for accessing “external” information. Enterprises must focus on use of social software and social media in the enterprise, and participation and integration with externally facing enterprise-sponsored and public communities. Do not ignore the role of the social profile to bring communities together.

Security – Activity Monitoring: Traditionally, security has focused on putting up a perimeter fence to keep others out, but it has evolved to monitoring activities and identifying patterns that would have been missed previously. Information security professionals face the challenge of detecting malicious activity in a constant stream of discreet events that are usually associated with an authorized user and are generated from multiple network, system and application sources. At the same time, security departments are facing increasing demands for ever-greater log analysis and reporting to support audit requirements. A variety of complementary (and sometimes overlapping) monitoring and analysis tools help enterprises better detect and investigate suspicious activity – often with real-time alerting or transaction intervention. By understanding the strengths and weaknesses of these tools, enterprises can better understand how to use them to defend the enterprise and meet audit requirements.

16 2010 EMC Proven Professional Knowledge Sharing : Flash memory is not new, but it is moving up to a new tier in the storage echelon. Flash memory is a semiconductor memory device, familiar from its use in USB memory sticks and digital camera cards. It is much faster than rotating disk, but considerably more expensive although the differential is shrinking. As the price declines, the technology will enjoy more than a 100 percent compound annual growth rate during the next few years and become strategic in many IT areas including consumer devices, entertainment equipment and other embedded IT systems. In addition, it offers a new layer of the storage hierarchy in servers and client computers that has key advantages including space, heat, performance and ruggedness.

Virtualization for Availability: Virtualization has been on the list of top strategic technologies in previous years. It is on the list this year because Gartner emphasizes new elements such as live migration for availability that have longer-term implications. Live migration is the movement of a running virtual machine (VM), while its operating system and other software continue to execute as if they remained on the original physical server. This takes place by replicating the state of physical memory between the source and destination VMs, then, at some instant in time, one instruction finishes execution on the source machine and the next instruction begins on the destination machine.

However, if replication of memory continues indefinitely, but execution of instructions remains on the source VM, and then the source VM fails the next instruction would now take place on the destination machine. If the destination VM were to fail, just pick a new destination to start the indefinite migration, therefore making very high availability possible.

The key value proposition is to displace a variety of separate mechanisms with a single “dial” that can be set to any level of availability from baseline to fault tolerance, all using a common mechanism and permitting the settings to be changed rapidly as needed. We could dispense with expensive high-reliability hardware, with fail-over cluster software and perhaps even fault- tolerant hardware, but still meet availability needs. This is key to cutting costs, lowering complexity, and increasing agility as needs shift.

Mobile Applications: By year-end 2010, 1.2 billion people will carry handsets capable of rich, mobile commerce providing an environment for the convergence of mobility and the web. There are already many thousands of applications for platforms such as the Apple iPhone, in spite of

17 2010 EMC Proven Professional Knowledge Sharing the limited market and need for unique coding. It may take a newer version designed to flexibly operate on both full PC and miniature systems, but if the operating system interface and processor architecture were identical, that enabling factor would create a huge turn upwards in mobile application availability.

Sustainability Metrics and what does it mean to be Sustainable

What are these metrics and how does one wade through technology, environmental, business and operational requirements looking for best practices in achieving IT Sustainability? All this will be touched on as we go through the details.

First, what does “Sustainability” mean? Generally, some define it as:

“Meeting the needs of the present without compromising the ability of future generations to meet their own needs 2." Or “Then I say the earth belongs to each generation during its course, fully and in its own right. The second generation receives it clear of the debts and encumbrances, the third of the second, and so on. For if the first could charge it with a debt, then the earth would belong to the dead and not to the living generation. Then, no generation can contract debts greater than may be paid during the course of its own existence.”3

This article will take an Information Technology (IT) centric approach to this idea of sustainability. I believe that for the IT industry, including the decision makers, implementers, vendors, and technologists in general, a definition of IT sustainability would be:

“A pro-active approach to ensure the long-term viability and integrity of the business by optimizing IT resource needs, reducing environmental, energy and/or social impacts, and managing resources while not compromising profitability to the business.”

One corollary to this definition would be that not only would developing a sustainable IT model not compromise profitability, but, by conforming to best practices, would actually increase it!

2 Washington State Department of Ecology 3 Thomas Jefferson, September 6, 1789 18 2010 EMC Proven Professional Knowledge Sharing

The four pillars or focal points to achieve IT sustainability are “The Environment,” “Efficiency,” “Effectiveness,” and “Business Practices” as shown in Figure 2 – Achieving IT Sustainability, on page 19. Achieving sustainability requires a focused effort on many fronts. All of these pillars will be discussed in great detail in the following sections.

The environment is not only an important social issue but is a responsibility for all of us to manage. From the business perspective, we can address the environmental aspects by working with the Manufacturing and Supply chain as well as following environmental Standards and Regulations. As an EMC Employee, EMC’s internal IT, Facilities and operations departments have made great strides in this particular area. It is important to not minimize what each of us can do as an individual or as an inclusive industry as a whole.

Considering business practices and requirements is important as well. Understanding each business’ operational, market and growth models as it relates to Sustainability is a given.

To be effective, it is important to have the appropriate tools and best practices to achieve sustainable growth. Lastly, and one can argue most importantly as it relates to the IT industry, that efficiency is paramount to what we can do and how the IT industry as a whole can make a difference.

Figure 2 – Achieving IT Sustainability

19 2010 EMC Proven Professional Knowledge Sharing The Challenges to achieve Sustainability

It is interesting to note that every IT process has some interaction with energy, given the fact that all IT technologies are electrical or mechanical in nature. It is always difficult to determine what metrics we should use to determine how to achieve sustainable growth. For example, one aspect is the concept of energy efficiency as it relates to business value.

As shown in Figure 4 – Efficiency - Megawatts to Infowatts to Business Value Solutions on page 22, this diagram illustrates an example of the general inefficiencies with power delivery and IT technology today. The question is what is the efficiency of information management per watt? The efficiency challenge is multi-dimensional. Following the energy flow from the power plant to the application, you can follow the energy loss through the full power delivery path. As shown, at the source, the power plant, upwards of 70% of the energy entering the plant is reduced through generation and power delivery to the Data Center. Given that most of the power consumed in the United States comes from fossil fuels, as shown in Figure 3 - US Energy Flows (Quadrillon BTUs), below, there is a major opportunity to reduce emissions by becoming more efficient.

Figure 3 - US Energy Flows (Quadrillon BTUs)[21]

Of the power entering the data center, 50% is lost. Adding to the loss in the data center includes the fans and power supply conversions. In terms of the data center facility, there are solutions such as “Fifth Light”, “CHP (Combined Heat and Power)”, “Flywheel”, “Liquid Cooling” and other technologies that can make a difference.

20 2010 EMC Proven Professional Knowledge Sharing

Within the data center facility, given typical under-utilization of Server, Storage and Network Bandwidth, as well as the challenges of inefficient and zero-value applications, the Megawatt to Info watt efficiency can be less than 1%. So, for example, for every 100Watts of power, 0.3 watts of power are actually used. I believe the IT industry can do better.

The good news is that there are solutions. However, there is no silver bullet. There is no one technology, business or process that will address at any great length the efficiency and sustainability goal.

Figure 4shows some examples of what can be done at all stages in the Megawatt to Info watt efficiency cycle. These range from virtualization, Consolidation, Network and Data optimization as well as other environmental solutions, many of which are point technologies. However, with these point technologies rolled out in tight orchestration, I am sure IT stakeholders can and will make a difference.

For example, concerning virtualization, this technology alone cannot relieve the burden of rising site infrastructure expenses and it can be argued that this technology alone cannot achieve sustainability. See Equation 4 – IT Long Term Sustainability Goal on page 28. Virtualizing four or ten servers onto a single physical host(s) will indeed cut power consumption and free up data center capacity. However, for data centers nearing their limits, virtualization can play a key role in delaying the time at which an expansion or new facility must be built, but this is not the total solution.

21 2010 EMC Proven Professional Knowledge Sharing

Figure 4 – Efficiency - Megawatts to Infowatts to Business Value Solutions4

The problem is that the cost of electricity and site infrastructure TCO (Total Cost of Ownership) is greatly out pacing the cost of the server itself. This can happen regardless of whether a single is running one application or, virtualized to handle multiple tasks. When electricity and infrastructure costs greatly exceed server cost, any IT deployment decisions based on server cost alone will result in a wildly inaccurate perception of the true total cost. Even when virtualization frees up wasted site capacity for additional servers without spending new money on site infrastructure, the opportunity cost (i.e. ensuring that scarce resources are used efficiently) of deploying the capacity is the same. Data center managers can be in a position of building expensive new capacity sooner than they need to[7].

4 EDS 22 2010 EMC Proven Professional Knowledge Sharing Furthermore, virtualization is a one-time benefit. After consolidating servers so that they are all running at full capacity, and planning future deployments so that newly purchased servers will also be fully utilized, data center operators are still faced with the reality that each year’s generation of servers will most likely draw more power than the previous hardware release. After virtualization has taken some of the slack out of underutilized IT hardware, the trend in power growth will resume.

Conversely, it is also possible that virtualization may allow each new server to be so productive that it’s worthwhile to divert a greater fraction of the IT budget to pay the increased site infrastructure and electricity cost, but a business can’t make that decision without considering the true total cost.

The concept of Green

The concept of “Green” also comes to mind. Sustainability and being green go hand in hand. Going green means change, but not all green solutions are efficient or sustainable. For example, one might say, plant trees to become green. Well, it would take 6.6 billion trees to 5 offset the CO2 generated by all of the data centers in the world. Planting trees is green, but not very efficient.

Green business models must reduce carbon dioxide. Developing green business models begins by determining how a company's products, services or solutions can be produced in contexts that reduce carbon dioxide (CO2) emissions. Market standards measure reductions by 1 million metric tons. We will discuss the Greenhouse Gas Equivalencies Calculator in the “Standards and Regulations” section, starting on page 32, translates difficult to understand statements into more commonplace terms, such as "is equivalent to avoiding the carbon dioxide emissions of X number of cars annually." This calculator also offers an excellent example of the analytics, metrics and intelligence measures that IT, SAAS and Cloud Computing solutions must deliver across the input and output chains in business models.

5 Robert McFarlane, Principal Data Center and Financial Trading Floor Consultant 23 2010 EMC Proven Professional Knowledge Sharing IT Sustainability and how to measure it

Even though this relatively newly named paradigm, “The Cloud” has a great deal of potential to contribute to a sustainable IT structure, it is just a part of the whole sustainability picture. In the following sections, Cloud computing and its variants will be discussed in detail.

As mentioned, emerging green IT, SAAS and Cloud Computing solutions offer great potential to extend, leverage and strengthen a company's business model by applying measures that can be reported and managed. Innovative IT solutions are emerging to address sustainability analytics and carbon reduction metrics. Third parties that work with companies as they develop green goods and services business models are well positioned to guide IT, SaaS and Cloud Computing companies toward developing solutions that produce green or eco-centric business metrics.

A Green IT, SaaS and Cloud Computing list is shown in Appendix A – Green IT, SaaS, Cloud Computing Solutions, starting on page 165, is being developed as solution providers report. This list shows the capability of their IT solutions to produce CO2 and sustainability measures, business analytics, market intelligence, metrics and key performance indicators that can be applied by companies to develop green goods and services.

This list of metrics will become ever more important because executing profitable green business plans will depend upon generating more and deeper levels of complex green house gas and sustainability measurements and metrics.

How does one measure computing Efficiency? After all, if you cannot measure it, you cannot improve it6.

For example, for a Server, the efficiency at its basic terminology is shown in Equation 1 – Computing Energy Efficiency, shown below. The efficiency is the effective work done by the energy used. This is also equal to the efficiency of the actual work done or the rate at which the work is done by the power used.

6Lord Kelvin 24 2010 EMC Proven Professional Knowledge Sharing Equation 1 – Computing Energy Efficiency

WorkDone ComputingSpeed Efficiency = = EnergyUsed Power Breaking it down further, as shown in Equation 2 – Computing Energy Efficiency-Detailed, below, the efficiency is also a function of the underlying hardware, its properties and the Data Center as a whole.

Equation 2 – Computing Energy Efficiency-Detailed

WorkDone EnergyUsedInChips Energy Pr ovidedToComputers Efficency = X X EnergyUsedInChips Energy Pr ovidedToComputers EnergyEnteringTheBuilding

This equation shows the dependency of all of the underlying hardware at all levels of the data center stack, from the Server, Network, and all parts of the infrastructure.

The efficiency is also summed by including the power user efficiency metric that is often used as shown in Equation 3 – Computing Energy Efficiency-Detailed as a function on PUE, on page 25, below. The equation shows the dependencies of the business, individual components, and the Data Center as a whole, to establish what efficiency means. For a more detailed discussion on PUE, please refer to the EMC Proven Professional Knowledge Sharing article titled “Crossing the Great Divide in Going Green: Challenges and Best Practices in Next Generation IT Equipment, EMC Knowledge Sharing, 2008”.

Equation 3 – Computing Energy Efficiency-Detailed as a function on PUE

Efficiency = ComputingEfficiency ∗ComputerEfficiency ∗ DataCentreEfficiency(1/ PUE)

When you think about a data center, what do you picture? Almost any aspect could be imagined: mechanical & electrical systems, network infrastructure, storage, compute environments, virtualization, applications, security, cloud, grid, fabric, unified computing, open source, etc. Then consider how these items incorporate into areas of efficiency, sustainability, or even a total carbon footprint [11].

25 2010 EMC Proven Professional Knowledge Sharing The Carbon Footprint

The view of a data center quickly becomes significantly more complex, leading to challenges like answering the question of how efficient a data center is to company executives. Where does someone start to measure for these types of complexities? Are the right technologies in place to do so? Which metrics should you use for a particular industry and data center design? Data Center professionals all over the world are asking the same questions and feeling the same pressures [13].

Data centers are changing at a rapid pace; more than any other point in history. Yet with all the change, data center facilities, and IT professionals face numerous challenges in unifying their peers to solve problems for their companies. Has virtualization changed your network architecture? What about your security protocols? What exactly does Cloud computing mean to my data center? Is cloud computing being performed in your data center already? More importantly, how do I align the different data center disciplines to understand how new technologies work together to solve data center problems?

With ever increasing densities, sleep deprived data center IT professionals still have to keep the data center operating, while facing additional challenges relating to power efficiencies and interdepartmental communication.

To compound the problem, ‘Green’ has become the new buzzword in almost every facet of our lives. Data centers are no exception to green marketing and are sometimes considered easy targets due to large, concentrated power and water consumption. New green solutions sometimes are not so green due to limited understanding of data center complexities. They may disrupt cost saving and efficient technologies already in use.

Corporations are trying to calculate their carbon footprint, put goals in place to reduce it, and may face pressure to apply a new solution without understanding the entire data center picture and what options are available. Various government bodies around the world have seen the increase in data center power consumption and realize it is only trending up. It is only a matter of time before regulations are put into place that will cause data center operators to comply with new rules, possibly beyond what a data center was originally designed for. Nevertheless, we all know that the most visible pressure is that costs are rising, potentially reducing profitability.

26 2010 EMC Proven Professional Knowledge Sharing

Figure 5 – IT Data Center Sustainability Taxonomy

The recent economic uncertainty has everyone looking for ways to cut and optimize data centers even further. Data centers have reached the CFO's radar and are under never ending scrutiny to cut capital investments and operating expenses. So what are data center owners and operators supposed to do? Invent their own standards? Metrics? Framework? Which industry standards and metrics apply to your data center and will they help you show results to your CFO? There has to be a better way.

With the advent of ‘Cloud computing’ and it multi-faceted variants, understand the data center interdependencies from top to bottom is a new priority. By doing so, users can analyze potential outsourcing, as an example, to a cloud technology solution. Outlining Figure 5 – IT Data Center Sustainability Taxonomy, shown above, is one approach to define the metrics and moving parts to achieve a framework to understand the challenges and mythologies required to achieve an efficient approach to IT data center architectures.

27 2010 EMC Proven Professional Knowledge Sharing At the bottom of the stack are the sustainability metrics. Understanding all the metrics that can be used such as useful work as outlined in Figure 4 to lifecycle manage with a desire to “design for run” to more eco-centric metrics such as Fuel and site selection is imperative. Design for run is the concept that you need to consider the full life cycle of IT technology from the energy used to create the product, the power consumer during the operational period and eventual disposition of the IT asset.

A carbon score results from these metrics. A carbon score can be localized to more specific data center relationships as outlined in the data center stack such as the network, server, storage, etc. As a reference, we can also map the data center in a cloud mapping for more of a variable metric score.

The output if the data center stack could be a consistent and design terminology definition, allowing you to map into a more eco centric approach to define a more targeted focal point for optimization such as the environment, real estate or physical data center as outlined. Out of that could be an approach to define a score and certification that businesses and governments can utilize to track and measure sustainable levels.

Therefore, what is the sustainability bottom line? What is the long term metric for Significantly Improving Sustainability by restoring the economic productivity of IT?

Equation 4 – IT Long Term Sustainability Goal

ΔEfficiencyIncrease ≥ ΔComputationalPerformanceIncrease

The long-term solution is defined in Equation 4 – IT Long Term Sustainability Goal, shown above. The goal is to have the rate of energy efficiency increase equal, or exceed, the rate of computational performance increase.

According to the Uptime Institute showing efficiencies of IT equipment relative to computational performance, server compute performance has been increasing by a factor of three every two years, so a total factor of 27 (3 x 3 x 3 =27). However, energy efficiency is only doubling in the

28 2010 EMC Proven Professional Knowledge Sharing same period (2 x 2 x 2 =8). 7This means computational performance increased by a factor of 27 between 2000 and 2006. Energy efficiency has gone up as well, but by only a factor of eight during the same period.

This means that while power consumption per computational unit has dropped dramatically in a six-year period (by 88 percent), the power consumption has still risen by a factor of more than 3.4 times.

Moore’s Law is a major contributor. The definition of Moore’s law is doubling the number of transistors on a piece of silicon every 18 months8 resulting in a power density increase within chips that causes temperatures inside and around those chips to rise dramatically. Virtually everyone involved in large-scale IT computing is now aware of the resulting temperature and cooling problems data centers are experiencing, but may not fully understand the risks as they relate to sustainability.

In addition to a common framework as outlined in Figure 5 on page 27, an Ontology “what the interdependencies are” as it relates to a sustainable IT framework consistent with what defines achieving sustainability is useful. It is outlined in Figure 6 – Top Level Sustainability Ontology, on page 30, below.

As defined in the figure, the four aspects or pillars of achieving Sustainability include “Business Practices, Environment,” “Effectiveness,” and “Efficiency”. These four pillars of sustainability will be the common theme of this article and all pillars we will cover each in detail.

7 Uptime Institute - The Invisible Crisis in the Data Center: The Economic Meltdown of Moore’s Law 8 Gordon More the cofounder originally predicted in 1965 doubling in 24 months. Real world was faster 29 2010 EMC Proven Professional Knowledge Sharing

Figure 6 – Top Level Sustainability Ontology (see notes below)

Notes on figure above; • Right point arrow above indicates a Ontology sub diagram, drilling down on that topic • The sub diagram “Figure 19 – Sustainability Ontology – Infrastructure Architectures” can be found on page 84 • The sub diagram “Figure 13 – Sustainability Ontology – Self Organizing Systems”, can be found on page 52 • The sub diagram “Figure 25 – Sustainability Ontology – Business Practices”, can be found on page 128

30 2010 EMC Proven Professional Knowledge Sharing Environment Pillar- Green Computing, Growing Sustainability

Green Computing is the efficient use of computing resources; the primary objective is to account for the triple bottom line (People, Planet, and Profit), an expanded range of values and criteria for measuring organizational and societal success. Given that computing systems existed before concern over their environmental impact, it has generally been implemented retroactively, but some consider it in the development phase. It is universal in nature, because ever-increasingly sophisticated modern computer systems rely upon people, networks and hardware. Therefore, the elements of a green solution may comprise items such as end user satisfaction, management restructuring, regulatory compliance, disposal of electronic waste, telecommuting, virtualization of server resources, energy use, thin client solutions and return on investment.

Data centers are one of the greatest environmental concerns of the IT industry. They have increased in number over time as business demands have increased, with facilities housing an increasing amount of ever more powerful equipment. As data centers run into limits related to power, cooling and space, their ever-increasing operation has created a noticeable impact on power grids. To the extent that data center efficiency has become an important global issue, leading to the creation of the Green Grid9, an international non-profit organization mandating an increase in the energy efficiency of data centers. Their approach, virtualization, has improved efficiency, but is optimizing a flawed model that does not consider the whole system, where resource provision is disconnected from resource consumption [4].

For example, competing vendors must host significant redundancy in their data centers to manage usage spikes and maintain the illusion of infinite resources. So, one would argue that as an alternative, a more systemic approach is required, where resource consumption and provision are connected, to minimize the environmental impact and allow sustainable growth.

9 Crossing the Great Divide in Going Green: Challenges and Best Practices in Next Generation IT Equipment, EMC Knowledge Sharing, 2008 31 2010 EMC Proven Professional Knowledge Sharing Standards and Regulations

Information technology has enabled significant improvements in the standards of living of much of the developed world, and through its contributions to greater transport and energy efficiency, improved design, reduced materials consumption and other shifts in current practices, may offer a key to long-term sustainability.

However, the production, purchase, use and disposal of electronic products have also had a significantly negative environmental impact. As with all products, these impacts occur at multiple stages of a product’s life: extraction and refining of raw materials, manufacturing to turn raw materials into finished product, product use, including energy consumption and emissions, and end-of-life collection, transportation, and recycling/disposal. Since computers and other electronic products have supply chains and customer bases that span the globe, these environmental impacts are widely distributed across time and distance.

Best Practice – In the US, Consider Executive Order 13423 and energy-efficiency legislation regulations

Executive Order 13423 (E.O.), "Strengthening Federal Environmental, Energy, and Transportation Management," which was signed into law January 2007 by former President Bush, requires that all federal agencies set an energy efficiency and environmental performance example to achieve a number of sustainability goals with target deadlines. To comply with this E.O, IT solutions providers will have to change their current product and offering set or create new products and offerings if they intend to supply government agencies. The suppliers' commercial customers will benefit as well, fulfilling the EO’s ultimate goal.

While the E.O. does not establish non-compliance penalties, other agencies, a number of states and at least one city -- New York City -- have enacted legislation to fine companies that violate these new laws. The impact to IT and data centers is under the Clean Air Act, any source emitting more than 250 tons of a pollutant would be forced to follow certain regulations and potentially be exposed to significant financial penalties.

The E.O. is not clear who would be responsible for carbon dioxide generation, the corporate power consumer or the power generation facility. One thing is certain though, the costs will be passed on to the business. If electric utilities are charged for carbon dioxide production, they are either going to pass those charges on to their customers or increase overall electric utility rates. 32 2010 EMC Proven Professional Knowledge Sharing Best Practice – Use tools and resources to understand environmental impacts

A tool developed by the EPA (U.S. Environmental Protection Agency) called the EPEAT (Electronic Product Environmental Assessment Tool) program was launched in 2006 to help purchasers identify environmentally preferable electronic products. EPEAT developed their environmental performance criteria through an open, consensus-based, multi-stakeholder process, supported by U.S. EPA that included participants from the public and private purchasing sectors, manufacturers, environmental advocates, recyclers, technology researchers and other interested parties. Bringing these varied constituencies’ needs and perspectives to bear on standard development enabled the resulting system not only to address significant environmental issues, but also to fit within the existing structures and practices of the marketplace, making it easy to use and therefore widely adopted.

To summarize EPEAT’s goals: • Provide a credible assessment of electronic products based on agreed-upon criteria • Evaluate products based on environmental performance throughout the life cycle • Maintain a robust verification system to maintain the credibility of product declarations • Help to harmonize numerous international environmental requirements • Promote continuous improvement in the design of electronic products • Lead to reduced impact on human and environmental health

For example, EPEAT Cumulative Benefits in the United States reflect that 101 million EPEAT registered products have been sold in the US since the system’s debut in July 2006, and the benefits of US EPEAT purchasing have increased over time and will continue to be realized throughout the life of the products. The data in Table 1 - 2006 to 2008 EPEAT US Sales Environmental Benefits, on page 34, below, shows the benefits of these sales, year to year and cumulatively.

It is important to understand the standards and regulations, so that from the business as well as from the purchasing perspective, we can make the appropriate decisions and continue on the path to a sustainable future.

33 2010 EMC Proven Professional Knowledge Sharing

Table 1 - 2006 to 2008 EPEAT US Sales Environmental Benefits

IT facilities and Operations

A company needs to determine the most effective method to select a data center site that is most efficient to attain sustainability. There are a number of key factors to consider. The site selection process is key for most companies, not only because the selected site/provider will be hosting mission-critical business services, but also because the chosen site will likely house those critical systems and platforms for the foreseeable future. Since you only perform site selection activities once or twice, it is important that all relevant factors be evaluated. Geographical factors are often overlooked in site selection activities, or at best incompletely examined. Many data centers produce information about hardware reliability or facility security, but often geography, as a measure of a facility’s ability to serve and sustain its clients’ needs, is often neglected[8].

34 2010 EMC Proven Professional Knowledge Sharing Best Practice – Place Data Centers in locations of lower risk of natural disasters

The prevalence of natural disasters in U.S. regions is another factor by which companies can measure data center operations as shown in Figure 7 - U.S. Federal Emergency Management Agency – Disaster MAP, below. Enterprises that outsource or move data center operations to other potential sites or locations can mitigate certain risks by choosing locations in areas deemed low risk by historical and analytical data.

Figure 7 - U.S. Federal Emergency Management Agency – Disaster MAP[22]

We can predict where earthquakes may occur by using seismic zone data and fault line analysis. Seismic zones are determined by compiling statistics about past earthquakes, specifically magnitude and frequency. The map below titled Figure 8 - U.S. Geological Survey Seismological Zones (0=lowest, 4=Highest), on page 36, illustrates U.S. seismic zones as defined by the United States Geological Survey (USGS). For the purposes of illustrating seismic activity in the United States, the USGS divides the country into zones, numbered from 0 to 4, indicating occurrences of observed seismic activity and assumed probabilities for future activity. 35 2010 EMC Proven Professional Knowledge Sharing

Figure 8 - U.S. Geological Survey Seismological Zones (0=lowest, 4=Highest)

According to FEMA, flooding is a common event that can occur virtually anywhere in the United States, including arid and semi-arid regions. The agency has defined flood zones according to varying levels of risk. Based on current data, Texas maintains the distinction as the country’s highest-risk flood zone. Note that FEMA is still collecting information and has not yet released statistical data regarding Hurricane Katrina’s flood impact. Despite recent events, consider the following flood facts from FEMA collected between 1960 and 1995:

• Texas had the most flood-related deaths during the past 36 years • Total flood-related deaths in Texas were double that of California • California ranked second to Texas in number of flood-related deaths • Texas had more flood-related deaths than any other state 21 out of 36 years

Like flooding, tornadoes can occur in every US state of the country. However, some areas are more prone to tornadoes than others. It also notes that most of the United States tornadoes develop over its vast eastern plains.

36 2010 EMC Proven Professional Knowledge Sharing

In terms of hurricane activity, most occurrences are by coastal states and those states more proximally located to the eastern and gulf coasts of the United States. Based on weather patterns and historical data, according to Figure 9 – U.S. NOAA Hurricane, below, much of the eastern United States, especially the Southeast and across the Gulf Coast, are significantly susceptible to yearly hurricane activity.

Figure 9 - U.S. NOAA Hurricane Activity in the United States

Best Practice – Evaluate Power GRID and Network sustainability for IT Data Centers

Other factors should be considered when performing site-selection activities. Not only should you consider the presence of various geographical factors, but also the availability of key resources, such as power and network, to name just a few.

Power availability should be a major factor in any site-selection process. Recent headlines (New Orleans) point to the potentially disastrous effects of deficient power infrastructures. When evaluating the power infrastructure in a given area, it is important to ascertain several key factors, including:

37 2010 EMC Proven Professional Knowledge Sharing

• Access to more than one grid – Is the provider in question connected to more than one feed from the energy company in question? • Power grid maturity – Does the grid(s) in question also feed a large number of residential developments? Is there major construction occurring within the area served by that grid? • On-site power infrastructure – Is the data center equipped to support major power requirements, and sustain itself should the main supply of power fail?

As with power, network and carrier backbone availability, we must also consider the availability and quality of network backbones. Key factors include:

• Fiber backbone routes and their proximity to the datacenter – Are major carrier routes proximate to the datacenter? • Type of fiber in proximity – For that fiber that is proximate to the data center, is it a major fiber route, or a smaller spur off the main backbone? How much fiber is already in place, and how much of it is ‘lit’, or ready for service? How much of it is ‘dark’? • Carrier presence – While the presence of a fiber backbone is important, it is also important to understand the presence of the carrier/ telecommunication provider(s) in the area, from a business and support perspective. A carrier may have fiber in the area, but if they have little or no presence themselves, or rely on third parties for maintenance, service to the data center may suffer accordingly. • Carrier type – Simply having a carrier’s backbone nearby does not necessarily indicate that the carrier themselves are a Tier 1 provider. This is of particular interest when internet access to/from the data center is being supplied via those carriers. In any event, it is usually important to understand the internet carriers currently providing service into the data center, and whether or not they are the telecommunications/fiber provider(s). It is not enough that one or more carriers are present in a data center. At least one of the carriers (optimally, more than one) should be a Tier 1 provider, meaning that they peer directly with other major backbones at private and public peering exchanges. Only a relatively small number of carriers can claim this level of efficiency, and all smaller carriers must purchase access from the major carrier, which often introduces some level of latency.

38 2010 EMC Proven Professional Knowledge Sharing Effectiveness Pillar

To have a sustainable model to achieve an efficient business model, we must have an effective set of tools, best practices, partnerships and a services level component. The following sections will discuss this in detail.

Services and Partnerships

In order to achieve sustainability, we must have actionable plans to achieve that goal. Understanding what resources can be brought to bear to achieve new policies and procedures can achieve that goal. As shown in Equation 5 – What is Efficient IT, outlined below, to achieve an Efficient IT environment to achieve sustainability is to consider that leveraging external or internal resources to create a assessment strategy, having the technical knowledge base with experience and best practices, some of which is outlined in this paper, can achieve that goal.

Equation 5 – What is Efficient IT

As shown in Figure 10 - Opportunities for efficiency improvements, as an example below, with the ability to utilize the efficient model in Equation 5, one can achieve a sustainable growth path for a company’s IT infrastructure. This can be done by consolidation and virtualization, implementing a tiered server and storage infrastructure model and utilizing common tools, key performance indicators, resource management best practices and process automation.

39 2010 EMC Proven Professional Knowledge Sharing

Figure 10 - Opportunities for efficiency improvements

Tools and Best Practices

Executive management , SIO and IT personnel should review their IT management tools sets, to establish whether investing in automation and better processes can reduce the percentage of IT staff spent keeping the lights on. Such investments should also increase the utilization of IT assets.

Dealing with server sprawl, improving network utilization and controlling the growth of enterprise storage will all help business extend their IT budgets for hardware, maintenance, licensing, software and staff.

With the latest generation of IT management tools and current best practices, both of which will be covered in the following sections, this effort should be perfectly compatible to preserving and improving business agility and a sustainable growth path [3].

40 2010 EMC Proven Professional Knowledge Sharing Efficiency Pillar

According to the EPA (U. S. Environmental Protection Agency), the energy used by the nation’s servers and data centers is about 61 billion kilowatt-hours (kWh) in 2006 (1.5 percent of total U.S. electricity consumption) for a total electricity cost of about $4.5 billion. This estimated level of electricity consumption is more than the electricity consumed by the nation’s color televisions and similar to the amount of electricity consumed by approximately 5.8 million average U.S. households. Federal servers and data centers alone account for approximately 6 billion kWh (10 percent) of this electricity use, for a total electricity cost of about $450 million annually.

The energy use of the nation’s servers and data centers in 2006 is estimated to have doubled since 2000. The power and cooling infrastructure that supports IT equipment in data centers also uses significant energy, accounting for 50 percent of the total consumption of data centers [5]. Among the different types of data centers, more than one-third (38 percent) of electricity use is attributable to the nation’s largest (i.e., enterprise-class) and most rapidly growing data centers.

Under current efficiency trends, national energy consumption by servers and data centers could nearly double again in another five years (i.e., by 2011) to more than 100 billion kWh, representing a $7.4 billion annual electricity cost. The peak load on the power grid from these servers and data centers is currently estimated to be approximately 7 gigawatts (GW), equivalent to the output of about 15 base load power plants. If current trends continue, this demand would rise to 12 GW by 2011, which would require an additional 10 power plants. These forecasts indicate that unless we improve energy efficiency beyond current trends, the federal government’s electricity cost for servers and data centers could be nearly $740 million annually by 2011, with a peak load of approximately 1.2 GW. As shown in Figure 11 - EPA Future Energy Use Projections, shown below, according to the EPA, given historical projections, annual energy will double every four years.

41 2010 EMC Proven Professional Knowledge Sharing

Figure 11 - EPA Future Energy Use Projections10

However, there is good news. By implementing a “State of the Art” or “Best Practices” Scenario model outlined in the following sections, as shown in the diagram, it is possible to achieve energy consumption sustainability by reversing the curve. Through best practices such as server and storage consolidation, Power Management, Virtualization, Data Center facility infrastructure enhancements such as improved transformers, UPS’s Chillers, Fans, pumps, Free Air and liquid Cooling all outlined in Figure 4 – Efficiency - Megawatts to Infowatts to Business Value Solutions, on page 22, all of these aspects will get us to where we want to go.

In order to be efficient in IT and to achieve sustainability, one can summarize it in three aspects. We can achieve efficiency by Consolidating, Optimizing and Automating. In the following sections, we will discuss best practices to allow businesses achieve a continued profitability model while achieving sustainable growth.

Information Management

Best Practice – Implement integrated Virtualized management into environment.

In order to manage an efficient and sustainable infrastructure utilizing the most recent architectural changes happening in the industry today, outlined in the section titled “Infrastructure Architectures” starting on page 80, some form of management system needs to

10 Report to Congress on Server and Data Center Energy Efficiency Public Law 109-431, Aug 2007 42 2010 EMC Proven Professional Knowledge Sharing be in place. As discussed in the “Infrastructure Architectures” section, the “Private” and “Hybrid” type cloud architectures for example, allow application, servers, networks and storage to be dynamically modified. It is more of a challenge to understand what is where.

The Best Practice is to implement a management solution that allows a business to: • Monitor virtualized, high-availability clustered and load-balanced configurations and isolate problems • View and monitor the critical application processes running on your physical or virtual machines • Identify when your critical hardware (Server, PC, ESX, Hyper-V, etc) is operating in a degraded state so you can proactively use VMotion ( data migration solution) to move your critical apps to avoid service disruption • Monitor the status of virtual machines and track their movement (VMotion or Quick Migrate) in real time • Isolate problems when for example, using Cluster Services and Symantec VERITAS Clustering

The management system needs to understand, end to end, all of the levels of virtualization and integrate common information models to suitably scale. EMC’s IONIX family of management software implements this type of functionality.

EMC’s Ionix Server Manager (EISM) Software understands the virtualization abstraction stack. EISM implements detailed Discovery of ESX Servers, Virtual Machines, Physical Hosts with VMs, VirtualCenter Instances and supports dynamic and on-going discovery of added/deleted/moved VMs. EISM also understand dependencies and Relationships such as VM topology dynamic (real-time) discovery, associations of VM’s with ESX server’s and physical hosts.

It is also a best practice for the management application software to support more than one virtualization platform. EMC’s EISM supports both (ESX) and Microsoft’s (Hyper-V) virtualization platforms as well as numerous clustering and load balancing solutions.

43 2010 EMC Proven Professional Knowledge Sharing Best Practice - Having a robust Information Model

A Best Practice in the management of Data Center Networks is a methodology that allows a common management platform to support a common information model that provides key knowledge for automating management applications.

The ICIM Common Information Model is a potential solution. Supported EMC’s Smarts product suite of management applications illustrates how ICIM can be used. We use this model to represent a networked infrastructure supporting complex business networks.

An information model, underlying a management platform, provides knowledge about managed entities that is important to management applications, such as fault, performance, configuration, security, and accounting. This information must be shared among applications for an integrated OSS solution.

As Best Practice, an information model must maintain detailed data about the managed system at multiple layers, spanning infrastructure, applications, and the business services typical in a Data Center network. A robust information model enables solutions at every level that includes element management, network management, service management, and business management.

Having a common information model has many benefits. Benefits include faster application development, and stored information that is maintained in one place, providing a single coherent view of the managed system. Applications can access the parts of the model pertinent to its operation, with consistent views to each application.

In a Data Center management system, agents collect operational data on managed elements (network, systems, applications, etc.) and provide this data to the management system.

Another Best Practice is an information model should represent the whole range of managed logical and physical entities, from network elements at any layer through attached servers and desktops, the applications that run on them, the middleware for application interaction, the services the applications implement, the business processes the applications support, and the end-users and customers of business processes.

44 2010 EMC Proven Professional Knowledge Sharing The classes used by the DMTF (Desktop Management Task Force) CIM (Common Information Model), are an excellent starting point for representing the complete range of entities. An information model must also be able to describe the behaviors of managed entities. Since events and problem behaviors play a central role in management processing, such as real-time fault and performance management, network design and capacity planning, and other functions, formalizing them within the CIM is a key enabler for management automation.

In addition, Best Practices reflect that data structures or repositories should play an important role in supporting the semantic model. They must be flexible enough to represent the rich set of information for each class of managed entities. They must be flexible enough to represent the often-complex web of relationships between entities (logical and physical) within individual layers, across layers, and across technology domains that is so typical in a Data Center.

Another important Best Practice is to automate the discovery process about entities and their relationships within and across technology domains as much as possible. The ability to automatically populate the information repository is a best practice.

For example, in a Data Center, auto-discovery is particularly effective in environments supporting the TCP/IP protocol suite, including SNMP and other standard protocols that enable automatic discovery of a large class of logical and physical entities and relationships in Network ISO Layers 1-7.

Another Best Practice is to have a modeling language that can describe as many entities of a managed environment as possible and their relationships within and across technology and business domains in a consistent fashion. A high-level modeling language can simplify development of managed entity models as well as reduce error.

The ICIM Common Information Model™ and its ICIM Repository provide excellent examples of a semantics-rich common information model and an efficient information repository that meet all requirements presented in earlier sections. ICIM is based on the industry-standard DTMF CIM, a rich model for management information across networks and distributed systems.

CIM reflects a hierarchical object-oriented paradigm with relationship capabilities, allowing the complex entities and relationships that exist in the real world to be depicted in the schema. ICIM

45 2010 EMC Proven Professional Knowledge Sharing enhances the rich CIM semantics by adding behavioral modeling to the description of managed entity classes to automate event correlation and health determination. This behavioral modeling includes the description of the following information items:

Events or exceptional conditions - These can be asynchronous alarms, expressions over MIB variables, or any other measurable or observable event. Authentic problems - These are the service-affecting problems that must be fixed to maximize availability and performance. Symptoms of authentic problems - These events can be used to recognize that the problem occurred.

By adding behavioral modeling, ICIM provides rich semantics that can support more powerful automation than any other management system.

Best Practices in Root Cause Analysis

Cloud networks are becoming more difficult to manage. The number and heterogeneity of hardware and software elements in networked systems are increasing exponentially, therefore increasing the complexity of managing these systems in a similar growth pattern. The introduction of each new technology adds to the list of potential problems that threaten the delivery of network-dependent services.

Fixing a problem is often easy once it has been diagnosed. The difficulty lies in locating the root cause of the myriad events that appear on the management console of a Data Center, cloud or any infrastructure. It has been shown that 80 to 90 percent of downtime is spent analyzing data and events in an attempt to identify the problem that needs to be corrected.

For Data Center managers charged with optimizing the availability and performance of large multi-domain networked systems, it is not sufficient to collect, filter, and present data to operators. Unscheduled downtime directly affects the bottom line. The need for applications that apply intelligent analysis to pinpoint root-cause failures and performance problems automatically is imperative, especially in the consumer driven video and audio vertical. Only when diagnosis is automated can self-healing networks become a reality.

46 2010 EMC Proven Professional Knowledge Sharing Many problems threaten service delivery. They include hardware failures, software failures, congestion, loss of redundancy, and incorrect configurations.

Best practice - Effective root-cause analysis technique must be capable of identifying all those problems automatically.

This technique must work accurately for any environment and for any topology, including interrelated logical and physical topologies, with or without redundancy. The solution must be able to diagnose problems in any type of object - for example, a cable, a switch card, a server, or a database application - at any layer, no matter how large or complex the infrastructure. As important, accurate root-cause analysis is required to determine the appropriate corrective action as well. If management software cannot automate root-cause analysis, that task falls to operators. Because of the size, complexity, and heterogeneity of today's networks, and the volume of data and alarms, manual analysis is extremely slow and prone to error.

A Best Practice is to intelligently analyze, adapt, and automate using a Codebook Correlation Technology described in the following sections. This Best Practice will translate directly for the customer into major business benefits, enabling organizations to introduce new services faster, to exceed service-level goals, and to increase profitability.

Rules-based Correlation Limitations and Challenges

Typically, issues that event managers focus on include gathering and displaying ever more data to users that is ineffective, as well as too much data, de-duplication and filtering. Because of the lack of intelligence in legacy event managers, users of these systems often resort to developing their own custom scripts to capture their specific rules for event processing.

Using customized rules is a development-intensive approach doomed to fail for all but the simplest scenarios in simple static networks.

In this approach, the developer begins by identifying all of the events. Events can include alarms, alerts, SNMP traps, threshold violations, and other sources that can occur in the managed system. The Management Platform and user then attempt to write network-specific rules to process each of these events as they occur.

47 2010 EMC Proven Professional Knowledge Sharing An organization willing to invest the effort necessary to write rules faces enormous challenges. Typically, there are hundreds and even thousands of network devices in a Data Center network. The number of rules required for a typical network, without accounting for delay or loss of alarms, or for resilience, can easily reach millions. The development effort necessary to write these rules would require many person-years, even for a small network.

Changes in the network configuration can render some rules obsolete and require writing new ones. At the point in time when their proper functioning is needed most, i.e., when network problems are causing loss and delay, rules-based systems are the least reliable given their constant maintenance and update cycles.

Due to the overall complexity of development, attempts to add intelligent rules to an unintelligent event manager have not been successful in practice. In fact, rules-based systems have consistently failed to deliver a return on the investment (ROI) associated with the huge development effort.

Best Practice – Rules-based correlation using CCT

A Best Practice is utilizing Introducing Codebook Correlation Technology (CCT). CCT is a mathematically founded next-generation approach to automating the correlation required for service assurance. CCT is able to automatically analyze any type of problem in any type of physical or logical object in any complex environment. It is also able to build intelligent analysis into off-the-shelf solutions as well as automatically adapt the intelligent analysis to the managed environment, even as it changes. As a result, CCT provides instant results even in the largest of Data Center Networks.

CCT solutions dynamically adapt to topology changes, since the analysis logic is automatically generated. This eliminates the high maintenance costs required by rules-based systems that demand continual reprogramming.

CCT provides an automated, accurate real-time analysis of root-cause problems and their effects in networked systems. Other advantages include minimal development. CCT supports off-the-shelf solutions that embed intelligent analysis and automatically adapt to the environment.

48 2010 EMC Proven Professional Knowledge Sharing Any development that is required consists of developing behavior models. The amount of effort is dependent not on the size of the managed environment, but on the number of problems that need to be diagnosed.

Since CCT consists of a simple distance computation between events and problem signatures, CCT solutions execute quickly. In addition, CCT utilizes minimal computing resources and network bandwidth, since it monitors only the symptoms that are needed to diagnose problems.

Because CCT looks for the closest match between observed events and problem signatures, it can reach the correct cause even with incomplete information

Leveraging CCT as a Best Practice to automate service assurance provides substantial business benefits as well. These include the ability to roll out a new service more quickly, achieve greater availability and performance of business critical systems.

Since CCT automatically generates its correlation logic for each specific topology, new Data Center Network services can be managed immediately and new customers can be added to new or existing services quickly. By eliminating the need for development, ongoing maintenance, and manual diagnostic techniques, CCT enables IT organizations to be proactive and to focus their attention on strategic initiatives that increase revenues and market share. CCT provides a future-proof foundation for managing any type of complex infrastructure. This gives CCT users the freedom to adopt new technology, with the assurance that it can be managed effectively, intelligently, and automatically.

Best Practice - Reduction of Downstream Suppression

Reducing Downstream Event Suppression is another Best Practice. Some management vendors implement a form of root-cause analysis that is actually a form of downstream event suppression. Downstream suppression is a path-based technique that is used to reduce the number of alarms to process when analyzing hierarchical Data Center networks. Downstream suppression works as follows.

A polling device periodically polls Data Center devices to verify that they are reachable. When a device fails to respond, downstream suppression does the following:

49 2010 EMC Proven Professional Knowledge Sharing • Ignores failures from devices downstream (farther away from the "poller”) from the first device • Selects the device closest to the “poller” that fails to respond as the "root cause"

Downstream suppression requires that the network have a simple hierarchical architecture, with only one possible path connecting the polling device to each managed device. This is typically unrealistic. Today’s mission-critical Data Center networks leverage redundancy to increase resilience. Downstream suppression does not work in redundant architectures because the relationship of one node being downstream from another is undefined; there are multiple paths between the manager and managed devices.

Utilizing downstream suppression to today's Data Center networks is limited. This technique applies only to simple hierarchical networks with no redundancy, and addresses only one problem, Data Center node failure. Because of these limitations, downstream suppression offers little in the way of automating problem analysis, and certainly cannot claim to offer root-cause analysis.

Self Organizing Systems

As the size and complexity of computer systems grow, system administration has become the predominant factor of ownership cost and a main cause for reduced system dependability. All of these factors impede an IT department from achieving an efficient and sustainable operational model.

The research community, in conjunction with the various hardware and software vendors, has recognized the problem and there have been several advances in this area.

All of these approaches propose some form of self-managed, self-tuned system(s) that minimize manual administrative tasks. As a result, computers, networks and storage systems are increasingly being designed as closed loop systems, as shown in Figure 12 – Closed Loop System. As shown, a controller can automatically adjust certain parameters of the system based on feedback from the system. This system can be either hardware, software or a combination.

50 2010 EMC Proven Professional Knowledge Sharing Closed Loop Response

Controller System

Measurements

Figure 12 – Closed Loop System

Servers, storage systems, Network, and backup hardware are examples of such closed loop systems aiming at managing the energy consumption and maximizing the utilization of data centers. In addition, self-organizing systems can also meet performance goals in file servers through virtualization, using VMware and other virtualization product offerings as well such as Internet services, databases and Storage tiering. As shown in Figure 13 – Sustainability Ontology – Self Organizing Systems, on page 52, this methodology or technology can be used in many additional scenarios, ranging from Storage, Application, Server and Consolidation as well as the various levels of Virtualization types that can be achieved through Self Organizing System Theory and applications.

51 2010 EMC Proven Professional Knowledge Sharing

Figure 13 – Sustainability Ontology – Self Organizing Systems

It is important that the resulting closed-loop system is stable (does not oscillate) and converges quickly to the desired end state when applying dynamic control.

In order to achieve a more sustainable solution, a more rigorous approach is needed for designing dynamically controlled systems. In particular, it is best practice to use the time true approach of control theory because it results in systems that can be shown to work beyond the narrow range of a particular experimental evaluation.

52 2010 EMC Proven Professional Knowledge Sharing Computer and infrastructure system designers can take advantage of decades of experience in the field and can apply well-understood and often automated methodologies for controller design. Many computer management problems can be formulated, so that standard controllers or systems are applied to solve them. Therefore, a best practice is that the systems community should stick with systems design; in this case, systems that are acquiescent to dynamic feedback-based control. This provides the necessary tunable system parameters and exports the appropriate feedback metrics, so that a controller (hardware or software) can be applied without destabilizing the system, while ensuring fast convergence to the desired goals.

Traditionally, control theory and/or feedback systems have been concerned with environments that are governed by laws of physics (i.e., mechanical devices), and as a result, allowed to make assertions about the existence or non-existence of certain properties. This is not necessarily the case with software systems. Checking whether a system is controllable or, even more, building controllable systems is a challenging task often involving non-intuitive analysis and system modifications.

As a first step, propose a set of necessary and sufficient properties that any system must abide by to be controllable by a standard adaptive controller that needs little or no tuning for the specific system. This is the goal leading to achievement of sustainability. These properties are derived from the theoretical foundations of a well-known family of adaptive controllers. From a control or feedback systems perspective, there are two very specific IT diverse management problems:

1) Enforcing soft performance goals in networked or storage service by dynamically adjusting the shares of competing workloads 2) Controlling the number of blades, storage subsystems, Network Nodes, etc, assigned to a workload to meet performance goals within power budgets

Best Practice - Dynamic Control in a Self Organized System

Many computer management problems are defined as online optimization problems. The objective is to have a number of measurements obtained from the system that can converge to the desired goals by dynamically setting a number of system parameters (or actuators). The problem is formalized as an objective function that has to be minimized.

53 2010 EMC Proven Professional Knowledge Sharing Existing research has shown that, in the general case, adaptive controllers are needed to trace the varying behavior of computer systems and their changing workloads [9].

Best Practice – Utilize STR’s when implementing adaptive controllers

Let us focus on one of the best-known families of adaptive controllers, i.e. Self-Tuning Regulators (STR), that have been widely used in practice to solve on-line optimization control problems, Using this technology can help attain sustainability. The term “self-tuning” comes from the fact that the controller parameters are automatically tuned to obtain the desired properties of the closed-loop system. The design of closed loop systems involves many tasks such as modeling, design of control law, implementation, and validation. STR controllers aim to automate these tasks. Therefore, STR’s can be used out-of-the-box for many practical cases. Other types of adaptive controllers proposed in many feedback system design methodologies require more intervention by the designer.

An STR consists of two basic modules or functions: the model estimation module and the on- line estimates model that describes the measurements from the system as a function of a finite history of past actuator values and measurements. That model(s) is then used by the control law module that sets their actuator values. A best practice is using a linear model of the following form for model estimation in the STR as defined in Equation 6 – Linear model of a Control System, shown below:

Equation 6 – Linear model of a Control System

n n−1 Y (t) = y(t − i)+ u(t − i − d ) ∑i=1 Ai ∑i=0 Bi 0 Where; y(t) is a vector of the N measurements sampled at time t and u(t) is a vector capturing the M actuator settings at time t.

Ai and Bi are the model parameters with dimensions compatible with those of vectors y(t) and u(t). n is the model order that captures how much history the model takes into account. d0 is the delay between an actuation and the time the first effects of that actuation are observed.

The unknown model parameters Ai and Bi are estimated using Recursive Least-Squares (RLS) estimation. This standard, computationally fast estimation technique that fits Equation 6 to a number of measurements, so that the sum of squared errors between the measurements and

54 2010 EMC Proven Professional Knowledge Sharing the model curve is minimized. Discreet-time models are assumed. One time unit in this discreet- time model that corresponds to an invocation of the controller, i.e., sampling of system measurements, estimation of a model, and setting the actuators. The relation between actuation and observed system behavior is not always linear. For example, while throughput is a linear function of the share of resources (e.g., CPU cycles) assigned to a workload, storage processor, etc, the relation between latency and resource share is nonlinear as Little’s law indicates.

However, even in the case of nonlinear metrics, a linear model is often a good enough local approximation to be utilized by a controller as the latter usually only makes small changes to actuator settings. We can estimate the advantage of using linear models in computationally efficient ways, resulting in tractable control laws.

The control law is essentially a function that, based on the estimated system model defined in Equation 6 at time t, decides what the actuator values u(t) should be to minimize the objective function. In other words, the STR derives u(t) from a closed-form expression as a function of previous actuations, previous measurements and the estimated system measurements y(t). from a systems perspective, the important point is that these computationally efficient calculations can be performed on-line. The STR requires little system-specific tuning as it uses a dynamically estimated model of the system and the control law automatically adapts to system and workload dynamics. For this process to apply and for the resulting closed-loop system to be stable and to have predictable convergence time, control theory has come with a list of necessary and sufficient properties that the target system must abide by.

Best Practice – Require System Centric Properties

In the following paragraphs, guidelines about how one can verify whether a property is satisfied and what the challenges for enforcing them are presented.

Monotonic. The elements of matrix B0 in Equation 6 on page 54 must have known signs that remain the same over time. The concept behind this property is that the real (non-estimated) relation between any actuator and any measurement must be monotonic and of known sign. This property usually refers to some physical law. Therefore, it is generally easy to check for it and get the signs of B0. For example, in the long term, a process with high fraction of CPU cycles gets higher throughput and lower latency than one with a smaller fraction.

55 2010 EMC Proven Professional Knowledge Sharing Accurate models. The estimated model in Equation 6 is a good enough local approximation of the system’s behavior. As discussed, the model estimation is performed periodically. A fundamental requirement is that the model around the current operating point of the system captures the dynamic relation between actuators and measurements sufficiently. In practice, this means that the estimated model must track only real system dynamics. I used the term ‘noise’ to describe deviations in the system behavior that are not captured by the model. It has been shown that to ensure stability in linear systems where there is a known upper bound on the noise amplitude, the model should be updated only when the model error is twice the noise bound.

There are three main sources for the previously discussed noise: 1) un-modeled system dynamics, due i.e., to contention on the network as an example 2) a fundamentally volatile relation between certain actuators and measurements 3) quantization errors when a linear model is used to approximate locally in an operating range the behavior of a nonlinear system.

Known system delay. We know the time delay or the delta time reference between system sample periods of the feedback system d0 relative to actuator intervals.

Known system order. We know the upper bound on the degree of the system. Known System ensures that the controller knows when to expect the first effects of its actuations, while Known System Order ensures that the model sufficiently remembers many prior measurements (y) to capture the dynamics of the system. These properties are needed for the controller to observe the effects of its actuations and then attempt to correct any error in subsequent actuations. If the model order were less than the system order, then the controller would not remember having ever actuated when the measurements finally are affected. The values of d0 and n are derived experimentally. The designer is faced with a tradeoff: their values must be high enough to capture the causal relations between actuation and measurements but not too high, so that the

STR remains computationally efficient. Note that d0 = 1 and n = 1 are ideal values.

Minimum phase. Recent actuations have higher impact to the measurements than older actuations. A minimum phase system is one for which the effects of an actuation can be corrected or canceled by another, later actuation. It is possible to design Stirs that deal with

56 2010 EMC Proven Professional Knowledge Sharing non-minimum phase systems, but they involve experimentation and non-standard design processes. In other words, without the minimum phase requirement, we cannot use off-the-shelf controllers. Typically, physical systems are minimum phase. The causal effects of events in the system fade as time passes by. Sometimes, this is not the case with computer systems. To ensure this property, a designer must re-set any internal state that reflects older actuations. Alternatively, the sample interval can always be increased until the system becomes minimum phase. However, longer sampling intervals result in slower control response.

Linear independence. The elements of each of the vectors y(t) and u(t) must be linearly independent. The quality of the estimated model is poor unless this property holds. The predicted value for y(k) may be way off the actual measurements. The reason is that matrix inversion in the RLS estimator may result in matrices with very large numbers, which in combination with the limited resolution of floating point arithmetic of a CPU, result in models that are not necessarily correct. Sometimes, simple intuition about a system may be sufficient to ascertain if there are linear dependencies among actuators.

Zero-mean measurements and actuator values. The elements of each of the vectors y(k) and u(k) should have a mean value close to 0. If the actuators or the measurements have a large constant component to them, RLS tries to accurately predict this constant component and may fail to capture the comparably small effect the values of the actuators have. If there is a large constant component in the measurements and it is known, then you can simply deduct it from the reported measurements. If unknown, you can easily estimate it using a moving average.

Comparable magnitudes of measurements and actuator values. The values of the elements in y(k) and u(k) should not differ by more than one order of magnitude. If the measurement values or the actuator values differ considerably, then RLS predicts more accurately the effects of the higher value. You can easily solve this problem by scaling the measurements and actuators, so that their values are comparable. This scaling factor can also be estimated using a moving average.

Application

To design a sustainable IT infrastructure, it is very important to deal with and understand the application and the resources needed.

57 2010 EMC Proven Professional Knowledge Sharing Best Practice – Architect a designed for Run solution The Designed for Run11 strategy provides enterprises a clear path for transformation and modernization while controlling costs and protecting the user experience. It considers the function and expense of the client's IT system and applications as a whole. Designed for Run ensures the key factors that affect agility and TCO, rigidity, complexity and resource utilization are addressed from the start in the planning phase.

The reason why we need to consider this methodology is that most organizations recognize the need for IT modernization but fear the time, expense and risk involved. However, failure to modernize will result in high costs, complex systems that are difficult to fix, outages that have a business impact, and low resource utilization. These factors weigh heavily in a TCO equation and compromise sustainability.

The Designed for Run approach offers these benefits: • Reduces risk by bringing deep industry knowledge, expert planning and thorough testing to every part of the process • Extracts value within your existing IT, extending the life of legacy elements by integrating them to support the business through modernization • Lowers total cost of ownership by building maximum asset utilization into your IT system and enabling you to anticipate operational IT expenses

The approach to a “design to run” plan follows; it is defined in three phases, Plan, Build, Run and Automate;

In the “Planning” phase, we focus on reducing risk. The objective is to reduce system complexity, poor resource utilization, and outages. After assessing your existing IT applications and infrastructure, we tie your IT strategy to your business strategy, determine the system's level of maturity, and use existing industry frameworks to develop a blueprint for your system of the future.

In the “Building” phase we focus on extracting value by designing high quality into the system, building it for zero outages. Begin by implementing optimized processes for IT applications and

11 HP trademark 58 2010 EMC Proven Professional Knowledge Sharing then implement the elements of a modern architecture onto a modern infrastructure. This enables us to best use existing systems while gaining the speed, flexibility and innovation required in the 21st century.

In the “Run” phase, we focus on optimizing for total cost of ownership. After the transformation is complete, there are no costly surprises for running the finished system. Rigorous, closed-loop operational processes drive consistency and relentless pursuit of incident prevention. One will enjoy complete visibility of your enterprise's IT systems, making it possible to avoid costly errors in budgeting and performance this increasing sustainability.

In the Automate phase, minimize management by automating this process.

Storage

Information management is at the core of achieving efficiency and sustainability. At this point, it is the belief and assumed to be common knowledge that the digital footprint, the storage that both businesses and people in general utilize, is exploding at an exponential rate. Something must be done to address this growth.

Storage has a life cycle, from creation to eventual archival or deletion. In order for any business to achieve a sustainable growth path, having an information lifecycle management (ILM) solution is a given. Technologies will be discussed in the following sections having a similar common approach to data life cycle management as shown in Figure 14 – A Sustainable Information transition lifecycle. The approach is to consider going from a highly usable performance entity, “Thick”, where the data is on a disk drive that has high performance characteristics with standard provisioning. This leads to eventual “Thin” or virtual provisioning. Thin or virtual provisioning can also include performance as well as capacity tiering going to a very low granular level. The data can then become “Small” through data deduplication, compression and other redundancy methods.

Figure 14 – A Sustainable Information transition lifecycle

59 2010 EMC Proven Professional Knowledge Sharing

At some point in the data’s life cycle, performance may not be much of an issue and the physical media can spin down in a smart way, going into a “Green” state to eventually be spun up with hints on when the application may need it. Eventually, the data may never be touched or deleted arriving at a “Gone” state. We will discuss all of these life stages.

Compression, Archiving and Data Deduplication

In order to achieve sustainability of data growth and therefore gain efficiency, it is imperative to implement some form of technology to reduce the amount of data through its life cycle from creation to its eventual archiving and deletion. One way is to use compression t reduce or minimize the number of copies of data as well as the data footprint. The standard example is when a person sends an email to twenty people with a file attachment. Each person will have a copy of that file. If each recipient saves that file in the mail archive, there will be twenty copies. This is obviously not a sustainable approach. Data deduplication is a technology that deals with this un-sustainable data growth pattern. Data compression achieves efficiency and sustainability.

Best Practice – Implement Deduplication Technology Data deduplication is an application specific form of data compression where redundant data is eliminated, typically to improve storage utilization. In this process, duplicate data is deleted, leaving only one copy. However, indexing or the ability to retrieve it from various sources of all data is retained should it ever be required. Deduplication is able to reduce the required storage capacity since only the unique data is stored. Backup applications generally benefit the most from de-duplication due to the nature of repeated full backups of an existing file system or multiple severs having similar images of an OS[10].

When implementing a data deduplication system, it is important to consider scalability to achieve true sustainability. Performance should remain acceptable as the storage capacity and high deduplication granularity. Data deduplication should also be unaffected by data loss due to errors in the deduplication algorithm.

60 2010 EMC Proven Professional Knowledge Sharing Best Practice – Implement a data-deduplication solution addressing Scaling and hash collisions It is critical that data deduplication solutions detect duplicate data elements, making the determination that one file, block or byte is identical to another. Data deduplication products determine this by processing every data element through a mathematical "hashing" algorithm to create a unique identifier called a hash number. Each number is then compiled into a list, defined as the hash index.

When the system processes new data elements, the resulting hash numbers are compared against the hash numbers already in the index. If a new data element produces a hash number identical to an entry already in the index, the new data is considered a duplicate, and it is not saved to disk. A small reference "stub" that relates back to the identical data that has been stored is put in the original place. If the new hash number is not already in the index, the data element is considered new and stored to disk normally.

It is possible that a data element can produce an identical hash result even though the data is not identical to the saved version. Such a false positive, also called a hash collision, can lead to data loss. There are two ways to reduce false positives:

• Use more than one hashing algorithm on each data element. Using in or out-of-band indexing with SHA-1 and MD5 algorithms is the best practice approach. This dramatically reduces the potential for false positives. • Another best practice to reduce collisions is to use a single hashing algorithm but perform a bit-level comparison of data elements that register as identical.

The challenge with both approaches is that more processing power from the host system be it the Source Host (Source De-duplication) or the Target (Target De-duplication), reduces index performance and slows the deduplication process. Target deduplication is the process of data reduction at the Server vs. at the target (VTL, Disk, etc.). As the deduplication process becomes more granular and examines smaller chunks of data, the index becomes much larger and the probability of collisions increases and can exacerbate any performance hit.

61 2010 EMC Proven Professional Knowledge Sharing Best Practice – Include and consider scaling and encryption in the deduplication Process Another issue is the relationship between deduplication, more traditional compression and encryption in a company's storage infrastructure. Ordinary compression removes redundancy from files, and encryption "scrambles" data so that it is completely random and unreadable. Both compression and encryption play an important role in data storage, but eliminating redundancy in the data can impair the deduplication process. Indexing and deduplication should be performed first if encryption or traditional compression is required along with deduplication.

Each "chunk" of data (i.e., a file, block or bits) is processed using a hash algorithm, such as MD5 or SHA-1, generating a unique reference for each piece. The resulting hash reference is then compared to an index of other existing hash numbers. If that hash number is already in the index, the data does not need to be stored again. If we have a new entry, the new hash number is added to the index and the new data is stored.

The more granular a deduplication platform is, the larger an index will become. For example, file-based deduplication may handle an index of millions, or even tens of millions, of unique hash numbers. Block-based deduplication will involve many more unique pieces of data, often numbering into the billions. Such granular deduplication demands more processing power to accommodate the larger index. This can impair performance as the index scales unless the hardware is designed to accommodate the index properly.

In rare cases, the hash algorithm may produce the same hash number for two different chunks of data. When such a hash collision occurs, the system fails to store the new data because it sees that hash number already. Such a "false positive" can result in data loss.

Best Practice – Utilize multi Hash algorithms, Metadata Hashing, Compression and Data Reduction It is a best practice to implement multiple hash algorithms, examining metadata to identify data deduplication, compression and data reduction, and thereby prevent hash collisions and other abnormalities. Data deduplication is typically used in conjunction with other forms of data reduction, such as compression and delta differencing. In data compression technology, which has existed for about three decades, algorithms are applied to data in order to simplify large or repetitious parts of a file.

62 2010 EMC Proven Professional Knowledge Sharing Delta differencing is primarily used in archiving or backup, reduces the total volume of stored data by saving only the changes to a file since its initial backup. For example, a file set may contain 400 GB of data, but if only 100 MB of data has changed since the previous backup, then only that 100 MB is saved. Delta differencing is frequently used in WAN-based backups to make the most of available bandwidth in order to minimize the backup window. For additional information on WAN or network sustainability efficiency options, please refer to the section titled “Network” starting on page 72.

Data deduplication also has ancillary benefits. Less deduplicated or compressed data can be backed up faster, resulting in smaller backup windows, reduced recovery point objectives (RPOs) and faster recovery time objectives (RTOs). Disk archive platforms are able to store considerably more files. If tape is the ultimate backup target, smaller backups also use fewer tapes, resulting in lower media costs and fewer tape library slots being used.

For a virtual tape library (VTL), the reduction in disk space requirements translates into longer retention periods for backups within the VTL itself and therefore more sustainable. Data transfers are accomplished sooner, freeing the network for other tasks, allowing additional data to be transferred or reducing costs using slower, less-expensive WANs. For additional information on WAN or network sustainability efficiency options, please refer to the section titled “Network”, starting on page 72.

In addition to Archiving, Flash Storage, Data Compression, De-duplication and Archiving, we will discuss other technologies that will advance the ability for a business to achieve sustainability.

Best Practice – Use Self organizing systems theory for Storage Storage virtualization products in the market today are a good first step, but enhancements are needed. Storage virtualization is creating efficiencies by inserting a layer of abstraction between data and storage hardware, and that same concept can be taken further to present a layer of abstraction between data and the method in which data is stored. RAID is actually a well-known form of data virtualization, because the linear sequence of bytes for data is transformed to stripe the data across the array, and includes the necessary parity bits. RAID’s data virtualization technique was designed over 20 years ago to improve data reliability and I/O performance. Even though it is a reliable and proven technology, we continue to need RAID technology as we transition more from structured data to large quantities of unstructured data.

63 2010 EMC Proven Professional Knowledge Sharing

One of EMC’s most recent technologies called “Virtual LUN” and “Fast Fully Automated Storage Tiering”, also known as FAST are examples. This technology automatically and dynamically moves data across storage tiers, so that it is in the right place at the right time simply by pooling storage resources, defining the policy, and applying it to an application similar to the modeling parameters Ai and Bi in Equation 6 – Linear model of a Control System on page 54.

FAST enables applications to remain optimized by eliminating trade-offs between capacity and performance. Automated storage tiering dynamically monitors and automatically relocates data to increase operational efficiency, lower costs and increase efficiency.

By utilizing a “FAST” technology that implements transparent mobility (i.e. the application does not know that a transfer is going on) and dynamically moving applications across different storage types, great strides can be made to sustaining the ability to manage information. As of this writing, Flash, Fiber Channel, and SATA drive technologies are all currently supported in EMC’s implementation of FAST.

“Dispersal” is another new technology being considered in storage. It is a natural successor for RAID for data virtualization because it can be configured with M of N fault tolerance, which can provide much higher levels of data reliability than RAID. Dispersal essentially packetizes the data (N packets), and only requires a subset (M packets) to fit perfectly and recreate the data.

There will no longer be a tight coupling between hardware and the storage of the data packets. This is one major change for data virtualization that will occur as Dispersal replaces RAID and will eliminate the concept of having copies of data on hardware.

Today’s RAID systems stripe data and parity bits across disks within an array, within an appliance. When asked, “Where is my data?” the answer is typically “On this piece of hardware.” This gives people peace of mind in terms of sensing something that is intangible (since the data is actually virtualized) is actually tangible because it is contained within a physical device.

The shift for IT storage administrators will be asking, “Where is my data?” since it will be virtualized across multiple devices in multiple locations to “Is my data protected?” because the

64 2010 EMC Proven Professional Knowledge Sharing root of the first question is the second. Once people get comfortable with actually giving up the control of knowing exactly where their data resides, they will realize the benefits of data virtualization.

Increased fault tolerance is the largest benefit to storing data packets across multiple hardware nodes. RAID is structured to provide disk drive fault tolerance. As a disk drive fails, the other disks can reconstruct the data.

Dispersal provides not only disk drive level fault tolerance, but also device drive fault tolerance, and even location fault tolerance. When an entire device fails, the data can be reconstructed from virtualized data packets on other devices, whether centrally located or across multiple sites. Self-organizing systems can be used to address reconstructed virtualized technology issues.

Current products focus on vitalizing storage pools and access from storage hardware. This is a good step towards avoiding a silo style storage system. However, it appears there is still too much management burden placed on the storage administrator. The management systems do have self-discovery, but that simply means listing the hardware nodes that have been added to the system. The burden is still on the storage administrator to determine where and how to deploy those nodes.

Another step must occur to simplify the management burden; the system must evolve to self organize. Self-organizing systems are made up of small units that can determine an inherent order collectively. Instead of a storage administrator having to determine which pools and tiers to add storage nodes to, the nodes themselves will evolve to contain metadata and rules, and inherently place themselves within the storage tiers, as the system requires.

An example of metadata and rules could be related to disk characteristics –SSD (Solid State Disks) suited for tier 1-performance scenarios. Another example would be related to GPS location – storage nodes could know which data center they have been installed within, and determine which storage pools they need to join. This type of self-organized system attributes are currently in development, for example, EMC’s ATMOS offerings.

65 2010 EMC Proven Professional Knowledge Sharing Beyond the storage system, self-organizing systems, in terms of provisioning hardware, systems will also self organize the tiers. Storage administrators will define the requirements for tiers (QoS, data reliability, performance), and the storage nodes will self organize underneath them. That means that when capacity and performance nodes are added to the system, the system will also determine which tiers need those resources.

Emergent patterns will surface once the storage nodes have metadata and rules, and the toolset for managing the system will change. Rather than managing the physical hardware, and individual storage pools and access, management will occur at a system level – what storage is required in which locations based on how information is dynamically moving across storage nodes and tiers.

Autonomic self healing systems

The concept of “Self Healing” systems is a subset of self-organizing systems (see section “Self Organizing Systems”, starting on page 50, for more information) in that it utilizes a similar closed loop system theory construct. The fundamental approach or rules are:

• The System must know itself • The System must be able to reconfigure itself within its operational environment • The System must preemptively optimize itself • The System must detect and respond to its own faults as they develop • The System must detect and respond to intrusions and attacks • The System must know its context of use • The System must live in an open world assuming the security requirement allow • The System must actively shrink the gap between user/business goals and IT solutions to achieve sustainability

Autonomic computing is really about making systems self-managing. If you think about biological systems like the human body, they are tremendously complex and very robust. The human body, for example, is constantly making adjustments. Your heart rate is being controlled; your breathing rate is controlled. All of these things happen beneath the level of conscious control. Biological systems give a good example for thinking about computer systems. When we take a look at the attributes of biological systems, we can find attributes that we wish our 66 2010 EMC Proven Professional Knowledge Sharing computer systems had, like self-healing, self-configuring, and self-protecting. We can begin to build the attributes that we see in biological systems into complex computer systems. In the end, it translates into real customer benefits because these more complex systems are easier to administer, control and are more sustainable.

Best Practice – Utilize Autonomic self-healing systems for Storage As it relates to Storage, in addition to FAST, Dispersal and RAID technology to store and protect, another relatively new technology has been implemented and is commonly referred to as "autonomic self-healing storage." This technology promises to substantially increase reliability of disk systems. Autonomic self-healing storage is different from RAID, redundant array of independent nodes (RAIN), snapshots, continuous data protection (CDP) and mirroring. RAID, RAIN, etc, are designed to restore data from a failure situation.

These technologies however, are actually self-healing data, not self-healing storage and restore data when there is a storage failure and mask storage failures from the apps. RAID and RAIN do not restore the actual storage hardware.

Autonomic self-healing systems transparently restore both the data and storage from a failure. It has been statistically proven that as HDDs proliferate, so will the number of hard disk drives failures, which can lead to lost data. Analyzing what happens when a HDD fails illustrates the issue:

If a hard disk drive fails, the drive must be physically replaced, either manually or from an online pool of drives. Depending on the RAID set level, the HDD's data is rebuilt on the spare. RAID 1/3/4/5/6/10/60 all rebuild the hard disk drives data, based on parity. RAID 0 cannot rebuild the HDD's data.

The time it takes to rebuild the HDD's data depends on the hard disk drives capacity, speed and RAID type. A 1 TB 7,200 rpm SATA HDD with RAID 5 will take approximately 24 hours to 30 hours to rebuild the data, assuming the process is given a high priority.

If the rebuild process is given a low priority and made a background task to be completed in off hours, the rebuild can take as long as eight days. The RAID group is subject to a higher risk of a second disk failure or non-recoverable read error during the rebuild, which would lead to lost

67 2010 EMC Proven Professional Knowledge Sharing data. This is because the parity must read every byte on every drive in the RAID group to rebuild the data. (Exceptions are RAID 6, RAID 60)

SATA drives typically have a rated non-recoverable read error rate of 1 x 1014: roughly 1 out of 100,000,000,000,000 bits will have a non-recoverable read error. This means that a seven-drive RAID 5 group with 1 TB SATA drives will have approximately a 50% chance of failing during a rebuild, resulting in the loss of the data in that RAID group.

Enterprise-class drives (Fiber Channel or SAS) are rated at 1 x 1015 for non-recoverable read errors, which translates into less than a 5% chance of the RAID 5 group having a failure during a rebuild. RAID 6 eliminates the risk of data loss should a second HDD fail. You pay for that peace of mind with decreased write performance vs. RAID 5, and an additional parity drive in the RAID group. Eventually, the hard disk drive is sent back to the factory. Using typical MTBF’s, there will be approximately 40 HDD "service events" per year.

Best Practice – Consider Autonomic Storage solutions utilizing Standards New Storage systems, including EMC’s VMAX, tackle end-to-end Autonomic self healing and error detection and correction, including silent data corruption (See section titled “Best Practice – Implement Undetected data corruption technology into environment”, on starting on page 69). In addition, sophisticated algorithms that attempt to "heal-in-place" failed HDDs before requiring a RAID data rebuild are also implemented. A technology that is currently being developed is a relatively new concept of "fail-in-place" so that in the rare circumstance when a HDD truly fails (i.e., it is no longer usable), no service event is required to replace the hard disk drive for a RAID data rebuild. This would add to the sustainability equation.

The T10 DIF is a relatively new standard and only applies to SCSI protocol HDDs (SAS and Fiber Channel). However, as of this writing, there is no standard spec for end-to-end error detection and correction for SATA hard disk drives. As a result, EMC and others have devised proprietary solutions for SATA end-to-end error detection and correction methodologies.

The American National Standards Institute's (ANSI) T10 DIF (Data Integrity Field) specification calls for data to be written in blocks of 520 bytes instead of the current industry standard 512 bytes. The eight additional bytes or "DIF" provide a super-checksum that is stored on disk with the data. The DIF is checked on every read and/or write of every sector. This makes it possible

68 2010 EMC Proven Professional Knowledge Sharing to detect and identify data corruption or errors, including misdirected, lost or torn writes. ANSI T10 DIF provides three types of data protection:

• Logical block guard for comparing the actual data written to disk • Logical block application tag to ensure writing to the correct logical unit (virtual LUN) • Logical block reference tag to ensure writing to the correct virtual block

When errors are detected, they can then be fixed by the storage system's standard correction mechanisms. Self-healing storage solves tangible operational problems in the data center and allows a more sustainable and efficient environment. This technology reduces service events, costs, management, the risk of data loss, and application disruptions.

Best Practice – Implement Undetected data corruption technology into environment Another problem with HDDs that is rarely mentioned but is quite prevalent is "silent data corruption." Silent data corruption(s) are storage errors that go unreported and undetected by most storage systems, resulting in corrupt data being provided to an application with no warning, logging, error messages, or notification of any kind.

Most storage systems do not detect these errors, which occur on average with 0.6% of SATA HDDs and .06% of enterprise HDDs over 17 months 12. Silent data corruption occurs when RAID does not detect data corruption errors, such as misdirected or lost writes. It can also occur with a “torn write”, data that is partially written and merges with older data, so the data ends up part original data and part new data. Because the hard disk drive does not recognize the errors, the storage system is not aware of it either, so there is no attempt at a fix. See the section titled “Autonomic self healing systems” starting on page 66, above for additional information.

Storage Media – Flash Disks

A number of techniques can be applied to reduce power consumption in a storage system and therefore increase efficiency. Disk drive technology vendors are developing low spin disks, which can be slowed or stopped to reduce power consumption when not in use. Another

12 An Analysis of Data Corruption in the Storage Stack," L.N. Bairavasundaram et al., presented at FAST '08 in San Jose, Calif. 69 2010 EMC Proven Professional Knowledge Sharing technology is FLASH Disks. Caching techniques that reduce disk accesses and the use of 2.5- inch rather than 3.5-inch formats can reduce voltage requirements from 12 volts to 6 volts. An industry-wide move toward higher-capacity Serial ATA (SATA) drives and 2.5-inch disks is under way, which some claim will lead to better energy performance.

Best Practice – Utilize low power flash technologies A best practice is to utilize FLASH storage as applicable to the Data Center Tiered Storage requirements. Low power technologies are starting to enter the data center. With the advent of Solid State Disks (SSDs), this enabling semiconductor technology can and will have a major impact on power efficiencies.

For example, an SSD system can be based on double data rate (DDR) DRAM technology and integrated with battery backup. It requires a Fiber Channel (FC) interface consistent with conventional hard drives. This SSD technology has been available for years and has established itself in a niche market that serves large processing-intensive government projects and companies involved in high-volume, high-speed/low-latency transactions such as stock trading systems.

For additional information on Flash Disk Technologies, please refer to the 2008 EMC Proven Professional Knowledge Sharing article titled “Crossing the Great Divide in Going Green: Challenges and Best Practices in Next Generation IT Equipment”.

Server Virtualization

With the advent of virtualization at the Host level, there are new possibilities to balance application workloads as per the requirements of self-organized systems. ’s infrastructure now provides two new capabilities: 1. resource pools, to simplify control over the resources of a host 2. clusters, to aggregate and manage the combined resources of multiple hosts as a single collection

In addition, now has functionality called Distributed Resource Scheduling (DRS) that dynamically allocates and balances computing capacity across the logical resource pools defined for Infrastructure.

70 2010 EMC Proven Professional Knowledge Sharing DRS continuously monitors utilization across the resource pools and intelligently allocates available resources among virtual machines based on resource allocation rules that reflect business needs and priorities. Virtual machines operating within a resource pool are not tied to the particular physical server on which they are running at any given point in time. When a virtual machine experiences increased load, DRS first evaluates its priority against the established resource allocation rules and then, if justified, allocates additional resources by redistributing virtual machines among the physical servers. VMotion executes the live migration of the virtual machine to a different server with complete transparency to end users. The dynamic resource allocation ensures that capacity is preferentially dedicated to the highest priority applications, while at the same time maximizing overall resource utilization.

Best Practice – Implement DRS Utilizing DRS and VirtualCenter provides a view and management of all resources in the cluster emulating a self-organized solution. As shown in Figure 15 – Self Organized VM application controller, below, a global scheduler within VirtualCenter enables resource allocation and monitoring for all virtual machines running on ESX Servers that are part of the cluster.

Figure 15 – Self Organized VM application controller

DRS provides automatic initial virtual machine placement on any of the hosts in the cluster, and also makes automatic resource relocation and optimization decisions as hosts or virtual machines are added or removed from the cluster. DRS can also be configured for manual control, in which case it only recommends that you can review and carry out. DRS provides several additional benefits to IT operations: 71 2010 EMC Proven Professional Knowledge Sharing

• Day-to-day IT operations are simplified as staff members are less affected by localized events and dynamic changes in their environment. Loads on individual virtual machines invariably change, but automatic resource optimization and relocation of virtual machines reduces the need for administrators to respond, allowing them to focus on the broader, higher-level tasks of managing their infrastructure. • DRS simplifies the job of handling new applications and adding new virtual machines. Starting up new virtual machines to run new applications becomes more of a task of high-level resource planning and determining overall resource requirements, than needing to reconfigure and adjust virtual machines settings on individual ESX Server machines. • DRS simplifies the task of extracting or removing hardware when it is no longer needed, or replacing older host machines with newer and larger capacity hardware. To remove hosts from a cluster, you can simply place them in maintenance mode, so that all virtual machines currently running on those hosts are reallocated to other resources of the cluster. After monitoring the performance of remaining systems to ensure that adequate resources remain for currently running virtual machines, you can remove the hosts from the cluster to allocate them to a different cluster, or remove them from the network if the hardware resources are no longer needed. Adding new resources to the cluster is also straightforward, as you can simply drag and drop new ESX Server hosts into a cluster

Network

Best Practice - Architect Your Network to Be the Orchestration Engine for Automated Service Delivery (The 5 S’s)

With respect to network technologies, we must rethink how, going forward, we will build out data communication networks. This is especially true with new data center architectures being developed to address the needs for efficiency and sustainability. The challenge is to develop a set of best practices with the requirement to address a cloud-ready network to automate service delivery. Built correctly, it can become the orchestration engine for your cloud and sustainability strategy. Cloud services need a network that embraces the five architectural goals. Design a cloud network with these principles[12]:

72 2010 EMC Proven Professional Knowledge Sharing Scalable: Your cloud network must scale without adding complexity or sacrificing performance. This means scaling in lockstep with the dynamic consumption of software, storage, and application resources without “throwing more infrastructure” at the problem.

Simplified: To achieve scale and reduce operational costs, you must simplify your network design. Fewer moving parts, collapsed network tiers, a single operating system if possible, and fewer interfaces ensure scalability and pave the way for automation.

Standardized: Cloud computing requires commoditized, standards-based technologies. Likewise, your cloud network cannot be based on proprietary components that increase the capital and operation costs associated with delivering cloud services.

Shared: Cloud networks must be built with multi-tenancy in mind. Different customers, departments, and lines of business will consume various cloud services with their own unique requirements. A shared network with differentiated service levels is required.

Secure: Cloud networks must embrace security on two levels: 1) controls built into the fabric that prevent wide-scale infrastructure and application breaches and disruptions 2) overlying identity and data controls to combat regulatory, privacy, and liability concerns. This security must be coordinated so that the network secures traffic along three key connections: among virtual machines within the data center, between data centers, and from clients to data centers.

It is important to note that these five principles are interrelated. Take simplicity, for example. It is critical to simplify your cloud network infrastructure to ensure scalability as well as reduce the number of moving parts needed to secure the end-to-end platform. However, to simplify, you must standardize the components and build on an open network platform as well as use shared components to get economies of scale.

73 2010 EMC Proven Professional Knowledge Sharing Best Practice - Select the Right Cloud Network Platform

The cloud network is a platform. However, few companies and even fewer vendors ask the question “A platform for what?” The answer is automation. Successful cloud networking requires building a network with automation as part of its core focus. Automation makes troubleshooting, security, provisioning, and other service delivery components less expensive and more reliable. To bake automation into your cloud services, select a vendor that embraces automation in four areas:

Cloud network infrastructure Routing, switching, and network appliances are the core components of delivering high- performance cloud services. A best Practice is to find a vendor whose infrastructure provides scalability and standardization, the building blocks for automation. In addition, the network infrastructure must be application-aware to provide the granularity for quality-of-service and quality-of-experience delivery requirements.

Cloud network operating system (OS) Hardware is the core platform, but the OS is the key to automation. Look for a vendor that provides an OS that is standardized, shared, and simplified across its entire routing, switching, and appliance infrastructure. A good platform OS has hooks to automate delivery across as well as an open development ecosystem for third party, cloud-specific applications to run natively on the network.

Cloud network management systems The infrastructure and OS are responsible for enabling automation, but the management system is responsible for orchestrating it. A best practice is to select a vendor with a single management system for its entire portfolio. Cloud management requires rich policy interfaces and the ability to define differentiated services in the cloud.

Cloud network security Security appliances are part of your platform infrastructure, but you will also need end-to-end security automated across the cloud. A best practice is to select a vendor with baked-in security that protects the network cloud at the macro (broadly across all infrastructure and the OS) and micro (granularly, to protection individual sessions) levels.

74 2010 EMC Proven Professional Knowledge Sharing Best Practice – Consider implementing layer 2 Locator/ID Separation

With advances in the ability to move applications via virtualization, such as Vmotion, one of the issues and challenges is the current IETF IP routing and addressing architecture protocols. Having a single numbering space the "IP address" for both host transport session identification and network routing creates scaling and interoperability issues. This is particularly true with new infrastructure architectures outlined in the subsequent section titled “Infrastructure Architectures”, starting on page 80. We can realize a number of scaling benefits by separating the current IP address into separate spaces for Endpoint Identifiers (EIDs) and Routing Locators (RLOCs); among them are:

1. Reduction of the routing table size in the "default-free zone" (DFZ). RLOCs would be assigned by internet providers at client network attachment points greatly improving aggregation and reducing the number of globally visible, routable prefixes. 2. Cost-effective multi-homing for sites that connect to different service providers, including Cloud Servicer Providers, so that providers can control their own policies for packet flow into the site without using extra routing table resources of core routers. 3. Easing of renumbering burden when clients change providers. Because host EIDs are numbered from a separate, non-provider assigned and non-topologically-bound space, they do not need to be renumbered when a client site changes its attachment points to the internal or external network. 4. Traffic engineering capabilities that can be performed by network elements and do not depend on injecting additional state into the routing system. 5. Mobility without address changing. Existing mobility mechanisms will be able to work in a locator/ID separation scenario. It will be possible for a host (or a cluster of physical or virtual hosts) to move to a different point in the network topology (Internal, external or hybrid Clouds) either retaining its initial address or acquiring a new address based on the new network location. A new network location could be a physically different point in the network topology or the same physical point of the topology with a different provider.

Currently, the IETF(Internet Engineering Task Force) is working on standards13 that will implement this type of protocol. Cisco is a major driver in this endeavor. By decoupling end

13 IETF Locator/ID Separation Protocol (LISP)- http://tools.ietf.org/html/draft-ietf-lisp-05 75 2010 EMC Proven Professional Knowledge Sharing point locators (addresses) from the routing information, the ability to implement dynamic network changes will allow cloud providers and consumer to be more efficient and sustainable.

Best Practice – Build a Case to Maximize Cloud Investments

Regardless of whether the plan is to consume or provide cloud services, a best practice is to select a vendor that provides the right infrastructure, OS, management, and security. Articulate a compelling business case by considering these business and technical recommendations.

Best Practice - Service Providers - Maximize and sustain Cloud Investments

The recommendation is that CEOs and business executives at cloud providers should consider optimizing revenue with a healthy mix of small, medium-size, and large companies. Smaller firms provider short sales cycles and quick cash, but enterprises provide long-term profitability. Focus on monetizing your assets by starting with just one or two of the cloud flavors; do not overstretch and do IaaS, PaaS, and SaaS out of the gate. Allow customers to cut through the clutter and focus on cost by providing tiered pricing and a self-service portal where users can immediately pay by plunking down a credit card.

With respect to CTOs and technical executives at cloud, providers should demonstrate that your cloud network starts on a low-cost, fixed-priced service and quickly scales capacity. Provide a road map for how you will scale across the IaaS, PaaS, and SaaS flavors with proper network capacity to consume all services.

Offer cloud network service-level agreements (SLAs) that tackle accessibility, reliability, and performance. To offer a sustainable offering, consider that cloud services are standardized, but SLAs are customized. It is important to demonstrate that the offering can tailor SLAs and provide business-specific granularity.

It is also a best practice to design cloud networks with visibility and quality-of-service reports for customers to run their own reports and audits, but also dedicate ample resources to accommodate customers auditing your services to ensure they are SAS Type II- and PCI- compliant.

76 2010 EMC Proven Professional Knowledge Sharing Best Practice - Enterprises - maximize and sustain Cloud Investments

It is a best practice that CIOs, line-of-business managers, and other enterprise business executives should consider focusing on cloud services as providing cost savings in the short term and automation and flexibility as driving competitive advantages in the long term. Identify business processes that drive revenue or customer interactions, but do not require mission critical infrastructure. Demonstrate business value by putting them in the cloud first. Restructure how you position SLAs’ with business peers. Focus on being a service provider, and drive conversations from technical SLAs to business outcome SLAs.

A best practice for Network Architects and senior infrastructure leaders is to build an environment with a hybrid internal/public cloud in mind. Use basic building blocks like virtualized machines and high-performance networks to ensure that you can scale quickly. Provide granular, real-time visibility across your cloud network. This allows service-level monitoring, cost tracking, integration with security operations, and detailed audit logs. Build identity management hooks into the cloud to automate user provisioning; enforce proper access management of partners, suppliers, and customers; and appease auditors.

Best Practice – Understand Information Logistics and Energy transposition tradeoffs

You must also consider network efficiencies in terms of creating a sustainable business or environmental model. As will be discussed in the business practices section titled “

77 2010 EMC Proven Professional Knowledge Sharing Economics,” starting on page 162 , it will be shown that the most efficient network may not be the most macro sustainable network.

As shown in Figure 16 - Energy in Electronic Integrated Circuits, on page 78, below, each network switch has a line card, a card that transmits data packets over a digital network. Each Line card has many CMOS ASICS and each ASIC had millions of CMOS gates.

Figure 16 - Energy in Electronic Integrated Circuits

As shown in Equation 7 – Energy Consumed by a CMOS ASIC and Equation 8 – Power Consumed by a CMOS ASIC, both shown below, the energy consumed is a function of the energy of each gate plus the product of the capacitance and the voltage level.

Equation 7 – Energy Consumed by a CMOS ASIC14

1 Energy = E + [ C ]V 2 ∑ Gate 2 ∑ Wire The power consumed by the ASIC is the product of energy consumed and the data bit rate. As you can see, as the bit rate increases, the power increases at a linear rate.

Equation 8 – Power Consumed by a CMOS ASIC15

14 IEEE ASIC Design Journal, Nov 2007 15 IEEE ASIC Design Journal, Nov 2007 78 2010 EMC Proven Professional Knowledge Sharing ⎡ 1 2 ⎤ Power = ∑ EGate + []∑CWire V xBitRate ⎣⎢ 2 ⎦⎥

The good news is “Moore’s Law” benefits us in that the switching energy is decreasing over time as shown in Figure 17 - Moore's Law - Switching Energy, shown on page 79. However, network use is increasing even faster.

Figure 17 - Moore's Law - Switching Energy16

It is also interesting to note that in some situations, it is more energy efficient to utilize physical transport than utilizing the IP network. Based on the equations defining power and energy utilization, in the example of transporting large amounts of data for backup, replication or general data movement purposes, a physical move is more efficient. Take for example the case shown in Figure 18 - Data by physical vs. Internet transfer, on page 80. In this case, transferring 9PB of data by physically moving it would be more efficient than transferring the data over the internet given the equivalent time interval. In addition, the number of Kg of CO2 is substantially reduced.

16 Intel, 2007 79 2010 EMC Proven Professional Knowledge Sharing

Figure 18 - Data by physical vs. Internet transfer17

So, what are the best practices for a vendor or business in terms of an efficient network? The first is to choose a vendor with lowest power and smallest footprint per unit (lambda, port, and bit). Leverage long haul technologies and ROADMs (reconfigurable optical add-drop multiplexer) to reduce intermediate regeneration. Push fiber and (passive) WDM closer to the end user and eliminate local exchanges. Aggregate multiple service networks onto a single optical backhaul network. Concentrate higher layer routing into fewer, more efficient data centers and CO’s (Central Offices). Use service demarcation techniques to allow lower layer switching, aggregation, and backhaul all the way to the core.

Infrastructure Architectures

In order to achieve efficiency and sustainability in the data center or wherever the IT entity is located, it is important to understand the various architectures that are available. Depending on

17 Rod Tucker, ARC Special Research Centre for Ultra-Broadband Information Networks (CUBIN) 80 2010 EMC Proven Professional Knowledge Sharing the use case and business requirements, some architecture types may be a better fit. In some use cases, implementing all or a subset may make sense. As shown in Figure 19 – Sustainability Ontology – Infrastructure Architectures, shown below on page 84, this diagram outlines the possible architectures available today.

The first is the legacy data center, which consists of a physical location with centralized IT equipment, power, cooling and support. The others are utility computing, warehouse scale machines and cloud computing. Cloud Computing has a few variants that will be discussed in subsequent sections. Warehouse Scale Machines are unique in the sense that this architecture supports specific business models. Businesses that support a few specific applications are one thing; being able to scale to thousands of servers spanning multiple data centers across the globe, such as Google, is another.

Datacenters are essentially very large devices that consume electrical power and produce heat. The datacenter’s cooling system removes that heat, consuming additional energy in the process. The heat must be removed as well. It is not surprising, then, that the bulk of the construction costs of a datacenter are proportional to the amount of power delivered and the amount of heat to be removed. In other words, most of the money is spent either on power conditioning and distribution or on cooling systems.

Data Center Tier Classifications

The overall design of a datacenter is often classified as belonging to “Tier I–IV.” Tier I datacenters have a single path for power and cooling distribution, without redundant components. Tier II adds redundant components to this design (N + 1), improving availability. Tier III datacenters have multiple power and cooling distribution paths but only one active path. They also have redundant components and are concurrently maintainable, that is, they provide redundancy even during maintenance, usually with an N + 2 setup. Tier IV datacenters have two active power and cooling distribution paths, redundant components in each path, and are supposed to tolerate any single equipment failure without impacting the load. These tier classifications are not 100% precise. Most commercial datacenters fall somewhere between tiers III and IV, choosing a balance between construction costs and reliability. Real-world datacenter reliability is also strongly influenced by the quality of the organization running the datacenter, not just by the datacenter’s design. Typical availability estimates used in the

81 2010 EMC Proven Professional Knowledge Sharing industry range from 99.7% availability for tier II datacenters to 99.98% and 99.995% for tiers III and IV, respectively.

Datacenter sizes vary widely. Two thirds of US servers are housed in datacenters smaller than 5,000 sq ft and with less than 1 MW of critical power. Most large datacenters are built to host servers from multiple companies (often called co-location datacenters, or “colos”) and can support a critical load of 10–20 MW. Very few datacenters today exceed 30 MW of critical capacity.

The data center, as we know it, is changing. Not only is the data center changing physically in terms of power, cooling and other metrics, how a business uses the information infrastructure is changing. As shown in Figure 19 – Sustainability Ontology – Infrastructure Architectures, on page 84, the method or architecture of data centers are changing. What this means is the options for the business or IT department are exceedingly increasing. Not only is the standard data center architecture available, but also options such as Utility Computing, Warehouse Scale Data Centers as well as the various flavors of Cloud offerings are now available. So, how does one determine what are the best options or the best mix of technologies or architectures to meet the sustainability and business requirements? Best practices will be discussed.

“Cloud Computing” is rising quickly, with its data centers growing at an unprecedented rate. However, this is accompanied with concerns about privacy, efficiency at the expense of resilience, and environmental sustainability, because of the dependence on Cloud vendors such as Google, Amazon, EMC and Microsoft. There is, however, an alternative model for the Cloud conceptualization, providing a paradigm for Clouds in the community, utilizing networked personal computers for liberation from the centralized vendor model.

Community Cloud Computing offers an alternative architecture, created by combining the Cloud with paradigms from Grid Computing, principles from Digital Ecosystems, and sustainability from Green Computing, while remaining true to the original vision of the Internet. It is more technically challenging than Cloud Computing, dealing with distributed computing issues, including heterogeneous nodes, varying quality of service, and additional security constraints. However, these challenges are attainable, and with the need to retain control over our digital lives and the potential environmental consequences, it is a challenge that should be pursued.

82 2010 EMC Proven Professional Knowledge Sharing The recent development of Cloud Computing provides a compelling value proposition for organizations to outsource their Information and Communications Technology (ICT) infrastructure. However, there are growing concerns over the control ceded to large Cloud vendors, especially the lack of information privacy. In addition, the data centers required for Cloud Computing are growing exponentially, creating an ever-increasing carbon footprint, and therefore raising environmental concerns. The distributed resource provision from Grid Computing, distributed control from Digital Ecosystems, and sustainability from Green Computing, can remedy these concerns. Therefore, Cloud Computing combined with these approaches would provide a compelling socio-technical conceptualization for sustainable distributed computing that utilizes the spare resources of networked personal computers to collectively provide the facilities of a virtual data center and form a Community Cloud. This essentially reformulates the Internet to reflect its current uses and scale, while maintaining the original intentions for sustainability in the face of adversity. Include extra capabilities embedded into the infrastructure to become as fundamental and invisible as moving packets is today.

Cloud Computing is likely to have the same impact on software that foundries have had on the hardware industry. At one time, leading hardware companies required a captive semiconductor fabrication facility, and companies had to be large enough to afford to build and operate it economically. However, processing equipment doubled in price every technology generation. A semiconductor fabrication line costs over $3B today, so only a handful of major “merchant” companies with very high chip volumes, such as Intel and Samsung, can still justify owning and operating their own fabrication lines. This motivated the rise of semiconductor foundries that build chips for others, such as Taiwan Semiconductor Manufacturing Company (TSMC). Foundries enable “fab-less” semiconductor chip companies whose value is in innovative chip design: A company such as nVidia can now be successful in the chip business without the capital, operational expenses, and risks associated with owning a state-of-the-art fabrication line. Conversely, companies with fabrication lines can time-multiplex their use among the products of many fab-less companies, to lower the risk of not having enough successful products to amortize operational costs. Similarly, the advantages of the economy of scale and statistical multiplexing may ultimately lead to a handful of Cloud Computing providers who can amortize the cost of their large datacenters over the products of many “datacenter-less” companies.

83 2010 EMC Proven Professional Knowledge Sharing

Figure 19 – Sustainability Ontology – Infrastructure Architectures

Cloud Overview

Cloud Computing is the use of Internet-based technologies for the provision of services, originating from the cloud as a metaphor for the Internet, based on depictions in computer network diagrams to abstract the complex infrastructure it conceals. It can also be seen as a commercial evolution of the academic-oriented Grid Computing succeeding where Utility Computing struggled while making greater use of the self-management advances of Autonomic Computing as discussed in the section titled “Autonomic self healing systems”, on page 66.

Cloud Computing offers the illusion of infinite computing resources available on demand, with the elimination of upfront commitment from users, and payment for the use of computing resources on a short-term basis as needed.

84 2010 EMC Proven Professional Knowledge Sharing Furthermore, it does not require the node providing a service to be present once its service is deployed. It is being promoted as the cutting-edge of scalable web application development, in which dynamically scalable and often-virtualized resources are provided as a service over the Internet, with users having no knowledge of, expertise in, or control over the technology infrastructure of the Cloud supporting them. It currently has significant momentum in two extremes of the web development industry. The consumer web technology incumbents who have resource surpluses in their vast data centers, and various consumers and start-ups that do not have access to such computational resources. Cloud Computing conceptually incorporates Software-as-a-Service (SaaS), Web 2.0 and other technologies with reliance on the Internet, providing common business applications online through web browsers to satisfy the computing needs of users, while the software and data are stored on the servers.

The cloud has three core attributes. First, clouds are built differently than traditional IT. Rather than dedicating specific infrastructure elements to specific applications, the cloud uses shared pools that applications can dynamically use as needed.

This pooling, can have the multifaceted benefit of saving capital expenditures since business units share the resources and provide better application experiences, since there are more resources based on this shared resource assuming the right QOS management infrastructure when the application needs it.

Second, clouds are operated differently than traditional IT. Most IT management today is about managing specific point types, devices, applications, network links, etc. Managing a cloud is all about managing service delivery. One manages outcomes, rather than individual components. The cloud brings the concept of "automated" to a new level that is an entirely different operational model, biased to low-touch and zero-touch IT operational models. Please refer to the section titled “Self Organizing Systems”, starting on page 50, for additional details.

Finally, clouds are consumed differently than traditional IT. You pay for what you use, when you use it. It is convenient to consume. Compare that with the traditional model of having to pay for all the physical infrastructure associated with your application, whether you are using it or not, “pay for the power I use, rather than buying a power plant ...”

85 2010 EMC Proven Professional Knowledge Sharing Client

Client Client

Resource Consumption, Client Resource Client Provisioning, Coordinator

Client Client

Client

Figure 20 - Cloud Topology

Figure 20 - Cloud Topology, shown above, shows the typical configuration of Cloud Computing at run-time when consumers visit an application served by the central Cloud, which is housed in one or more data centers. The Cloud resources include consumption and resource provision. The role of coordinator for resource provisioning is also included and is centrally controlled. Even if the central node is implemented as a distributed grid, which is typical of a standard data center, control is still centralized. Providers, who are the controllers, are usually companies with other web activities that require large computing resources, and in their efforts to scale their primary businesses, have gained considerable expertise and hardware. For them, Cloud Computing is a way to resell these as a new product while expanding into a new market. Consumers include everyday users, Small and Medium sized Enterprises (SMEs), and ambitious start-ups whose innovation potentially threatens the incumbent providers.

86 2010 EMC Proven Professional Knowledge Sharing SaaS Consume (Software as a Clients/End Users r Service) Pro ive vid el e D

Provide Provide PaaS Consume (Platform as a Deliver Developers Service) e Vendor(s) m su D on e C li

v Provide e r IaaS (Infrastructure as a Service)

Figure 21 - Cloud Computing Topology

Cloud Layers of Abstraction

While there is a significant buzz around Cloud Computing, there is little clarity over which offerings qualify as typical use cases or their interrelation with the other solutions. The key to resolving this confusion is the realization that the various offerings fall into different levels of abstraction, as shown in Figure 21 - Cloud Computing Topology defined above. They are focused at different market segments.

Infrastructure-as-a-Service (IaaS): At the most basic level of Cloud Computing offerings, there are providers such as Amazon and Mosso who provide machine instances to developers. These instances essentially behave like dedicated servers that are controlled by the developers, who therefore have full responsibility for their operation. Therefore, once a machine reaches its performance limits, the developers have to manually instantiate another machine and scale their application out to it. This service is intended for developers who can write arbitrary software on top of the infrastructure with only small compromises in their development methodology.

Platform-as-a-Service (PaaS): One level of abstraction above services like provides a programming environment that abstracts machine instances and other technical 87 2010 EMC Proven Professional Knowledge Sharing details from developers. The programs are executed over data centers, not concerning the developers with matters of allocation. In exchange for this, the developers have to handle some constraints that the environment imposes on their application design, for example, the use of key-value stores instead of relational databases.

Note that “key-value stores” is defined as a distributed storage system for structured data that focuses on scalability, at the expense of the other benefits of relational databases. Examples include Google’s "BigTable" and Amazon’s SimpleDB.

Software-as-a-Service (SaaS): At the consumer-facing level are the most popular examples of Cloud Computing, with well-defined applications offering users online resources and storage. This differentiates SaaS from traditional websites or web applications, which do not interface with user information (e.g. documents) or do so in a limited manner. Popular examples include Microsoft’s (Windows Live) Hotmail, office suites such as Google Docs and Zoho, and online business software such as .com. We can categorize the roles of the various entities to better understand Cloud Computing.

The vendor as resource provider has already been discussed. The application developers utilize the resources provided, building services for the end users. This separation of roles helps define the stakeholders and their differing interests. However, actors can take on multiple roles, with vendors also developing services for the end users, or developers utilizing the services of others to build their own services. Yet, within each Cloud, the role of provider, and therefore the controller, can only be occupied by the vendor providing the Cloud.

It is also important to consider Cloud interfaces. In order to allow a business to achieve sustainability utilizing an external or internal cloud resource, it is important to consider developing a standard interface or API that would allow users to interoperate between various cloud implementations as well as be able to federate each entity. This topic will be covered in the section titled “Standards”, starting on page 135.

Cloud Type Architecture(s) Computing Concerns

The Cloud Computing model is not without concerns. They include:

88 2010 EMC Proven Professional Knowledge Sharing Failure of Monocultures:

The uptime of Cloud Computing (defined as a measure of the time a computer system has been running) based solutions is an advantage, when compared to businesses running their own infrastructure, but we often overlook the co-occurrence of downtime in vendor-driven monocultures. The use of globally decentralized data centers for vendor Clouds minimizes failure, aiding its adoption. However, when a cloud fails, there is a cascade effect, crippling all organizations dependent on that Cloud, and all those dependent upon them.

This was illustrated by the Amazon (S3) Cloud outage, which disabled several other dependent businesses. Therefore, failures are now system-wide, instead of being partial or localized. Therefore, the efficiencies gained from centralizing infrastructure for Cloud Computing are increasingly at the expense of the Internet’s resilience.

Convenience vs. Control

The growing popularity of Cloud Computing comes from its convenience, but also brings vendor control, an issue of ever-increasing concern. For example, Google Apps for in-house e-mail typically provides higher uptime, but its failure highlights the issue of lock-in that comes from depending on vendor Clouds. The even greater concern is the loss of information privacy, with vendors having full access to the resources stored on their Clouds. Both the British and US governments are considering a ‘G Cloud’ for government business applications. In particularly sensitive cases of SMEs and start-ups, the provider-consumer relationship that Cloud Computing fosters between the owners of resources and their users could potentially be detrimental, as there is a potential conflict of interest for the providers. They profit by providing resources to up-and-coming players, but also wish to maintain dominant positions in their consumer facing industries.

General distrust of external service providers

As soon as you say the word "cloud" people immediately think of an ugly world where big portions of critical IT are being put in the hands of vendors they do not know and they do not trust, similar to out sourcing. The difference is that private clouds are about efficiency, control and choice not to mention sustainability.

89 2010 EMC Proven Professional Knowledge Sharing

Choice means the business decides whether everything runs internally, externally, or any mix you choose. Choice also means that you will have multiple service providers competing for your business, and it will be easy to switch between them if you need to for some reason. Switching between providers is an issue and is addressed in the section titled “Standards,” starting on page 135.

Concern to virtualize the majority of servers and desktop workloads

We can understand that there may be some delay between the actual capabilities of a given technology, and the general availability of those capabilities. Virtualization is no exception. Fortunately, a few vendors such as EMC, and Cisco have collaborated (V-Block solutions) to develop a set of standard designs allowing businesses to have a well-known and proven solution. Seeing is believing, through the solution just mentioned or from one of the thousands of enterprise IT environments that are pushing serious workloads, using and getting great results from as well.

Fully virtualized environments are hard to manage

This is true if you try to manage them with tools and processes designed for the physical IT world. Indeed, virtualization efforts often stall because of IT leadership failing to recognize that the operational model is very different (and vastly improved!) in the virtual world. To get to a private cloud or indeed any virtualization at scale, the management and operational model will have to be completely re-engineered. Please refer to the section titled “Information Management,” starting on page 42, for additional details regarding management.

The upside in virtualizing is enormous and will have the ability to move to a more sustainable operational and business model by responding far more quickly to changing requirements, as well as providing far higher service levels.

Many environments can't be virtualized onto x86 and hypervisors

That is often true. All legacy applications are difficult, impractical or not worth the effort to bring over to an Intel instruction set. The question is should having a 20 years of legacy equipment on

90 2010 EMC Proven Professional Knowledge Sharing data center floor stop a business from moving forward? Which part of your environment is growing faster? We would estimate that applications that are running on x86 instruction sets are growing faster. In three years, how much of your world will be legacy, and how much on newer platforms?

A best practice is to cap the investment in legacy, start building new applications on the new environment, and selectively migrate from old to new when the opportunity presents itself.

Concerns on security

As discussed in the section titled “Security,” starting on page 138, there are issues, but the best practices are discussed. It can be argued that fully virtualized environments can be made far more secure than anything in the physical world can at a lower cost, and with less effort. It is interesting to point out that trillions of dollars flow around the globe every day in the financial cloud, a dynamic and federated environment of shared computing resources. So far, I have not lost a dime.

Industry Standards

Unfortunately, usable industry standards usually develop at an absurdly abysmal pace. Even when we have them, it is often the case that everyone implements them differently, defeating the purpose. When it comes to private clouds, there are a few basic and usable standards in place (i.e. OVF, the open virtualization format), with a few more coming, but it is going to take time before we as an industry have this sorted out.

A best practice is to keep in mind open standards, but in the short-term advantage specific technologies that do the job today (e.g.), and keep your options open. Please refer to the section titled “Standards,” starting on page 135 for additional information.

Applications support for virtualized environments, or only the one the vendor sells

Certain software vendors, such as Oracle, have challenges in making their licensing schemes work in virtualized environments or may claim to have support concerns. Ironically, these same software vendors often use virtualization to develop their products. It is unfortunate since this

91 2010 EMC Proven Professional Knowledge Sharing obstacle can be a long term deterrent in staying with that particular application. A best practice is to vocalize your business infrastructure and sustainability message to your software vendors, rather than conforming to theirs.

Environmental Impact Concerns

The ever-increasing carbon footprint from the exponential growth of the data centers required for Cloud Computing within the IT industry is another concern. IT is expected to exceed the airline industry by 2020 in terms of carbon footprint raising sustainability concerns.

The industry is being motivated to address the problem by legislation, the operational limit of power grids (being unable to power any more servers in their data centers), and the potential financial benefits of increased efficiency. The primary solution is the use of virtualization to maximize resource utilization, but the problem remains. While these issues are common to Cloud Computing, they are not flaws in the Cloud concept, but the vendor provisioning methods and the implementation of Clouds. There are attempts to address some of these concerns, such as a portability layer between vendor Clouds to avoid lock-in. However, this will not alleviate issues such as inter-Cloud latency.

An open source implementation of the Amazon (EC2) Cloud, called , allows data centers to execute code compatible with Amazon’s Cloud. This allows creation of private internal Clouds, avoiding vendor lock-in and providing information privacy, but only for those with their own data center and so is not really Cloud Computing (which by definition is to avoid owning data centers). Therefore, vendor Clouds remain synonymous with Cloud Computing

One solution is a possible alternative model for the Cloud conceptualization, created by combining the Cloud with paradigms from Grid Computing, principles from Digital Ecosystems, and sustainability from Green Computing, while remaining true to the original vision of the Internet. This option will be covered in the section titled “Community Cloud”, starting on page 101. This cloud type is a challenged solution for the enterprise, but it may be the ultimate cloud architecture in the long term.

One incentive for cloud computing is that it may be more environmentally friendly. First, reducing the number of hardware components needed to run applications on the company's internal data center and replacing them with cloud computing systems reduces energy for 92 2010 EMC Proven Professional Knowledge Sharing running and cooling hardware. By consolidating these systems in remote centers, they can be handled more efficiently as a group.

Second, techniques for cloud computing promote telecommuting techniques, such as remote printing and file transfers, potentially reducing the need for office space, buying new furniture, disposing of old furniture, having your office cleaned with chemicals and trash disposed, and so on. They also reduce the need to drive to work and the resulting carbon dioxide emissions.

Threshold Policy Concerns

Let us suppose you have a program that does credit card validation in the cloud, and you hit the crunch for the December buying season. Higher demand would be detected and more instances would be created to fill that demand. As we moved out of the buying crunch, the need diminishes and the instances of that resource would be de-allocated and put to other use.

A best practice is to test if the program works. Then, develop, or improve and implement, a threshold policy in a pilot study before moving the program to the production environment. Check how the policy detects sudden increases in the demand and results in the creation of additional instances to fill in the demand. Also, check to determine how unused resources are to be de-allocated and turned over to other work.

Interoperability issues Concerns

If a company outsources or creates applications with one cloud-computing vendor, the company may find it is difficult to change to another computing vendor that has proprietary APIs and different formats for importing and exporting data. This creates problems of achieving interoperability of applications between these two cloud-computing vendors. You may need to reformat data or change the logic in applications. Although industry cloud-computing standards do not exist for APIs or data import and export, IBM and have worked together to make interoperability happen.

Hidden Cost Concerns

Cloud computing does not tell you what the hidden costs are. For instance, companies could incur higher network charges from their service providers for storage and database applications containing terabytes of data in the cloud. This outweighs costs they could save on new infrastructure, training new personnel, or licensing new software. In another instance of incurring network costs, companies who are far from the location of cloud providers could experience latency, particularly when there is heavy traffic.

93 2010 EMC Proven Professional Knowledge Sharing Best Practice - Assess cloud storage migration costs upfront

Another hidden cost is Migration. Today, many cloud storage providers provide the basic cost per gigabyte of capacity. For example, at the time of this publication, the basic cost for Amazon Web Services is approximately $0.15. Pricing for Zetta starts at ~$0.25 and decreases as more data is stored to the cloud. For archive cloud storage providers that provide additional features like WORM (Write Once Read Many) and information lifecycle management, the basic cost is in the realm of ~$1.00 per GB.

Therefore, a potential customer should be able to calculate how much disk storage they need and then determine a monthly cost for storing data in the cloud. While this sounds simple, few providers actually mention that basic storage costs are only part of the picture. The issue is migration into the storage cloud

All providers will charge for data transfers in and out of the cloud based on the volume of data transferred (typical cost is $0.10 per GB). Some will also charge for metadata functions such as directory or file attribute listings, and copying or deleting files. While these metadata operation costs are generally miniscule on a per-operation basis (maximum of $0.01 per 1,000 for Amazon), they can add up based on the amount of users the customer has accessing cloud storage data.

Another piece of cloud storage pricing is how a customer actually gets to the data stored in the cloud. Some cloud storage providers, including Autonomy Zantaz and Iron Mountain Inc., support private data lines that connect the customer's infrastructure to the cloud storage infrastructure. Others, such as Zetta, estimate the Telco circuit and cross-connect fees for customer access data will add up to as much as 20% of their total cost per month. Whether or not this will be an issue depends on the type of data storage and the customers’ access patterns.

Perhaps the least well understood cost of cloud storage is the mass transfer of data in or out of the cloud. Some providers, like Zetta, do not charge transfer fees for data migration into the cloud. Others, such as Amazon, include a stated pricing plan for large-scale data transfers using a portable medium, charging a time-based fee for the data load and a handling fee for the portable device.

Consumers should make sure that every cloud storage request counts. Therefore, a data migration plan is crucial and things like virus scanners, indexing services and backup software

94 2010 EMC Proven Professional Knowledge Sharing should be carefully configured so as not to access the cloud storage medium as just another network drive.

As the cloud continues to evolve, cloud storage providers who can provide the most sophisticated cost analysis tools will be best suited to help potential customers accurately determine costs. Yet customers must still look at all potential costs, including transfer, bulk load, network and on-site appliances as discussed in the section titled “Best Practice – Understand Information Logistics and Energy transposition tradeoffs”, starting on page 77 .

Unexpected behavior concerns Let us suppose your credit card validation application works well at your company's internal data center. It is important to test the application in the cloud with a pilot study to check for unexpected behavior. Examples of tests include how the application validates credit cards, and how, in the scenario of the December buying crunch, it allocates resources and releases unused resources, turning them over to other work. If the tests show unexpected results of credit card validation or releasing unused resources, you will need to fix the problem before running the application in the cloud.

Security issue concerns

In February 2008, Amazon's S3 and EC2 suffered a three-hour outage. Even though an SLA provides data recovery and service credits for this type of outage, consumers missed sales opportunities and executives were cut off from critical business information they needed.

Instead of waiting for an outage to occur, consumers should do security testing on their own, checking how well a vendor can recover data. The test is very simple; no tools are needed. All you have to do is to ask for old data you have stored and check how long it takes the vendor to recover. If it takes too long to recover, ask the vendor why and how much service credit you would get in different scenarios. Verify if the checksums match the original data.

Test a trusted algorithm to encrypt the data on your local computer, and then try to access data on a remote server in the cloud using the decryption keys. If you cannot read the data once you have accessed it, the decryption keys are corrupted, or the vendor is using its own encryption algorithm. You may need to address the algorithm with the vendor.

95 2010 EMC Proven Professional Knowledge Sharing Another issue is the potential for problems with data in the cloud. You may want to manage your own private keys to protect the data. Check with the vendor on private key management. Amazon will give you the certificate if you sign up for it.

Software development in cloud concerns

To develop software using high-end databases, the most likely choice is to use cloud server pools at the internal data corporate center and extend resources temporarily with Amazon Web services for testing purposes. This allows project managers to better control costs, manage security, and allocate resources to the cloud a project is assigned to. Project managers could also assign individual hardware resources to different cloud types: Web development cloud, testing cloud, and production cloud. The cost associated with each cloud type may differ. The cost per hour or usage with the development cloud is most likely lower than the production cloud, as additional features, such as SLA and security, are allocated to the production cloud.

The managers can limit projects to certain clouds. For instance, services from portions of the production cloud can be used for the production configuration. Services from the development cloud can be used for development purposes only. To optimize assets at varying stages of the project of software development, the managers can get cost-accounting data by tracking usage by project and user. If the costs are high, managers can use Amazon EC2 to temporarily extend resources at a very low cost, if security and data recovery issues have been resolved.

Private Cloud

First, what is a private cloud? Since this is a relatively new concept, there are many definitions. However, the first aspect is there are no presumptions as to where applications physically run compared to a public cloud. Applications can run in a data center the business owns and/or run at a service provider's location.

Second, there is no requirement to rewrite the applications to get to a private cloud. Many public clouds require that applications comply with a pre-defined software stack.

Private clouds are different through the technology of virtualization. For example, utilizing hypervisors such as, anything that runs on an Intel instruction set can be private cloud

96 2010 EMC Proven Professional Knowledge Sharing structured. As a result, there is no need to rewrite applications just to get to a private cloud model.

Finally, the private cloud model assumes that control of the private cloud firmly remains in IT's hands, and not some external service provider. IT controls service delivery, if they choose; and security and compliance, if they choose (see section titled “Security”, starting on page 138 for more details). IT controls the mix of internal and external resources, if they choose, or whether they want an IaaS, PaaS, or SaaS model.

To summarize the definition of a private cloud, it is a fully virtualized computing environment using a next-generation operational and security model with a flexible consumption model both internal and external with IT fully in control.

Private clouds are a stepping-stone to external clouds, particularly for financial services. Many believe that future datacenters will look like internal clouds.

Private clouds can also be designed utilizing various existing technologies federating multiple aspects of the virtualized data center and Cloud Computing shown in Figure 22 - Using a Private Cloud to Federate disparate architectures.

This creates what many would consider an internal or private cloud. With the cloud resources of the external cloud and the virtualization resources of the internal cloud information can, if properly designed, move securely across the pool of resources, and possibly across legacy resources in the internal cloud and public cloud resources. This architectural advantage is key in that they are never separate resources; they are all one pool. The resources are aggregated and federated together so that applications can act on the combined resources as a single pool of resources, just like the single pool of resources available to us today when one uses virtualization to join servers from multiple racks in a data center. This forms the private cloud that enables us to get the best of both worlds. The word “Private” is used because the use and operation of the cloud resources are completely controlled and only available to the enterprise. This cloud resource that looks and behaves just like the resources purchased in the past.

97 2010 EMC Proven Professional Knowledge Sharing This architecture offers the advantage of achieving sustainability and efficiency; you get the best of both worlds. You can achieve trusted, controlled reliability and security while getting the flexibility, dynamic, on-demand, and sustainable efficiency of a cloud type architecture.

Figure 22 - Using a Private Cloud to Federate disparate architectures

Let us take it a step further and examine the core principles, or best practices that uniquely define private cloud computing.

Best Practice – Implement a dynamic computing infrastructure Private cloud computing requires a dynamic computing infrastructure. The foundation for the dynamic infrastructure is a standardized, scalable, and secure physical infrastructure. There should be levels of redundancy to ensure high levels of availability, but mostly it must be easy to extend as usage growth demands it, without requiring architecture rework.

Next, it must be virtualized. Today, virtualized environments leverage server virtualization (typically from Microsoft, or Xen) as the basis for running services. These services need to be easily provisioned and de-provisioned via software automation. These service workloads need to be moved from one physical server to another as capacity demands increase or decrease. Finally, this infrastructure should be highly utilized, whether provided by an external cloud provider or an internal IT department. The infrastructure must deliver business value over and above the investment.

98 2010 EMC Proven Professional Knowledge Sharing A dynamic computing infrastructure is critical to effectively supporting the elastic nature of service provisioning and de-provisioning as requested by users in the private cloud, while maintaining high levels of reliability and security. The consolidation provided by virtualization, coupled with provisioning automation, creates a high level of utilization and reuse, ultimately yielding a very effective use of capital equipment

Best Practice – Implement an IT Service-Centric Approach Cloud computing is IT (or business) service-centric. This is in stark contrast to more traditional system or “server”- centric models. In most cases, users of the private cloud generally want to run some business service or application for a specific, timely purpose. IT administrators do not want to be bogged down in the system and network administration of the environment. They would prefer to quickly and easily access a dedicated instance of an application or service. By abstracting away the server-centric view of the infrastructure, system users can easily access powerful pre-defined computing environments designed specifically around their service.

An IT Service Centric approach enables user adoption and business agility. The easier and faster a user can perform an administrative task, the more expedient the business becomes, reducing costs, driving revenue and approaching an IT sustainable model.

Best Practice – Implement a self-service based usage Model Interacting with the private cloud requires some level of user self-service. Best of breed self- service provides users the ability to upload, build, deploy, schedule, manage, and report on their business services on demand within the enterprise. A self-service private cloud offering must provide easy-to-use, intuitive user interfaces that equip users to productively manage the service delivery lifecycle.

The benefit of self-service from the users' perspective is a level of empowerment and independence that yields significant business agility. One benefit often overlooked from the internal service provider's or IT team's perspective is that the more self-service that can be delegated to users, the less administrative involvement is necessary. This saves time and money and allows administrative staff to focus on more strategic, high-valued responsibilities.

99 2010 EMC Proven Professional Knowledge Sharing Best Practice – Implement a minimally or self-managed platform An IT team or service provider must leverage a technology platform that is self-managed in order to efficiently provide a cloud for their constituents. Best-of-breed clouds enable self- management via software automation, leveraging the following capabilities as discussed in the section titled “Information Management,” starting on page 42:

• A provisioning engine for deploying services and tearing them down recovering resources for high levels of reuse • Mechanisms for scheduling and reserving resource capacity • Capabilities for configuring, managing, and reporting to ensure resources can be allocated and reallocated to multiple groups of users • Tools for controlling access to resources and policies for how resources can be used or operations can be performed

All of these capabilities enable business agility while simultaneously enacting critical and necessary administrative control. This balance of control and delegation maintains security and uptime, minimizes the level of IT administrative effort, and keeps operating expenses low, freeing up resources to focus on higher value projects.

Best Practice – Implement a consumption-based billing methodology Finally, private cloud computing is usage driven. Consumers pay for only what resources they use and therefore are charged or billed on a consumption-based model. Cloud computing platforms must provide mechanisms to capture usage information that enables chargeback reporting and/or integration with billing systems such as a charge back system.

The value from a user's perspective is the ability for the business units to pay only for the resources they use, ultimately keep their costs down. From a provider's perspective, it allows them to track usage for charge back and billing purposes.

In summary, these five best practices are necessary to produce an enterprise private cloud, capable of achieving compelling business value including savings on capital equipment and operating costs, reduced support costs, and significantly increased business agility. This enables corporations to improve their profit margins and competitiveness in the markets they serve. 100 2010 EMC Proven Professional Knowledge Sharing Public Cloud

Public cloud solutions are the most well known examples of cloud storage. In a public cloud implementation, an organization accesses third-party resources (like ™,EMC Atmos, Iron Mountain®, Google™, etc.) on an as-needed basis, without the requirement to invest in additional internal infrastructure. In this pay-per-use model, public cloud vendors provide applications, computer platforms and storage to the public, delivering significant economies of scale. For storage, the difference between the purchase of a dedicated local appliance and the use of a public cloud is not the functional interface, but merely the fact that the storage is delivered on demand.

Either the customer or business unit pays for what they actually use or in other cases, what they have allocated for use. As an extension of the financial benefits, public clouds offer a scalability that is often beyond what a user would be able to otherwise afford. Publicly accessible clouds offer storage capacity using multi-tenancy solutions, meaning multiple customers are serviced at once from the same infrastructure. This results in some common concerns when evaluating public cloud solutions, including security and privacy, as well as the possibilities of latency and compliance issues. When considering the use of public cloud options for data storage, pay attention to the management, both now and in the future, of both the clouds and the data, as well as the integration of the Cloud service usage with internal IT.

Since there are numerous white papers discussing the public cloud, I will refer these papers for additional details. I will mention however, that the five (5) best practices outlined in the section titled “Private Cloud”, starting on page 96, are also applicable to the public cloud.

Community Cloud

Community Clouds are digital ecosystems that distribute adaptive open socio-technical systems, with properties of self-organization, scalability and sustainability, inspired by natural eco-systems. This is an interesting approach to the sustainability issue, especially from a regional perspective.

In a traditional market-based economy, made up of sellers and buyers, the parties exchange property. In the new network-based economy, made up of servers and clients, the parties share access to services and experiences. Digital Ecosystems support network-based economies that rely on next-generation IT to extend the Service-Oriented Architecture (SOA) concept with the 101 2010 EMC Proven Professional Knowledge Sharing automatic combination of available and applicable services in a scalable architecture, to meet business user requests for applications that facilitate business processes. Digital Ecosystems research is yet to consider scalable resource provision, and therefore risks being subsumed into vendor Clouds at the infrastructure level, while striving for decentralization at the service level. Therefore, the realization of their vision requires a form of Cloud Computing, but with their principle of community-based infrastructure where individual users share ownership.

One aspect of the Community Cloud as it relates to other Cloud architectures is that Community Clouds are less dependent on vendors and can, in the long run, achieve a higher level of environmental sustainability. The Community Cloud approach is to combine distributed resource provisioning from Grid Computing, distributed control from Digital Ecosystems and sustainability from Green Computing with the use cases of Cloud Computing, while making greater use of self-management advances from Autonomic Computing. Replacing vendor Clouds by shaping the underutilized resources of user machines forms a Community Cloud, with nodes potentially fulfilling all roles, consumer, producer, and most importantly coordinator, as shown in Figure 23 - Community Cloud, below.

Figure 23 - Community Cloud

The figure shows an environment that includes nodes of varying functionality of user machines or servers/clients allowing potentially all nodes to fulfill all roles, consumer, producer, and coordinator. This concept of the Community Cloud draws upon Cloud Computing, Grid 102 2010 EMC Proven Professional Knowledge Sharing Computing, Digital Ecosystems, Green Computing and Autonomic Computing. This is a model of Cloud Computing that is part of the community, without dependence on Cloud vendors. There are a number of advantages:

1) Openness: Removing dependence on vendors makes the Community Cloud the open equivalent to vendor Clouds, and therefore identifies a new dimension in the open versus proprietary struggle that has emerged in code, standards and data, but has yet to be expressed in the realm of hosted services.

2) Community: The Community Cloud is as much a social structure as a technology paradigm. Community ownership of the infrastructure carries with it a degree of economic scalability, without which there would be diminished competition and potential stifling of innovation as risked in vendor Clouds.

3) Individual Autonomy: In the Community Cloud, nodes have their own utility functions in contrast with data centers, in which dedicated machines execute software as instructed. Therefore, with nodes expected to act in their own self-interest, centralized control would be impractical, as with consumer electronics like game consoles. Attempts to control user machines counter to their self-interest results in cracked systems, from black market hardware modifications and arms races over hacking and securing the software (routinely lost by the vendors). In the Community Cloud, where no concrete vendors exist, it is even more important to avoid antagonizing the users, instead embracing their self-interest and harnessing it for the benefit of the community with measures such as a community currency.

4) Identity: In the Community Cloud, each user would inherently possess a unique identity, which combined with the structure of the Community Cloud should lead to an inversion of the currently predominant membership model. Therefore, instead of users registering for each website (or service) as a new user, they could simply add the website to their identity and grant access, allowing users to have multiple services connected to their identity, instead of creating new identities for each service. This relationship is reminiscent of recent application platforms, such as Facebook’s f8 and Apple’s App Store, but decentralized in nature and so free from vendor control. In addition, it allows for the reuse of the connections between users, akin to Google’s Friend Connect, instead of reestablishing them for each new application.

103 2010 EMC Proven Professional Knowledge Sharing 5) Graceful Failures: The Community Cloud is not owned or controlled by any one organization, and therefore not dependent on the lifespan or failure of any one organization. It therefore should be robust and resilient to failure, and immune to the system-wide cascade failures of vendor Clouds. Due to the diversity of its supporting nodes, their failure is graceful, non- destructive, and with minimal downtime, as the unaffected nodes mobilize to compensate for the failure.

6) Convenience and Control: The Community Cloud, unlike vendor Clouds, has no inherent conflict between convenience and control. This results from its community ownership that provides distributed control which is more democratic. However, whether the Community Cloud can provide a technical quality equivalent or one superior to its centralized counterparts requires further research.

7) Community Currency: The Community Cloud requires its own currency to support the sharing of resources, a community currency, which in economics is a medium (currency) not backed by a central authority (e.g. national government), for exchanging goods and services within a community. It does not need to be restricted geographically, despite sometimes being called a local currency. An example is the Fureai kippu system in Japan, which issues credits in exchange for assistance to senior citizens. Family members living far from their parents can earn credits by assisting the elderly in their local community, which can then be transferred to their parents and redeemed by them for local assistance.

8) Quality of Service: Ensuring acceptable quality of service (QoS) in a heterogeneous system will be a challenge, not least because achieving and maintaining the different aspects of QoS will require reaching critical mass in the participating nodes and available services. Thankfully, the community currency could support long-term promises by resource providers and allow the higher quality providers, through market forces, to command a higher price for their service provision. Interestingly, the Community Cloud could provide a better QoS than vendor Clouds, utilizing time-based and geographical variations advantageously in the dynamic scaling of resource provision.

9) Environmental Sustainability: It is anticipated that the Community Cloud will have a smaller carbon footprint than vendor Clouds, on the assumption that making use of underutilized user machines requires less energy than the dedicated data centers require for vendor Clouds. The

104 2010 EMC Proven Professional Knowledge Sharing server farms within data centers are an intensive form of computing resource provision, while the Community Cloud is more organic, growing and shrinking in a symbiotic relationship to support the demands of the community, which in turn supports it.

10) Service Composition: The great promise of service oriented computing is that the marginal cost of creating the nth application will be virtually zero, as all the software required already exists to satisfy the requirements of other applications. Only their composition and orchestration are required to produce a new application. Within vendor Clouds it is possible to make services that expose themselves for composition and compose these services, allowing the hosting of a complete service-oriented architecture. However, current service composition technologies have not gained widespread adoption. Digital Ecosystems advocate service cross pollination to avoid centralized control by large service providers, because easy service composition allows coalitions of SMEs to compete simply by composing simpler services into more complex services that only large enterprises would otherwise be able to deliver. So, one could extend decentralization beyond resource provisioning and up to the service layer, to enable service composition within the Community Cloud.

Figure 24 - Community Cloud Architecture

As shown in Figure 24 - Community Cloud Architecture, above, is an architecture in which the most fundamental layer deals with distributing coordination. One layer above, resource provision and consumption are arranged on top of the coordination framework. Finally, the service layer is where resources are combined into end-user accessible services, to then themselves be composed into higher-level services.

105 2010 EMC Proven Professional Knowledge Sharing

The concept is the distribution of server functionality between pluralities of nodes provided by user machines, shaping underutilized resources into a virtual data center. Even though this is a simple and straightforward idea, it poses challenges on many different levels. The approach can be divided into three layers, Coordination, Resource, Service and Consumption.

Distributing coordination is taken for granted in homogeneous data centers where good connectivity, constant presence and centralized infrastructure can be assumed. One layer above, resource provisioning and consumption are arranged on top of the coordination framework. This would also be a challenge in a distributed heterogeneous environment. Finally, the service layer is where resources are combined into end-user accessible service(s). It is also possible to federate these services into higher-level services.

Best Practice in Community Cloud – Use VM’s To achieve coordination, the nodes need to be deployed as isolated virtual machines, forming a fully distributed network that can provide support for distributed identity, trust, and transactions. Using Virtual Machines (VMs), executing arbitrary code in the machine of a resource-providing user would require a sandbox for the guest code, and a VM to protect the host. The role of the VM is to make system resources safely available to the Community Cloud. The Cloud processes could be run safely without danger to the host machine. In addition to VMs, possible platforms can include Java Virtual Machine’s, lightweight . The age of multi-core processors, in many cases, has resulted in unused or underutilized cores occurring in modern personal computers, which lend themselves well to the deployment and background execution of Community Cloud facing VMs.

Best Practice in Community Cloud – Use Peer to Peer Networking Best Practice is to implement a P2P network. Newer P2P solutions offer sufficient guarantees of distribution, immunity to super-peer failure, and resistance to enforced control. For example, in the Distributed Virtual Super-Peer (DVSP) model, a collection of peers logically combines to form a virtual super-peer that dynamically changes over time to facilitate fluctuating demands.

Best Practice in Community Cloud – Distributed Transactions A key element of distributed coordination is the ability of nodes to jointly participate in transactions that influence their individual state. Appropriately defined business processes can

106 2010 EMC Proven Professional Knowledge Sharing be executed over a distributed network with a transactional model maintaining the properties on behalf of the initiator. Newer transaction models maintain these properties while increasing efficiency and concurrency. Focusing on distributing the coordination of transactions is fundamental to permitting multi-party service composition without centralized control.

Best Practice in Community Cloud – Distributed Persistence Storage Best Practice is to require storage on its participating nodes, taking advantage of the ever- increasing surplus on most personal computers. However, the method of information storage in the Community Cloud is an issue with multiple aspects. First, information can be file-based or structured. Second, while constant and instant availability can be crucial, there are scenarios in which recall times can be relaxed. Such varying requirements call for a combination of approaches, including distributed storage and distributed databases. Information privacy in the Community Cloud should be provided by the encryption of user information when on remote nodes, only being unencrypted when accessed by the user. This allows for the secure and distributed storage of information.

Challenges in the federation of Public and Private Clouds

Cloud computing tops Gartner's “Top 10 Strategic Technologies for 2010.” They define a strategic technology as “one with the potential for significant impact on the enterprise in the next three years. The fundamental challenge is that the industry has shoe horned anything that can be loosely defined as cloud, virtual, IT consolidation, or anything on the network in the same term being cloud. There is a trend to interchange public, private, hybrid, cloud and other variant services.”

Gartner predicts that through 2012, “IT organizations will spend more money on private cloud computing investments than on offerings from public cloud providers.” There are two primary reasons why the enterprise will not make major strides towards the public cloud in the near term– lack of visibility and multi-tenancy issues that cloak the real concern about critical data security. Some consider that “Security” is key and could be a show stopper for public clouds at least in the short term.

It is interesting to note that recently in the United States, the FBI raided at least two Texas data centers, serving search-and-seizure warrants for computing equipment, including servers,

107 2010 EMC Proven Professional Knowledge Sharing routers and storage. The FBI was seeking equipment that may have been involved in fraudulent business practices by a handful of small VoIP vendors18.

It appears that, in the United States, if the FBI finds out that there is a threat and it is coming from a hosted provider (i.e. cloud provider), and if the servers that are used for the scam/threat are virtualized (cloud providers), the FBI will confiscate everything, possibly your data! The reason is that it is much harder to figure out where the server and data is located. For additional details, please refer to the 2010 Proven Professional article titled “How to Trust the Cloud – “Be Careful up There” for additional information.

Lack of visibility

The public cloud is opaque and lacks a level of true accountability that will paralyze any enterprise account from releasing their prized data assets to a set of unknown entities. Look at the value proposition - no one consuming the service has visibility into the infrastructure. The providers themselves are not looking at the infrastructure. Are SLAs relevant? In addition, if so, who can enforce or even monitor them?

The public cloud has received so much buzz in large part because it professes to offer significant cost savings over buying, deploying and maintaining an in-house IT infrastructure. While this is massively appealing, it does not answer any of the fundamentals of Quality of Service, network and data security, to name a few. Imagine the concern of opening up your internal systems with a direct pipe into the ‘cloud.’

Multi-tenancy Issues

Multi-tenancy is the second reason why businesses of any real size will not make the leap to the public cloud. Wikipedia defines multi-tenancy as “a principle in software architecture where a single instance of the software runs on a server, serving multiple client organizations (tenants).” In other words, many people using the same IT assets and infrastructure.

18 http://www.wired.com/threatlevel/2009/04/data-centers-ra/#ixzz0fv24gOsn 108 2010 EMC Proven Professional Knowledge Sharing So here is the concern, EC2, Google, etc., provide true multi-tenancy but at what cost to compliance and security? What about such hot topics such as PCI or forensics? How safe are the tenants on a system? Who is on the same system as you, a hacker or perhaps your nearest competition? How secure is the isolation between clients? What data have you trusted to this cloud? If you buy the argument, it will be your patient records, payroll, client list, etc. It will be essentially your most important data assets. Please refer to the EMC Proven Professional article titled “How to Trust the Cloud – Be Careful up There” for more information.

Cloud computing needs to cover its assets

Until the public cloud can provide visibility all the way down to the IT infrastructures’ simplest asset – logs - enterprises simply will not risk it. To be deployed properly, a public cloud needs to understand logs and log management for security, business intelligence, IT optimization, PCI forensics, parsing out billing info, and the list goes on.

Until then, in the grand scheme of risk mitigation, enterprises may fear the cloud and segment public cloud from ITaaS in a private cloud. Most have taken all of the Cloud variants and placed them into a single bucket. In fact, there is a tremendous value in cloud computing. Nevertheless, public clouds and enterprise computing are a world apart and should be treated as such. In addition, there are many risks to consider along the way. Please refer to the EMC Proven Professional article titled “How to Trust the Cloud – Be Careful up There” for more information.

Warehouse Scale Machines - Purposely Built Solution Options

Cloud computing, utility computing and other cloud paradigms are most certainly on IT managers and architects lists. However, there are other architectures that should be considered to achieve sustainability, especially in specific use cases.

The trend toward server-side computing and the exploding popularity of Internet services has created a new class of computing systems. This architecture has been defined as warehouse- scale computers, or WSCs. The name calls attention to the most distinguishing feature of these machines, the massive scale of their software infrastructure, data repositories, and hardware platform. This perspective is a departure from a view of the computing problem that implicitly assumes a model where one program runs in a single machine. In addition, this new class deals

109 2010 EMC Proven Professional Knowledge Sharing with a use case where a limited number of applications need to scale to an enormous scale as with internet services.

In warehouse-scale computing, the program is an Internet service that may consist of tens or more individual programs that interact to implement complex end-user services such as email, search, or maps. These programs might be implemented and maintained by different teams of engineers, perhaps even across organizational, geographic, and company boundaries as is the case with mashups. The computing platform required to run such large-scale services bears little resemblance to a pizza-box server or even the refrigerator-sized high-end multiprocessors that reigned in the last decade. The hardware for such a platform consists of thousands of individual computing nodes with their corresponding networking and storage subsystems, power distribution and conditioning, equipment, and extensive cooling systems. The enclosure for these systems is in fact a building structure and often indistinguishable from a large warehouse.

Had scale been the only distinguishing feature of these systems, we might simply refer to them as datacenters. Datacenters are buildings where multiple servers and communication gear are co-located because of their common environmental requirements and physical security needs, and for ease of maintenance. In that sense, a WSC could be considered a type of datacenter. Traditional datacenters, however, typically host a large number of relatively small- or medium- sized applications, each running on a dedicated hardware infrastructure that is de-coupled and protected from other systems in the same facility. Those datacenters host hardware and software for multiple organizational or business units or even different companies. Different computing systems within such a datacenter often have little in common in terms of hardware, software, or maintenance infrastructure, and tend not to communicate with each other at all.

WSCs currently power the services offered by companies such as Google, Amazon, Yahoo, and Microsoft’s online services division. This application requirement differs significantly from traditional datacenters in that they belong to a single organization, use a relatively homogeneous hardware and system software platform, and share a common systems management layer. Often much of the application, middleware, and system software is built in- house compared to the predominance of third-party software running in conventional datacenters.

110 2010 EMC Proven Professional Knowledge Sharing Most importantly, WSCs run a smaller number of very large applications (or Internet services), and the common resource management infrastructure allows significant introduction deployment flexibility. The requirements of homogeneity, single-organization control, and enhanced focus on cost efficiency motivate designers to take new approaches in constructing and operating these systems.

Best Practice – WSC’s must achieve high availability Internet services must achieve high availability, typically aiming for at least 99.99% uptime (about an hour of downtime per year). Achieving fault-free operation on a large collection of hardware and system software is difficult and is made more difficult by the large number of servers involved. Although it might be theoretically possible to prevent hardware failures in a collection of 10,000 servers, it would surely be extremely expensive. Consequently, WSC workloads must be designed to gracefully tolerate large numbers of component faults with little or no impact to service level performance and availability.

Best Practice - WSC’s must achieve cost efficiency Building and operating a large computing platform is expensive, and the quality of a service may depend on the aggregate processing and storage capacity available, further driving costs up and requiring a focus on cost efficiency. For example, in information retrieval systems such as Web search, the growth of computing needs is driven by three main factors.

Increased service popularity translates into higher request loads. The size of the problem keeps growing. The Web is growing by millions of pages per day, which increases the cost of building and serving a Web index. Even if the throughput and data repository could be held constant, the competitive nature of this market continuously drives innovations to improve the quality of results retrieved and the frequency with which the index is updated.

Although smarter algorithms can achieve some quality improvements, most substantial improvements demand additional computing resources for every request. For example, in a search system that also considers synonyms of the search terms in a query, retrieving results is substantially more expensive, either the search needs to retrieve documents that match a more complex query that includes the synonyms or the synonyms of a term need to be replicated in the index meta data structure for each term. The relentless demand for more computing capabilities makes cost efficiency a primary metric of interest in the design of WSCs. Cost

111 2010 EMC Proven Professional Knowledge Sharing efficiency must be defined broadly to account for all the significant components of cost, including hosting-facility capital and operational expenses (which include power provisioning and energy costs), hardware, software, management personnel, and repairs.

WSC (Warehouse Scale Computer) Attributes

Today’s successful Internet services are no longer a miscellaneous collection of machines co- located in a facility and wired up together. The software running on these systems, such as Gmail or Web search services, execute at a scale far beyond a single machine or a single rack. They run on no smaller a unit than clusters of hundreds to thousands of individual servers.

Therefore, the machine, the computer, is this large cluster or aggregation of servers itself and needs to be considered a single computing unit. The technical challenges of designing WSCs are no less worthy of the expertise of computer systems architects than any other class of machines. First, they are a new class of large-scale machines driven by a new and rapidly evolving set of workloads. Their size alone makes them difficult to experiment with or simulate efficiently; therefore, system designers must develop new techniques to guide design decisions. Fault behavior, and power and energy considerations have a more significant impact in the design of WSCs, perhaps more so than in other smaller scale computing platforms. Finally, WSCs have an additional layer of complexity beyond systems consisting of individual servers or small groups of servers; WSCs introduce a significant new challenge to programmer productivity, a challenge perhaps greater than programming multi-core systems. This additional complexity arises indirectly from the larger scale of the application domain and manifests itself as a deeper and less homogeneous storage hierarchy, higher fault rates, and possibly higher performance variability.

One Data Center vs. Several Data Centers

Multiple datacenters are sometimes used as complete replicas of the same service, with replication being used primarily for reducing user latency and improving server throughput (a typical example is a Web search service). In these cases, a given user query tends to be fully processed within one datacenter, and our machine definition seems appropriate.

112 2010 EMC Proven Professional Knowledge Sharing However, in cases where a user query may involve computation across multiple datacenters, our single-datacenter focus is a less obvious fit. Typical examples are services that deal with nonvolatile user data updates requiring multiple copies for disaster tolerance reasons. For such computations, a set of datacenters might be the more appropriate system. However, think of the multi-datacenter scenario as more analogous to a network of computers.

In many cases, there is a huge gap in connectivity quality between intra- and inter-datacenter communications causing developers and production environments to view such systems as separate computational resources. As the software development environment for this class of applications evolves, or if the connectivity gap narrows significantly in the future, a need may arise to adjust the choice of machine boundaries.

Best Practice – Use Warehouse Scale Computer Architecture designs in certain scenarios All but a few large Internet companies might consider WSCs because their sheer size and cost render them unaffordable. This may not be true. It can be argued that the problems that today’s large Internet services face will soon be meaningful to a much larger constituency because many organizations will soon be able to afford similarly sized computers at a much lower cost. Even today, the attractive economics of low-end server class computing platforms puts clusters of hundreds of nodes within the reach of a relatively broad range of corporations and research institutions. When combined with the trends toward large numbers of processor cores on a single die, a single rack of servers may soon have as many or more hardware threads than many of today’s datacenters. For example, a rack with 40 servers, each with four 8-core dual- threaded CPUs, would contain more than two thousand hardware threads. Such systems will arguably be affordable to a very large number of organizations within just a few years, while exhibiting some of the scale, architectural organization, and fault behavior of today’s WSCs.

Architectural Overview of WSC’s

The hardware implementation of a WSC will differ significantly from one installation to the next. Even within a single organization such as Google, systems deployed in different years use different basic elements, reflecting the hardware improvements provided by the industry. However, the architectural organization of these systems has been relatively stable over the years.

113 2010 EMC Proven Professional Knowledge Sharing

Best Practice – Connect Storage Directly or via NAS in WSC environments With Google’s implementation, disk drives are connected directly to each individual server and managed by a globally distributed file system. Alternately, they can be part of Network Attached Storage (NAS) devices that are directly connected to the cluster-level switching fabric. NAS tends to be a simpler solution to deploy initially because it pushes the responsibility for data management and integrity to a NAS appliance vendor. In contrast, using the collection of disks directly attached to server nodes requires a fault-tolerant file system at the cluster level. This is difficult to implement, but can reduce hardware costs (the disks leverage the existing server enclosure), and networking fabric utilization (each server network port is effectively dynamically shared between the computing tasks and the file system).

Best Practice – WSC should consider using non-standard Replication Models The replication model between these approaches is also fundamentally different. A NAS solution provides extra reliability through replication or error correction capabilities within each appliance, whereas systems like GFS implement replication across different machines. However, GFS-like systems are able to keep data available even after the loss of an entire server enclosure or rack and may allow higher aggregate read bandwidth because the same data can be sourced from multiple replicas. Trading off higher write overheads for lower cost, higher availability, and increased read bandwidth was the right solution for many of Google’s workloads. An additional advantage of having disks co-located with compute servers is that it enables distributed system software to exploit data locality.

Some WSCs, including Google’s, deploy desktop-class disk drives instead of enterprise-grade disks because of the substantial cost differential between the two. Because the data is nearly always replicated in some distributed fashion (as in GFS), this mitigates the possibly higher fault rates of desktop disks. Moreover, because field reliability of disk drives tends to deviate significantly from the manufacturer’s specifications, the reliability edge of enterprise drives is not clearly established.

Networking Fabric

Choosing a networking fabric for WSCs involves a trade-off between speed, scale, and cost. Typically, 1-Gbps Ethernet switches with up to 48 ports are essentially a commodity component,

114 2010 EMC Proven Professional Knowledge Sharing costing less than $30/Gbps per server to connect a single rack as of the writing of this article. As a result, bandwidth within a rack of servers tends to have a homogeneous profile.

However, network switches with high port counts, which are needed to tie together WSC clusters, have a much different price structure and are more than ten times more expensive (per 1-Gbps port) than commodity switches. In other words, a switch that has 10 times the bi-section bandwidth costs about 100 times as much.

Best Practice – For WSC’s Create a Two level Hierarchy of networked switches As a result of this cost variation, the networking fabric of WSCs is often organized at the two- level hierarchy. Commodity switches in each rack provide a fraction of their bi-section bandwidth for interact communication through a handful of uplinks to the more costly cluster-level switches. For example, a rack with 40 servers, each with a 1-Gbps port, might have between four and eight 1-Gbps uplinks to the cluster-level switch, corresponding to an oversubscription factor between 5 and 10 for communication across racks. In such a network, programmers must be aware of the relatively scarce cluster-level bandwidth resources and try to exploit rack-level networking locality, complicating software development and possibly affecting resource utilization.

Alternatively, one can remove some of the cluster-level networking bottlenecks by spending more money on the interconnect fabric. For example, Infiniband interconnects typically scale to a few thousand ports but can cost $500–$2,000 per port. Alternatively, lower-cost fabrics can be formed from commodity Ethernet switches by building “fat tree” networks. How much to spend on networking vs. spending the equivalent amount on buying more servers or storage is an application-specific question that has no single correct answer. One assumption is that intra- rack connectivity is often cheaper than inter-rack connectivity.

Handling Failures

The sheer scale of WSCs requires that Internet services software tolerate relatively high component fault rates. Disk drives, for example, can exhibit annualized failure rates higher than 4%. Between 1.2 and 16 average server-level restarts per year is typical. With such high component failure rates, an application running across thousands of machines may need to react to failure conditions on an hourly basis.

115 2010 EMC Proven Professional Knowledge Sharing

The applications that run on warehouse-scale computers (WSCs) dominate many system design trade-off decisions. Some of the distinguishing characteristics of software that runs in large Internet services are the system software and tools needed for a complete computing platform. Here is some terminology to define the different software layers in a typical WSC deployment:

Platform-level software:

Platform-level software is the common firmware, kernel, operating system distribution, and libraries expected to be present in all individual servers to abstract the hardware of a single machine and provide basic server-level services.

Cluster-level infrastructure is the collection of distributed systems software that manages resources and provides services at the cluster level; ultimately, we consider these services as an operating system for a datacenter. Examples are distributed file systems, schedulers, remote procedure call (RPC) layers, as well as programming models that simplify the usage of resources at the scale of datacenters, such as MapReduce, Dryad, Hadoop,

Application-level software

Application-level software is the software that implements a specific service. It is often useful to further divide application-level software into online services and offline computations, because those tend to have different requirements. Google search, Gmail, and Google Maps are examples of online services. Offline computations are typically used in large-scale data analysis or as part of the pipeline that generates the data used in online services; for example, building an index of the Web or processing satellite images to create map files for the online service.

Best Practice - Use Sharding and other requirements in WSCs Sharding is splitting a data set into smaller fragments (shards) and distributing them across a large number of machines. Operations on the data set are dispatched to some or all of the machines hosting shards, and results are coalesced by the client. The sharding policy can vary depending on space constraints and performance considerations. Sharding also helps

116 2010 EMC Proven Professional Knowledge Sharing availability because recovery of small data fragments can be done more quickly than larger ones.

In large-scale services, service-level performance often depends on the slowest responder out of hundreds or thousands of servers. Reducing response-time variance is critical. In a sharded service, load balancing can be achieved by biasing the sharding policy to equalize the amount of work per server.

That policy may need to be informed by the expected mix of requests or by the computing capabilities of different servers. Note that even homogeneous machines can offer different performance characteristics to a load-balancing client if multiple applications are sharing a subset of the load-balanced servers.

In a replicated service, a load-balancing agent can dynamically adjust the load by selecting to which servers to dispatch a new request. It may still be difficult to approach perfect load balancing because the amount of work required by different types of requests is not always constant or predictable. Health checking and watchdog timers are required.

In a large-scale system, failures are often manifested as slow or unresponsive behavior from a given server. In this environment, no operation can rely on a given server to respond to make forward progress. In addition, it is critical to quickly determine that a server is too slow or unreachable and steer new requests away from it. Remote procedure calls (RPC’s) must set well-informed time-out values to abort long-running requests, and infrastructure-level software may need to continually check connection-level responsiveness of communicating servers and take appropriate action when needed.

Integrity checks are required. In some cases, besides unresponsiveness, faults are manifested as data corruption. Although those may be more rare, they do occur and often in ways that underlying hardware or software checks do not catch (e.g., there are known issues with the error coverage of some networking CRC checks). Extra software checks can mitigate these problems by changing the underlying encoding or adding more powerful redundant integrity checks. See section titled “Best Practice – Implement Undetected data corruption technology into environment”, starting on page 69 for additional details.

117 2010 EMC Proven Professional Knowledge Sharing Best Practice – Implement application specific compression Often a large portion of the equipment costs in modern datacenters is in the various storage layers. For services with very high throughput requirements, it is critical in this environment to fit as much of the working set as possible in DRAM.

This makes compression techniques very important because the extra CPU overhead of decompressing is still orders of magnitude lower than the penalties involved in going to disks. Although generic compression algorithms can do quite well on the average, application-level compression schemes that are aware of the data encoding and distribution of values can achieve significantly superior compression factors or better decompression speeds. Eventual consistency, keeping multiple replicas up to date, using the traditional guarantees offered by a database management system, significantly increases complexity, workloads and software infrastructure and reduces availability of distributed applications.

Fortunately, large classes of applications have more relaxed requirements and can tolerate inconsistent views for limited periods, provided that the system eventually returns to a stable consistent state. Response time of large parallel applications can also be improved by the use of redundant computation techniques. Several situations may cause a given subtask of a large parallel job to be much slower than its siblings may, either due to performance interference with other workloads or software/hardware faults. Redundant computation is not as widely deployed as other techniques because of the obvious overhead.

However, the completion of a large job is being held up by the execution of a very small percentage of its subtasks in some situations. One such example is the issue of stragglers, as described in the paper on MapReduce19. In this case, a single slower worker can determine the response time of a huge parallel task. MapReduce’s strategy is to identify such situations toward the end of a job and speculatively start redundant workers only on those slower jobs. This strategy increases resource usage by a few percentage points while reducing a parallel computation’s completion time by more than 30%.

19 http://labs.google.com/papers/mapreduce.html 118 2010 EMC Proven Professional Knowledge Sharing Utility Computing

Utility computing can be thought of as a previous incarnation of Cloud Computing with a business slant.

While utility computing often requires a cloud-like infrastructure, its focus is on the business model. Simply put, a utility computing service is one in which customers receive computing resources from a service provider (hardware and/or software) and “pay by the glass,” much as you do for your water, electric service and other utilities at home.

Amazon Web Services (AWS), despite a recent outage, is the current incumbent for this model as it provides a variety of services, among them the Elastic Compute Cloud (EC2), in which customers pay for compute resources by the hour, and Simple Storage Service (S3), for which customers pay based on storage capacity. Other utility services include Sun’s Network.com, EMC’s recently launched storage cloud service, and those offered by startups, such as Joyent and Mosso.

The primary benefit of utility computing is better economics. As mentioned previously, corporate data centers are notoriously underutilized, with resources such as servers often idle 85 percent of the time. This is due to over provisioning (buying more hardware than is needed on average in order to handle peaks).

Any application needs a model of computation, a model of storage and, assuming the application is even trivially distributed, a model of communication.

The statistical multiplexing necessary to achieve elasticity and the illusion of infinite capacity requires resources to be virtualized so that the implementation of how they are multiplexed and shared can be hidden from the programmer.

Different utility computing offerings are distinguished based on the level of abstraction presented to the programmer and the level of management of the resources. For example, Amazon EC2 is at one end of the spectrum. An EC2 instance looks much like physical hardware, and users can control nearly the entire software stack, from the kernel upwards. The API exposed is “thin”: a few dozen API calls to request and configure the virtualized hardware. There is no a priori limit on the kinds of applications that can be hosted; the low level of 119 2010 EMC Proven Professional Knowledge Sharing virtualization of raw CPU cycles, block-device storage, IP-level connectivity allow developers to code whatever they want. On the other hand, this makes it inherently difficult for Amazon to offer automatic scalability and failover, because the semantics associated with replication and other state management issues are highly application-dependent.

AWS offers a number of higher-level managed services, including several different managed storage services for use in conjunction with EC2, such as SimpleDB. However, these offerings have higher latency and non-standard API’s, and our understanding is that they are not as widely used as other parts of AWS. At the other extreme of the spectrum are application domain-specific platforms, such as Google AppEngine and Force.com, the SalesForce business software development platform. AppEngine is targeted exclusively at traditional web applications, enforcing an application structure of clean separation between a stateless computation tier and a state-full storage tier. Furthermore, AppEngine applications are expected to be request-reply based, and as such, they are severely rationed in how much CPU time they can use in servicing a particular request. AppEngine’s impressive automatic scaling and high- availability mechanisms, and the proprietary MegaStore (based on “BigTable”) data storage available to AppEngine applications, all rely on these constraints. Therefore, AppEngine is not suitable for general-purpose computing. Similarly, SalesForce.com is designed to support business applications that run against the salesforce.com database, and nothing else. Microsoft’s Azure is an intermediate point on this spectrum of flexibility vs. programmer convenience. Azure applications are written using the .NET libraries, and compiled to the Common Language Runtime, a language independent managed environment. The system supports general purpose computing, rather than a single category of application. Users get a choice of language, but cannot control the underlying operating system or runtime. The libraries provide a degree of automatic network configuration and failover/scalability, but require the developer to declaratively specify some application properties in order to do so. Therefore, Azure is intermediate between complete application frameworks like AppEngine on the one hand, and hardware virtual machines like EC2 on the other.

As shown in Table 2 - Examples of Cloud Computing vendors and how each provides virtualized resources (computation, storage), I summarize how these three classes virtualize computation, storage, and networking. The scattershot offerings of scalable storage suggest that scalable storage with an API comparable in richness to SQL remains an open challenge. Amazon has begun offering Oracle databases hosted on AWS, but the economics and licensing

120 2010 EMC Proven Professional Knowledge Sharing model of this product makes it a less natural fit for Cloud Computing. Will one model beat out the others in the Cloud Computing space?

We can draw an analogy with programming languages and frameworks. Low-level languages such as C and assembly language allow fine control and close communication with the bare metal, but if the developer is writing a Web application, the mechanics of managing sockets, dispatching requests, and so on are cumbersome and tedious to code, even with good libraries. On the other hand, high-level frameworks such as make these mechanics invisible to the programmer, but are only useful if the application readily fits the request/reply structure and the abstractions provided by Rails; any deviation requires diving into the framework at best, and may be awkward to code. No reasonable Ruby developer would argue against the superiority of C for certain tasks, and vice versa. Different tasks will result in demand for different classes of utility computing.

Continuing the language analogy, just as high-level languages can be implemented in lower level ones, highly managed cloud platforms can be hosted on top of less-managed ones. For example, AppEngine could be hosted on top of Azure or EC2; Azure could be hosted on top of EC2. Of course, AppEngine and Azure each offer proprietary features (AppEngine’s scaling, failover and MegaStore data storage) or large, complex API’s (Azure’s .NET libraries) that have no free implementation, so any attempt to “clone” AppEngine or Azure would require re- implementing those features or API’s.

Table 2 - Examples of Cloud Computing vendors and how each provides virtualized resources (computation, storage),

Virtualize Google AppEngine Amazon Web computation Services Class Computation -Predefined application -Microsoft Common -x86 Instruction Set model (VM) structure and framework; Language Runtime Architecture (ISA) via programmer-provided (CLR) VM; common Xen VM “handlers” written in Python, intermediate form -Computation all persistent state stored in executed in managed elasticity allows MegaStore (outside Python environment scalability, but code) -Machines are developer must build

121 2010 EMC Proven Professional Knowledge Sharing -Automatic scaling up and provisioned based on the machinery, or down of computation and declarative descriptions third party VAR such storage; network and server (e.g. which “roles” can as RightScale for failover; all consistent with 3- be replicated); automatic example tier Web app structure load balancing Storage -MegaStore/BigTable -SQL Data Services -Range of models model (restricted from block store view of SQL Server) (EBS) to augmented -Azure storage service key/blob store (SimpleDB) -Automatic scaling varies from no scaling or sharing (EBS) to fully automatic (SimpleDB, S3), depending on which model used -Consistency guarantees vary widely depending on which model used APIs vary from standardized(EBS) to proprietary Networking -Fixed topology to -Automatic based on -Fixed topology to model accommodate 3-tier Web programmer’s accommodate 3-tier app structure declarative descriptions Web app structure -Scaling up and down is of app components -Scaling up and down automatic and programmer (roles) is automatic and invisible programmer invisible

122 2010 EMC Proven Professional Knowledge Sharing Grid computing

Grid Computing is a form of distributed computing in which a virtual super computer is composed of networked, loosely coupled computers, acting in concert to perform very large tasks .A typical configuration is one in which resource provisioning is managed and allocated by a group of distributed nodes . The Central virtual Super computer is where the resource is consumed and provisioned. The role of coordinator for resource provision is also centrally controlled.

It has been applied to computationally intensive scientific, mathematical, and academic problems through volunteer computing, and used in commercial enterprise for such diverse applications as drug discovery, economic forecasting, seismic analysis, and back-office processing to support e-commerce and web services. What distinguishes Grid Computing from cluster computing is being more loosely coupled, heterogeneous, and geographically dispersed. In addition, grids are often constructed with general purpose grid software libraries and middleware, dividing and apportioning pieces of a program to potentially thousands of computers. However, what distinguishes Cloud Computing from Grid Computing is being web- centric, despite some of its definitions being conceptually similar (such as computing resources being consumed as electricity are from power grids).

Cloud Type Architecture Summary

In terms of cloud computing service types and the similarities and differences between cloud, grid and other cloud types, it may be advantageous to use Amazon Web Services as an example.

To get cloud computing to work, you need three things: thin clients (or clients with a thick-thin switch), grid computing, and utility computing. Grid computing links disparate computers to form one large infrastructure, harnessing unused resources. Utility computing can be considered paying for what you use on shared servers like those that you pay for a public utility (such as electricity, gas, and so on).

With grid computing, you can provision computing resources as a utility that can be turned on or off. Cloud computing goes one step further with on-demand resource provisioning. This eliminates over-provisioning when used with utility pricing. It also removes the need to over- provision in order to meet the demands of millions of users. 123 2010 EMC Proven Professional Knowledge Sharing

Infrastructure as a Service and more

A consumer can get service from a full computer infrastructure through the Internet. This type of service is called Infrastructure as a Service (IaaS). Internet-based services such as storage and databases are part of the IaaS. Other types of services on the Internet are (PaaS) and Software as a Service (SaaS). PaaS offers full or partial application development that users can access, while SaaS provides a complete turnkey application, such as Enterprise Resource Management through the Internet.

To get an idea of how Infrastructure as a Service (IaaS) is/was used in real life, consider The New York Times, which processed terabytes of archival data using hundreds of Amazon's EC2 instances within 36 hours. If The New York Times had not used EC2, it would have taken days or months to process the data.

The IaaS divides into two types of usage: public and private. Amazon EC2 uses public server pools in the infrastructure cloud. A more private cloud service uses groups of public or private server pools from an internal corporate data center. One can use both types to develop software within the environment of the corporate data center, and, with EC2, temporarily extend resources at low cost, for example, for testing purposes. The mix may provide a faster way of developing applications and services with shorter development and testing cycles.

Amazon Web services With EC2, customers create their own Amazon Machine Images (AMIs) containing an operating system, applications, and data, and they control how many instances of each AMI run at any given time. Customers pay for the instance-hours (and bandwidth) they use, adding computing resources at peak times and removing them when they are no longer needed. The EC2, Simple Storage Service (S3), and other Amazon offerings scale up to deliver services over the Internet in massive capacities to millions of users.

Amazon provides five different types of servers ranging from simple-core x86 servers to eight- core x86_64 servers. You do not have to know which servers are in use to deliver service instances. You can place the instances in different geographical locations or availability zones. Amazon allows elastic IP addresses that can be dynamically allocated to instances.

124 2010 EMC Proven Professional Knowledge Sharing

Cloud computing With cloud computing, companies can scale up to massive capacities in an instant without having to invest in new infrastructure, train new personnel, or license new software. Cloud computing is of particular benefit to small and medium-sized businesses who wish to completely outsource their data-center infrastructure, or large companies who wish to get peak load capacity without incurring the higher cost of building larger data centers internally. In both instances, service consumers use what they need on the Internet and pay only for what they use.

The service consumer no longer has to be at a PC, use an application from the PC, or purchase a specific version that is configured for smart phones, PDAs, and other devices. The consumer does not own the infrastructure, software, or platform in the cloud. The consumer has lower upfront costs, capital expenses, and operating expenses. The consumer does not care about how servers and networks are maintained in the cloud. The consumer can access multiple servers anywhere on the globe, without knowing which ones and where they are located.

Grid Computing Cloud computing evolved from grid computing and provides on-demand resource provisioning. Grid computing may or may not be in the cloud depending on what type of users are using it. If the users are systems administrators and integrators, they care how things are maintained in the cloud. The providers install and virtualize servers and applications. If the users are consumers, they do not care how things are run in the system.

Grid computing requires the use of software that can divide and farm out pieces of a program as one large system image to several thousand computers. One concern about grid is that if one piece of the software on a node fails, other pieces of the software on other nodes may fail. This is minimized if that component has a failover component on another node, but problems can still arise if components rely on other pieces of software to accomplish one or more grid computing tasks. Large system images and associated hardware to operate and maintain them can contribute to large capital and operating expenses.

125 2010 EMC Proven Professional Knowledge Sharing Similarities and differences Cloud computing and grid computing are scalable. Scalability is accomplished through load balancing of application instances running separately on a variety of operating systems and connected through Web services. CPU and network bandwidth is allocated and de-allocated on demand. The system's storage capacity goes up and down depending on the number of users, instances, and the amount of data transferred at a given time.

Both computing types involve multi-tenancy and multitask, meaning that many customers can perform different tasks, accessing a single or multiple application instance. Sharing resources among a large pool of users assists in reducing infrastructure costs and peak load capacities. Cloud and grid computing provide service-level agreements (SLAs) for guaranteed uptime availability of, say, 99 percent. If the service slides below the level of the guaranteed uptime service, the consumer will get service credit for receiving data late.

The Amazon S3 provides a Web services interface for storing and retrieving data in the cloud. Setting a maximum limits the number of objects you can store in S3. You can store an object as small as 1 byte and as large as 5 GB or even several terabytes. S3 uses the concept of buckets as containers for each storage location of your objects. The data is stored securely using the same data storage infrastructure that Amazon uses for its e-commerce Web sites. While the storage computing in the grid is well suited for data-intensive storage, it is not economically suited for storing objects as small as 1 byte. In a data grid, the amounts of distributed data must be large for maximum benefit.

A computational grid focuses on computationally intensive operations. Amazon Web Services in cloud computing offers two types of instances: standard and high-CPU.

126 2010 EMC Proven Professional Knowledge Sharing Business Practices Pillar

It is important to recognize that there are tough challenges that data center managers, industry operators, and IT businesses face as they all struggle to support their businesses in the face of budget cuts and uncertainty about the future. It is natural that environmental sustainability is taking a back seat in many companies at this time. However, the fact is, being “lean and green” is good for both the business and the environment, and organizations that focus their attentions accordingly will see clear benefits. Reducing energy use and waste improves a company’s bottom line, and increasing the use of recycled materials is a proven way to demonstrate good corporate citizenship to your customers, employees, and the communities in which you do business.

That said, it is not always easy to know where to begin in moving to greener and more efficient operations. As shown in Figure 25 – Sustainability Ontology – Business Practices, shown below, on page 128, many methods and best practices can be implemented. The diagram outlines the structure of how a company can achieve a high level of efficiency and sustainability through better process improvement and management, and how conforming to standards and addressing governance and compliance standards can help achieve this goal. With that in mind, this section enumerates the many best business practices for environmentally sustainable business. It is hoped that if companies follow these best practices, it will lead to optimal use of resources and help teams and management stay aligned with the core strategies and goals of achieving a sustainable IT.

127 2010 EMC Proven Professional Knowledge Sharing

Figure 25 – Sustainability Ontology – Business Practices

Process Management and Improvement

Best Practice - Provide incentives that support your primary goals: Incentives can help you achieve remarkable results in a relatively short period of time if you apply them properly. Take energy efficiency as an example. A broad range of technology improvements and best practices are already available that companies can use to improve efficiency in the data center. However, industry adoption for these advances has been relatively low. One possible reason is that the wrong incentives are in place. For instance, data center

128 2010 EMC Proven Professional Knowledge Sharing managers are typically compensated based on uptime and not efficiency. Best practice is to provide specific incentives to reward managers for improving the efficiency of their operations, using metrics such as Power Usage Effectiveness (PUE), which determines the energy efficiency of a data center by dividing the amount of power entering a data center by the power used to run the computer infrastructure within it. Uptime is still an important metric, but the incentives being appropriately balanced against the need to improve energy efficiency is the goal[2].

Another outmoded incentive in the industry involves how data center hosting costs are allocated back to internal organizations. Most often, these costs are allocated based on the proportion of floor space used. These incentives drive space efficiency and ultra-robust data centers, but they come at a high cost, and typically are not energy efficient. Space-based allocation does not reflect the true cost of building and maintaining a data center. A best practice is to achieve substantial efficiency gains by moving to a model that allocates costs to internal customers, based on the proportion of energy their services consume. It is anticipated that business units would began evaluating their server utilization data to make sure they did not already have unused capacity before ordering more servers.

Best Practice - Focus on effective resource utilization Energy efficiency is an important element in any company’s business practices, but equally important is the effective use of deployed resources. For example, if only 50 percent of a data center’s power capacity is used, then highly expensive capacity is stranded in the uninterruptible power supplies (UPSs), generators, chillers, and so on. In a typical 12 Megawatt data center, this could equate to $4-8 million annually in unused capital expenditure [3]. In addition, there is embedded energy in the unused capacity, since it takes energy to manufacture the UPSs, generators, chillers, and so on. Stranding capacity will also force organizations to build additional data centers sooner than necessary.

Best Practice - Use virtualization to improve server utilization and increase operational efficiency As noted in the best practice above, underutilized servers are a major problem facing many data center operators. In today’s budgetary climate, IT departments are being asked to improve efficiency, not only from a capital perspective, but also with regard to operational overhead. By migrating applications from physical to virtual machines and consolidating these applications

129 2010 EMC Proven Professional Knowledge Sharing onto shared physical hardware, it is quite common to see several instances in data centers where server resources are under-utilized. Industry analysts have reported that utilization levels are often well below 20 percent. Utilizing technologies such as a Hyper-V to increase virtualization and therefore utilization year after year, in turn helps increase the productivity per watt of our operations. Utilizing infrastructure architectures such as Amazon, Microsoft’s Windows Azure cloud operating system, and EMC’s VCE Cloud virtualization are a best practice.

One immediate benefit of virtual environments is improved operational efficiency. Operations teams can deploy and manage servers in a fraction of the time it would take to deploy the equivalent physical hardware or perform a physical configuration change. In a virtual environment, managing hardware failures without disrupting service is as simple as a click of a button or automated trigger, which rolls virtual machines from the affected physical host to a healthy host.

A server running virtualization will often need more memory to support multiple virtual machines, and there is small software overhead for virtualization. However, the overall value proposition measured in terms of work done per cost and per watt is much better than the dedicated underutilized physical server case.

Key benefits of virtualization include: • Reduction in capital expenditures • Decrease in real estate, power, and cooling costs • Faster time to market for new products and services • Reduction in outage and maintenance windows

Best Practice - Drive quality up through compliance: Many data center processes are influenced by the need to meet regulatory and security requirements for availability, data integrity, and consistency. See section titled “Compliance” on page 146 for additional information. Quality and consistency are tightly linked and can be managed through a common set of processes. Popular approaches to increasing quality are almost without exception tied to observing standards and reducing variability.

130 2010 EMC Proven Professional Knowledge Sharing A continuous process helps maintain the effectiveness of controls as your environment changes. Compliance boils down to developing a policy and then operating consistently as measured against that policy. The extended value that can be offered by standardized, consistent processes that address compliance will also help you achieve higher quality benefits. A best practice is to achieve certification to the international information security standard, ISO/IEC 27001:2005. For instance, through monitoring one’s data center systems for policy compliance, many companies have exposed processes that were causing problems, and found opportunities for improvements that benefitted multiple projects. As outlined in Figure 26 - A continuous process helps maintain the effectiveness of controls as your environment changes. This is a best practice.

Perform Control

Design Test Control Control

Feedback to Document Design Issues

Analyze and Correct

Figure 26 - A continuous process helps maintain the effectiveness of controls as your environment changes

Best Practice - Embrace change management Poorly planned changes to the production environment can have unexpected and sometimes disastrous results, which can then spill over into the planet’s environment when the impacts involve lower energy utilization and other inefficient use of resources. Changes may involve hardware, software, configuration, or process. Standardized procedures for a request, approval, coordination, and execution of changes can greatly reduce the number and severity of

131 2010 EMC Proven Professional Knowledge Sharing unplanned outages. Data center organizations should adopt and maintain repeatable, well- documented processes, where the communication of planned changes enables teams to identify risks to dependent systems and develop appropriate workarounds in advance.

Figure 27 - Consistent and well-documented processes help ensure smooth changes in the production environment

A best practice is to manage changes to a data center’s hardware and software infrastructure through a review and planning process that is based on the Information Technology Infrastructure Library (ITIL) framework. An example of this type of process is shown in Figure 27 - Consistent and well-documented processes help ensure smooth changes in the production environment. Proposed changes are reviewed prior to approval to ensure that sufficient diligence has been applied. Additionally, planning for recovery in the case of unexpected results is crucial. Rollback plans must be scrutinized to ensure that all known contingencies have been considered. When developing a change management program, it is important to consider the influences of people, processes, and technology. By employing the correct level of change management, businesses can increase customer satisfaction and improved service level performance without placing undue burden on its operations staff.

Other features that your change management process should include: 132 2010 EMC Proven Professional Knowledge Sharing

• Documented policies around communication and timeline requirements • Standard templates for requesting, communicating, and reviewing changes • Post-implementation review, including cases where things went well

Best Practice - Invest in understanding your application workload and behavior: The applications in your environment and the particulars of the traffic on your network are unique. The better you understand them, the better positioned you will be to make improvements. Moving forward in this regard requires hardware engineering and performance analysis expertise within your organization, so you should consider staffing accordingly. Credible and competent in-house expertise is needed to properly evaluate new hardware, optimize your request for proposal (RFP) process for servers, experiment with new technologies, and provide meaningful feedback to your vendors. Once you start building this expertise, the first goal is to focus your team on understanding your environment, and then working with the vendor community. Make your needs known to them as early as possible. It is an approach that makes sense for any company in the data center industry that is working to increase efficiency. If you do not start with efficient servers, you are just going to pass inefficiencies down the line.

Best Practice - Right-size your server platforms to meet your application requirements Another best practice in data centers involves “right-sizing the platform.” This can take two forms. One is where you work closely with server and other infrastructure manufacturers to optimize their designs and remove items you don’t use, such as more memory slots and input/output (I/O) slots than you need, and focus on high efficiency power supplies and advanced power management features. With the volume of servers that many large corporations purchase, most manufacturers are open to meeting these requests, as well as partner with us to drive innovation into the server space to reduce resource consumption even further. Of course, not all companies purchase servers on a scale where it makes sense for manufacturers to offer customized stock-keeping units (SKUs). That is where the second kind of right-sizing comes in. It involves being disciplined about developing the exact specifications that you need servers to meet for your needs, and then not buying machines that exceed your specifications. It is often tempting to buy the latest and greatest technology, but you should only do so after you have evaluated and quantified whether the promised gains provide an acceptable return on investment (ROI). 133 2010 EMC Proven Professional Knowledge Sharing

Consider that you may not need the latest features server vendors are selling. Understand your workload and then pick the right platform. Conventional wisdom has been to buy something bigger than your current needs so you can protect your investment. However, with today’s rapid advances in technology, this can lead to rapid obsolescence. You may find that a better alternative is to buy for today's needs and then add more capacity as and when you need it. Also, look for opportunities to use a newer two-socket quad-core platform to replace an older four-socket dual-core, instead of overreaching with newer, more capable four-socket platforms with four or six cores per socket. Of course, there is no single answer. Again, analyze your needs and evaluate your alternatives.

Best Practice - Evaluate and test servers for performance, power, and total cost of ownership A best practice, and what many large corporations are doing in the procurement process, is usually built around testing. Hardware teams run power and performance tests on all “short list” candidate servers, and then calculate the total cost of ownership, including power usage effectiveness (PUE) for energy costs. The key is to bring the testing in-house so you can evaluate performance and other criteria in your specific environment and on your workload. It is important to not rely on benchmark data, which may not be applicable to your needs and environment.

For smaller organizations that do not have resources to do their own evaluation and testing, SPECpower_ssj2008 (the industry-standard SPEC benchmark that evaluates the power and performance characteristics of volume server class computers) can be used in the absence of anything else to estimate workload power. In addition to doing its own tests, Microsoft requests this data from vendors in all of its RFPs. For more information, visit the Standard Performance Evaluation Corp. web site at www.spec.org/specpower.

Best Practice - Converge on as small a number of stock-keeping units (SKUs) as you can A best practice is to make leading data center initiatives move to a server “standards” program where internal customers choose from a consolidated catalogue of servers. Narrowing the number of SKUs can allow IT departments to make larger volume buys, thereby cutting capital costs. However, perhaps equally important, it helps reduce operational expenditures and complexities around installing and supporting a variety of models. This increases operational

134 2010 EMC Proven Professional Knowledge Sharing consistency and results in better pricing, as long-term orders are more attractive to vendors. Finally, it provides exchangeable or replaceable assets. For example, if the demand for one online application decreases while another increases, it is easier to reallocate servers as needed with fewer SKUs.

Best Practice - Take advantage of competitive bids from multiple manufacturers to foster innovation and reduce costs. Competition between manufacturers encourages thorough ongoing analysis of proposals from multiple companies. That puts most of the weight on price, power, and performance. A best practice is to develop hardware requirements and then share them with multiple manufacturers. Then, work actively to develop an optimized solution. Energy efficiency, power consumption, cost effectiveness and application performance per watt each play key roles in hardware selection. The competition motivates manufacturers to be price competitive, drive innovation, and provide the most energy efficient, lowest total cost of ownership (TCO) solutions. In many cases, online services do not fully use the available performance. Hence, it makes sense to give more weight to price and power. It is important to remember that power affects not only energy consumption costs, but also data center capital allocation costs.

Standards

To achieve sustainability, utilizing the various architectures described in the section titled “Infrastructure Architectures, starting on page 80, a best practice is to create standards allowing the various technologies to not only interoperate, but allow the business to not get “locked-in” to a particular vendor or strategy.

Focusing on cloud computing architectures, this technology is an approach in delivering IT services that promises to be highly agile and lower costs for consumers, especially up-front costs. This approach impacts not only the way computing is used, but also the technology and processes that are used to construct and manage IT within enterprises and service providers. Coupled with the opportunities and promise of cloud computing are elements of risk and management complexity. Adopters of cloud computing should consider asking questions such as: • How do I integrate computer, network, and storage services from one or more cloud service providers into my business and IT processes? • How do I manage security and business continuity risk across several cloud providers?

135 2010 EMC Proven Professional Knowledge Sharing • How do I manage the lifecycle of a service in a distributed multiple-provider environment in order to satisfy service-level agreements (SLAs) with my customers? • How do I maintain effective governance and audit processes across integrated datacenters and cloud providers? • How do I adopt or switch to a new cloud provider?

The definitions of cloud computing, including private and public clouds, Infrastructure as a Service (IaaS), and Platform as a Service (PaaS) are taken from work by the National Institute of Standards and Technology (NIST). In part, NIST defines cloud computing as “a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (for example, networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.”

NIST defines four cloud deployment models: • Public clouds (cloud infrastructure made available to the general public or a large industry group) • Private clouds (cloud infrastructure operated solely for an organization) • Community clouds (cloud infrastructure shared by several organizations) • Hybrid clouds (cloud infrastructure that combines two or more clouds)

There is an Open Cloud Standards Incubator project going on which includes all of the deployment models defined above. The focus of the project is management aspects of IaaS, with some work involving PaaS. These aspects include service-level agreements (SLAs), quality of service (QoS), workload portability, automated provisioning, and accounting and billing.

The fundamental IaaS capability made available to cloud consumers is a cloud service. Examples of services are computing systems, storage capacity, and networks that meet specified security and performance constraints. Examples of consumers of cloud services are enterprise datacenters, small businesses, and other clouds.

Many existing and emerging standards will be important in cloud computing. Some of these, such as security-related standards, apply generally to distributed computing environments.

136 2010 EMC Proven Professional Knowledge Sharing Others apply directly to virtualization technologies that are expected to be important building blocks in cloud implementations.

The dynamic infrastructure enabled by virtualization technologies aligns well with the dynamic on-demand nature of clouds. Examples of standards include SLA management and compliance, federated identities and authentication, and cloud interoperability and portability.

Best Practice - Use standard interfaces to Cloud Architectures There are multiple competing proposals for interfaces to clouds and given the embryonic stage of the industry, it is important for users to insist that cloud providers use standard interfaces to provide flexibility for future extensions and to avoid becoming locked into a vendor. With the backing of key players in the industry, this aspect of portability is a primary value that standards- based cloud infrastructure offers. Three scenarios show how cloud consumers and providers may interact using interoperable cloud standards. These scenarios are examples only; many more possibilities exist.

1. Building flexibility to do business with a new provider without excessive effort or cost 2. Ways that multiple cloud providers may work together to meet the needs of a consumer of cloud services 3. How different consumers with different needs can enter into different contractual arrangements with a cloud provider for data storage services

As previously discussed, many standards bodies are rallying to generate a common standard allowing varying cloud offerings to interoperate and federate. The DMTF for example, is working with affiliated industry organizations such as the Open Grid Forum, Cloud Security Alliance, TeleManagement Forum (TMF), Storage Networking Industry Association (SNIA), and National Institute of Standards and Technology (NIST). The DMTF has also established formal synergistic relationships with other standards bodies. The intent of these alliance partnerships is to provide mutual benefit to the organizations and the related standards bodies.

Alliances play an important role in helping the cloud community to provide a unified view of management initiatives. For example, SNIA has produced an interface specification for cloud storage. The Open Cloud Standards Incubator will not only leverage that work but also collaborate with SNIA to ensure consistent standards. The Incubator expects to leverage

137 2010 EMC Proven Professional Knowledge Sharing existing DMTF standards including Open Virtualization Format (OVF), Common Information Model (CIM), CMDB Federation (CMDBf), CIM Simplified Policy Language (CIM-SPL), and the DMTF's virtualization profiles, as well as standards from affiliated industry groups.

The ultimate goal of the Open Cloud Standards is to enable portability and interoperability between private clouds within enterprises and hosted or public cloud service providers. A first step has been initiated in the development of use cases, a service lifecycle, and reference architecture. This is still a work in progress, but in order for any business to utilize cloud architectures and leverage these architectures to achieve efficiency and sustainability, an interface standard is required.

Security

With respect to sustainability and new technologies, such as Cloud Computing, moving to a new business model such as going to the cloud offers economies of scale and flexibility that are both good and bad from a security point of view. The massive concentrations of resources and data present a more attractive target to attackers, but cloud-based defenses can be more robust, scalable and cost-effective. For a more detailed discussion focusing on the details on Cloud security, please refer to the Proven Professional article titled “How to Trust the Cloud – “Be Careful up There””.

The new cloud economic/sustainability model has also driven technical change in terms of:

Scale: Commoditization and the drive towards economic sustainability and efficiency have led to massive concentrations of the hardware resources required to provide services. This encourages economies of scale for all the kinds of resources required to provide computing services.

Architecture: Optimal resource use demands computing resources that are abstracted from underlying hardware. Unrelated customers who share hardware and software resources rely on logical isolation mechanisms to protect their data. Computing, content storage and processing are massively distributed. Global markets for commodities demand edge distribution networks where content is delivered and received as close to customers as possible. This tendency towards global distribution and redundancy means resources are usually managed in bulk, both physically and logically.

138 2010 EMC Proven Professional Knowledge Sharing

Given the reduced cost and flexibility it brings, a migration to cloud computing is compelling for many SMEs. However, there can be concerns for SMEs migrating (also see “Best Practice - Assess cloud storage migration costs upfront on page 94 for additional information) to the cloud including the confidentiality of their information and liability for incidents involving the infrastructure

Following are some best practices for managing trust in public and private clouds:

Best Practice – Determine if cloud vendors can deliver on their security claims

Because information security is only as strong as its weakest link, it is essential for organizations to evaluate the quality of their cloud vendors. Having a high-profile “brand name” vendor and an explicit SLA is not enough. Organizations must aggressively verify whether cloud vendors can deliver upon and validate their security claims. Enterprises must make a firm commitment that they will protect the information assets outside their corporate IT environment to at least the same high standard of security that would apply if those same information assets were preserved in-house. In fact, because these assets are stored outside the organization, it could be argued that the standard for protection should be even higher. Security practitioners must be particularly diligent in assessing the security profiles of those cloud vendors entrusted with highly sensitive data or mission-critical functions.

Best Practice - Adopt federated identity policies backed by strong authentication practices A federated identity allows a user to access various web sites, enterprise applications and cloud services using a single sign-on. Federated identities are made possible when organizations agree to honor each other’s trust relationships, not only in terms of access but also in terms of entitlements. Establishing “ties of federation” agreements between parties to share a set of policies governing user identities, authentication and authorization, provides users with a more convenient and secure way of accessing, using and moving between services, whether those services reside in the enterprise or in a cloud. Federated identity policies go hand-in-hand with strong authentication policies. Whereas federation policies bridge the trust gap between members of the federation, strong authentication policies bridge the security gap, creating the secure access infrastructure to bring all members of the community together.

139 2010 EMC Proven Professional Knowledge Sharing

The federation of identity and authentication policies will eventually become standard practice in the cloud, not just because users will demand it but as a matter of convenience. For organizations, federation also delivers cost benefits and improved security. Companies can centralize the access and authentication systems maintained by separate business units. They can reduce potential points of threat, such as unsafe password management practices, as users will no longer have to enter credentials and passwords in multiple places. For federated identity policies to become more widely used, the information technology and security industry will have to knock down barriers to implementing such policies. So far, it appears the barriers are not economic or technological, but trust-related. Federated identity models, like the strong authentication services that enforce them, are only as strong as their weakest link. Each member of the federation must be trusted to comply with the group’s security policies.

Expanding the circle of trust means expanding the threat surface where problems could arise and increasing the potential for single points of failure in the community of trust. The best way of ensuring that trust and security are preserved within communities of federation is to require all community members to enforce a uniform, acceptable level of strong authentication. Some IT industry initiatives are attempting to establish security standards that facilitate federated identities and authentication. For instance, the OASIS Security Services Technical Committee has developed the Security Assertion Markup Language (SAML), an XML-based standard for exchanging authentication and authorization data between security domains, to facilitate web browser single sign-on. SAML appears to be evolving into the definitive standard for enterprises deploying web single sign-on solutions.

Best Practice – Preserve segregation of administrator duties While data isolation and preventing data leakage are essential, enterprise systems administrators still need appropriate levels of access to manage and configure their company’s applications within the shared infrastructure. Furthermore, in addition to systems administrators and network administrators, private clouds introduce a new function into the circle of trust: the cloud administrator. Cloud administrators, the IT professionals working for the cloud provider, need sufficient access to an enterprise’s virtual facilities to optimize cloud performance while being prevented from tapping into the proprietary information they are hosting on behalf of their tenants. Enterprises running private clouds on hosted servers should consider requiring that

140 2010 EMC Proven Professional Knowledge Sharing their data center operator disable all local administration of hypervisors, using a central management application instead to better monitor and reduce risks of unauthorized administrator access.

As an added security measure, enterprises should preserve a separation of administrator duties in the cloud. The temptation may be to consolidate duties, as many functions can be centrally administered from the cloud using virtualization management software. However, as with physical IT environments, in which servers, networks and security functions are split among several administrators or departments, segregating those functions within the cloud can provide added security by decentralizing control. Furthermore, organizations can use centralized virtualization management capabilities to limit administrative access, define roles and appropriately assign privileges to individual administrators. By segregating administrator duties and employing a centralized virtualization management console, organizations can safeguard their private clouds from unauthorized administrator access.

Best Practice - Set clear security policies Set clear policies to define trust and be equipped to enforce them. In a private cloud, trust relationships are defined and controlled by the organization using the cloud. While every party in the trust relationship will naturally protect information covered by government privacy and compliance regulations, employee tax ID numbers, proprietary financial data, etc., organizations will also need to set policies for how other types of proprietary data are shared in the cloud. For instance, a corporation may classify information such as purchase orders or customer transaction histories as highly sensitive, even as trade secrets, and may establish risk-based policies for how cloud providers and business partners store, handle and access that data outside the enterprise. For “Trust” relationships to work, there must be clear, agreed-upon policies for what information is privileged, how that data is managed and how cloud providers will report and validate their performance in enforcing the standards set by the organization. These agreed-upon standards must be enforced by binding service level agreements (SLAs) that clearly stipulate the consequences of security breaches and service agreement violations.

Best Practice - Employ data encryption and tokenization The cloud provider in online backups sometimes stores enterprise data used in cloud applications. Encrypting data is often the simplest way to protect proprietary information against unauthorized access, particularly by administrators and other parties within the cloud.

141 2010 EMC Proven Professional Knowledge Sharing Organizations should encrypt data residing with or accessible to cloud providers. As in traditional enterprise IT environments, organizations should encrypt data in applications at the point of ingest. Additionally, they should ensure cloud vendors support data encryption controls that secure every layer of the IT stack. Segregate sensitive data from the users or identities they are associated with as an additional precaution to secure data residing in clouds. For instance, companies storing credit card data often keep credit card numbers in separate databases from where cardholders’ personal data is stored, reducing the likelihood that security breaches will result in fraudulent purchases. Companies also can protect sensitive cardholder information in the cloud through a form of data masking called tokenization. This method of securing data replaces the original number with a token value that has no explicit relationship to the original value. The original card number is kept in a separate, secure database called a vault.

Best Practice - Manage policies for provisioning virtual machines Best Practice is to secure their virtual infrastructure; companies using private clouds must be able to oversee how virtual machines are provisioned and managed within their clouds. In particular, managing virtual machine identities is crucial, as they are used for basic administrative functions, such as identifying the systems and people with which virtual machines are physically associated, and moving software to new host servers. Organizations establishing a security position based on virtual machine identities should know how those identities are created, validated and verified, and what safety measures their cloud vendors have taken to safeguard those identities. Additionally, information security leaders should set their identity access and management policies to grant all users, whether human or machine, the lowest level of access needed for each to perform their authorized functions within the cloud.

Best Practice – Require transparency into cloud operations to ensure multi-tenancy and data isolation In the virtualized environment of the cloud, many different companies, or “tenants,” may share the same physical computing, storage and network infrastructure. Cloud providers need to ensure isolation of access that software, data and services can be safely partitioned within the cloud and that tenants sharing physical facilities cannot tap into their neighbors’ proprietary information and applications.

The best way to ensure secure data isolation and multi-tenancy is to partition access to appropriate cloud resources for all tenants. Cloud vendors should furnish log files and reports of

142 2010 EMC Proven Professional Knowledge Sharing user activities. Some cloud vendors are able to provide an even higher degree of visibility through applications that allow enterprise IT administrators to monitor the data traversing their virtual networks and to view events within the cloud in near real time. Specific performance metrics should be written into managed service agreements and enforced with financial consequences if those agreed-upon performance conditions are not upheld.

Organizations and businesses with private clouds should work with cloud vendors to ensure transferability of security controls. In other words, if data or virtual resources are moved to another server or to a backup data center, the security policies established for the original server or primary data center should automatically be implemented in the new locations.

Governance

The ability to govern and measure enterprise risk within a company-owned data center is difficult and it seems surprisingly still in the early stages of maturation in most organizations. Cloud computing brings new unknowns to governance and enterprise risk.

Online agreements and contracts of this type for the most part are still untested in a court of law and consumers have yet to experience an extended outage of services that they may someday determine to need on a 24/7 basis. Questions still remain about the ability of user organizations to assess the risk of the provider through onsite assessments.

The storage and use of information considered sensitive by nature may be allowed, but it could be unclear as to who is responsible in the event of a breach. If both the code authored by the user and the service delivered by the provider are flawed, who is responsible? Current statutes cover the majority of the United States but how are the laws of foreign countries, especially the European Union, to be interpreted in the event of disputes? Many questions remain with respect to Cloud Governance and Enterprise Risk.

Best Practices – Do your due diligence of your SLAs Cloud consumers considering using Cloud Services should perform in depth due diligence prior to the execution of any service “Terms of Service,” “Service Level Agreements” (SLAs), or use. This due diligence should assess the arrangement of risks known at present and abilities of partners to work within and contribute to the customer’s enterprise risk management program for the length of the engagement. Some recommendations include:

143 2010 EMC Proven Professional Knowledge Sharing

1. Consider creating a Private (Virtual) Cloud or a Hybrid Cloud that provides the appropriate level of controls while maintaining risk at an acceptable level. 2. Review what type of provider you prefer, such as software, infrastructure or platform. Gain clarity on how pricing is performed with respect to bandwidth and CPU utilization in a shared environment. Compare usage as measured by the cloud service provider with your own log data, to ensure accuracy. 3. Request clear documentation on how the facility and services are assessed for risk and audited for control weaknesses, the frequency of assessments and how control weaknesses are mitigated in a timely manner. Ask the service provider if they make the results of risk assessments available to their customers. 4. Require the definition of what the provider considers critical success factors, key performance indicators, and how they measure them relative to IT Service Management (Service Support and Service Delivery). 5. Require a listing of all provider third party vendors, their third party vendors, their roles and responsibilities to the provider, and their interfaces to your services. 6. Request divulgence of incident response, recovery, and resiliency procedures for any and all sites and associated services. 7. Request a review of all documented policies, procedures and processes associated with the site and associated services assessing the level of risk associated with the service. 8. Require the provider to deliver a comprehensive list of the regulations and statutes that govern the site and associated services, and how compliance with these items is executed. 9. Perform full contract or terms of use due diligence to determine roles, responsibilities, and accountability. Ensure legal counsel review, including an assessment of the enforceability of local contract provisions and laws in foreign or out-of-state jurisdictions. 10. Determine whether due diligence requirements encompass all material aspects of the cloud provider relationship, such as the provider’s financial condition, reputation (e.g., reference checks), controls, key personnel, disaster recovery plans and tests, insurance, communications capabilities and use of subcontractors.

Request a scope of services including: • Performance standards

144 2010 EMC Proven Professional Knowledge Sharing • Rapid provisioning – de-provisioning • Methods of multi-tenancy and resource sharing • Pricing • Controls • Financial and control reporting • Right to audit • Ownership of data and programs • Procedures to address a Legal Hold • Confidentiality and security • Regulatory compliance • Indemnification • Limitation of liability • Dispute resolution • Contract duration • Restrictions on, or prior approval for, subcontractors • Termination and assignment, including timely return, of data in a machine- readable format • Insurance coverage • Prevailing jurisdiction (where applicable) • Choice of Law (foreign outsourcing arrangements) • Regulatory access to data and information necessary for supervision • Business Continuity Planning.

Consumers, Businesses, Cloud Service Providers, and Information Security and Assurance professionals must collaborate to focus on the potential issues and solutions listed above, and to discover the holes. The Cloud Security Alliance (CSA), one of the standards bodies outlined in the section titled “Standards”, starting on page 135, calls for collaboration in setting standard terms and requirements that drive governance and enterprise risk issues to a mature and acceptable state allowing for negotiation. The CSA is working to address these issues so businesses can take full advantage of the nimbleness, expansive service options, flexible pricing and cost savings of Cloud Services to achieve a sustainable IT solution.

145 2010 EMC Proven Professional Knowledge Sharing Compliance

With cloud computing resources as a viable and cost effective means to outsource entire systems and increase sustainability, maintaining compliance with your security policy and the various regulatory and legislative requirements to which your company has adhered can become even more difficult to demonstrate. The cost of auditing compliance is likely to increase without proper planning. With that in mind, it is imperative to consider all of your requirements and options prior to progressing with cloud computing plans [6].

Best Practice - Know Your Legal Obligations Best Practice is your organization must fully understand all of the necessary legal requirements. The regulatory landscape is typically dictated by the industry in which you reside. Depending on where your organization operates, you are likely subject to a lengthy collection of legislation that governs how you treat specific types of data, and it is your obligation to understand it and remain compliant. Without understanding your obligations, an organization cannot formulate its data processing requirements. It is a best practice to engage internal auditors, external auditors, and legal counsel to ensure that nothing is left out.

Best Practice - Classify / Label your Data & Systems Your company must classify date to adequately protect it. Considering the regulatory and legislative requirements discussed earlier, your organization needs to classify its data to isolate what data requires the most stringent protection from the public, or otherwise less sensitive data. The data and systems must also be clearly labeled and the processes surrounding the handling of the data formalized. At this point, your organization can consider cloud-computing resources for data and systems not classified at a certain level, which would be subject to burdensome regulatory requirements.

Best Practice - External Risk Assessment A third party risk assessment of the systems and data being considered for cloud resources should be conducted to ensure all risks are identified and accounted for. This includes a Privacy Impact Assessment (PIA) as well as other typical Threat Risk Assessments (TRA). Impacts to other internal systems, such as Data Leakage Protection (DLP) systems should also be considered. Be prepared to discover extensive risks with costly remediation strategies in order to consider cloud computing for regulated data.

146 2010 EMC Proven Professional Knowledge Sharing Best Practice - Do Your Diligence / External Reports At a minimum, you need to understand the security of the organization hosting your cloud computing resources and what they are prepared to offer. If you have very stringent security requirements, you may want to mandate that your cloud provider be certified to ISO/IEC 27001: 2005 annually. It is also likely that an organization will need to improve your processes and operational security maturity to manage your cloud provider to that level of security. It is important to utilize the risk assessment and data classification exercises previously mentioned to provide the amount of security required to ensure the appropriate confidentiality, integrity and availability of your data and systems without over spending. Assuming ISO/IEC 27001:2005 certification is too costly or not available within the class of service you seek, the assurance statement most likely to be available is the Statement on Auditing Standards (SAS )70 Type II. Work these requirements into the contract requirements and ensure that you see a previous certificate of compliance prior to formalizing an agreement.

Similarly, the business should demand the results of external security scans and penetration tests on a regular basis due to the unique attack surfaces associated with cloud computing. The value of certifications such as ISO/IEC 27001:2005 or audit statements like SAS 70 are the source of significant debate among security professionals. Skeptics will point out that through the scoping process; an organization can exclude critical systems and process from scrutiny and present an unrealistic picture of organizational security. This is a legitimate issue, and our recommendation is that domain experts develop standards relating to scoping these and other certifications, so that over time, the customer will expect broad scoping. Customers must demand an ISO certification based upon a comprehensive security program. In the end, this will benefit the cloud provider as well, as a certifiably robust security program will pay for itself in reduced requests for audit.

Best Practice - Understand Where the Data Will Be! If your company is considering using cloud-computing resources for regulated data, it is imperative to understand where the data will be processed and stored under any and all situations. Of course, this task is far from simple for all parties, including cloud-computing providers. However, with respect to legislative compliance surrounding where data can and cannot be transmitted or stored, the cloud computing provider will need to be able to demonstrate assurance that the data will be where they say it is and only there. This applies to third parties and other outsourcers used by the cloud computing provider. If the provider has

147 2010 EMC Proven Professional Knowledge Sharing reciprocal arrangements or other types of potential outsourcing of the resources, strict attention to how this data is managed, handled, and located must extend to that third party arrangement.

If the potential provider you have engaged cannot do this, investigate others. As this requirement becomes more prevalent, it is likely the option will likely become available. Remember, if that assurance cannot be provided, some of your data and processing cannot use public cloud computing resources as defined in Domain 1 without exception. Private clouds may be the appropriate option in this case.

Best Practice - Track your applications to achieve compliance. To manage an application effectively, you have to know where it is. Establish a "chain of custody" that enables you to see where applications are running and manage them against any legal concerns. The chain of custody includes identifying the machine the application is installed on, what data is associated with that application, who is in control of the machine, and what controls are in place.

With server virtualization, applications move among different machines, and without careful control over the chain of custody, you can expose an application or the data to circumstances where a high-security application may be shifted into a low-security environment. Before you change anything in the environment, consider whether the change will create unauthorized access to the application or related data.

Best Practice - With off-site hosting, keep your assets separate. If a third party controls or hosts one of your servers, keeping your operating assets separate from those of the host's other customers is critical to avoid potential liability for security exposures, including improper access. For hosted applications, you also need to ensure that settings for one application cannot drift or migrate into the control of another, so no other host customers can access your data.

To do this, you need to evaluate how the host distributes and controls applications and data stored in its server array. Depending on the configurations of the hosts and client machines, settings and programmatic adjustments can trickle down and install in an unexpected manner.

148 2010 EMC Proven Professional Knowledge Sharing This is why you need to make sure that appropriate security controls are in place. You do not want unexpected updates or configuration controls to gain control over your data or application versions. Make sure your contract with the hosting company details the technical specifications that protect your data and users, and that the hosting company provides testing and monitoring reports that show compliance with your controls.

Best Practice - Protect yourself against power disruptions Any CIO overseeing a data center knows that power outages are a common occurrence. The reason is simple -- the power to run and cool a data center is more and more vulnerable. A 2006 AFCOM survey reported that 82.5% of data center outages in a five-year period were power- related.

If your data center has experienced power-related business interruptions, consider drafting contract terms for your own customers that protect you from liability if the power supply to your facilities is disrupted or lost. You may want more than general "acts of God" clauses in your customer-facing agreements.

If you are considering a shift to a hosted extension of your data center, you need to understand your hosted site's power supply and capabilities. Make sure your contract precisely defines those capabilities and allocates the risks for any service disruptions that occur. Account for this in your own customer contracts as well. Draft them carefully to ensure that power disruptions to your suppliers do not expose you to liability that you would avoid if your data center were in- house.

Best Practice - Ensure vendor cooperation in legal matters What happens when virtualization and compliance collide and the matter ends up in court? When a legal collision between virtualization and e-discovery occurs, such as if a third-party host was unable to produce documents a business needs for a legal action, a service provider can be a significant risk variable.

To avoid this scenario, it is best practice that you obtain the provider’s commitment to cooperate in legal matters. This must be done contractually with the third-party data custodian.

149 2010 EMC Proven Professional Knowledge Sharing In conclusion, virtualizing any aspect of your data center changes the game for compliance and e-discovery. It is best practice to make sure you know exactly where your applications are running, that your server controls are intact, and that your service provider contract provisions are "virtualization-friendly." There are benefits of a virtual data center or cloud. It is wise to address the issues above and not worry about whether your compliance controls are falling out of the cloud.

Profitability

The business of sustainability in Information Technology is the catalyst for sustainable and profitable growth. To put it another way, profitability and sustainability go hand in hand.

There is a new definition of profitability that has been the mantra of Sustainability for some time. It has called the “Triple Bottom Line.” The three (3) bottom lines include social, environmental, and economic extents, and you should align these extents to profitability. A more detailed definition of the extents is shown in Table 3 - Extent of Sustainability to achieve profitability, shown below. In principle, the basic idea is as simple as it is compelling. Resources may only be used at a rate at which they can be replenished naturally. It is obvious that the way in which the industrialized world operates today is not sustainable and that change is imperative.

Table 3 - Extent of Sustainability to achieve profitability

Social Environmental Economic ¾ Labor, health, and safety: ¾ Energy optimization: Manage ¾ Sustainability performance Address occupational health, energy costs via planning, risk management: Provide key safety, working conditions, and management, and process performance indicators to so on improvements manage sustainability efforts ¾ Human rights and diversity: ¾ Water optimization: Ensure ¾ Sustainable business Ensure compliance with human sustainable and cost-effective opportunity: Enable new goods rights and organizational water supply • and services for customers diversity ¾ Raw materials optimization: ¾ Emission trading: Ensure ¾ Product safety: Ensure Control raw material–related financial optimization (cap and consumer safety • Retention and costs and manage price volatility trade) qualification: Attract, foster, and ¾ Air and climate change: Reduce ¾ Reporting: Comply with external retain top talent by fostering or account for greenhouse gas demands for adequate reporting “green” profile emissions and disclosures ¾ Sewage: Manage sewage emissions and impact on water

150 2010 EMC Proven Professional Knowledge Sharing supply ¾ Land pollution: Avoid or reduce land pollution ¾ Waste: Manage waste in a sustainable way • Sustainable product life cycle: Sustainably develop new products and manage life cycle

Sustainability is very relevant not only at times of growth, but specifically during times of economic challenge. The main drivers of sustainability do not change for the following reasons:

Regulation will continue to increase. That is specifically true in the case of carbon emissions, but will likely include many other environmental and social aspects in the future. Energy prices will continue to fluctuate and, with economic recovery, rise sharply and increase cost pressure. Consumer awareness will continue to intensify and force transparency and optimization across entire business networks and supply chains.

Business and Profit objectives to achieve Sustainability Looking at the big picture, the new sustainability model is all about being environmentally friendly and making money. This sounds great to CEOs thinking about sustainability, but what are companies actually doing to achieve this? The backbone of most programs is based on the best Business Practices consisting of Awareness and Transparency, Efficiency Improvements, Innovation and Mitigation.

Best Practice - Consumer Awareness and Transparency Consumer Awareness and Transparency communicates the value of your sustainability initiative and is key to building brand equity. Transparency offers accountability to the program and avoids “green-washing.” In addition to getting out the word, awareness programs are often promoted as educational, providing a series of sustainability best practices to improve industry at large. Whether you view this through a lens of being altruistic or self-serving, the net result promotes and advances sustainable practices.

151 2010 EMC Proven Professional Knowledge Sharing Best Practice – Implement Efficiency Improvement Efficiency improvement, how to do more with less, is a central theme in most sustainability programs. Efficiencies improve products or processes, typically without making major changes to the underlying product or technology. Modifying engine design to be 20% percent more efficient is an example of product efficiency, whereas redesigning packaging to reduce waste, or transporting components and finished products more efficiently are examples of process efficiency. The effects of efficiencies are additive, each contributing to the sustainability goals of the company, driving the bottom line and creating potential for increased brand value.

Best Practice - Product Innovation Product Innovation is often more challenging than efficiency enhancements because it results in fundamental changes to products and processes. Innovation tends to have a higher barrier to entry than efficiency programs, requiring ideas that challenge the status quo and require significant R&D and marketing investments. The risks of failure for both product development and ultimately customer acceptance are higher for innovations, but so too are the potential rewards. Developments of thin module photovoltaic (TMPV) solar cells and algae-based bio- diesel, both with potential to significantly change the economics of renewable energy, are examples of innovation.

Best Practice - Carbon Mitigation Carbon Mitigation offsets green house gas (GHG) emissions through projects that remove carbon from the atmosphere. The Kyoto Protocol’s cap and trade mechanism created the framework for trading carbon allowances as a way for companies to meet mandatory GHG emissions targets. It also paved the way for a voluntary carbon offset market targeted to companies without mandatory requirements or those seeking to be carbon-neutral. Carbon- neutral status is also becoming popular for individuals with a number of sites and affinity credit cards catering to this desire.

Information Technology Sector Initiatives IT has become pervasive across all sectors and although invisible in many ways, it forms a service backbone for almost all products. People rarely think about it, but computers and communications are invoked for every cell phone call, every online purchase, every item shipped by a courier, every Google search and every invoice processed. In short, everything in the modern economy has an associated IT carbon footprint.

152 2010 EMC Proven Professional Knowledge Sharing

Network and Server Infrastructure Until recently, computing and communications were all about capacity and speed, with little thought to energy requirements. Following Moore’s Law, computing power and speed grow consistently to the point where, in some markets, it can cost more to power a server than to purchase it. In response, innovation has turned to designing low power chips that deliver high performance without the energy penalty. Network and server infrastructure manufacturers are focusing on reducing energy, space and cooling requirements with a new breed of high-density, high-capacity platforms using state of the art energy-efficient chipsets and components, not to mention the Cloud and the new paradigms that arise.

The proposition is fundamentally ROI based and is especially attractive to businesses that have hit power, size, or cooling barriers in their existing installations. As described in the previous sections, the technologies outlined below will go a long way to achieve sustainability

Best Practice – Virtualization When analyzing efficiency improvements, one option is to eliminate a facility altogether. Virtualization favors consolidating many distributed datacenters into a specially designed, centralized “Cloud” facility. An example of this is Google’s advanced data center facility, affectionately known as a Googleplex, which is alleged as being the most efficient and economical datacenter. While some argue the Googleplex is search specific, the concept of achieving Google economies of scale for applications across the board holds merit.

Best Practice - Recycling e-Waste Equipment vendors are increasing e-waste collection and recycling in efforts to reduce heavy metal and toxins levels in local landfills. Companies such as and HP have long-standing e- cycling programs as part of their cradle-to-grave sustainability programs.

Cloud Profitability and Economics

Cloud Computing, the long-held dream of computing as a utility, has the potential to transform a large part of the IT industry, making software even more attractive as a service and shaping the way IT hardware is designed and purchased. Developers with innovative ideas for new Internet services no longer require large capital outlays in hardware to deploy their service or the human

153 2010 EMC Proven Professional Knowledge Sharing expense to operate it. They need not be concerned about over provisioning for a service whose popularity does not meet their predictions, therefore wasting costly resources, or under provisioning for one that becomes wildly popular, therefore missing potential customers and revenue. Moreover, companies with large batch-oriented tasks can get results as quickly as their programs can scale, since using 1000 servers for one-hour costs no more than using one server does for 1000 hours. This elasticity of resources, without paying a premium for large scale, is unprecedented in the history of IT.

Cloud Computing refers to both the applications delivered as services over the Internet and the hardware and systems software in the datacenters that provide those services. The services themselves have long been referred to as Software as a Service (SaaS). The datacenter hardware and software is what we call a Cloud. When a Cloud is made available in a pay-as- you-go manner to the public, we call it a Public Cloud; the service being sold is Utility Computing. We use the term Private Cloud to refer to internal datacenters of a business or other organization, not made available to the public. Therefore, Cloud Computing is the sum of SaaS and Utility Computing, but does not include Private Clouds. People can be users or providers of SaaS, or users or providers of Utility Computing. We focus on SaaS Providers (Cloud Users) and Cloud Providers, who have received less attention than SaaS Users. From a hardware point of view, there are three aspects to Cloud Computing.

1. The illusion of infinite computing resources available on demand, thereby eliminating the need for Cloud Computing users to plan for provisioning 2. The elimination of an up-front commitment by Cloud users, thereby allowing companies to start small and increase hardware resources only when there needs increase 3. The ability to pay for use of computing resources on a short-term basis as needed (e.g., processors by the hour and storage by the day), and release them as needed, thereby rewarding conservation by letting machines and storage go when they are no longer useful

You can argue that the construction and operation of extremely large-scale, commodity- computer datacenters at low-cost locations was the key enabler of Cloud Computing. They uncovered the factors of 5 to 7 percent decrease in cost of electricity, network bandwidth, operations, software, and hardware available at these very large economies of scale. These factors, combined with statistical multiplexing to increase utilization compared to a private cloud,

154 2010 EMC Proven Professional Knowledge Sharing meant that cloud computing could offer services below the costs of a medium-sized datacenter and yet still make a profit.

Any application needs a model of computation, a model of storage, and a model of communication. The statistical multiplexing necessary to achieve elasticity and the illusion of infinite capacity requires each of these resources to be virtualized to hide the implementation of how they are multiplexed and shared.

One view is that different utility-computing offerings will be distinguished based on the level of abstraction presented to the programmer and the level of management of the resources. Amazon EC2 is at one end of the spectrum. An EC2 instance looks much like physical hardware, and users can control nearly the entire software stack, from the kernel upwards. This low level makes it inherently difficult for Amazon to offer automatic scalability and failover, because the semantics associated with replication and other state management issues are highly application-dependent.

At the other extreme of the spectrum are application domain specific platforms, such as Google AppEngine. AppEngine is targeted exclusively to traditional web applications, enforcing an application structure of clean separation between a stateless computation tier and a state-full storage tier. AppEngine’s impressive automatic scaling and high-availability mechanisms and the proprietary MegaStore data storage available to AppEngine applications, all rely on these constraints. Applications for Microsoft’s Azure are written using the .NET libraries, and compiled to the Common Language Runtime, a language-independent managed environment. Therefore, Azure is intermediate between application frameworks like AppEngine and hardware virtual machines like EC2.

From a business and profitability perspective, when is Utility Computing preferable to running a Private Cloud?

Case 1: Demand for a service varies with time Provisioning a data center for the peak load it must sustain a few days per month leads to underutilization at other times. Instead, Cloud Computing lets an organization pay by the hour for computing resources, potentially leading to cost savings even if the hourly rate to rent a machine from a cloud provider is higher than the rate to own one.

155 2010 EMC Proven Professional Knowledge Sharing

Case 2: Demand is unknown in advance For example, a web startup will need to support a spike in demand when it becomes popular, followed potentially by a reduction once some visitors turn away.

Case 3: Batch Processing Organizations that perform batch analytics can use the ”cost associatively” of cloud computing to finish computations faster: using 1000 EC2 machines for 1 hour costs the same as using 1 machine for 1000 hours.

For the first case of a web business with varying demand over time and revenue proportional to user hours, the tradeoff is shown in Equation 9 – Cloud Computing - Cost Advantage, below. Cloud Computing is more profitable when the following is true:

Equation 9 – Cloud Computing - Cost Advantage

Profit From Using Cloud Computing ≥ Profit From Using a Fixed Capacity Data Center

Equation 10 – Cloud Computing - Cost tradeoff for demand that varies over time

⎛ CostDataCenter ⎞ UserHoursCloud ∗(Net Revenue − CostCloud ) ≥ UserHoursDataCenter ∗⎜ Net Revenue − ⎟ ⎝ Utilization ⎠

In Equation 10 – Cloud Computing - Cost tradeoff for demand that varies over time, above, the left-hand side multiplies the net revenue per user-hour by the number of user-hours, giving the expected profit from using Cloud Computing. The right-hand side performs the same calculation for a fixed-capacity datacenter by factoring in the average utilization, including nonpeak workloads. Whichever side is greater represents the opportunity for higher profit.

As shown in Table 4 - Best Practices in Cloud Architectures, shown below, defines best practices to achieve growth of Cloud Computing Architectures. The first three best practices concern adoption and the next five deal with growth. The last two are marketing related. All best practices should aim at horizontal scalability of virtual machines over the efficiency of a single VM.

156 2010 EMC Proven Professional Knowledge Sharing Table 4 - Best Practices in Cloud Architectures

Best Practice Solution Increase Availability of Service Use Multiple Cloud Providers; Use Elasticity to Prevent DDOS (Distributed Denial of Service attacks) Implement some form of Data Lock-In Standardize APIs; Compatible SW to enable Surge Computing Data Confidentiality and Audit ability Deploy Encryption, VLANs, Firewalls; Geographical Data Storage Reduce Data Transfer Bottlenecks FedExing Disks; Data Backup/Archival; Higher BW Switches Minimize Performance Unpredictability Improved VM Support; Flash Memory; Gang Schedule VMs Implement a Scalable Storage Solution Data De-duplication, Tiered Storage and other Scalable Store Solutions Reduce or minimize Bugs in Large-Scale Implement a full functioned fault isolation and Distributed Systems root cause analysis function as well as Distributed VMs Implement Architecture to Scale quickly Implement Auto-Scalar that relies on Machine Learning Snapshots to encourage Cloud Computing Conservationism Reputation Fate Sharing Offer reputation-guarding services like those for email Implement best of breed tiered Software Pay-for-use licenses; Bulk use sales, etc Licensing

In addition: 1. Applications need to both scale down rapidly as well as scale up, which is a new requirement. Such software also needs a pay-for-use licensing model to match needs of Cloud Computing.

2. Infrastructure Software needs to be aware that it is no longer running on bare metal, but on VMs. It needs to have billing built in from the beginning.

157 2010 EMC Proven Professional Knowledge Sharing 3. Hardware Systems should be designed at the scale of a container (at least a dozen racks), which will be is the minimum purchase size. Cost of operation will match performance. Cost of purchase in important, rewarding energy proportionality, such as by putting idle portions of the memory, disk, and network into low power mode. Processors should work well with VMs, flash memory should be added to the memory hierarchy, LAN switches, and WAN routers must improve in bandwidth and cost.

Cloud Computing Economics

When deciding whether hosting a service in the cloud makes sense over the long term, you can argue that the fine-grained economic models enabled by Cloud Computing make tradeoff decisions more flowing, and in particular the elasticity offered by clouds serves to transfer risk. Although hardware resource costs continue to decline, they do so at variable rates. For example, computing and storage costs are falling faster than WAN costs. Cloud Computing can track these changes and potentially pass them through to the customer more effectively than building your own datacenter, resulting in a closer match of expenditure to actual resource usage.

In making the decision about whether to move an existing service to the cloud, you must examine the expected average and peak resource utilization, especially if the application may have highly variable spikes in resource demand; the practical limits on real-world utilization of purchased equipment; and operational costs that vary depending on the type of cloud environment being considered. See the section titled “

158 2010 EMC Proven Professional Knowledge Sharing Economics Pillar,” starting on page 162 for additional details on economic issues. After all, profitability and economics go hand in hand.

Best Practice – Consider Elasticity as part of the business deciding metrics Although the economic appeal of Cloud Computing and its variants is often described as “converting capital expenses to operating expenses” (CapEx to OpEx),the phrase “pay as you go” may more directly capture the economic benefit to the buyer or consumer. Hours purchased via Cloud Computing can be distributed non-uniformly in time (e.g., use 100 server-hours today and no server-hours tomorrow, and still pay only for what you use). In the networking community, this way of selling bandwidth is already known as usage-based pricing. In addition, the absence of up-front capital expense allows capital to be redirected to core business investment. Even though, as an example, Amazon’s pay-as-you-go pricing could be more expensive than buying and depreciating a comparable server over the same period, you can argue that the cost is outweighed by the extremely important Cloud Computing economic benefits of elasticity and transference of risk. This is especially true if the risks of over provisioning (underutilization) and under provisioning (saturation) are paramount.

Starting with elasticity, the key observation is that Cloud Computing’s ability to add or remove resources at a fine granularity (for example, one server at a time with EC2) and with a lead-time of minutes rather than weeks allows us to match resources to workload much more closely.

Real world estimates of server utilization in datacenters range from 5% to 20%. This may sound shockingly low, but it is consistent with the observation that for many services, the peak workload exceeds the average by factors of 2 to 10. Few users deliberately provision for less than the expected peak, and therefore they must provision for the peak and allow the resources to remain idle at nonpeak times. The more pronounced the variation, the more the waste.

A simple example demonstrates how elasticity reduces this waste and can more than compensate for the potentially higher cost per server-hour of paying-as-you-go vs. buying.

For example, regarding elasticity, assume a service has a predictable daily demand where the peak requires 500 servers at noon but the trough requires only 100 servers at midnight, as shown in Figure 28 – Provisioning for peak load, on page 160.

159 2010 EMC Proven Professional Knowledge Sharing

Figure 28 – Provisioning for peak load

As long as the average utilization over a whole day is 300 servers, the actual utilization over the whole day (shaded area under the curve) is 300 x 24 = 7200 server-hours; but since we must provision to the peak of 500 servers, we pay for 500 x 24 = 12000 server-hours, a factor of 1.7 more than what is needed. Therefore, as long as the pay-as-you-go cost per server-hour over 3 years is less than 1.7 times the cost of buying the server, you can save money using utility computing.

In fact, the Figure 28 – Provisioning for peak load diagram above example underestimates the benefits of elasticity. In addition to simple diurnal or 24 hour pattern, most nontrivial services also experience seasonal or other periodic demand variations (e.g., e-commerce peaks in December and photo sharing sites peak after holidays) as well as some unexpected demand bursts due to external events (e.g., news events). Since it can take weeks to acquire and rack new equipment, the only way to handle such spikes is to provision for them in advance. We already saw that even if service operators predict the spike sizes correctly, capacity is wasted, and if they overestimate, the spike they provision for is even worse. They may also underestimate the spike as shown in Figure 29 – Under Provisioning Option 1 on page 160 however, accidentally turning away excess users. While the monetary effects of over provisioning are easily measured, those of under provisioning are more difficult to measure yet equally serious given performance and scalability concerns.

160 2010 EMC Proven Professional Knowledge Sharing

Figure 29 – Under Provisioning Option 1

Not only do rejected users generate zero revenue; they may never come back due to poor service. Figure 30 – Under Provisioning Option 2 on page 161 aims to capture this behavior. Users will abandon an under provisioned service until the peak user load equals the datacenter’s usable capacity, at which point users again receive acceptable service, but with fewer potential users.

161 2010 EMC Proven Professional Knowledge Sharing Figure 30 – Under Provisioning Option 2

Regarding Figure 28, even if peak load can be correctly anticipated without elasticity, we waste resources (shaded area) during nonpeak times. In the case of Figure 29 – Under Provisioning Option 1, potential revenue from users not served (shaded area) is sacrificed. Lastly, in Figure 30 – Under Provisioning Option 2, some users desert the site permanently after experiencing poor service. This attrition and possible negative press result in a permanent loss of a portion of the revenue stream.

162 2010 EMC Proven Professional Knowledge Sharing Economics Pillar

It seems to be a foregone conclusion that by increasing energy efficiency, the world as a whole reduces carbon emissions and overall, aids in the goal of sustainability.

Consider the economics of passenger airline flights. A number of years ago, there was thought of building wide body aircraft so that by building planes that could handle more people, less planes would be used and therefore increase efficiency (i.e., reduce carbon gas emissions). Interestingly, the opposite happened. At the micro level, yes, the passenger per flight efficiency increased, but since the cost of airfare was reduced, more people started to fly and therefore greenhouse emissions increased.

It has become an article of faith among environmentalists, seeking to reduce greenhouse gas emissions, that improving the efficiency of energy use will lead to a reduction in energy consumption. This proposition has even been adopted by many countries who are promoting energy efficiency as the most cost effective solution to global warming.

However, in the United States, there has been a backlash against energy efficiency as an instrument of energy policy. This has been stimulated partly by disillusionment with the failures of energy conservation programs undertaken by utilities, and partly by the growing influence of the 'contrarians', those hostile to government mandated environmental programs.

The debate as to whether energy efficiency is effective (i.e., reduces energy consumption) has spread from the pages of obscure energy economic journals in the early 1990s to the pages of the leading US science journal, Science, and newspaper, the New York Times in the mid 1990s. It has recently produced such polemics as the US book by “Inhaber” entitled “Why Energy Conservation Fails.” Inhaber argues, with the aid of an extensive bibliography, that energy efficiency programs are a waste of time and effort.

This debate has also promoted discussion among the climate change community and US energy analysts over the extent of the “rebound” or “take-back” effect. That is how much of the energy saving produced by an efficiency investment is taken back by consumers in the form of higher consumption, both on the micro and macro levels.

163 2010 EMC Proven Professional Knowledge Sharing The Khazzoom-Brookes postulate20, first put forward by the US economist Harry Saunders in 1992, says that energy efficiency improvements that, on the broadest considerations, are economically justified at the micro level, lead to higher levels of energy consumption at the macro level, than in the absence of such improvements.

It argues against the views of conservationists who promote energy efficiency as a means of reducing energy consumption. We can identify every little benefit from each individual act of energy efficiency and then aggregate them all to produce a macroeconomic total. In effect, it adopts a macroeconomic (top down) approach rather than the microeconomic (bottom up) approach used by conservationists.

It warns that although it is possible to reduce energy consumption through improved energy efficiency, it would be at the expense of loss of economic output. It can be argued that overzealous pursuit of energy efficiency per se would damage the economy through misallocation of resources. In other words, reduced energy consumption is possible but at an economic cost.

The effect of higher energy prices, either through taxes or producer-induced shortages, initially reduces demand, but in the longer term, encourages greater energy efficiency. This efficiency response amounts to a partial accommodation of the price rise and therefore the reduction in demand is blunted. The result is a new balance between supply and demand at a higher level of supply and consumption than if there had been no efficiency response.

For example, under the economic conditions of falling fuel prices and a free market approach that have prevailed in the United Kingdom most of this century, energy consumption has increased at the same time as energy efficiency has improved. During periods of high energy prices, such as 1973-4 and 1979-80, energy consumption fell. Whether this is due to the adverse consequences of higher fuel prices on economic, or activity or energy efficiency improvements, is a matter of fierce dispute.

20 http://www.zerocarbonnetwork.cc/News/Latest/The-Khazzoom-Brookes-Postulate-Does- Energy-Efficiency-Really-Save-Energy.html 164 2010 EMC Proven Professional Knowledge Sharing The lower level of energy consumption at times of high energy prices may be at the expense of reduced economic output. This, in turn, is due to the adverse effect on economic productivity as a whole, and of the high price of an important resource.

Best Practice – Consider Efficiency as only one part of the Economic Sustainable equation

Energy is only one factor of production. Therefore, there are no economic grounds for favoring energy productivity over labor or capital productivity. Governments may have non-economic reasons, such as combating global warming, for singling out energy productivity

However, climate policies that rely only on energy efficiency technologies may need reinforcement by market instruments such as fuel taxes and other incentive mechanisms. Without such mechanisms, a significant portion of the technological achievable carbon and energy savings could be lost to the rebound.

Conclusion

Data centers are changing at a rapid, exponential pace. Cloud computing and all of its variants have been discussed. How we align the different data center disciplines to understand how new technologies will work together to solve data center sustainability problems remains a key discussion area. We reviewed Best Practices to achieve business value and sustainability. In summary, this article went above the Cloud, offering Best Practices that align with the most important goal of creating a sustainable computing infrastructure to achieve business value and growth.

165 2010 EMC Proven Professional Knowledge Sharing

Appendix A – Green IT, SaaS, Cloud Computing Solutions

SaaS or Cloud Value Proposition Offered to Green Computing Webpage Address Goods and Services Companies Company Name

Analytics, Technology, Information and Services for the Energy and Environmental Markets. See Environmental Registry and Banking > APX, Inc. - http://www.apx.com/ http://www.apx.com/environmental/environme ntal-registries.asp and Market Operations > http://www.apx.com/marketoperations/ for examples.

Carbon footprint calculator and preset option enables customers to easily and affordably offset your carbon footprint by Carbon Fund http://www.carbonfund.o pressing the “Offset Your Footprint Now!” Offsets rg/calculators button after adding a list of items to a shopping list. Offers detailed information to learn more about carbon offsets.

Cloud Computing http://cloudcomputingex Information About Cloud Computing Expo po.com/

CO2Stats makes your site carbon neutral http://www.co2stats.com and shows visitors you are environmentally CO2 Stats / friendly.

166 2010 EMC Proven Professional Knowledge Sharing Mobile phone capability to track carbon Erico http://www.ecorio.org/ footprint.

http://www.pe- Software tools and databases for product GaBi Software international.com/english and process sustainability analyses /gabi/

Green company business models must offer a value proposition that reduces carbon dioxide. Designing the green business model begins by identifying how its products, services or solutions can reduce carbon dioxide (CO2) emissions. Market standards trend toward measuring reductions by 1 million metric tons. It can http://www.epa.gov/clea be difficult to visualize what a "metric ton of Greenhouse Gas nenergy/energy- carbon dioxide" really is. This calculator Equivalencies resources/calculator.html translates difficult to understand Calculator statements into more commonplace terms, such as "is equivalent to avoiding the carbon dioxide emissions of X number of cars annually." It also offers an excellent example of the kind green analytics, metrics and intelligence measures that SaaS / Cloud Computing solutions must address.

promoting a healthy active life style, and aiding in the preservation of the idela sports http://idealsportsent@gm environment through the demonstration of entertainment ail.com bike riding to reduce C02 emissions and exercise.

167 2010 EMC Proven Professional Knowledge Sharing PE INTERNATIONAL provides conscientious companies with cutting-edge tools, in-depth knowledge and an unparalleled spectrum of experience in making both corporate operations and products more sustainable. Applied methods include implementing management systems, developing PE sustainability indicators, life cycle http://www.pe- INTERNATIONAL assessment (LCA), carbon footprint, international.com/english Experts in design for environment (DfE) and / Sustainability environmental product declarations (EPD), technology benchmarking, or eco- efficiency analysis and emissions management. PE INTERNATIONAL offers two leading software solutions, with the GaBi software for product sustainability and the SoFi software for corporate sustainability.

Rapid Carbon Modeling (RCM) approach enables organizations to efficiently assess their exposure to commodity, climate, and http://www.planetmetrics Planet Metrics reputational risks and the implications of .com/ these forces on the corporation, its suppliers and customers.

Point Carbon Trading Analytics provides the market with independent analysis of http://www.pointcarbon. Point Carbon the power, gas and carbon markets. We com/trading/ offer 24/7 accessible web tools, aimed at continuously providing our clients with the

168 2010 EMC Proven Professional Knowledge Sharing latest market-moving information and forecasts.

SoFi is a leading software system for environmental and sustainability / corporate social responsibility management, it is currently used in 66 countries. The fast information flow and the consistent database in SoFi will help you to http://www.pe- improve your environmental and SoFi international.com/english sustainability performance. The main /sofi/ product lines are: SoFi EH&S for Environmental Management and Occupational Safety SoFi CSM for Sustainable Corporate Management SoFi EM for Emissions Management and Benchmarking

Trucost is an environmental research organization working with companies, http://www.trucost.com/ investors and government agencies to Trucost PLC understand the impacts companies have on the environment. Trucost is an independent organization founded in 2000.

Appendix B – Abbreviations

Acronym Description Comment

169 2010 EMC Proven Professional Knowledge Sharing Data Center Efficiency = Shows a ratio of how well a data DCE IT equipment / Total center is consuming power facility power

Shows how effectively a data center Data Center Performance is consuming power to produce a Efficiency = Effective IT DCPE given level of service or work such workload / total facility as energy per transaction or energy power per business function performed

Power usage effectiveness = Total PUE Inverse of DCE facility power / IT equipment power Kilowatts (kw) Watts / 1,000 One thousand watts Annual kWh kWh x 24 x 365 kWh used in on year Megawatts (mw) kW / 1,000 One thousand kW

Heat generated in an hour from using energy in British Thermal BTU/hour watts x 3.413 Units. 12,000 BTU/hour can equate to 1 Ton of cooling.

The number of watts used in one kWh 1,000 watt hours hour Amps x Volts (e.g. 12 Watts amps * 12 volts = 144 Unit of electrical energy power watts)

Watts BTU/hour x 0.293 Convert BTU/hr to watts

Watts / Amps (e.g. 144 Volts watts / 12 amps = 12 The amount of force on electrons volts)

170 2010 EMC Proven Professional Knowledge Sharing Watts / Volts (e.g. 144 Amps watts / 12 volts = 12 The flow rate of electricity amps)

Sometimes power expressed in Volt-Amperes (VA) Volts x Amps Volt-Amperes

kVA Volts x Amp / 1000 Number of kilovolt-amperes

Power factor is the efficiency of a kW kVA x power-factor piece of equipments’ use of power

kVA kW / power-factor Kilovolt-Amperes

EIA metric describing height of equipment in racks U 1U = 1.75”

171 2010 EMC Proven Professional Knowledge Sharing Indicator of how much work and how efficient energy is being used to accomplish useful work. This metric applies to active workloads or actively used and frequently Amount of work accessed storage and data. accomplished per unit of Examples would be IOPS per watt, energy consumed. This Activity / Watt Bandwidth per watt, Transactions could be IOPS, per watt, Users or streams per watt. Transactions or Activity per watt should also be Bandwidth per watt. used in conjunction with another metric such as how much capacity is supported per watt and total watts consumed for a representative picture.

Indicator of how effectively energy is being used to perform a given amount of work. The work could be Number of I/O operations I/Os, transactions, throughput or IOPS / Watt (or transactions) / energy other indicator of application (watts) activity. For example SPC-1 / Watt, SPEC / Watt, TPC / Watt, transaction / watt, IOP / Watt.

172 2010 EMC Proven Professional Knowledge Sharing This indicates how much data is GBPS or TBPS or PBPS moved or accessed per second or / Watt Amount of data time interval per unit of energy transferred or moved per Bandwidth / Watt consumed. This is often confused second and energy used. with capacity per watt given that Often confused with both bandwidth and capacity Capacity per watt reference GByte, TByte, PByte.

Indicator of how much capacity (space) or bandwidth supported in a given configuration or footprint per watt of energy. For inactive data or off-line and archive data, capacity GB or TB or PB (storage per watt can be an effective Capacity / Watt capacity space / watt measurement gauge. However, for active workloads and applications activity per watt also needs to be looked at to get a representative indicator of how energy is being used

Indicator of how effectively energy Processor performance / Mhz / Watt is being used by a CPU or energy (watts) processor.

Offset credits that can be bought

Carbon Credit Carbon offset credit and sold to offset your CO2 emissions

The amount of average carbon Average 1.341 lbs per dioxide (CO2) emissions from CO2 Emission kWh of electricity generating an average kWh of generated electricity

173 2010 EMC Proven Professional Knowledge Sharing Appendix B – References

[1] Gartner's Top Predictions for IT Organizations and Users, 2010 and Beyond: A New Balance Brian Gammage, Daryl C. Plummer, Ed Thompson, Leslie Fiering, Hung LeHong, Frances Karamouzis, Claudio Da Rold, Kimberly Collins, William Clark, Nick Jones, Charles Smulders, Meike Escherich, Martin Reynolds, Monica Basso, Publication Date: 29 December 2009 [2] Cloud Computing Value Chains:Understanding Businesses and Value Creation in the Cloud, Ashraf Bany Mohammed, Jorn Altmann and Junseok Hwang, Dec 2009 [3] Cloud Data Management Interface Specification, Version 0.80, Jan 2009 [4] HP and the cloud for industry analysts, Rebecca Lawson, Director of worldwide cloud marketing initiatives, Fall 2009 [5] Belady, C., Electronics Cooling, Volume 13, No. 1, February 20007 [6] White Paper - Creating HIPAA-Compliant Medical Data Applications with Amazon Web Services, April 2009 [7] Guidelines for energy efficient data centers, February 16,2007 [8] Evaluating Data Center High-Availability Service Delivery, A FORTRUST White Paper, June 2008 [9] Probabilistic Latent Semantic Indexing, Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval Thomas Hofmann International Computer Science Institute, Berkeley, CA & EECS Department, CS Division, UC Berkeley, [email protected] [10] A closer look at Data de-duplication and VTL, Jan Poos, Sun Microsystems [11] The Green Data Center: Understanding Energy Regulations, Power Consumption and More, Using chargeback to reduce data center power consumption: Five steps, Search tech Data, IBM, Nov 2009 [12] Supporting Sustainable Cloud Services Investing In The Network To Deliver Scalable, Reliable, And Secure Cloud Computing, A commissioned study conducted by Forrester Consulting on behalf of Juniper Networks, October 2009 [13] PROXY PROPOSALS FOR MEASURING DATA CENTER PRODUCTIVITY CONTRIBUTORS: JON HAAS, INTEL MARK MONROE, SUN MICROSYSTEMS JOHN PFLUEGER, DELL JACK POUCHET, EMERSON PETER SNELLING, SUN MICROSYSTEMS ANDY RAWSON, AMD FREEMAN RAWSON, IBM WHITE PAPER #17, ©2009 The Green Grid. [14] S.V. Garimella, Joshi, Y.K. , Bar-Cohen, A., Mahajan, R., Toh, K.C., Carey, V.P.,Baelmans, M., Lohan, J., Sammakia, B. and Andros, F., “Thermal Challenges in Next Generation Electronic Systems – Summary of Panel Presentations and Discussions,”IEEE Trans. Components and Packaging Technologies”, 2002 [15] Shah, A.J., Carey, V.P., Bash, C.E and Patel, C.D., “Energy Analysis of Data Center Thermal Management Systems, Proceedings of the 2003 IMECE, paper IMECE2003- 42527, 2003.

174 2010 EMC Proven Professional Knowledge Sharing [16] Shah, A.J., Carey, V.P., Bash, C.E. and Patel, C.D, “An Energy-Based Control Strategy for Computer Room Air-Conditioning Units in Data Centers,” paper IMECE2004-61384, Proceedings of the 2004 IMECE, Anaheim, CA, 2004. [17] Data Center Power Efficiency, Technical committee white paper, The green Grid, February 20, 2007 [18] Power Efficiency and Storage Arrays, Technology concepts and business considerations, EMC, July, 2007 [19] IDC, Industry trends and market analysis, October 30, 2007 [20] http://www.ashrae.org/ [21] http://www.fema.gov/hazard/map/index.shtm#disaster [22] US Department of Energy, Energy Information Industry, Annual Energy Review 2006, June 27, 2007

175 2010 EMC Proven Professional Knowledge Sharing

Author’s Biography

Paul Brant is a Senior Technology Consultant at EMC in the Global Technology Solutions Group located in New York City. He has over 25 years experience in semiconductor VLSI design, board level hardware and software design and IT solutions in various roles including engineering, marketing and technical sales. He also holds a number of patents in the data communication and semiconductor fields. Paul has a Bachelors and Masters Degree in Electrical Engineering from New York University (NYU) located in down town Manhattan as well as a Masters in Business Administration (MBA), from Dowling College located in Suffolk County, Long Island, NY. In his spare time, he enjoys his family of five, bicycling and other various endurance sports.

176 2010 EMC Proven Professional Knowledge Sharing

Index

Amazon ....82, 87, 88, 89, 92, 93, 94, 95, 96, DMTF ...... 45, 137, 138 101, 110, 119, 120, 121, 123, 124, 126, Downstream Event Suppression ...... 49 130, 155, 158, 173 DRAM ...... 70, 118 ANSI ...... 68 Effectiveness ...... 19, 29, 39, 129 AppEngine ...... 120, 121, 155 EISM ...... 43 ASIC ...... 77, 78 EMC ...... 1, 10, 19, 174, 175 Autonomic Computing ...... 84, 102, 103 Environment ...... 19, 29, 31 Business Practices ...... 19, 127, 128, 151 EPA ...... 33, 41, 42 CCT ...... 48, 49 EPEAT ...... 33, 34 Chillers ...... 42 ERP ...... 14 CHP ...... 20 ESX ...... 43, 71, 72 CIM ...... 45, 138 Executive Order 13423 ...... 32 Cloud ....1, 11, 12, 13, 14, 23, 24, 46, 72, 73, FAST ...... 64, 67, 69 74, 76, 77, 81, 82, 83, 84, 85, 86, 87, 88, FBI ...... 107, 108 89, 92, 93, 96, 99, 100, 101, 102, 103, Federated ...... 139, 140 104, 105, 106, 107, 109, 119, 120, 121, FEMA ...... 36 123, 125, 126, 130, 136, 137, 138, 140, Fifth Light ...... 20 142, 143, 144, 145, 153, 154, 155, 156, Flash ...... 17, 64, 157 157, 158, 159, 164, 165, 166 Flash Memory ...... 17, 157 Cloud computing ...... 13, 14, 24, 26, 27 Flywheel ...... 20 Cloud Computing .24, 81, 82, 83, 85, 86, 88, Google13, 81, 82, 87, 88, 89, 101, 103, 109, 89, 92, 97, 102, 121, 154, 156, 158, 173 110, 113, 114, 116, 120, 121, 152, 153, Consolidation ...... 21, 51 155 CRM ...... 14 Governance ...... 143 data center .....11, 12, 16, 20, 21, 23, 25, 26, Green .. 15, 23, 24, 31, 82, 83, 92, 102, 103, 27, 28, 31, 34, 35, 38, 65, 69, 73, 80, 81, 165, 166 82, 83, 86, 91, 92, 95, 96, 106, 124, 127, Grid Computing ..... 82, 83, 84, 92, 102, 103, 128, 129, 130, 131, 132, 133, 134, 135, 123, 125 141, 143, 149, 150, 153, 155, 164, 169 HDD ...... 67, 68 Digital Ecosystems ...... 82, 83, 92, 101, 102, Hyper-V ...... 43, 130 103, 105 IaaS ...... 76, 87, 97, 124, 136 177 2010 EMC Proven Professional Knowledge Sharing IBM ...... 93, 173 SATA ...... 64, 67, 68, 69, 70 ICIM ...... 44, 45, 46 Security 16, 74, 95, 97, 137, 138, 139, 140, IONOX ...... 43 145 ISO ...... 45, 131, 147 Self Healing ...... 66 IT 11, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, Self Organizing Systems ...... 50, 52 23, 24, 26, 27, 28, 29, 31, 32, 34, 37, 39, Self-organizing ...... 65 40, 41, 42, 49, 50, 53, 57, 58, 59, 64, 71, SNMP ...... 45, 47 72, 80, 81, 82, 85, 89, 90, 92, 97, 98, 99, Social Computing ...... 16 100, 101, 107, 108, 109, 127, 129, 134, STR ...... 54, 55, 56 135, 139, 140, 141, 142, 143, 144, Sustainability . 13, 18, 19, 20, 21, 23, 24, 27, 145,鶬152, 153, 165, 169, 175 28, 29, 30, 31, 81, 82, 84, 104, 150, 151, Joyent ...... 119 167 Liquid Cooling ...... 20 T10 DIF ...... 68 Little’s law ...... 55 TCO ...... 22, 58, 135 Moore’s Law ...... 29, 78, 153 Telco ...... 94 Mosso ...... 87, 119 Texas ...... 36 Optimization ...... 14 triple bottom line ...... 31 PaaS ...... 76, 87, 97, 124, 136 United States ...... 20, 33, 35, 36, 37, 143 People, Planet, Profit ...... 31 VERITAS ...... 43 RAID ...... 63, 64, 65, 67, 68, 69 Virtualization .... 14, 17, 42, 51, 70, 90, 138, RAID 6 ...... 68 153 RFP ...... 133 VM ...... 17, 43, 71, 106, 121, 156, 157 Risk ...... 143, 146 VTL ...... 61, 63, 173 RLS ...... 54, 57 Warehouse-Scale ...... 11 ROI ...... 48, 133, 153 WDM ...... 80 Ruby on Rails ...... 121 WORM ...... 94 SaaS ..13, 24, 76, 85, 88, 97, 124, 154, 165, WSCs ..... 109, 110, 111, 112, 114, 115, 116 166 Zantaz ...... 94 SAS ...... 68, 76, 147 Zetta ...... 94

178 2010 EMC Proven Professional Knowledge Sharing