Architecting for the : lessons learned from 100 CloudStack deployments

Sheng Liang CTO, Cloud Platforms, Citrix CloudStack History

2008 2009 2010 2011 2012

Sept 2008: Nov 2009: May 2010: July 2011: April 2012: VMOps CloudStack Cloud.com Citrix Apache Founded 1.0 GA Launch & Acquires CloudStack CloudStack Cloud.com 2.0 GA The inventor of IaaS cloud – Amazon EC2

Amazon eCommerce Platform

EC2 API

Amazon Proprietary Orchestration Software

Open Source

Commodity Networking Storage Servers CloudStack is inspired by Amazon EC2

Amazon CloudPortaleCommerce Platform

CloudEC2 APIAPIs

Amazon ProprietaryCloudStack Orchestration Software

ESX Hyper-VOpen SourceXenServer Xen Hypervisor KVM OVM

Commodity Networking Storage Servers There will be 1000s of clouds

SP

Data center mgmt Desktop

Owner | Operator Owner and automation Cloud IT Horizontal Vertical General Purpose Special Purpose Learning from 100s of CloudStack deployments

Service Providers Web 2.0 Enterprise What is the biggest difference between traditional-style automation and Amazon-style cloud?

How to handle failures • Server failure comes from: ᵒ 70% - hard disk ᵒ 6% - RAID controller ᵒ 5% - memory ᵒ 18% - other factors 8% • Application can still fail for Annual Failure Rate of servers other reasons: ᵒ Network failure ᵒ Software bugs Kashi Venkatesh Vishwanath and ᵒ Human admin error Nachiappan Nagappan, Characterizing Hardware Reliability, SoCC’10

11

Core Routers

… Access Routers

Aggregation Switches

Load Balancers

… Top of Rack Switches

Servers •Bugs in failover mechanism •Incorrect configuration 40 % •Protocol issues such Effectiveness of network as TCP back-off, redundancy in reducing failures timeouts, and spanning tree reconfiguration Phillipa Gill, Navendu Jain & Nachiappan Nagappan, Understanding Network Failures in Data Centers: Measurement, Analysis and Implications , SIGCOMM 2011

13 A. Promise users VM, storage, and networking will never fail -- no strategy to handle failures

B. Backup VM for users and restore for users when failure happens

C.Tell users to expect failure. Users to backup VM and handle failure themselves zCloud zCloud West East Zone Zone AWS AWS West East Zone Zone zCloud zCloud West East Zone Zone AWS Design for AWS West East Zone Failure Zone Cloud workloads

Traditional-Style Amazon-Style Reliable hardware, backup entire Tell users to expect failure. cloud, and restore for users when Users to build apps that can failure happens withstand infrastructure failure

Link aggregation VM backup/snapshots Storage multi-pathing Ephemeral resources VM HA, fault tolerance Chaos monkey VM live migration Multi-site redundancy Strong consistency Eventual consistency Designing a zone for a traditional workload

Hypervisor Traditional-Style Availability Zone vSphere or XenServer Enterprise

vCenter/XenCenter Storage

Enterprise Networking (e.g., VLAN) SAN

Networking

Hypervisor Hypervisor Hypervisor L2 VLANs Cluster Cluster Cluster

Network Services

Enterprise Storage (e.g., SAN) Load Balancing VPN

Multi-tier Apps

Ent App Mgmt Designing a zone for an Amazon-style workload

Amazon-Style Availability Zone Software Defined Networks Hypervisor

(e.g., Security Groups, EIP, ELB,...) XenServer or KVM

Server Server Server Server Storage Racks Racks Racks Racks Local EBS Server Server Server Server Racks Racks Racks Racks Networking L3 SDN based L2 Elastic IP Server Server Server Server Racks Racks Racks Racks Network Services

Security Groups ELB GSLB

Elastic Block Storage Multi-tier Apps 3rd Party Tools (e.g., RightScale, enStratus)enStratus ) Object store is critical for Amazon-style cloud

Availability Zone 2

NetScaler

ELB/ Storage Cloud ? GSLB NetScaler

Users Availability Zone 2

NetScaler Availability Zone 1 Same Cloud can Support Both Styles

Apache CloudStack Mgmt Server

Traditional Traditional AWS-style AWS-style AWS-style Style Style Availability Availability Availability Availability Availability Zone Zone Zone Zone Zone

Object Storage Replication/DR Tests for a “true” cloud app

• Does it require SAN or VLAN? • Does it run in multiple data centers? • Does it involve a distributed object store? • Is there a single point of failure? Learning from 100s of CloudStack deployments

Service Providers Web 2.0 Enterprise

Traditional-style Mostly Amazon-style Mostly traditional style Standby CloudStack Mgmt Server Cluster

CloudStack Admin Internet Availability Zone 2 Primary CloudStack Mgmt Server Cluster

Primary Router MySQL Backup Load Balancer MySQL L3 Core Switch Top of Rack Switch

Object Store Servers … … … … … Availability Zone 1

Pod 1 Pod 2 Pod 3 Pod N Layer 3 cloud networking (security groups)

Web DB Web VM VM VM Web DB Security Security Web Group Web Group DB VM VM VM … ……

Web Web VM VM Layer 2 VLAN networking

User 1 User 1

User 1 User User 2 User 1 … 2 …… OVS networking

GRE Key 1 GRE Key 2

OVS User 1 OVS User 1

User 1 User OVS User OVS OVS 2 User 1 … 2 …… Multi-tier virtual networking Internet Network Services Public VLAN IPSec VPN • IPAM Customer • DNS Virtual Router Premises LB [intra] NetScaler VPX • MPLS VLAN • S-2-S VPN GRE Key 2 • Static Routes GRE Key 1 App VM • ACLs Web VM 1 1 • NAT, PF App VM • FW [ingress & egress] Web VM 2 GRE Key 3 2 • BGP Web VM 3 DB VM 1

Web VM 4 Web subnet App subnet DB Subnet 10.1.1.0/24 10.1.2.0/24 10.1.3.0/24 Network flexibility

Network Services Service Providers Network Isolation

 L2 connectivity  Virtual appliances  No isolation  IPAM  Hardware firewalls  VLAN isolation  DNS  LB appliances  SDN overlays  SDN controllers  Routing  L3 isolation  IDS /IPS  ACL appliances  Firewall  VRF  NAT  Hypervisor  VPN  LB  IDS  IPS “The Apache Way”

• Collaborative software development • Commercial-friendly standard license • Consistently high quality software • Respectful, honest, technical-based interaction • Faithful implementation of standards • Security as a mandatory feature Apache CloudStack Community

Pre Apache June Move (Jan Actuals 2012)

# of companies 1 68 endorsing project

# of companies 10 140 participating

# of developers 40 238 working on project Apache CloudStack community projects • SDN • Smart Storage ᵒ Nicira ᵒ Hadoop + S3 API for object store ᵒ Midokura ᵒ NetApp (FlexPod, object store) ᵒ Big Switch Networks ᵒ Basho RIAK CS ᵒ Stratosphere ᵒ Caringo object store • Backup/DR ᵒ Cloudian S3 ᵒ Sungard • PaaS ᵒ CloudFoundry implementation through • Networking IronFoundry and Stackato teams ᵒ Cisco ᵒ ᵒ Brocade (ADX) ᵒ Cumulogic ᵒ GigaSpaces Workload requirements drive cloud architecture

There is real demand for SDN in cloud infrastructure

Open source developers drive cloud adoption More info http://cloudstack.org