Multi-Site Openstack Deployment Options & Challenges for Telcos
Total Page:16
File Type:pdf, Size:1020Kb
MULTI-SITE OPENSTACK DEPLOYMENT OPTIONS & CHALLENGES FOR TELCOS Azhar Sayeed Chief Architect [email protected] DISCLAIMER Important Informaon The informaon described in this slide set does not provide any commitments to roadmaps or availability of products or features. Its inten+on is purely to provide clarity in describing the problem and drive a discussion that can then be used to drive open source communi+es Red Hat Product Management owns the roadmap and supportability conversaon for any Red Hat product 2 AGENDA • Background: OpenStack Architecture • Telco Deployment Use case • Distributed deployment – requirements • Multi-Site Architecture • Challenges • Solution and Further Study • Conclusions 3 OPENSTACK ARCHITECTURE WHY MULTI-SITE FOR TELCO? • Compute requirements – Not just at Data Center • Mul+ple Data Centers • Managed Service Offering • Managed Branch Office • Thick vCPE • Mobile Edge Compute • vRAN – vBBU locaons • Virtualized Central Offices • Hundreds to thousands of locaons • Primary and Backup Data Center – Disaster recovery • IoT Gateways – Fog compung 6 Centrally managed Compute closer to the user Multiple DC or Central Offices Independent OpenStack Deployments Remote Sites E2E Orchestrator • Hierarchical Connec+vity model of CO • Remote sites with compute Security & Firewall requirements Quality of Service (QoS) • Extend OpenStack to these sites Traffic Shaping Overlay Device Management Tunnel over Internet Main Data Center Backup Data Center A typical service almost always spans across mulple DCs Remote Data 7 Centers Multiple DCs – NFV Deployment Real Customer Requirements Region 2 L2 or L3 . Extensions between DCs . Fully Redundant System . Region 1 . • 25 Sites 25 • 2-5 VNFs required at each site • Maximum of 2 Compute Nodes per site needed for these VNFs • Controllers Storage Requirements = Image storage only Redundant Storage Nodes • Total number of control Nodes = 25 *3 =75 Compute Nodes Configuraon Overhead • Total Number of Storage Nodes = 25 * 3 = 75 75% 8 • Total Number of Compute Nodes = 25 * 2 = 50 Virtual Central Office Real Customer Challenge Region Region 1 2 L2 or L3 . Extensions . between DCs . Fully Redundant System . 1000+ • 1000+ Sites – Central Offices • From few 10s to 100s of VMs • Fully Redundant configuraons • Terminaon of Residen+al, Business and Mobile Services Controllers • Managing 1000 openstack islands Storage Nodes • Tier 1 Telcos already have >100 sites today Compute Nodes 9 Management Challenge DEPLOYMENT OPTIONS 10 OPTIONS • Mul+ple Independent Island Model – seen this already • Common Authen+caon and Management – External user policy management with LDAP integraon – Common Keystone • Stretched deployment model – Extend compute and Storage Nodes into other Data Centers – Keep central control of all remote resources • Allow Data Centers to share workloads – Tri-circle approach • Proxy the APIs – Master Slave model or cascading model • Agent based model • Something else?? 11 Multiple DC or Central Offices Independent OpenStack Deployments Feed the load balancer • Site capacity independent of the other Cloud Management Plaorm • User informaon separate or replicated offline • Load balancer directs traffic where to L go to – Good for loadsharing • DR – external problem Directory B L2 or L3 Extensions between DCs Fully Redundant Fully Redundant System System Controllers Storage Nodes Region 1 Compute Nodes Region 2…N 12 Good for few 10s of sites – What about 100s or Thousands of sites Extended OpenStack Model Shared Keystone Deployment Common or Shared Keystone Cloud Management Plaorm • Single Keystone for authen+caon • User informaon in one locaon • Independent Resources • Modify the keystone endpoint table Keystone • Endpoint, Service, Region, IP Directory L2 or L3 Extensions between DCs Fully Redundant Fully Redundant System System Controllers Storage Nodes Region 1 Compute Nodes … 13 Iden+ty: Keystone – Single point of control Region 2 N Extended OpenStack Model Central Controller and Remote Compute & Storage (HCI) Nodes Central Controller Cloud Management Plaorm • Single authen+caon • Distributed Compute Resources • Single Availability Zone per Region L2 or L3 Region 1 Extensions Region 2…N between DCs Fully Redundant System Replicated Storage – Controllers Galera Cluster Storage Nodes Cinder, Glance and Image Compute Nodes Directory 14 Manual Restore Revisiting the Branch Office - Thick CPE Can we deploy compute nodes at all the branch sites and centrally control them? E2E Network Orchestrator Data Center Enterprise Security & Firewall vCPE Quality of Service (QoS) Traffic Shaping Device Management IPSec, MPLS Internet or Other Tunnel mechanism Enterprise vCPE x86 Server with VNFs Deploy Nova Compute NFVI 15 OpenStack, OpenShift/ How do I scale it to thousands of sites? Kubernetes OSP 10 – Scale components independently Most OpenStack HA services and VIPs must be launched/managed by Pacemaker or HAProxy. However, some can be managed via systemctl thanks to the simplification of pacemaker constraints introduced in version 9 and 10. COMPOSABLE SERVICES AND CUSTOM ROLES Hardcoded Custom Custom Custom Controller Role Controller Role Ceilometer Role Networker Role Keystone Keystone Ceilometer Ceilometer Neutron Neutron RabbitMQ RabbitMQ Glance Glance ... ... • Leverage composable services model – to define a Central Keystone – Place functionality where it is needed – i.e. dis-aggregate • Deployable standalone on separate nodes or combined with other services into Custom Role(s). 17 – Distribute the functionality depending on the DC locations Re-visiting the Virtual Central Office use case Real Customer Challenge Region 1 L2 or L3 Extensions Region 2 between DCs Fully Redundant System Region 4 Region 3 Controllers Storage Nodes Compute Nodes Region 3a Region 3b 18 Require Flexibility and some Hierarchy CONSIDERATIONS Scaling across a thousand sites? • Some areas that we need to look at • Latency and Outage times • Delays due to distance between DCs and link speeds - RTT • The remote site is lost – headless operations and subsequent recovery • Startup Storms • Scaling Oslo messaging • RabbitMQ • Scaling of Nodes => Scale RabbitMQ/Messaging • Ceilometer (Gnocchi & Aodh)– heavy user of MQ 19 LATENCY AND OUTAGE TIMES Scaling across a thousand sites? • Latency between sites – Nova API Calls • 10, 50, 100 ms? Round trip +me = Queue tuning • Boleneck link/node speed • Outage +me – recovery me • 30s or more? • Nova Compute services flapping • Confirmaon – from provisioning to operaon • Neutron +me outs – binding issues • Headless operaon • Restart –causes storms 20 RABBITMQ TUNING • Tune the buffers – increase buffer size • Take into account messages in flight – rates and round trip +mes • BDP = Bojleneck speed * RTT • Number of messages • Servers * backends * requests/sec = Number of messages/sec Neutron • Split into mul+ple instances of message queues for distributed deployment • Ceilometer into a MQ – Heaviest user of MQ MQ • Nova into a single MQ • Neutron into a MQ Nova Conductor MQ • Refer to an interes+ng presentaon on this topic – “Tuning RabbitMQ Compute MQ at Large Scale Cloud” – Openstack Summit – Ausn 2016 Ceilometer Ceilometer Agents collector 21 RECENT AMQP ENHANCEMENTS • Eliminates the broker based model Broker • Enhances AMQP 1.0 Broker Broker • Separate messaging end point from message routers Hierarchical - Tree • Newton has AMQP driver for oslo messaging • Ocata provides perf tuning, upstream support for Triple-O • If you must use RabbitMQ • Use clustering and exchange configurations Broker Broker • Use shovel plugin with exchange configurations and multiple instances Mesh - Routed 22 OPENSTACK CASCADING PROJECT Parent Child Child Child Parent Proxy for Nova, Cinder, Celometer & Neutron subsystems per site AZ1 AZn At Parent – loads of proxys one set per Child Child User communicates to the master 23 TRICIRCLE AND TRIO2O Cascading solution split into two projects User1 • Tricircle – Networking across openstack clouds UserN • Trio2o – Single API Gateway for Nova, Cinder TRI-CIRCLE Trio2o Make Neutron(s) work as a single cluster API Gateway AZ1 AZx AZn pod Expand workloads into other OS instances Single Region with mul+ple sub regions Create Networking extensions Shared or Federated Keystone Isolaon of East-west traffic Shared or Distributed Glance Applicaon HA UID = TenantID+PODID 24 OPNFV Mul+-Site Project – Eupherates release WHAT’S THE ALTERNATIVE? Remote Compute Nodes • Should we abandon the idea of Remote Nova Nodes? • Use Packstack/AllinOne – OSP in a box – ala Vz uCPE • High overhead if you want to run 1-2 VNFs • Perhaps some optimization possible using Kolla/Container model • Initialize the remote nodes – Need L3/L2 connectivity for PXE • Make that a Kubernetes Node – Use containers on that node • Implement a new interface for remote nodes • Nova Agent on remote nodes ? • Abandon the idea of OpenStack – No!!!! No OpenStack really!!! ? • Use a CMP – to manage remote bare metal nodes • KVM – Hypervisor • Run Containers on remote nodes – Do we run into same issues? 25 VIRTUAL CONTROLLER MODEL Virtual controllers – to get around node restrictions Kolla –Containerizing the control plane • Kolla –Kubernetes and Kolla Ansible • Containerizing OSP control makes the previous options easier • Can remote nodes be considered as PODS in Kubernetes environments • Interface between Master and Host node • The containers can be deployed on those nodes to manage apps or even OSP services Neutron VM1 VM2 Keystone Glance Nova 26 SUMMARY • Deploying OpenStack at multiple sites is a must for Telcos