Using PubSub For Scheduling in Azure SDN Qi Zhang (Microsoft - Azure Networking)
Azure Networking
Azure Region ‘A’ Regional Cable Network Consumers Regional CDN Network Carrier Microsoft Edge Enterprise, SMB, WAN mobile Azure Region ‘B’ ExpressRoute Regional Internet Network Exchanges Enterprise Regional DC/Corpnet Network
DC Hardware Services Intra-Region WAN Backbone Edge and ExpressRoute CDN Last Mile • SmartNIC/FPGA • Virtual Networks • DC Networks • Software WAN • Internet Peering • Acceleration for • E2E monitoring • SONiC • Load Balancing • Regional Networks • Subsea Cables • ExpressRoute applications and (Network Watcher, • VPN Services • Optical Modules • Terrestrial Fiber content Network Performance • Firewall • National Clouds Monitoring) • DDoS Protection • DNS & Traffic Management Microsoft Global Network
Svalbard Greenland
United States Sweden Norway Russia Canada United Kingdom Poland Ukraine Kazakistan France Russia
United States Turkey
Iran China One of the largest private Algeria Pacific Ocean Atlanta Saudi Ocean Libya networks in the world Mexico Egypt Arabia India Myanmar Niger (Burma) Mali Chad Sudan Pacific Ocean Nigeria • 8,000+ ISP sessions Ethiopi Venezuela a Colombia Dr Congo • 130+ edge sites Indonesia Peru Angola Brazil Zambia Indian Ocean • Bolivia 44 ExpressRoute locations Nambia
Australia • 33,000 miles of lit fiber South Africa Owned Capacity Data Argentina • SDN Managed (SWAN, OLS) center Leased Capacity Moving to Owned Edge Site
DCs and Network sites not exhaustive Software
Defined Management Central Commodity HW Networking API Controllers (SDN) vNIC vNIC vNIC vNIC vNIC vNIC
Azure SDN Basis of all NW virtualization in our datacenters Control Plane
Centralized, hierarchical, highly Host Agents scalable and available controllers SmartNIC Data Plane Host agent, drivers
Key to flexibility and scale is SDN PubSub in SDN • Scale: • 40+ regions, hundreds of DCs, millions of servers • millions of VNets and LBs • Flexible, scalable and efficient scheduling between controllers and agents • Publisher/Subscriber pattern
Controller
Publish flow PubSub Notification flow
Agent 1 Agent i Agent N Virtual
Network Virtual Network Virtual Network in Azure VNet Peering
Secure per customer virtual datacenter in the cloud Virtual Network
Instantiate and configure Cross premises Internet complex topologies in Connectivity minutes
Rich security and networking services Virtual Network Virtual Network CA-PA Mappings
Payload, including CA, is encapsulated Directory Service Traverses physical network 10.1.5.2 Payload 10.1.1.2
CA PA CA PA 10.0.0.1 10.1.1.2 10.0.0.1 10.1.5.3 10.0.0.4 10.1.1.3 10.0.0.7 10.1.1.4 10.0.0.6 10.1.3.3 VM-SW3 . . . VM-SW2 10.0.0.4. . . 10.1.3.2 VM-SW1 10.0.0.7 10.1.5.2
10.0.0.7 10.0.0.7 Payload Payload 10.0.0.1 10.0.0.1 PA 10.1.1.2 PA 10.1.1.4 PA 10.1.1.3 PA 10.1.3.3 PA 10.1.3.2 PA 10.1.5.3 PA 10.1.5.2 VM2 VM1 CA 10.0.0.1CA 10.0.0.7 CA 10.0.0.4 CA 10.0.0.6 CA 10.0.0.4 CA 10.0.0.1 CA 10.0.0.7
Host Node 1 Host Node 2 Host Node 3 Data traffic Control msgs PubSub for CA-PA Mapping Challenges: • Scale: hundreds K agents, millions of VNets • Scope: cluster, regional, global • VNet size limit: 4K mappings -> 64K mappings, 500 peerings • Provisioning Speed: minutes -> seconds VNet Controller VNet Controller
Directory Service PubSub
Agent 1 Agent i Agent N Agent 1 Agent i Agent N Scenario I: Global Peering Region A / VNET A Region B / VNET B
VNet VNet Controller Controller PubSub PubSub
Agent Agent Scenario II: DataExfil
Resource “Metadata”
METADATA (resource A): { NRP subscription: “{guid}, Resource A account: “users”, Policy Service Tunnel Policy storage_type: “blob” { } id: “policy-123”, PubSub service: “xstore”, Resource “Metadata”
subscription: “{guid}, METADATA (resource B):
accounts: [ {
“users”, BLOCK subscription: “{guid}, Resource B “wiki.*” account: “users”, Host ], storage_type: “table” storage_type: “blob”, } Agent access: “rw” Resource “Metadata” VNetPolicyCache } METADATA (resource C):
{
subscription: “{guid}, Resource C account: “wikimain”, Storage FE
storage_type: “blob”
} Overview Publisher Query Publish GetNodeInfo CreateNode • Persisted KV Store UpdateNode • Hierarchical name space • Set watcher on a node Root • Single watcher PK1 … PKi … PKn • Bulk watcher W W • Interfaces a1 a2 a3 n • Publish (batch/multi supported) • Subscribe b1 b2 b3 b4 b5 • Notification • Query Subscribe Notification watcher Created, Deleted • State Update/Delivery bulkwatcher DataChanged • Initial state ChildrenChanged • Subsequent state updates Subscriber Partition Key Partition Key
4 Microservices:
Stateless Service • Routing Service • Notification Service
Stateful Service • Selector Service • Madari Service
SDN PubSub Service Publisher (Vnet Controller) Subscriber Agent) PK: /Vnet/{VnetId1}, PK: /Vnet/{VnetId2} Path: /mappings/ipv4/{CA1} 6 1 Path: /
Partition Key Partition Key 2 2 /Vnet/{VnetId1} /Vnet/{VnetId2} 4 Microservices: 3 3 Stateless Service MadariService_02 MadariService_03 • Routing Service • Notification Service PK: /Vnet/{VnetId1}, 4 5 5 4 Path: /mappings/ipv4/{CA1} SetBulkWatcher:
SDN PubSub Service Madari Selector Service: Data Partitioning
AddPartitionKey(“baz”)
1
Selector Service MadariService_01 3 2 MadariService_02 Partition Key Madari Instance Madari Instance Total Data Size MadariService_03 “foo” MadariService_01 MadariService_01 1.05G “bar” MadariService_02 MadariService_02 1.9G ….. ….. MadariService_03 1.6G “baz” MadariService_01 Subscription through Notification Service
MadariService_02 MadariService_04 …..Root ….. Root ….. A C D B vnet vnet vnet vnet 1 2 3 4
A C B D
….. NotificationService_03 ….. NotificationService_08 …..
vnet1 vnet1 vnet1 vnet2 vnet3
Subscriber Subscriber Subscriber I III II Service Fabric Ring
• Service Fabric ring • Multiple PaaS tenants form a Service Tenant1 Fabric ring n1 n2 n3 n4 n5 Cluster1 • Service Fabric ring is on a VNET
Tenant2 • PubSub as Service Fabric application n6 n7 n8 n9 n10 • Routing Service/Notification Service Cluster2 • Stateless Tenant3 • On every node n11 n12 n13 n14 n15
Cluster3 • MadariService/MadariSelectorService(s) • Stateful • Min 3, target 7 Client Libraries
• Commit Managed Libraries • hooks Madari.ClientLibrary • Publishing through WCF channel Commit hooks Mark objects • Reliable Publisher modified triggered • IMOS-based publishers • User implements: • Commit hooks IMOS Lib • Handler Repo • Nuget package: Runtime Madari.ReliablePublisher.RSL Persist reliable tasks Madari.ReliablePublisher.ServiceFabric Retry on failure • Native Libraries • Publish Execute handler Worker Handler • Nuget package: Pick up tasks Madari.MadariFrontEnd.Native • Subscribe Delete executed tasks on success • Nuget package: Madari.Subscriber.Native Hierarchical PubSub Infrastructure
Resource Scope => PubSub Service Scope
Resource Scope Publisher Subscriber CA-PA mapping regional VNet Controller Agent DataExfil policy global NRP Agent
DataExfil policy
Global PubSub
CA->PA CA->PA CA->PA
Regional Regional Regional PubSub PubSub PubSub Global PubSub
Global PubSub Replication Service
Region A Region B
PubSub PubSub PubSub PubSub PubSub PubSub (AZ01) (AZ02) (AZ03) (AZ01) (AZ02) (AZ03) Publish Policy – No Replication (Sync)
/DataExfil/Policies/ {policyid} 1 8
Routing 4 Madari Service 5 Service /DataExfil/Policies/ 2 3 {policyid}
6 7 Selector Service
Replication Remote Service Global PubSub Regional P/S 8 Replication Service Madariservice/01 Partition 1
Operation Tracking Table
Op Id Status Operation Replication Details Replicationservice/01
1001 Replicated [add] /DataExfil/Policies/Policy1 {Dest1:Y, Dest2:Y, Dest3:Y } Replication Queue 1002 Replicating [update] /DataExfil/Policies/Policy1 {Dest1:Y, Dest2:N, Dest3:Y } Request to Partition 1
1003 Committed [remove] /DataExfil/Policies/Policy1 {Dest1:N, Dest2:N, Dest3:N } Destination Tracker
Dest1: req1002 Dest 2: req1001 Dest 3: req1001 Global SF Ring
Tenant1 n1 n2 n3 n4 n5 uswest vnet1
Tenant5 Tenant2 n1 n2 n3 n4 n5 n1 n2 n3 n4 n5 vnet2 europewest vnet5 useast
Tenant4 Tenant3 n1 n2 n3 n4 n5 n1 n2 n3 n4 n5 asiasoutheast vnet4 uswestcentral vnet3 Major Performance KPIs
• 15 partitions KPI Write throughput 10k req/s Read throughput 42k req/s End to End latency 10ms/300ms (50%/99%) Max subscribers 500K • In a large region: • < 300k agents • < 100K VNets • ~1k read/sec, ~200 write/sec Work in Progress • Accelerating read flow
• End to end validation Q & A
Thank you!