Using PubSub For Scheduling in Azure SDN Qi Zhang ( - Azure Networking)

Azure Networking

Azure Region ‘A’ Regional Cable Network Consumers Regional CDN Network Carrier Microsoft Edge Enterprise, SMB, WAN mobile Azure Region ‘B’ ExpressRoute Regional Internet Network Exchanges Enterprise Regional DC/Corpnet Network

DC Hardware Services Intra-Region WAN Backbone Edge and ExpressRoute CDN Last Mile • SmartNIC/FPGA • Virtual Networks • DC Networks • Software WAN • Internet Peering • Acceleration for • E2E monitoring • SONiC • Load Balancing • Regional Networks • Subsea Cables • ExpressRoute applications and (Network Watcher, • VPN Services • Optical Modules • Terrestrial Fiber content Network Performance • Firewall • National Clouds Monitoring) • DDoS Protection • DNS & Traffic Management Microsoft Global Network

Svalbard Greenland

United States Sweden Norway Russia Canada United Kingdom Poland Ukraine Kazakistan France Russia

United States Turkey

Iran China One of the largest private Algeria Pacific Ocean Atlanta Saudi Ocean Libya networks in the world Mexico Egypt Arabia India Myanmar Niger (Burma) Mali Chad Sudan Pacific Ocean Nigeria • 8,000+ ISP sessions Ethiopi Venezuela a Colombia Dr Congo • 130+ edge sites Indonesia Peru Angola Brazil Zambia Indian Ocean • Bolivia 44 ExpressRoute locations Nambia

Australia • 33,000 miles of lit fiber South Africa Owned Capacity Data Argentina • SDN Managed (SWAN, OLS) center Leased Capacity Moving to Owned Edge Site

DCs and Network sites not exhaustive Software

Defined Management Central Commodity HW Networking API Controllers (SDN) vNIC vNIC vNIC vNIC vNIC vNIC

Azure SDN Basis of all NW virtualization in our datacenters Control Plane

Centralized, hierarchical, highly Host Agents scalable and available controllers SmartNIC Data Plane Host agent, drivers

Key to flexibility and scale is SDN PubSub in SDN • Scale: • 40+ regions, hundreds of DCs, millions of servers • millions of VNets and LBs • Flexible, scalable and efficient scheduling between controllers and agents • Publisher/Subscriber pattern

Controller

Publish flow PubSub Notification flow

Agent 1 Agent i Agent N Virtual

Network Virtual Network Virtual Network in Azure VNet Peering

Secure per customer virtual datacenter in the cloud Virtual Network

Instantiate and configure Cross premises Internet complex topologies in Connectivity minutes

Rich security and networking services Virtual Network Virtual Network CA-PA Mappings

Payload, including CA, is encapsulated Directory Service Traverses physical network 10.1.5.2  Payload 10.1.1.2

CA PA CA PA 10.0.0.1 10.1.1.2 10.0.0.1 10.1.5.3 10.0.0.4 10.1.1.3 10.0.0.7 10.1.1.4 10.0.0.6 10.1.3.3 VM-SW3 . . . VM-SW2 10.0.0.4. . . 10.1.3.2 VM-SW1 10.0.0.7 10.1.5.2

10.0.0.7  10.0.0.7  Payload Payload 10.0.0.1 10.0.0.1 PA 10.1.1.2 PA 10.1.1.4 PA 10.1.1.3 PA 10.1.3.3 PA 10.1.3.2 PA 10.1.5.3 PA 10.1.5.2 VM2 VM1 CA 10.0.0.1CA 10.0.0.7 CA 10.0.0.4 CA 10.0.0.6 CA 10.0.0.4 CA 10.0.0.1 CA 10.0.0.7

Host Node 1 Host Node 2 Host Node 3 Data traffic Control msgs PubSub for CA-PA Mapping Challenges: • Scale: hundreds K agents, millions of VNets • Scope: cluster, regional, global • VNet size limit: 4K mappings -> 64K mappings, 500 peerings • Provisioning Speed: minutes -> seconds VNet Controller VNet Controller

Directory Service PubSub

Agent 1 Agent i Agent N Agent 1 Agent i Agent N Scenario I: Global Peering Region A / VNET A Region B / VNET B

VNet VNet Controller Controller PubSub PubSub

Agent Agent Scenario II: DataExfil

Resource “Metadata”

METADATA (resource A): { NRP subscription: “{guid}, Resource A account: “users”, Policy Service Tunnel Policy storage_type: “blob” { } id: “policy-123”, PubSub service: “xstore”, Resource “Metadata”

subscription: “{guid}, METADATA (resource B):

accounts: [ {

“users”, BLOCK subscription: “{guid}, Resource B “wiki.*” account: “users”, Host ], storage_type: “table” storage_type: “blob”, } Agent access: “rw” Resource “Metadata” VNetPolicyCache } METADATA (resource ):

{

subscription: “{guid}, Resource C account: “wikimain”, Storage FE

storage_type: “blob”

} Overview Publisher Query Publish GetNodeInfo CreateNode • Persisted KV Store UpdateNode • Hierarchical name space • Set watcher on a node Root • Single watcher PK1 … PKi … PKn • Bulk watcher W W • Interfaces a1 a2 a3 n • Publish (batch/multi supported) • Subscribe b1 b2 b3 b4 b5 • Notification • Query Subscribe Notification watcher Created, Deleted • State Update/Delivery bulkwatcher DataChanged • Initial state ChildrenChanged • Subsequent state updates Subscriber Partition Key Partition Key

4 Microservices:

Stateless Service • Routing Service • Notification Service

Stateful Service • Selector Service • Madari Service

SDN PubSub Service Publisher (Vnet Controller) Subscriber Agent) PK: /Vnet/{VnetId1}, PK: /Vnet/{VnetId2} Path: /mappings/ipv4/{CA1} 6 1 Path: / Data (bond message): {PA1} 1 6

Partition Key Partition Key 2 2 /Vnet/{VnetId1} /Vnet/{VnetId2} 4 Microservices: 3 3 Stateless Service MadariService_02 MadariService_03 • Routing Service • Notification Service PK: /Vnet/{VnetId1}, 4 5 5 4 Path: /mappings/ipv4/{CA1} SetBulkWatcher: Data (bond message): {PA1} PK: /Vnet/{VnetId2} Stateful Service • Selector Service • Madari Service

SDN PubSub Service Madari Selector Service: Data Partitioning

AddPartitionKey(“baz”)

1

Selector Service MadariService_01 3 2 MadariService_02 Partition Key Madari Instance Madari Instance Total Data Size MadariService_03 “foo” MadariService_01 MadariService_01 1.05G “bar” MadariService_02 MadariService_02 1.9G ….. ….. MadariService_03 1.6G “baz” MadariService_01 Subscription through Notification Service

MadariService_02 MadariService_04 …..Root ….. Root ….. A C D B vnet vnet vnet vnet 1 2 3 4

A C B D

….. NotificationService_03 ….. NotificationService_08 …..

vnet1 vnet1 vnet1 vnet2 vnet3

Subscriber Subscriber Subscriber I III II Service Fabric Ring

• Service Fabric ring • Multiple PaaS tenants form a Service Tenant1 Fabric ring n1 n2 n3 n4 n5 Cluster1 • Service Fabric ring is on a VNET

Tenant2 • PubSub as Service Fabric application n6 n7 n8 n9 n10 • Routing Service/Notification Service Cluster2 • Stateless Tenant3 • On every node n11 n12 n13 n14 n15

Cluster3 • MadariService/MadariSelectorService(s) • Stateful • Min 3, target 7 Client Libraries

• Commit Managed Libraries • hooks Madari.ClientLibrary • Publishing through WCF channel Commit hooks Mark objects • Reliable Publisher modified triggered • IMOS-based publishers • User implements: • Commit hooks IMOS Lib • Handler Repo • Nuget package: Runtime Madari.ReliablePublisher.RSL Persist reliable tasks Madari.ReliablePublisher.ServiceFabric Retry on failure • Native Libraries • Publish Execute handler Worker Handler • Nuget package: Pick up tasks Madari.MadariFrontEnd.Native • Subscribe Delete executed tasks on success • Nuget package: Madari.Subscriber.Native Hierarchical PubSub Infrastructure

Resource Scope => PubSub Service Scope

Resource Scope Publisher Subscriber CA-PA mapping regional VNet Controller Agent DataExfil policy global NRP Agent

DataExfil policy

Global PubSub

CA->PA CA->PA CA->PA

Regional Regional Regional PubSub PubSub PubSub Global PubSub

Global PubSub Replication Service

Region A Region B

PubSub PubSub PubSub PubSub PubSub PubSub (AZ01) (AZ02) (AZ03) (AZ01) (AZ02) (AZ03) Publish Policy – No Replication (Sync)

/DataExfil/Policies/ {policyid} 1 8

Routing 4 Madari Service 5 Service /DataExfil/Policies/ 2 3 {policyid}

6 7 Selector Service

Replication Remote Service Global PubSub Regional P/S 8 Replication Service Madariservice/01 Partition 1

Operation Tracking Table

Op Id Status Operation Replication Details Replicationservice/01

1001 Replicated [add] /DataExfil/Policies/Policy1 {Dest1:Y, Dest2:Y, Dest3:Y } Replication Queue 1002 Replicating [update] /DataExfil/Policies/Policy1 {Dest1:Y, Dest2:N, Dest3:Y } Request to Partition 1

1003 Committed [remove] /DataExfil/Policies/Policy1 {Dest1:N, Dest2:N, Dest3:N } Destination Tracker

Dest1: req1002 Dest 2: req1001 Dest 3: req1001 Global SF Ring

Tenant1 n1 n2 n3 n4 n5 uswest vnet1

Tenant5 Tenant2 n1 n2 n3 n4 n5 n1 n2 n3 n4 n5 vnet2 europewest vnet5 useast

Tenant4 Tenant3 n1 n2 n3 n4 n5 n1 n2 n3 n4 n5 asiasoutheast vnet4 uswestcentral vnet3 Major Performance KPIs

• 15 partitions KPI Write throughput 10k req/s Read throughput 42k req/s End to End latency 10ms/300ms (50%/99%) Max subscribers 500K • In a large region: • < 300k agents • < 100K VNets • ~1k read/sec, ~200 write/sec Work in Progress • Accelerating read flow

• End to end validation Q & A

Thank you!