PerformancePerformance ofof ORBsORBs onon SwitchedSwitched FabricFabric TransportsTransports Victor Giddings Objective Interface Systems [email protected]

© 2001 Objective Interface Systems, Inc. SwitchedSwitched FabricsFabrics

u High-speed interconnects – High-bandwidth, low latency switched circuits – Adaptive routing through alternate paths – DMA transfers between memories of processors

Processor Processor Cross–bar Switch

Memory Memory

© 2001 Objective Interface Systems, Inc. 2 2 Motivation:Motivation: ORBORB PerformancePerformance onon EthernetEthernet u Most ORB performance studies have used TCP over – Most common use of CORBA – Well-known performance u Prediction of an ORBs performance using TCP over Ethernet – CPU speed is largest determinant of performance v Startup latency is dominated by processing in the protocol stack v Data is dominated by marshalling (copy) time – Extrapolating performance is a matter of scaling CPU speeds u Problem: how to predict ORB performance on transports

© 2001 Objective Interface Systems, Inc. 3 3 Context:Context: ORBORB SwitchedSwitched FabricFabric TransportsTransports u ORBexpress transports developed for two different switched fabric technologies – Mercury Computing’s RACEway v Joint development with Mercury Computing – Myrinet (CSPI & Myricom) u Performance results shows – Extremely low latency – Low variability

© 2001 Objective Interface Systems, Inc. 4 4 ExampleExample LatencyLatency ResultsResults CSPICSPI 28412841

ORBexpress Latency - CSPI 2841

Switched Fabric Transport - LongSeq TCP - LongSeq

250

200

150 usec

100

50

0 0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 Bytes Transferred

© 2001 Objective Interface Systems, Inc. 5 5 ComparisonComparison -- LatencyLatency

Comparsion Model 2641 vs. Model 2841 Latency

1,000,000

100,000

10,000

1,000 usec

100

10

1 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 bytes transferred

2841 DoubleSeq 2641 DoubleSeq

© 2001 Objective Interface Systems, Inc. 6 6 RatioRatio ofof LatenciesLatencies

Ratio - Latency Model 2641 vs. Model 2841

4

3.5

3

2.5

2

1.5

1

0.5

0 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 1,800,000 2,000,000 bytes transferred

© 2001 Objective Interface Systems, Inc. 7 7 PerformancePerformance ModelModel –– FirstFirst AttemptAttempt u Use “Ethernet” model – Latency = ORB “overhead” + Transport Propagation delay u ORBexpress provides a “mirror” transport – “Reflects” requests back to collocated sender – Directly measures ORB overhead including marshalling u Simple model: – Latency = ORB latency + Number of bytes * Propagation per byte

© 2001 Objective Interface Systems, Inc. 8 8 PerformancePerformance ModelModel –– ResultResult ofof FirstFirst AttemptAttempt

80,000

70,000

60,000

50,000

40,000 usec

30,000

20,000

10,000

0 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 1,800,000 2,000,000 bytes transferred

Actual Simple Prediction

© 2001 Objective Interface Systems, Inc. 9 9 PerformancePerformance ModelModel -- RefinementRefinement u Refined hardware block diagram

Memory Processor

ASIC Cross–bar Switch

Memory

u Account for propagation delay on memory bus

© 2001 Objective Interface Systems, Inc. 10 10 PerformancePerformance ModelModel –– ResultResult ofof RefinementRefinement

80,000

70,000

60,000

50,000

40,000 usec

30,000

20,000

10,000

0 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 1,800,000 2,000,000 bytes transferred

Simple Prediction Actual Refined Prediction

© 2001 Objective Interface Systems, Inc. 11 11 PerformancePerformance ModelModel –– PredictionPrediction ofof RatiosRatios

Ratio - 2641 vs. 2841

5

4.5

4

3.5

3

2.5

2

1.5

1

0.5

0 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 1,800,000 2,000,000 bytes transferred

Actual Refined Prediction

© 2001 Objective Interface Systems, Inc. 12 12 BandwidthBandwidth –– ORBORB overover SwitchedSwitched FabricFabric TransportTransport

ORBexpress Throughput - 2841

MyriTransport - LongSeq

45

40

35

30

25 MB/s 20

15

10

5

0 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 1,800,000 2,000,000 Bytes Transferred

© 2001 Objective Interface Systems, Inc. 13 13 BandwidthBandwidth

u Attained bandwidth is small part of available transport bandwidth

u Mirror transport bandwidth offers clue to cause – Since inverse bandwidths add (additional time per byte adds) – Transport bandwidth must be combined with “ORB bandwidth”

u ORB bandwidth factors – Startup latency – insignificant for significant byte counts – Memory copies

© 2001 Objective Interface Systems, Inc. 14 14 MemoryMemory BandwidthBandwidth –– CopyCopy InverseInverse RateRate

Memory Copy Inverse Rate (2841 400MHz PPC7400 with Altivec)

16.00

14.00

12.00

10.00

8.00 nanosec/Byte 6.00

4.00

2.00

0.00 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 1,800,000 2,000,000 Bytes Copied

© 2001 Objective Interface Systems, Inc. 15 15 ORBORB BandwidthBandwidth

u ORB Bandwidth – Measures the ability of the ORB to transfer volumes of data

u ORB bandwidth is dominated by – The number of copies – And the memory bandwidth of the processor

u Increasing an ORBs bandwidth – Requires elimination of copying – Motivation for “High Performance Enablers” RFP

© 2001 Objective Interface Systems, Inc. 16 16 BandwidthBandwidth -- PredictedPredicted

Predicted vs. Attained Bandwidth

70

60

50

40 MBps 30

20

10

0 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 1,800,000 2,000,000 bytes transferred

Predicted Bandwidth Attained Bandwidth

© 2001 Objective Interface Systems, Inc. 17 17 SummarySummary

u Examined two aspects of ORB performance over switched fabric transports

u Latency – Prediction is more complicated than for TCP/IP over Ethernet or Loopback – More complex model needed

u Bandwidth – Prediction is more straightforward – Introduced concept of “ORB Bandwidth” – ORB bandwidth is dependent on the number of copies

© 2001 Objective Interface Systems, Inc. 18 18