Network Protocols: Myths, Missteps, and Mysteries

Radia Perlman EMC [email protected]

1 Network Protocols

• A lot of what we all know

2 Network Protocols

• A lot of what we all know…is false!

3 This field is really confusing

• “Common knowledge” – Need IP+ because IP is “layer 3” and Ethernet is “layer 2” – Ethernet is CSMA/CD – Replacing IPv4 with ISO’s CLNP in 1992 would have been a traumatic change; transitioning to IPv6 is just “a new version of IP” – Security is built into IPv6, but is just an add-on to IPv4 – ISO had too many layers – SDN is revolutionary stuff

4 How networking tends to be taught • Memorize these standards documents, or the arcane details of some implementation that got deployed • Nothing else ever existed • Except possibly to make vague, nontechnical, snide comments about other stuff

5 My philosophy on teaching (and books) • Look at each conceptual problem, like how to autoconfigure an address • Talk about a bunch of approaches to that, with tradeoffs • Then mention how various protocols (e.g., IPv4, IPv6, Appletalk, IPX, DECnet, …) solve it

6 But some professors say…

• Why is there stuff in here that my students don’t “need to know”?

7 Where does confusion come from? • Hype, FUD • People repeating stuff • Buzzwords with no clear definition – Or persons A and B have a clear definition in mind, but different from each other • Or the world changing, so something that used to be true is no longer true

8 Things are so confusing

• Comparing technology A vs B – Nobody knows both of them – Somebody mumbles some vague marketing thing, and everyone repeats it – Both A and B are moving targets

9 What about “facts”?

• What if you measure A vs B?

10 What about “facts”?

• What if you measure A vs B? • What are you actually measuring?...one implementation of A vs one implementation of B

11 What about “facts”?

• What if you measure A vs B? • What are you actually measuring?...one implementation of A vs one implementation of B • So don’t believe something unless you can figure out a plausible property of the two protocols that would make that true

12 Fostering and Practicing Critical Thinking • Don’t believe something (and certainly don’t repeat it!) unless you understand something intrinsic that makes it true • Encourage “naïve” questions – Delight in teaching what “everyone knows” – Cherish the chance to question your basic assumptions – Be a role model by asking questions yourself

13 An example of something confusing

14 What is Ethernet?

15 The story of Ethernet

• What is Ethernet? • How does it compare/work with IP? • People talk about “layer 2 solutions” vs “layer 3 solutions”. What’s that about?

16 So, first we need to review network “layers” • ISO credited with naming the layers • It’s just a way of thinking about networks

17 Perlman’s View of ISO Layers

• 1: Physical

18 Perlman’s View of ISO Layers

• 1: Physical • 2: : (neighbor to neighbor)

19 Perlman’s View of ISO Layers

• 1: Physical • 2: Data link: (neighbor to neighbor) • 3: Network: create path, forward data (e.g., IP)

20 Perlman’s View of ISO Layers

• 1: Physical • 2: Data link: (neighbor to neighbor) • 3: Network: create path, forward data (e.g., IP) • 4: Transport: end-to end (e.g., TCP, UDP)

21 Perlman’s View of ISO Layers

• 1: Physical • 2: Data link: (neighbor to neighbor) • 3: Network: create path, forward data (e.g., IP) • 4: Transport: end-to end (e.g., TCP, UDP) • 5 and above:

22 Perlman’s View of ISO Layers

• 1: Physical • 2: Data link: (neighbor to neighbor) • 3: Network: create path, forward data (e.g., IP) • 4: Transport: end-to end (e.g., TCP, UDP) • 5 and above: …. boring

23 So…why are we forwarding Ethernet packets? • Ethernet was intended to be layer 2 • Just between neighbors – not forwarded

24 So…why are we forwarding Ethernet packets? • Ethernet was intended to be layer 2 • Just between neighbors – not forwarded • What exactly is Ethernet?

25 So…why are we forwarding Ethernet packets? • Ethernet was intended to be layer 2 • Just between neighbors – not forwarded • What exactly is Ethernet? • No way to understand it without seeing the history

26 Back then…

• I was the designer of layer 3 of DECnet – the protocol I designed was adopted by ISO and renamed IS-IS • Layer 3 calculates paths, and forwards packets • Layer 2 just marked beginning and end of packet, and checksum (links between two nodes)

27 Router/Switch

Forwarding table packet

Router/switch

28 Computing the Forwarding Table

29 Computing the Forwarding Table

• Could be done with a central – ATM, Infiniband, … • Or with a distributed algorithm

30 Distributed Routing Algorithms

31 Distributed Routing Protocols

• Rtrs exchange info • Use it to calculate forwarding table

32 Link State Routing

• meet nbrs • Construct Link State Packet (LSP) – who you are – list of (nbr, cost) pairs • Broadcast LSPs to all rtrs • Store latest LSP from each rtr • Compute Routes (breadth first, i.e., “shortest path” first—well known and efficient algorithm)

33 6 2 A B C 5 2 1 2 G D 2 E 4 F 1

A B C D E F G B/6 A/6 B/2 A/2 B/1 C/2 C/5 D/2 C/2 F/2 E/2 D/2 E/4 F/1 E/1 G/5 F/4 G/1

34 Back to history

• I was doing layer 3 • Then along came Ethernet

35 The story of Ethernet

• CSMA/CD • Spanning Tree • TRILL

36 CSMA/CD Ethernet

• CSMA/CD…shared bus, peers, no master – CS: carrier sense (don’t interrupt) – MA: multiple access (you’re sharing the air!) – CD: listen while talking, for collision • Lots of papers about goodput under load only about 60% or so because of collisions • Limited in # of nodes (maybe 1000), distance (kilometer or so)

37 I saw Ethernet as a new type of link • I had to modify the routing protocol to accommodate this type of link • For instance, the concept of “pseudonodes” and “designated routers” so that instead of n2 links, it’s n links with n+1 nodes

38 Pseudonodes

Instead of: Use pseudonode

39 But Ethernet was a link in a network, not a network • I wish they’d called it “Etherlink”

40 Original Invention

• A way of cheaply hooking together lots of nodes on a single link • Everyone could directly talk to everyone • No forwarding

41 Ethernet packet

dest source data

42 Layer 3 Packet

dest source cnt data

43 It’s easy to confuse Ethernet with layer 3 • It looks sort of the same • No hop count field… • Flat addresses (no way to summarize a bunch of addresses in a forwarding table) • But it never occurred to the Ethernet inventors that anyone would be forwarding an Ethernet packet

44 So…why are we forwarding Ethernet packets?

45 How Ethernet evolved from CSMA/CD to spanning tree • People got confused, and thought Ethernet was a network (layer 3) instead of a link (layer 2) – Link (layer 2) = nbr-nbr – Network (layer 3) = forward along a path • Built apps on Ethernet, with no layer 3 • Router can’t forward without the right envelope • I tried to argue…

46 Problem Statement (from about 1983)

Need something that will sit between two , and let a station on one Ethernet talk to another

A C

47 Problem Statement (from about 1983)

Need something that will sit between two Ethernets, and let a station on one Ethernet talk to another

A C

Without modifying the endnode, or Ethernet packet, in any way!

48 The basic concept

• Bridge just listens “promiscuously”, and forwards to each other port(s) when the ether is free • Learn (Source=S, input port). Once learned, if see a packet with destination=S, know where to forward it (rather than “all the ports”) • This requires a topology with only one path between any pair of nodes

49 Basic concept

A X,C

J A X C

E D

50 How about require physical tree topology? • What about miscabling? • What about backup paths? • So…spanning tree algorithm – Allowing any physical topology – Pruning to a loop-free topology for sending data

51 Physical Topology

A

X 11 6 7 3 9 2 10 5 4 14

52 Pruned to Tree

A

X 11 6 7 3 9 2 10 5 4 14

53 Algorhyme

I think that I shall never see A graph more lovely than a tree. A tree whose crucial property Is loop-free connectivity. A tree which must be sure to span So packets can reach every LAN. First the root must be selected, By ID it is elected. Least cost paths from root are traced, In the tree these paths are placed. A mesh is made by folks like me. Then bridges find a spanning tree.

54 Bother with spanning tree?

• Maybe just tell customers “don’t do loops” • First bridge sold...

55 First Bridge Sold

A C

56 CSMA/CD died long ago

• A variant is used on links • But wired Ethernet quickly became spanning tree • So “Ethernet” today has nothing to do with all the papers about CSMA/CD

57 Why spanning tree is unstable

• Spanning tree algorithm is unstable if bridges cannot look at all incoming packets – My spec required sufficient computation power to keep up with links – IEEE removed this, and nets do “melt down”

58 Next stage in Ethernet evolution

59 Why not get rid of Ethernet and use only IP? • World has converged to IP as layer 3, and it’s in the network stacks

60 Why not get rid of Ethernet and use only IP? • World has converged to IP as layer 3, and it’s in the network stacks • If IP were designed slightly differently, we wouldn’t need Ethernet anymore • Just put your data in a layer 3 envelope!

61 What’s wrong with IP?

• IP is configuration intensive, moving VMs disruptive – IP protocol requires every link to have a unique block of addresses – Routers need to be configured with which addresses are on which ports – If something moves, its address changes

62 Layer 3 doesn’t have to work that way!

• CLNP / DECnet...20 byte address – Bottom level of routing is a whole cloud with the same 14-byte prefix – Routing is to 6 byte ID inside the cloud – Enabled by “ES-IS” protocol, where endnodes periodically announce themselves to the routers

14 bytes 6 bytes Prefix shared by all nodes in large cloud Endnode ID

63 CSMA/CD? IP Plus Ethernet Spanning tree? CLNP TRILL? Bottom 6 bytes of CLNP Ethernet

Top 14 bytes of CLNP address gets you to “cloud” IP gets you to Ethernet “link”

True layer Need to do 3 routing inside ARP to get final circle Ethernet address

64 Hierarchy

One prefix per link (like IP) One prefix per campus

22* 293* 28*

292* 25*

2* 2*

65 Worst decision ever

• 1992… could have adopted CLNP • Easier to move to a new layer 3 back then – Internet smaller – Not so mission critical – IP hadn’t yet (out of necessity) invented DHCP, NAT, so CLNP gave understandable advantages • CLNP much cleaner than IP; wouldn’t need ARP, wouldn’t need Ethernet/spanning tree • IPv6 still not better than CLNP! (IPv6 also routes to a link, so will require Ethernet clouds, and ARP-like thing)

66 Ethernet looks to IP like a single IP link • So Ethernet provides a large cloud in which switches can autoconfigure, and nodes (e.g., VMs) can move around transparently • But don’t want limitations of spanning tree

67 So Bridges were a kludge, digging out of a bad decision • Why are they so popular? – plug and play – simplicity – high performance • Will they go away? – because of idiosyncracy of IP, need it for lower layer.

68 Note some things about bridges

• Certainly don’t get optimal source/destination paths • Temporary loops are a disaster – No hop count – Exponential proliferation • Unstable if bridges can’t keep up with wire speed • But they are wonderfully plug-and-play

69 Next step in evolution: TRILL

70 Next step in evolution: TRILL

• We’re stuck with IP, meaning we’re also stuck with Ethernet. Can we improve Ethernet to eliminate limitations of spanning tree? • Yes, because length restriction on Ethernet packet is now relaxed, so we can add an extra header

71 TRILL

• TRansparent of Lots of Links • Want best of both worlds – From Ethernet: autoconfiguration, and flat address space – From layer 3: Optimal paths, multipathing, stability, traffic engineering, etc.

72 My general philosophy about protocol designs • Autoconfiguration

73 My general philosophy about protocol designs • Autoconfiguration • OK…I’ll give you knobs if you want knobs

74 My general philosophy about protocol designs • Autoconfiguration • OK…I’ll give you knobs if you want knobs • Be evolutionary if possible

75 TRILL switches form network between themselves • Run a (link state) routing protocol between the TRILL switches – Spanning tree switches are just glue between TRILL switches • So TRILL switches know how to reach other TRILL switches • Put Ethernet packet into a layer 3-like header, addressing it to last switch

76 Form network of TRILL switches

• TRILL switches find each other if: – Directly connected with pt-to-pt – Both connected to same Ethernet island • Do “link state protocol” among TRILL switches to calculate paths to other TRILL switches

77 b T T T T b b T b T b T T T T b b

78 b T T T T b b T b T b T T T T b b

79 b T T T T b T b T T T T T

80 b T T T T T b T T T T T

81 T T T T T T T T T T

82 T T T1 T T T T2 T T T

Note: only one T must encap/decap So T1 and T2 must Find each other and coordinate 83 TRILL

R2 c R4 R7 R5

R3 R6

a R1

84 TRILL packet

Last 1st hops Original Ethernet packet switch switch

TRILL header Switch addresses are 16 bits

85 16-bit nicknames

• Piggyback on link state protocol • Look for an unused nickname (not in any other LSPs), and claim it • If R1 and R2 both claim the same nickname, use 48-bit ID (plus configured priority perhaps) as tie-breaker. One keeps the nickname, the other has to choose another one.

86 How does R1 know that R2 is the correct “last RBridge”? • Currently….If R1 doesn’t, R1 sends packet through a tree • When R2 decapsulates, it remembers (ingress RBridge, source MAC)

87 How does R1 know R2 is “last switch”? • Orthogonal concept to rest of TRILL • R1 needs table of (destination MAC, egress switch) • Various possibilities – Edge switch learns when decapsulating data, floods if destination unknown – Configuration of edge switches – Directory that R1 queries – Central fabric manager pushes table

88 Other possibilities

• Configuration of (MAC addresses, location) into switches • Directory listing (IP, MAC, switch location) – Consulted by first switch, or hypervisor, or VM, or application – No reason endnode couldn’t encapsulate into TRILL header, using switch’s nickname as “first switch” – or, pretend to be a switch and get a nickname

89 16-bit TRILL switch “nicknames” • Allows 64,000 switches…many more endnodes • TRILL autoconfigures nicknames • Allows simple forwarding table lookup – Direct table lookup – Don’t need associative memory, or hash, or longest prefix match

90 Advantage of extra header

• Switches inside cloud don’t need to know about all the endnodes… – Forwarding table size of # of switches

91 Advantage of extra header

• Switches inside cloud don’t need to know about all the endnodes… – Forwarding table size of # of switches • The outer header is like a layer 3 header, and can use all the layer 3 techniques, e.g., – Shortest paths – Multiple paths (exploit parallelism) – Traffic engineering

92 TRILL and

93 TRILL and Multicast

• For spreading multicast traffic around, campus computes several trees • “Last TRILL switch” field in TRILL header specifies which tree to send on • Traffic filtered in the core based on VLAN, and IP multicast addresses

94 Use of “first” and “last” RBridge in TRILL header • For Unicast, obvious – Route towards “last” RBridge – Learn location of source from “first” RBridge • For Multicast/unknown destination – Use of “first” • to learn location of source endnode • to do “RPF check” on multicast – Use of “last” • To allow first RB to specify a tree • Campus calculates some number of trees 95 Multiple trees for multicast

Which 1st hops Original Ethernet packet tree switch

R1 R1 specifies which tree

(yellow, red, or blue) 96 TRILL link state routing calculates:

• Paths from me to all other TRILL switches • A few trees for distribution of multicast • A unique nickname for myself

97 Note: TRILL is evolutionary

• Endnodes just think it’s Ethernet…no changes • Even interworks with existing spanning tree switches • The more switches you upgrade to TRILL, the better the utilization

98 Orthogonal concept

99 Who encapsulates/decapsulates?

• Could be – first switch – Or hypervisor – Or VM – Or application • For “evolution”, switch • Having endnode do it saves work for switch, easier to eliminate stale entries

100 Algorhyme v2

I hope that we shall one day see A graph more lovely than a tree. A graph to boost efficiency While still configuration-free. A network where RBridges can Route packets to their target LAN. The paths they find, to our elation, Are least cost paths to destination. With packet hop counts we now see, The network need not be loop-free. RBridges work transparently. Without a common spanning tree. Ray Perlner

101 Recently, a bunch of similar things invented • NVGRE, VXLAN, …

102 How to compare with TRILL

• “Inner” packet flat address space – TRILL uses Ethernet – Other things use IPv4 • “Outer” header say where in the cloud the destination is – TRILL uses TRILL header (6 bytes, autoconfigured switch nicknames) – Others use IP+UDP or GRE

103 Outer header: TRILL is 6 bytes, autoconfigured, vs IP+UDP/GRE+stuff (VXLAN/NVGRE)

Inside: Flat address space Ethernet (TRILL) vs IP (the recent stuff) Ethernet bigger addresses, smaller header

104 Interesting (to me, anyway) note

• CLNP vs IP+TRILL – Advantage of CLNP: no need for ARP to get address on final link…it’s part of the CLNP address – Advantage of TRILL: forwarding table inside final cloud can be smaller…with CLNP, routers have to keep track of all endnodes inside the cloud – but edge TRILL guys still need to map (endnode, exit switch)

105 Protocol Folklore

• Obvious stuff everyone gets wrong

106 Version Number

107 What’s a Version Number?

• Version number • What is the purpose? • Philosophical question: – what is “new version” vs “new protocol”?

108 What I think makes sense

• Envelope says what the protocol is (how to parse the packet) – Ethernet: Ethertype – IP: Protocol Type – TCP/UDP: port

109 What I think makes sense

• Envelope says what the protocol is (how to parse the packet) • If differentiate based on protocol type, then it’s a new protocol • If differentiate based on version number, then it’s a new version of the same protocol

110 If differentiate based on version number • You can’t just say “write this value into this field • You have to say “Look at the version number, and if it’s not your version, then drop the packet”!

111 Version #

• Nobody seems to do this right • IP, IKEv1, SSL don’t say what to do if version # different. Most implementations ignore version number field • SSL v3 moved version field!

112 Parameters

• Minimize these: – someone has to document it – customer has to read documentation and understand it • How to avoid – architectural constants if possible – automatically configure if possible

113 Settable Parameters

• Make sure they can’t be set incompatibly across nodes, across layers, etc. (e.g., hello time and dead timer) • Make sure they can be set at nodes one at a time and the net can stay running

114 Example: Hello Timer

• IS-IS – pairwise parameters reported in “hellos” – So you know what to expect from that neighbor • OSPF – Kind of copied IS-IS, but decided…

115 Example: Hello Timer

• IS-IS – pairwise parameters reported in “hellos” – So you know what to expect from that neighbor • OSPF – Kind of copied IS-IS, but decided… – Refuse to talk if timers not identical with neighbor’s!

116 Latency

• Store-and-forward vs cut-through • Cut through can start after the forwarding decision is made • What field do you need to see for forwarding decision?

117 IPv4 header

118 IPv6 header

119 Another latency issue

• TCP has checksum in the header • So can’t start transmitting until you see the whole packet

120 Parting Thoughts

• What “wins out in the market place” isn’t necessarily the best thing • Don’t believe (or repeat) things you can’t understand…they are often false • Know what problem you’re solving before you try to solve it!

121