Quick viewing(Text Mode)

Consistent Sdns Through Network State Fuzzing

Consistent Sdns Through Network State Fuzzing

1

Consistent SDNs through Network State

Apoorv Shukla1 S. Jawad Saidi2 Stefan Schmid3 Marco Canini4 Thomas Zinner1 Anja Feldmann2,5 TU Berlin1 MPI-Informatics2 University of Vienna3 KAUST4 UDS5

Abstract—The conventional wisdom is that a software-defined network (SDN) operates under the premise that the logically Policy (P) Management “Traffic A should [26-42] take Path X” (P ≣ R ) centralized control plane has an accurate representation of the Plane logical (P ≣ P ) actual data plane state. Unfortunately, bugs, misconfigurations, logical faults or attacks can introduce inconsistencies that undermine Logical + correct operation. Previous work in this area, however, lacks a Rules (R ) Logical [5, 21-25] logical Logical Paths Topology (Tlogical) ≣ holistic methodology to tackle this problem and thus, addresses Control Plane (P ) (P Pphysical) only certain parts of the problem. Yet, the consistency of the logical [20] overall system is only as good as its least consistent part. Data Plane (P ≣ P ) Physical logical physical Motivated by an analogy of network consistency checking Rules + [2, 18, 19] (R ) physical (R ≣ R ) with program testing, we propose to add active probe-based Physical Forwarding logical physical Paths (P ) Topology (Tphysical) physical network state fuzzing to our consistency check repertoire. Hereby, our system, PAZZ, combines production traffic with Figure 1: Overview of consistency checks described in the literature. active probes to periodically test if the actual forwarding path and decision elements (on the data plane) correspond to the expected ones (on the control plane). Our insight is that active lighted by Figure1. Monocle [4], RuleScope [18], and traffic covers the inconsistency cases beyond the ones identified RuleChecker [19] use active probing to verify whether the by passive traffic. PAZZ prototype was built and evaluated on topologies of varying scale and complexity. Our results show that logical rules Rlogical are the same as the rules Rphysical of the PAZZ requires minimal network resources to detect persistent data plane. ATPG [2] creates test packets based on the control data plane faults through fuzzing and localize them quickly plane rules to verify whether paths taken by the packets on while outperforming baseline approaches. the data plane P physical are the same as the expected path Index Terms—Consistency, Fuzzing, Network verification, from the high-level policy P without giving attention to the Software defined networking. matched rules. VeriDP [20] in SDNs and P4CONSIST [21] in P4 SDNs use production traffic to only verify whether NTRODUCTION I.I paths taken by the packets on the data plane P physical are The correctness of a software-defined network (SDN) the same as the expected path from the control plane crucially depends on the consistency between the manage- P logical. NetSight [22], PathQuery [23], CherryPick [24], ment, the control and the data plane. There are, however, and PathDump [25] use production traffic whereas SDN many causes that may trigger inconsistencies at run time, Traceroute [26] uses active probes to verify P ≡ P physical. including, switch hardware failures [1], [2], bit flips [3]–[8], Control plane solutions focus on verifying network-wide misconfigurations [9]–[11], priority bugs [12], [13], control invariants such as reachability, forwarding loops, slicing, and and switch software bugs [14]–[16]. When an inconsistency detection against high-level network policies both occurs, the actual data plane state does not correspond to for stateless and stateful policies. This includes tools [27]– what the control plane expects it to be. Even worse, a [37] that monitor and verify some or all of the network-wide malicious user may actively try to trigger inconsistencies as invariants by comparing the high-level network policy with arXiv:1904.08977v2 [cs.NI] 2 May 2020 part of an attack vector. the logical rule set that translates to the logical path set at Figure1 shows a visualization inspired by the one by the control plane, i.e., P ≡ Rlogical or P ≡ P logical. These Heller et al. [17] highlighting where consistency checks systems, however, only model the network behavior which operate. The figure illustrates the three network planes – man- is insufficient to capture firmware and hardware bugs as agement, control, and data plane – with their components. “modeling” and verifying the control-data plane consistency The management plane establishes the network-wide policy are significantly different techniques. P , which corresponds to the network operator’s intent. To Typically, previous approaches to consistency checking realize this policy, the control plane governs a set of logical proceed “top-down,” starting from what is known to the rules (Rlogical) over a logical topology (T logical), which yield management and control planes, and subsequently checking a set of logical paths (P logical). The data plane consists of whether the data plane is consistent. We claim that this is the actual topology (T physical), the rules (Rphysical), and the insufficient and underline this with several examples (§II-C) resulting forwarding paths (P physical). wherein data plane inconsistencies would go undetected. This Consistency checking is a complex problem. Prior work can be critical, as analogous to security, the overall system has tackled individual subpieces of the problem as high- consistency is only as good as the weakest link in the chain. 2

We argue that we need to complement existing top- in all experimental topologies while consuming minimal down approaches with a bottom-up approach. To this end, resources as compared to Header Space Analysis to detect we rely on an analogy to program testing. Programs can and localize data plane faults (§IV); have a huge state space, just like networks. There are two basic approaches to test program correctness: one is static II.BACKGROUND &MOTIVATION testing and the other is dynamic testing using fuzz testing or fuzzing [38]. Hereby, the latter is often needed as the former This section briefly navigates the landscape of faults and cannot capture the actual run-time behavior. We realize that reviews the symptoms and causes (§II-A) to set the stage for the same holds true for network state. the program testing analogy in networks (§II-B). Finally, we Fuzz testing involves testing a program with invalid, un- highlight the scenarios of data plane faults manifesting as expected, or random data as inputs. The art of designing inconsistency (§II-C). an effective fuzzer lies in generating semi-valid inputs that A. Landscape of Faults: Frequency, Causes and Symptoms are valid enough so that they are not directly rejected by As per the survey [1], the top primary causes for abnormal the parser, but do create unexpected behaviors deeper in network behaviour or failures in the order of their frequency the program, and are invalid enough to expose corner cases of occurence are the following: that have not been dealt with properly. For a network, 1) Software bugs: code errors, bugs, etc., 2) Hardware fail- this corresponds to checking its behavior not only with the ures or bugs: bit errors or bitflips, switch failures, etc., 3) At- expected production traffic but with unexpected or abnormal tacks and external causes: compromised security, DoS/DDos, packets [39]. However, in networking, what is expected or etc., and 4) Misconfigurations: ACL /protocol misconfigs, etc. unexpected depends not only on the input (ingress) port but In SDNs, the above causes still exist and are persis- also the topology till the exit (egress) port and configuration tent [2]–[4], [12]–[16], [40]–[42]. We, however, realized that i.e., rules on the switches. Thus, there is a huge state the symptoms [1] of the above causes can manifest either space to explore. Relying only on production traffic is not as functional or performance-based problems on the data sufficient because production traffic may or may not trigger plane. To clarify further, the symptoms are either functional inconsistencies. However, having faults that can be triggered (reachability, security policy correctness, forwarding loops, at any point in time, due to a change in production traffic e.g., broadcast/multicast storms) or performance-based ( malicious or accidental, is undesirable for a stable network. CPU high utilization, congestion, latency/throughput, inter- Thus, we need fuzz testing for checking network consistency. mittent connectivity). To abstract the analysis, if we disregard Accordingly, this paper introduces PAZZ which combines the performance-based symptoms, we realize the functional such capabilities with previous approaches to verify SDNs problems can be reduced to the verification of network against persistent data plane faults. Therefore, similar to correctness. Making the situation worse, the faults manifest program testing we ask: “How far to go with consistency in the form of inconsistency where the expected network state checks?” at control plane is different to the actual data plane state. Our Contributions: A physical network or network data plane comprises of devices and links. In SDNs, such devices are SDN switches • We identify and categorize the causes and symptoms connected through links. The data plane of the SDNs is all of data plane faults which are currently unaddressed about the network behaviour when subjected to input in the to provide some useful insights into the limitations of form of traffic. Just like in programs, we need different test existing approaches by investigating their frequency of cases as inputs with different coverage to test the code cov- occurence. Based on our insights, we make a case for erage. Similarly, in order to dynamically test the network be- fuzz testing mechanism for campus and private datacenter haviour thoroughly, we need input in the form of traffic with SDNs (§II); different coverage [43]. Historically, the network correctness or functional verification on the data plane has been either a • We introduce a novel methodology, PAZZ which detects path-based verification (network-wide) [20], [22]–[26], [44] and later, localizes faults by comparing control vs. or a rule-based verification (mostly switch-specific) [2], [4], data plane information for all three components, rules, [18], [19], [26], [45]. A path-based verification can be end- topology, and paths. It uses production traffic as well as to-end or hop-by-hop whereas rule-based verification is a active probes (to fuzz test the data plane state) (§III); switch-by-switch verification. The network coverage brings us to the concept of Packet Header Space Coverage. • We develop and evaluate PAZZ prototype1 in multiple experimental topologies representative of multi-path/grid B. Packet Header Space Coverage: Active vs Passive campus and private datacenter SDNs. Our results show We observe that networks just as programs can have that fuzzing through PAZZ outperforms baseline approach a huge distributed state space. Packets with their packet

1 headers, including source IP, destination IP, port numbers, PAZZ software and experiments with topologies and configs will be made public to the research community. etc., are the inputs and the state includes all forwarding 3

Header space Src IPv4 0 32 coverage from ingress 2 2 Type of monitoring/verification 0 port i to egress port e 2 Related work in Traffic Type Rule-based Path-based the data plane (Packet header space Dst IPv4 coverage) S1 232 2 i e ATPG [2] Active (X) (X) S0 S3 Uncovered Covered Monocle [4] Active (X) × × 0 Dst Ports 16 RuleScope [18] Active (X) S2 2 2 3 20 RuleChecker [19] Active (X) ×

SDNProbe [45] Active (X) (X) i’ FOCES [44] Passive (X) (X)

Header space coverage Dst IPv4 VeriDP [20] Passive (X) (X) from ingress port i’ to 232 NetSight [22] Passive (X) (X)

egress port e Uncovered Covered PathQuery [23] Passive (X) (X) 4 CherryPick [24] Passive (X) (X) Figure 2: Example topology with two example ingress/egress port pairs (source- PathDump [25] Active (X) (X) destination pairs) and their packet header space coverage (w.r.t IPv4 addresses and P4CONSIST [21] Passive × X destination ports). PAZZ Active, Passive X X

Table I: Classification of related work in the data plane based on the type of the equivalence classes defined by the flow rules. Note, that verification and the packet header space coverage. Xdenotes full capability, (X) denotes a part of full capability, × denotes missing capability. every pair of ingress-egress ports (source-destination pair) can have different forwarding equivalence classes (FECs). We network forwarding: passive and active. Passive corresponds use the term covered packet header space to refer to it. Our to using the existing traffic or production traffic while active motivation is that the forwarding equivalence classes refer to refers to sending specific probe traffic. The advantage of pas- parts of the packet header space. For a given pair of ingress sive traffic is that it has low overhead and popular forwarding and egress ports (source-destination pair), when receiving paths are tested repeatedly. However, production traffic may traffic on the egress port from the ingress port, we can check (a) not cover all cases (covers only faults that can be triggered if the packet is covered by the corresponding “packet header by production traffic only); (b) change rapidly; and (c) have space”. If it is within the space it is “expected”, otherwise it is delayed fault detection, as the fraction of traffic triggering “unexpected” and, thus, we have discovered an inconsistency the faults is delayed. Indeed, malicious users may be able due to the presence of a fault on the data plane. to inject malformed traffic that may trigger fault/s. Thus, Consider the example topology in Figure2. It consists of production traffic may not cover the whole packet header four switches S0, S1, S2, and S3. Let us focus on two ingress space achievable by active probing. ports i and i0 and one egress port e. The figure also includes Furthermore, we should also fuzz test the network state. possible packet header space coverage. For i to e, it includes This is important as we derive our network state from the matches for the source and destination IPv4 addresses. For i0 information of the controller. Yet, this is not sufficient since to e, it includes matches for the destination IPv4 address and we cannot presume that the controller state is complete and/or possible destination port ranges. Indeed, it is possible to take accurate. Thus, we propose to generate packets that are out- into account other dimensions such as IPV6 header space, side of the covered packet header space of an ingress/egress source-destination MAC addresses. port pair. We suggest doing this by systematically and contin- When testing a network, if traffic adheres to a specific uously or periodically testing the header space just outside packet header space, there are multiple possible cases. If we of the covered header space. E.g., if port 80 is within the observe a packet sent via an ingress port i and received at an covered header space test for port 81 and 79. If x.0/17 is egress port e then we need to check if it is within the covered in the covered header space test for x.1.0.0 which is part area, if it is not we refer to the packet as “unexpected” and of the x.1/17 prefix. In addition, we propose to randomly then, we have an inconsistency for that packet header space probe the remaining packet header space continuously or caused by a fault. If a packet from an ingress port is within periodically by generating appropriate test traffic. The goal the expected packet header space of multiple egress ports, we of active traffic generation through fuzzing is to detect the need to check if the sequence of rules expected to be matched faults identifiable by active traffic only. and path/s expected to be taken by the packet correspond to TableI shows the existing data plane approaches on the the actual output port on data plane. This is yet another way basis of kind of verification or monitoring in addition to of finding inconsistency caused by faults. the packet header space coverage. We see that the existing Similar to program testing where negative test cases are dataplane verification approaches are insufficient when it used to fuzz test, in networking, we should not only test the comes to both path and rule-based verification in addition network state with “expected” or the production traffic, but to ensuring sufficient packet header space coverage. In this also with specially crafted probe packets to test corner cases paper, our system PAZZ aims to ensure packet header space and “fuzz test”. In principle, there are two ways for testing coverage in addition to path and rule-based verification to 2In this tool, if the packet is received at the expected destination from a source, path ensure network correctness on the data plane and thus, is considered to be the same. detecting and localizing persistent inconsistency. 3In this tool, authors claim that tool may detect match and action faults without guarantee. C. Data plane faults manifesting as inconsistency 4In this tool, issues in only symmetrical topologies are addressed. 4

S1 Flow Table: S1 Flow Table: S2 Flow Table: S2 R1: x.1.1.0 - 79 -> fwd (1) R1:S1 x.1.1.0 Flow Table: - 15 -> fwd (1) R1: x.1.1.0 - 63 -> fwd (1) S2 Flow Table: R2:S1 x.1.1.31 Flow Table: -> fwd (1) S2 R2:R1: x.1.1.31 x.1.1.1 --> 15 fwd -> fwd (1) (1) R1: x.1.1.0 - 63 -> fwd (1) R2: x.1.1.33 -> fwd (1) 2 R3:R1: x.1.1.33 x.1.1.1 --> 15 fwd -> fwd (1) (1) R3: x.1.1.65 - 79 -> fwd (1) 3 2 R3: x.1.1.64 - 79 -> fwd (1) R4:R2: x.1.1.65 x.1.1.33 - ->79 fwd -> (1)fwd (1) 3 R4: x.1.1.0 - 31 -> fwd (2) R3: x.1.1.64 - 79 -> fwd (1) R4: x.1.1.1 - 31 -> fwd (2) Bank Server R5: x.1.1.0 - 31 -> fwd (2) 1 R4: x.1.1.1 - 31 -> fwd (2) Bank Server 1 2 1 1 2 3 3 S3 1 2 S3 1 2 S1 S3 Flow Table: R1: x.1.1.0 - 63 -> fwd (1) S1 S3 Flow Table: R1: x.1.1.0 - 63 -> fwd (1)

Firewall Firewall Figure 3: Example misconfiguration with a hidden rule. Figure 4: Example misconfiguration with a hidden rule detectable by active probing Expected/actual route ⇔ blue/red arrows. only. Expected/actual route ⇔ blue/red arrows.

1) Faults identified by Passive Traffic: Type-p: To high- forwarded based on R1 rather than R4 and thus, bypasses light the type of faults, consider a scenario shown in Figure3. the firewall. This may still be detectable, e.g., by observing It has three OpenFlow switches (S1, S2, and S3) and one the path of a test packet [20]. However, the bitflip in R1 also firewall (FW). Initially, S1 has three rules R1, R3, and R4. causes an overlap in the match of R1 and R3 in switch S1 R4 is the least specific rule and has the lowest priority. R1 and both rules have the same action, i.e., forward to port 1. has the highest priority. Note the rules are written in the order Thus, traffic to x.1.1.66 supposed to be matched by R3 of their priority. will be matched by R1. If later, the network administrator Incorrect packet trajectory: We start by considering a removes R3, the traffic still pertaining to R3 still runs. This known fault [20], [22], [25]–hidden rule/misconfiguration. violates the network policy. In this paper, we categorize the For this, the rule R2 is added to S1 via the switch command dataplane faults detectable by the production traffic as Type-p line utility. The controller will remain unaware of R2 since faults. R2 is a non-overlapping flow rule. Thus, it is installed without Insight 2: Even if the packet trajectory remains the same, notification to the controller [46]. [4], [12] have hinted at this the matched rules need to be monitored. problem. As a result, traffic to IP x.1.1.31 bypasses the 2) Faults identified by Active Traffic only: Type-a: To firewall as it uses a different path. highlight, we focus on hidden or misconfigured rule R3 (in Priority faults are another reason for such incorrect green) in Figure4. This rule matches the traffic corresponding forwarding where either rule priorities get swapped or are not to x.1.1.33 on switch S1 and reaches the confidential taken into account. The Pronto-Pica8 3290 switch with PicOS bank server, however, the expected traffic or the production 2.1.3 caches rules without accounting for rule priorities [13]. traffic does not belong to this packet header space [40]– The HP ProCurve switch lacks rule priority support [12]. [42]. Therefore, we need to generate probe packets to trigger Furthermore, priority faults may manifest in many forms e.g., such rules and thus, detect their presence. This will require they may cause the trajectory changes or incorrect matches generating and sending the traffic corresponding to the packet even when the trajectory remains the same. Action faults header space which is not expected by the control plane. can be another reason where bitflip in the action part of the We call this traffic as fuzz traffic in the rest of the paper flowrule may result in a different trajectory. since it tests the network behavior with unexpected packet Insight 1: Typically, the packet trajectory tools only mon- header space. In this paper, we categorize the dataplane faults itor the path. detectable by only the active or fuzz traffic as Type-a faults. Correct packet trajectory, incorrect rule matching: If we Insight 3: The tools which test the rules check only rules add a higher priority rule in a similar fashion where the path “known” to the control plane (SDN controller) by generating does not change, i.e., the match and action remains the same active traffic for “known” flows. as in the shadowed rule, then previous work will be unable Insight 4: Typically, the active traffic for certain flows to detect it and, thus, it is unaddressed 5. Even if the packet checks only if the path remains the same even when rule/s trajectory is correct but wrong rule is matched, it can inflict matched may be different on the data plane. serious damages. Misconfigs, hidden rules, priority faults, match faults (described next) may be the reason for incorrect III.PAZZ METHODOLOGY matches. Next, we focus on match faults where anomaly in Motivated by our insights gained in §II-C about the Type-p the match part of a forwarding flow rule on a switch causes and Type-a faults on the data plane resulting in inconsistency, the packets to be matched incorrectly. We again highlight we aim to take the consistency checks further. Towards this known as well as unaddressed cases starting with a known end, our novel methodology, PAZZ compares forwarding 6 scenario. In Figure3, if a bitflip , e.g., due to hardware rules, topology, and paths of the control and the data plane, problems, changes R1 from x.1.1.0/28 to match from using top-down and bottom-up approaches, to detect and x.1.1.0 upto x.1.1.79. Traffic to x.1.1.17 is now localize the data plane faults. PAZZ, derived from Passive and Active (fuZZ testing), 5We validated this via experiments. OpenFlow specification [46] states that if a non- overlapping rule is added via the switch command-line utility, controller is not notified. takes into account both production and probe traffic to 6A previously unknown firmware bug in HP 5406zl switch. ensure adequate packet header space coverage. PAZZ 5

Fuzzer Algorithm 1: Data Plane Tagging Input : (p, s, i, o, r) for each incoming packet p and switch with ID s let i 1. Send Corpus 4. Send Actual be the inport ID and o the outport ID for packet p, r is the flow rule to Fuzzer Report to used for forwarding. Control Plane Component Output: Tagged packet p if necessary with the Verify shim header. SDN Control Plane Component // Is there already a shim header, e.g., (s,i) is not Controller (CPC) 5. Send Expected an entry point or source port Report based on 1 if (p has no shim header) then 2. Compute & Actual Report // Add shim header with “Ethertype” 2080, Send Fuzz initialize tag values- Verify_Port: entry point Traffic OpenFlow Consistency Tester Channel hash, Verify_Rule: 1. Fuzz 2 p.push_verify; Traffic 3. Send sampled OpenFlow Data Plane Component Actual Report // Determine u ID from switch ID and port ID Switch (DPC) p s i Production 3 up = s k i; Traffic // Bloom Filter 4 p.Verify_Port ← bloom(hash u ; Figure 5: PAZZ Methodology. ( p)) // Determine ur ID from rule ID r of table ID t 5 ur = s k r k t; // Binary hash chain checks the matched forwarding flow rules as well as the 6 p.Verify_Rule ← hash(p.Verify_Rule, ur); links constituting paths of various packet headers (5-tuple // Shim header has to be removed if (s, o) is exit point 7 if ((s, o) is exit point) then microflow) present in the passive and active traffic. To detect 8 if (p has no shim header) then // For traffic injected between a faults, PAZZ collects state information (in terms of reports) source-destination pair from the control and the data plane: PAZZ compares the 9 p.push_verify; “expected” state reported by the control to the “actual” state 10 Generate_report((s, o), p.Verify_Port, p.Verify_Rule, collected from the data plane. Figure5 illustrates the P AZZ p.header); p.pop_verify; methodology. It consists of four components sequentially: 1) Control Plane Component (CPC): Uses the current controller information to proactively compute the packets tagging, therefore tagging is possible without limiting for- that are reachable between any/every source-destination warding capabilities. To avoid adding extra monitoring rules pair. It then sends the corpus of seed inputs to Fuzzer. on the scarce TCAM which may also affect the forwarding For any given packet header and source-destination pair, behavior [23], [49], we augment OpenFlow with new actions. it reactively generates an expected report which encodes Between any source-destination pair, the new actions are used the paths and sequence of rules. (§III-B) by all rules of the switches to add/update the shim header if necessary for encoding the sequence of inports (path) and 2) Fuzzer: Uses the information from CPC to compute the matched rules. To remove the shim header, we use another packet header space not covered by the controller and custom OpenFlow action. To trigger the actual report to the hence, the production traffic. It generates active traffic for Consistency Tester, we use sFlow [50] sampling. Indeed, we fuzz testing the network. (§III-C) can use any other sampling tool. Note sFlow is a packet sampling technique so it samples packets not flows based on 3) Data Plane Component (DPC): For any given packet sampling rate and polling interval. For a given source, the header and source-destination pair, it encodes the path report contains the packet header, the shim header content, and sequence of forwarding rules to generate a sampled and the egress port of the exit switch (destination). actual report. (§III-A) Even with a shim header: Verify, however, it is impractical to expect packets to have available and sufficient space to 4) Consistency Tester: Detects and later, localizes faults encode information about each port and rule on the path. by comparing the expected report/s from the CPC with Therefore, we rely on a combination of bloom filter [51] and the actual report/s from the DPC. (§III-D) binary hash chains. For scalability purposes, sampling is used before sending a report to the Consistency Tester. Now, we will go through all components in a non-sequential Data Plane Tagging: To limit the overhead, we decided to manner for the ease of description. insert Verify shim header on layer-2. VerifyTagType is Ether- Type for Verify header, Verify_Port (bloom filter) encodes A. Data Plane Component (DPC) the local inport in a switch, and Verify_Rule (binary hash To record the actual path of a packet and the rules that chain) encodes the local rule/s in a switch. Thus, the encoding are matched in forwarding, we rely on tagging the packets is done with the help of the bloom filter and binary hash contained in active and production traffic. In particular, we chain respectively. To take actions on the proposed Verify propose the use of a shim header that gives us sufficient shim header and to save TCAM space, we propose four space even for larger network diameters or larger flow rule new OpenFlow actions: two for adding (push_verify) and sets. Indeed, INT [47] can be used for data plane moni- removing (pop_verify) the Verify shim header and two for toring, however, it is applicable for P4 switches [48] only. updating the Verify_Port (set_Verify_Port) and Verify_Rule Unlike [20], [22]–[25], we use our custom shim header for (set_Verify_Port) header fields respectively. Since, the header 6

p.push_Verify; p.Verify_Port ← bloom(hash(u ); p forwarded from port i to port j only. The switch predicate p.Verify_Port ← bloom(hash(up)); p.Verify_Rule ← hash(p.Verify_Rule, ur); p.Verify_Rule ← hash(p.Verify_Rule, ur); p.pop_Verify; is defined via rule predicates: Ri,j which are given by the flowrules belonging to the switch s and a flowtable t. 1 2 1 2 1 2 Each rule has an identifier that consists of a unique_id

S1 S2 S3 and table_id representing the flowtable t in which the rule

resides, in_port array representing a list of inports for that p.Verify_Port ← bloom(hash(up)); p.Verify_Rule ← hash(p.Verify_Rule, u ); r rule, out_port array representing a list of outports in the

Figure 6: Data plane tagging using bloom filters and hashing. action of that rule and the rule priority p. Based on the rule priority p, in_port in the match part and out_port in the action part of a flowrule, each rule has a list of rule predicates size of the Verify_Port and Verify_Rule and tagging actions (BDD predicates) which represent the set of packets that are implementation-specific, we have explained them in can be matched by the rule for the corresponding inport and prototype section §IV-A1, §IV-A2 respectively. Algorithm1 forwarded to the corresponding outport. explains the data plane tagging algorithm between a source- destination pair. For each packet either from the production Similar to the plumbing graph of [30], we generate a or active traffic (§III-C) entering the source inport, Verify dependency graph of rules (henceforth, called rule nodes) shim header will be added automatically by the switch. For called reachability graph based on the topology and switch each switch on the path, the tags in the packet namely, configuration which computes the set of packet headers Verify_Port and Verify_Rule fields get updated automatically. between any source-destination pair. There exists an edge Figure6 illustrates the per-switch tagging approach. Once between the two rules a and b, if (1) out_port of rule a the packet leaves the destination outport, the resulting report is connected to in_port of b; and (2) the intersection of known as the actual report is sent to the Consistency Tester rule predicates for a and b is non-empty. For computational (§III-D). Note if there is no Verify header, Verify shim header efficiency, each rule node keeps track of higher priority rules is pushed on the exit switch to ensure that any traffic injected in the same table in the switch. A rule node computes the at any switch interface between a source-destination pair match of each higher priority rule, subtracting it from its own gets tagged. To reduce the overhead on the Consistency match. We refer to this as the same-table dependency of rules Tester as well as on the switch, we employ sampling at in a switch. In the following, by slightly abusing the notation, the egress port. We continuously or periodically test the we will use switch predicates Si,j and rule predicates Ri,j network as the data plane is dynamic due to reconfigurations, to denote also the set of packet headers: {p1, p2, ..., pn} they link/switch/interface failures, and topology changes. Note, imply. Disregarding the ACL predicates for simplicity, the periodic testing can also be executed by implementing timers. rule predicates in each switch s representing packet header fwd space forwarded from inport i to outport j is given by Ri,j . fwd The switch predicates are then computed as: Si,j = ∪i,j Ri,j B. Control Plane Component (CPC) More specifically, to know the reachable packet header In principle, we can use the existing control plane mech- space (set of packet headers) between any source-destination anisms, including HSA [29], NetPlumber [30] and APVeri- pair in the network, we inject a fully-wildcarded packet fier [37]. In addition to experiments in [37], our independent header set h from the source port. If the intersection of the experiments show that Binary Decision Diagram (BDD)- switch predicate Si,j and the current packet header p is non- based [52] solutions like [37] perform better for set oper- empty, i.e., Si,j ∩ {p} 6= φ, the packet is forwarded to the ations on headers than HSA [29] and NetPlumber [30]. In next switch until we reach the destination port. Thus, we can particular, we will propose in the following a novel BDD- compute reachability between any/every source-destination based solution that supports rule verification in addition to pair. For caching and tag computation, we generate the in- path verification (APVerifier [37] takes into account only verse reachability graph simultaneously to cache the traversed paths). Specifically, our Control Plane Component (CPC) paths and rules matched by a packet header p between performs two functions: a) Proactive reachability and corpus every source-destination pair. After the reachability/inverse computation, and b) Reactive tag computation. reachability graph computation, CPC sends the current switch Proactive Reachability & Corpus Computation: We start predicates of the entry and exit switch pertaining to a by introducing an abstraction of a single switch configuration source-destination pair to Fuzzer as a corpus for fuzz traffic called switch predicate. In a nutshell, a switch predicate generation (§III-C). specifies the forwarding behavior of the switch for a given In case of a FlowMod, the reachability/inverse reachability set of incoming packets, and is defined in turn by the graph and new corpus are re-computed. Recall every rule rule predicates. More formally, the general configuration node in a reachability graph keeps track of high-priority rules abstraction of a SDN switch s with ports 1 to n can be in a table in a switch. Therefore, only a part of the affected described by switch predicates: Si,j where i ∈ {1, 2, ..., n} reachability/inverse reachability graph needs to be updated and j ∈ {1, 2, ..., n} where n denotes the number of switch in the event of rule addition/deletion. In the case of rule ports. The packets headers satisfying predicate Si,j can be addition, the same-table dependency of the rule is computed 7 by comparing the priorities of new and old rule/s before it is Algorithm 2: Fuzzer added as a new node in the reachability graph. If the priority Input : Switch predicates of entry (Si) and exit Se switch for a of a new rule is higher than any rule/s and there is an overlap source-destination pair i-e Output: fuzz_sweep_area, random_fuzz_area 0 in the match part, the new switch predicate: Si,j as per the // Generate fuzz traffic in the difference of covered 0fwd packet header space area between entry and exit new rule predicate: Ri,j is computed as: switch 0 0fwd fwd 0fwd Si,j = Ri,j ∪ (Ri,j − Ri,j ) 1 fuzz_sweep_area ← Se − Si; 2 residual_area ← U − fuzz_sweep_area − Si; Reactive Tag Computation: For any given data plane report 3 random_fuzz_area ← Φ; p // Generate fuzz traffic in completely uncovered packet corresponding to a packet header between any source- header space area randomly destination pair, we traverse the pre-computed inverse reach- 4 while random_fuzz_area 6= residual_area do ability graph to generate a list of sequences of rules that 5 fuzz ← random_choose(residual_area); 6 random_fuzz_area ← random_fuzz_area S fuzz; can match and forward the actual packet header observed at a destination port from a source port. Note, there can be multiple possible paths, e.g., due to multiple entry points and per-packet or per-flow load balancing. For a packet to start with active traffic generation on the boundary of header p, the appropriate Verify_Port and Verify_Rule tags are the net covered packet header space area between a source- computed similarly as in Algorithm1. The expected report is destination pair as there is a maximum possibility of faults then sent to the Consistency Tester (§III-D) for comparison. in this area. A packet will reach e from i iff all of the rules Note we can generate expected reports for any number of on the switches in a path match it; else it will be dropped source-destination pairs. either midway or on the first switch. Therefore, for an end- to-end reachability, the ruleset on S0 and S3 should match C. Fuzzer the packet p contained in the production traffic belonging to Inspired by the code coverage-guided fuzzers like Lib- the covered packet header space: Si ∩ Se. This implies that Fuzz [53], we design a mutation-based fuzz testing [54] we need to first generate active traffic in the area: Se − Si component called Fuzzer. Fuzzer receives the corpus of seed and then randomly generate in the leftover area. Traffic can, inputs in the form of the switch predicates of the entry and however, be also injected at any switch on any path between exit switch from the CPC for a source-destination pair. In a source-destination pair, thus the checking needs to be done particular, the switch predicates pertaining to the inport of for different source-destination pairs. the entry switch (source) and the outport of the exit switch We now explain the Algorithm2 in the context of Fig- (destination) represent expected covered packet header space ure2. For active or fuzz traffic generation, if there is a containing the set of packet headers satisfying those predi- difference in the covered packet header space areas of S0 cates. Fuzzer applies mutations to the corpus (Algorithm2). and S3, we first generate traffic in the area, i.e., Se − Si Where Can Most Faults Hide: Before explaining Algo- denoted by fuzz_sweep_area (Line 1). Recall there is rithm2, we present a scenario to explain the packet header a high probability that there may be hidden rules in this space area where potential faults can be present. Consider area since the header space coverage of the exit switch the example topology illustrated in Figure2. Due to a huge may be bigger than the same at the entry switch. Later, header space in IPv6 (128-bit), we decide to focus on the des- we generate traffic randomly in the residual packet header tination IPv4 header space (32-bit) in a case of destination- space area, i.e., U − fuzz_sweep_area − Si denoted by based . In principle, it is possible to consider IPv6 residual_area (Lines 2-6). We generate traffic randomly in header space as the fuzzing mechanism remains the same. the area as this is mostly, a huge space and fault/s can lie We use Si, S1 and Se to represent covered packet header anywhere. The fuzz traffic generated randomly is given by the space (switch predicates) of switches S0, S1 and S3 between completely uncovered packet header space area denoted by i-e (source-destination pair). Note there can be multipaths random_fuzz_area. Thus, the fuzz traffic that the Fuzzer for the same packet header p. Now, assume there is only a generates belongs to the packet header space area given single path: S0 → S1 → S3, the reachable packet header by fuzz_sweep_area and random_fuzz_area. It is worth space or net covered packet header space area is given by noting that not all of the packets generated by the fuzz traffic Si ∩ S1 ∩ Se. Note this area corresponds to the control plane are allowed in the network due to a default drop rule in the perspective so there may be more or less coverage on the switches. Therefore, if some packets in the fuzz traffic are data plane. The production traffic is generated in the area matched, the reason can be attributed to either the presence Si which depends on the expected rules of S0 at an ingress of faulty rule/s, wildcarded rules or hardware/software faults port i for a packet header p destined to e. In principle, the to match such traffic. This also highlights that the fuzz traffic production traffic will cover the packet header space area Si. may not cause network congestion. As discussed previously, Now, the active traffic should be generated for the uncovered there is another scenario where the traffic gets injected at area U − Si(entryswitch) where U represents the universe one of the switches on the path between a source-destination of all possible packet header space which is 20 to 232 for a pair and may end up getting matched in the data plane. Verify destination IPv4 header space. As stated in §II-B, we need header is pushed at the exit switch if it is not already present 8

Algorithm 3: Consistency Tester (detection, localization) report for the packet header and the corresponding source-

Input : Actual and expected report containing the Verify_Rulea and destination pair in the actual report. Once Consistency Tester Verify_Port p a tags for a packet pertaining to a flow (5-tuple) has received both reports, it compares both reports as per Output: Detected and localized faulty switch Sf or Faulty Rule Rf . // Different rules were matched on data plane. Algorithm3 for fault detection and the localization. To avoid 1 (Verify_Rule 6= Verify_Rule ) if a e then Verify_Port Verify_Rule // Fault is detected and reported. confusion, we use a, a tags for the ac- 2 Report fault tual data plane report and Verify_Porte, Verify_Rulee tags for // Different path was taken on data plane. the corresponding expected control plane report respectively. 3 if (Verify_Porta 6= Verify_Porte) then // Localize the fault. If Verify_Rule tag is different for a packet header and a pair 4 for i ← 0 to n by 1 do of ingress and egress ports, then the fault is detected and 5 if Verify_Porta ∩ Verify_Portei = Verify_Portei then 6 No problems for this switch hop reported (Lines 1-2). Note that we avoid the bloom filter false // Previous switch wrongly routed the positive problem by first matching the hash value for the Ver- packet. ify_Rule tag. Therefore, the detection accuracy is high unless 7 else 8 Sf ← Si−1 a hash collision occurs in Verify_Rule field (§IV-A3). Once a fault is detected, Consistency Tester uses the Verify_Port // Path is same even rules matched are different. bloom-filter for localization of faults where the actual path is 9 else // Localize Type-p match fault. different from the expected path, i.e., the Verify_Porta bloom 10 Verify_Rule 6= 0 if e then filter is different from the Verify_Port bloom filter (Lines 3- 11 Go through the different switches hop-by-hop to find R e f Verify_Port // Localize Type-a fault. 8). Therefore, a is compared with the per-switch 12 else hop Verify_Port in the control plane or Verify_Portei for the 13 Rf lies in S0 else go through the different switches ith hop starting from the source inport to the destination hop-by-hop outport. This hop-by-hop walkthrough is done by traversing // Different path was taken on data plane. the reachability graph at the CPC hop-by-hop from the source 14 else if (Verify_Porta 6= Verify_Porte) then // Type-p action fault is detected and reported. port of the entry switch to the destination port of the exit 15 Report fault switch. As per the Algorithm3, the bitwise logical AND // Localize Type-p action fault. operation between the Verify_Port and the Verify_Port i is 16 for i ← 0 to n by 1 do a e 17 if Verify_Porta ∩ Verify_Portei = Verify_Portei then executed at every hop. It is, however, important to note that 18 No problems for this switch hop if actual path was same as expected path, i.e., Verify_Porte = // Previous switch wrongly routed the packet. Verify_Port even when actual rules matched were different 19 else a 20 Sf ← Si−1 on the data plane, i.e., Verify_Porte 6= Verify_Porta (Lines 10-13), the localization of faulty Rf gets tricky as it can be 21 else either a case of Type-p match faults (e.g., bitflip in match 22 No fault detected part) (Lines 10-11) or Type-a fault (Lines 12-13). Hereby, it is worth noting that there will be no expected report from the CPC in the case of unexpected fuzz traffic. Therefore, (§III-A, §IV-A2) and thus, the packets still get tagged in Consistency Tester checks if Verify_Rulee 6= 0 (Line 10). the data plane to be sent in the actual report. However, the If true, localization can be done through hop-by-hop manual CPC may generate empty Verify_Rule and Verify_Port tags inspection of expected switches or manual polling of expected as the traffic is unexpected. In such cases, the fault is still switches (Lines 10-11) else the Type-a fault may be localized detected but may not be localized automatically (§III-D). to the entry switch as it has a faulty rule that allows the Furthermore, Fuzzer can be positioned to generate traffic at unexpected fuzz traffic in the network (Lines 12-13). There different inports to detect more faults in the network between is another scenario where the actual rules matched are same any/every source-destination pair. In case if the production as expected but the path is different (Lines 14-20). This is traffic does not cover all of the expected rules at the ingress a case of Type1-action fault (e.g., bitflip in the action part or entry switch, Fuzzer design can be easily tuned to also of the rule). In this case, the expected and actual Verify_Port generate the traffic for critical flows. Our evaluations confirm bloom filter can be compared and thus, Type-p action fault that an exhaustive active traffic generator which randomly is detected and localized. Note action fault is Type-p as it is generates the traffic in the uncovered area performs poorly caused in production traffic. against PAZZ in the real world topologies (§IV-E). Note if the network topology or configuration changes, the CPC sends Binary hash chain in Verify_Rule gives PAZZ better accu- the new corpus to the Fuzzer and Algorithm2 is repeated. racy, however, we lose the ability to automatically localize We can continuously or periodically test the network with the Type-p match faults where the path remains the same fuzz traffic for any changes. and rules matched are different since the Verify_Port bloom filter remains the same. To summarize, detection will happen D. Consistency Tester always, but localization can happen automatically only in After receiving an actual report from the data plane, the case of two conditions holding simultaneously: a) when the Consistency Tester queries the CPC for its expected traffic is production; and b) when there is a change in path 9 since Verify_Port bloom filter will be different. In most and set_Verify_Port. If there is no Verify header, push_verify cases, fuzz traffic is not permitted in the network. Recall is executed at the entry and the exit switch between a source active traffic can be injected from any switch in between and destination pair. a source-destination pair. In such cases, the Rf will still be •pop_verify: Removes the Verify header from the tagged detected and can be localized by either manual polling of the packet. switches or hop-by-hop traversal from source to destination. push_verify should be used, if there is no Verify header for Blackholes for critical flows can be detected as Consistency a) all packets entering a source inport or b) all packets leaving Tester generates an alarm after a chosen time of some seconds the destination outport (in case, if any traffic is injected if it does not receive any packet pertaining to that flow. For lo- between a source-destination pair) just before a report is calizing silent random packet drops, MAX-COVERAGE [55] generated to the Consistency Tester. For packets leaving the algorithm can be implemented on Consistency Tester. destination outport, pop_verify should be used only after a sampled report to the Consistency Tester has been generated. IV. PAZZ PROTOTYPEAND EVALUATION To initiate and execute data plane tagging, the actions A. Prototype set_Verify_Port and set_Verify_Rule are prepended to all flow rules in the switches as first actions in the OpenFlow “action 1) DPC: Verify Shim Header: We decided to use a 64-bit list” [46]. On the entry and exit switch, action push_verify Verify (8 Byte) shim header on layer-2: . To ensure sufficient is added as the first action. On the exit switch, pop_verify space, we limit the link layer MTU to a maximum of 8,092 is added as an action once the report is generated. Recall, Bytes for jumbo frames and 1,518 Bytes for regular frames. our actions do not change the forwarding behavior per se as Verify has three fields, namely: the match part remains unaffected. However, if one of the • Verify VerifyTagType: 16-bit EtherType to signify header. actions gets modified unintentionally or maliciously, it may • Verify_Port: 32-bit encoding the local inport in a switch. have a negative impact but gets detected and localized later. • Verify_Rule: 16-bit encoding the local rule/s in a switch. Notably, set_Verify_Rule encodes the priority of the rule and VerifyTagType We use a new EtherType value for to ensure flow table number in the Verify_Rule field and thus, providing that any layer-2 switch on the path without our OpenFlow support for rule priorities/cascaded flow tables. modifications will forward the packet based on its layer-2 3) Bloom Filter & Hash Function: We use Verify_Port information. The Verify shim header is inserted within the bloom filter for the localization of detected faults. In an layer 2 header after the source MAC address, just like a extreme case from the perspective of operational networks VLAN header. In presence of VLAN (802.1q), the Verify like datacenter networks or campus networks, for a packet header is inserted before the VLAN header. Note Verify header and a pair of ingress and egress port, if CPC computes header is backward compatible with legacy L2 switches, i different paths with n hops in each of the paths, the and transparent to other protocols. probability of a collision in bloom filter and hash value simultaneously will be given by: (0.6185)m/n × p(i) 2) DPC: New OpenFlow Actions: The new actions ensure In our case, (0.6185)m/n = (0.6185)32/n using the bloom that there is no interference in forwarding as no extra rules are filter false positive formula [51] as m = 32 (bloom filter added. To ensure efficient use of the shim header space, we size) of the Verify_Port field and n is the network diameter use the bloom filter to encode path-related information in the or the number of switches in a path. p(i) is the probability Verify_Port field and binary hash chains to encode rule-level of collision of the hash function computed using a simple information in the Verify_Rule field. A binary hash chain adds approximation of the birthday attack formula: a new hash-entry to an existing hash-value by a p(i) = i2/2H = i2/217 hash of the existing hash-value and the new hash-entry and i H 16 then storing it as the new value. The Verify_Port field is a is the number of different paths, is 2 for 16-bit Verify_Rule Bloom-filter which will contain all intermediate hash results field hash. including the first and last value. This ensures that we can For the 16-bit Verify_Rule hash operation, we used Cyclic test the initial value as well as the final path efficiently. Redundancy Check (CRC) code [56]. For the 32-bit Ver- ify_Port Bloom filter operations, we use one of the similar •set_Verify_Port: Computes hash of the unique identifier (up) of the switch ID and its inport ID and adds the result to the approaches as mentioned in [51]. First, three hashes are Bloom-filter in the Verify_Port field. computed as: gi(x) = h1(x) + i · h2(x) for i = 0, 1, 2 •set_Verify_Rule: Computes hash of the globally unique where h1(x) and h2(x) are the two halves of a 32-bit Jenkins hash [57] of x. Then, we use the first 5 bits of gi(x) to set identifier (ur) of the flow rule, i.e., switch ID and rule ID (uniquely identifying a rule within a table), flow table ID the 32-bit Bloom filter for i = 0, 1, 2. with the previous value of the Verify_Rule to form a binary PAZZ Components: We implemented DPC on top of hash chain. software switches, in particular, Open vSwitch [58] Ver- •push_verify: Inserts a Verify header if needed, initializes sion 2.6.90. The customized OvS switches and the fuzz/pro- the value in Verify_Rule to 1 and the value of Verify_Port is duction traffic generators run in vagrant VMs. Currently, the the hash of up. It is immediately followed by set_Verify_Rule prototype supports both OpenFlow 1.0 and OpenFlow 1.1. 10

Reachability Path graph Fuzzer Execu- Topology #Rules #Paths Length computation tion Time time 4 switches (grid) ~5k ~24k 2 0.64 seconds ~1 ms 9 switches (grid) ~27k ~50k 4 0.91 seconds ~1.2 ms 16 switches (grid) ~60k ~75k 6 1.13 seconds ~3.2 ms 4-ary fat-tree (20 Figure 7: Left to right: Experimental grid topologies with 4, 9 and 16 switches ~100k ~75k 6 1.15 seconds ~7.5 ms respectively where fuzz and production traffic enter from the leftmost switch and leave switches) at the rightmost switch in each topology. Table II: Columns 1-4 depict the parameters of four experimental topologies. Column 5 depicts the reachability graph computation time by the CPC for the experimental topologies proactively by the CPC. Represents an average over 10 runs. C1 C2 C3 C4 Column 6 depicts the Fuzzer execution time to compute the packet header space for generating the fuzz traffic for the corresponding experimental topologies. Represents an average over 10 runs. at the desired rate using Tcpreplay [61] with infinite loops to test the network continuously. To execute PAZZ periodically

A1 A2 A3 A4 A5 A6 A7 A8 instead of continuously, there is an API support to implement timers with desired timeouts as per the deployment scenario. T1 T2 T3 T4 T5 T6 T7 T8 For sampling, we used sFlow [50] with a polling interval of 1 second and a sampling rate of 1/100, 1/500 and 1/1000. Note, the polling interval and sampling rate can be modified as Src Dest Src Src per the deployment scenario. The sampling was done on the Figure 8: 4-ary fat-tree (20 switches) experimetal topology with expected paths in egress port of the exit switch in the data plane so the sampled dotted blue line from three different sources (Src) to Destination (Dst). T< x >: (Top-of-Rack), A< x >: Aggregate and C< x >: Core switches respectively. Paths actual report reaches the Consistency Tester and thus, avoids for each source-destination pair: T2 A2 C3 A4 T4 A3 T3 (Blue), T6 A6 C3 A4 T4 overwhelming it. Note each experiment was conducted for A3 T3 (Green), T8 A8 C3 A4 T4 A3 T3 (Red). ten times for a randomly chosen source-destination pair. In our prototype, we chose Ryu SDN controller. Python- Workloads: For 1 Gbps links between the switches in the 4 based Consistency Tester, Java-based CPC and Python-based experimental topologies: 3 grid and 1 fat-tree (4-ary), the 6 Fuzzer communicate through Apache Thrift [59]. production traffic was generated at 10 pps (packets per second). In parallel, fuzz traffic was generated at 1000 pps, B. Experiment Setup i.e., 0.1% of the production traffic. Note, the fuzz traffic rate We evaluate PAZZ on 4 topologies: a) 3 grid topologies can be modified as per the use-case requirements. (Figure7) of 4, 9 and 16 switches respectively with varying complexities to ensure diversity of paths, and b) 1 datacenter fat-tree (4-ary) topology of 20 switches with multipaths C. Evaluation Strategy (Figure8). Experiments were conducted on an 8 core 2.4GHz For a source-destination pair, our experiments are parame- Intel-Xeon CPU machine and 64GB of RAM. For scalability terized by: (a) size of network (4-20 switches), (b) path length purposes, we modified and translated the Stanford backbone (2-6), (c) configs (flow rules from 5k-100k), (d) number configuration files [60] to equivalent OpenFlow rules as per of paths (24k-75k), (e) number around (1-30) and kind of our topologies, and installed them at the switches to allow faults (Type-p, Type-a), (f) sampling rate (1/100, 1/500, multi-path destination-based routing. We used our custom 1/1000) with polling interval (1 sec), and (g) workloads, script to generate configuration files for the four experimental i.e., throughput (106 pps for production and 1000 pps for topologies. The configuration files ensured the diversity of fuzz traffic). Note, due to space constraints, we removed the paths for the same packet header. Columns 1-4 in TableII results of 1/500 sampling rate. Our primary metrics of interest illustrate the parameters of the four experimental topologies. are fault detection with localization time, and comparison of We randomly injected faults on randomly chosen OvS fault detection/localization time in PAZZ against the baseline switches in the data plane where each fault belonged to of exhaustive traffic generation approach and Header Space different packet header space (in 32-bit destination IPv4 Analysis (HSA) [29]. In particular, we ask these questions: address space) either in the production or fuzz traffic header Q1. How does PAZZ perform under different topologies and space. In particular, we injected Type-p (match/action faults) configs of varying scale and complexity? (§IV-D) and Type-a faults. ovs-ofctl utility was used to insert Q2. How does PAZZ compare to the strawman case of ex- the faults in the form of high-priority flow rules on random haustive random packet generation for a single and multiple switches. Therefore, we simulated a scenario where the source-destination pairs? (§IV-E) control plane was unaware of these faults in the data plane. Q3. How much time does PAZZ take to compute reachability We made a pcap file of the production traffic generated from graph as compared to Header Space Analysis [29]? (§IV-F) our Python-based script that crafts the packets. In addition, Q4. How much time does PAZZ take to generate active traffic we made a pcap of the fuzz traffic generated from the Fuzzer. for a source-destination pair and how much overhead does The production and fuzz traffic pcap files were collected PAZZ incur on the links? (§IV-G) using Wireshark and replayed in parallel and continuously Q5. How much packet processing overhead does PAZZ incur 11

1.00 1.00 1.00 1.00 1.00 1.00

0.75 0.75 0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 0.50 0.50 CDF CDF CDF CDF CDF 16 switches 16 switches CDF 9 switches 9 switches 4 switches: PAZZ 4 switches: PAZZ 9 switches: PAZZ 9 switches: PAZZ 0.25 4 switches 0.25 4 switches 0.25 0.25 0.25 0.25 fat−tree fat−tree 4 switches: Exhaustive 4 switches: Exhaustive 9 switches: Exhaustive 9 switches: Exhaustive

0.00 0.00 0.00 0.00 0.00 0.00 0 50 100 150 0 100 200 300 400 0 50 100 150 0 100 200 300 400 500 0 50 100 150 200 0 200 400 600 Detection Time (s) Detection Time (s) Detection Time (s) Detection Time (s) Detection Time (s) Detection Time (s) (a) Results of detection time for all topologies (b) Results of 4-switch grid topology (c) Results of 9-switch grid topology 1.00 1.00 1.00 1.00 1.00 1.00

0.75 0.75 0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 0.50 0.50 CDF CDF CDF CDF CDF CDF

16 switches: PAZZ 16 switches: PAZZ fat−tree: PAZZ fat−tree: PAZZ fat−tree: PAZZ fat−tree: PAZZ 0.25 16 switches: Exhaustive 0.25 16 switches: Exhaustive 0.25 fat−tree: Exhaustive 0.25 fat−tree: Exhaustive 0.25 fat−tree: Exhaustive 0.25 fat−tree: Exhaustive

0.00 0.00 0.00 0.00 0.00 0.00 0 100 200 300 0 200 400 600 0 50 100 150 0 100 200 300 400 0 100 200 300 0 200 400 600 800 Detection Time (s) Detection Time (s) Detection Time (s) Detection Time (s) Detection Time (s) Detection Time (s) (d) Results of 16-switch grid topology (e) Results of 4-ary fat-tree (20-switch) topology (f) Results of 3 source-destination pairs in fat-tree Figure 9: a) For a source-destination pair, CDF of Type-p and Type-a fault detection time by PAZZ in all 4 experimental topologies for sampling rate of 1/100 (left), and 1/1000 (right) respectively. Red, green, blue and black lines belong to 4-switch, 9-switch, 16-switch and 4-ary fat-tree topology (20-switch) respectively. b), c), d), e) For a source-destination pair, comparison of fault detection time by PAZZ (blue) and exhaustive packet generation approach (red) in all 4 experimental topologies (4-switch, 9-switch, 16-switch and 4-ary fat-tree respectively). In each figure, left to right illustrates sampling rate of 1/100 (left), and 1/1000 (right) respectively. f) For 3 source-destination pairs, comparison of fault detection time (in seconds) by PAZZ (blue) and exhaustive packet generation approach (red) in 4-ary fat-tree topology. Left to right illustrates sampling rate of 1/100 (left), and 1/1000 (right) respectively.

160 300 Pazz Pazz 140 Exhaust 250 Exhaust

120

200 100

80 150

60 100

40 Detection time in seconds in time Detection Detection time in seconds in time Detection 50 20

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Bugs Bugs

(a) Results of 4-ary fat-tree (20-switch) topology with 1/100 sampling rate (b) Results of 4-ary fat-tree (20-switch) with 1/1000 sampling rate

Figure 10: a) Illustrates the median fault detection time in PAZZ with quartiles vs baseline for a single source-destination pair in 4-ary fat-tree topology with sampling rate of 1/100. (Figure 9e (left) represents CDF of the mean fault detection time.) b) Illustrates the median fault detection time in PAZZ with quartiles vs baseline for a single source-destination pair in 4-ary fat-tree topology with sampling rate of 1/1000. (Figure 9e (right) represents CDF of the mean fault detection time.) on varying packet sizes? (§IV-H) Type-p faults in the production traffic header space (35% of total faults) were detected faster in a maximum time of 24 D. PAZZ Performance seconds for all four topologies as compared to the Type-a faults (65% of total faults) in the fuzz traffic header space Figure 9a illustrates the cumulative distribution function which were detected in a maximum time of 420 seconds.As (CDF) of the Type-p and Type-a faults detected in the the experiment was conducted ten times, the time taken is four different experimental topologies with the parameters the mean of the ten values to detect a fault pertaining to mentioned in TableII. As expected, in a grid 16-switch a packet header space. We omitted confidence intervals as topology with 60k rules and 75k paths, PAZZ takes only they are small after 10 runs. In all cases, the detection time 25 seconds to detect 50% of the faults and 105 seconds to difference was marginal. detect all of the faults in case of sampling rate 1/100 and Localization Time: As per Algorithm3, the production polling interval of 1 second (left in Figure 9a). For the same sampling rate of 1/100, in the case of 4-ary fat-tree topology traffic-specific faults after detection were automatically lo- calized within a span of 50 µsecs for all four experimental with 20 switches containing 100k rules and 75k paths, PAZZ detects 50% of the faults in 40 seconds and all faults in 160 topologies. The localization of faults pertaining to fuzz traffic seconds. Since the production traffic was replayed at 106 pps was manual as there was no expected report from the CPC. in parallel with the fuzz traffic replayed at 1000 pps, the Hereby, the localization was done for two cases: a) when the 12

fuzz traffic entered at the ingress port of the entry switch and time comparison of PAZZ against exhaustive packet genera- b) when the fuzz traffic entered in between a pair of ingress tion approach for 3 different source-destination pairs in the 4- and egress ports. For the first case, the localization of each ary fat tree topology (Figure8). Note3, P AZZ can be extended fault happens in a second after the fault was detected by the to Layer-2 and Layer-4 headers. Consistency Tester as the first switch possessed a flow rule Figures 10a, 10b illustrate the median fault detection time to allow such traffic in the network. For the second case, with quartiles in PAZZ against the baseline for 4-ary fat- i.e., where fuzz traffic was injected from between the pair tree topology (20 switches) with 100k rules for a single of ingress and egress ports took approx. 2-3 minutes after source-destination pair for sampling rates: 1/100 and 1/1000 detection for manual localization as the path was constructed respectively. Note, due to space constraints, we removed the after hop-by-hop inspection of the switch rules. quartile plots for the three grid topologies. To summarize, the To compute the CPU usage in PAZZ, we measured the CPU results of the four experimental topologies are as expected time (cpu_time = cpu_cycles ∗ cpu_clock_rate) required and similar without any observable deviation. for computing expected paths, test and detect inconsistency per sFlow sample analyzed by the Consistency Tester (brain F. Reachability Graph Computation vs Header Space Anal- of PAZZ) for the 4-ary fat-tree scenario. Our single threaded ysis (HSA) [29] implementation, required 4.271 ms (mean), and 4.397 ms TableII (Column 5) shows the reachability graph computa- (median) with a minimum of 0.259 ms, and a maximum of tion in all experimental topologies by the CPC for an egress 7.310 ms for each sample on a single core of the Intel Xeon port. To observe the effect of evolving configs, we added E5-2609 2.40GHz CPU. For the memory usage, topology and additional rules to various switches randomly. We observe configuration are the main factors. The fat-tree scenario with that CPC computes the reachability graph in all topologies 100k rules requires the maximum resident set size (RSS) of in < 1s. 1.29MB. Memory usage of CPC is mentioned in §IV-F. We compare CPC against the state-of-the-art Header Space Analysis (HSA) [29] that can be used to generate reacha- E. Comparison to Exhaustive Packet Generation bility graphs to generate test packets and compute covered We compare the fault detection time of PAZZ which uses header space. HSA is also used by NetPlumber [30] and Fuzzer against exhaustive packet generation approach. For a ATPG [2]. We observe that HSA library has scalability issues fair comparison, the exhaustive packet generation approach due to poor support for set operations on wildcards unlike generates the same number of flows randomly and at the same BDDs [52]. Indeed, we reconfirm this by running a simple rate like PAZZ. Figures 9b, 9c, 9d and 9e illustrate the fault reachability experiment between a source-destination pair in detection time CDF in 4-switch, 9-switch, 16-switch and 4- the fat-tree topology of 20 switches with 100k rules in total. ary fat-tree (20-switch) experimental topologies respectively. In this experiment, we compare the resource usage of the Two figures for each experimental topology illustrates the Python implementation of HSA library with the CPC of PAZZ results for two different sampling rates of 1/100, and 1/1000 which uses BDDs. HSA requires in excess of 206GB of (left and right) respectively. The blue line indicates PAZZ memory to compute only 10 possible paths between two which uses Fuzzer and the red line indicates exhaustive ports, while the CPC requires only 1.5GB of memory to packet generation approach. As expected, we observe that compute all 75k paths. PAZZ performs better than exhaustive packet generation approach. PAZZ provides an average speedup of 2-3 times. G. Fuzzer Execution Time & Overhead We observe in all cases, 50% of the faults are detected in Execution Time: TableII (Column 6) illustrates the time a maximum time of ~50 seconds or less than a minute by taken by Fuzzer to compute the packet header space for fuzz PAZZ. Note we excluded the Fuzzer execution time (§IV-G) traffic in the four experimental topologies after it receives the in the plots. It is worth mentioning that PAZZ will perform covered packet header space (corpus) from the CPC. Since much better if we compare against a fully exhaustive packet we considered destination-based routing hence, the packet generation approach which generates 232 flows in all possible header space computation was limited to 32-bit destination destination IPv4 header space. Hereby, the detected faults IPv4 address space in the presence of wildcarded rules. When are Type-a as they require active probes in the uncovered some of the rules were added to the data plane, the CPC packet header space. Since PAZZ relies on production traffic recomputed the corpus, the new corpus was sent to the Fuzzer to detect the Type-p faults hence, we get rid of the exhaustive which recomputed the new fuzz traffic within a maximum generation of all possible packet header space. Similar results time of 7.5 milliseconds. were observed for localization of detected faults. Overhead: The fuzz traffic contains 54-byte test packets at Multiple Source-Destination Pairs & Header Cover- the rate of 1000 pps on a 1 Gbps link that is 0.04% of the age: We observed the similar results for different source- link bandwidth and therefore, minimal overhead on the links destination pairs when we placed Fuzzer and production at the data plane. Note that most of the fuzz traffic is dropped traffic generator at different sources in parallel for a different at the first switch unless there is a flow rule to match that header space coverage. Figure 9f shows the fault detection traffic and thus, incurring even less overhead on the links. 13

H. DPC Overhead Policy (P) Management “Traffic A should take Path X” We generated different packet sizes from 64 bytes to Plane 1500 bytes at almost Gbps rate on the switches running the Logical DPC software of PAZZ and the native OvS switches. We Rules + (Rlogical) Logical Logical Paths added flow rules on our switches to match the packets and Control Plane Topology (Tlogical) (Plogical) tag them by the Verify shim header using our push_verify, PAZZ (P ≣ P ) Data Plane logical physical set_Verify_Rule and set_Verify_Port actions in the flow rule. Physical + PAZZ Rules PAZZ (R ) (R ≣ R ) physical Physical ≣ Then, we measured the average throughput over 10 runs. logical physical Forwarding (Tlogical Tphysical) Topology (T ) Paths (P ) physical physical We observe that the Verify shim header and the tagging mechanism incurred negligible throughput drop as compared Figure 11: PAZZ provides coverage on the forwarding flow rules, topology and paths. to the native OvS. However, a throughput drop of 1.1.% is observed only on the entry switch/exit switch as push_verify need to take consistency checks further. Using an analogy to actions happen on the entry switch/exit switch to insert the program testing, we proposed a novel “bottom-up” method- Verify shim header. Furthermore, sFlow sampling is done at ology, making the case for network state fuzzing. the exit switch only. On the contrary, set_Verify_Port and However, the question of “how far to go?” in terms of set_Verify_Rule just set the Verify_Port and Verify_Rule fields consistency checks is a non-trivial one and needs further respectively and have no observable impact. Thus, PAZZ attention. In particular, while methodologies such as PAZZ introduces minimal packet processing overheads atop OvS. provide a framework to systematically find (and subsequently remove) faults, in theory, they raise a question of how much V. RELATED WORK resources to invest. Indeed, we observe that providing full In addition to the related work covered in §I that includes coverage of forwarding flow rules, topology, and forwarding the existing literature based on [17] and TableI, we now paths is inherently expensive: Consider a simplistic linear will navigate the landscape of related works and compare network of n = 4 switches with two links between each them to PAZZ in terms of the Type-p and Type-a faults switch. Assume IP traffic enters from a source and exits on a which cause inconsistency (§II). The related work in the destination, and each switch stores a simple forwarding rule: area of control plane [27]–[37] either check the controller- switch Si for i = {1, 2, 3...n} bases its forwarding decisions applications or the control-plane compliance with the high- solely on the ith least significant bit of the destination IP level network policy. These approaches are insufficient to address: packets of even bits are forwarded to the upper link check the physical data plane compliance with the control (up) and packets of odd bits to the lower (lo) link. While such plane. As illustrated in TableI, we navigate the landscape a configuration requires only a small constant number of rules of the data plane approaches and compare them with PAZZ per switch, the number of possible paths is exponential in n: based on the ability to detect Type-p and Type-a faults. It is the set of paths is given by {up, lo}n−1. In addition to the worth noting that the approaches either test the rules or the combinatorial complexity of testing static configurations, we paths whereas PAZZ tests both together. In the case of Type-p face the challenge that both the (distributed) control plane match faults (§II-C1) when the path is same even if different and the data plane state (as well as possibly the network rule is matched, path trajectory tools [20], [22]–[26], [44] fail. connecting the controllers to the data plane elements) evolve The approaches based on active-probing [2], [4], [18], [19], over time, possibly in an asynchronous (or even stateful [27]) [26], [45] do not detect the Type-a faults (§II-C2) caused by and hard to predict manner. This temporal dimension exacer- hidden or misconfigured rules on the data plane which only bates the challenge of identifying faults, raising the question match the fuzz traffic. These tools, however, only generate “how frequently to check”. the probes to test the rules known or synced to the controller. Finally, the question of finding faults can go beyond Such Type-a faults are detected by PAZZ. Recently, Shukla et deciding on the right amount of resources to invest: as any al. developed P4CONSIST [21] which relies on production traffic-based testing introduces the inherent “Schrödinger’s traffic to detect path-based control-data plane inconsistency cat” problem that tagged traffic may not be subject to the in P4 SDNs. However, it cannot localize and calso, cannot same treatment as the untagged one. detect rule-based control-data plane inconsistency in SDNs We note that current trends towards programmable data like PAZZ. Furthermore, the packet header space coverage is planes [48] and hardware [62] may aid in solving some of less than PAZZ as it solely relies on production traffic. Re- the issues this paper raises. INT [47] in P4 specifically tries cently, P4RL [39] leveraged machine learning-guided fuzzing to solve the problem of but is yet to be to detect path deviations in P4 switches. However, P4RL is made scalable for the whole network and is limited to the limited to P4 switches only. P4 switches. Still, leveraging programmability, future SDNs VI.DISCUSSION:HOW FAR TO GO? will encompass even more possibilities of faults with a mix of Motivated by our analysis of existing faults which cause vendor-code, reusable libraries, and in-house code. Yet, there inconsistency in SDNs and the fact that network consistency will no longer be a common API, i.e., OpenFlow. As such the is only as good as its weakest link, we have argued that we general problem will persist and we will have to explore how 14

to extend the PAZZ methodology to programmable networks. [18] K. Bu, X. Wen, B. Yang, Y. Chen, L. E. Li, and X. Chen, “Is every flow on IEEE INFOCOM AZZ the right track?: Inspect SDN forwarding with RuleScope,” in , Overall, as illustrated in Figure 11,P checks consis- 2016. tency at all levels between control and the data plane, i.e., [19] P. Zhang, C. Zhang, and C. Hu, “Fast testing network data plane with rulechecker,” in Network Protocols (ICNP). IEEE, 2017. P logical ≡ P physical, T logical ≡ T physical, and Rlogical ≡ Rphysical. [20] P. Zhang, H. Li, C. Hu, L. Hu, L. Xiong, R. Wang, and Y. Zhang, “Mind the gap: Monitoring the control-data plane consistency in software defined networks,” in Proceedings of the 12th International on Conference on emerging Networking VII.CONCLUSION &FUTURE WORK EXperiments and Technologies. ACM, 2016. [21] A. Shukla, S. Fathalli, T. Zinner, A. Hecker, and S. Schmid, “P4CONSIST: To- This paper presented PAZZ, a novel network verification wards Consistent P4 SDNs,” in IEEE Journal on Special Areas in Communication (JSAC)- NetSoft, 2020. methodology that automatically detects and localizes the data [22] N. Handigol, B. Heller, V. Jeyakumar, D. Mazières, and N. McKeown, “I plane faults manifesting as inconsistency in SDNs. PAZZ Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks,” in USENIX NSDI, 2014. periodically fuzz tests the packet header space and compares [23] S. Narayana, M. Tahmasbi, J. Rexford, and D. Walker, “Compiling Path Queries,” the expected control plane state with the actual data plane in Proc. USENIX NSDI, 2016. [24] P. Tammana, R. Agarwal, and M. Lee, “CherryPick: Tracing Packet Trajectory state. The tagging mechanism tracks the paths and rules at the in Software-Defined Datacenter Networks,” in ACM SOSR, 2015. data plane while the reachability graph at the control plane [25] ——, “Simplifying datacenter network debugging with pathdump,” in USENIX OSDI, 2016. tracks paths and rules to help PAZZ in verifying consistency. [26] K. Agarwal, J. Carter, and C. Dixon, “SDN traceroute : Tracing SDN Forwarding Our evaluation of PAZZ over real network topologies and without Changing Network Behavior,” 2014. [27] S. K. Fayaz, T. Yu, Y. Tobioka, S. Chaki, and V. Sekar, “Buzz: Testing context- configurations showed that PAZZ efficiently detects and local- dependent policies in stateful networks,” in Proc. USENIX NSDI, 2016. izes the faults causing inconsistency while requiring minimal [28] H. Mai, A. Khurshid, R. Agarwal, M. Caesar, P. B. Godfrey, and S. T. King, “Debugging the Data Plane with Anteater,” in SIGCOMM, 2011. data plane resources (e.g., switch rules and link bandwidth). [29] P. Kazemian, G. Varghese, and N. McKeown, “Header Space Analysis: Static Currently, we do not handle transient faults [63], packet Checking for Networks,” in Proc. USENIX NSDI, 2012. [30] P. Kazemian, M. Chang, H. Zeng, G. Varghese, N. Mckeown, and S. Whyte, rewriting and checking of backup rules lays the groundwork “Real Time Network Policy Checking Using Header Space Analysis,” in Proc. for our future work. For improving the performance of PAZZ USENIX NSDI, 2013. [31] H. Zeng, S. Zhang, F. Ye, V. Jeyakumar, M. Ju, J. Liu, N. McKeown, and further, we consider multithreading at the Consistency Tester A. Vahdat, “Libra: Divide and Conquer to Verify Forwarding Tables in Huge will allow multiple flows to be checked in parallel. For DPC, Networks,” in USENIX NSDI, 2014. [32] M. Canini, D. Venzano, P. Perešíni, D. Kostic,´ and J. Rexford, “A NICE Way to Data Plane Development Kit (DPDK) or P4 can improve Test Openflow Applications,” in Proc. USENIX NSDI, 2012. the DPC performance. We will consider performance-related [33] C. Scott, A. Wundsam, B. Raghavan, A. Panda, A. Or, J. Lai, E. Huang, Z. Liu, A. El-Hassany, S. Whitlock et al., “Troubleshooting blackbox sdn faults like throughput, latency and packet-loss. Distributed control software with minimal causal sequences,” in ACM SIGCOMM Computer SDN controller and P4-specific hardware [48], [62] design Communication Review. ACM, 2015. [34] A. Khurshid, X. Zou, W. Zhou, M. Caesar, and P. B. Godfrey, “VeriFlow: and implementation will also be a part of our future work. Verifying Network-Wide Invariants in Real Time,” in USENIX NSDI, 2013. [35] A. Fogel, S. Fung, L. Pedrosa, M. Walraed-Sullivan, R. Govindan, R. Mahajan, REFERENCES and T. D. Millstein, “A general approach to network configuration analysis.” in NSDI, 2015. [1] H. Zeng, P. Kazemian, G. Varghese, and N. McKeown, “A Survey Of Network [36] N. P. Lopes, N. Bjørner, P. Godefroid, K. Jayaraman, and G. Varghese, “Checking Troubleshooting,” Stanford University, Tech. Rep. TR12-HPNG-061012, 2012. Beliefs in Dynamic Networks,” in Proc. USENIX NSDI, 2015. [2] ——, “Automatic test packet generation,” in Proceedings of the 8th international [37] H. Yang and S. S. Lam, “Real-Time Verification of Network Properties Using conference on Emerging networking experiments and technologies. ACM, 2012. Atomic Predicates,” in IEEE/ACM Transactions on Networking, 2016. [3] P. Gawkowski and J. Sosnowski, “Experiences with software implemented fault [38] P. Godefroid, M. Y. Levin, and D. Molnar, “SAGE: whitebox fuzzing for security injection,” in Architecture of Computing Systems (ARCS), 2007. testing,” in CACM, vol. 55, no. 3, 2012. [4] P. Perešíni, M. Ku´zniar, and D. Kostic,´ “Monocle: Dynamic, Fine-Grained Data [39] A. Shukla, K. N. Hudemann, A. Hecker, and S. Schmid, “Runtime Verification Plane Monitoring,” in Proc. ACM CoNEXT, 2015. of P4 Switches with Reinforcement Learning,” in ACM SIGCOMM NetAI, 2019. [5] B. Schroeder, E. Pinheiro, and W.-D. Weber, “Dram errors in the wild: a large- [40] P.-W. Chi, C.-T. Kuo, J.-W. Guo, and C.-L. Lei, “How to detect a compromised scale field study,” in ACM SIGMETRICS Performance Evaluation Review, vol. 37, sdn switch,” in Network Softwarization (NetSoft), 2015. IEEE, 2015. no. 1. ACM, 2009, pp. 193–204. [41] M. Antikainen, T. Aura, and M. Särelä, “Spook in your network: Attacking an [6] Z. Al-Ars, A. J. van de Goor, J. Braun, and D. Richter, “Simulation based analysis sdn with a compromised openflow switch,” in Nordic Conference on Secure IT of temperature effect on the faulty behavior of embedded drams,” in Proceedings Systems. Springer, 2014. International Test Conference 2001 (Cat. No. 01CH37260). IEEE, 2001, pp. [42] G. Pickett, “Staying persistent in software defined networks,” in Black Hat, 2015. 783–792. [43] G. Varghese, “Technical perspective: Treating networks like programs.” New [7] NetworkWorld, “Cisco says router bug could be result of ‘cosmic York, NY, USA: ACM, Oct. 2015. radiation‘ ... Seriously?” https://www.networkworld.com/article/3122864/ [44] P. Zhang, S. Xu, Z. Yang, H. Li, Q. Li, H. Wang, and C. Hu, “Foces: Detecting cisco-says-router-bug-could-be-result-of-cosmic-radiation-seriously.html, 2016. forwarding anomalies in software defined networks,” in ICDCS. IEEE, 2018. [8] Y. Yankilevich and G. Malchi, “Tcam soft-error detection method and apparatus,” [45] Y. Ke, H. Hsiao, and T. H. Kim, “SDNProbe: Lightweight Fault Localization in Jul. 20 2017, US Patent App. 15/390,465. the Error-Prone Environment,” in 2018 IEEE 38th International Conference on [9] R. Mahajan, D. Wetherall, and T. Anderson, “Understanding bgp misconfigura- Distributed Computing Systems (ICDCS). IEEE, 2018. tion,” in ACM SIGCOMM Computer Communication Review. ACM, 2002. [46] OpenFlow Spec, “https://www.opennetworking.org/images//openflow-switch-v1. [10] “Cloud leak: Wsj parent company dow jones exposed customer data,” https:// 5.1.pdf,” 2015. www.upguard.com/breaches/cloud-leak-dow-jones/. [47] C. Kim, A. Sivaraman, N. Katta, A. Bas, A. Dixit, and L. J. Wobker, “In-band [11] “Con-ed steals the ’net’,” http://dyn.com/blog/coned-steals-the-net/. network telemetry via programmable dataplanes,” in ACM SOSR, 2015. [12] M. Ku´zniar, P. Perešíni, and D. Kostic,´ “What you need to know about sdn flow [48] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. Mckeown, J. Rexford, C. Schlesinger, tables,” in Proc. PAM, 2015. D. Talayco, A. Vahdat, G. Varghese, and D. Walker, “P4: Programming Protocol- [13] N. Katta, O. Alipourfard, J. Rexford, and D. Walker, “Cacheflow: Dependency- Independent Packet Processors,” 2014. aware rule-caching for software-defined networks,” in ACM SOSR, 2016. [49] H. Zhang, C. Lumezanu, J. Rhee, N. Arora, Q. Xu, and G. Jiang, “Enabling Layer [14] J. Rexford, “SDN Applications,” in Dagstuhl Seminar 15071, 2015. 2 Pathlet Tracing through Context Encoding in Software-Defined Networking,” [15] M. Kuzniar, P. Peresini, M. Canini, D. Venzano, and D. Kostic, “A SOFT Way in HotSDN, 2014. for Openflow Switch Interoperability Testing,” in Proc. ACM CoNEXT, 2012. [50] S. Panchen, P. Phaal, and N. McKee, “Inmon corporation’s sflow: A method for [16] M. Kuzniar, P. Peresini, and D. Kostic,´ “Providing reliable fib update acknowl- monitoring traffic in switched and routed networks,” 2001. edgments in sdn,” in Proc. ACM CoNEXT, 2014. [51] A. Kirsch and M. Mitzenmacher, “Less hashing, same performance: building a [17] B. Heller, J. McCauley, K. Zarifis, P. Kazemian, C. Scott, N. McKeown, better bloom filter,” in ESA. Springer, 2006. S. Shenker, A. Wundsam, H. Zeng, S. Whitlock, V. Jeyakumar, and N. Handigol, [52] R. E. Bryant, “Graph-based algorithms for boolean function manipulation.” “Leveraging SDN layering to systematically troubleshoot networks,” 2013. Washington, DC, USA: IEEE Computer Society, Aug. 1986. 15

[53] “Llvm compiler infrastructure: libfuzzer: a library for coverage-guided fuzz [58] “Open vswitch,” http://openvswitch.org/, 2016. testing,” http://llvm.org/docs/LibFuzzer.html. [59] “Apache thrift,” https://thrift.apache.org/. [54] Z. Zhang, Q.-Y. Wen, and W. Tang, “An efficient mutation-based fuzz testing [60] “Hassel-public,” https://bitbucket.org/peymank/hassel-public/. approach for detecting flaws of network protocol,” in Computer Science & Service System (CSSS), 2012. [61] A. Turner and M. Bing, “Tcpreplay: Pcap editing and replay tools for* nix,” in online], http://tcpreplay. sourceforge. net [55] R. R. Kompella, J. Yates, A. Greenberg, and A. C. Snoeren, “Detection and , 2005. localization of network black holes,” in INFOCOM 2007. 26th IEEE International [62] D. Firestone et al., “Azure accelerated networking: Smartnics in the public cloud,” Conference on Computer Communications. IEEE. in USENIX NSDI, 2018. [56] G. Castagnoli, J. Ganz, and P. Graber, “Optimum cycle redundancy-check codes [63] A. Shukla, S. Schmid, A. Feldmann, A. Ludwig, S. Dudycz, and A. Schuetze, with 16-bit redundancy,” in IEEE Transactions on Communications. IEEE, 1990. “Towards transiently secure updates in asynchronous sdns,” in Proceedings of the [57] B. Jenkins, “Hash functions,” in Dr Dobbs Journal. 2016 ACM SIGCOMM Conference, 2016, pp. 597–598.